Recipe · change detection
PID loop hunting
A boiler hold temperature oscillates a couple of degrees around its setpoint — within spec, no alarm fires. Three months later the modulating valve fails. The trace had been telling you the whole time, in how often the signal turned around rather than how far it strayed.
This recipe counts how often the signal reverses direction inside a
sliding window. The winnow node fires on every trajectory turning
point, swStats averages those events into a continuous rate, and
categorize bands the rate into a graded health state.
Why frequency, not amplitude. Two oscillations at the same amplitude can produce opposite outcomes. A ten-minute period is a tuning nuisance. A one-minute period at the same amplitude shortens the modulating valve’s life by years. A control chart (
mean ± 2σ) treats them as identical because it measures amplitude alone. This recipe measures how often the signal turns around.
flow('loop-hunting-detector')
.sanitize('sane', 'temperature', { failureReason: 'failReason' },
{ ranges: { temperature: { min: 0, max: 200 } } })
.median3('med', 'temperature', { median3: 'medT' })
.esMean('slow', 'medT', { mean: 'slowT' }, { halfLife: 3 })
.esStats('noise', 'slowT', { stdev: 'tStdev' }, { halfLife: 30 })
.trend('dir', 'slowT',
{ trend: 'tDir', rocMean: 'tRoc' },
{ rocStatsHalfLife: 30, rocThreshold: 0.015, warmupSamples: 30 })
.winnow('events', 'slowT',
{ significant: 'turningPoint' },
{
K: 7, tightenBase: 5000, maxGap: 5000,
noiseField: 'tStdev', // from esStats above
dirField: 'tDir', // from trend above
slopeField: 'tRoc' // from trend above
})
.transform('encode', 'turningPoint',
{ result: 'eventBit' },
{ using: ( v ) => v })
.swStats('rate', 'eventBit',
{ mean: 'reversalRate' },
{ windowSize: 600 })
.categorize('level', 'reversalRate',
{ category: 'cyclingHealth' },
{ thresholds: [ 0.001, 0.0115, 0.020 ],
categories: [ 'stable', 'minor', 'major', 'severe' ] })
.run()Drag the slider through the four phases and watch the recipe walk
from stable through minor and major to severe — even though the
three oscillating phases share the same amplitude.
What You’re Seeing
The chart is divided into four 30-minute phases. Subtle background shading marks the three hunting phases; the unshaded first 30 minutes are a stable baseline.
The gray line is raw temperature near 80°C. Look at the wave heights across all four phases — they are the same, about ±2°C. An amplitude alarm would see no difference between any of them.
Now look at the wave speed. In the slow-hunt phase the signal completes one full cycle in about ten minutes. In the medium phase, five minutes. In the fast phase, two and a half. Same height, doubling speed — that is what a hunting control loop looks like as it worsens.
The amber curve (right axis) is the reversal rate — how many direction changes the pipeline counted per sample over the last 600 samples. It sits near zero in the stable phase and steps up at each phase boundary as faster oscillations feed more turning points into the window. This single number is what the pipeline is built to produce.
The health card below the chart shows the graded state at the
slider position: stable → minor → major → severe. A standard
control chart (mean ± 2σ) cannot make this distinction — all three
hunting phases have the same amplitude and would land in the same
bucket. This recipe separates them by frequency alone.
Drag past minute 10 to see the first state appear — trend and
swStats need their warmup windows to fill before they can report.
Where This Pattern Fits
| Domain | What hunts | Why it’s hard to catch |
|---|---|---|
| Boiler temperature loop | Jacket or outlet temp under PID control | Oscillation amplitude stays within spec; only the period changes as tuning degrades |
| Reactor jacket cooling | Inlet or outlet temperature | Slow oscillations look like normal batch behaviour on a dashboard |
| Modulating control valve | Valve position or downstream flow | Mechanical wear produces frequent small reversals that no single reading flags |
| Tank level under flow control | Level transmitter | Level oscillates around setpoint with no trend — every reading is in range |
| Heat exchanger outlet | Downstream fluid temperature | Hunting appears only when the upstream loop adjusts to new operating conditions |
How It Works
Two design choices make this recipe work. Both are calibrated against the feasibility study. Remove either one and classification accuracy drops from 100% to 36% or worse.
Smooth twice before taking the slope. median3 absorbs
single-sample spikes. esMean with halfLife = 3 then drops the
residual noise on the smoothed signal to about 0.013 — below
trend’s rocThreshold of 0.015. Without esMean, trend’s
rate-of-change estimate flickers on noise and winnow’s
trend-reversal trigger fires on every flicker, flooding the rate
with false events. The general rule: a slope estimator’s noise
floor scales as 2σ/√halfLife; the signal slope you want to detect
must sit above that floor.
Set winnow’s K to 7, not the default of 2. winnow’s deadband
fires when the signal departs from its projected path by more than
K × stdev. Think of K as a sensitivity dial — lower values trigger
more fires, higher values suppress false alarms. At K = 3, about
3% of samples in the stable phase trigger a background fire from
noise alone — enough to swamp the real hunting signal. At K = 7
the false-fire probability drops below one per million samples. The
general rule: when winnow feeds an event counter, K needs to be
well above the compression-oriented defaults.
Keep the deadband width constant. winnow progressively narrows
its deadband as elapsed samples grow — a compression feature that
inserts intermediate anchor points in long quiet segments. In a pure
event-counting role that tightening would seed spurious fires inside
the sliding window and inflate the rate. Setting tightenBase: 5000
and maxGap: 5000 — both well beyond the 600-sample swStats
window — pins the deadband at its full K × σ width across the
entire window and disables the gap-fill anchor. Lower values are
correct for compression but wrong here.
Counting the events. Each winnow.significant = true is a single
event. The transform encoder converts the boolean to 0 or 1 so
swStats can read it as a numeric field. swStats(mean) over a
600-sample window gives a continuous rate — fires per sample.
Winnow fires at peaks, at troughs, and occasionally on large-enough
noise excursions during the cycle, so the reported rate tracks the
cycle frequency monotonically — faster hunting always produces a
higher rate — but it is not literally 2 × cycles-per-sample. That
is why categorize.thresholds are set against observed rates on a
clean baseline, not against a ground-truth formula. The operator
gets a single state field — stable, minor, major, or severe
— that they can log, alarm on, or display.
Wire winnow to the renamed fields. Winnow reads four fields
from the message on every tick: its primary input (from.x), plus
a noise estimate, a trend direction, and a slope value for its
deadband and reversal checks. By default these three supporting
fields are named stdev, trendDir, and roc. The pipeline above
renames them with a t prefix (tStdev, tDir, tRoc), so
winnow must be pointed at the renamed fields via noiseField,
dirField, and slopeField. Miss this wiring and winnow silently
reads undefined, its warm-up guard fires on every sample, and the
downstream rate saturates at 1.0. Three options — easy to overlook,
load-bearing.
The wider lesson is the composition pattern. A per-sample event detector, a sliding-window rate, and a graded state — the same shape applies wherever you need to turn discrete events into a continuous health indicator.
Tuning to your loop
The published thresholds are calibrated against the synthetic signal. On your own loop, calibrate from a clean baseline:
- Run the pipeline for at least a week on a loop you know is
healthy. Capture
reversalRateto storage. - Note the stable-state value. Call it
r_normal. On a healthy loop it is usually between 0 and ~0.002. - Set
categorize.thresholdsto[ max(r_normal × 2, 0.001), r_normal × 5, r_normal × 10 ]. - Leave the rest —
K, the smoothing chain,tightenBase,maxGap— at the values in this recipe. They are calibrated against the noise floor of the smoothed signal, not against the loop.
The table below is the reference for the other knobs.
| Parameter | How to set it |
|---|---|
from.x | Your loop variable |
winnow.K | 7 is the conservative floor. Drop to 6 only if your stable baseline is cleaner than the synthetic noise |
categorize.thresholds | Use the four-step procedure above |
swStats.windowSize | About three times the longest period of hunt you expect to detect. A ten-minute slowest hunt at one sample per second means windowSize: 1800 |
esMean.halfLife | 3 is good for noise σ ≤ 0.1. Increase if your signal is noisier — but faster oscillations then get attenuated by the heavier smoothing |
What this recipe is not
Period floor. This recipe is calibrated for slow-to-medium
control loops where one cycle contains at least about 150 samples.
For HVAC compressor short-cycling at 30–60 second intervals, the
trend node’s slope estimator cannot resolve the oscillation — use
a kalman1d-based detector instead.
Drift tolerance. The slope-aware projection handles baseline
drift up to roughly 0.005 units per sample. Faster drifts — strong
setpoint ramps, large ambient shifts, slow production trends — walk
winnow’s anchor out from under the signal and classification
degrades. Place a kalman1d-based detrender upstream in that regime.
Sinusoidal vs real limit cycles. The synthetic demo signal is
sinusoidal. Real hunting — especially from valve stiction — is
asymmetric: long dwells punctuated by sudden jumps at reversal. The
reversal rate still rises with cycle frequency, but the categorize
thresholds will need per-loop calibration against your own clean
baseline (see the procedure above).
Other shapes of failure. For a signal that drifts unidirectionally off its setpoint with no oscillation, use Gradual Drift. For a signal that jumps to a new level, use Sudden Shifts.
References
- Bristol, E.H. (1990). Swinging Door Trending: Adaptive Trend Recording? ISA National Conference Proceedings, pp. 749–756. (Direct heritage of winnow’s deadband + trajectory model.)
- Welford, B.P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3), 419–420. doi:10.1080/00401706.1962.10490022 (The recurrence underneath esStats.)
- Åström, K.J. & Hägglund, T. (2006). Advanced PID Control. ISA Publishing. (Loop-tuning and hunting diagnosis, chapter 7.)
Next Steps
- Trajectory-Aware Adaptive Compression — the other winnow recipe; same per-sample event mechanism, a different downstream consumer
- Sudden Shifts — step changes rather than oscillations
- Composition Patterns — understand the per-sample-event → rate → state pattern in detail