Recipe · change detection

PID loop hunting

A boiler hold temperature oscillates a couple of degrees around its setpoint — within spec, no alarm fires. Three months later the modulating valve fails. The trace had been telling you the whole time, in how often the signal turned around rather than how far it strayed.

This recipe counts how often the signal reverses direction inside a sliding window. The winnow node fires on every trajectory turning point, swStats averages those events into a continuous rate, and categorize bands the rate into a graded health state.

Why frequency, not amplitude. Two oscillations at the same amplitude can produce opposite outcomes. A ten-minute period is a tuning nuisance. A one-minute period at the same amplitude shortens the modulating valve’s life by years. A control chart (mean ± 2σ) treats them as identical because it measures amplitude alone. This recipe measures how often the signal turns around.


flow('loop-hunting-detector')
    .sanitize('sane', 'temperature', { failureReason: 'failReason' },
        { ranges: { temperature: { min: 0, max: 200 } } })
    .median3('med', 'temperature', { median3: 'medT' })
    .esMean('slow', 'medT', { mean: 'slowT' }, { halfLife: 3 })
    .esStats('noise', 'slowT', { stdev: 'tStdev' }, { halfLife: 30 })
    .trend('dir', 'slowT',
        { trend: 'tDir', rocMean: 'tRoc' },
        { rocStatsHalfLife: 30, rocThreshold: 0.015, warmupSamples: 30 })
    .winnow('events', 'slowT',
        { significant: 'turningPoint' },
        {
            K: 7, tightenBase: 5000, maxGap: 5000,
            noiseField: 'tStdev',   // from esStats above
            dirField:   'tDir',      // from trend above
            slopeField: 'tRoc'       // from trend above
        })
    .transform('encode', 'turningPoint',
        { result: 'eventBit' },
        { using: ( v ) => v })
    .swStats('rate', 'eventBit',
        { mean: 'reversalRate' },
        { windowSize: 600 })
    .categorize('level', 'reversalRate',
        { category: 'cyclingHealth' },
        { thresholds: [ 0.001, 0.0115, 0.020 ],
          categories: [ 'stable', 'minor', 'major', 'severe' ] })
    .run()

Drag the slider through the four phases and watch the recipe walk from stable through minor and major to severe — even though the three oscillating phases share the same amplitude.

Loading PID loop hunting recipe...

What You’re Seeing

The chart is divided into four 30-minute phases. Subtle background shading marks the three hunting phases; the unshaded first 30 minutes are a stable baseline.

The gray line is raw temperature near 80°C. Look at the wave heights across all four phases — they are the same, about ±2°C. An amplitude alarm would see no difference between any of them.

Now look at the wave speed. In the slow-hunt phase the signal completes one full cycle in about ten minutes. In the medium phase, five minutes. In the fast phase, two and a half. Same height, doubling speed — that is what a hunting control loop looks like as it worsens.

The amber curve (right axis) is the reversal rate — how many direction changes the pipeline counted per sample over the last 600 samples. It sits near zero in the stable phase and steps up at each phase boundary as faster oscillations feed more turning points into the window. This single number is what the pipeline is built to produce.

The health card below the chart shows the graded state at the slider position: stable → minor → major → severe. A standard control chart (mean ± 2σ) cannot make this distinction — all three hunting phases have the same amplitude and would land in the same bucket. This recipe separates them by frequency alone.

Drag past minute 10 to see the first state appear — trend and swStats need their warmup windows to fill before they can report.

Where This Pattern Fits

Domain	What hunts	Why it’s hard to catch
Boiler temperature loop	Jacket or outlet temp under PID control	Oscillation amplitude stays within spec; only the period changes as tuning degrades
Reactor jacket cooling	Inlet or outlet temperature	Slow oscillations look like normal batch behaviour on a dashboard
Modulating control valve	Valve position or downstream flow	Mechanical wear produces frequent small reversals that no single reading flags
Tank level under flow control	Level transmitter	Level oscillates around setpoint with no trend — every reading is in range
Heat exchanger outlet	Downstream fluid temperature	Hunting appears only when the upstream loop adjusts to new operating conditions

How It Works

Two design choices make this recipe work. Both are calibrated against the feasibility study. Remove either one and classification accuracy drops from 100% to 36% or worse.

Smooth twice before taking the slope. median3 absorbs single-sample spikes. esMean with halfLife = 3 then drops the residual noise on the smoothed signal to about 0.013 — below trend’s rocThreshold of 0.015. Without esMean, trend’s rate-of-change estimate flickers on noise and winnow’s trend-reversal trigger fires on every flicker, flooding the rate with false events. The general rule: a slope estimator’s noise floor scales as 2σ/√halfLife; the signal slope you want to detect must sit above that floor.

Set winnow’s K to 7, not the default of 2. winnow’s deadband fires when the signal departs from its projected path by more than K × stdev. Think of K as a sensitivity dial — lower values trigger more fires, higher values suppress false alarms. At K = 3, about 3% of samples in the stable phase trigger a background fire from noise alone — enough to swamp the real hunting signal. At K = 7 the false-fire probability drops below one per million samples. The general rule: when winnow feeds an event counter, K needs to be well above the compression-oriented defaults.

Keep the deadband width constant. winnow progressively narrows its deadband as elapsed samples grow — a compression feature that inserts intermediate anchor points in long quiet segments. In a pure event-counting role that tightening would seed spurious fires inside the sliding window and inflate the rate. Setting tightenBase: 5000 and maxGap: 5000 — both well beyond the 600-sample swStats window — pins the deadband at its full K × σ width across the entire window and disables the gap-fill anchor. Lower values are correct for compression but wrong here.

Counting the events. Each winnow.significant = true is a single event. The transform encoder converts the boolean to 0 or 1 so swStats can read it as a numeric field. swStats(mean) over a 600-sample window gives a continuous rate — fires per sample. Winnow fires at peaks, at troughs, and occasionally on large-enough noise excursions during the cycle, so the reported rate tracks the cycle frequency monotonically — faster hunting always produces a higher rate — but it is not literally 2 × cycles-per-sample. That is why categorize.thresholds are set against observed rates on a clean baseline, not against a ground-truth formula. The operator gets a single state field — stable, minor, major, or severe — that they can log, alarm on, or display.

Wire winnow to the renamed fields. Winnow reads four fields from the message on every tick: its primary input (from.x), plus a noise estimate, a trend direction, and a slope value for its deadband and reversal checks. By default these three supporting fields are named stdev, trendDir, and roc. The pipeline above renames them with a t prefix (tStdev, tDir, tRoc), so winnow must be pointed at the renamed fields via noiseField, dirField, and slopeField. Miss this wiring and winnow silently reads undefined, its warm-up guard fires on every sample, and the downstream rate saturates at 1.0. Three options — easy to overlook, load-bearing.

The wider lesson is the composition pattern. A per-sample event detector, a sliding-window rate, and a graded state — the same shape applies wherever you need to turn discrete events into a continuous health indicator.

Tuning to your loop

The published thresholds are calibrated against the synthetic signal. On your own loop, calibrate from a clean baseline:

Run the pipeline for at least a week on a loop you know is healthy. Capture reversalRate to storage.
Note the stable-state value. Call it r_normal. On a healthy loop it is usually between 0 and ~0.002.
Set categorize.thresholds to [ max(r_normal × 2, 0.001), r_normal × 5, r_normal × 10 ].
Leave the rest — K, the smoothing chain, tightenBase, maxGap — at the values in this recipe. They are calibrated against the noise floor of the smoothed signal, not against the loop.

The table below is the reference for the other knobs.

Parameter	How to set it
`from.x`	Your loop variable
`winnow.K`	7 is the conservative floor. Drop to 6 only if your stable baseline is cleaner than the synthetic noise
`categorize.thresholds`	Use the four-step procedure above
`swStats.windowSize`	About three times the longest period of hunt you expect to detect. A ten-minute slowest hunt at one sample per second means `windowSize: 1800`
`esMean.halfLife`	3 is good for noise σ ≤ 0.1. Increase if your signal is noisier — but faster oscillations then get attenuated by the heavier smoothing

What this recipe is not

Period floor. This recipe is calibrated for slow-to-medium control loops where one cycle contains at least about 150 samples. For HVAC compressor short-cycling at 30–60 second intervals, the trend node’s slope estimator cannot resolve the oscillation — use a kalman1d-based detector instead.

Drift tolerance. The slope-aware projection handles baseline drift up to roughly 0.005 units per sample. Faster drifts — strong setpoint ramps, large ambient shifts, slow production trends — walk winnow’s anchor out from under the signal and classification degrades. Place a kalman1d-based detrender upstream in that regime.

Sinusoidal vs real limit cycles. The synthetic demo signal is sinusoidal. Real hunting — especially from valve stiction — is asymmetric: long dwells punctuated by sudden jumps at reversal. The reversal rate still rises with cycle frequency, but the categorize thresholds will need per-loop calibration against your own clean baseline (see the procedure above).

Other shapes of failure. For a signal that drifts unidirectionally off its setpoint with no oscillation, use Gradual Drift. For a signal that jumps to a new level, use Sudden Shifts.

References

Bristol, E.H. (1990). Swinging Door Trending: Adaptive Trend Recording? ISA National Conference Proceedings, pp. 749–756. (Direct heritage of winnow’s deadband + trajectory model.)
Welford, B.P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3), 419–420. doi:10.1080/00401706.1962.10490022 (The recurrence underneath esStats.)
Åström, K.J. & Hägglund, T. (2006). Advanced PID Control. ISA Publishing. (Loop-tuning and hunting diagnosis, chapter 7.)

Next Steps

Trajectory-Aware Adaptive Compression — the other winnow recipe; same per-sample event mechanism, a different downstream consumer
Sudden Shifts — step changes rather than oscillations
Composition Patterns — understand the per-sample-event → rate → state pattern in detail