White's reality check¶
A bootstrap test for whether the best strategy among K candidates beats zero, after correcting for the bias from picking the best.
The data-snooping problem: take 50 random strategies, keep the one with the highest Sharpe, the marginal p-value looks great — but the multiple-comparison penalty usually erases it. White (2000) handles the penalty directly. Under the null that no strategy has positive expected excess return, the test asks how likely the observed best statistic is.
The implementation uses the stationary bootstrap of Politis & Romano (1994) so the bootstrap samples preserve the autocorrelation of the original return series.
Inputs¶
A 2D matrix of excess returns shaped (num_strategies,
num_periods). Each row is one strategy's return series. Benchmark
adjustment is on the caller — pass raw returns to test against zero,
or returns - benchmark otherwise.
num_bootstrap is the resample count (default 10 000).
avg_block_size is the mean block length; 0 (default) picks
sqrt(num_periods), a standard rule of thumb for return-series
autocorrelation.
Python¶
import flox_py as flox
import numpy as np
# rows = strategies, cols = periods. Pass excess returns.
returns = np.array([
strat_a_excess_returns, # length T
strat_b_excess_returns,
strat_c_excess_returns,
])
result = flox.whites_reality_check(
returns,
num_bootstrap=10_000,
avg_block_size=0.0, # auto = sqrt(T)
seed=42,
)
print(f"best strategy index: {result['best_index']}")
print(f"observed statistic : {result['best_stat']:.4f}")
print(f"p-value : {result['p_value']:.4f}")
A small p-value (say <5%) means the best strategy's edge survives the multiple-comparison correction. A large one says the apparent edge is likely a lucky pick.
Node¶
const flox = require('@flox-foundation/flox');
// Flat row-major: strategy 0 first, then strategy 1, ...
const K = 3; // strategies
const T = 1000; // periods
const returns = new Float64Array(K * T);
// fill with excess returns ...
const result = flox.whitesRealityCheck(returns, K, T, 10000, 0.0);
console.log(result); // { p_value, best_stat, best_index }
Codon¶
from C import flox_stat_whites_reality_check(cobj, u64, u64, u32, f64,
cobj, cobj, cobj) -> None
import numpy as np
returns = np.array([...], dtype=np.float64).reshape((K, T))
p = np.array([0.0])
stat = np.array([0.0])
idx = np.array([0], dtype=np.int32)
flox_stat_whites_reality_check(
returns.ctypes.data, K, T, 10000, 0.0,
p.ctypes.data, stat.ctypes.data, idx.ctypes.data,
)
print(p[0], stat[0], idx[0])
Pairing with GridSearch and walk-forward¶
This is what to run after a grid search before keeping the top-Sharpe cell.
import numpy as np
import flox_py as flox
results = grid.run()
# Build (K, T) returns from per-cell return series. The exact source
# is up to your strategy: equity-curve log-returns, per-trade returns,
# or per-bar PnL.
returns = np.stack([cell["bar_returns"] for cell in results])
# Pass excess returns. Test against zero here; subtract a benchmark
# series before this line if needed.
out = flox.whites_reality_check(returns, num_bootstrap=10_000)
print(f"Best of {len(results)} cells: p={out['p_value']:.4f}")
Same pattern with walk-forward: feed the out-of-sample return series for each parameter setting and check whether the best one is still significant after the sweep.
Caveats¶
- The bootstrap recentres each strategy's mean to zero before
building the bootstrap distribution.
best_statreflects the in-sample mean; the comparison happens against the bootstrap draws. - The stationary bootstrap preserves short-range autocorrelation via block resampling, but does not address survivorship bias in the strategy pool or look-ahead bias in the data.
- The C ABI uses a fixed seed so the same inputs produce the same
p-value. pybind11 exposes
seed=for control; the NAPI binding inherits the C-side default (42). - The penalty grows quickly with K. A 10 000-sample bootstrap is
fine up to a few hundred candidates; above that, push
num_bootstraphigher.