Rolling Walk-Forward Backtest

Simulate periodic portfolio rebalancing over historical data to evaluate how an optimization strategy would have performed with regular weight updates, capturing realistic out-of-sample performance across multiple time periods.

Overview

A standard portfolio optimization produces a single set of weights based on the entire historical window. While useful, this approach does not reflect how portfolios are managed in practice — real investors periodically rebalance their holdings as new data arrives.

The rolling walk-forward backtest addresses this by splitting the historical period into multiple rebalance intervals. At each interval the optimizer is re-run on a training window of past data, producing fresh weights that are then held for the subsequent out-of-sample period. This cycle repeats across the entire history, generating a sequence of hold-period returns, weight snapshots, and performance metrics.

The result is a far more realistic picture of strategy performance: you see how weights evolve over time, how much turnover each rebalance incurs, and whether the strategy consistently delivers positive risk-adjusted returns or benefits from a single favorable regime.

Configuration Options

Rebalance Frequency

Controls how often the portfolio is re-optimized. Available options:

FrequencyCadenceUse Case
AnnualOnce per yearLong-term, low-turnover strategies
Semi-AnnualTwice per year (H1/H2)Balanced rebalancing cadence
QuarterlyFour times per yearActive strategies, faster adaptation

Window Type

Determines how much historical data is used for training at each rebalance point. This is the key “window” parameter.

Expanding Window

Uses all available data from the start of the dataset up to the rebalance date. Each successive period trains on a larger dataset. The window_length_years parameter is ignored in this mode.

Rolling (Fixed) Window

Uses only the most recent years of data before each rebalance date. The training window slides forward with each period, always maintaining the same size. Requires the window_length_years parameter.

Window Length (Years)

Only applies when window_type is set to rolling. Specifies how many years of historical data to include in each training window. For example, a value of 3 means each optimization uses the trailing 3 years of prices. A minimum of 60 trading days is required in each window.

Window Types Explained

Expanding Window

The training set grows with each period as more data becomes available:

201020112012201320142015201620172018201920202021Period 1TrainHoldPeriod 2TrainHoldPeriod 3TrainHoldPeriod 4TrainHold
Training Window
Hold Period

More data can stabilize covariance estimates but may include stale market regimes that no longer reflect current conditions.

Rolling Window (e.g. 3 years)

The training window slides forward, always keeping the same length:

201020112012201320142015201620172018201920202021Period 1TrainHoldPeriod 2TrainHoldPeriod 3TrainHoldPeriod 4TrainHold
Training Window
Hold Period

Keeps training data fresh and equally weighted across time, better reflecting recent market conditions at the cost of less data per window.

How It Works

The backtest proceeds through four stages for each rebalance period:

1
Build Schedule
Identify rebalance dates
2
Train
Compute μ, Σ & optimize
3
Hold
Out-of-sample returns
4
Aggregate
Summary statistics

1. Build Rebalance Schedule

Based on the chosen frequency, the system identifies all rebalance boundary dates within the available price history. At least two boundaries are required to produce one complete train-hold cycle.

2. Train

For each rebalance date, the optimizer selects the training window (expanding or rolling) and computes the risk model — expected returns and covariance matrix — from the training prices. The selected optimization method then produces a set of portfolio weights.

3. Hold (Out-of-Sample Evaluation)

The optimized weights are applied to the next period's prices (the hold window). Portfolio returns are calculated and performance metrics (Sharpe ratio, CAGR, max drawdown, volatility) are measured on this unseen data.

4. Record & Aggregate

Each period records its weights, weight deltas (changes from the prior period), and turnover. Turnover is defined as:

where is the weight of asset at rebalance . After all periods complete, the system aggregates results into summary statistics: total return, CAGR, average Sharpe, max drawdown, total turnover, and annualized volatility.

Output & Visualizations

The results page provides several views into the backtest output:

Summary Card

Displays six key aggregate metrics across all periods: Total Return, CAGR, Average Sharpe, Max Drawdown, Total Turnover, and Annualized Volatility. Values are color-coded (green for positive, red for negative).

Weight Timeline Chart

A stacked area chart showing how portfolio allocation evolves across rebalance periods. Each colored area represents a ticker's weight, stacking to 100%. Tickers can be toggled on/off via the legend.

Weight Delta Chart

A grouped bar chart showing the weight changes at each rebalance. Green bars indicate increased allocation, red bars indicate decreased allocation. Changes below 0.05% are treated as noise and shown as neutral.

Rolling Metrics Chart

A dual-axis line chart tracking three metrics over time: Sharpe ratio (left axis), period return / CAGR (right axis, percentage), and the risk-free rate used in each period (right axis, dashed). A horizontal reference line at Sharpe = 0 is included for context.

Period Details Table

A tabular breakdown of every rebalance period showing: rebalance date, hold dates, risk-free rate, Sharpe ratio, return, volatility, top holdings (top 3 by weight), and turnover. Each row represents one complete train-hold cycle.

Sharpe Ratio Inference

A raw Sharpe ratio is a point estimate. Without a measure of statistical significance, it is impossible to know whether observed performance reflects genuine skill or is simply the result of a short, lucky sample. This platform computes two inference statistics — the Probabilistic Sharpe Ratio (PSR) and the Minimum Track Record Length (MinTRL) — following López de Prado, Lipton & Zoonekynd (2025/2026) “How to Use the Sharpe Ratio”.

These statistics apply to both regular optimization results (using the full historical daily return series) and rolling backtest results (using the chained out-of-sample hold-period returns). Both supply the return series needed to estimate skewness, kurtosis, and serial correlation.

Why Point Estimates Are Misleading

The standard i.i.d.-Normal assumption for SR inference can underestimate the sampling variance of the Sharpe ratio by 4× or more for realistic returns with negative skew, excess kurtosis, and positive autocorrelation. The five common pitfalls are:

  1. Ignoring non-normality (skewness and kurtosis)
  2. Ignoring serial correlation (autocorrelation)
  3. Using too short a track record
  4. Treating the annualized SR as if it were normally distributed
  5. Comparing SRs across different frequencies without adjusting

Generalized Sampling Variance (Eq. 2–3)

The generalized variance of the Sharpe ratio estimator, accounting for non-Normality and serial correlation, is:

where is skewness, is Pearson kurtosis (3 for Normal returns), is AR(1) autocorrelation, and is the number of observations. Setting recovers the classical i.i.d.-Normal result .

Probabilistic Sharpe Ratio — PSR (Eq. 4–5, 9)

The PSR is the probability that the true Sharpe ratio exceeds a benchmark value (typically 0), adjusted for sample properties:

where is the standard Normal CDF and is the generalized standard error evaluated at the null . PSR > 0.95 means 95% confidence that the strategy has genuine positive risk-adjusted returns.

Minimum Track Record Length — MinTRL (Eq. 10–11)

The minimum number of observations required to reject at confidence :

If < MinTRL, the observed Sharpe ratio cannot be reliably distinguished from the null even if it appears large. The platform displays whether your current track record is sufficient and how many additional observations are needed.

Paper Validation Example

López de Prado et al. (2025) report: with ,, ,, and months, the annualized SR estimate is 0.456 with , PSR(SR₀=0) = 0.966, and MinTRL(SR₀=0) ≈ 19.5 months.

Advantages & Limitations

Advantages

  • +Out-of-sample evaluation avoids look-ahead bias
  • +Reveals strategy robustness across different market regimes
  • +Turnover tracking exposes hidden transaction costs
  • +Supports both expanding and rolling window modes
  • +Works with all optimization methods (except Technical Indicator)
  • +Weight evolution charts show allocation drift over time

Limitations

  • Requires sufficient history (at least 60 trading days per window)
  • Does not model transaction costs or slippage directly
  • Assumes instantaneous rebalancing at period boundaries
  • Rolling window may discard useful long-term information
  • Computation time scales linearly with number of periods
  • Not available for Technical Indicator method

References

  1. Bailey, D. H., Borwein, J. M., Lopez de Prado, M., & Zhu, Q. J. (2014). “Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance.” Notices of the AMS, 61(5), 458-471.
  2. Harvey, C. R., & Liu, Y. (2015). “Backtesting.” Journal of Portfolio Management, 42(1), 13-28.
  3. Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapters 11-12 on backtesting and cross-validation.