The simplest summary of the leaderboard data is also the most unsettling: the best covered call strategy on Lockheed Martin outperforms the best covered call strategy on Apple by a factor of nine. The same system, the same entry logic, the same statistical validation framework, applied to two of America's largest companies, produces a Sharpe ratio of 4.51 in one case and 0.50 in the other. Understanding why requires setting aside the usual explanations — better signals, smarter exits, more sophisticated agents — and confronting the more fundamental question of which assets are structurally suited to premium harvesting and which are not.
Nine hundred and twenty-three strategies were evaluated across 41 tickers and 22 distinct entry signal variants. The distribution of out-of-sample Sharpe ratios is neither bell-shaped nor random. It is bimodal, with a significant mass of strongly positive results clustered between Sharpe 2.0 and 4.5, a large body of strategies near zero, and a long left tail of negative outcomes that extends — in one extraordinary case — to −39.95.
The five-tier classification system — S, A, B, C, F — was defined by Sharpe thresholds set before the search began, not fitted to the results afterwards. Tier S requires an out-of-sample Sharpe above 2.0, a standard that most professional systematic strategies would be satisfied to meet across an entire fund. The system produced 192 such strategies, representing 21% of everything tested.
| Tier | OOS Sharpe threshold | Count | % of total | Interpretation |
|---|---|---|---|---|
| S | Sharpe > 2.0 | 192 | 20.8% | Elite. Deploy with full confidence. |
| A | 1.5 – 2.0 | 97 | 10.5% | Strong. Portfolio candidates. |
| B | 0.5 – 1.5 | 124 | 13.4% | Viable. Monitor closely in production. |
| C | 0.0 – 0.5 | 131 | 14.2% | Marginal. Structural edge unclear. |
| F | Sharpe < 0 | 379 | 41.1% | Failed. Active destruction of capital. |
The 41% failure rate is not a cause for alarm; it is a cause for reflection. A well-designed search should produce many failures — they are the evidence that the positive results are not artefacts of a permissive filtering process. If every strategy had shown positive Sharpe, the correct inference would be that the test framework was contaminated. Failure is the price of a credible pass.
Walk-Forward Efficiency is the ratio of out-of-sample Sharpe to in-sample Sharpe. WFE > 1.0 means the strategy performs better on data it never saw than on data it was trained on — the hallmark of a structural rather than fitted edge.
The top twenty strategies by out-of-sample Sharpe span just five tickers: LMT, SLV, XOM, CVX, and HON. The concentration is not accidental. It reflects a structural property of the underlying assets that will be examined in detail in the next section. First, the rankings.
| # | Ticker | Strategy | OOS Sharpe | IS Sharpe | WFE |
|---|---|---|---|---|---|
| 1 | LMT | cc_vix | 4.51 | 1.03 | 4.36× |
| 2 | SLV | cc_bbsq | 4.27 | 3.23 | 1.32× |
| 3 | SLV | cc_vrp | 4.27 | 3.56 | 1.20× |
| 4 | LMT | cc_earn | 4.25 | 0.80 | 5.32× |
| 5 | XOM | cc_bbsq | 4.17 | 1.08 | 3.87× |
| 6 | LMT | cc_vrp | 4.11 | 0.93 | 4.40× |
| 7 | XOM | cc_vrp | 4.02 | 0.57 | 7.02× |
| 8 | XOM | cc_sink | 3.98 | 0.92 | 4.32× |
| 9 | XOM | cc_rsi | 3.97 | 1.08 | 3.67× |
| 10 | LMT | cc_bbsq | 3.97 | 0.84 | 4.71× |
| 11 | CVX | cc_sink | 3.88 | 0.83 | 4.69× |
| 12 | HON | cc_bbsq | 3.86 | 0.41 | 9.41× |
| 13 | XOM | cc_macd | 3.84 | 1.02 | 3.75× |
| 14 | GLD | cc_bbsq | 3.80 | 1.82 | 2.09× |
| 15 | HON | cc_vrp | 3.77 | −0.13 | — |
| 16 | LMT | cc_ma200 | 3.77 | 0.74 | 5.12× |
| 17 | XOM | cc_term | 3.72 | 1.14 | 3.27× |
| 18 | XOM | cc_vix | 3.72 | 1.14 | 3.27× |
| 19 | XOM | cc_always | 3.72 | 1.14 | 3.27× |
| 20 | GLD | cc_vol | 3.70 | 1.39 | 2.67× |
Row 15 is the one that demands explanation. HON/cc_vrp achieves a Sharpe of 3.77 out-of-sample while posting an in-sample Sharpe of −0.13 — negative during training, exceptional during testing. The Walk-Forward Efficiency calculation produces a divide-by-zero and is listed as undefined. This result is not a data error. It is the most extreme instance of a pattern visible throughout the top of the table: out-of-sample performance consistently and substantially exceeding in-sample performance. A WFE of 4.36 for the top-ranked strategy means that on data the model never saw during optimisation, it returned more than four times the risk-adjusted performance it showed during training. This is the opposite of what the overfitting hypothesis predicts.
"The portfolio actually performs better out-of-sample than in-sample. This is characteristic of a structural edge — like theta decay — rather than a fitted pattern."
Phase 6 Statistical Validation Report · Stack$Trader DocumentationThe most striking pattern in the full 923-strategy dataset is not the magnitude of the best results — it is the consistency of the split between winning and losing ticker universes. There are two worlds in this data, and the dividing line runs directly between the old economy and the new.
Energy majors (XOM, CVX), aerospace and defence (LMT, HON), precious metals ETFs (GLD, SLV), and diversified equity income (VYM). These assets share three properties: moderate implied volatility relative to realised, low susceptibility to overnight gap risk from earnings surprises or product launches, and stable option premium as a fraction of share price.
LMT 4.51 · SLV 4.27 · XOM 4.17 · CVX 3.88 · HON 3.86 · GLD 3.80 · VYM 3.50Big tech (MSFT, AAPL, AMZN, GOOG), high-beta growth (TSLA, SOFI, PLTR), crypto proxies (COIN, IBIT), and speculative small-caps. These assets share: high realised volatility that frequently exceeds implied, sustained upward price momentum that forces covered calls to be assigned or rolled at a loss, and binary event risk (product launches, analyst upgrades, macro betas) that overwhelms the theta premium.
MSFT −2.94 · TSLA −1.43 · AMZN −1.87 · AAPL −1.64 · IBIT −1.53The intuition is this: a covered call earns money when the underlying moves less than implied volatility suggests it will. In stable, cash-flow-generating businesses — an oil major, a defence contractor, a gold ETF — this condition is reliably met. Implied volatility for these assets incorporates a structural risk premium that consistently exceeds realised volatility over a cycle. In technology stocks, the opposite holds: the stocks actually move as much or more than options markets price in, and additionally tend to move directionally upward (capturing the call premium while generating assignment losses from truncated upside).
The data make this concrete. Ranked by their single best strategy across all entry variants:
The leaderboard encompasses not just the choice of underlying but also the choice of entry strategy — the signal logic that determines when the covered call position is opened. Twenty-two distinct entry strategy variants were tested, ranging from the simplest possible rule (cc_always: enter on every available date) to complex multi-signal composites. The results across entry strategies are, by comparison to the ticker variation, surprisingly uniform — but not entirely without structure.
Unconditional entry. Sell a covered call whenever no position is open. The no-skill baseline.
Works on good tickersBollinger Band squeeze filter. Enter when realised volatility is contracting — implied premium is relatively rich to near-term realised.
Top-ranked variantVolatility Risk Premium gate. Enter when implied volatility exceeds recent realised volatility by a threshold. Explicitly targets the VRP structural edge.
Consistent top-5VIX-regime filter. Enter when market-wide fear is elevated — premium is structurally richer during high-VIX environments.
Best for LMT (#1)Earnings adjacency filter. Enters specifically around earnings calendar windows when implied vol is inflated above realised.
Best for LMT #4Pre-earnings premium capture. Enters the week before earnings when implied vol is rising into the event, exits before the event resolves.
High WFE (7.92×)Short OTM call + short OTM put. Collects premium on both sides. Unlimited downside risk if the underlying makes a large directional move.
Catastrophically failsFour-leg defined-risk structure. Short OTM put spread + short OTM call spread. Theoretically superior capital efficiency.
Fails universallyBull put spread. Short OTM put + long lower-strike put. Defined risk, lower premium than naked put.
Fails universallyGold standard consensus: requires agreement across all major directional agents. Extremely selective entry — few trades fire.
Mixed (high WFE)The structure results deserve separate attention. Iron condors, strangles, and vertical put spreads failed uniformly across all tickers — not merely underperforming but actively destroying capital in both in-sample and out-of-sample periods. The worst single result in the entire dataset, ONDS/strangle at −39.95, is a multi-leg structure. This is counterintuitive: iron condors and spreads are frequently advocated as superior to covered calls because they offer defined risk and capital efficiency. In the Stack$Trader dataset, they do not outperform — they catastrophically underperform, even on the tickers where the simple covered call excels.
The most likely explanation is execution realism. Multi-leg structures incur double the bid-ask spread friction, require simultaneous fills across multiple legs, and create complex delta-management problems when the underlying moves against one leg. The synthetic option model used here, which constructs positions from yfinance data rather than live quotes, may not capture the full cost of entering and exiting four-leg structures. The negative results should be treated as a warning flag rather than a definitive verdict on multi-leg strategies as a category.
The bottom five results are as instructive as the top five. They fall into two categories: assets with structural incompatibility with premium selling, and structures with execution flaws that no amount of signal optimisation can overcome.
| Rank | Ticker | Strategy | OOS Sharpe | IS Sharpe | Diagnosis |
|---|---|---|---|---|---|
| 923 | ONDS | strangle | −39.95 | −7.07 | Low-float biotech + naked strangle = extreme gap risk |
| 922 | SLV | strangle | −10.30 | −5.50 | Even a top cc ticker fails with wrong structure |
| 921 | SLV | iron_condor | −4.01 | −7.03 | Defined risk but negative in both periods |
| 920 | TSM | iron_condor | −3.75 | −7.57 | Semiconductor volatility incompatible with condor |
| 919 | LRCX | strangle | −3.57 | −25.58 | LRCX IS Sharpe of −25.58 suggests structural data issue |
The ONDS/strangle result at −39.95 is the statistical equivalent of a car crash: it tells you something important happened, and the investigation is as informative as the result itself. ONDS is a small-cap biotech — a category with routine binary event risk from FDA approvals, clinical trial announcements, and financing events. Selling a naked strangle on such an asset creates a payoff profile that is short enormous gap risk on both sides for a premium that is structurally insufficient to compensate. The in-sample Sharpe of −7.07 confirms this is not an OOS anomaly; the strategy was losing money during training as well and should never have been deployed.
The SLV/strangle result at −10.30 makes the more interesting point. SLV ranks second overall at Sharpe 4.27 when a covered call strategy is applied, and third at 4.27 with a different entry variant. The same asset, on the same data period, produces a top-two result and a near-bottom result depending solely on the choice of structure. Asset selection and structure selection are not independent decisions — they interact multiplicatively. The worst thing a researcher can do is identify a strong underlying and then choose the wrong structure for it.
When every bucket of a grid search converges on identical optimal parameters, the grid search has found a structural optimum — not a local one fitted to the bucket's specific data. This is the strongest possible evidence that the parameters are real.
One of the most significant findings of the Phase 6 search is not in the Sharpe ratios but in the parameters. Across all tickers, all time buckets, and all strategy variants, the grid search converged on essentially identical optimal values:
This convergence across independent subsets of the data is the statistical equivalent of replication. The optimal parameters are not the product of fitting to a particular market regime or ticker microstructure — they appear to describe an invariant of the strategy itself. Delta 0.20 balances premium income against assignment risk at the point where the expected value of the trade is maximised across a wide range of volatility environments. DTE 14 captures the steepest region of the theta decay curve while staying far enough from expiry to avoid gamma risk. A threshold of −0.15 (entry when the ensemble score is above this floor) provides meaningful filtering without excessive selectivity.
The implication is practical and important: the system does not require per-ticker parameter optimisation. A researcher who discovers a new candidate ticker can apply the universal parameters as a starting point with high confidence, reserving the grid search for validation rather than exploration. The variation between tickers is not in how the strategy is parameterised — it is in whether the underlying asset is structurally suited to the strategy at all.
The WFE of 3.886 — the portfolio-level ratio of OOS to IS Sharpe — is the most important single number in the entire dataset. It is commonly assumed in the backtesting literature that sophisticated strategies will overfit: that as the number of parameters and tested variants increases, the gap between in-sample and out-of-sample performance will widen. In the Stack$Trader leaderboard, the opposite occurs. Adding complexity (more agents, more entry variants, more tickers) consistently improved out-of-sample performance relative to in-sample performance, not degraded it.
This is explicable only if the edge being captured is structural rather than statistical. The permutation test result, reported in the companion article "Selling Insurance at Scale," confirms this directly: theta-normalised permutation p = 1.000, meaning that 100% of the strategy's returns are attributable to the structural theta decay and volatility risk premium, with zero contribution from entry timing. A structural edge does not overfit because it does not depend on fitting. The calendar does not care how many parameter combinations were tested; time passes, options decay, and the VRP pays its premium whether or not the researcher tested 923 configurations or nine.
The leaderboard is, in the end, an asset selection tool as much as a strategy validation tool. It establishes which tickers carry the structural properties that allow a systematic covered call strategy to function, and which do not. The list of winners — defence, energy, metals, income — is not a commentary on those companies' business prospects. It is a commentary on their option market dynamics: moderate and persistent volatility risk premia, low directional momentum that would cap the covered call upside, and stable implied-to-realised volatility ratios that make premium income predictable across market regimes. These are the properties that the leaderboard selects for, and they turn out to concentrate in exactly the sectors that most sophisticated options traders have discovered empirically over decades of practice. The system, in nine hundred and twenty-three trials, arrived at the same answer.