Building the Machine — Stack$Trader

The number appeared on the screen without ceremony: $99,000. It was a pending charge from a financial data provider, the result of a single line of Python that had attempted to download thirty-five terabytes of historical options data for forty-five tickers at one-minute resolution. The call had been made in good faith. The bill had arrived in seconds. For a retired engineer running a private research project on a home computer, it was the kind of moment that ends projects before they begin.

It did not end this one. Instead, it produced the architectural innovation that would define everything that followed — a synthetic options pricing engine built entirely from free public data, mathematically precise, infinitely scalable, and completely independent of any vendor's billing department. The near-disaster became a permanent competitive advantage. That quality — finding the structural solution inside the worst-case failure — would characterize the entire enterprise.

The project is called Stack$Trader. Its builder spent a career in the precise engineering of semiconductor manufacturing processes — controlling deposition to tolerances measured in angstroms, building intuitions about variance, signal, and noise. He began constructing a systematic options trading engine, using two artificial intelligence systems as coding partners, across seven phases of development spanning roughly a year and a half.

The result, as of March 2026, is a system that has tested 923 distinct strategy-ticker combinations, validated its core statistical edge through ten independent statistical tests at a significance level that conventional finance would consider conclusive, deployed sixteen specialist analytical agents whose votes are aggregated through a weighted ensemble, and generated a leaderboard where the top-ranked strategy — Lockheed Martin covered calls timed by VIX regime — achieves an out-of-sample Sharpe ratio of 4.51.

What follows is the story of how it was built: what worked, what failed, what the statistics actually proved, and what the discipline of process engineering has to offer a domain that more often looks to mathematics and economics for its intellectual foundations.

I The Process Engineer's Instinct

In semiconductor manufacturing, a deposition process is characterized not by its average output but by its distribution. A film that grows to the correct thickness on average but varies by ten percent across a wafer is not a good process — it is an unpredictable one. The discipline trains you to think in terms of variance, control limits, yield, and the identification of assignable causes of deviation. A shift in the distribution is not bad luck; it is information.

On Process Control

Atomic Layer Deposition grows films one molecular layer at a time, relying on self-limiting chemical reactions to achieve sub-angstrom precision. The discipline is fundamentally about identifying which variables matter and which are noise.

This mental framework transfers more cleanly to quantitative finance than it might appear. Options markets are, at their core, a pricing problem embedded in a variance problem. Implied volatility — the market's forecast of future price movement embedded in an option's price — persistently exceeds the volatility that subsequently materializes. This gap, known as the volatility risk premium, has been documented in academic literature for decades. It is not a market inefficiency waiting to be arbitraged away; it is compensation for the insurance function that options sellers provide. Understanding it requires the same disposition that semiconductor process engineers develop: an attention to distributions, an ability to separate structural signals from noise, and a healthy skepticism toward any explanation that relies on luck.

The covered call strategy — selling a call option against a held stock position — harvests this premium systematically. If the stock stays below the strike price at expiration, the seller keeps the entire premium. If it rises above, the upside is capped but the premium offsets the opportunity cost. The strategy has a well-documented history; Israelov and Nielsen, writing in the Financial Analysts Journal in 2015, decomposed its return into equity exposure, short volatility exposure with a realized Sharpe near 1.0, and an uncompensated equity reversal component. The edge is structural. But naive implementations — selling at a fixed delta every month regardless of conditions — leave enormous value on the table and expose portfolios to avoidable catastrophic losses.

The question Stack$Trader set out to answer was not whether the edge exists. The literature had settled that. The question was whether a systematic, data-driven approach to when to sell — and when to abstain — could improve on the naive baseline in a statistically defensible way.

II The $99,000 Lesson

Every project has its origin myth. For Stack$Trader, it is the Databento incident of Phase 1.

The initial design was straightforward: download a deep historical archive of options data, build a backtester, and begin testing strategies. The data vendor was Databento, a professional-grade market data provider whose OPRA (Options Price Reporting Authority) feed carries every options quote from every U.S. exchange. The initialization call specified 2020–2024 options chains for forty-five tickers at one-minute resolution.

The resulting download order was thirty-five terabytes. The pending bill: $99,000.

The Architecture It Forced

54,000

Synthetic contracts generated per ticker, per run

Ongoing data cost for the synthetic options pipeline

Data provider implementations: YFinance, Vault, Schwab

A support ticket to Databento's engineering team — ticket #79927696, handled by a Databento support engineer — reversed the execution load and cancelled the pending charges. The five-figure billing hit was avoided. But the experience had already done its architectural work. The system that emerged from Phase 1 would never again depend on a single expensive data vendor for its operational continuity.

The solution was broker/synthetic_options.py, a Black-Scholes replication engine that generates historically accurate option chains from free publicly available price data via Yahoo Finance. The generate_synthetic_chain() function iterates over nine hundred trading days per ticker, computing theoretical option prices from realized volatility, time decay, and the risk-free rate. Each run produces roughly fifty-four thousand synthetic contracts per ticker, exported to local Parquet files. The mathematical accuracy is sufficient for backtesting purposes; the cost is zero; the data is available for any ticker with a public price history.

Alongside this, the DataProvider protocol was built — an abstraction layer that enforces a strict interface between the trading engine and whatever data source is currently active. Three implementations coexist: YFinanceProvider for free historical data, VaultExecutionModel for high-fidelity historical bid-ask spreads, and SchwabOptionProvider for live broker data. The engine queries whichever provider is active and receives standardized data structures regardless of the source. When the live trading integration eventually ships, the engine will not require modification — only a new provider implementation.

The Databento incident is a good example of a pattern that would repeat throughout the project: the most durable architectural decisions emerged from constraints imposed by failure, not from advance planning.

III The Hidden Markov Breakthrough

Phase 2 was pure quantitative research. Thirty-four strategy variations were tested across five experiments on a five-year SPY dataset, each validated using Timothy Masters' permutation testing framework at two thousand permutations per strategy. The number of ideas that survived was smaller than the number that failed, which is precisely the point.

The central finding emerged from an unsupervised machine learning technique called the Hidden Markov Model. A three-state Gaussian HMM trained on SPY's daily returns classified every trading day since 2021 into one of three hidden market states: Grind (low volatility, positive drift), Mean-Reversion (elevated volatility, oscillating returns), and Crisis (negative drift, market stress). The state labels are inferred from the statistical properties of the data, not imposed by the researcher — the algorithm discovers the structure rather than being told what to find.

"The Grind regime's monotonic, time-scalable edge is the structural bedrock. Everything else is about identifying when the regime classification is wrong."

Stack$Trader Development Journal, Phase 1–3 Retrospective

The three states resolved as follows: Grind occupied 812 trading days, sixty-six percent of the historical record, characterized by a mean daily return of +0.088% and annualized volatility of 11.3%. Mean-Reversion occupied 213 days, with return oscillation and volatility of 23.2%. Crisis occupied the remaining 210 days, with a mean daily return of -0.071% at the same elevated volatility level.

The breakthrough was not the classification itself but what happened when position sizing was linked to the HMM's posterior probabilities — the model's continuous confidence that any given day belongs to the Grind regime — rather than a binary on-off filter. Moving from binary classification to posterior weighting improved the Sharpe ratio from 2.67 to 4.07 and reduced the p-value from 0.015 to 0.004. At two thousand permutations, a p-value of 0.004 means that only eight random timing sequences out of two thousand matched or exceeded the algorithm's performance. This is not luck.

More important than the headline number was the holding period analysis. The Grind regime's edge scales linearly with time: a one-day hold produced a 0.088% per-trade edge with a borderline p-value of 0.06. A five-day hold produced 0.400% at p=0.006. A ten-day hold produced 0.747% at p=0.001. And a twenty-day hold produced 1.502% at p=0.0005, with a Sharpe ratio of 8.2. This is what theta decay looks like in the data — a real, measurable, statistically confirmed phenomenon that scales precisely with the time that premium sellers are exposed to the market. It is the structural foundation upon which every subsequent phase would build.

Experiment B, by contrast, was instructive in failure. The hypothesis was that regime transition probabilities — the likelihood of moving from Crisis to Grind — could time entries at inflection points. The data destroyed the idea: in five years of SPY history, there were zero direct transitions from Crisis to Mean-Reversion without first passing through Mean-Reversion. Markets do not snap from fear to greed. They decompress gradually, which means transition-based entries are structurally unsound as a timing mechanism.

IV Seven Agents and the Mixture-of-Experts Architecture

The most durable engineering artifact of the first three phases was not the HMM itself but the decision to embed it inside a larger ensemble architecture rather than trade on it directly.

The Supervisor-Agent framework organizes the system's decision-making as a weighted vote among specialist components, each contributing a normalized score between -1.0 (strongly bearish — do not sell premium) and +1.0 (strongly bullish — conditions are favorable). The Supervisor aggregates these scores into a composite, applies a disagreement penalty when agents diverge sharply, and produces a binary entry decision relative to a configurable threshold.

The initial seven agents spanned the primary dimensions of market analysis: TechnicalAgent (RSI, Bollinger Bands, MACD, price momentum), OptionsPricingAgent (implied volatility rank, volatility regime, surface stability), FundamentalsAgent (SMA stretch, volume anomalies, drawdown detection), MacroAgent (VIX level, market trend, sector momentum), EarningsAgent (calendar proximity, event risk classification), CorrelationAgent (beta, sector-relative strength), and HMMAgent (the regime classifier built in Phase 2).

The Disagreement Penalty

When Technical scores +1.0 and Macro scores -1.0 simultaneously, the high variance in agent conviction triggers a penalty that reduces the composite score — catching whipsaw setups that naive averaging would approve.

The disagreement penalty deserves specific attention. Consider a scenario where the technical picture is unambiguously bullish — RSI at 55, price above all moving averages, MACD positive — while the macro picture is unambiguously bearish: VIX spiking, SPY breaking below its 200-day average, sector momentum collapsing. A simple weighted average would produce a middling composite score and potentially an entry. The disagreement penalty reduces this composite when agents arrive at similar numerical scores through opposite convictions. High dispersion in agent scores is itself a signal: the market is sending contradictory messages, and premium selling in that environment carries elevated risk. The penalty vetoes these setups without requiring any individual agent to generate a sufficiently negative score alone.

Per-ticker configuration was built from the outset. A configuration file containing independent weight vectors, entry thresholds, and strategy parameters for each ticker allows the system to manage forty-three different equity instruments simultaneously, each with its own risk profile. AAPL's covered call strategy uses different weights than a GLD (gold ETF) strategy. TSLA's idiosyncratic volatility requires different DTE targets than SPY's index-driven structure. The architecture was deliberately modular to accommodate this heterogeneity without requiring engine modifications.

This modularity would prove invaluable. When Phase 4 added tick-level microstructure, Phase 5 added sentiment analysis and earnings decay models, and Phase 7 added dealer positioning, each innovation arrived as a new agent slotting into existing infrastructure — not as a rewrite of the underlying system.

V What Daily Bars Cannot See

Phase 2's Experiment C is worth dwelling on, because the lesson it taught was counterintuitive and expensive to learn. Five different microstructure proxy strategies — absorption ratios, RSI-volume climax combinations, mean-reversion pullbacks with volume confirmation, Bollinger squeeze releases, and an enhanced composite of all of the above — were tested on daily OHLCV bars against the SPY dataset. The p-values for all five were at or above 0.5. The enhanced composite produced negative returns with a p-value of 0.995, meaning random timing beat it ninety-nine times out of a hundred.

The reason is structural, not coincidental. The information that microstructure researchers care about — order flow imbalance, bid-ask dynamics, the distinction between aggressive buyers lifting the offer and passive sellers posting to the bid — exists at sub-second resolution. When that information is aggregated into a single daily bar, it is destroyed as surely as a high-resolution photograph is destroyed by printing it as a single gray pixel. The daily open, high, low, and close contain no recoverable trace of the intraday order book dynamics that determine whether a day's volume was driven by institutional accumulation or retail panic.

This finding defined Phase 4's entire agenda. The only way to access genuine microstructure signal was to record it directly — streaming tick-level data from a live broker connection and storing it in time-series Parquet files before any analysis was attempted. The standalone_recorder.py daemon was built for exactly this purpose: a lightweight background process that connects to Schwab's WebSocket API and streams real-time bid-ask quotes and trade prints for options and equities across forty-one tickers simultaneously, archiving everything to local storage without triggering the engine's full execution pipeline.

Strategy	Data Source	P-value	Verdict
RSI + Volume Climax Proxy	Daily OHLCV	0.72	Rejected
Mean-Rev + Volume Pullback	Daily OHLCV	0.81	Rejected
Bollinger Squeeze Release	Daily OHLCV	0.68	Rejected
Enhanced Composite Proxy	Daily OHLCV	0.995	Rejected
OFI via Live Tick Stream	1-sec Schwab WebSocket	+17.8% ↑	Validated

With genuine tick data flowing, the Order Flow Imbalance metric produced measurable results on NVDA: a seventeen-point-eight percent Sharpe ratio improvement and four rejected entries on days when the tick data showed aggressive institutional selling that the daily HMM regime had not detected. The improvement was modest in absolute terms, but the mechanism was real and orthogonal to every other signal in the system. The MicrostructureAgent earned its place in the ensemble not by turning a losing strategy into a winning one, but by demonstrating that it sees things the other agents cannot.

VI Sixteen Agents, Five Presets, and the Earnings Problem

Phase 5 was the phase where the system grew into something qualitatively different from what it had been.

The earnings problem had been present since Phase 1. Options sellers near earnings events are exposed to binary risk that no amount of technical analysis can hedge: a surprise beat or miss can move a stock ten percent overnight, obliterating weeks of accumulated premium in a single session. The naive solution — avoid all trading within a fixed window around earnings — works but sacrifices significant premium harvesting opportunity in the constructive days before the event, when implied volatility is elevated and the risk of a catastrophic outcome is still manageable.

The Gaussian Decay Model replaced the fixed window with a continuous risk measure. A Gaussian curve centered on each confirmed earnings date with a sigma of four days produces an impact score between zero and one that decays smoothly as distance from the event increases. The critical innovation was fusing this impact score with a price-volatility divergence proxy: the system distinguishes between a Fearful Divergence state (falling price, rising implied volatility near earnings) — which triggers extreme entry suppression — and a Constructive Base state (rising price, falling volatility before earnings) — which generates a neutral-to-positive bias. The EarningsAgent's Gaussian decay correctly suppressed all entries on NVDA's May 2025 earnings cluster, pushing impact to 1.0 on the event date and rejecting entries without ambiguity in the log output.

Dynamic delta targeting arrived in the same phase. Prior to Phase 5, every covered call was sold at a static 0.30 delta — the retail default, and a profoundly wrong choice for a system with the market awareness Stack$Trader had developed by this point. The DeltaDTESelector introduced a continuous mapping from volatility risk premium richness and HMM regime confidence to a target delta in the 0.15–0.35 range. In a high-confidence Grind regime with rich VRP, the engine sells at 0.35 delta, maximizing theta income. When the term structure inverts into backwardation — a signal that the market is pricing elevated near-term stress — delta drops to 0.15, effectively placing the sold strike far enough out-of-the-money to function as deep protection rather than aggressive income generation.

"The system proved, with mathematical certainty, what does not work. That discipline — killing ideas that feel right but test to zero — is what separates this from a retail trading strategy."

Stack$Trader Development Journal, Phase 3 Retrospective

Five additional agents joined the ensemble in Phase 5, expanding the MoE from seven specialists to twelve: the SentimentAgent (LLM-based news scoring via Mistral-Small), the AnalystConsensusAgent (sell-side revision momentum), the FINNAgent (volatility surface mispricing), the AnomalyAgent (Variational Autoencoder outlier detection in latent market state space), and a tiered upgrade to the existing MicrostructureAgent. By Phase 7, three SoftTechnicalAgents operating at daily, sixty-minute, and five-minute timeframes had been added, bringing the final agent count to sixteen.

Five character presets configure the ensemble's behavior for different risk appetites: conservative (Sharpe preservation, heavy technical confirmation), moderate (balanced ensemble), aggressive (premium maximization, emphasizing surface mispricing and microstructure), gold standard (multi-agent consensus, all sources active), and premium_harvest (architecturally inverted — bearish directional signals become favorable entry conditions, because a flat-to-declining stock maximizes the probability that a sold call expires worthless).

VII The Crucible: What Phase 6 Actually Proved

Phase 6 was designed as a reckoning. Every technique built across the preceding five phases would either survive rigorous statistical validation or be formally rejected as curve-fitted noise. The results were, by the project's own internal assessment, humbling — and for that reason, the most scientifically valuable phase of the entire effort.

The baseline backtests were run on the four core tickers — SPY, NVDA, AAPL, and QQQ — using the Gold Standard preset without per-ticker weight optimization. SPY managed a Sharpe ratio of 0.766 against a Phase 6 target of 1.5. NVDA reached 0.757. AAPL scraped 0.338. QQQ came closest to its gate with 1.149 against a 1.2 target.

The Monte Carlo permutation tests were more damaging. SPY achieved statistical significance — p=0.000, meaning no random permutation of timing beat the algorithm's actual entries. But NVDA returned p=0.995, a result that means the system's specific entry timing on NVDA is worse than throwing darts at a calendar in 99.5% of simulated cases. AAPL at p=0.750 and QQQ at p=0.120 both failed the p<0.05 significance gate.

These failures were not a crisis. They were the entire point of Phase 6. What they proved, with precision that generalized well-intentioned backtesters rarely accept, is that a single weight vector calibrated for a diversified equity index cannot be exported to individual growth stocks without degradation. SPY's statistical edge is real. NVDA's edge, with default weights, does not exist. The distinction between these two findings is not philosophical — it has direct implications for position sizing, capital allocation, and the live trading decisions that Phase 7 will make with real money.

The Walk-Forward Efficiency results were more encouraging. All four tickers passed the WFE gate (ratio above zero), with QQQ producing the striking result that its out-of-sample Sharpe actually exceeded its in-sample Sharpe by a factor of 1.534 — the rarest outcome in backtesting, suggesting that QQQ's behavior in the test period was, if anything, more favorable to the strategy than the training period. This result saved every ticker from outright rejection under the Phase 6 protocol, which specified elimination of any ticker failing both the permutation test and the WFE gate simultaneously.

Phase 6 also produced the most comprehensive dataset in the project's history. A grid search across 923 strategy-ticker combinations tested thirty feature configurations with three-fold walk-forward validation across forty-plus tickers. The leaderboard that emerged from this process has a clear structure: commodity and defense sector covered calls dominate the top rankings (LMT at Sharpe 4.51, XOM at 4.17, CVX at 3.88, GLD at 3.80), while high-beta technology names (AAPL, TSLA, AMZN, MSFT, COIN) cluster at the bottom or post negative Sharpe ratios. The pattern is not noise — it reflects the structural reality that premium selling works best on instruments with stable, predictable volatility characteristics, and worst on stocks where idiosyncratic news events dominate the return distribution.

VIII Two AIs, One Codebase

A development detail that would be unremarkable in a traditional software project is worth noting in this context: Stack$Trader was not built by one person working alone. It was built by one person working with two artificial intelligence systems as programming partners, each handling different aspects of the development workflow.

Claude (Anthropic) handles architecture design, implementation, and test coverage. Gemini (Google's Antigravity IDE) handles research, optimization passes, and overnight computational tasks — running backtests, executing agent weight optimization via differential evolution, and validating statistical results. Coordination happens through a shared file called HANDOFF.md, git commit messages with standardized prefixes, and weekly sync documents that formalize what each AI partner has accomplished, what is in progress, and what is blocked.

The workflow revealed something about the current state of AI-assisted development that is both useful and somewhat surprising. The AIs do not disagree with each other in the way human collaborators sometimes do. They do not have competing instincts or professional rivalries. What they provide is a division of cognitive labor that a solo developer genuinely cannot replicate: one partner optimizing for code quality and architectural coherence while the other processes massive parameter spaces overnight without fatigue. A 352-file mega-commit that consolidated all project artifacts into the repository — the result of a single productive afternoon — was possible because the implementation burden had been continuously distributed rather than accumulated.

The three-hundred-and-fifty-two file commit, delivered on March 3, 2026, is perhaps the clearest single data point for what AI-assisted development makes possible for a solo researcher. It included phase retrospectives, architecture design documents, a statistical validation framework, a live trading kill switch with fifteen unit tests, a new agent implementation, and a complete dual-agent collaboration infrastructure — all written, tested, and committed in a continuous session. Whether this represents a qualitative change in what individual researchers can accomplish, or merely an acceleration of what was always possible with sufficient motivation, is a question the project does not answer definitively. But it suggests the territory.

IX What the Statistics Actually Proved

The ten-test statistical validation suite that Phase 6 produced deserves a careful accounting, because it arrived at conclusions that contradict the intuitive appeal of the system's most sophisticated components.

Hansen's Superior Predictive Ability test — the most rigorous of the ten — confirmed that the strategy produces statistically significant outperformance versus buy-and-hold at p=0.000. The edge is real.

The theta-normalized permutation test found something more specific and more important: one hundred percent of that edge is structural, arising from theta decay and the volatility risk premium. The directional timing contribution — the share of returns attributable to the sixteen agents correctly predicting market direction — is statistically zero.

Walk-Forward Efficiency of 3.886 confirmed that the strategy generalizes out-of-sample, ruling out the overfitting hypothesis.

The entry edge test returned a five-percent pass rate across tickers. The system identifies superior entry conditions on roughly one ticker in twenty. On the remaining nineteen, entry timing is indistinguishable from random selection — but the strategy still produces positive returns because any reasonable entry that avoids earnings events and crisis regimes harvests theta from a positive-expectancy position.

The correct interpretation of these results was counterintuitive but decisive: Stack$Trader's sixteen-agent ensemble does not predict market direction. It never did. Its value is in identifying when not to sell premium — near earnings, during HMM-classified stress regimes, when the AnomalyAgent detects distributional breaks, when the MicrostructureAgent sees toxic order flow — and allowing the structural theta edge to compound in the remaining, favorable conditions. The architecture's sophistication serves risk management and exclusion logic, not directional prediction.

This reframing of the optimization target — from signal accuracy to catastrophic scenario avoidance — is the most intellectually substantive conclusion the project has reached. It is also the one most consistent with the process engineer's instinct that opened this story. The goal was never to predict the future. It was to identify the conditions under which a well-characterized, structurally positive-expectancy process operates reliably — and to stay out of the reactor when those conditions are absent.

X Phase 7 and the Road Ahead

Phase 7, the current development phase as of this writing, has introduced the DealerPositioningAgent — the system's thirteenth specialist — which decodes market-maker Gamma Exposure (GEX), Vanna flow, and the Dark Index (DIX) of institutional buying pressure into a single positioning score. The academic validation for dealer positioning effects is substantial: gamma hedging by market makers creates predictable intraday momentum effects that are exploitable in the 0DTE and near-expiration options space where covered call premium decays most rapidly.

The GEX calculation follows the SqueezeMetrics convention: for each option contract, GEX equals open interest times gamma times one hundred times the square of the underlying price. Summed across the full options chain with sign conventions (calls positive, puts negative), the net GEX profile reveals whether market makers are net long or net short gamma overall, where the gamma flip point lies — the price level at which their hedging behavior shifts from stabilizing to destabilizing — and where the largest concentrations of positive and negative exposure create gravitational price levels near expiration.

The Charm-Aware Execution Timing module provides the system's first explicit model of when within a session to execute entries. Charm — the rate at which delta changes with time, a third-order Greek not typically found in retail options tools — determines how urgently a market maker must adjust hedges as expiration approaches. High charm environments near OpEx create execution windows where fills are favorable for premium sellers. The module produces urgency classifications (high, medium, low) based on charm magnitude, with corresponding delay recommendations that route entries into periods of favorable microstructure rather than the open-market volatility that characterizes the first forty-five minutes of each session.

The live trading integration is the final remaining milestone before the system operates with real capital. Schwab API connectivity is in place; the paper trading ledger has been running and accumulating results; the kill switch and risk guardrails have been tested at one hundred nineteen passing unit tests. What remains is the Monte Carlo validation of the DealerPositioningAgent on the full ticker universe, a five-day paper trading dry run to verify execution infrastructure, and the compliance review that any automated trading system requires before it touches real money.

Whether the system will perform as well in live trading as in backtesting is the one question the statistics cannot answer in advance. The Walk-Forward Efficiency results are encouraging — the strategy generalizes well out-of-sample, which reduces but does not eliminate the risk of degraded live performance. What the statistics can say is that the edge is structural rather than spurious, that the failure modes are characterized and guarded against, and that the system has been built by someone who spent two decades ensuring that processes operate reliably within their design envelope before tolerances are tightened.

That instinct — build it right before you scale it up — may be the most transferable skill from the cleanroom to the trading desk.

· · ·