Documentation
User Manual
Methodology, formulas, and configuration for QuantFin Exeter — regime detection, PPO execution, benchmarks, and governance.
1. Introduction
QuantFin Exeter is a research and demo stack for optimal trade execution on SPY-style daily data: it labels market regimes, trains a PPO policy to schedule liquidation over a fixed horizon, compares against classical benchmarks (TWAP, VWAP, Almgren–Chriss, immediate), and optionally explains decisions with an LLM layer.
It is aimed at students, quants, and PMs who want a transparent pipeline from features → regimes → RL → benchmarks → narrative governance.
Source code: github.com/KirikPapka/quantfin_exeter
2. Data
The pipeline expects CRSP-style daily parquet (per-split feature files under data/features/features_{train,val,test}.parquet), filtered to ticker SPY when a ticker column exists.
Core columns built in load_features_parquet include:
- OHLCV:
Open,High,Low,Close,Volume - Volatility:
realised_vol_20(20-day realised vol), and \[ \sigma_{\text{daily}} = \frac{\mathrm{rv}_{20}}{\sqrt{252}} \] - Liquidity / microstructure proxies:
amihud_illiquidity, \[ \text{bid\_ask\_proxy} = \frac{H - L}{C + \varepsilon}, \qquad \text{volume\_to\_spread} = \frac{V}{\text{bid\_ask\_proxy} + \varepsilon} \] - VIX:
vix_aligned(forward-filled)
Optional enrichments:
- BBO order imbalance (NASDAQ ITCH L1): merged from
data/processed/bbo_daily.parquetwhen present; daily series can start from 2019-01-02 depending on your ITCH extract. Appears asorder_imbalance_dailyfor HMM features when merged. - Finnhub news: optional daily news counts merged when a news parquet is configured in the pipeline.
3. Regime detection (HMM)
A Hidden Markov Model (HMM) assumes the market switches among a small number of latent states; each day you only observe features, not the state. GaussianHMM (hmmlearn) with full covariance is fit on scaled features; predicted states are re-ordered by mean realised volatility so lower indices mean calmer conditions (for 2 states: 0 = calm, 1 = elevated; with 3 states you get calm / elevated / stressed ordering by vol).
Default feature vector (before StandardScaler):
If there are too few rows, fit fails, or any state holds <5% of days, the detector falls back to a simple volatility threshold on realised_vol_20 (default threshold 0.24).
4. Reinforcement learning (PPO)
PPO (Proximal Policy Optimization) is an on-policy actor–critic method that stabilizes policy updates by clipping the probability ratio between old and new policies. Here it learns a continuous policy over execution fractions on the OptimalExecutionEnv (Gymnasium).
Observation space (9 dimensions)
Vector: [inventory_norm, rem, S_ratio, liq_z, sig_z, regime, pva, twap_gap, news_z]
- \(x_{\text{inv}} = X / X_0\) — remaining inventory fraction.
- \(r_{\text{time}} = (T - t) / T\) — time remaining.
- \(S_{\text{ratio}} = \mathrm{Close}_t / \mathrm{Close}_{\text{start}}\).
liq_z,sig_z— z-scores of Amihud and \(\sigma_{\text{daily}}\) vs train statistics.regime— HMM label.- \(\text{pva} = \mathrm{clip}\bigl(\mathrm{Close}/p_{\text{arr}} - 1,\,-0.5,\,0.5\bigr)\).
twap_gap— normalized inventory minus TWAP schedule target after start bar \(t_0\); 0 before \(t_0\).news_z— z-scored news count, clipped to \([-4,4]\).
Action space
One continuous action in \([0,1]\): fraction of current inventory to sell this step (optional per-step cap and residual-bound shaping).
Reward (physical USD mode, sketch)
Inventory-risk term scales roughly as
\[ \left(\frac{X}{X_0}\right)^2 \sigma_{\text{daily}}^2 \cdot w_{\text{time}}(r_{\text{time}}), \quad w_{\text{time}}(r)= r^{3/2} \]
plus normalized dollar shortfall vs arrival (scaled by is_reward_scale), optional TWAP-slice bonus, relative-IS term, terminal penalties/bonuses, and optional eval_is_reward_coef alignment.
physical_institutional_kwargs defaults
When notional > 0, the helper returns (unless overridden):
max_inventory_fraction_per_step = 0.25 is_reward_scale = 1.28 twap_slice_bonus_coef = 0.60 terminal_inventory_penalty = 5.0
If notional is zero or unset, the dict is empty (legacy dimensionless env).
5. Market impact model
Trade dollars \(\mathrm{td} = |v|\, S_{\text{base}}\), bar dollar volume \(\mathrm{dv} \approx V \cdot S_{\text{close}}\) (with guards), participation \(\mathrm{part} = \mathrm{td}/\mathrm{dv}\). Then:
Defaults: \(\alpha_\sigma = 0.65\), \(\alpha_A = 0.35\), \(p_{\max} = 12\) (inside sqrt and Amihud term), \(\phi_{\max} = 0.35\).
Arrival price for benchmarks and physical RL: previous bar’s close before the first execution bar (arrival_price_full).
6. Benchmarks
All strategies compute average execution price vs the same arrival, then IS in bps (section 7).
TWAP
Effective horizon \(T_{\text{eff}} = T - t_0\). Per-bar sell fraction in physical mode:
\[ q_k = \frac{Q}{T_{\text{eff}}}, \quad k = t_0,\ldots,T-1 \]Each slice priced with sell_effective_close; legacy mode uses bar closes without impact.
VWAP
\[ w_i = \frac{V_i}{\sum_j V_j}, \quad q_i = Q \cdot w_i \]Same impact law per slice in physical mode.
Almgren–Chriss
\[ \kappa = \sqrt{\frac{\lambda \sigma^2}{\eta}}, \qquad x_i = Q\,\frac{\sinh\bigl(\kappa(T_{\text{eff}}-i)\bigr)}{\sinh(\kappa T_{\text{eff}})}, \quad \Delta x_i = x_{i-1} - x_i \]Parameters \(\eta,\gamma,\sigma,\lambda\) from run config; trades \(\Delta x_i\) executed with impact in physical mode.
Immediate
Sell full \(Q\) on the first execution bar with one impact call (legacy: first close).
7. Implementation shortfall (IS)
For a sell, mean IS in basis points:
\[ \mathrm{IS}_{\text{bps}} = \frac{\bar p_{\text{exec}} - p_{\text{arr}}}{p_{\text{arr}}} \times 10^4 \]Higher \(\mathrm{IS}_{\text{bps}}\) is better for a sell (more proceeds vs arrival).
USD edge from a \(\Delta\mathrm{IS}_{\text{bps}}\) gap vs a benchmark:
\[ \Delta \mathrm{USD} \approx N \cdot \frac{\Delta\mathrm{IS}_{\text{bps}}}{10^4} \]Example: \(\Delta\mathrm{IS} = 19.85\) bps on \(N = \$5{,}000{,}000\) → \(\Delta\mathrm{USD} \approx \$9{,}925\).
8. LLM governance
explain_execution builds a structured Claude prompt (regime name, volatility, Amihud liquidity, inventory, action fraction, cost vs TWAP/Almgren–Chriss). The prompt is hashed (SHA-256, first 16 hex chars); responses are cached under data/cached_llm/<hash>.json.
If ANTHROPIC_API_KEY is missing, an offline template paragraph is returned and still written to the cache for reproducibility. Model name defaults to ANTHROPIC_MODEL or claude-sonnet-4-20250514.
9. Configuration reference (Run page)
- Data split
- Train 2018–2022, validation 2023, test 2024 (as used for split parquet files in the standard pipeline).
- HMM states
- 2 or 3 Gaussian components; optional OBI feature when BBO data is merged.
- Horizon
T - Number of daily bars in the execution window (episode length).
- Policy
- Trained PPO checkpoint from
models/vs random agent baseline. - Execution Start Date
- Calendar date the execution program begins. The episode and all benchmarks are evaluated on the same
T-day window starting from this date. Non-trading days (weekends, holidays) snap backward to the nearest previous session.
10. Trend classification
Rolling return on Close with lookback \(L\) (default 20):
Classification (defaults \(+2\%\), \(-2\%\)):
\[ \text{label} = \begin{cases} \text{up} & r_t \ge 0.02 \\ \text{down} & r_t \le -0.02 \\ \text{mid} & \text{otherwise} \end{cases} \]Code: TREND_UP, TREND_MID, TREND_DOWN.