User Manual — Optimal Execution

1. Introduction

QuantFin Exeter is a research and demo stack for optimal trade execution on SPY-style daily data: it labels market regimes, trains a PPO policy to schedule liquidation over a fixed horizon, compares against classical benchmarks (TWAP, VWAP, Almgren–Chriss, immediate), and optionally explains decisions with an LLM layer.

It is aimed at students, quants, and PMs who want a transparent pipeline from features → regimes → RL → benchmarks → narrative governance.

Source code: github.com/KirikPapka/quantfin_exeter

2. Data

The pipeline expects CRSP-style daily parquet (per-split feature files under data/features/features_{train,val,test}.parquet), filtered to ticker SPY when a ticker column exists.

Core columns built in load_features_parquet include:

OHLCV: Open, High, Low, Close, Volume
Volatility: realised_vol_20 (20-day realised vol), and \[ \sigma_{\text{daily}} = \frac{\mathrm{rv}_{20}}{\sqrt{252}} \]
Liquidity / microstructure proxies: amihud_illiquidity, \[ \text{bid\_ask\_proxy} = \frac{H - L}{C + \varepsilon}, \qquad \text{volume\_to\_spread} = \frac{V}{\text{bid\_ask\_proxy} + \varepsilon} \]
VIX: vix_aligned (forward-filled)

Optional enrichments:

BBO order imbalance (NASDAQ ITCH L1): merged from data/processed/bbo_daily.parquet when present; daily series can start from 2019-01-02 depending on your ITCH extract. Appears as order_imbalance_daily for HMM features when merged.
Finnhub news: optional daily news counts merged when a news parquet is configured in the pipeline.

3. Regime detection (HMM)

A Hidden Markov Model (HMM) assumes the market switches among a small number of latent states; each day you only observe features, not the state. GaussianHMM (hmmlearn) with full covariance is fit on scaled features; predicted states are re-ordered by mean realised volatility so lower indices mean calmer conditions (for 2 states: 0 = calm, 1 = elevated; with 3 states you get calm / elevated / stressed ordering by vol).

Default feature vector (before StandardScaler):

\[ \bigl(\mathrm{rv}_{20},\; \log(1 + \max(0, \text{volume\_to\_spread})),\; \text{optional OBI}\bigr)^\top \]

If there are too few rows, fit fails, or any state holds <5% of days, the detector falls back to a simple volatility threshold on realised_vol_20 (default threshold 0.24).

4. Reinforcement learning (PPO)

PPO (Proximal Policy Optimization) is an on-policy actor–critic method that stabilizes policy updates by clipping the probability ratio between old and new policies. Here it learns a continuous policy over execution fractions on the OptimalExecutionEnv (Gymnasium).

Observation space (9 dimensions)

Vector: [inventory_norm, rem, S_ratio, liq_z, sig_z, regime, pva, twap_gap, news_z]

$x_{\text{inv}} = X / X_0$ — remaining inventory fraction.
$r_{\text{time}} = (T - t) / T$ — time remaining.
$S_{\text{ratio}} = \mathrm{Close}_t / \mathrm{Close}_{\text{start}}$.
liq_z, sig_z — z-scores of Amihud and $\sigma_{\text{daily}}$ vs train statistics.
regime — HMM label.
$\text{pva} = \mathrm{clip}\bigl(\mathrm{Close}/p_{\text{arr}} - 1,\,-0.5,\,0.5\bigr)$.
twap_gap — normalized inventory minus TWAP schedule target after start bar $t_0$; 0 before $t_0$.
news_z — z-scored news count, clipped to $[-4,4]$.

Action space

One continuous action in $[0,1]$: fraction of current inventory to sell this step (optional per-step cap and residual-bound shaping).

Reward (physical USD mode, sketch)

Inventory-risk term scales roughly as

\[ \left(\frac{X}{X_0}\right)^2 \sigma_{\text{daily}}^2 \cdot w_{\text{time}}(r_{\text{time}}), \quad w_{\text{time}}(r)= r^{3/2} \]

plus normalized dollar shortfall vs arrival (scaled by is_reward_scale), optional TWAP-slice bonus, relative-IS term, terminal penalties/bonuses, and optional eval_is_reward_coef alignment.

physical_institutional_kwargs defaults

When notional > 0, the helper returns (unless overridden):

max_inventory_fraction_per_step = 0.25
is_reward_scale = 1.28
twap_slice_bonus_coef = 0.60
terminal_inventory_penalty = 5.0

If notional is zero or unset, the dict is empty (legacy dimensionless env).

5. Market impact model

Trade dollars $\mathrm{td} = |v|\, S_{\text{base}}$, bar dollar volume $\mathrm{dv} \approx V \cdot S_{\text{close}}$ (with guards), participation $\mathrm{part} = \mathrm{td}/\mathrm{dv}$. Then:

\[ \sqrt{\mathrm{part}^\ast} = \sqrt{\min(\mathrm{part},\, p_{\max}) + \varepsilon} \] \[ \phi = \mathrm{clip}\Bigl( \alpha_\sigma\, \sigma_{\text{daily}}\, \sqrt{\mathrm{part}^\ast} + \alpha_A\, \mathrm{Amihud}\, \min(\mathrm{part},\, p_{\max}),\; 0,\, \phi_{\max} \Bigr) \] \[ P_{\text{eff}} = \max\bigl(S_{\text{base}}(1-\phi),\, \varepsilon\bigr) \]

Defaults: $\alpha_\sigma = 0.65$, $\alpha_A = 0.35$, $p_{\max} = 12$ (inside sqrt and Amihud term), $\phi_{\max} = 0.35$.

Arrival price for benchmarks and physical RL: previous bar’s close before the first execution bar (arrival_price_full).

6. Benchmarks

All strategies compute average execution price vs the same arrival, then IS in bps (section 7).

TWAP

Effective horizon $T_{\text{eff}} = T - t_0$. Per-bar sell fraction in physical mode:

\[ q_k = \frac{Q}{T_{\text{eff}}}, \quad k = t_0,\ldots,T-1 \]

Each slice priced with sell_effective_close; legacy mode uses bar closes without impact.

VWAP

\[ w_i = \frac{V_i}{\sum_j V_j}, \quad q_i = Q \cdot w_i \]

Same impact law per slice in physical mode.

Almgren–Chriss

\[ \kappa = \sqrt{\frac{\lambda \sigma^2}{\eta}}, \qquad x_i = Q\,\frac{\sinh\bigl(\kappa(T_{\text{eff}}-i)\bigr)}{\sinh(\kappa T_{\text{eff}})}, \quad \Delta x_i = x_{i-1} - x_i \]

Parameters $\eta,\gamma,\sigma,\lambda$ from run config; trades $\Delta x_i$ executed with impact in physical mode.

Immediate

Sell full $Q$ on the first execution bar with one impact call (legacy: first close).

7. Implementation shortfall (IS)

For a sell, mean IS in basis points:

\[ \mathrm{IS}_{\text{bps}} = \frac{\bar p_{\text{exec}} - p_{\text{arr}}}{p_{\text{arr}}} \times 10^4 \]

Higher $\mathrm{IS}_{\text{bps}}$ is better for a sell (more proceeds vs arrival).

USD edge from a $\Delta\mathrm{IS}_{\text{bps}}$ gap vs a benchmark:

\[ \Delta \mathrm{USD} \approx N \cdot \frac{\Delta\mathrm{IS}_{\text{bps}}}{10^4} \]

Example: $\Delta\mathrm{IS} = 19.85$ bps on $N = \$5{,}000{,}000$ → $\Delta\mathrm{USD} \approx \$9{,}925$.

8. LLM governance

explain_execution builds a structured Claude prompt (regime name, volatility, Amihud liquidity, inventory, action fraction, cost vs TWAP/Almgren–Chriss). The prompt is hashed (SHA-256, first 16 hex chars); responses are cached under data/cached_llm/<hash>.json.

If ANTHROPIC_API_KEY is missing, an offline template paragraph is returned and still written to the cache for reproducibility. Model name defaults to ANTHROPIC_MODEL or claude-sonnet-4-20250514.

9. Configuration reference (Run page)

Data split: Train 2018–2022, validation 2023, test 2024 (as used for split parquet files in the standard pipeline).
HMM states: 2 or 3 Gaussian components; optional OBI feature when BBO data is merged.
Horizon T: Number of daily bars in the execution window (episode length).
Policy: Trained PPO checkpoint from models/ vs random agent baseline.
Execution Start Date: Calendar date the execution program begins. The episode and all benchmarks are evaluated on the same T-day window starting from this date. Non-trading days (weekends, holidays) snap backward to the nearest previous session.

10. Trend classification

Rolling return on Close with lookback $L$ (default 20):

\[ r_t = \frac{C_t}{C_{t-L}} - 1 \]

Classification (defaults $+2\%$, $-2\%$):

\[ \text{label} = \begin{cases} \text{up} & r_t \ge 0.02 \\ \text{down} & r_t \le -0.02 \\ \text{mid} & \text{otherwise} \end{cases} \]

Code: TREND_UP, TREND_MID, TREND_DOWN.