Reinforcement learning for futures trading — combining multi-timeframe market structure analysis, LLM-powered regime classification, and MaskablePPO for discrete order generation on Nikkei 225 and Nasdaq 100 futures.
A fully offline, broker-decoupled training simulation for futures trading. Processes historical market data through a feature pipeline, enriches it with LLM assessments, and trains an RL agent to generate trading decisions.
Inverse Fair Value Gaps, multi-timeframe trend alignment, and liquidity pools computed with pure Polars on 5-minute OHLCV candles.
Claude assesses market regime, setup quality, and risk-reward — encoded as 11 normalized observation features for the RL agent.
Gymnasium environment with action masking trains a PPO agent to select entry, size, stop, and target for each 5-minute bar.
Sharpe, max drawdown, win rate, profit factor across episodes. Promotion criteria gate deployment quality.
CandleValidator, gap detection, quality audits with per-day completeness, spike detection, and OHLC relationship checks.
Fill simulation with 1-tick slippage, MAE/MFE tracking, breakeven trailing stops, and commission modeling (JPY & USD contracts).
An IFVG-centric trading strategy: detect structural gaps, wait for price inversion, confirm with multi-timeframe trend alignment, and size trades based on room-to-right.
Scan 3-candle patterns for gaps ≥ 4 ticks. When price trades through, it becomes an Inverse FVG — a support/resistance zone. NIY tick size: 5.0 JPY, NQ: 0.25 USD.
features/ifvg.pyMulti-timeframe structure: daily (weight 0.4), 4-hour (0.35), 15-minute (0.25). Composite trend score from −1.0 to +1.0 filters for directional conviction.
features/trend.pyCluster swing points into liquidity pools, count obstacles toward target, check daily range exhaustion. Scored 0–100 for long and short.
features/room_to_right.pyClaude classifies regime (trending, choppy, event-driven, low-liquidity), assesses quality, and provides confidence + narrative. Encoded as 11 features.
llm/MaskablePPO observes 39 features and selects from 81 action combinations: entry direction, position size, stop distance, and target type.
rl/env.pyNine packages with clean dependency boundaries. Common provides the foundation, data handles ingestion, features compute market structure, sim manages offline simulation, llm integrates Claude, rl contains the environment, execution targets IB, and monitoring handles alerts.
Three pluggable data sources feed into a unified backfill pipeline that validates, normalizes, and stores 5-minute OHLCV candles in TimescaleDB.
1-minute CSVs aggregated to 5m via Polars group_by_dynamic. Symbol mapping:
NQ→NSXUSD, NIY→JPXJPY. Semicolon or comma delimited.
Pre-formatted 5m CSVs with standard 8-column schema. Auto-assigns contract months if missing. Supports Databento exports and custom sources.
data/sources.pyReal-time and historical via IB Gateway on port 4002. Read-only mode by default. Client ID isolation for concurrent connections.
data/ib_client.pyThe BackfillService orchestrates historical data loading with resume support,
batch processing, validation, and progress logging.
Query MAX(timestamp) from candles_5m for the instrument. Skip past already-stored dates.
Configurable batch size (default 30 days). Each batch fetched from the CandleSource protocol.
OHLCV validation rejects null/negative/inconsistent rows. Valid candles upserted to TimescaleDB. Commit per batch for crash resilience.
@runtime_checkable
class CandleSource(Protocol):
def fetch(self, instrument: str, start: date, end: date) -> pl.DataFrame:
"""Return 5m OHLCV DataFrame with standard 8-column schema.
Columns: timestamp, instrument, open, high, low, close,
volume, contract_month"""
...
Multi-layer validation ensures data integrity from ingestion through training. The CandleValidator runs four checks; the GapDetector finds missing candles; the QualityAudit produces per-day completeness reports.
ohlc_invalid | L ≤ O,C ≤ H relationship, null/missing values |
spike | Close-to-close change > 5% threshold |
zero_volume | Volume = 0 during active session hours |
gap | Missing expected candles in session window |
Severity levels: WARNING, ERROR, CRITICAL
The QualityAudit scans date ranges producing DailyQualityMetrics:
completeness_pct — actual vs expected candle countspike_count — anomalous price jumpsgap_count — missing candles per dayohlc_issues — relationship violationszero_volume_count — dead periodsFast Polars-based row filtering. Returns (valid_df, rejected_df). Checks:
# No NaN/null in OHLC, no negative prices
# low <= high, low <= open, low <= close
# high >= open, high >= close
# Volume: not null, not NaN, not negative
Quarterly contract calendar with automatic month assignment and two back-adjustment methods for creating seamless continuous price series across contract rolls.
| H (Mar) | Jan – Mar |
| M (Jun) | Apr – Jun |
| U (Sep) | Jul – Sep |
| Z (Dec) | Oct – Dec |
Example: 2024-02-15 → 2024H, 2024-07-01 → 2024U
Ratio (default) — Multiply prior prices by
new_close / old_close at each roll. Preserves % returns, keeps prices positive.
Difference (Panama Canal) — Subtract the price gap at each roll from all earlier prices. Preserves absolute differences.
# Detect roll points where contract_month changes
rolls = detect_rolls(df_sorted) # -> list[RollPoint]
# Each RollPoint contains:
# timestamp, old_contract, new_contract,
# old_close, new_close, ratio_factor, diff_factor
# Apply cumulative adjustment backward from newest contract
result = back_adjust(df, method=AdjustmentMethod.RATIO)
Six stages compute market structure from raw candles. Each is a pure function on Polars DataFrames — deterministic and side-effect free.
5m candles → 15m, 4h, daily frames via group_by_dynamic
EMA-20/50, slopes, ATR-14, swing points, structure classification, displacement detection. Composite score: D1×0.4 + H4×0.35 + M15×0.25
features/trend.py3-candle gap detection → inversion tracking → lifecycle states (active/tested/mitigated/expired) → quality scoring
features/ifvg.pyTrading window boundaries, minutes since open, overnight range, prior session high/low/close
features/session.pyLiquidity pool clustering, obstacle counting, exhaustion measurement
features/room_to_right.pyBoolean filter: IFVGs ≥ 2, |trend_score| ≥ 0.3, max RTR ≥ 30, in trading window
features/pre_screen.py| Quality | Gap Ticks | Body Ratio |
|---|---|---|
| High | ≥ 8 | AND body > 70% of range |
| Medium | ≥ 6 | OR body > 60% |
| Low | Everything else | |
Data flows through five transformations: enrichment, assessment, episode construction, time-based splitting, and RL training. Each stage uses Polars DataFrames with Parquet I/O.
Enriched candles grouped by session_date. Each session becomes one Episode
with window boundaries from in_trading_window indices.
Minimum 6 in-window candles required.
Three-way split by date. No data leakage — strictly chronological:
Anthropic Claude analyzes enriched candle windows to produce structured assessments with setup classification, confidence scoring, and market regime analysis. Assessments are cached to Parquet for reproducible training.
Context builder sends last 50 bars as compact CSV plus trend summary, IFVG context, session timing, and room-to-right metrics. Target: under 4,000 tokens.
System prompt defines IFVG criteria, 5 setup types, 5 trend alignments, 4 market regimes, and room-to-right scoring guidelines.
llm/prompt_v1.txt llm/context_builder.pyPydantic model with 11 fields returned via tool use:
setup_type — 5 enum valuesconfidence — 0.0–1.0ifvg_quality — high/medium/lowtrend_alignment — per-timeframe dictregime — 4 enum valuesrisk_reward_estimate, room_to_right_estimatenarrative, concernsThe encode_assessment() function normalizes the structured assessment into 11 float features for the observation vector:
| Feature | Encoding | Range |
|---|---|---|
llm_confidence | Direct passthrough | [0, 1] |
llm_setup_type | Enum index / 4.0 | [0, 1] |
llm_ifvg_quality | high=1.0, medium=0.66, low=0.33 | [0.33, 1] |
llm_rr_estimate | min(rr / 5.0, 1.0) | [0, 1] |
llm_regime | Enum index / 3.0 | [0, 1] |
llm_trend_* | Bullish=1, Turning=±0.5, Neutral=0, Bearish=-1 | [-1, 1] |
llm_rtr_estimate | value / 100.0 | [0, 1] |
llm_concern_count | min(count / 5.0, 1.0) | [0, 1] |
Setup Types: bullish_reversal, bearish_reversal,
bullish_continuation, bearish_continuation, no_setup
Market Regimes: trending_day (strong directional),
choppy (overlapping candles), event_driven (unusual volatility),
low_liquidity (thin order book)
Trend Alignments: bullish, bearish,
turning_bullish, turning_bearish, neutral
The agent observes 39 normalized features in 7 groups and selects from a MultiDiscrete action space with 4 dimensions. Action masking prevents invalid combinations.
llm_confidencellm_setup_typellm_ifvg_qualityllm_rr_estimatellm_regimellm_trend_* ×4llm_rtr_estimatellm_concern_countifvg_count_activeifvg_nearest_distifvg_best_qualityifvg_avg_fill_pctifvg_direction_biastrend_scoreema20_slope_* ×3structure_* ×3minutes_since_openwindow_progress_pctovernight_rangeprior_session_rangein_trading_windowrtr_score_longrtr_score_shortexhaustion_pctexhaustion_flagvol_ratiobar_range_normbar_body_ratioin_positionunrealized_pnl_rdaily_pnl_rtrades_todayBoolean mask of shape (12,) — flattened across 4 sub-actions. Entry blocked when:
| Feature | Normalization | Output Range |
|---|---|---|
ifvg_count_active | val / 10.0 | [0, 1] |
ifvg_nearest_dist | val / ATR | [0, 1] |
ifvg_best_quality | val / 3.0 | [0, 1] |
rtr_score_* | val / 100.0 | [0, 1] |
minutes_since_open | val / 480.0 (8h max) | [0, 1] |
overnight_range | val / ATR | [0, 1] |
unrealized_pnl_r | unrealized / risk / 5.0 | [-1, 1] |
daily_pnl_r | val / 5.0 | [-1, 1] |
trades_today | val / 10.0 | [0, 1] |
ema20_slope_* | rising=1, flat=0, falling=-1 | [-1, 1] |
structure_* | uptrend=1, ranging=0, downtrend=-1 | [-1, 1] |
The position module tracks the full lifecycle of each trade: fill simulation with slippage, MAE/MFE tracking, breakeven trailing stops, and commission modeling.
Orders filled within candle range with 1-tick slippage.
Long: min(price + tick, high).
Short: max(price - tick, low).
Returns None if range doesn't reach order.
MAE (Max Adverse Excursion): worst drawdown in ticks. MFE (Max Favorable Excursion): best unrealized profit. Updated every bar for trade analysis.
Per-contract commission based on currency:
JPY: 80.0 × size
USD: 1.25 × size
Deducted as commission ticks from realized P&L.
Each bar, the position is updated with conservative order checking: stop first, then target.
Long: candle_low ≤ stop_price. Short: candle_high ≥ stop_price. Checked first (conservative).
Long: candle_high ≥ target. Short: candle_low ≤ target. Returns CompletedTrade if hit.
Track adverse/favorable excursion. When mfe_ticks ≥ risk_ticks (1R profit), move stop to breakeven.
Exit reasons: stop, target, session_end (force-close at window end).
Three reward components shape the agent toward profitable, disciplined trading.
Exact formulas from rl/reward.py:
Core signal: risk-reward ratio with target bonus.
Overtrading + accelerating loss penalty.
Prevents over-entering low-quality setups.
def compute_trade_reward(trade: CompletedTrade) -> float:
reward = trade.realized_rr * 1.0
if trade.hit_target:
reward += 0.3
return reward
def compute_step_penalty(trades_today, daily_pnl_r,
max_trades=5, drawdown_threshold=-2.0):
penalty = 0.0
if trades_today > max_trades:
penalty -= 0.05
if daily_pnl_r < drawdown_threshold:
penalty -= 0.8 * abs(daily_pnl_r - drawdown_threshold)
return penalty
def compute_patience_bonus(action_is_skip, in_position):
if action_is_skip and not in_position:
return 0.01
return 0.0
sb3-contrib MaskablePPO trains on session episodes with periodic checkpointing, evaluation callbacks, and TensorBoard logging.
total_timesteps 1,000,000
learning_rate 3e-4
n_steps 2,048
batch_size 256
gamma 0.99
clip_range 0.2
ent_coef 0.01
policy_net_arch [64, 64]
max_daily_loss_r 3.0 R
max_trades_session 5
checkpoint_freq 50,000 steps
eval_freq 50,000 steps
trailing_stop breakeven @ 1R
Two-layer MLP [64, 64] with shared feature extractor for actor and critic.
Input: 39-dim observation vector. Output: MultiDiscrete([3,3,3,3]) action logits + value estimate.
Trained models are evaluated on held-out episodes against baseline agents. Promotion criteria gate deployment quality.
Additional metrics: profit factor, avg RR, total R, trades per session. Sharpe annualized by √252.
Samples random valid actions respecting action masks. Uses per-sub-action sampling from valid options. Baseline for "can the agent beat random?"
rl/baselines.pyAlways enters long with size=1, stop=medium, target=nearest (2R). Respects masks — skips when blocked. Tests "is selective entry better than always-in?"
rl/baselines.pydef meets_promotion_criteria(metrics: EvalMetrics) -> bool:
return (
metrics.sharpe_ratio >= 1.0
and metrics.max_drawdown_r <= 5.0
and metrics.win_rate >= 0.40
)
Trained models are saved with full metadata for reproducibility. Each training run creates a timestamped directory with the model, config, and evaluation results.
models/
20260222_101530/
model.zip # MaskablePPO weights
metadata.json # Training config, instrument, timestamps
eval_metrics.json # EvalMetrics from validation episodes
config_snapshot.yaml # Full GothamSettings at training time
Pydantic-settings with YAML layering. Four priority levels from init kwargs (highest) to default.yaml (lowest). 9 config sections.
Direct constructor arguments for testing and programmatic overrides.
GOTHAM_ prefix with __ nesting. Example: GOTHAM_TRAINING__LEARNING_RATE=1e-4
config/{GOTHAM_ENV}.yaml (default: dev)
config/default.yaml — singleton via get_settings()
database:
host: localhost
port: 5432
name: gotham
user: gotham
password: changeme
URL-encoded credentials. Sync + async URLs (asyncpg).
ib:
host: 127.0.0.1
port: 4002
client_id: 1
timeout: 30
readonly: true
instruments:
nikkei:
symbol: NIY
exchange: CME
currency: JPY
tick_size: 5.0
point_value: 500.0
session: tokyo
nasdaq:
symbol: NQ
exchange: CME
currency: USD
tick_size: 0.25
point_value: 20.0
session: us
features:
ifvg_min_gap_ticks: 4.0
ifvg_max_age_bars: 100
ema_fast: 20
ema_slow: 50
atr_period: 14
displacement_body_pct: 0.70
displacement_atr_mult: 1.5
rtr_lookback_days: 20
pre_screen_min_ifvgs: 2
pre_screen_min_trend: 0.3
pre_screen_min_rtr: 30.0
training:
total_timesteps: 1000000
learning_rate: 0.0003
n_steps: 2048
batch_size: 256
gamma: 0.99
clip_range: 0.2
ent_coef: 0.01
checkpoint_freq: 50000
eval_freq: 50000
max_daily_loss_r: 3.0
max_trades_per_session: 5
policy_net_arch: [64, 64]
llm:
model: claude-sonnet-4-5-20250929
max_tokens: 4096
temperature: 0.3
backfill:
data_dir: data/raw
source: histdata
batch_days: 30
sim:
data_dir: data
enriched_dir: data/enriched
assessments_dir: data/assessments
model_dir: models
logging:
level: INFO
format: json
rotation: "50MB"
log_dir: logs
Docker services, Makefile targets, CI/CD pipeline, and CLI commands for the full development lifecycle.
| TimescaleDB | pg18 on port 5432 |
| IB Gateway | Port 4002, paper trading |
make docker-up # start services
make docker-down # stop services
make install # uv sync --all-extras
make lint # ruff check + mypy
make format # ruff format + fix
make test # pytest -m unit
make test-all # pytest (all markers)
make test-cov # coverage report
# Train a model
uv run python -m gotham.rl train \
--enriched-path data/enriched/nq.parquet \
--instrument NQ --timesteps 1000000
# Evaluate a model
uv run python -m gotham.rl evaluate \
--model-path models/20260222_101530 \
--instrument NQ --n-episodes 50
# Backfill data
uv run python -m gotham.data backfill \
--instrument NQ --start 2023-01-01
# Quality audit
uv run python -m gotham.data quality-audit \
--instrument NQ --days 30
ruff — line-length 100, rules E,F,W,I,UP,B,SIM,RUF, target py311.
mypy — disallow_untyped_defs = true,
warn_return_any = true. All function signatures require type annotations.
pytest — markers: @unit, @integration,
@slow. conftest auto-clears settings cache.
uv — Fast Python package manager.
requires-python = ">=3.11".