MACHINE LEARNING

Machine Learning Pipeline

Three base learners, stacked meta-learning, isotonic calibration, and online monitoring. Production-grade ML for prediction market probability estimation.

Three Learners, One Calibrated Signal

LightGBM, XGBoost, and Logistic Regression each see the same features but learn different decision boundaries. A Ridge stacking meta-learner combines their calibrated outputs. Isotonic regression enforces monotonic probability mapping on every base model.

LightGBM
w = 0.50
XGBoost
w = 0.30
Logistic
w = 0.20
Isotonic
calibrate
Ridge Stack
P(YES)
ml/models.py ModelConfig
# LightGBM — tuned for low-SNR prediction markets
lgb_params = {
    "objective": "binary",
    "n_estimators": 500,
    "learning_rate": 0.03,
    "max_depth": 4,
    "num_leaves": 15,
    "min_child_samples": 30,
    "reg_alpha": 0.5,  # L1
    "reg_lambda": 1.0# L2
}

# XGBoost — mirrors LGB regularization
xgb_params = {
    "objective": "binary:logistic",
    "n_estimators": 500,
    "learning_rate": 0.03,
    "max_depth": 4,
    "min_child_weight": 5,
    "gamma": 0.1,
}

# Isotonic calibration on held-out cal set
cal = IsotonicRegression(
    y_min=0.001, y_max=0.999,
    out_of_bounds="clip"
)

48 Engineered Features Across 7 Families

Each market snapshot is transformed into a rich feature vector spanning microstructure, momentum, volume, cross-market, external signals, time encoding, and market quality. Tree models handle NaN natively; the Logistic path uses learned median imputation.

Features are computed by the FeatureEngineering class with rolling windows at 5-minute, 15-minute, and 1-hour horizons. Cyclical time features use sin/cos encoding. Stability selection and mRMR pruning remove noisy inputs.

  • NaN-safe: tree models learn optimal split directions for missing data
  • Winsorized preprocessing prevents outlier distortion
  • Feature hash guard detects schema misalignment on model load
  • Bayesian hyperopt via Optuna for automated tuning

Momentum 6

  • return_5m
  • return_15m
  • return_1h
  • volatility_1h
  • price_vs_twap
  • price_acceleration

Volume 4

  • volume_24h_log
  • volume_zscore
  • volume_rate_of_change
  • open_interest_log

Order Flow 7

  • spread_cents
  • spread_pct
  • spread_velocity
  • bid_depth_log
  • ask_depth_log
  • imbalance
  • microprice_vs_mid

Cross-Market 3

  • polymarket_spread
  • polymarket_spread_zscore
  • polymarket_momentum

Time 7

  • hour_of_day_sin / cos
  • day_of_week_sin / cos
  • is_weekend
  • market_age_days
  • time_to_resolution_days

Sentiment 4

  • news_sentiment
  • expert_forecast
  • expert_confidence
  • signal_agreement

Microstructure 5

  • price_bucket
  • distance_from_50
  • is_extreme_price
  • efficiency_score
  • liquidity_score
ml/cross_validation.py PurgedKFold
# 3-way temporal split with embargo
# [train | embargo | cal | embargo | stack]

class PurgedKFold:
    """de Prado (2018) Ch. 7"""
    n_splits = 5
    embargo_pct = 0.01  # 1%

class WalkForwardCV:
    """Expanding-window forward test"""
    n_splits = 5
    min_train_pct = 0.3
    embargo_pct = 0.01

# Ticker-group purging: if a ticker
# appears in test fold, ALL its samples
# are removed from training fold

# Preprocessor fitted INSIDE each fold
# (prevents winsor bounds leakage)
Temporal Split Layout
TRAIN
EMB
CALIBRATE
EMB
STACK
TEST 15%

No Lookahead, No Leakage

Standard k-fold cross-validation gives inflated metrics on financial data due to autocorrelation and same-ticker label leakage. The pipeline uses Purged K-Fold with 1% embargo periods at every boundary, following de Prado (2018).

Walk-Forward CV simulates actual deployment: train on the past, predict the future, advance the window, repeat. This is the gold standard for evaluating prediction market models -- the only evaluation that matches production conditions.

  • Embargo periods prevent autocorrelation leakage
  • Ticker-group purging removes all samples of test tickers from train
  • Preprocessor fitted inside each fold -- no test data leakage
  • Promotion gating: model must beat baseline Brier on held-out 15%

Continuous Brier Monitoring

Every resolved market outcome is fed back to the OnlineEnsemble. A rolling window of the last 100 predictions tracks Brier score in real time. When the score degrades past 0.30, the system auto-retrains on the full Parquet history -- not just the in-memory buffer.

The feature hash guard prevents silent misalignment: a SHA-256 hash of the feature schema is saved alongside every model. On load, the hash is compared against the live codebase. If features have been added or removed since training, a warning fires and retrain is recommended.

  • Rolling Brier on last 50 observations triggers retrain
  • PSI per-feature drift detection (critical at 0.25+)
  • Concept drift: base-rate shift + calibration degradation
  • Hyperopt-derived config preserved across retrains
ml/model_server.py LIVE
# OnlineEnsemble auto-retrain logic

def retrain_if_needed(self,
    min_samples=200,
    brier_threshold=0.30
):
    recent = mean(brier[-50:])
    if recent > brier_threshold:
        return True  # trigger

# Feature hash guard on model load
saved_hash = meta["feature_hash"]
live_hash = sha256(FeatureEngineering
    .FEATURE_NAMES)
if live_hash != saved_hash:
    warn("feature_hash_mismatch")

# PSI drift thresholds
# < 0.10 → ok
# 0.10-0.25 → investigate
# >= 0.25 → retrain
Health Monitor
Brier score (50-obs rolling) 0.182
Feature hash d4e1...8f2a
Covariate shift AUC 0.53
PSI: polymarket_spread 0.14
Concept drift stable

Four Market Regimes, Automatic Adjustment

A 5-component efficiency score (spread, volume, age, cross-market, news) classifies each market into a regime. Kelly fraction, minimum edge threshold, and execution urgency are all adjusted per-regime. The classifier uses weighted thresholds at 0.25, 0.50, and 0.75 boundaries.

Efficient

Normal conditions. Standard parameters, balanced risk/reward.
Kelly: 0.75x | Edge: 1.0x

Inefficient

Exploitable opportunities. Use full Kelly, act quickly before edge closes.
Kelly: 1.0x | Edge: 0.75x

Highly Inefficient

New market, major news. Full Kelly, maximum urgency multiplier at 1.5x.
Kelly: 1.0x | Urgency: 1.5x
ml/regime.py RegimeConfig
# Efficiency score = weighted 5-component sum
spread_weight       = 0.30
volume_weight      = 0.25
age_weight         = 0.15
cross_market_weight = 0.15
news_weight        = 0.15

# Regime boundaries
highly_efficient = 0.75
efficient        = 0.50
inefficient      = 0.25

ML Feeds the Bayesian Estimator

The ensemble's calibrated probability is not used in isolation. It flows into the FairValueEstimator alongside cross-market prices, polling data, expert forecasts, and historical base rates. All signals are combined via weighted Bayesian updating in log-odds space.

Satopaa et al. (2014) extremization pushes the aggregated probability away from 50% with a tunable parameter d=1.3, correcting for the known underconfidence of averaged forecasts. Calibration-curve adjustments then correct for any remaining systematic bias in specific probability buckets.

  • Log-odds aggregation for numerical stability
  • Satopaa extremization (d=1.3) corrects underconfidence
  • Dynamic prior weight shrinks as signal count grows
  • Bucket-level calibration bias correction
strategies/fair_value.py FairValueEstimator
# Satopaa extremization
def extremize(p, d=1.3):
    p_d = p ** d
    q_d = (1 - p) ** d
    return p_d / (p_d + q_d)

# Bayesian estimator config
FairValueEstimator(
    prior_weight = 0.3,
    extremization_d = 1.3,
    consensus_bonus = 0.1,
    prior_weight_min = 0.1,
    prior_weight_max = 0.5,
)

# Signal sources combined in
# log-odds space:
# ML ensemble probability
# Cross-market prices
# Polling / expert forecasts
# Historical base rates
3
Base learners
48
Engineered features
0.18
Brier score
4
Market regimes

Put the Pipeline to Work

Production-grade ML that calibrates, monitors, and retrains itself. Every probability, backed by data.