Machine Learning Pipeline

ENSEMBLE ARCHITECTURE

Three Learners, One Calibrated Signal

LightGBM, XGBoost, and Logistic Regression each see the same features but learn different decision boundaries. A Ridge stacking meta-learner combines their calibrated outputs. Isotonic regression enforces monotonic probability mapping on every base model.

LightGBM

w = 0.50

XGBoost

w = 0.30

Logistic

w = 0.20

Isotonic

calibrate

Ridge Stack

P(YES)

            ml/models.py
            ModelConfig
        
# LightGBM — tuned for low-SNR prediction markets

lgb_params = {

    "objective": "binary",

    "n_estimators": 500,

    "learning_rate": 0.03,

    "max_depth": 4,

    "num_leaves": 15,

    "min_child_samples": 30,

    "reg_alpha": 0.5,  # L1

    "reg_lambda": 1.0, # L2

}

# XGBoost — mirrors LGB regularization

xgb_params = {

    "objective": "binary:logistic",

    "n_estimators": 500,

    "learning_rate": 0.03,

    "max_depth": 4,

    "min_child_weight": 5,

    "gamma": 0.1,

}

# Isotonic calibration on held-out cal set

cal = IsotonicRegression(

    y_min=0.001, y_max=0.999,

    out_of_bounds="clip"

)

FEATURE ENGINEERING

48 Engineered Features Across 7 Families

Each market snapshot is transformed into a rich feature vector spanning microstructure, momentum, volume, cross-market, external signals, time encoding, and market quality. Tree models handle NaN natively; the Logistic path uses learned median imputation.

Features are computed by the FeatureEngineering class with rolling windows at 5-minute, 15-minute, and 1-hour horizons. Cyclical time features use sin/cos encoding. Stability selection and mRMR pruning remove noisy inputs.

NaN-safe: tree models learn optimal split directions for missing data
Winsorized preprocessing prevents outlier distortion
Feature hash guard detects schema misalignment on model load
Bayesian hyperopt via Optuna for automated tuning

Momentum 6

return_5m
return_15m
return_1h
volatility_1h
price_vs_twap
price_acceleration

Volume 4

volume_24h_log
volume_zscore
volume_rate_of_change
open_interest_log

Order Flow 7

spread_cents
spread_pct
spread_velocity
bid_depth_log
ask_depth_log
imbalance
microprice_vs_mid

Cross-Market 3

polymarket_spread
polymarket_spread_zscore
polymarket_momentum

Time 7

hour_of_day_sin / cos
day_of_week_sin / cos
is_weekend
market_age_days
time_to_resolution_days

Sentiment 4

news_sentiment
expert_forecast
expert_confidence
signal_agreement

Microstructure 5

price_bucket
distance_from_50
is_extreme_price
efficiency_score
liquidity_score

                    ml/cross_validation.py
                    PurgedKFold
                
# 3-way temporal split with embargo

# [train | embargo | cal | embargo | stack]

class PurgedKFold:

    """de Prado (2018) Ch. 7"""

    n_splits = 5

    embargo_pct = 0.01  # 1%

class WalkForwardCV:

    """Expanding-window forward test"""

    n_splits = 5

    min_train_pct = 0.3

    embargo_pct = 0.01

# Ticker-group purging: if a ticker

# appears in test fold, ALL its samples

# are removed from training fold

# Preprocessor fitted INSIDE each fold

# (prevents winsor bounds leakage)

Temporal Split Layout

TRAIN

EMB

CALIBRATE

EMB

STACK

TEST 15%

WALK-FORWARD VALIDATION

No Lookahead, No Leakage

Standard k-fold cross-validation gives inflated metrics on financial data due to autocorrelation and same-ticker label leakage. The pipeline uses Purged K-Fold with 1% embargo periods at every boundary, following de Prado (2018).

Walk-Forward CV simulates actual deployment: train on the past, predict the future, advance the window, repeat. This is the gold standard for evaluating prediction market models -- the only evaluation that matches production conditions.

Embargo periods prevent autocorrelation leakage
Ticker-group purging removes all samples of test tickers from train
Preprocessor fitted inside each fold -- no test data leakage
Promotion gating: model must beat baseline Brier on held-out 15%

ONLINE LEARNING

Continuous Brier Monitoring

Every resolved market outcome is fed back to the OnlineEnsemble. A rolling window of the last 100 predictions tracks Brier score in real time. When the score degrades past 0.30, the system auto-retrains on the full Parquet history -- not just the in-memory buffer.

The feature hash guard prevents silent misalignment: a SHA-256 hash of the feature schema is saved alongside every model. On load, the hash is compared against the live codebase. If features have been added or removed since training, a warning fires and retrain is recommended.

Rolling Brier on last 50 observations triggers retrain
PSI per-feature drift detection (critical at 0.25+)
Concept drift: base-rate shift + calibration degradation
Hyperopt-derived config preserved across retrains

                    ml/model_server.py
                    LIVE
                
# OnlineEnsemble auto-retrain logic

def retrain_if_needed(self,

    min_samples=200,

    brier_threshold=0.30

):

    recent = mean(brier[-50:])

    if recent > brier_threshold:

        return True  # trigger

# Feature hash guard on model load

saved_hash = meta["feature_hash"]

live_hash = sha256(FeatureEngineering

    .FEATURE_NAMES)

if live_hash != saved_hash:

    warn("feature_hash_mismatch")

# PSI drift thresholds

# < 0.10 → ok

# 0.10-0.25 → investigate

# >= 0.25 → retrain

Health Monitor

Brier score (50-obs rolling) 0.182

Feature hash d4e1...8f2a

Covariate shift AUC 0.53

PSI: polymarket_spread 0.14

Concept drift stable

REGIME DETECTION

Four Market Regimes, Automatic Adjustment

A 5-component efficiency score (spread, volume, age, cross-market, news) classifies each market into a regime. Kelly fraction, minimum edge threshold, and execution urgency are all adjusted per-regime. The classifier uses weighted thresholds at 0.25, 0.50, and 0.75 boundaries.

Efficient

Normal conditions. Standard parameters, balanced risk/reward.

Kelly: 0.75x | Edge: 1.0x

Inefficient

Exploitable opportunities. Use full Kelly, act quickly before edge closes.

Kelly: 1.0x | Edge: 0.75x

Highly Inefficient

New market, major news. Full Kelly, maximum urgency multiplier at 1.5x.

Kelly: 1.0x | Urgency: 1.5x

            ml/regime.py
            RegimeConfig
        
# Efficiency score = weighted 5-component sum

spread_weight       = 0.30

volume_weight      = 0.25

age_weight         = 0.15

cross_market_weight = 0.15

news_weight        = 0.15

# Regime boundaries

highly_efficient = 0.75

efficient        = 0.50

inefficient      = 0.25

FAIR VALUE INTEGRATION

ML Feeds the Bayesian Estimator

The ensemble's calibrated probability is not used in isolation. It flows into the FairValueEstimator alongside cross-market prices, polling data, expert forecasts, and historical base rates. All signals are combined via weighted Bayesian updating in log-odds space.

Satopaa et al. (2014) extremization pushes the aggregated probability away from 50% with a tunable parameter d=1.3, correcting for the known underconfidence of averaged forecasts. Calibration-curve adjustments then correct for any remaining systematic bias in specific probability buckets.

Log-odds aggregation for numerical stability
Satopaa extremization (d=1.3) corrects underconfidence
Dynamic prior weight shrinks as signal count grows
Bucket-level calibration bias correction

                    strategies/fair_value.py
                    FairValueEstimator
                
# Satopaa extremization

def extremize(p, d=1.3):

    p_d = p ** d

    q_d = (1 - p) ** d

    return p_d / (p_d + q_d)

# Bayesian estimator config

FairValueEstimator(

    prior_weight = 0.3,

    extremization_d = 1.3,

    consensus_bonus = 0.1,

    prior_weight_min = 0.1,

    prior_weight_max = 0.5,

)

# Signal sources combined in

# log-odds space:

#   ML ensemble probability

#   Cross-market prices

#   Polling / expert forecasts

#   Historical base rates

3

Base learners

48

Engineered features

0.18

Brier score

4

Market regimes