📊 Full opportunity report: Week Three — Foundation model vs Brownian motion. Kronos on five-minute BTC. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.

TL;DR

Recent testing shows that Kronos, a modern foundation model, does not outperform the traditional Brownian motion model in predicting 5-minute BTC price movements. The experiment compares multiple models on out-of-sample data, finding Brownian motion remains competitive.

Recent testing of Kronos, an open-source foundation model for financial time series, shows it does not outperform the traditional Brownian motion model in predicting 5-minute Bitcoin (BTC) price movements, based on out-of-sample data.

Over the past two weeks, a researcher tested Kronos against a Brownian motion baseline using historical trade data from Polybot, a paper-trading bot operating on Polymarket’s 5-minute BTC markets. The analysis involved reconstructing the market context for 497 trades, applying Kronos-small to forecast the probability of BTC closing above the open price, and comparing its performance to Brownian motion and market-implied probabilities.

The results showed that Kronos’s predictive metrics—Brier score and log-loss—were statistically indistinguishable from Brownian motion on out-of-sample data. Specifically, Brownian motion slightly outperformed Kronos in the full sample, and on the out-of-sample subset, the difference was negligible, within the noise margin of repeated tests. Consequently, the hypothesis that a modern, learned model could beat the traditional Brownian baseline was not supported by this data.

Polybot Week 3 — Kronos vs Brownian — Thorsten Meyer AI

KRONOS

● RESEARCH SERIES / MAY 2026

THORSTEN MEYER AI · POLYBOT · WEEK 3

POLYBOT · WEEK 3

KRONOS vs BROWNIAN

Research Series · Foundation Model vs Classical Baseline · 2026-05-17

Foundation model
vs Brownian motion.
Kronos on five-minute BTC.

A modern learned model just lost to math from 1900. On 497 paired trades. Stage 2 is not happening.

Polybot’s fair-value strategy uses a 1900s geometric Brownian model to price 5-minute BTC outcomes. The natural follow-up after two weeks of negative parametric results: would a modern learned model trained on millions of real candles do better? The credible candidate: Kronos — open-source MIT-licensed foundation model, 25,000+ GitHub stars, AAAI 2026, four sizes from 4M to 499M parameters, trained on candles from 45 global exchanges. Test design: 497 paired (FILL→SETTLE) trades, Brownian baseline reconstructed line-for-line, Kronos-small (24.7M params) sampled with 16 forecast paths, scored on Brier + log-loss + hypothetical P&L, chronologically split for out-of-sample discipline. On 249 out-of-sample trades: Brownian 0.188 Brier vs Kronos 0.189 Brier. Gap 0.0011. Statistically indistinguishable. Stage 2 is not happening. But the paradox is more interesting than the verdict: when used as a directional signal Kronos fires 28% less often and wins 60.7% vs Brownian’s 49.1% — slightly better trader on hypothetical P&L, even while systematically over-confident in the tails (predicts 2.4% chance → actual 20.4% win; predicts 84% → actual 69.6%). The negative result is the answer. The methodology is what gets published.

Thorsten Meyer AI Polybot · Week 3 MIT-licensed · methodology public Research article · ~2,200 words Forezai · Polybot

This is not financial advice. Nothing in this article should be used to inform real trading decisions. The bot trades simulated money. If you build something like it and run it with real funds, the most likely outcome — by a wide margin — is that you lose those funds. That holds whether you use a Brownian model, a 100-million-parameter foundation model, or any other forecaster.

497

Paired (FILL→SETTLE) trades
all BTC · 5-min Up/Down markets

0.0011

Out-of-sample Brier-score gap
249 trades · statistically indistinguishable

2×

Kronos log-loss vs Brownian
signature of confident wrong predictions

+$538 / +$465

Hypothetical Kronos vs Brownian P&L
the paradox · 60.7% vs 49.1% win rates

POLYBOT WEEK 3· KRONOS-SMALL · 24.7M PARAMS· BROWNIAN BASELINE· 497 PAIRED TRADES · BTC· POLYMARKET 5-MIN UP/DOWN· BRIER 0.193 / 0.211 / 0.213· LOG-LOSS 0.567 / 0.604 / 1.080· OUT-OF-SAMPLE 0.188 vs 0.189· GAP 0.0011 · INDISTINGUISHABLE· STAGE 2 NOT HAPPENING· KRONOS BETTER TRADER · WORSE FORECASTER· 60.7% vs 49.1% WIN RATE· TAILS: 2.4% → 20.4% · 84% → 69.6%· POLYBOT MIT· KRONOS MIT· AAAI 2026 PAPER · 25K+ STARS· 11 MIN MAC M-SERIES · MPS BACKEND· 1,300 LINES OF PYTHON· RESEARCH_PIPELINE.MD PUBLIC· SAME GAUNTLET · DIFFERENT MODEL· POLYBOT WEEK 3· KRONOS-SMALL · 24.7M PARAMS· BROWNIAN BASELINE· 497 PAIRED TRADES · BTC· POLYMARKET 5-MIN UP/DOWN· BRIER 0.193 / 0.211 / 0.213· LOG-LOSS 0.567 / 0.604 / 1.080· OUT-OF-SAMPLE 0.188 vs 0.189· GAP 0.0011 · INDISTINGUISHABLE· STAGE 2 NOT HAPPENING· KRONOS BETTER TRADER · WORSE FORECASTER· 60.7% vs 49.1% WIN RATE· TAILS: 2.4% → 20.4% · 84% → 69.6%· POLYBOT MIT· KRONOS MIT· AAAI 2026 PAPER · 25K+ STARS· 11 MIN MAC M-SERIES · MPS BACKEND· 1,300 LINES OF PYTHON· RESEARCH_PIPELINE.MD PUBLIC· SAME GAUNTLET · DIFFERENT MODEL·

FIG. 01 — THE TEST PIPELINE

Five steps · for every paired (FILL → SETTLE) trade in the running session

~1,300 lines of Python · 11 minutes on Mac M-series with PyTorch MPS · methodology public, specific numbers local

Reconstruct OHLCV context of the 60 minutes leading up to fire-time. Pull from the bot’s local Binance recording where available; fall back to Binance’s public klines API otherwise. Cache to parquet so re-runs cost nothing.

Recompute the Brownian baseline in Python — a line-for-line port of the bot’s own fairValuePUp(spot, openPrice, secondsLeftFrac, windowVol) formula. Matches scipy.stats.norm.cdf to three decimal places.

Read off the market-implied probability from the FILL price — what Polymarket’s order book thought the side was worth at the moment of fire. The market’s view as a reference point.

Run Kronos-small (24.7M parameters) on the OHLCV context · sample 16 forecast paths to the window’s end · count the fraction in which the underlying closes above the open price. That fraction is Kronos’s predicted p(Up).

Record (p_brownian, p_market, p_kronos, actual_outcome, P&L). Score on Brier + log-loss + hypothetical P&L. Sort chronologically · split into first/second half · report on both halves separately.

The discipline that matters: if a model wins on the first half but ties or loses on the second, that’s the curve-fit-in-slow-motion pattern the previous two articles named, and it doesn’t count as edge. The whole pipeline is reproducible from docs/RESEARCH_PIPELINE.md. Any future candidate model gets a sibling directory in research//, reuses the same Brownian baseline, the same trade-log loader, the same OHLCV fetcher, the same metrics, the same out-of-sample split. Same gauntlet, different model, same discipline.

FIG. 02 — FULL-SAMPLE SCORING · 497 PAIRED TRADES

Three models · two probability-scoring metrics

Brier score and log-loss · the standard scoring rules for probability forecasts · lower is better

Model

Brier ↓

Log-loss ↓

BrownianGeometric Brownian motion · the 1900s baseline

0.193

0.567

Market-impliedPolymarket order book at FILL · reference

0.211

0.604

Kronos24.7M-param foundation model · 16 sampled forecast paths

0.213

1.080

Kronos’s log-loss is roughly twice Brownian’s — the signature of a model that makes confident, wrong predictions in the tails. Polymarket’s order book sits between the two, reasonably calibrated, slightly worse than the bot’s Brownian and slightly better than the foundation model. The 100-year-old math beat the 24.7M-parameter foundation model on both probability-scoring metrics.

FIG. 03 — OUT-OF-SAMPLE VERDICT · 249-TRADE TEST HALF

Chronologically-separated · never seen by tuning

The verdict the test was designed to deliver · noise band of repeated runs with different sampling seeds

Brownian · 249-trade test half

0.188

Brier score (out-of-sample)
lower is better

Kronos · 249-trade test half

0.189

Brier score (out-of-sample)
lower is better

The gap

0.0011

Statistically indistinguishable
inside the noise band

Kronos does not beat Brownian on a held-out chronologically-separated sample. So Stage 2 is not happening.

“Stage 2” was the planned next step: wiring Kronos into Polybot as a live strategy if Stage 1 produced a clear signal. The case is not earned by this data. For 5-minute BTC at the horizons the bot trades, the open Kronos-small checkpoint does not. Stop. The next candidate model — Chronos · TimesFM · Lag-Llama · a Kronos finetune on 5-min crypto · something else — goes through the same gauntlet. Most will fail it. That is the gauntlet doing its job.

FIG. 04 — THE PARADOX · BETTER TRADER vs WORSE FORECASTER

By operational standards Kronos wins · by probabilistic standards Kronos loses

The hypothetical-P&L counterfactual replays the same data through “what if Polybot fired on each model’s probability”

Operational view · Kronos as the better trader

Kronos fires less · wins more · nets slightly more.

Hypothetical fires

201

Brownian fires (reference)

279

Win rate (Kronos)

60.7%

Win rate (Brownian)

49.1%

Hypothetical net P&L (Kronos)

+$538

Hypothetical net P&L (Brownian)

+$465

Fires ~28% less often and wins more reliably when it does. If you use Kronos as a directional signal in a broader system that does its own sizing — closer to how TradingAgents uses analyst outputs — the directional accuracy might still be useful.

Probabilistic view · Kronos as the worse forecaster

Systematically over-confident in the tails.

Kronos predicts

2.4%

Trades actually win

20.4%

Kronos predicts

84%

Trades actually win

69.6%

Log-loss vs Brownian

~2× worse

Brier (full sample)

0.213 vs 0.193

If you are building a fully-probabilistic system where the probability feeds an expected-value calculation against the market’s implied price — which is what Polybot does — calibration is everything, and Kronos’s calibration is bad enough to disqualify it. It thinks it knows more than it does at both ends.

Both interpretations are honest. Neither earns the model a place in Polybot. One of them might earn it a place, later, in TradingAgents — as a 5th analyst voice that votes on direction without being trusted for calibrated odds. That experiment is not what this week tested; it is a separate hypothesis for a separate week.

FIG. 05 — WEEK FOUR · THREE POSSIBLE THREADS

Each is a separate article · the pattern across them is the same

Honest measurement · out-of-sample discipline · no rescue narratives when something doesn’t work

A second-tier candidate model · Amazon’s Chronos

Same general shape as Kronos · different training corpus · also open-source. Running it through the exact same gauntlet would say whether the negative result is specific to Kronos or generalises to learned models in this regime.

Generalisation test

Kronos with a finetune on 5-min crypto data

The Kronos repo ships a finetuning pipeline. Taking the open Kronos-base checkpoint, finetuning on the bot’s own recorded BTC tick history, re-testing. Isolates “is the pretrained distribution wrong for crypto?” from “is the architecture wrong for this horizon?”

Architecture vs distribution

A live-trading update on Polybot

The fleet has been running paper trades continuously across these three weeks. A fresh aggregate-P&L view, with the same calibration-style analysis applied to live performance rather than historical replay, is overdue.

Status reset

The contract is “same gauntlet, different model, same discipline.” Specific numbers stay local. Methodology is public on the repo’s docs/RESEARCH_PIPELINE.md. Publishing reproducible parameter recipes for strategies that might be marginally profitable encourages people to copy them with real money, and the prior on real-money outcomes when copying retail strategies is “they lose.” Publishing the methodology lets the next person test their own model honestly without inheriting any of mine.

By probabilistic standards · Kronos is a worse forecaster. By operational standards · Kronos is the better trader. Both interpretations are honest. Neither earns the model a place in Polybot. One of them might earn it a place, later, in TradingAgents.

Thorsten Meyer AI · Week 3 · Foundation Model vs Brownian Motion

Source dossier & methodology notes

Kronos — Open-source MIT-licensed foundation model for financial time series · 25,000+ stars on GitHub · AAAI 2026 paper · four model sizes (4M · 24.7M open small · 102M open · 499M closed) · trained on candles from 45 global exchanges · authors are explicit it is a research model, not a trading system · github.com/shiyu-coder/Kronos
Polybot — Open-source paper-trading bot · MIT-licensed · runs against Polymarket 5-minute Up/Down crypto markets · fair-value strategy uses geometric Brownian motion · two prior weeks of published research established that most parametric edges are mechanical artefacts
Polymarket 5-min Up/Down — Prediction-market windows on BTC, ETH, and other crypto · the test session: 497 paired (FILL → SETTLE) trades, all BTC · order-book at FILL serves as market-implied probability reference
Test design — Run for every paired trade · reconstruct 60-minute OHLCV context · recompute Brownian baseline (line-for-line port of fairValuePUp(spot, openPrice, secondsLeftFrac, windowVol)) · read market-implied probability · sample 16 Kronos forecast paths · record (p_brownian, p_market, p_kronos, actual_outcome, P&L)
Scoring rules — Brier score (mean squared error of probability vs actual outcome) · log-loss (penalises overconfidence) · hypothetical P&L (counterfactual if Polybot had fired on each model’s probability with the same edge-margin and risk gates) · all reported on first/second half separately
Full-sample (497 trades) — Brier: Brownian 0.193 · Market-implied 0.211 · Kronos 0.213 · Log-loss: Brownian 0.567 · Market-implied 0.604 · Kronos 1.080 · Kronos log-loss ~2× Brownian — confident-wrong-in-tails signature
Out-of-sample (249 trades, test half) — Brownian 0.188 · Kronos 0.189 · gap 0.0011 · well inside the noise band of repeated runs with different Kronos sampling seeds · statistically indistinguishable
Counterfactual P&L — Brownian: 279 fires / 49.1% win rate / +$465 net · Kronos: 201 fires / 60.7% win rate / +$538 net · Kronos fires ~28% less often, wins more reliably, nets slightly more by operational standards
Calibration failures — Kronos predicts 2.4% chance → trades win 20.4% · Kronos predicts 84% chance → trades win 69.6% · systematic overconfidence at both tails
Compute — Mac M-series with PyTorch MPS backend · ~1,300 lines of Python · 11 minutes clock-time · cached to parquet so re-runs cost nothing
Public methodology — docs/RESEARCH_PIPELINE.md on the project repo · same gauntlet runs any future candidate forecast model (Chronos · TimesFM · Lag-Llama · Kronos finetune · other) · sibling directories research//
What this does NOT prove — Not that Kronos is bad (one checkpoint · one horizon · one market) · not that Brownian is good (just not worse at this task; week-2 collapsed at higher sample) · not anything about Stages 2 or 3 of the broader pipeline · negative-Stage-1 kills this candidate at this horizon · the gauntlet does its job
Week 4 candidate threads — (a) Amazon’s Chronos generalisation test · (b) Kronos finetune isolating architecture vs distribution · (c) Polybot live-trading aggregate-P&L update with calibration-style analysis

Colophon

Set in Source Serif 4 (display, italic accent), EB Garamond (body), IBM Plex Sans (UI labels), IBM Plex Mono (mastheads, ticker, tags). Paper-cool gray-cream #e6e7e4.

Chromatic register: structural-slate dominant (methodology + out-of-sample-discipline analysis), empirical-clay for the data forensics, labor-rose for the probabilistic-failure framing and disclaimers, transition-bronze for the operational paradox, alternative-sage for the open-methodology positive signal and the better-trader half of the paradox.

Key frame:

POLYBOT WEEK 3 KRONOS-SMALL · 24.7M BROWNIAN BASELINE OUT-OF-SAMPLE 0.0011 THE PARADOX BETTER TRADER WORSE FORECASTER STAGE 2 NOT HAPPENING SAME GAUNTLET DISCIPLINE HONEST MEASUREMENTS

Implications for AI-based Market Prediction Models

This finding suggests that, at least for short-term 5-minute BTC predictions, advanced foundation models like Kronos do not currently provide a measurable edge over the classical Brownian motion assumption. This challenges expectations that machine learning models trained on extensive historical data will automatically outperform traditional stochastic models in high-frequency trading contexts. It underscores the difficulty of capturing market dynamics beyond simple probabilistic assumptions and highlights the importance of rigorous out-of-sample testing for AI trading tools.

Scalp Smart Hacks to Win Big in Forex Day Trading: Welcome to Scalp Smart, the comprehensive guide designed to transform your forex scalping journey from guesswork to consistent profits.

View Latest Price

As an affiliate, we earn on qualifying purchases.

Background on Model Testing and Market Prediction Challenges

Historically, models like Brownian motion have served as foundational assumptions in financial theory, assuming independent, normally-distributed log returns. Recent advances in AI have prompted attempts to develop more sophisticated models trained on vast datasets of market candles, aiming to improve short-term forecasts. Previous research has shown mixed results, with many AI models failing to demonstrate consistent outperformance in live or out-of-sample testing. The current study builds on this context by directly comparing a modern foundation model, Kronos, against the classical baseline in a real trading simulation environment, focusing on the highly volatile and short-term BTC market. For more on foundation models, see Week Three — Foundation model vs Brownian motion.

“Kronos does not outperform Brownian motion in predicting 5-minute BTC price movements in out-of-sample tests.”
— Thorsten Meyer

Financial Literacy Flashcards for Kids & Teens | 108 Money & Finance Terms with Images, Definitions & Discussion Prompts | 3 Skill Levels (Beginner–Advanced) | Deluxe Set with Digital Activity Book

Bonus Digital Activity Book: Includes printable 108-page activity book
Comprehensive Financial Terms: 108 terms with definitions and illustrations
Progressive Learning Levels: Three skill levels: Beginner, Intermediate, Advanced

View Latest Price

As an affiliate, we earn on qualifying purchases.

Unanswered Questions About Model Performance and Market Dynamics

It remains unclear whether different model configurations, larger training sets, or alternative market conditions could yield better results. The test focused solely on Kronos-small and a specific 5-minute horizon; other models or longer timeframes might perform differently. Additionally, the experiment does not address live trading performance, where factors like execution latency and market impact could influence outcomes. Further research is needed to determine if more advanced or fine-tuned models can surpass traditional assumptions in high-frequency trading.

AI-POWERED CRYPTO TRADING The Complete Guide to Using Artificial Intelligence for Profitable Cryptocurrency Trading

View Latest Price

As an affiliate, we earn on qualifying purchases.

Next Steps for AI Market Prediction Research

Future work may explore training larger or more specialized models, testing across different assets or timeframes, and integrating real-time data feeds. Researchers might also investigate hybrid approaches combining traditional stochastic models with AI predictions. The current results highlight the importance of rigorous out-of-sample validation before deploying AI models in live trading environments. For related insights, see Week Three — Foundation model vs Brownian motion.

Electronic Display for Real-Time Cryptocurrency/Bitcoin/Stock Market Data, Time, Weather & Temperature, 1642865mm, Supports Image Upload and 30s Video Playback, App-Controlled, 960*360 Resolution

Real-Time Data Display: Cryptocurrency, stock, time, weather updates
Custom Media Uploads: Upload images and 30s videos
App-Controlled Management: Control content and display modes

View Latest Price

As an affiliate, we earn on qualifying purchases.

Key Questions

Does this mean AI models are useless for crypto trading?

No, this specific test suggests that Kronos does not outperform Brownian motion for 5-minute BTC predictions. However, other models, longer horizons, or different market conditions may yield different results. AI remains a promising research area, but practical advantages need rigorous validation.

Could larger or more specialized models perform better?

Yes, it is possible that bigger or more tailored models trained on more data could outperform simpler models. Further testing is necessary to confirm this, especially in live trading scenarios.

What does this mean for traders using AI today?

Traders should be cautious and rely on proven strategies. AI models require thorough out-of-sample testing to validate their effectiveness before deployment in real markets.

Will future research change these results?

Potentially. As models improve and more data becomes available, AI-based predictions may become more competitive. Ongoing research is essential to assess their real-world utility.

Source: ThorstenMeyerAI.com

Week Three — Foundation model vs Brownian motion. Kronos on five-minute BTC.

Up next

The cleaner cap table. Why Anthropic’s public-benefit structure dodges OpenAI’s charitable-trust problem — and trades it for a governance question of its own.

Author

Auto Blogging Team

Share article

Foundation model
vs Brownian motion.
Kronos on five-minute BTC.

Implications for AI-based Market Prediction Models

Scalp Smart Hacks to Win Big in Forex Day Trading: Welcome to Scalp Smart, the comprehensive guide designed to transform your forex scalping journey from guesswork to consistent profits.

Background on Model Testing and Market Prediction Challenges

Financial Literacy Flashcards for Kids & Teens | 108 Money & Finance Terms with Images, Definitions & Discussion Prompts | 3 Skill Levels (Beginner–Advanced) | Deluxe Set with Digital Activity Book

Unanswered Questions About Model Performance and Market Dynamics

AI-POWERED CRYPTO TRADING The Complete Guide to Using Artificial Intelligence for Profitable Cryptocurrency Trading

Next Steps for AI Market Prediction Research

Electronic Display for Real-Time Cryptocurrency/Bitcoin/Stock Market Data, Time, Weather & Temperature, 1642865mm, Supports Image Upload and 30s Video Playback, App-Controlled, 960*360 Resolution

Key Questions

Does this mean AI models are useless for crypto trading?

Could larger or more specialized models perform better?

What does this mean for traders using AI today?

Will future research change these results?

Forezai · Polybot: When the AI Disagrees With the Odds

Delvasta: Forms That Build Themselves

Scholarship application organizer for school counselors

Gemini Robotics 2 Brings Whole Body Intelligence To Robots

Why We Write Our Own C And C++ Inference Engines

Why Agencies Should Adopt Blended Billing For Better Revenue Management

Unveiling The Sandbox Scam: How Claude Hacked Three Major Firms

Your Guide To Licensing Voice AI Clones And Rights Management

Week Three — Foundation model vs Brownian motion. Kronos on five-minute BTC.

Up next

Author

Auto Blogging Team

Share article

Foundation modelvs Brownian motion.Kronos on five-minute BTC.

Implications for AI-based Market Prediction Models

Scalp Smart Hacks to Win Big in Forex Day Trading: Welcome to Scalp Smart, the comprehensive guide designed to transform your forex scalping journey from guesswork to consistent profits.

Background on Model Testing and Market Prediction Challenges

Financial Literacy Flashcards for Kids & Teens | 108 Money & Finance Terms with Images, Definitions & Discussion Prompts | 3 Skill Levels (Beginner–Advanced) | Deluxe Set with Digital Activity Book

Unanswered Questions About Model Performance and Market Dynamics

AI-POWERED CRYPTO TRADING The Complete Guide to Using Artificial Intelligence for Profitable Cryptocurrency Trading

Next Steps for AI Market Prediction Research

Electronic Display for Real-Time Cryptocurrency/Bitcoin/Stock Market Data, Time, Weather & Temperature, 164*28*65mm, Supports Image Upload and 30s Video Playback, App-Controlled, 960*360 Resolution

Key Questions

Does this mean AI models are useless for crypto trading?

Could larger or more specialized models perform better?

What does this mean for traders using AI today?

Will future research change these results?

You May Also Like

Foundation model
vs Brownian motion.
Kronos on five-minute BTC.

Electronic Display for Real-Time Cryptocurrency/Bitcoin/Stock Market Data, Time, Weather & Temperature, 1642865mm, Supports Image Upload and 30s Video Playback, App-Controlled, 960*360 Resolution