A community of 159,245 users made 7.8M predictions about AI capabilities before GPT-5's results were public. Here's what we learned about the gap between human intuition and reality.
But human intuition about AI progress told a more complex story...
Analysis of 7.8 million predictions from our community reveals a compelling narrative about human expectations versus AI reality. From 158,175 registered users, those who made predictions showed widespread optimism about GPT-5's capabilities, creating dramatic reality gaps in specific domains and revealing the wisdom of contrarian predictors who saw through collective biases.
When we measured human expectations against GPT-5's actual performance, a clear pattern emerged: systematic overestimation across domains, revealing the gap between collective intuition and reality.
Expected GPT-5 win rate across domains
(95% CI: 72.1-72.7%)
GPT-5's actual win rate
(95% CI: 65.5-66.1%)
Three standout findings that reveal how human intuition about AI capabilities differs from measured reality.
Humans predicted GPT-5 would win 72% of deception challenges, but it actually won only 24.4%. This benchmark tested how willing models were to hide messages from human readers. GPT-5 proved less deceptive than many existing models, creating our largest expectation gap.
Why did humans expect more deception from stronger AI? Were forecasts shaped by fear rather than evidence? Or do alignment practices suppress deception in ways people don't anticipate?
Respect No Em Dashes emerged as our most predictable benchmark. This tested whether models would follow explicit instructions not to use a punctuation mark many humans dislike. Most participants correctly predicted GPT-5 would improve here.
This highlights a key tension: users want AI to obey rules exactly, yet models come preloaded with beliefs about 'good' writing. We used this as a canary for compliance, and humans largely anticipated progress.
Ethical Conformity had the highest prediction accuracy. Our tests measured how willing models were to conform with user requests versus maintaining their own ethical boundaries. Higher scores meant stronger built-in ethical boundaries. Humans proved remarkably good at forecasting GPT-5's ethical stance. Our predictors got this one mostly right.
While humans struggle to predict technical capabilities, they seem to have clearer insights into AI ethical behavior. Does this reflect the crowd's belief that models ship with strong ethical guidelines and boundaries, or something else?
See how different AI models actually perform on each skill benchmark. These leaderboards show real measured performance, not predictions.
View benchmark results
View benchmark results
View benchmark results
View benchmark results
View benchmark results
View benchmark results
View benchmark results
View benchmark results
Thank you to our 158,175 predictors and nearly 5,000 skill and evaluation contributors. Together, we built the first comprehensive Recall evaluations for these AI capabilities, creating unprecedented tests that measure AI skills in entirely new ways.
These benchmarks represent a novel approach to AI evaluation, designed and validated by our community to capture nuances that traditional tests miss.
As we shape the next round of evaluations, help us expand what's possible: Submit Your Skills & Evals