1/9 We’re announcing the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3’s performance is remarkable, but there’s still a ways to go before any single AI system nears the collective genius of the math community.
2/9 For context, FrontierMath currently spans three broad tiers:
• T1 (25%) Advanced, near top-tier undergrad/IMO
• T2 (50%) Needs serious grad-level background
• T3 (25%) Research problems demanding relevant research experience
All can take hours—or days—for experts to solve.
3/9 Although o3 solved problems in all three tiers, it likely still struggles on the most formidable Tier 3 tasks—those “exceptionally hard” challenges that Tao and Gowers say can stump even top mathematicians.
4/9 Tier 4 aims to push the boundary even further. We want to assemble problems so challenging that solving them would demonstrate capabilities on par with an entire top mathematics department.
5/9 Each problem will be composed by a team of 1-3 mathematicians specialized in the same field over a 6-week period, with weekly opportunities to discuss ideas with teams in related fields. We seek broad coverage of mathematics and want all major subfields represented in Tier 4.
6/9 Process for a Tier 4 problem:
1. 1 week crafting a robust problem concept, which “converts” research insights into a closed-answer problem.
2. 3 weeks of collaborative research.
Presentations among related teams for feedback.
3. Two weeks for the final submission.
7/9 We’re seeking mathematicians who can craft these next-level challenges. If you have research-grade ideas that transcend T3 difficulty, please email elliot@epoch.ai with your CV and a brief note on your interests.
8/9 We’ll also hire some red-teamers, tasked with finding clever ways a model can circumvent a problem’s intended difficulty, and some reviewers to check for mathematical correctness of final submissions. Contact me if you think you’re suitable for either such role.
9/9 As AI keeps improving, we need benchmarks that reflect genuine mathematical depth. Tier 4 is our next (and possibly final) step in that direction.
1/9 We’re announcing the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3’s performance is remarkable, but there’s still a ways to go before any single AI system nears the collective genius of the math community.2/9 For context, FrontierMath currently spans three broad tiers:
• T1 (25%) Advanced, near top-tier undergrad/IMO
• T2 (50%) Needs serious grad-level background
• T3 (25%) Research problems demanding relevant research experience
All can take hours—or days—for experts to solve.3/9 Although o3 solved problems in all three tiers, it likely still struggles on the most formidable Tier 3 tasks—those “exceptionally hard” challenges that Tao and Gowers say can stump even top mathematicians.4/9 Tier 4 aims to push the boundary even further. We want to assemble problems so challenging that solving them would demonstrate capabilities on par with an entire top mathematics department.5/9 Each problem will be composed by a team of 1-3 mathematicians specialized in the same field over a 6-week period, with weekly opportunities to discuss ideas with teams in related fields. We seek broad coverage of mathematics and want all major subfields represented in Tier 4.6/9 Process for a Tier 4 problem:
1. 1 week crafting a robust problem concept, which “converts” research insights into a closed-answer problem.
2. 3 weeks of collaborative research.
Presentations among related teams for feedback.
3. Two weeks for the final submission.7/9 We’re seeking mathematicians who can craft these next-level challenges. If you have research-grade ideas that transcend T3 difficulty, please email elliot@epoch.ai with your CV and a brief note on your interests.8/9 We’ll also hire some red-teamers, tasked with finding clever ways a model can circumvent a problem’s intended difficulty, and some reviewers to check for mathematical correctness of final submissions. Contact me if you think you’re suitable for either such role.9/9 As AI keeps improving, we need benchmarks that reflect genuine mathematical depth. Tier 4 is our next (and possibly final) step in that direction.
1/9 We’re announcing the development of Tier 4, a new suite of math problems that go beyond the hardest problems in FrontierMath. o3’s performance is remarkable, but there’s still a ways to go before any single AI system nears the collective genius of the math community. ... 2/9 For context, FrontierMath currently spans three broad tiers:
• T1 (25%) Advanced, near top-tier undergrad/IMO
• T2 (50%) Needs serious grad-level background
• T3 (25%) Research problems demanding relevant research experience
All can take hours—or days—for experts to solve. ... 3/9 Although o3 solved problems in all three tiers, it likely still struggles on the most formidable Tier 3 tasks—those “exceptionally hard” challenges that Tao and Gowers say can stump even top mathematicians. ... 4/9 Tier 4 aims to push the boundary even further. We want to assemble problems so challenging that solving them would demonstrate capabilities on par with an entire top mathematics department. ... 5/9 Each problem will be composed by a team of 1-3 mathematicians specialized in the same field over a 6-week period, with weekly opportunities to discuss ideas with teams in related fields. We seek broad coverage of mathematics and want all major subfields represented in Tier 4. ... 6/9 Process for a Tier 4 problem:
1. 1 week crafting a robust problem concept, which “converts” research insights into a closed-answer problem.
2. 3 weeks of collaborative research.
Presentations among related teams for feedback.
3. Two weeks for the final submission. ... 7/9 We’re seeking mathematicians who can craft these next-level challenges. If you have research-grade ideas that transcend T3 difficulty, please email elliot@epoch.ai with your CV and a brief note on your interests. ... 8/9 We’ll also hire some red-teamers, tasked with finding clever ways a model can circumvent a problem’s intended difficulty, and some reviewers to check for mathematical correctness of final submissions. Contact me if you think you’re suitable for either such role. ... 9/9 As AI keeps improving, we need benchmarks that reflect genuine mathematical depth. Tier 4 is our next (and possibly final) step in that direction.
Missing some Tweet in this thread? You can try to
Update