Thread By @shinboson - A story about fraud in the AI research c..

𝞍 Shin Megami Boson 𝞍

@shinboson

2.36K 733 60.88K 14.79K

Listen to this Thread

View original tweet on Twitter

Hide Media

They get massive news coverage and are the talk of the town, so to speak. *If* this were real, it would represent a substantial advance in tuning LLMs at the *abstract* level, and could perhaps even lead to whole new directions of R&D. But soon, cracks appear in the story.

On September 7th, the first independent attempts to replicate their claimed results fail. Miserably, actually. The performance is awful. Further, it is discovered that Matt isn't being truthful about what the released model actually is based on under the hood.

Matt starts making claims that there's something wrong with the API. There's something wrong with the upload. For *some* reason there's some glitch that's just about to be fixed.

Proof points are needed and so Matt hits back. He provides access to a secret, private API that can be used to test "his model". And it performs great! For an open source model of that size, anyway. He even releases a publicly available endpoint for researchers to try out!

But the thing about a private API is it's not really clear what it's calling on the backend. They could be calling a more powerful proprietary model under the hood. We should test and see. Trust, but verify. And it turns out that Matt is a liar.

Their API was a Claude wrapper with a system prompt to make it act similar to the open source model. Amusingly, they appear to be redeploying their private API in response to distinctive tells sneaking through, playing whack-a-mole to try to not get found out.

tl;dr Matt Shumer is a liar and a fraud. Presumably he'll eventually throw some poor sap engineer under the bus and pretend he was lied to. Grifters shit in the communal pool, sucking capital, attention, and other resources away from people who could actually make use of them.

check out mythbuster extrordinaire @RealJosephus's great thread on this https://t.co/DnSABgzABo

"Reflection API" is a sonnet 3.5 wrapper with prompt. And they are currently disguising it by filtering out the string 'claude'.https://t.co/c4Oj8Y3Ol1 https://t.co/k0ECeo9a4i pic.twitter.com/jTm2Q85Q7b
— Joseph (@RealJosephus) September 8, 2024

@RealJosephus Since some people are saying this is premature and they want to wait for data and replications, I grabbed an API key, added support for OpenRouter to my eval script, and compared Reflection 70B to other leading models on an *unseen* test set. The results were bad.

@RealJosephus The test set was an algorithmically generated set of 200 multiple choice puzzles. They're unique every time they're generated so they can't be cheesed. There's no way to perform well on this test except intelligence.

A story about fraud in the AI research community: On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they've made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it's real. It isn't. They get massive news coverage and are the talk of the town, so to speak. *If* this were real, it would represent a substantial advance in tuning LLMs at the *abstract* level, and could perhaps even lead to whole new directions of R&D. But soon, cracks appear in the story. On September 7th, the first independent attempts to replicate their claimed results fail. Miserably, actually. The performance is awful. Further, it is discovered that Matt isn't being truthful about what the released model actually is based on under the hood. Matt starts making claims that there's something wrong with the API. There's something wrong with the upload. For *some* reason there's some glitch that's just about to be fixed. Proof points are needed and so Matt hits back. He provides access to a secret, private API that can be used to test "his model". And it performs great! For an open source model of that size, anyway. He even releases a publicly available endpoint for researchers to try out! But the thing about a private API is it's not really clear what it's calling on the backend. They could be calling a more powerful proprietary model under the hood. We should test and see. Trust, but verify. And it turns out that Matt is a liar. Their API was a Claude wrapper with a system prompt to make it act similar to the open source model. Amusingly, they appear to be redeploying their private API in response to distinctive tells sneaking through, playing whack-a-mole to try to not get found out. tl;dr Matt Shumer is a liar and a fraud. Presumably he'll eventually throw some poor sap engineer under the bus and pretend he was lied to. Grifters shit in the communal pool, sucking capital, attention, and other resources away from people who could actually make use of them. check out mythbuster extrordinaire @RealJosephus's great thread on this https://t.co/DnSABgzABo@RealJosephus Since some people are saying this is premature and they want to wait for data and replications, I grabbed an API key, added support for OpenRouter to my eval script, and compared Reflection 70B to other leading models on an *unseen* test set. The results were bad. @RealJosephus The test set was an algorithmically generated set of 200 multiple choice puzzles. They're unique every time they're generated so they can't be cheesed. There's no way to perform well on this test except intelligence.

𝞍 Shin Megami Boson 𝞍

Unroll Another Tweet

Use Our Twitter Bot to Unroll a Thread