I'm looking for OCR models that can "guess" partial words and aren't restricted by safety filters or content policies, especially useful for journalism. Big models like GPT-4o and Google Gemini sometimes refuse to extract text, which isn't ideal. Microsoft's Kosmos-2.5 looks promising but may generate hallucinations. I want reliable, uncensored OCR tools.
Any OCR models out there with LLM-like capabilities - like the ability to "guess" partial words based on context - but that don't follow extra instructions or apply safety filters of any kind?
I want reliable OCR that can't be prompt injected and that won't sometimes refuse text
Multimodal models like GPT-4o and Claude 3 Opus and Google Gemini seem great for OCR at first, but they're no good if they're going to refuse to return text because the content disagrees with their content policies, or they skip text labeled "ignore this text:" in the document!
This is not a theoretical concern: here's Claude 3 Opus refusing to extract JSON from a campaign finance report document because "... that would involve extracting and structuring private details about the individual"!
This is currently my strongest argument in favor of "uncensored" models: sometimes you just want to be able to do something useful - like OCR - against an arbitrary document
Especially relevant to journalism, which often involves handling content from unsavory sources!
Anyone tried Microsoft's Kosmos-2.5?
Looks promising: "a multimodal literate model for machine reading of text-intensive images" - but the README does warn "Since this is a generative model, there is a risk of hallucination during the generation process"
@pagilgukey
@simonw sounds like you want something like https://t.co/d6KLcOH5Rs
Any OCR models out there with LLM-like capabilities - like the ability to "guess" partial words based on context - but that don't follow extra instructions or apply safety filters of any kind?
I want reliable OCR that can't be prompt injected and that won't sometimes refuse textMultimodal models like GPT-4o and Claude 3 Opus and Google Gemini seem great for OCR at first, but they're no good if they're going to refuse to return text because the content disagrees with their content policies, or they skip text labeled "ignore this text:" in the document!This is not a theoretical concern: here's Claude 3 Opus refusing to extract JSON from a campaign finance report document because "... that would involve extracting and structuring private details about the individual"!This is currently my strongest argument in favor of "uncensored" models: sometimes you just want to be able to do something useful - like OCR - against an arbitrary document
Especially relevant to journalism, which often involves handling content from unsavory sources!Anyone tried Microsoft's Kosmos-2.5?
Looks promising: "a multimodal literate model for machine reading of text-intensive images" - but the README does warn "Since this is a generative model, there is a risk of hallucination during the generation process"
yes
Any OCR models out there with LLM-like capabilities - like the ability to "guess" partial words based on context - but that don't follow extra instructions or apply safety filters of any kind?
I want reliable OCR that can't be prompt injected and that won't sometimes refuse text ... Multimodal models like GPT-4o and Claude 3 Opus and Google Gemini seem great for OCR at first, but they're no good if they're going to refuse to return text because the content disagrees with their content policies, or they skip text labeled "ignore this text:" in the document! ... This is not a theoretical concern: here's Claude 3 Opus refusing to extract JSON from a campaign finance report document because "... that would involve extracting and structuring private details about the individual"! ... This is currently my strongest argument in favor of "uncensored" models: sometimes you just want to be able to do something useful - like OCR - against an arbitrary document
Especially relevant to journalism, which often involves handling content from unsavory sources! ... Anyone tried Microsoft's Kosmos-2.5?
Looks promising: "a multimodal literate model for machine reading of text-intensive images" - but the README does warn "Since this is a generative model, there is a risk of hallucination during the generation process"
Missing some Tweet in this thread? You can try to
Update