New paper! LLMs Corrupt Your Documents When You Delegate
LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf.
Delegation requires trust: does the LLM complete tasks without introducing errors?
We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/N
We built DELEGATE-52: 310 work environments across 52 professional domains.
Each has real documents + 5-10 complex editing tasks.
Key idea: every edit is reversible.
Apply forward edit → backward edit → compare with original for evaluation.
Chain 10 of these → simulate long-horizon delegated interactions.
Finding #1: Every model degrades documents over time.
We tested 19 LLMs.
Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupt 25% of document content after 20 interactions.
Average across all models: 50% content loss.
Finding #2: Degradation depends on the domain.
In most domains, models make sparse but critical errors that corrupt documents.
Python programming is the exception: most models can manipulate Python code losslessly.
We define "ready" = 98%+ preservation after 20 interactions.
The best model (Gemini 3.1 Pro) is ready in only 11 of 52 domains.
Finding #3: Giving LLMs tools doesn't help — it makes things worse.
We tested LLMs with an agent harness with file read/write + code execution.
The 4 tested models degraded documents *more* with tools than without.
Why? Tool overhead = 2-5x more input tokens is more context to get lost in. Even with tools, models primarily use manual file writing, showing computational editing (through code execution) remains challenging.
Finding #4: Everything compounds.
📏 Bigger documents → more degradation
⏳ Longer interaction → more degradation
📎 Distractor files → more degradation
And these multiply: the harm from larger documents snowballs 5x over the course of interaction.
Short evals dramatically underestimate real-world degradation.
Finding #5: It's not death by a thousand cuts.
Models maintain near-perfect reconstruction in some rounds, then experience *critical failures* — losing 10% of contents in a single step.
These sparse critical failures explain ~80% of total degradation.
Stronger models don't avoid small errors better, they delay catastrophic ones.
Conclusion: Current LLMs are unreliable delegates.
They introduce sparse but severe errors that compound and corrupt work documents.
We release DELEGATE-52 to encourage the study of delegated work across knowledge work domains:
💻
🤗
📄
Work done with my wonderful colleagues Tobias Schnabel and @ProfJenNeville at @MSFTResearch
New paper! LLMs Corrupt Your Documents When You Delegate
LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf.
Delegation requires trust: does the LLM complete tasks without introducing errors?
We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/NWe built DELEGATE-52: 310 work environments across 52 professional domains.
Each has real documents + 5-10 complex editing tasks.
Key idea: every edit is reversible.
Apply forward edit → backward edit → compare with original for evaluation.
Chain 10 of these → simulate long-horizon delegated interactions.Finding #1: Every model degrades documents over time.
We tested 19 LLMs.
Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupt 25% of document content after 20 interactions.
Average across all models: 50% content loss.Finding #2: Degradation depends on the domain.
In most domains, models make sparse but critical errors that corrupt documents.
Python programming is the exception: most models can manipulate Python code losslessly.
We define "ready" = 98%+ preservation after 20 interactions.
The best model (Gemini 3.1 Pro) is ready in only 11 of 52 domains.Finding #3: Giving LLMs tools doesn't help — it makes things worse.
We tested LLMs with an agent harness with file read/write + code execution.
The 4 tested models degraded documents *more* with tools than without.
Why? Tool overhead = 2-5x more input tokens is more context to get lost in. Even with tools, models primarily use manual file writing, showing computational editing (through code execution) remains challenging.Finding #4: Everything compounds.
📏 Bigger documents → more degradation
⏳ Longer interaction → more degradation
📎 Distractor files → more degradation
And these multiply: the harm from larger documents snowballs 5x over the course of interaction.
Short evals dramatically underestimate real-world degradation.Finding #5: It's not death by a thousand cuts.
Models maintain near-perfect reconstruction in some rounds, then experience *critical failures* — losing 10% of contents in a single step.
These sparse critical failures explain ~80% of total degradation.
Stronger models don't avoid small errors better, they delay catastrophic ones.Conclusion: Current LLMs are unreliable delegates.
They introduce sparse but severe errors that compound and corrupt work documents.
We release DELEGATE-52 to encourage the study of delegated work across knowledge work domains:
💻
🤗
📄
Work done with my wonderful colleagues Tobias Schnabel and @ProfJenNeville at @MSFTResearch
yes
New paper! LLMs Corrupt Your Documents When You Delegate
LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf.
Delegation requires trust: does the LLM complete tasks without introducing errors?
We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/N ... We built DELEGATE-52: 310 work environments across 52 professional domains.
Each has real documents + 5-10 complex editing tasks.
Key idea: every edit is reversible.
Apply forward edit → backward edit → compare with original for evaluation.
Chain 10 of these → simulate long-horizon delegated interactions. ... Finding #1: Every model degrades documents over time.
We tested 19 LLMs.
Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupt 25% of document content after 20 interactions.
Average across all models: 50% content loss. ... Finding #2: Degradation depends on the domain.
In most domains, models make sparse but critical errors that corrupt documents.
Python programming is the exception: most models can manipulate Python code losslessly.
We define "ready" = 98%+ preservation after 20 interactions.
The best model (Gemini 3.1 Pro) is ready in only 11 of 52 domains. ... Finding #3: Giving LLMs tools doesn't help — it makes things worse.
We tested LLMs with an agent harness with file read/write + code execution.
The 4 tested models degraded documents *more* with tools than without.
Why? Tool overhead = 2-5x more input tokens is more context to get lost in. Even with tools, models primarily use manual file writing, showing computational editing (through code execution) remains challenging. ... Finding #4: Everything compounds.
Bigger documents → more degradation
⏳ Longer interaction → more degradation
Distractor files → more degradation
And these multiply: the harm from larger documents snowballs 5x over the course of interaction.
Short evals dramatically underestimate real-world degradation. ... Finding #5: It's not death by a thousand cuts.
Models maintain near-perfect reconstruction in some rounds, then experience *critical failures* — losing 10% of contents in a single step.
These sparse critical failures explain ~80% of total degradation.
Stronger models don't avoid small errors better, they delay catastrophic ones. ... Conclusion: Current LLMs are unreliable delegates.
They introduce sparse but severe errors that compound and corrupt work documents.
We release DELEGATE-52 to encourage the study of delegated work across knowledge work domains:
🤗
Work done with my wonderful colleagues Tobias Schnabel and @ProfJenNeville at @MSFTResearch
Missing some Tweet in this thread? You can try to
Update
Unroll Another Thread
Convert any Twitter threads to an easy-to-read article instantly
Have you tried our Twitter bot?
You can now unroll any thread without leaving Twitter/X. Here's how to use our Twitter bot to do it.