Philippe Laban
Philippe Laban
@PhilippeLaban
Apr 22 28 days ago 8 tweets Read on X

New paper! LLMs Corrupt Your Documents When You Delegate

LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf.
Delegation requires trust: does the LLM complete tasks without introducing errors?

We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/N

We built DELEGATE-52: 310 work environments across 52 professional domains.

Each has real documents + 5-10 complex editing tasks.

Key idea: every edit is reversible.
Apply forward edit → backward edit → compare with original for evaluation.
Chain 10 of these → simulate long-horizon delegated interactions.

Tweet image 1
Tweet image 2

Finding : Every model degrades documents over time.

We tested 19 LLMs.
Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT-5.4) corrupt 25% of document content after 20 interactions.
Average across all models: 50% content loss.

Tweet image 1

Finding : Degradation depends on the domain.

In most domains, models make sparse but critical errors that corrupt documents.
Python programming is the exception: most models can manipulate Python code losslessly.

We define "ready" = 98%+ preservation after 20 interactions.
The best model (Gemini 3.1 Pro) is ready in only 11 of 52 domains.

Tweet image 1

Finding : Giving LLMs tools doesn't help — it makes things worse.

We tested LLMs with an agent harness with file read/write + code execution.
The 4 tested models degraded documents *more* with tools than without.

Why? Tool overhead = 2-5x more input tokens is more context to get lost in. Even with tools, models primarily use manual file writing, showing computational editing (through code execution) remains challenging.

Tweet image 1

Finding : Everything compounds.

📏 Bigger documents → more degradation
⏳ Longer interaction → more degradation
📎 Distractor files → more degradation

And these multiply: the harm from larger documents snowballs 5x over the course of interaction.

Short evals dramatically underestimate real-world degradation.

Tweet image 1

Finding : It's not death by a thousand cuts.

Models maintain near-perfect reconstruction in some rounds, then experience *critical failures* — losing 10% of contents in a single step.

These sparse critical failures explain ~80% of total degradation.

Stronger models don't avoid small errors better, they delay catastrophic ones.

Tweet image 1

Conclusion: Current LLMs are unreliable delegates.
They introduce sparse but severe errors that compound and corrupt work documents.

We release DELEGATE-52 to encourage the study of delegated work across knowledge work domains:

💻
🤗
📄

Work done with my wonderful colleagues Tobias Schnabel and at

Missing some Tweet in this thread? You can try to Update

Unroll Another Thread

Convert any Twitter threads to an easy-to-read article instantly

Have you tried our Twitter bot?

You can now unroll any thread without leaving Twitter/X. Here's how to use our Twitter bot to do it.

  • Give us a follow on Twitter. follow us
  • Drop a comment, mentioning us @unrollnow on the thread you want to Unroll.
  • Wait For Some Time, We will reply to your comment with Unroll Link.
UnrollNow Twitter Bot
Modal Image
0:00 / 0:00