Measuring how well models can find and fix errors in human-written text
Benchmarked 51 model variants across 1613 runs with --samples 3 --chunk-size 2000 --max-turns-per-chunk 3Total runtime5d 6h 24mTotal cost$558
Updated Apr 7, 2026, 10:02 AM - commit 6c1be39 - art@revise.io
Proofreading with LLMs
LLMs make fast, cost-effective proofreaders - but which models are the best? How thoroughly can they find and fix errors, and how efficiently?
ErrataBench measures all of this by running a simple agent loop over large samples of text, each containing a wide variety of writing errors. We ask models to find and fix as many problems as they can find using "find and replace" tools.
Overall Rankings
The chart below is an overall ranking of success rate. What percentage of errors did the model find and correctly fix?
For issues not fixed successfully, we make a distinction between omissions ("Not Addressed") and attempted fixes which were rejected as incorrect by the judge ("Bad Fix").
Models were all given the same prompt, tools, text chunks, and number of turns.
Efficiency
Here we compare the overall success rate with other dimensions like speed and cost. Only the best reasoning variant of each model is shown by default; toggle off "Best Variant Only" to see all data points.
This two-dimensional comparison gives the full picture between performance and cost, depending on what is important to you. Tool Call Efficiency measures how surgical the model was with its changes.
Use "Highlight Dominant Models" to see which models dominate on the axis you're viewing.
Methodology
Dataset
ErrataBench tests models against a dataset of English text from various sources (literature, law, technical manuals). Each source has been altered with a wide variety of writing errors, across several categories. These errors are tracked in a corresponding file.
Source text is altered using a corruptor model + review model pair; the corruptor is instructed to make small changes to the text to insert errors of specific types, sampled from the error category taxonomy. The review model reviews the changes as a second opinion before they are accepted into the dataset.
Agent Loop
The benchmark runs a simple agent loop over each altered text, showing the model chunks of up to 2000 words at a time while keeping whole paragraphs together. The model is simply asked to proofread the text carefully, fixing errors while preserving meaning and details. The model is not given any specifics about what types of errors are hiding in the text or how many of them there are.
The agent loop gives the model up to 3 turns with each chunk, during which the model can modify that text using simple tools: find_and_replace (to make surgical changes) and replace_paragraph (for wider rewrites). If the model changes the text to what it was originally, the error is immediately counted as fixed. If the model changes it to something else, an LLM judge is used to decide if that alternative is also correct.
Metrics
The headline Fix rate score is an equal-weighted per-dataset mean, not a pooled average over all runs. For each source dataset d, ErrataBench averages the successful-run quality scores for that dataset to get mean(q_d), then averages those dataset-level means across datasets. That keeps datasets with more successful repeats from counting more heavily than datasets with fewer.
The other scatterplot metrics are derived similarly. Cost Efficiency is corrected issues per USD at the run level, then averaged by dataset and across datasets.Speed is corrected issues per minute of run time, aggregated the same way. Tool Call Efficiency is shown as issues fixed per 100 benchmark-scoped tool characters, which is an inverted display of the underlying tool-chars-per- resolved-issue metric. Turns Used is the mean average turns per chunk across datasets.
The scatterplot's consistency metric is median_d range(q_d), the median per-dataset range of quality. For each dataset, ErrataBench measures the spread between the highest and lowest successful-run quality score, then takes the median of those dataset-level ranges. In the UI this appears as Consistency. Lower values mean the model's proofreading quality is more consistent from run to run, while the median makes the metric less sensitive to a few noisy source texts.
Reproducing Results
Source code, dataset, and results are available on GitHub. Results can be reproduced using an OpenRouter API key. The repository also includes a tool for creating your own dataset using any source text, making it easy to run this benchmark against new samples.
The benchmark has two primary inputs which determine how the agent loop works: the chunk size and number of turns per chunk. Changing these values can produce different results - specifically, larger chunks and fewer turns favors models that do less reasoning. High-reasoning models tend to get lost when presented larger chunks and fewer turns to edit them with. The data published here was generated by running the benchmark with --samples 3 --chunk-size 2000 --max-turns-per-chunk 3, which was decided on as a reasonable general setting.
Error Categories
The full list of error categories used by ErrataBench can be found below.
- Orthography and word form
- Typographical errors: fat-finger typoomissioninsertiontranspositionduplication
- Spelling and casing: nonword spelling errorreal-word spelling errorcapitalization errorspacing errorhyphenation or compounding error
- Morphology and inflection: incorrect pluralizationincorrect conjugationwrong word form
- Lexical choice and confusability
- Word selection: misused wordhomophone confusionhomograph or near-neighbor confusionmalapropismcollocation error
- Mishearing and reinterpretation: mondegreeneggcorn
- Idiomaticity: idiom error
- Grammar and syntax
- Agreement: subject-verb agreementpronoun agreement
- Verb system: incorrect tenseauxiliary or modal error
- Clause and sentence structure: sentence fragmentrun-on, fused sentence, or comma spliceword order errorextra wordparallelism errorrelative-clause errorcoordination or subordination error
- Function words and reference: article or determiner errorpreposition errorpronoun-case errorpronoun-reference errordangling or misplaced modifier
- Polarity and comparison: double negativenegation-scope errorcomparative or superlative errorcountability error
- Punctuation and boundaries
- punctuation errorapostrophe errorquotation or delimiter errorsentence-boundary error
- Semantics, discourse, and style
- Semantics: semantic anomalyreferential ambiguity
- Discourse: discourse coherence errorcross-sentence tense inconsistency
- Style and register: wordiness or redundancyregister mismatchawkward phrasingnonstandard dialect or colloquial form