Measuring how well models can find and fix errors in human-written text

Benchmarked 51 model variants across 1613 runs with --samples 3 --chunk-size 2000 --max-turns-per-chunk 3Total runtime5d 6h 24mTotal cost$558

Updated Apr 7, 2026, 10:02 AM - commit 6c1be39 - art@revise.io

Proofreading with LLMs

LLMs make fast, cost-effective proofreaders - but which models are the best? How thoroughly can they find and fix errors, and how efficiently?

ErrataBench measures all of this by running a simple agent loop over large samples of text, each containing a wide variety of writing errors. We ask models to find and fix as many problems as they can find using "find and replace" tools.

Overall Rankings

The chart below is an overall ranking of success rate. What percentage of errors did the model find and correctly fix?

For issues not fixed successfully, we make a distinction between omissions ("Not Addressed") and attempted fixes which were rejected as incorrect by the judge ("Bad Fix").

Models were all given the same prompt, tools, text chunks, and number of turns.

Error Fixed
Bad Fix
Not Addressed

Efficiency

Here we compare the overall success rate with other dimensions like speed and cost. Only the best reasoning variant of each model is shown by default; toggle off "Best Variant Only" to see all data points.

This two-dimensional comparison gives the full picture between performance and cost, depending on what is important to you. Tool Call Efficiency measures how surgical the model was with its changes.

Use "Highlight Dominant Models" to see which models dominate on the axis you're viewing.

Methodology

Dataset

ErrataBench tests models against a dataset of English text from various sources (literature, law, technical manuals). Each source has been altered with a wide variety of writing errors, across several categories. These errors are tracked in a corresponding file.

Source text is altered using a corruptor model + review model pair; the corruptor is instructed to make small changes to the text to insert errors of specific types, sampled from the error category taxonomy. The review model reviews the changes as a second opinion before they are accepted into the dataset.

Agent Loop

The benchmark runs a simple agent loop over each altered text, showing the model chunks of up to 2000 words at a time while keeping whole paragraphs together. The model is simply asked to proofread the text carefully, fixing errors while preserving meaning and details. The model is not given any specifics about what types of errors are hiding in the text or how many of them there are.

The agent loop gives the model up to 3 turns with each chunk, during which the model can modify that text using simple tools: find_and_replace (to make surgical changes) and replace_paragraph (for wider rewrites). If the model changes the text to what it was originally, the error is immediately counted as fixed. If the model changes it to something else, an LLM judge is used to decide if that alternative is also correct.

Metrics

The headline Fix rate score is an equal-weighted per-dataset mean, not a pooled average over all runs. For each source dataset d, ErrataBench averages the successful-run quality scores for that dataset to get mean(q_d), then averages those dataset-level means across datasets. That keeps datasets with more successful repeats from counting more heavily than datasets with fewer.

The other scatterplot metrics are derived similarly. Cost Efficiency is corrected issues per USD at the run level, then averaged by dataset and across datasets.Speed is corrected issues per minute of run time, aggregated the same way. Tool Call Efficiency is shown as issues fixed per 100 benchmark-scoped tool characters, which is an inverted display of the underlying tool-chars-per- resolved-issue metric. Turns Used is the mean average turns per chunk across datasets.

The scatterplot's consistency metric is median_d range(q_d), the median per-dataset range of quality. For each dataset, ErrataBench measures the spread between the highest and lowest successful-run quality score, then takes the median of those dataset-level ranges. In the UI this appears as Consistency. Lower values mean the model's proofreading quality is more consistent from run to run, while the median makes the metric less sensitive to a few noisy source texts.

Reproducing Results

Source code, dataset, and results are available on GitHub. Results can be reproduced using an OpenRouter API key. The repository also includes a tool for creating your own dataset using any source text, making it easy to run this benchmark against new samples.

The benchmark has two primary inputs which determine how the agent loop works: the chunk size and number of turns per chunk. Changing these values can produce different results - specifically, larger chunks and fewer turns favors models that do less reasoning. High-reasoning models tend to get lost when presented larger chunks and fewer turns to edit them with. The data published here was generated by running the benchmark with --samples 3 --chunk-size 2000 --max-turns-per-chunk 3, which was decided on as a reasonable general setting.

Error Categories

The full list of error categories used by ErrataBench can be found below.

  • Orthography and word form
    • Typographical errors: fat-finger typoomissioninsertiontranspositionduplication
    • Spelling and casing: nonword spelling errorreal-word spelling errorcapitalization errorspacing errorhyphenation or compounding error
    • Morphology and inflection: incorrect pluralizationincorrect conjugationwrong word form
  • Lexical choice and confusability
    • Word selection: misused wordhomophone confusionhomograph or near-neighbor confusionmalapropismcollocation error
    • Mishearing and reinterpretation: mondegreeneggcorn
    • Idiomaticity: idiom error
  • Grammar and syntax
    • Agreement: subject-verb agreementpronoun agreement
    • Verb system: incorrect tenseauxiliary or modal error
    • Clause and sentence structure: sentence fragmentrun-on, fused sentence, or comma spliceword order errorextra wordparallelism errorrelative-clause errorcoordination or subordination error
    • Function words and reference: article or determiner errorpreposition errorpronoun-case errorpronoun-reference errordangling or misplaced modifier
    • Polarity and comparison: double negativenegation-scope errorcomparative or superlative errorcountability error
  • Punctuation and boundaries
    • punctuation errorapostrophe errorquotation or delimiter errorsentence-boundary error
  • Semantics, discourse, and style
    • Semantics: semantic anomalyreferential ambiguity
    • Discourse: discourse coherence errorcross-sentence tense inconsistency
    • Style and register: wordiness or redundancyregister mismatchawkward phrasingnonstandard dialect or colloquial form

Error Category Ranking

Hardest

Lexical Choice And ConfusabilityIdiomaticity

Examples
The balance of these policy objectives varies from case by to case, because they may often conflict.
It is evidently disproportionate to the limits which we must here prescribe to ourselves, to enumerate the events which it would be agreeable to the interests of mankind in large general that nations should regard as giving, and alone giving, commencement and termination to rights of dominion; because, in order to afford an enumeration which would be in any degree instructive, the reasons must be given why one set of events, and not another, should have the privilege in question conferred upon them.
Places actually blockaded, that is surrounded with an hostile force for the immediate purpose of being reduced, either by arms, or by famine, would still form exceptions; because the admission of ships into them, with supplies either of food, or munition of war, would be directly in at variance with the very object of the blockade.
The theoretical argument for why women want fewer children as their opportunities in the labor market rise can be explained by the same framework proposed by Gary Becker that I laid down out above.
The Christian Bible, for example, teaches to "be faithful fruitful and multiply and fill the earth and subdue it".
All model variants51.2%
Top models
Claude Opus 4.6 (High)
15 / 15 (100.0%)
Claude Opus 4.6 (Low)
15 / 15 (100.0%)
Claude Opus 4.6 (Medium)
15 / 15 (100.0%)
Gemini 3.1 Pro Preview (High)
15 / 15 (100.0%)
Qwen 3.5 397B A17b (High)
8 / 9 (88.9%)
Claude Opus 4.6 (None)
13 / 15 (86.7%)
Gemini 3.1 Pro Preview (Low)
12 / 15 (80.0%)
Gemini 3.1 Pro Preview (Medium)
12 / 15 (80.0%)
GPT 5.4 (Medium)
12 / 15 (80.0%)
GPT 5.4 Mini (Medium)
12 / 15 (80.0%)

Punctuation And BoundariesSentence Boundary Error

Examples
It does not mean the four forces are equal, it equal. It means the opposing forces are equal to, and thereby cancel, the effects of each other.
The designers determine how far the center of pressure (CP) will travel, it travel. It is important to understand that an aircraft's weight is concentrated at the CG and the aerodynamic forces of lift occur at the CP.
If such trade dress is at issue, add the following after the third paragraph of this instruction: Trade dress concerns the overall visual impression created in the consumer's mind when viewing the non-functional aspects of the product. And product and not from the utilitarian or useful aspects of the product.
Only two questions, of any great importance, appear to remain. That remain; that relating to the march of troops, for a hostile purpose, through a neutral country, and that relating to the extent to which the operations of a successful war ought to be pursued.
They cannot learn it from their parents, as less than 3% of mothers can pass a simple literacy test, this test. This study concluded that the quality of teaching was poor because "teachers are isolated, underequipped, receive salaries after long delays, and have little training."
All model variants56.2%
Top models
Gemini 3.1 Pro Preview (Low)
29 / 30 (96.7%)
GPT 5.4 Mini (High)
27 / 30 (90.0%)
Gemini 3 Flash Preview (High)
25 / 30 (83.3%)
Gemini 3.1 Pro Preview (Medium)
25 / 30 (83.3%)
GPT 5.4 Mini (Medium)
24 / 30 (80.0%)
Claude Opus 4.6 (None)
23 / 30 (76.7%)
GPT 5.4 (Medium)
23 / 30 (76.7%)
Gemini 3.1 Pro Preview (High)
22 / 30 (73.3%)
GPT 5.4 (High)
22 / 30 (73.3%)
Claude Opus 4.6 (Low)
21 / 30 (70.0%)

Semantics Discourse And StyleDiscourse

Examples
A trademark identifies the source of goods. See Brookfield Commc'ns Inc. v. W. Coast Ent. Corp., 174 F.3d 1036, 1051 (9th Cir. 1999). And But it fails to serve its source-identifying function when the public has never seen it, for instance when registered for an Internet domain name.
Desertion must take place either from the ships of war of the belligerent, nor or from its merchant ships.
It will presently be seen how much of the benefit capable of being derived from an international code must be lost, unless if it is left destitute of a similar organ.
A person acquired acquires the right to exclude others from using the same mark or a similar mark that is likely to cause confusion in the marketplace by being the first to use it in the marketplace, or by using it before the alleged infringer.
If, then, we should suppose that it were enacted as the law of nations, that the property of individuals passing on the seas should be equally respected, in peace and in war, we proceeded may proceed to consider whether any disadvantage, nearly countervailing the general good, would thence accrue to the belligerents.
All model variants59.1%
Top models
Qwen 3.5 397B A17b (High)
9 / 11 (81.8%)
Claude Opus 4.6 (High)
12 / 15 (80.0%)
Claude Opus 4.6 (Low)
12 / 15 (80.0%)
Claude Opus 4.6 (Medium)
12 / 15 (80.0%)
Gemini 3 Flash Preview (High)
12 / 15 (80.0%)
Gemini 3.1 Pro Preview (High)
12 / 15 (80.0%)
Gemini 3.1 Pro Preview (Low)
12 / 15 (80.0%)
Gemini 3.1 Pro Preview (Medium)
12 / 15 (80.0%)
GPT 5.4 (High)
12 / 15 (80.0%)
Grok 4.20 (High)
12 / 15 (80.0%)

Grammar And SyntaxFunction Words And Reference

Examples
This model has been adopted for operating theatre departments that care for patients within an average-sized new-build acute general hospital undergoing elective or emergency surgery as in-patients within an average-sized new-build acute general hospital that serves a population of circa 300,000.
Waiting People may wish to wait here for long periods of time, so comfortable seating is essential. Daylight should be provided if possible; however, soft lighting is an acceptable alternative.
33 The general waiting room used by family and friends should not double up as a breaking bad news room.
4.47 When a child is undergoing a procedure, the anaesthetic room is the final destination for the parents escorting their child.
If unavailable, re-usable accessories should be purchased that are autoclaveable.autoclaveable should be purchased.
All model variants65.8%
Top models
GPT 5.4 Mini (High)
108 / 123 (87.8%)
Gemini 3.1 Pro Preview (Medium)
103 / 123 (83.7%)
GPT 5.4 (Medium)
98 / 118 (83.1%)
Claude Opus 4.6 (High)
102 / 123 (82.9%)
Gemini 3.1 Pro Preview (Low)
102 / 123 (82.9%)
Gemini 3 Flash Preview (High)
97 / 118 (82.2%)
GPT 5.4 (High)
101 / 123 (82.1%)
Claude Opus 4.6 (Medium)
99 / 123 (80.5%)
Claude Opus 4.6 (Low)
98 / 123 (79.7%)
GPT 5.4 Mini (Medium)
97 / 123 (78.9%)

Semantics Discourse And StyleSemantics

Examples
The intent of the Congress in passing section 43(a) was to create a right of action for a competitor to stop . . . fair unfair competition in interstate commerce." U-Haul Int'l, Inc.
In the previous section, we have seen that it is easy costly to bring the entire education system to fruition.
During the demographic transition — when mortality was high, low, and fertility was high — Sundstrom and David (1988)33 identified the significant support children offered to their parents in the United States before the Civil War.
The researchers van Ginneken and Razzaque (2003)49 studied the inclining declining fertility rate in Bangladesh.
10 Economic goods and services are those that can be produced and that are abundant scarce in relation to the demand for them. They stand in contrast to free goods, like sunlight, which are abundant, or those many important aspects in our lives that cannot be produced, like friendships.
All model variants66.2%
Top models
Claude Opus 4.6 (None)
72 / 78 (92.3%)
Claude Opus 4.6 (Medium)
71 / 78 (91.0%)
GPT 5.4 (Medium)
53 / 59 (89.8%)
Gemini 3.1 Pro Preview (Medium)
70 / 78 (89.7%)
Claude Opus 4.6 (High)
69 / 78 (88.5%)
Claude Opus 4.6 (Low)
69 / 78 (88.5%)
Gemini 3.1 Pro Preview (Low)
68 / 78 (87.2%)
GPT 5.4 (Low)
66 / 78 (84.6%)
Gemini 3.1 Pro Preview (High)
65 / 78 (83.3%)
GPT 5.4 (None)
65 / 78 (83.3%)

Punctuation And BoundariesQuotation Or Delimiter Error

Examples
These facts are relevant to whether the defendant is liable for: [(1) infringing plaintiff's registered trademark rights, by using a trademark in a manner likely to cause confusion among consumers;] [(2) unfairly competing, by using a trademark in a manner likely to cause confusion as to the origin or quality of plaintiff's goods;) ] [(3) unfairly competing, by using trade dress in a manner likely to cause confusion as to the origin or quality of plaintiff's goods;] [(4) infringing plaintiff's trade name, by using similar corporate, business or professional names in a manner likely to cause confusion about the source of products in the minds of consumers;] [(5) false advertising, by making a false statement that was material and that tended to deceive consumers, injuring the plaintiff in the market.]
Every law imports, "that that something is to be done; or to be left undone.
It is unnecessary to pursue any farther the analysis of this "extraordinary extraordinary hypothesis. It is evident from what has been said, that it is full of impracticabilities.
There would also be certain other injuries pointed out, of a more doubtful character, which might, or might not, according to circumstances not easy to define), , be such as to justify recourse to war.
"As As there can be no reason why the demand of compensation should not always precede the use of arms, except in cases of such a necessity as will not allow time for demanding compensation—a necessity for the immediate use of arms, in order to prevent an evil immediately impending—those cases of urgent necessity should, as far as possible, be sought out, and defined.
All model variants66.5%
Top models
GPT 5.4 (Medium)
30 / 33 (90.9%)
Gemini 3 Flash Preview (High)
29 / 33 (87.9%)
GPT 5.4 (High)
29 / 33 (87.9%)
GPT 5.4 (None)
29 / 33 (87.9%)
Claude Haiku 4.5 (None)
27 / 33 (81.8%)
Gemini 3 Flash Preview (None)
27 / 33 (81.8%)
Gemini 3.1 Pro Preview (High)
27 / 33 (81.8%)
Gemini 3.1 Pro Preview (Low)
27 / 33 (81.8%)
GPT 5.4 (Low)
27 / 33 (81.8%)
Mimo V2 Pro (Low)
27 / 33 (81.8%)

Punctuation And BoundariesPunctuation Error

Examples
Likewise, if the engine power is increased, thrust, becomes greater than drag and the airspeed increases.
Notice how the flat plate in Figure 57 5-7 causes the air to swirl around the edges until it eventually rejoins downstream.
This flow of air results in "spillage" over the tips , thereby setting up a whirlpool of air called a vortex.
But the question, in settling the difficulties of international jurisprudence , is not whether an advantage is gained, but whether the advantage, such as it is, be not gained, at too great a cost.
It was Spitz . The rabbit could not turn, and as the white teeth broke its back in mid air it shrieked as loudly as a stricken man may shriek.
All model variants71.7%
Top models
GPT 5.4 (Medium)
15 / 15 (100.0%)
GPT 5.4 Mini (Medium)
15 / 15 (100.0%)
Gemini 3 Flash Preview (High)
14 / 15 (93.3%)
Gemini 3.1 Pro Preview (High)
14 / 15 (93.3%)
GPT 5.4 Mini (High)
14 / 15 (93.3%)
Grok 4.20 (High)
14 / 15 (93.3%)
Grok 4.20 (Medium)
14 / 15 (93.3%)
Qwen 3.5 397B A17b (High)
11 / 12 (91.7%)
Claude Haiku 4.5 (Medium)
13 / 15 (86.7%)
GPT 5.4 (High)
13 / 15 (86.7%)

Grammar And SyntaxClause And Sentence Structure

Examples
Experience is that which in this part we must depend on, on. And it were to be wished that it were more improved.
Hitherto we have exammined the extent of our knowledge, in respect of the several sorts of beings that are, there are. There is another extent of it, in respect of universality, which will also deserve to be considered ; and in this regard, our knowledge follows the nature of our ideas.
A trademark infringement case can be brought under three different causes of action: (1) statutory trademark infringement, (2) common law trademark infringement, and (3) unfair competition, although competition. Although elements of a claim in trademark may overlap with a claim in copyright, the acts do not preempt each other. See Polar Bear Prods.
If this were done, an international code would be composed, in which the rights of dominion would be accurately defined; and to determine any question about boundaries, or about the degree of dominion, nothing farther would then be necessary than an adequate inquiry respecting the state of the facts.
The first principle with regard to the sea is this, that all nations have an equal right to the use of it the it. The utility of recognizing this principle is so apparent, that it has never been the subject of any dispute.
All model variants73.4%
Top models
Gemini 3.1 Pro Preview (High)
242 / 258 (93.8%)
GPT 5.4 (Medium)
220 / 237 (92.8%)
Gemini 3.1 Pro Preview (Low)
238 / 258 (92.2%)
GPT 5.4 Mini (Medium)
238 / 258 (92.2%)
GPT 5.4 Mini (High)
236 / 258 (91.5%)
Gemini 3.1 Pro Preview (Medium)
235 / 258 (91.1%)
Gemini 3 Flash Preview (High)
212 / 237 (89.5%)
GPT 5.4 (Low)
226 / 258 (87.6%)
GPT 5.4 (High)
221 / 258 (85.7%)
GPT 5.4 (None)
221 / 258 (85.7%)

Grammar And SyntaxAgreement

Examples
We have now considered, though in a very general manner (and our limits preclude us from attempting any thing more), the mode in which nations should agree about the rights of one another (in other words, the laws it they should establish), in as far as the property of individuals' belonging to them, is concerned.
Though it would not be correct to say, that these do not contribute, or rather that it they may not be made to contribute, to the means with which the government carries on the war; yet it would be absurd not to recognize a very broad distinction between them, and the men and things which are immediately applied, or applicable to the war.
Better education also improves knowledge, the use of contraceptives, and the ability of better-educated women to reduce the gap between her their desired number of children and the actual number of children they have.
As we would expect from the theory above, it they have had much fewer children. Where women have more than eight years of education, the fertility rate is below four children per woman, and in many countries, it is below two.
Unplanned births include that those occurring two or more years sooner than desired ("mistimed") and those that were not wanted at all by the mother ("unwanted").
All model variants74.0%
Top models
Gemini 3 Flash Preview (High)
66 / 66 (100.0%)
Gemini 3.1 Pro Preview (High)
62 / 66 (93.9%)
GPT 5.4 (Medium)
60 / 66 (90.9%)
Grok 4.20 (High)
59 / 65 (90.8%)
GPT 5.4 (High)
59 / 66 (89.4%)
Grok 4.1 Fast (Medium)
59 / 66 (89.4%)
Qwen 3.5 27B (Medium)
59 / 66 (89.4%)
GPT 5.4 (Low)
58 / 66 (87.9%)
GPT 5.4 (None)
58 / 66 (87.9%)
Grok 4.1 Fast (High)
58 / 66 (87.9%)

Grammar And SyntaxPolarity And Comparison

Examples
Several studies go one step further — they not only look at two variables but also do not control for possibly confounding variables.
To solve the problems we face, it is not enough to not increase overall production. We also need to make good decisions about which goods and services we want to produce more of and which ones we want less of.
20 In order to preserve privacy and dignity of patients, reduce the noise levels and reduce the risk of cross-infection, personnel who are directly not directly involved with patient care should be accommodated outside patient areas.
62 It is essential not essential to have a door between the scrub room and theatre. If one is provided, it should be an automatic self-closing door to prevent scrubbed staff from re-contaminating their hands.
116 Many units have alternative chemical substances for disinfection. If not still used, glutaraldehyde is a hazardous substance. It is recognised to be toxic-irritant and allergenic.
All model variants79.2%
Top models
Gemini 3 Flash Preview (High)
51 / 53 (96.2%)
GPT 5.4 (Medium)
51 / 53 (96.2%)
Claude Opus 4.6 (None)
54 / 57 (94.7%)
Gemini 3.1 Pro Preview (Low)
54 / 57 (94.7%)
Gemini 3.1 Pro Preview (Medium)
53 / 57 (93.0%)
Qwen 3.5 397B A17b (High)
32 / 35 (91.4%)
Claude Opus 4.6 (Medium)
52 / 57 (91.2%)
Claude Opus 4.6 (High)
52 / 57 (91.2%)
Claude Opus 4.6 (Low)
52 / 57 (91.2%)
Qwen 3.5 397B A17b (Medium)
41 / 45 (91.1%)