Connect with us

AI

QED Score Read 57,455 Preprints. It Beat Journal Rank 75% of the Time.

QED Score graded 57,455 life-science preprints and, in blind head-to-head tests, experts chose the AI’s pick over journal rank 75% of the time.

Published

on

A new artificial intelligence tool called QED Score graded 57,455 life-science preprints published between May 2025 and April 2026, and when the score disagreed with where those papers were eventually published, blinded domain experts sided with the AI 75% of the time. The result is laid out in a white paper from QED Science published this week, and it puts a single automated metric ahead of journal rank in head-to-head tests.

The 75% figure is not a marketing claim. It comes from a study in which QED Science paired papers where the AI score and the journal rank of each paper’s eventual venue pointed in opposite directions, then asked expert professors to pick the better paper without seeing either signal.

What QED Score Is and Why It Exists

QED is the work of Oded Rechavi, a molecular biologist at Tel Aviv University, and a team of scientists and engineers. Its name comes from the Latin phrase quod erat demonstrandum, “which was to be demonstrated,” traditionally signed at the end of mathematical proofs.

Rechavi built the first version of QED last year as an AI assistant for reviewing scientific manuscripts. Users upload a paper or an early draft, and the system returns a structured “claim tree” that breaks the work into individual assertions, checks each for logical soundness, and flags experiments that could strengthen weak points. According to QED Science, the company behind the tool, the review report is typically produced within about 30 minutes.

The bottleneck the team is responding to is real and growing. With more than 1.5 million papers published in biomedicine and life sciences every year, and a mean journal acceptance rate of roughly 32%, an estimated 4.7 million manuscripts enter the submission pipeline annually. First peer-reviewed decisions in health and biomedical journals take a median of 60 days, and full submission-to-publication times can stretch from about three months to nearly two years.

Inside the Multi-Agent Reading Engine

QED Score works by decomposing each manuscript into its claims and the experiments supporting them, then running a panel of specialized AI agents in parallel over that structure. The pipeline was designed to score originality and validity, the two dimensions validated in the company’s three published studies.

Each agent examines a different dimension: inconsistencies across figures, statistical rigor, contradictions with the existing literature, alternative hypotheses, and adherence to field reporting standards. Their findings pass through a verification layer and a scoring layer before an aggregator synthesizes them into a single calibrated score. Every manuscript is anonymized before scoring. Author names and institutional affiliations are stripped, and the system does not know whether the paper came from a well-known lab or one it has never heard of.

The score itself reflects two dimensions rather than one. Originality measures how far the finding advances what the field already knows, weighting genuine conceptual leaps over incremental refinements. Validity measures whether the evidence actually supports the conclusions drawn. Both are scored across multiple factors, not a single judgement.

The architecture is, in the company’s framing, deliberately distinct from prior systems. “QED Score differs from prior efforts to assess scientific work automatically in both domain and difficulty,” the white paper states, noting that most prior systems were built and validated on computer-science conference submissions rather than the wet-lab, mechanistic claims of the experimental life sciences.

What Three Validation Studies Found

Before deploying the score at scale, QED Science ran three independent validation studies, each from a different angle. The full results are in the QED Score white paper, published alongside the launch of The 1% ranking.

In the first study, the team obtained a professionally labelled corpus of 925 published papers from 185 authors, each rated as Limited, Satisfactory, or Strong by a panel of domain experts. QED Score separated Limited papers from the rest with an AUC of 0.867. On the same corpus, the SCImago Journal Rank separated them with an AUC of 0.804. For identifying Strong papers, the gap narrowed, with QED Score at 0.782 and journal rank at 0.774.

  • 925 papers in the labelled reference corpus (185 authors)
  • AUC 0.867 for QED Score vs 0.804 for journal rank at identifying Limited papers
  • Spearman ρ = 0.63 between 2,879 preprint QED Scores and the SCImago Journal Rank of their eventual publication venues
  • 100 paper pairs sent to blinded experts where QED Score and journal rank disagreed
  • 75% of 60 decisive expert judgements went to the QED-favored paper (95% Wilson CI 63%-84%)

In the second study, the team scored 4,953 bioRxiv preprints from April 2025 using language models whose knowledge cut-offs predated the corpus, then tracked where those papers were eventually published. Of the 4,953 preprints, 2,879 were matched to a published version with an associated SJR value. The correlation between the preprint QED Score and the eventual journal’s SCImago Journal Rank was Spearman ρ = 0.63, described in the white paper as substantial agreement between an AI assessment of unreviewed work and the outcome of formal peer review months later.

The third study asked what happens when the score and journal rank disagree. The team constructed 100 paper pairs in which QED Score and journal rank pointed in opposite directions, then sent them to blinded domain-expert professors. Of 70 confident judgements, 60 were decisive. In 75% of those, experts preferred the paper QED had rated more highly, three times more often than the journal-rank-favored paper. The 95% Wilson confidence interval ran 63% to 84%, with a two-sided exact sign test p-value below 0.001.

Reading 57,455 Preprints to Surface The 1%

Following validation, QED Score was applied at scale. Between May 2025 and April 2026, the system scored 57,455 bioRxiv preprints, a near-complete census of life science preprint output over a full year.

Source Approximate acceptance rate
The 1% (QED Score) 1%
Nature roughly 8%
Science around 6%
Mean biomedical journal approximately 32%

The papers scoring at the top 1% of that distribution are labeled “The 1%” by QED Science, and according to the company, the ranking is more selective than any major life science journal. The team scored works across all 25 life science disciplines covered by the bioRxiv taxonomy, from Genetics and Molecular Biology to Ecology and Systems Biology, with field representation varying accordingly.

The same scoring run, QED Science estimates, would have required between 860,000 and 1,030,000 hours of expert time under traditional peer review, the equivalent of more than 400 researcher-years. QED Score completed the equivalent assessments in hours, the company says.

The Bottleneck QED Was Built to Ease

The bottleneck is not theoretical. QED Science cites an estimated 35 to 40 million hours of peer review labour annually in the life sciences, and over 130 million hours across academic fields globally, applying published estimates of submission volumes, review rates, and reviewer time to the current annual output of 1.5 million papers.

In a typical case, the company says, reading a paper would take roughly eight hours, and three reviewers are needed. Multiplied across the 4.7 million manuscripts that enter the submission pipeline each year, that arithmetic puts the human cost of running the current system in the hundreds of millions of hours.

QED does not aim to replace peer review, the company has stressed. The tool is built to act on the work before formal review begins, giving authors a structured signal about their claims while reviewers remain scarce. In November 2025, bioRxiv launched a pilot pipeline allowing authors to send submissions directly to QED Science from the preprint server’s author area, with the report produced within about 30 minutes of upload.

The QED team has continued to iterate. On November 6, 2025, Rechavi announced a collaboration between QED and openRxiv. Researchers can run their work through QED before posting a preprint on the site, and the company plans to track how authors improve their manuscripts in response to QED reports.

Independent Researchers Put It to the Test

Independent researchers who have used the tool publicly describe a mix of relief and reserve, according to The Scientist’s reporting. The Scientist’s feature on QED included reactions from beta testers and recent users.

Michał Turek, a biologist at the Institute of Biochemistry and Biophysics of the Polish Academy of Sciences, tested a manuscript he was working on during an August 2025 beta group. Despite large language models’ reputation for hallucinated citations, he said QED “gave pretty accurate suggestions on what you should do to support your claim,” and that the tool’s ability to position a piece of research within the current state of knowledge was something other language models were not doing reliably. Maria Elena de Bellard, a neurodevelopmental biologist at California State University, Northridge, was more direct. “ChatGPT will think for me, but QED makes me think,” de Bellard said.

ChatGPT will think for me, but QED makes me think.

Maria Elena de Bellard, neurodevelopmental biologist, California State University, Northridge, as reported in The Scientist.

Mark Hanson, an immunologist and evolutionary biologist at the University of Exeter, was less enthusiastic. He uploaded a previously published paper to QED and another AI peer-review agent. While he said QED “is doing something quite interesting in the power of how it is able to digest information,” he added that the suggestions were “not original” and offered an unsparing summary. “The AI does a great job of being an average critical thinker,” Hanson said.

The Limits of a Single Metric

The pushback is not just hypothetical. The validation work was done by QED Science itself, and the 15 domain experts who participated in the head-to-head study reviewed only the 100 paper pairs QED constructed.

Rechavi has been direct that QED Score is not a substitute for peer review. “Peer review should be done,” he said. “It’s great. It’s just not always available, and it fails often.” The company frames QED as a complementary tool that gives authors and hiring committees an early signal on preprints, where journal rank does not yet exists, rather than as a replacement for expert evaluation.

Peer review should be done. It’s great. It’s just not always available, and it fails often.

Oded Rechavi, molecular biologist, Tel Aviv University, and founder of QED Science.

Even the strongest published critique is internal. The 75% expert preference is a meaningful signal, but it is also a number produced by 60 decisive judgements across 100 paper pairs designed to surface disagreement between QED and journal rank; it is not a population-level claim about every paper the system scores. Rechavi acknowledged the system is imperfect. “We know that it doesn’t capture [everything] in the world; no one score can,” he said. “But the nice thing about AI evaluations is that it’s much easier to improve them.” According to QED Science, as of May 31, 2026 the platform was in use by more than 10,000 laboratories across over 1,500 institutions in more than 70 countries.

Frequently Asked Questions

How does QED Score work?

QED Score is an AI metric that reads a manuscript in anonymized form and grades it on originality and validity using a multi-agent system. The score is expressed as a percentile rank, comparable across fields and over time. QED Science processes each paper within about 30 minutes of upload, against an estimated 860,000 to 1,030,000 hours of traditional peer review for the same 57,455-paper corpus.

Is QED Score better than journal rank?

In QED Science’s three validation studies, the AI score separated Limited papers from Strong or Satisfactory ones with an AUC of 0.867, against 0.804 for the SCImago Journal Rank on the same corpus. Across 2,879 bioRxiv preprints matched to their eventual publication venues, the preprint QED Score correlated with the venue’s SJR at Spearman ρ = 0.63. In 100 head-to-head paper pairs where the score and the journal disagreed, blinded domain experts chose the QED-favored paper in 75% of 60 decisive cases.

Can QED Score replace peer review?

No. Rechavi has said QED Score is a complement to peer review, not a replacement. Peer review, in his framing, should be done and is great; it is just not always available and fails often. Independent researchers, including those quoted in The Scientist’s reporting, have noted that any single metric is too narrow for evaluating scientific work, even one as well-validated as QED Score.

Who built QED Score and who uses it?

QED Score was developed by QED Science, the company founded by Oded Rechavi, a molecular biologist at Tel Aviv University. According to the company’s own reporting, as of May 31, 2026 the platform was in use by more than 10,000 laboratories across over 1,500 institutions in more than 70 countries. Authors can submit papers to QED directly through bioRxiv following a pilot integration announced November 4, 2025.

Logan Pierce is a writer and web publisher with over seven years of experience covering consumer technology. He has published work on independent tech blogs and freelance bylines covering Android devices, privacy focused software, and budget gadgets. Logan founded Oton Technology to publish clear, no nonsense tech news and reviews based on real hands on testing. He has personally tested and reviewed dozens of mid range and budget Android phones, written extensively about app privacy, and built and managed multiple WordPress publications over the past decade. Logan holds a bachelor's degree in English and studied digital marketing at a certificate level.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending