Popular AIs head-to-head: OpenAI beats DeepSeek on sentence-level reasoning

Importance Score: 65 / 100 šŸ”“

AI Chatbot Citation Accuracy: A New Benchmark for Evaluating Reasoning

ChatGPT and similar AI chatbots employing large language models are known to occasionally fabricate information, including scientific and legal citations. New research indicates that evaluating the accuracy of an AI model’s citations is an effective method for assessing its reasoning capabilities. This finding highlights the crucial link between citation precision and the overall reliability of AI-generated information, especially in critical domains like research and law.

Understanding AI Reasoning through Citation Accuracy

An AI modelreasons” by dissecting a query into sequential steps and processing them systematically, akin to how individuals learn to solve mathematical problems. This step-by-step approach is fundamental to how these models process information and arrive at conclusions.

Ideally, for citation generation, an AI model should comprehend the core ideas within a document, produce a ranked list of pertinent scholarly works, and offer compelling justifications for how each proposed paper bolsters the corresponding text. It should underscore specific connections between the text and the cited research, explaining the relevance of each source.

The central question is whether current models can be relied upon to establish these connections and provide clear reasoning that validates their source selections. This inquiry extends beyond mere citation accuracy, addressing the broader issue of how dependable and precise large language models are for any information retrieval purpose. Citation accuracy, therefore, serves as a key indicator of a model’s overall competence and reliability in handling and processing information.

Introducing the Reasons Benchmark

As a computer scientist, I, along with my peers—researchers from the AI Institute at the University of South Carolina, Ohio State University, and University of Maryland Baltimore County—have developed the Reasons benchmark. This tool is designed to evaluate how effectively large language models can automatically generate research citations and provide comprehensible reasoning.

We employed the benchmark to assess the performance of two widely used AI reasoning models: DeepSeek’s R1 and OpenAI’s o1. While DeepSeek gained attention for its impressive efficiency and cost-effectiveness, it still needs to improve to match OpenAI’s reasoning capabilities.

Sentence-Level Citation Accuracy

Citation accuracy is significantly influenced by whether the AI model is analyzing information at the sentence level, as opposed to the paragraph or document level. Paragraph-level and document-level citations can be considered as feeding a substantial amount of information into a large language model and requesting numerous citations in response.

In this approach, the large language model tends to overgeneralize and misinterpret specific sentences. Consequently, users receive citations that pertain to the entire paragraph or document, rather than the detailed information contained within the sentence.

Furthermore, reasoning quality diminishes when a large language model is tasked with processing an entire document. These models predominantly rely on memorizing patterns, which they more readily identify at the beginning and end of longer texts compared to the middle. This limitation hinders their capacity to fully grasp all critical information throughout an extensive document.

Large language models experience confusion due to the information density in paragraphs and documents, which adversely affects both citation generation and the reasoning process. As a result, reasoning based on paragraphs and documents becomes more akin to summarizing or paraphrasing information.

The Reasons benchmark specifically addresses this deficiency by scrutinizing large language models’ citation generation and reasoning abilities at the sentence level.

Testing Methodology and Results

Following the launch of DeepSeek R1 in January 2025, we sought to evaluate its accuracy in generating citations and the quality of its reasoning, and to compare it against OpenAI’s o1 model. We constructed a paragraph composed of sentences from diverse sources, presented individual sentences from this paragraph to the models, and requested citations and reasoning.

To initiate our evaluation, we developed a focused testbed comprising approximately 4,100 research articles across four core subjects related to human brains and computer science: neurons and cognition, human-computer interaction, databases, and artificial intelligence. We assessed the models using two metrics: the F-1 score, quantifying citation accuracy, and the hallucination rate, measuring the soundness of the model’s reasoning—specifically, how often it produces inaccurate or misleading responses.

Our tests revealed notable performance variations between OpenAI o1 and DeepSeek R1 across different scientific fields. OpenAI’s o1 demonstrated proficiency in linking information across subjects, such as connecting research on neurons and cognition to human-computer interaction and then to concepts in artificial intelligence, while maintaining accuracy. Its performance metrics consistently surpassed DeepSeek R1’s across all evaluation categories, particularly in minimizing hallucinations and successfully executing assigned tasks.

OpenAI o1 exhibited superior ability in semantically integrating ideas, whereas R1 prioritized generating a response for every attribution task, which consequently increased hallucinations during reasoning. OpenAI o1 had a hallucination rate of around 35%, compared to DeepSeek R1’s rate of nearly 85% in the attribution-based reasoning task.

Performance Metrics: F-1 and BLEU Scores

In terms of accuracy and linguistic competence, OpenAI o1 achieved a score of about 0.65 on the F-1 test, indicating it was correct approximately 65% of the time in answering questions. It also scored around 0.70 on the BLEU test, which evaluates the naturalness of a language model’s writing. These scores are considered quite favorable.

DeepSeek R1 achieved lower scores, with roughly 0.35 on the F-1 test, meaning it was correct about 35% of the time. Furthermore, its BLEU score was only about 0.2, suggesting its writing was less natural-sounding than OpenAI’s o1. This comparison highlights o1’s advantage in presenting information in clear, natural language.

OpenAI’s Competitive Edge

On other benchmarks, DeepSeek R1 performs comparably to OpenAI o1 in math, coding, and scientific reasoning tasks. However, the significant disparity observed in our benchmark implies that o1 offers more dependable information, while R1 struggles with factual consistency.

Although our comprehensive testing included other models, the performance difference specifically between o1 and R1 underscores the current competitive landscape in AI development. OpenAI’s offering maintains a considerable advantage in reasoning and knowledge integration capacities.

These findings suggest that OpenAI presently holds an advantage in source attribution and reasoning, potentially due to the nature and volume of its training data. The company recently announced its deep research tool, capable of generating reports with citations, answering follow-up questions, and providing reasoning for its responses.

Caveats and Future Outlook

The ultimate utility of this new tool for researchers remains to be seen, but a crucial reminder persists for everyone: always double-check all citations provided by AI.


šŸ• Top News in the Last Hour By Importance Score

# Title šŸ“Š i-Score
1 Top Australian soldier loses appeal over war crimes defamation case 🟢 82 / 100
2 Scientists say they've finally discovered cause of long Covid… and its terrifying link to dementia šŸ”“ 75 / 100
3 Trump news at a glance: top court divided on White House’s birthright citizenship restrictions šŸ”“ 72 / 100
4 Nine million Brits risk being turned away at airports due to four passport issues šŸ”“ 72 / 100
5 Justin Bieber proclaims he’s ā€˜not among’ Sean ā€˜Diddy’ Combs’ victims as rap mogul faces sex-trafficking trial šŸ”“ 65 / 100
6 Trump strikes ā€˜historic’ deal with UAE to build biggest AI campus outside US šŸ”“ 65 / 100
7 xAI blames Grok’s obsession with white genocide on an ā€˜unauthorized modification’ šŸ”“ 62 / 100
8 Western Bulldogs AFL coach Luke Beveridge promises he'll get revenge after David Koch made a shocking remark about his team šŸ”µ 55 / 100
9 Misery of the abandoned orcas: Killer whales and dolphins are left trapped in shut-down marine park – as campaigners plead for them to be saved šŸ”µ 55 / 100
10 Alibaba partly punctures China’s AI hopes — TradingView News šŸ”µ 55 / 100

View More Top News āž”ļø