GraphOmni: Comprehensive Benchmark for LLMs on Graph-theoretic Tasks

¹Hao Xu^†, ²Xiangru Jian^{† ✉}, ¹Xinjian Zhao^†, ²Wei Pang^†, ²Chao Zhang², ³Suyuchen Wang, ⁴Qixin Zhang, ²Zhengyuan Dong, ⁵Joao Monteiro, ³Bang Liu, ⁶Qiuzhuang Sun, ¹Tianshu Yu^✉

¹CUHK(SZ), ²University of Waterloo, ³UdeM/Mila, ⁴CityUHK, ⁵Autodesk, ⁶University of Sydney

† Equal Contribution, ✉ Corresponding Authors

Paper Code Datasets Daily Papers

TL;DR

GraphOmni delivers the most Comprehensive Evaluation of LLMs on Graph Reasoning tasks.

GraphOmni Overview — Overview of the GraphOmni benchmark. The benchmark evaluates LLMs across diverse graph types , serialization formats , and prompt schemes on a comprehensive suite of graph-theoretic tasks .

Introduction

GraphOmni is the first fully extensible benchmark for evaluating the reasoning capabilities of LLMs on graph-theoretic tasks in natural language. Spanning 6 graph tasks, 7 graph types, 7 serialization formats, and 9 prompt schemes, it systematically explores how these interdependent dimensions shape model performance. Our extensive evaluations reveal that even top models like Claude-3.5 and o4-mini excel on some combinations yet leave substantial room for improvement on harder tasks. To streamline tuning, we introduce a reinforcement learning-inspired framework that adaptively selects optimal configurations at minimal cost. This open, modular platform lays a robust foundation for advancing LLMs’ graph understanding and reasoning.

Benchmark Statistics — Radar charts comparing the performance of open-source (top row) and closed-source (bottom row) LLMs across six graph tasks (**BFS Order**, **Connectivity Check**, **Cycle Detection**, **Diameter Calculation**, **Shortest Path**, **Triangle Counting**) at three difficulty levels (easy, medium, hard).

What Is Missing in the Evaluation of Graph Reasoning Tasks?

Rich Graph Diversity : 7 synthetic graph types (others use only 1–4), covering a wider range of structural patterns.
Multiple Serializations : 7 formats (vs. 1–4), so models are tested on adjacency lists, matrices, edge lists, and more.
Varied Prompt Schemes : 9 schemes (vs. 1–6), from algorithmic templates to chain‐of‐thought, letting us probe prompting sensitivity.
Random Baseline & Modular Design : Only GraphOmni includes both simple random baseline and a plug‐and‐play evaluation pipeline for easy extension and ablation.

GraphOmni Benchmark

Evaluation Pipeline

Illustration of pipeline — **GraphOmni Evaluation Pipeline.** We convert graph-theoretic tasks into text-based questions. In the adjustable settings, we vary three dimensions, i.e., graph types,serialization formats, and prompt schemes, and then generate every possible combination. Each of them is fed to the LLM, and its output is compared against the ground truth to assess performance.

Data Statistics

Statistical summary of GraphOmni over all tasks at different difficulty levels.

Token usage for prompt-serialization combinations.

Evaluation Results

Overall Results

Main Results — **Benchmark Accuracy of Large Language Models Across Tasks (Mean ± 95 CI Margin). Bold orange** / Underlined blue / Light purple highlights indicate best / second-best / third-best performance relative to each model type (i.e., open-source or closed-source models).

o4-mini Raises the Bar : Beats or ties GPT-4o and Claude-3.5 on 7 / 10 hard tasks, hitting 92–93 % on Cycle Count and Connected Components.
Gap to Human Mastery : Accuracy is less than 35 % on hard cases of Triangle Counting and BFS Order for all models, far below human-level mastery.
Closed > Open : Best-performing open-source LLM Qwen-2.5 lags o4-mini by 8–15% on average and the gap widens as difficulty increases.
Tight CIs : 95 % CI margins mostly less than 2%, so rank ordering is statistically robust.

Different Combinations Make a Difference

Heatmap — **Performance heatmaps for different prompt schemes and serialization formats on diameter calculation of GPT-4o.** The color intensity represents the accuracy, with darker colors indicating better performance. The solid and dashed boxes highlight the best and second-best performing combinations, respectively.

High Sensitivity : accuracy swings 40% (≈30 % → 70 %) on formatting alone.
Evaluate Everything : fixing one dimension and sweeping the other can misestimate capability, so we need to vary both!

Prompt Schemes — **Accuracy of open-source versus closed-source models with different prompt schemes.** (a) and (b) show the average performance with a 95% confidence interval for open-/closed-source models across various prompt schemes and tasks, with the x-axis sorted by mean accuracy.

Closed-Source Superiority : closed-source models consistently outperform open-source ones across all prompt schemes.
Favor CoT & Instruct : Chain-of-Thought and Instruct prompts yield the highest mean accuracies, especially on eaty/medium-difficulty tasks.

Error Analysis

Example of Qwen2.5 on Triangle Counting

Example of Mistral on Connectivity Check

Example of o4-mini on Triangle Counting

Example of Qwen2.5 on BFS Order

Case A. Misinterpretation of serialization formats: Models occasionally struggled to accurately interpret serialized graph representations, resulting in misunderstandings of the underlying graph structure.

Example of Claude-3.5-Sonnet on Cycle Detection

Example of Mistral on Diameter Calculation

Example of Mistral on Triangle Counting

Case B. Incorrect reasoning about graph-theoretic concepts: LLMs frequently exhibited fundamental misunderstandings of basic graph definitions and problem-solving methods.

BibTeX

@misc{xu2025graphomni,
      title={GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks}, 
      author={Hao Xu and Xiangru Jian and Xinjian Zhao and Wei Pang and Chao Zhang and Suyuchen Wang and Qixin Zhang and Zhengyuan Dong and Joao Monteiro and Bang Liu and Qiuzhuang Sun and Tianshu Yu},
      year={2025},
      url={https://arxiv.org/abs/2504.12764}
}