ClawArena: A Multi-Framework Benchmark for Evaluating AI Coding Agents on Realistic Multi-Session Scenarios

Published in arXiv preprint, 2026

ClawArena is a benchmark evaluation platform for AI coding agents. It provides a unified pipeline to run inference, score results, and compare performance across different agent frameworks on the same set of realistic, multi-session scenarios. It features 64 scenarios across 8 domains, 1,879 evaluation rounds mixing multiple-choice reasoning and execution-based checks, and supports framework-agnostic evaluation via a plugin system.

ClawArena: A Multi-Framework Benchmark for Evaluating AI Coding Agents on Realistic Multi-Session Scenarios

Recommended citation: Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao, "ClawArena: A Multi-Framework Benchmark for Evaluating AI Coding Agents on Realistic Multi-Session Scenarios," 2026. [Online]. Available: https://arxiv.org/abs/2604.04202
Download Paper | Download Bibtex