如何设计一个优秀的benchmark

▼thought

用户问的是"如何设计一个优秀的benchmark"，这是一个方法论问题。我需要给出系统性的指导，涵盖benchmark设计的各个方面。这类问题既需要理论框架，也需要实践案例。让我从多个角度来搜索：benchmark设计原则、AI领域benchmark问题（如饱和、数据污染）、计算机体系结构领域的benchmark设计（SPEC等经典案例）、以及最近的benchmark批评和最佳实践。让我同时进行多路搜索。

▼✓网页搜索查询: "how to design a good benchmark principles methodology", 结果数: 10

{
  "results": [
    {
      "title": "How to Build Good Language Modeling Benchmarks – Ofir Press",
      "url": "https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/",
      "description": "Building benchmarks is important because they shine a spotlight on the weaknesses of existing language models and so can guide the community on how to improve them. I’ve spent a lot of my career both on building benchmarks and on building systems that push forward the state-of-the-art on a given benchmark, and I believe that building good benchmarks is *just as important* as building new systems. Designing a good benchmark is challenging and I’ve spent a lot of time recently thinking about what makes for a good benchmark. Try to build a benchmark that has natural questions that some category of humans ask on a frequent basis. B. **The questions are made up and not taken from actual questions asked by actual humans**: If you work for Google and you’re tasked with building a challenging question-answering benchmark, a really nonoptimal thing to do would be to sit around by yourself in a room and just try to think of questions."
    },
    {
      "title": "[PDF] Principles for Financial Benchmarks - IOSCO",
      "url": "https://www.iosco.org/library/pubdocs/pdf/ioscopd415.pdf",
      "description": "a) Clearly define and substantiate through appropriate written arrangements the roles and obligations of third parties who participate in the"
    },
    {
      "title": "Benchmark Standards for Justice Innovations",
      "url": "https://justiceinnovation.law.stanford.edu/resources/evaluation/benchmark/",
      "description": "The benchmark principles, checklists, and criteria can be used to review what you are creating. Judge your initiative against these expert best practices, and"
    },
    {
      "title": "[PDF] How to Build a Benchmark - SPEC Research Group",
      "url": "https://research.spec.org/icpe_proceedings/2015/proceedings/p333.pdf",
      "description": "The key characteristics can be or-ganized in the following groups, which will be discussed in more detail in the next sections: • Relevance How closely the benchmark behavior cor-relates to behaviors that are of interest to consumers of the results • Reproducibility The ability to consistently produce similar results when the benchmark is run with the same test conﬁguration • Fairness Allowing diﬀerent test conﬁgurations to com-pete on their merits with-out artiﬁcial limitations • Veriﬁability Providing conﬁdence that a benchmark result is accurate • Usability Avoiding roadblocks for users to run the benchmark in their test environments All benchmarks are subject to these same criteria, but each category includes additional issues that are speciﬁc to the individual benchmark, depending on the benchmark’s goals."
    },
    {
      "title": "Custom AI Benchmark Guide: Lessons from Public Evals",
      "url": "https://kili-technology.com/blog/custom-ai-benchmark-guide-what-the-best-public-evals-teach-you-about-building-your-own",
      "description": "Learn best practices for combining LLM-as-a-judge and HITL workflows for reliable AI. # Custom AI Benchmark Guide: What the Best Public Evals Teach You About Building Your Own. The public benchmarks that defined AI progress for the last three years are saturating in months, not years, and a 445-benchmark systematic review found most don't accurately capture what their names claim. This guide distils the design choices behind HELM, GPQA Diamond, SWE-bench, LiveCodeBench, LegalBench, and MultiMedQA into a practitioner methodology for building AI benchmarks you can actually trust. The 2025 AI Index reports gains of 18.8 points on MMMU, 48.9 on GPQA Diamond, and 67.3 points in SWE-bench scores in a single year. A custom benchmark is no longer optional infrastructure for any team deploying AI in a domain that matters. The question is how to evaluate the system you actually ship without repeating the validity, contamination, and saturation mistakes that the public ai benchmarks are now publicly working through."
    },
    {
      "title": "How to Build AI Benchmarks That Evolve | Label Studio",
      "url": "https://labelstud.io/blog/how-to-build-ai-benchmarks-that-evolve-with-your-models/",
      "description": "# How to Build AI Benchmarks that Evolve with your Models. In our first post in the Benchmark series, we explored why evaluating large language models (LLMs) is uniquely challenging—and how AI benchmarks offer a solution by bringing structure, repeatability, and objectivity to an otherwise subjective process. In this article, we’ll break down what makes a benchmark effective: the core components you need, different scoring approaches, and when to use them. We’ll also explore how benchmark strategies should evolve as your system matures—from early prototypes to production-ready applications—so you can evaluate your model in the right way, at the right time. An AI benchmark has two key components: a standardized set of tasks, and a scoring methodology. By asking the model to answer the same questions every time, we can begin to get a deeper understanding of where our model is performing well or poorly, and how it has changed over time."
    },
    {
      "title": "How Should I Build A Benchmark? Revisiting Code-Related Benchmarks For LLMs",
      "url": "https://arxiv.org/html/2501.10711v1",
      "description": "# How Should I Build A Benchmark? Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to assure its quality, reliability, and reproducibility. We propose How2Bench comprising a 55-criteria checklist as a set of guidelines to comprehensively govern the development of code-related benchmarks. Nearly 70% of the benchmarks did not take measures for data quality assurance;. Finally, we conducted a human study involving 49 participants and revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency. For example, a recent study pointed out that “current programming benchmarks are inadequate for assessing the actual correctness of LLM-generated code” (Liu et al., 2023a). This checklist covers the entire lifecycle of benchmark development, from design and construction to evaluation, analysis, and release as shown in Figure 1."
    },
    {
      "title": "Essential guidelines for computational method benchmarking - PMC",
      "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC6584985/",
      "description": "In general, we can define three broad types of benchmarking studies: (i) those by method developers, to demonstrate the merits of their approach (e.g., [22–26]); (ii) neutral studies performed to systematically compare methods for a certain analysis, either conducted directly by an independent group (e.g., [27–38]) or in collaboration with method authors (e.g., [39]); or (iii) those organized in the form of a community challenge, such as those from the DREAM [40–44], FlowCAP [45, 46], CASP [47, 48], CAMI [49], Assemblathon [50, 51], MAQC/SEQC [52–54], and GA4GH [55] consortia. Evaluation criteria: key quantitative performance metrics | +++ | Number and types of performance metrics | Subjectivity in the choice of metrics: e.g., selecting metrics that do not translate to real-world performance  Metrics that give over-optimistic estimates of performance  Methods may not be directly comparable according to individual metrics (e.g., if methods are designed for different tasks) |."
    },
    {
      "title": "What is Benchmarking? Technical & Competitive ... - ASQ",
      "url": "https://asq.org/quality-resources/benchmarking?srsltid=AfmBOoqVw5c0a3C-vqeTZlpsK1or5hUKoylrtrOH-S4jJKkMPSm1rBUZ",
      "description": "Benchmarking is a method that allows companies to compare products, performance and services to top competitors. Learn more as ASQ.org."
    },
    {
      "title": "What advice can you give me for writing a meaningful benchmark?",
      "url": "https://stackoverflow.com/questions/324200/what-advice-can-you-give-me-for-writing-a-meaningful-benchmark",
      "description": "The most meaningful benchmark is to measure how your code performs under everyday usage. That will obviously provide you with the most realistic numbers."
    }
  ]
}

▼✓网页搜索查询: "benchmark design flaws pitfalls AI ML evaluation 2024 2025", 结果数: 10

{
  "results": [
    {
      "title": "NeurIPS Poster Position: Benchmarking is Broken - Don't Let AI be Its Own Judge",
      "url": "https://neurips.cc/virtual/2025/poster/121919",
      "description": "Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. As a flood of participants enters the AI space, this \"Wild West\" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's.In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? For true and sustainable AI advancement, we call for a paradigm shift to a unified, live, and quality-controlled benchmarking framework—robust by construction rather than reliant on courtesy or goodwill. To concretize this position, we introduce the idea of PeerBench, a community-governed, proctored evaluation blueprint that seeks to improve security and credibility through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to lay the groundwork for evaluations that restore integrity and deliver genuinely trustworthy measures of AI progress."
    },
    {
      "title": "AI Benchmarks for 2025 ·",
      "url": "https://dasarpai.com/dsblog/ai-benchmarks-2025/",
      "description": "# AI Benchmarks for 2025. A term **“AI benchmark”** is thrown around a lot and can be confusing because it’s used in slightly different ways depending on the context. In this artcile we will try to understand what are the different meaning of this term and what are the latest AI benchmarks. In general, an **AI benchmark** is a **standardized way to evaluate the performance** of an AI system and/or models. For example, Image classification, text generation, translation, reasoning etc. **Dataset** : The data used to test performance of the model. **Metrics** : How performance is measured or what metrics are used to evalute the model performance. **Leaderboard** : A ranked list of various model’s performances on the against same benchmark. For example, when someone say the model was evaluated against SuperGLUE (a famous NLP benchmark) benchmark it means. Here’s a list of **LLM benchmarks released in the last 2–3 years (2022–2024)** — these are *recent*, research-driven, and widely cited or adopted:."
    },
    {
      "title": "Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation",
      "url": "https://arxiv.org/html/2502.06559v1",
      "description": "An Interdisciplinary Review of Current Issues in AI Evaluation. Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing logic that fails to account for how AI models are increasingly multimodal and interact with humans and other technical systems). Quantitative artificial intelligence (AI) benchmarks (i.e., combinations of test datasets and performance metrics that are taken to represent general or specific tasks and used to compare AI model capabilities and/or risks (Raji et al., 2021)) play a central role in the release and marketing of newly developed AI tools."
    },
    {
      "title": "Avoiding Common Pitfalls in LLM Evaluation",
      "url": "https://www.honeyhive.ai/post/avoiding-common-pitfalls-in-llm-evaluation",
      "description": "# Avoiding Common Pitfalls in LLM Evaluation. Discover the hidden challenges of LLM evaluation and the most common mistakes we've seen after helping hundreds of teams build effective evals that drive business results. ###### About the author:. ###### Felipe Adachi. ## Join our monthly newsletter. Your submission has been received! Something went wrong while submitting the form."
    },
    {
      "title": "Custom AI Benchmark Guide: Lessons from Public Evals",
      "url": "https://kili-technology.com/blog/custom-ai-benchmark-guide-what-the-best-public-evals-teach-you-about-building-your-own",
      "description": "Learn best practices for combining LLM-as-a-judge and HITL workflows for reliable AI. # Custom AI Benchmark Guide: What the Best Public Evals Teach You About Building Your Own. The public benchmarks that defined AI progress for the last three years are saturating in months, not years, and a 445-benchmark systematic review found most don't accurately capture what their names claim. This guide distils the design choices behind HELM, GPQA Diamond, SWE-bench, LiveCodeBench, LegalBench, and MultiMedQA into a practitioner methodology for building AI benchmarks you can actually trust. The 2025 AI Index reports gains of 18.8 points on MMMU, 48.9 on GPQA Diamond, and 67.3 points in SWE-bench scores in a single year. A custom benchmark is no longer optional infrastructure for any team deploying AI in a domain that matters. The question is how to evaluate the system you actually ship without repeating the validity, contamination, and saturation mistakes that the public ai benchmarks are now publicly working through."
    },
    {
      "title": "Technical Performance | The 2025 AI Index Report | Stanford HAI",
      "url": "https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance",
      "description": "In 2023, AI researchers introduced several challenging new benchmarks, including MMMU, GPQA, and SWE-bench, aimed at testing the limits of increasingly capable AI systems. By 2024, AI performance on these benchmarks saw remarkable improvements, with gains of 18.8 and 48.9 percentage points on MMMU and GPQA, respectively. On SWE-bench, AI systems could solve just 4.4% of coding problems in 2023—a figure that jumped to 71.7% in 2024. In early January 2024, the leading closed-weight model outperformed the top open-weight model by 8.04% on the Chatbot Arena Leaderboard. By early 2025, this gap had narrowed to just 5.4%. Likewise, the difference between the top two models shrank from 4.9% in 2023 to just 0.7% in 2024. The saturation of traditional AI benchmarks like MMLU, GSM8K, and HumanEval, coupled with improved performance on newer, more challenging benchmarks such as MMMU and GPQA, has pushed researchers to explore additional evaluation methods for leading AI systems."
    },
    {
      "title": "A benchmark of expert-level academic questions to assess AI ...",
      "url": "https://www.nature.com/articles/s41586-025-09962-4",
      "description": "# A benchmark of expert-level academic questions to assess AI capabilities | Nature. We also use optional cookies for advertising, personalisation of content, usage analysis, and social media, as well as to allow video information to be shared for both marketing, analytics and editorial purposes. By accepting optional cookies, you consent to the processing of your personal data - including transfers to third parties. See our privacy policy for more information on the use of your personal data. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To address this gap, we introduce HLE (originally defined as Humanity’s Last Exam, although we will use the term HLE for this paper), a benchmark of 2,500 challenging questions from dozens of subject areas, designed to assess LLM capabilities at an expert level in broad academic subjects."
    },
    {
      "title": "LLM Evaluation Framework: MMLU, Chatbot Arena & LLM-as-Judge [2026 Guide]",
      "url": "https://www.meta-intelligence.tech/en/insight-llm-evaluation",
      "description": "# LLM Evaluation Framework: MMLU, Chatbot Arena & LLM-as-Judge [2026 Guide]. Multi-dimensional LLM evaluation radar chart and leaderboard illustration. Evaluating large language models is one of the most challenging problems in the AI field. However, the capability dimensions of LLMs are extraordinarily broad: they simultaneously handle translation, summarization, code generation, mathematical reasoning, creative writing, fact-checking, and dozens of other tasks — no single metric can capture the full picture. The more fundamental difficulty is that **\"a good answer\" is itself a subjective and multi-dimensional concept.** One answer may be impeccable in factual accuracy but stiff in tone and lacking empathy; another may be beautifully written but contain subtle hallucinations. `The Core Dilemma of LLM Evaluation:. in their survey categorize LLM evaluation methods into three major classes: automated benchmark evaluation, human evaluation, and model-as-evaluator (LLM-as-Judge). This article will systematically dissect the current major LLM evaluation methodologies, from static benchmarks to dynamic human rankings, from automated judges to enterprise custom frameworks, providing readers with a complete evaluation decision map."
    },
    {
      "title": "2025 July \"AI Evaluation\" Digest",
      "url": "https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest",
      "description": "[![Image 1: The AI Evaluation Substack](https://substackcdn.com/image/fetch/$s_!xXz3!,w_40,h_40,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b6dcf51-f765-489b-8c76-0a26cf708ebf_494x494.png)](https://aievaluation.substack.com/). # [The AI Evaluation Substack](https://aievaluation.substack.com/). [Sign in](https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest). [![Image 3: AI Evaluation's avatar](https://substackcdn.com/image/fetch/$s_!bqQs!,w_36,h_36,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87229cf4-5123-4968-bfbf-3033864b23dc_440x542.png)](https://substack.com/@aievals). [AI Evaluation](https://substack.com/@aievals). Now, as the AI community faces new challenges with evaluating ever-larger LMs, new initiatives (e.g., [inspect_evals](https://arxiv.org/pdf/2507.06893) and [WEVAL](https://weval.org/)) are following in OpenML’s footsteps, showing that lasting impact comes not just from flashy algorithms or brute-force compute, but from building the shared foundations that make ongoing, open, and rigorous scientific progress possible. *   [MLR-Bench](https://arxiv.org/pdf/2505.19955) tests whether AI systems can truly conduct independent scientific research in the area of machine learning, evaluating models across 201 machine learning problems through all research stages. [![Image 4](https://substackcdn.com/image/fetch/$s_!xXz3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b6dcf51-f765-489b-8c76-0a26cf708ebf_494x494.png)The AI Evaluation Substack A monthly digest of the latest developments, research trends and key initiatives in the realm of AI evaluation.](https://aievaluation.substack.com/?utm_source=substack&utm_campaign=publication_embed&utm_medium=web). [12 Likes](https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest)∙. [Jul 25, 2025](https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest/comment/138924777 \"Jul 25, 2025, 7:55 PM\"). [Aug 26, 2025](https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest/comment/149274778 \"Aug 26, 2025, 1:43 PM\"). [2 more comments...](https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest/comments). [2025 September \"AI Evaluation\" Digest](https://aievaluation.substack.com/p/2025-september-ai-evaluation-digest). [What could possibly go wrong?](https://aievaluation.substack.com/p/2025-september-ai-evaluation-digest). [2025 December \"AI Evaluation\" Digest](https://aievaluation.substack.com/p/2025-december-ai-evaluation-digest). [Call for Tributes: Your test of time.](https://aievaluation.substack.com/p/2025-december-ai-evaluation-digest). [2025 November \"AI Evaluation\" Digest](https://aievaluation.substack.com/p/2025-november-ai-evaluation-digest). Seeing is all you need](https://aievaluation.substack.com/p/2025-november-ai-evaluation-digest)."
    },
    {
      "title": "Why AI Needs Better Benchmarks",
      "url": "https://www.youtube.com/watch?v=kLZeFWZewH0",
      "description": "Why AI Needs Better Benchmarks\nThe AI Daily Brief: Artificial Intelligence News\n581000 subscribers\n102 likes\n3982 views\n27 Mar 2026\nARC-AGI-3 from the ARC Prize measures intelligence by testing learning efficiency across 135 interactive visual games. Coverage examines benchmark saturation, benchmark maxing, and the shift from memorization-focused tests toward interactive, tool-enabled real-world evaluations. ARC-AGI-3 aims to expose gaps between human learning and current models and to refocus research on genuine general intelligence.\n\nThe AI Daily Brief helps you understand the most important news and discussions in AI. \nSubscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614\nGet it ad free at http://patreon.com/aidailybrief\nLearn more about the show https://aidailybrief.ai/\n6 comments"
    }
  ]
}

▼✓网页搜索查询: "benchmark saturation data contamination goodhart's law evaluation crisis", 结果数: 10

{
  "results": [
    {
      "title": "The Measurement Crisis: Saturation, Goodhart's Law, and the End of ...",
      "url": "https://hub.stabilarity.com/the-measurement-crisis-saturation-goodharts-law-and-the-end-of-ai-leaderboards/",
      "description": "The Measurement Crisis ... Their analysis identifies two contamination pathways: direct (benchmark items in training data) ... benchmark saturation, Goodhart's Law,"
    },
    {
      "title": "A Systematic Study of Benchmark Saturation - arXiv",
      "url": "https://arxiv.org/html/2602.16763v1",
      "description": "Various works have highlighted fundamental evaluation pitfalls in AI evaluation: Data contamination, i.e., when test content appears in training, can"
    },
    {
      "title": "To Solve the Benchmark Crisis, Evals Must Think - Fig",
      "url": "https://blog.fig.inc/to-solve-the-benchmark-crisis-evals-must-think/",
      "description": "This is Goodhart's Law4 at scale: when a measure becomes a target, it ceases to be a good measure. A practical anecdote: A team we recently"
    },
    {
      "title": "Benchmark Saturation: AI Evaluation Metrics and Ceiling Effects - Interactive | Michael Brenndoerfer | Michael Brenndoerfer",
      "url": "https://mbrenndoerfer.com/writing/benchmark-saturation-ai-evaluation-metrics",
      "description": "# Benchmark Saturation: AI Evaluation Metrics and Ceiling Effects. Discover benchmark saturation in AI evaluation. Learn why static metrics hit ceiling effects, lose statistical power, and how dynamic benchmarks solve this. When GPT-4 achieved 86.4% on the MMLU benchmark in early 2023, it approached but did not surpass the average human performance of approximately 89.8% by expert test-takers in those subjects. This rapid ascent reveals a fundamental challenge in evaluating language models: benchmarks have a lifespan. What begins as a rigorous test of capability eventually becomes a checkmark on a datasheet, unable to distinguish between good models and great ones. This phenomenon is known as benchmark saturation. Benchmark saturation occurs when model performance on a static dataset approaches the theoretical ceiling, rendering the metric incapable of discriminating between improvements. When a benchmark saturates, the number no longer represents meaningful capability differences; it becomes an artifact of measurement precision, annotation noise, and memorization rather than understanding."
    },
    {
      "title": "The Sequence Opinion #750: The Paradox of AI Benchmarks: Challenges in Evaluation",
      "url": "https://thesequence.substack.com/p/the-sequence-opinion-750-the-paradox",
      "description": "# The Sequence Opinion #750: The Paradox of AI Benchmarks: Challenges in Evaluation. ### Goodhart’s Law and some of the challenges around AI benchmarks. We largely rely on standardized benchmarks and tidy metrics to declare winners and track improvement. Yet a central paradox undercuts this practice: as soon as a metric becomes the target, it often stops measuring what we intended. This is Goodhart’s Law in action—*“when a measure becomes a target, it ceases to be a good measure.”* In AI, that means a leaderboard gain may reflect success at **gaming the test**, not a durable leap in capability. This essay examines the **core challenges of AI benchmarks and evaluations** across language, vision, and reinforcement learning (RL). We’ll look at how once-useful measures turn into misleading goals; why good evals should be **accurate, inexpensive, and hard to memorize**; where academic and real-world tests clash; and what RL environments promise—and fail—to deliver as evolving evaluations. ## Goodhart’s Law and the Benchmark Trap."
    },
    {
      "title": "The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks",
      "url": "https://tianpan.co/blog/2026-04-19-goodharts-law-ai-benchmark-gaming",
      "description": "# The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks. In late 2024, OpenAI's o3 system scored 75.7% on the ARC-AGI benchmark — a test specifically designed to resist optimization. Then practitioners looked closer: o3 had been trained on 75% of the benchmark's public training set, and the highest-compute configuration used 172 times more resources than the baseline. It wasn't a capability breakthrough dressed up as a score. It was a score dressed up as a capability breakthrough. The moment a benchmark becomes the thing teams optimize for, it stops measuring what it was designed to measure. Goodhart's Law — \"when a measure becomes a target, it ceases to be a good measure\" — was articulated in 1970s economic policy, but it describes AI benchmarking with eerie precision. ## The Lifecycle of a Benchmark. A research team identifies a meaningful capability — reasoning, code generation, multilingual comprehension — and constructs test questions to measure it."
    },
    {
      "title": "Is AI Cheating on the Test: Data Contamination, Gaming, and the ...",
      "url": "https://medium.com/@wasowski.jarek/is-ai-cheating-on-the-test-data-contamination-gaming-and-the-benchmark-crisis-9dff2fba494f",
      "description": "How data contamination, Goodhart's Law, and gaming turned objective AI tests into a marketing tool — and what to do about it as a practitioner."
    },
    {
      "title": "A Systematic Study of Benchmark Saturation",
      "url": "https://mcml.ai/publications/ars+26/",
      "description": "In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To"
    },
    {
      "title": "[PDF] Scaling Laws, Foundation Models, and the AI Singularity",
      "url": "https://wjarr.com/sites/default/files/fulltext_pdf/WJARR-2026-0011.pdf",
      "description": "As models consume their own outputs, recursive training risks variance collapse and model autophagy unless carefully mitigated through provenance filtering and human data prioritisation For scaling-law practice, the implication is that future performance may be bottlenecked less by raw compute and more by data governance: provenance tracking, contamination control, deduplication, and methods that preserve distributional diversity while improving informational density per token (Penedo et al., 2024; Villalobos et al., 2024). A useful taxonomy separates: (1) frontier catastrophic risks, such as misuse at scale, loss of control in tightly coupled systems, or rapid World Journal of Advanced Research and Reviews, 2026, 29(01), 111-134 125 capability jumps under competitive pressure; (2) high-frequency societal harms, including disinformation, discrimination, privacy violations, and workplace exploitation; and (3) systemic infrastructure risks, such as energy demand, critical dependency on concentrated providers, and fragility created by common-model monocultures (Bengio et al., 2025; Bontcheva et al., 2023; CMA, 2024; International Energy Agency, 2024)."
    },
    {
      "title": "The Metrology Imperative: The Necessity of Robust Evaluation Frameworks and Comprehensive Automated Judges in Generative AI\n\t\t\t\t\t\t\t| International Journal of Intelligent Systems and Applications in Engineering",
      "url": "https://ijisae.org/index.php/IJISAE/article/view/8188",
      "description": "# The Metrology Imperative: The Necessity of Robust Evaluation Frameworks and Comprehensive Automated Judges in Generative AI. Across the past several years, the accelerating advancement of Large Language Models (LLMs) and generative artificial intelligence has quietly produced a crisis that much of the field has been slow to name directly—a breakdown in the ability to evaluate what these systems can and cannot actually do. This brief argues, with considerable urgency, that building robust and dynamic evaluation frameworks alongside sophisticated automated judges—most prominently through the LLM-as-a-Judge paradigm—is not an optional enhancement to existing practices but an absolute prerequisite for the continued, safe, and value-aligned development of AI systems. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief."
    }
  ]
}

▼✓信息流全文检索关键词: "benchmark design methodology evaluation", 条数: 20

{
  "items": [
    {
      "id": "oai:arXiv.org:2605.18498v1",
      "url": "https://arxiv.org/abs/2605.18498",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs",
      "author": [
        "Jing Wang",
        "Hongxuan Lu",
        "Jazze Young",
        "Shu Wang",
        "Zhimin Xin"
      ],
      "summary": "arXiv:2605.18498v1 Announce Type: cross Abstract: Expert specialization in Mixture-of-Experts (MoE) models remains poorly understood, with traditional evaluations conflating architectural load-balancing with functional specialization. We introduce DBES, a comprehensive diagnostic framework combining a multi-domain benchmark with five theoretically grounded metrics: Routing Specialization, Normalized Effective Rank, Domain Isolation, Routing Stiffness Score, and N-gram Expertise measures. Critical findings demonstrate distinct specialization paradigms across models: Qwen-series exhibit modular specialization with high domain isolation, while DeepSeek and GLM employ distributed collaboration. However, we emphasize that specialization is a diagnostic dimension, necessary but not sufficient for downstream performance. Most crucially, interventional evidence validates the actionability of these metrics: by using DBES to identify high-specialization expert paths during domain-specific post-training, we achieved 66% to 94.48% improvement in specialized domains with only 15% of original training resources, demonstrating that these diagnostic tools can be converted into concrete optimization operators. This work provides the first systematic methodology for evaluating expert specialization independently of accuracy metrics, offering crucial insights for the design and post-training optimization of next-generation MoE systems.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.16282v1",
      "url": "https://arxiv.org/abs/2605.16282",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents",
      "author": [
        "Miles Q. Li",
        "Benjamin C. M. Fung",
        "Boyang Li",
        "Heba Ismail",
        "Farkhund Iqbal"
      ],
      "summary": "arXiv:2605.16282v1 Announce Type: cross Abstract: The rapid deployment of LLM-based autonomous agents has introduced safety risks that extend far beyond traditional LLM concerns, prompting a proliferation of safety benchmarks since late 2023. However, these benchmarks have developed independently, with inconsistent threat models, incompatible metrics, and overlapping yet incomplete risk coverage. We present the first systematic analysis dedicated to agent safety benchmarks as evaluation instruments. We catalog 40 behavioral agent-safety benchmarks (2023-2026), plus 5 adjacent evaluator, defense, and dataset artifacts, propose a six-axis taxonomy of benchmark evaluation methodology, and apply it across the corpus to characterize how methodological choices shape safety conclusions. A coverage matrix reveals broad risk coverage but limited methodological convergence, while the taxonomy analysis shows a behavioral-benchmark core concentrated in sandboxed, constrained, and often safety-only evaluation. Across the landscape, we find that benchmark choice can yield contradictory safety conclusions, coverage counts often overstate evaluation depth, environment fidelity systematically shapes reported safety, the field disproportionately tests externally imposed rather than agent-internal risks, metric fragmentation limits comparison, and robustness remains effectively unbenchmarked. We ground these claims with a cross-benchmark consistency check, with 95% confidence intervals and Kendall's W concordance analysis, finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94). We release structured metadata, full taxonomy codings, risk annotations, and all experimental artifacts, and propose minimum reporting standards for future benchmarks.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.05739v3",
      "url": "https://arxiv.org/abs/2605.05739",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback",
      "author": [
        "Mohammad Al Ridhawi",
        "Mahtab Haj Ali",
        "Hussein Al Osman"
      ],
      "summary": "arXiv:2605.05739v3 Announce Type: replace-cross Abstract: Agentic artificial intelligence systems produce outputs through sequences of interdependent autonomous decisions, yet standard evaluation assesses outputs alone and cannot diagnose the underlying process. We develop a behavioral evaluation methodology that complements output-level testing by scoring the intermediate decision process itself. Behavioral traces logged at each autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges. A perturbation procedure that corrupts one dimension while leaving the other five intact confirms dimension specificity; cross-model agreement reaches Krippendorff's alpha = 0.85. The composite behavioral score correlates at Spearman rho = 0.72 with realized 20-day Sharpe ratio. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty added to the Soft Actor-Critic reward. Three fine-tuning cycles, confined to validation data, reduce one-day MAPE from 0.61% to 0.54% (11.5% relative; p<0.001, d=0.31) on the held-out 2017 to 2025 test period, significant under Diebold-Mariano and localized by Giacomini-White to the high-volatility regime. The methodology is application-agnostic and applies to any agentic system whose intermediate decisions can be logged.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.16909v1",
      "url": "https://arxiv.org/abs/2605.16909",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents",
      "author": [
        "Zhiqiang Liu",
        "Wenhui Dong",
        "Yilang Tan",
        "Yuwen Qu",
        "Haochen Yin",
        "Chenyang Si"
      ],
      "summary": "arXiv:2605.16909v1 Announce Type: new Abstract: Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.13877v2",
      "url": "https://arxiv.org/abs/2605.13877",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "ARES-LSHADE: Autoresearch-Enhanced LSHADE with Memetic Polish for the GNBG Benchmark",
      "author": [
        "Abdullah Naeem",
        "Md Wasi Ul kabir",
        "Manish Bhatt",
        "Ayon Dey",
        "Anav Katwal",
        "Md Tamjidul Hoque"
      ],
      "summary": "arXiv:2605.13877v2 Announce Type: replace-cross Abstract: We present ARES-LSHADE, a memetic differential-evolution variant submitted to the GECCO 2026 competition on LLM-designed evolutionary algorithms for the Generalized Numerical Benchmark Generator (GNBG). The algorithm builds on the LLM-LSHADE 2025 winner, contributing two new components: (a) a scout-augmented mutation operator with adaptive CMA-ES integration, produced by an autonomous research loop across approximately thirty LLM-driven design experiments, and (b) a multi-start L-BFGS-B polish phase that respects strict blackbox treatment of the benchmark. On the official 31-run-per-function evaluation with the competition-specified function-evaluation budgets, ARES-LSHADE obtains 510 of 744 wins (per-function gap below 1e-8), reaching machine precision on 18 of 24 functions. The remaining six functions exhibit characteristic plateau signatures consistent with GNBG's compositional structure, and were independently identified by the autoresearch loop as the hardest of the suite. Beyond the result itself, this report documents two methodological observations: (i) an LLM-driven research loop with operator-only edit surface and fitness-only observation space converges to a characteristic plateau on this benchmark; (ii) when we initially widened the observation space to include the benchmark's compositional metadata, the resulting algorithm trivially solved all 24 functions but violated the competition's blackbox rule, which we identified before submission. We discuss this tension between LLM capability and benchmark integrity as a design consideration for future LLM-driven optimization-algorithm research. Code and reproducibility artifacts are available at https://github.com/anaeem1/ARES-LSHADE.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.17829v1",
      "url": "https://arxiv.org/abs/2605.17829",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Interactive Evaluation Requires a Design Science",
      "author": [
        "Keyang Xuan",
        "Peiyang Song",
        "Pan Lu",
        "Pengrui Han",
        "Wenkai Li",
        "Zhenyu Zhang",
        "Zexue He",
        "Wenyue Hua",
        "Manling Li",
        "Jiaxuan You",
        "Adrian Weller",
        "Yizhong Wang",
        "Jiaxin Pei"
      ],
      "summary": "arXiv:2605.17829v1 Announce Type: new Abstract: AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2510.18941v2",
      "url": "https://arxiv.org/abs/2510.18941",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge",
      "author": [
        "Zhilin Wang",
        "Jaehun Jung",
        "Ximing Lu",
        "Shizhe Diao",
        "Ellie Evans",
        "Jiaqi Zeng",
        "Pavlo Molchanov",
        "Yejin Choi",
        "Jan Kautz",
        "Yi Dong"
      ],
      "summary": "arXiv:2510.18941v2 Announce Type: replace-cross Abstract: Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench and Leaderboard: https://huggingface.co/spaces/nvidia/ProfBench",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2601.11895v3",
      "url": "https://arxiv.org/abs/2601.11895",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models",
      "author": [
        "Adarsh Kumarappan",
        "Pareesa Ameneh Golnari",
        "Wen Wen",
        "Xiaoyu Liu",
        "Gabriel Ryan",
        "Yuting Sun",
        "Shengyu Fu",
        "Elsie Nallipogu"
      ],
      "summary": "arXiv:2601.11895v3 Announce Type: replace-cross Abstract: DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry and synthesized using generator models from multiple provider families to mitigate single-source bias. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, with the strongest achieving only 43.5% Pass@1, confirming the benchmark remains challenging and revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement, detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2510.26384v2",
      "url": "https://arxiv.org/abs/2510.26384",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings",
      "author": [
        "Andrew M. Bean",
        "Nabeel Seedat",
        "Shengzhuang Chen",
        "Jonathan Richard Schwarz"
      ],
      "summary": "arXiv:2510.26384v2 Announce Type: replace Abstract: The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks (\"cold-start\"), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we propose a new item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.25% data subset, we predict full benchmark scores with a 3.2% mean absolute error, and on Humanity's Last Exam we predict full scores with 2.9% mean absolute error using a 2.0% sample. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.18421v1",
      "url": "https://arxiv.org/abs/2605.18421",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective",
      "author": [
        "Yuyao Wang",
        "Zhongjian Zhang",
        "Mo Chi",
        "Kaichi Yu",
        "Yuhan Li",
        "Miao Peng",
        "Bing Tong",
        "Chen Zhang",
        "Yan Zhou",
        "Jia Li"
      ],
      "summary": "arXiv:2605.18421v1 Announce Type: cross Abstract: Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2509.19590v2",
      "url": "https://arxiv.org/abs/2509.19590",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Position: AI Evaluations Should be Grounded on a Theory of Capability",
      "author": [
        "Nathanael Jo",
        "Ashia Wilson"
      ],
      "summary": "arXiv:2509.19590v2 Announce Type: replace Abstract: Evaluations of generative models are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet skepticism about their reliability continues to grow. How can we know that a reported accuracy genuinely reflects a model's underlying performance? Although benchmark results are often presented as direct measurements of capability, in practice they are inferences: treating a score as evidence of capability already presupposes a theory of what it means to be capable at a task. We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability. While this perspective is standard in fields like psychometrics, it remains underdeveloped in AI evaluation, where core assumptions are often left implicit. As a proof-of-concept, we empirically show that reported performance can depend strongly on the evaluator's modeling assumptions, underscoring the need for transparent, theory-driven evaluation practices. We conclude by offering an Evaluation Card to help researchers document, justify, and scrutinize the modeling decisions underlying AI evaluations.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.18693v1",
      "url": "https://arxiv.org/abs/2605.18693",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents",
      "author": [
        "Yifan Zhou",
        "Zhentao Zhang",
        "Ziming Cheng",
        "Shuo Zhang",
        "Qizhen Lan",
        "Zhangquan Chen",
        "Zhi Yang",
        "QianyuXu",
        "Ronghao Chen",
        "Huacan Wang",
        "Sen Hu"
      ],
      "summary": "arXiv:2605.18693v1 Announce Type: new Abstract: As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task-conditioned generation, where a task-specific skill is synthesized after the task is revealed, and task-agnostic generation, where a reusable skill library must be distilled before downstream tasks are known. It also spans two complementary procedural sources: repository-grounded instances, where procedures are distributed across code, configuration, and scripts, and document-grounded instances, where procedures and constraints must be distilled from long-form text. We provide standardized task specifications, pinned environments, and evaluation protocols centered on deterministic execution-based checks, supplemented by auxiliary signals for diagnosis. Experiments across a range of skill-generation methods and backbones show substantial performance variation, highlight the difficulty of reusable skill distillation, and reveal distinct failure modes in skill generation from software repositories versus long-form documents. SkillGenBench establishes a reproducible testbed for studying skill generation as an independent research problem in agent systems.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.17554v1",
      "url": "https://arxiv.org/abs/2605.17554",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps",
      "author": [
        "Tanmay Asthana",
        "Aman Saksena",
        "Divyansh Sahu"
      ],
      "summary": "arXiv:2605.17554v1 Announce Type: new Abstract: Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.18663v1",
      "url": "https://arxiv.org/abs/2605.18663",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "GIM: Evaluating models via tasks that integrate multiple cognitive domains",
      "author": [
        "Rohit Patel",
        "Alexandre Rezende",
        "Steven McClain"
      ],
      "summary": "arXiv:2605.18663v1 Announce Type: new Abstract: As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.17373v1",
      "url": "https://arxiv.org/abs/2605.17373",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics",
      "author": [
        "Qiran Zou",
        "Hou Hei Lam",
        "Wenhao Zhao",
        "Tingting Chen",
        "Yiming Tang",
        "Samson Yu",
        "Yingtao Zhu",
        "Srinivas Anumasa",
        "Zufeng Zhang",
        "Tianyi Zhang",
        "Chang Liu",
        "Zhengyao Jiang",
        "Anirudh Goyal",
        "Dianbo Liu"
      ],
      "summary": "arXiv:2605.17373v1 Announce Type: cross Abstract: AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree-search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process-level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML-bench.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.14133v2",
      "url": "https://arxiv.org/abs/2605.14133",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents",
      "author": [
        "Yuxiang Lai",
        "Peng Xia",
        "Haonian Ji",
        "Kaiwen Xiong",
        "Kaide Zeng",
        "Jiaqi Liu",
        "Fang Wu",
        "Jike Zhong",
        "Zeyu Zheng",
        "Cihang Xie",
        "Huaxiu Yao"
      ],
      "summary": "arXiv:2605.14133v2 Announce Type: replace Abstract: Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \\textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17\\% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2508.20810v3",
      "url": "https://arxiv.org/abs/2508.20810",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs",
      "author": [
        "Jessica M. Lundin",
        "Usman Nasir Nakakana",
        "Guillaume Chabot-Couture"
      ],
      "summary": "arXiv:2508.20810v3 Announce Type: replace Abstract: Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-18T04:00:00.000Z",
      "fetched_at": "2026-05-18T16:06:46.568Z"
    },
    {
      "id": "oai:arXiv.org:2605.17937v1",
      "url": "https://arxiv.org/abs/2605.17937",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting",
      "author": [
        "Zhensheng Wang",
        "Wenmian Yang",
        "Qingtai Wu",
        "Lequan Ma",
        "Yiquan Zhang",
        "Weijia Jia"
      ],
      "summary": "arXiv:2605.17937v1 Announce Type: cross Abstract: Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.18630v1",
      "url": "https://arxiv.org/abs/2605.18630",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science",
      "author": [
        "Nithin Somasekharan",
        "Youssef Hassan",
        "Shiyao Lin",
        "Gihan Panapitiya",
        "Patrick Emami",
        "Anurag Acharya",
        "Sameera Horawalavithana",
        "Shaowu Pan"
      ],
      "summary": "arXiv:2605.18630v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.16372v1",
      "url": "https://arxiv.org/abs/2605.16372",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "SwordBench: Evaluating Orthogonality of Steering Image Representations",
      "author": [
        "Vladimir Zaigrajew",
        "Dawid Pludowski",
        "Hubert Baniecki",
        "Przemyslaw Biecek"
      ],
      "summary": "arXiv:2605.16372v1 Announce Type: cross Abstract: Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    }
  ],
  "total": 5511
}

▼thought

Good, I've got a lot of material. Let me now fetch some of the most relevant articles for deeper reading, and also do additional targeted searches to fill gaps. Let me fetch: 1. The Ofir Press article on how to build good benchmarks 2. The SPEC paper on how to build a benchmark 3. The Kili Technology custom AI benchmark guide 4. The How2Bench paper 5. The Can We Trust AI Benchmarks survey 6. The Interactive Evaluation paper Also need to search for: - SPEC benchmark methodology (classic CS benchmark design) - Psychometrics / IRT in AI benchmarking - More on data contamination prevention

▼✓抓取网页正文网址: "https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/"

{
"url": "https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/",
"title": "How to Build Good Language Modeling Benchmarks",
"author": "Ofir Press",
"summary": "Building benchmarks is important because they shine a spotlight on the weaknesses of existing language models and so can guide the community on how to improve them.",
"content": "<div id=\"readability-page-1\" class=\"page\"><div>\n Building benchmarks is important because they shine a spotlight on the weaknesses of existing language models and so can guide the community on how to improve them.\n\nI’ve spent a lot of my career both on building benchmarks and on building systems that push forward the state-of-the-art on a given benchmark, and I believe that building good benchmarks is just as important as building new systems.\n\nDesigning a good benchmark is challenging and I’ve spent a lot of time recently thinking about what makes for a good benchmark. I’ve distilled it down to three main properties:\n\n<h4 id=\"1--natural\">1. Natural:</h4>\nTry to build a benchmark that has natural questions that some category of humans ask on a frequent basis. For example, the questions in our <a href=\"https://swe-bench.com/\">SWE-bench</a> are made up of real bugs that users reported in popular GitHub repos. The task is to take the reported bug and the repo (as it was at the time the bug was reported) and try to fix the bug. That’s a very natural task that many people do on a daily basis (and even get paid for). Other natural tasks that we’ve recently turned into benchmarks include answering questions such as “What yoga studio near me has vinyasa classes before 8 AM on weekdays?” (see <a href=\"https://assistantbench.github.io/\">AssistantBench</a>) and “Which paper first showed that transformer language models can’t extrapolate to long sequences?” (see <a href=\"https://citeme.ai/\">CiteME</a>).\n\nSometimes I see new benchmarks come out that don’t fulfill the naturalness criteria, and they often have a hard time getting excitement from the community. I find benchmarks that contain IQ test-like questions, where you have to identify patterns in diagrams to not be very exciting. Or any ‘common sense’-like benchmarks that have questions like ‘Bob threw an egg at Alice’s face. Is Alice happy, sad, or ambivalent?’ These types of benchmarks might have been interesting in the past, when our LMs were still struggling with basic tasks, but now that our language models are becoming more capable, we need to challenge them with tougher and more realistic tasks.\n\nAnother way to think about whether a benchmark fulfills the naturalness criteria is to evaluate whether it fulfills what I term the usefulness criteria: would a system that got better-than-baseline accuracy on this benchmark be useful to humans? Would it make anyone more productive? A system that can autonomously fix bugs would save lots of time for developers, even if it only managed to fix the easiest ten percent of bugs. A system that can quickly find me a yoga class that meets my needs would save me time.\n\nI’ve also noticed that there are two simple indicators for a benchmark being unnatural, and so I try to avoid building benchmarks that have these properties:\n\nA. The question set-up is unrealistic: For example, if a benchmark contains multiple choice questions, I believe it is unnatural. When I go to the doctor, I never say “Doctor doctor, my elbow hurts, and it is definitely happening because of one of these four options…”. Always think of your question set-up and if it seems unrealistic, try to modify it.\n\nB. The questions are made up and not taken from actual questions asked by actual humans: If you work for Google and you’re tasked with building a challenging question-answering benchmark, a really nonoptimal thing to do would be to sit around by yourself in a room and just try to think of questions. You’d probably come up with weird questions that no real user would ever ask. A really smart thing to do would be to look at the Google Search logs and try to filter it to find questions that users entered and did not find a good answer to (for example, this might be indicated by the user going to the second result page or because the user spent more than five minutes on the initial results page).\n\nSWE-bench contains real bug reports filed by real users on real GitHub repos. I think this makes the benchmark much more exciting to the community. Using questions that actual users asked implies that by building systems that get higher scores on the benchmark, we would be fulfilling a real-world need.\n\n<h4 id=\"2--automatically-evaluateable\">2. Automatically Evaluateable:</h4>\nIn a benchmark, given a model-generated answer to a question, we need to determine if the model was right or wrong. Sometimes this is easy, but depending on the question type, this could be hard or impossible. Validating the correctness of code could be a challenging task, since there are many different ways to program a given function, and that’s why benchmarks such as HumanEval and SWE-bench use unit tests to automatically validate code.\n\nSummarization is a task that I think could be super useful for humans (“Write a 500 word summary of this patient’s medical file”) but we’ve seen very little development of new benchmarks in this space because evaluation is just so hard. There are many different ways to correctly summarize a given text, but evaluating these summaries is hard. Some have proposed using an LM to evaluate LM outputs but I don’t think that that’s the right way to go. We should either use an LM to solve a task, or use it to judge outputs, but if we use it as both the solver and the evaluator that leads to problems.\n\n<h4 id=\"3--challenging\">3. Challenging</h4>\nIf you launch an automatically evaluatable and natural benchmark, but the accuracy of the best LM at launch is 80%, people will see your benchmark as already being solved and won’t want to try and build models to improve performance on it. I think making your benchmarking challenging is critical. I think that at launch, a good benchmark should have the top LMs achieving between 1% to 35% accuracy on it.\n\nEdit, January 2025: Due to the extremely fast development of LMs these days, I currently recommend that benchmark builders launch their benchmarks with the top accuracy being between 0.1% to 9%. Anything higher probably means that the benchmark is too easy.\n\nEdit, May 2025: I had to make another edit. Due to the speed of development of AI I’m now asking my collaborators, not to think of benchmarks that would have AI systems achieving 0% at launch, but to think of benchmarks that would have systems achieving “-200%” at launch. Find questions that are so hard that even if the models improve 3x they’ll still get zero. Just building a benchmark where models get 0% today might not be enough anymore. You have to look at how the models have been improving over the past 3-6 months, try to predict where they’ll be in 6-12 months and build benchmarks that would not only make current models fail, but benchmarks that would make the models of next year fail as well. Anything easier than that might get saturated much more quickly than you expect.\n\nIf you find a benchmark idea and build it out and it’s natural and automatically evaluateable but you build a baseline and it gets 70% right, one thing you might want to consider doing is to use that baseline to filter-out the easier instances in the benchmark. For example, our <a href=\"https://ofir.io/The-Bamboogle-Dataset/\">Bamboogle benchmark</a> had tough-to-answer 2-hop questions, and we built the dataset by filtering out all questions that Google Search answered correctly. For CiteME, we filtered out all questions that GPT-4o managed to answer correctly in a prompting-only setting (i.e. non-agentic). I think that building benchmarks by finding tasks that a strong existing approach can’t solve is a great way to go.\n\nBeware- researchers are humans and humans have emotions. If at launch, the top model’s accuracy is less than 10%, that might seem very intimidating for most researchers, and they might not want to work on your benchmark at all. Try to plan for that. For example, when we launched SWE-bench the top model’s accuracy was 1.96%. Almost everyone I talked to at the time was intimidated by this and didn’t want to approach it. I wasn’t worried, because we immediately started working on SWE-agent after releasing SWE-bench. I remember telling the team that if we got anywhere near 10% accuracy, the community would see that SWE-bench isn’t as impossible as it seemed, and that would get the ball rolling. Eventually we launched SWE-agent at around 13% accuracy and soon afterwards a barrage of other models appeared, each getting better accuracy than the previous one.\n\n<h4 id=\"bonus-property\">Bonus Property:</h4>\n\nBuilding a benchmark that would be hard to leak into the training data is something that I think about all the time. Could we build a benchmark such that even if the benchmark itself leaks into an LM’s training data, it won’t really help that LM in getting a good score on the benchmark? In <a href=\"https://scicode-bench.github.io/\">SciCode</a>, we had PhDs write very tough programming challenges related to their field of study. Each instance in the dataset is a description of a function and the unit tests to validate whether the model programmed it correctly or not. We intentionally do not release any of the answers to these programming challenges, to make sure these answers are never inserted into any LMs training data. This way, even if our benchmark fully leaks into an LM’s training set, it still won’t be able to produce the right answers to the questions. Achieving this property is extremely difficult, and so it’s not something I try to do with every single benchmark I build.\n\n<h4 id=\"other-guidelines\">Other guidelines:</h4>\n\n<ol>\n <li>\n Have one number for your benchmark. One metric that people go for. “We get 87% on HumanEval” is the vibe you are going for. Don’t have three metrics, like accuracy, precision, and recall, have just one. Don’t divide accuracy by category, have just an overall accuracy. This is really important. You want to make use of your benchmark as easy as possible. You want people to get it right away. If you start having seventeen metrics and nineteen categories it’s going to be complicated for people to understand what you’re trying to do, and that will lower the chances of your benchmark catching on.\n\n When you write the analysis section in your paper about your benchmark, it’s totally fine to present other metrics for each model, or to break down performance by category, but you should only do that there, and not have the categories or other metrics when you generally talk about the benchmark.\n </li>\n <li>\n When you write a paper, always include very strong baselines, both based on strong proprietary models and on leading open source models. You should never try to make your benchmark look more impressive than it actually is by including only weak baselines, or baseline systems that use old or outdated models like GPT 3.5 or Llama 3 7B.\n </li>\n <li>\n Benchmarks typically get saturated within a year. And so when you build a benchmark, I don’t think it’s important to worry about questions like “Will this still be an interesting question in five years?”. Deep learning moves incredibly quickly, and there’s no way to predict where we’ll be in more than a year. For this reason, it’s also OK to have benchmarks that ask questions that will probably have totally different answers in a year or two, such as “Which yoga studios near Bushwick in NYC have a Vinyasa class before 7AM?”.\n </li>\n</ol>\n\nBenchmarks are great because they provide a lot of room for creativity, and they can be super impactful in guiding the community towards the future. I hope this post helps you in building the next big LM benchmark. And always remember- rules are meant to be broken. I do not think that a benchmark that does not follow all of these rules is bad. I just think that these guidelines are a good indicator for whether you’re on the right path or not.\n\n<h3 id=\"summary\">Summary</h3>\n\nHere are some questions to think about while you’re designing a new benchmark:\n<ol>\n <li>\n A benchmark is a collection of tasks, where each task is made up of &lt;request, environment, stopping criteria, scorer&gt; 4-tuples.\n &nbsp;&nbsp;&nbsp;&nbsp;\nA. The request is what you want the model to actually do, i.e. in SWE-bench it would be “Fix this issue “ + issue_text.\n &nbsp;&nbsp;&nbsp;&nbsp;\nB. The environment is a total description of the environment that the agent will act in while solving your request. Is internet access allowed? What dependencies are installed and which ones are not? Are there any special tools you will be providing the agent with?\n &nbsp;&nbsp;&nbsp;&nbsp;\nC. The stopping criteria is how you decide when to end an agent’s run. For some tasks the agent will probably issue a ‘submit’ command and exit but you need to decide how to act when that never happens. Are you going to have a turn limit per task? A cost limit? A walltime limit? A combination of these? All answers are viable, you just need to decide.\n &nbsp;&nbsp;&nbsp;&nbsp;\nD. The scorer takes the environment as it was when the agent exited and scores it. Will you build a binary pass/fail benchmark, like we did in SWE-bench with the fail2pass and pass2pass tests? Or will you build a benchmark with a continuous score, like we did in AlgoTune, where we ask agents to speed up computer programs, and the score per task is the agent’s code total runtime divided by our baseline’s total runtime. Or will you use ELO like we did in CodeClash? There are many possiblities here.\n </li>\n <li>\n What is the baseline scaffolding that you will use and how similar is it to the best scaffolding in common use right now? For example, if you’re asking coding questions, and your scaffolding doesn’t allow for code execution, that’s not a very good representation of reality. If you’re asking knowledge questions and don’t allow access to the internet, that’s not realistic. Try to make your scaffolding as close as good as you can. This frequently doesn’t take much effort as people think. mini-SWE-agent is able to get very competitive scores (and sometimes even surpass) Claude Code these days, even though it is orders of magnitude simpler. I talk a lot about how much easier it is to sell a benchmark that is realistic, and part of that is making the tasks realistic, but you should also make your baseline scaffolding realistic, otherwise people will mistrust your results.\n </li>\n</ol>\n\nBenchmarks are what moves the frontier of AI forward, there isn’t anything more important than building good new benchmarks. Good luck!\n\n\n </div></div>",
"markdown": "Building benchmarks is important because they shine a spotlight on the weaknesses of existing language models and so can guide the community on how to improve them.\n\nI’ve spent a lot of my career both on building benchmarks and on building systems that push forward the state-of-the-art on a given benchmark, and I believe that building good benchmarks is _just as important_ as building new systems.\n\nDesigning a good benchmark is challenging and I’ve spent a lot of time recently thinking about what makes for a good benchmark. I’ve distilled it down to three main properties:\n\n#### 1\\. Natural:\n\nTry to build a benchmark that has natural questions that some category of humans ask on a frequent basis. For example, the questions in our [SWE-bench](https://swe-bench.com/) are made up of _real_ bugs that users reported in popular GitHub repos. The task is to take the reported bug and the repo (as it was at the time the bug was reported) and try to fix the bug. That’s a very natural task that many people do on a daily basis (and even get paid for). Other natural tasks that we’ve recently turned into benchmarks include answering questions such as “What yoga studio near me has vinyasa classes before 8 AM on weekdays?” (see [AssistantBench](https://assistantbench.github.io/)) and “Which paper first showed that transformer language models can’t extrapolate to long sequences?” (see [CiteME](https://citeme.ai/)).\n\nSometimes I see new benchmarks come out that don’t fulfill the naturalness criteria, and they often have a hard time getting excitement from the community. I find benchmarks that contain IQ test-like questions, where you have to identify patterns in diagrams to not be very exciting. Or any ‘common sense’-like benchmarks that have questions like ‘Bob threw an egg at Alice’s face. Is Alice happy, sad, or ambivalent?’ These types of benchmarks might have been interesting in the past, when our LMs were still struggling with basic tasks, but now that our language models are becoming more capable, we need to challenge them with tougher and more realistic tasks.\n\nAnother way to think about whether a benchmark fulfills the naturalness criteria is to evaluate whether it fulfills what I term the usefulness criteria: would a system that got better-than-baseline accuracy on this benchmark be useful to humans? Would it make anyone more productive? A system that can autonomously fix bugs would save lots of time for developers, even if it only managed to fix the easiest ten percent of bugs. A system that can quickly find me a yoga class that meets my needs would save me time.\n\nI’ve also noticed that there are two simple indicators for a benchmark being unnatural, and so I try to avoid building benchmarks that have these properties:\n\nA. **The question set-up is unrealistic**: For example, if a benchmark contains multiple choice questions, I believe it is unnatural. When I go to the doctor, I never say “Doctor doctor, my elbow hurts, and it is definitely happening because of one of these _four_ options…”. Always think of your question set-up and if it seems unrealistic, try to modify it.\n\nB. **The questions are made up and not taken from actual questions asked by actual humans**: If you work for Google and you’re tasked with building a challenging question-answering benchmark, a really nonoptimal thing to do would be to sit around by yourself in a room and just try to think of questions. You’d probably come up with weird questions that no real user would ever ask. A really smart thing to do would be to look at the Google Search logs and try to filter it to find questions that users entered and did not find a good answer to (for example, this might be indicated by the user going to the second result page or because the user spent more than five minutes on the initial results page).\n\nSWE-bench contains real bug reports filed by real users on real GitHub repos. I think this makes the benchmark much more exciting to the community. Using questions that actual users asked implies that by building systems that get higher scores on the benchmark, we would be fulfilling a real-world need.\n\n#### 2\\. Automatically Evaluateable:\n\nIn a benchmark, given a model-generated answer to a question, we need to determine if the model was right or wrong. Sometimes this is easy, but depending on the question type, this could be hard or impossible. Validating the correctness of code could be a challenging task, since there are many different ways to program a given function, and that’s why benchmarks such as HumanEval and SWE-bench use unit tests to automatically validate code.\n\nSummarization is a task that I think could be super useful for humans (“Write a 500 word summary of this patient’s medical file”) but we’ve seen very little development of new benchmarks in this space because evaluation is just so hard. There are many different ways to correctly summarize a given text, but evaluating these summaries is hard. Some have proposed using an LM to evaluate LM outputs but I don’t think that that’s the right way to go. We should either use an LM to solve a task, or use it to judge outputs, but if we use it as both the solver and the evaluator that leads to problems.\n\n#### 3\\. Challenging\n\nIf you launch an automatically evaluatable and natural benchmark, but the accuracy of the best LM at launch is 80%, people will see your benchmark as already being _solved_ and won’t want to try and build models to improve performance on it. I think making your benchmarking challenging is critical. I think that at launch, a good benchmark should have the top LMs achieving between 1% to 35% accuracy on it.\n\n_Edit, January 2025: Due to the extremely fast development of LMs these days, I currently recommend that benchmark builders launch their benchmarks with the top accuracy being between 0.1% to 9%. Anything higher probably means that the benchmark is too easy._\n\n_Edit, May 2025: I had to make another edit. Due to the speed of development of AI I’m now asking my collaborators, not to think of benchmarks that would have AI systems achieving 0% at launch, but to think of benchmarks that would have systems achieving “-200%” at launch. Find questions that are so hard that even if the models improve 3x they’ll still get zero. Just building a benchmark where models get 0% today might not be enough anymore. You have to look at how the models have been improving over the past 3-6 months, try to predict where they’ll be in 6-12 months and build benchmarks that would not only make current models fail, but benchmarks that would make the models of next year fail as well. Anything easier than that might get saturated much more quickly than you expect._\n\nIf you find a benchmark idea and build it out and it’s natural and automatically evaluateable but you build a baseline and it gets 70% right, one thing you might want to consider doing is to use that baseline to **filter-out the easier instances** in the benchmark. For example, our [Bamboogle benchmark](https://ofir.io/The-Bamboogle-Dataset/) had tough-to-answer 2-hop questions, and we built the dataset by filtering out all questions that Google Search answered correctly. For CiteME, we filtered out all questions that GPT-4o managed to answer correctly in a prompting-only setting (i.e. non-agentic). I think that building benchmarks by finding tasks that a strong existing approach can’t solve is a great way to go.\n\nBeware- researchers are humans and humans have emotions. If at launch, the top model’s accuracy is less than 10%, that might seem very intimidating for most researchers, and they might not want to work on your benchmark at all. Try to plan for that. For example, when we launched SWE-bench the top model’s accuracy was 1.96%. Almost everyone I talked to at the time was intimidated by this and didn’t want to approach it. I wasn’t worried, because we immediately started working on SWE-agent after releasing SWE-bench. I remember telling the team that if we got anywhere near 10% accuracy, the community would see that SWE-bench isn’t as impossible as it seemed, and that would get the ball rolling. Eventually we launched SWE-agent at around 13% accuracy and soon afterwards a barrage of other models appeared, each getting better accuracy than the previous one.\n\n#### Bonus Property:\n\nBuilding a benchmark that would be **hard to leak into the training data** is something that I think about all the time. Could we build a benchmark such that even if the benchmark itself leaks into an LM’s training data, it won’t really help that LM in getting a good score on the benchmark? In [SciCode](https://scicode-bench.github.io/), we had PhDs write very tough programming challenges related to their field of study. Each instance in the dataset is a description of a function and the unit tests to validate whether the model programmed it correctly or not. We intentionally _do not_ release any of the answers to these programming challenges, to make sure these answers are never inserted into any LMs training data. This way, even if our benchmark fully leaks into an LM’s training set, it still won’t be able to produce the right answers to the questions. Achieving this property is extremely difficult, and so it’s not something I try to do with every single benchmark I build.\n\n#### Other guidelines:\n\n1. Have **one** number for your benchmark. One metric that people go for. “We get 87% on HumanEval” is the vibe you are going for. Don’t have three metrics, like accuracy, precision, and recall, have just one. Don’t divide accuracy by category, have just an overall accuracy. This is really important. You want to make use of your benchmark as easy as possible. You want people to _get it_ right away. If you start having seventeen metrics and nineteen categories it’s going to be complicated for people to understand what you’re trying to do, and that will lower the chances of your benchmark catching on.\n \n When you write the analysis section in your paper about your benchmark, it’s totally fine to present other metrics for each model, or to break down performance by category, but you should only do that there, and not have the categories or other metrics when you generally talk about the benchmark.\n \n2. When you write a paper, always **include very strong baselines**, both based on strong proprietary models and on leading open source models. You should never try to make your benchmark look more impressive than it actually is by including only weak baselines, or baseline systems that use old or outdated models like GPT 3.5 or Llama 3 7B.\n \n3. Benchmarks typically get saturated within a year. And so when you build a benchmark, I don’t think it’s important to worry about questions like “Will this still be an interesting question in five years?”. Deep learning moves incredibly quickly, and there’s no way to predict where we’ll be in more than a year. For this reason, it’s also OK to have benchmarks that ask questions that will probably have totally different answers in a year or two, such as “Which yoga studios near Bushwick in NYC have a Vinyasa class before 7AM?”.\n \n\nBenchmarks are great because they provide a lot of room for creativity, and they can be super impactful in guiding the community towards the future. I hope this post helps you in building the next big LM benchmark. And always remember- rules are meant to be broken. I do not think that a benchmark that does not follow all of these rules is bad. I just think that these guidelines are a good indicator for whether you’re on the right path or not.\n\n### Summary\n\nHere are some questions to think about while you’re designing a new benchmark:\n\n1. A benchmark is a collection of tasks, where each task is made up of <request, environment, stopping criteria, scorer> 4-tuples. \n A. The **request** is what you want the model to actually do, i.e. in SWE-bench it would be “Fix this issue “ + issue\\_text. \n B. The **environment** is a total description of the environment that the agent will act in while solving your request. Is internet access allowed? What dependencies are installed and which ones are not? Are there any special tools you will be providing the agent with? \n C. The **stopping criteria** is how you decide when to end an agent’s run. For some tasks the agent will probably issue a ‘submit’ command and exit but you need to decide how to act when that never happens. Are you going to have a turn limit per task? A cost limit? A walltime limit? A combination of these? All answers are viable, you just need to decide. \n D. The **scorer** takes the environment as it was when the agent exited and scores it. Will you build a binary pass/fail benchmark, like we did in SWE-bench with the fail2pass and pass2pass tests? Or will you build a benchmark with a continuous score, like we did in AlgoTune, where we ask agents to speed up computer programs, and the score per task is the agent’s code total runtime divided by our baseline’s total runtime. Or will you use ELO like we did in CodeClash? There are many possiblities here.\n \n2. What is the baseline scaffolding that you will use and how similar is it to the best scaffolding in common use right now? For example, if you’re asking coding questions, and your scaffolding doesn’t allow for code execution, that’s not a very good representation of reality. If you’re asking knowledge questions and don’t allow access to the internet, that’s not realistic. Try to make your scaffolding as close as good as you can. This frequently doesn’t take much effort as people think. mini-SWE-agent is able to get very competitive scores (and sometimes even surpass) Claude Code these days, even though it is orders of magnitude simpler. I talk a lot about how much easier it is to sell a benchmark that is realistic, and part of that is making the tasks realistic, but you should also make your baseline scaffolding realistic, otherwise people will mistrust your results.\n \n\nBenchmarks are what moves the frontier of AI forward, there isn’t anything more important than building good new benchmarks. Good luck!"
}

▼✓抓取网页正文网址: "https://kili-technology.com/blog/custom-ai-benchmark-guide-what-the-best-publ..."

{
  "url": "about:blank",
  "title": "Custom AI Benchmark Guide: Lessons from Public Evals",
  "summary": "How HELM, GPQA Diamond, SWE-bench, and LegalBench design choices translate into a practitioner methodology for custom AI benchmarks you can trust.",
  "content": "<div id=\"readability-page-1\" class=\"page\"><div><div><div><div role=\"list\" data-w-id=\"866f87fb-b78e-c941-0157-b3859e4af64a\"><div role=\"listitem\"><p>LLMs</p></div><div role=\"listitem\"><p>AI Evaluation</p></div><div role=\"listitem\"><p>Foundation Models</p></div></div><div><h2 data-w-id=\"403ca70e-d530-b6de-d868-c208d5f4dfbf\">Custom AI Benchmark Guide: What the Best Public Evals Teach You About Building Your Own</h2><p data-w-id=\"9e94c3f3-3cf3-5e65-6535-70daafdadb94\">The public benchmarks that defined AI progress for the last three years are saturating in months, not years, and a 445-benchmark systematic review found most don't accurately capture what their names claim. This guide distils the design choices behind HELM, GPQA Diamond, SWE-bench, LiveCodeBench, LegalBench, and MultiMedQA into a practitioner methodology for building AI benchmarks you can actually trust.</p></div><div data-w-id=\"0fe81583-09e6-8487-ab50-1aaf94dc0ea6\"><div><p><img src=\"https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1).png\" loading=\"lazy\" alt=\"\" sizes=\"(max-width: 767px) 100vw, (max-width: 991px) 727px, 939px\" srcset=\"https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-500.png 500w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-800.png 800w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-1080.png 1080w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-1600.png 1600w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-2000.png 2000w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-2600.png 2600w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-3200.png 3200w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1).png 3644w\"></p><p>Kili Technology</p></div><p>·</p><p>Apr 30, 2026</p></div></div><p><img src=\"https://cdn.prod.website-files.com/68da32b2041c593b0511a582/69f34f60ec23da20cca59620_custom%20ai%20benchmarks.png\" loading=\"lazy\" alt=\"\"></p></div><div id=\"w-node-_630176ae-f75c-4c07-5b3f-71f4b37392a2-0511a5bd\"><div><h2>AI Summary</h2><div><ul id=\"\"><li id=\"\">A 2025 review of 445 LLM benchmarks found pervasive construct-validity gaps across widely cited ai benchmarks.</li><li id=\"\">HELM's 7-metric design raised cross-model evaluation coverage from 17.9% to 96%.</li><li id=\"\">GPQA Diamond's expert-write, expert-validate, non-expert-test pipeline is the cleanest reusable protocol for sourcing tasks from domain experts.</li><li id=\"\">SWE-bench scores rose from 1.96% to 71.7% in one year — and a later human-filtered subset removed roughly a third of the original tasks as ambiguous or infeasible.</li><li id=\"\">LLM judge reliability hinges on rubric clarity, not chain-of-thought; the rubric is the operationalisation of the construct you claim to measure.</li><li id=\"\">Kili Technology supports the upstream work behind reliable custom AI benchmarks — expert annotator workflows, rubric design, multi-annotator validation, and ongoing re-annotation as models and production traffic evolve.</li></ul></div></div><div fs-toc-element=\"contents\" fs-toc-offsettop=\"7rem\"><div id=\"introduction\"><h2>Introduction</h2><p>Three years ago, \"build a benchmark\" meant pick a task, write a few hundred examples, publish a leaderboard. The methodology was loose because the time horizons were long: MMLU, released in 2020, took until 2024 to saturate.</p><p>That assumption is gone. The 2025 AI Index reports gains of 18.8 points on MMMU, 48.9 on GPQA Diamond, and 67.3 points in SWE-bench scores in a single year. Humanity's Last Exam, released in early 2025 with the explicit goal of resisting frontier reasoning models, went from 8.8% (top score, January 2025) to over 50% by April 2026. The exam was designed as the hardest standardised exam ever assembled for an LLM, and even an exam of that calibre hit the saturation curve within fifteen months. Stanford HAI's Vanessa Parli framed the resulting question bluntly: are we measuring the right thing, are the benchmarks compromised, and how should the research community evaluate models?</p><p>The pressure is sharper for teams running production systems. Public leaderboards optimise for capability ceilings; production systems break on consistency floors. An MIT NANDA analysis of roughly 300 enterprise AI deployments found only 5% reach measurable P&amp;L impact, and the binding constraints are workflow integration and evaluation gaps, not model quality. Gartner expects more than 40% of agentic AI projects to be cancelled by 2027.</p><p>A custom benchmark is no longer optional infrastructure for any team deploying AI in a domain that matters. The question is how to evaluate the system you actually ship without repeating the validity, contamination, and saturation mistakes that the public ai benchmarks are now publicly working through.</p></div><div id=\"why-are-public-ai-benchmarks-no-longer-enough\"><h2>Why Are Public AI Benchmarks No Longer Enough?</h2><p>Three forces have converged.</p><p>The first is <strong>saturation speed</strong>. When the dataset was introduced in late 2023, the best model (Claude 2) solved 1.96% of real GitHub issues drawn from SWE-bench. By 2024, leading research systems and agents were solving 71.7%. The benchmark community responded with harder variants — a verified human-filtered subset and a Pro tier — but the underlying pattern is structural: any static benchmark hard for today's frontier reasoning models will be solved by next year's. A useful heuristic for benchmark builders is that top models should land below roughly 35% accuracy at launch — anything easier is already a regression eval in disguise, and the resulting scores will not separate frontier systems from each other.</p><p>The second is <strong>construct validity erosion</strong>. The 2025 Reuel et al. systematic review of 445 LLM benchmarks identified prevalent gaps in construct validity, the property that the test actually measures the capability it claims to measure. Naming a benchmark \"general reasoning\" or \"general knowledge\" doesn't establish that the score generalises to the construct. Raji et al. flagged this in 2021, calling general-purpose benchmark framing \"ultimately dangerous and deceptive.\" The 2025 review confirms the problem is endemic, not isolated.</p><p>The third is <strong>benchmark exploitability</strong>. Recent research has shown that headline scores often measure how well a model gamed the test harness rather than how well it solved the underlying tasks. Automated scanning agents have been demonstrated that exploit structural flaws in popular ai benchmarks — for instance, the lack of strict isolation between the agent under test and the evaluator process — to achieve near-perfect scores without solving any of the tasks. The attack pattern unfolds in stages: the scanning agent probes the harness, identifies leaked grader signals, and produces outputs that satisfy the grader without solving the problem. The takeaway is unambiguous: high scores on static benchmarks can be deeply misleading, and inflated leaderboard scores can be uncorrelated with the underlying capabilities.</p><p>The fourth is the <strong>research-versus-production divergence</strong>. Public benchmarks measure peak capability on single attempts. Production systems need consistency: an <a href=\"https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4660\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4660\">Anthropic engineering analysis</a> of τ-bench found agents hitting 60% pass@1 dropped to 25% pass^k (consistency across k trials). That gap is invisible in single-run leaderboards and catastrophic for users. The same analysis pushes teams toward pass@k <em>and</em> pass^k, partial-credit graders, and balanced positive/negative cases, none of which are standard on public leaderboards.</p><p>If your system processes legal contracts, diagnoses medical images, or routes financial trades, the relevant comparison isn't whether your model beats GPT-4 on MMLU. It's whether it does the specific job you need it to do, reliably, on the kind of inputs your production traffic actually contains.</p></div><div id=\"what-makes-a-custom-ai-benchmark-actually-measure-what-it-claims\"><h2>What Makes a Custom AI Benchmark Actually Measure What It Claims?</h2><p>Construct validity is the first principle. Before you draft a single task, write down (in a paragraph, not a vibe) what capability you're measuring, why it matters for your system, and what it would mean for the score to go up or down. If you can't articulate the construct, you can't measure it. In practice this also means aligning the benchmark to a concrete business outcome or domain task — what \"right\" looks like — rather than chasing an abstract aggregate score.</p><div><p>The Reuel et al. research proposes four validity types worth carrying through: <br>- <strong>construct </strong>(does it measure what it names), criterion (does it correlate with downstream outcomes),<br>- <strong>consequential </strong>(does optimising for it produce the behaviours you want), and <br>- <strong>external</strong> (does it generalise outside the eval set). </p><p>A benchmark that ignores any of the four is gameable.</p></div><p>The second principle is <strong>multi-metric coverage with explicit gaps acknowledged</strong>. The <a href=\"https://arxiv.org/abs/2211.09110\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4673\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4673\">HELM framework from Stanford CRFM</a> rejected single-number leaderboards and instead reported 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) across 16 core scenarios. The point wasn't completeness; it was making trade-offs visible to anyone reading the data. Before HELM, models were evaluated on just 17.9% of its core scenarios on average; HELM raised this to 96%, putting 30 models on equal footing for direct comparison. For a custom benchmark, the lesson is to pick three or four metrics that capture the trade-offs you actually face (latency vs. accuracy, recall vs. precision, helpfulness vs. harm) and report them all, every time. A workable pattern is to lead with one primary metric for at-a-glance understanding and relegate detailed secondary metrics — and the parameters used to compute them — to an appendix that lets another team replicate the run end-to-end.</p><p>The third principle is <strong>balanced problem sets</strong>. Anthropic's engineering guidance phrases it directly: include cases where the behaviour should occur and cases where it shouldn't. A benchmark of \"did the agent successfully cancel the order\" is incomplete without \"did the agent refuse to cancel an order it shouldn't have.\" Balanced sets are how you catch over-eager agents and miss-rate trade-offs that single-axis evaluation hides. Edge cases and adversarial items belong in the same set — silent failures hide between average and worst-case behaviour.</p><p>Construct validity is also where rubric design enters early. <a href=\"https://arxiv.org/abs/2506.13639\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a467d\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a467d\">Park et al. (2025)</a> showed empirically that evaluation criteria are the dominant factor in LLM judge reliability; chain-of-thought offers minimal gains when the rubric is clear. <strong>The rubric is the operationalisation of the construct.</strong> If the rubric is vague, the construct is vague, and no amount of grader sophistication will save the score. A clear rubric also accelerates downstream understanding: when scores diverge across model versions, the team can read the rubric and trace the divergence to a specific criterion rather than re-arguing the underlying construct.</p><p>This is also where Kili-style annotation infrastructure starts to matter, not for the eval itself, but for the upstream work of writing the rubric, validating it against expert disagreement, and revising the data collection process before any tasks are graded.</p></div><div id=\"where-should-the-tasks-come-from-and-who-should-write-them\"><h2>Where Should the Tasks Come From, and Who Should Write Them?</h2><p>Two patterns from the public ai benchmarks dominate:</p><div id=\"real-artefacts-beat-synthetic-prompts\"><h3><strong>Real artefacts beat synthetic prompts</strong></h3><p><strong>‍</strong>‍<a href=\"https://arxiv.org/abs/2310.06770\" data-wf-native-id-path=\"2f63eefe-abe7-aa5a-7eb6-da7683ec89b4\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"2f63eefe-abe7-aa5a-7eb6-da7683ec89b4\">SWE-bench</a> drew its 2,294 tasks from real GitHub issues across 12 popular Python repositories and graded by running the project's actual test suite. A patch passes only if it makes the failing tests pass without breaking the passing ones (FAIL_TO_PASS / PASS_TO_PASS). The realistic construction setting, in the authors' phrasing, gave the dataset properties no synthetic prompt collection could replicate: tasks are continually updatable from new pull requests, hard to game with surface heuristics, and grounded in code that other humans actually had to review. For a custom benchmark, the implication is to mine your own systems first. Bug trackers, support tickets, rejected agent outputs, escalations to human review: these are the highest-signal task sources you have.</p></div><div id=\"expert-authoring-with-adversarial-validation\"><h3><strong>E</strong><strong>xpert authoring with adversarial validation</strong>. </h3><p>‍<a href=\"https://arxiv.org/abs/2311.12022\" data-wf-native-id-path=\"a29c6242-7f8d-265f-9105-82d268588ad8\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"a29c6242-7f8d-265f-9105-82d268588ad8\">GPQA Diamond</a> is the cleanest published example. Its four-stage pipeline (expert authoring, expert validation, revision, non-expert validation) produced 448 multiple-choice questions where domain PhDs reach 65% accuracy, 74% if you discount clear mistakes, but skilled non-experts (humans with 30+ minutes of unrestricted web access) reach only 34%. The Diamond subset is the high-confidence slice — questions where two domain experts agreed on the answer and a third validated it independently. The non-expert validation stage is the underappreciated part: it gives you a quantitative answer to \"is this question actually hard, or just obscure?\" GPQA Diamond also embeds a canary string for contamination tracing, a small touch with disproportionate downstream value.</p><p><a href=\"https://arxiv.org/abs/2308.11462\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a469a\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a469a\">LegalBench</a> is the canonical example of the expert-led collaborative model: 162 tasks across 6 reasoning types (issue-spotting, rule-recall, rule-application, rule-conclusion, interpretation, rhetorical understanding) authored by 40+ contributors, including lawyers, law professors, and legal practitioners. The reasoning typology is itself a contribution: it forces task authors to articulate what kind of legal cognition each task is testing, which makes downstream interpretation of scores tractable.</p><p>Four operational rules follow:</p><p>- <strong>Start with real failures, then scale.</strong> Anthropic's guidance recommends 20–50 tasks drawn from production failures as a starting point; the working bar for a defensible domain benchmark sits higher, at a hand-picked, expert-labelled set of 200–1,000 examples that reflect real user journeys and edge cases for the certain use cases your agents actually serve in production. Treat dataset construction as a series of stages, not a one-shot exercise: a small seed set, then expansion, then iteration as failures surface. </p><p>- <strong>Write reference solutions</strong> — if your domain experts can't solve the task, the model definitely can't, and the failure tells you nothing. </p><p>- <strong>Use multiple raters per item</strong> to evaluate annotation quality: the verified-subset protocol used three annotators per item, severity-ensembled, to filter out infeasible or under-specified items, and produced what is now the de facto standard for executable-test benchmarks; single-rater work inherits single-rater blind spots, and using too few raters strips out the human disagreement that is itself a signal about task ambiguity. </p><p>- <strong>Keep the test set strictly separated from anything used to develop the system</strong> — train/test contamination quietly inflates scores on held-out evaluations, and the inflation is invisible until production performance disappoints.</p></div></div><div id=\"how-do-you-grade-outputs-that-dont-have-a-single-right-answer\"><h2>How Do You Grade Outputs That Don't Have a Single Right Answer?</h2><p>The Anthropic taxonomy is the cleanest framing: code-based graders, model-based graders (LLM judge), and human graders. Pick the cheapest grader that works for the task — and let the cost of being wrong, not the convenience of measuring, drive the choice.</p><p><strong>Code-based graders</strong> are the gold standard when applicable: exact match, regex, executable tests, structured output validation. The FAIL_TO_PASS / PASS_TO_PASS pattern from the original benchmark is a code-based grader; so is \"does the JSON parse and contain the required fields.\" If the task admits a code-based grader, use it. The grader is deterministic, free to run, and impossible to game without solving the task — and for agents that produce structured outputs, this is the cheapest reliable signal you will ever have.</p><p><strong>Model-based graders</strong> are the workhorse for open-ended outputs. Park et al. showed evaluation criteria dominate reliability; rubric-guided judges (Prometheus, G-Eval) reach Pearson correlations around 0.897 with humans rating the same outputs when the rubric is unambiguous. The <a href=\"https://arxiv.org/abs/2510.07243\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46b7\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46b7\">LeMAJ legal evaluation framework</a> found inter-rater agreement among humans increased 11% when reviewers used a shared rubric, and rubric-guided LLM judge configurations hit Cohen's κ of 0.75 with human consensus. The pattern across these results: rubric clarity is the constraint, not judge model size.</p><p>Three operational rules. <strong>Validate against a human-labelled golden set</strong> — the working bar in practice is 75–90% agreement with human consensus; below that, the judge is amplifying noise. <strong>Decompose into structured criteria</strong> — binary checks (\"does the output cite a real case?\") plus ordinal scores (\"rate clarity 1–5 with anchor descriptions\") outperform single overall scores in nearly every published comparison. <strong>Watch the known biases.</strong> Model-based judges show position bias, length bias, and self-preference (preferring outputs from the same model family). Randomise position, normalise length when possible, and use a different model family for judging than for the system under test.</p><p><strong>Human graders</strong> are the calibration layer and the high-stakes layer. Reserve humans for golden-set construction, judge calibration, and tasks where the cost of a wrong answer makes a 90%-agreement model judge unacceptable: clinical safety, legal compliance, financial advice. The economics rarely support running humans across a full benchmark, but they almost always support running humans across a calibration subset — and the resulting scores anchor every other grader downstream.</p></div><div id=\"how-do-you-keep-a-custom-ai-benchmark-useful-over-time\"><h2>How Do You Keep a Custom AI Benchmark Useful Over Time?</h2><p>A custom benchmark is a versioned artefact with a maintenance schedule. Treat it that way and it stays useful; treat it as a one-time project and it expires inside a year.</p><p><strong>Contamination defence.</strong> The <a href=\"https://arxiv.org/abs/2411.03923\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46d1\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46d1\">Singh et al. (2024) ConTAM analysis</a> of 13 benchmarks across 7 models found contamination has been underestimated in many prominent LLM releases, even when developers attempted decontamination. Three practical defences: hold out a private split that never goes public; date-stamp every item so you can filter to post-cutoff data per model (the <a href=\"https://arxiv.org/abs/2403.07974\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46d4\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46d4\">LiveCodeBench</a> approach, which date-stamps problems from competitive programming contests since May 2023); and embed a canary string, GPQA Diamond–style, so you can later test whether a model has memorised your dataset.</p><p><strong>Harness isolation.</strong> A subtler failure mode is the absence of strict isolation between the system under test and the evaluator. When agents can read or write to the same filesystem as the grader, observe the grader's logs, or otherwise inspect the scoring process, automated exploits become trivial — and as recent scanning-agent research shows, headline scores in those conditions can reflect harness gaming rather than genuine capabilities. Build agent/evaluator isolation into the design, not as a follow-up.</p><p><strong>Saturation planning.</strong> Distinguish capability evals from regression evals from day one. A capability eval and a regression eval are two different stages of the same benchmark's life: capability evals start at low pass rates (5–30%) and let you hill-climb; once they saturate above 90%, they become regression evals, where the goal flips from \"can the model do this\" to \"did we break something that used to work.\" Both stages matter. A team with only capability evals goes blind once the model is good; a team with only regression evals never sees what it can't do yet, and the scores stop reflecting any meaningful comparison between systems.</p><p><strong>Continuous re-annotation.</strong> Pipe production failures back into the eval set. Every time a user flags a wrong answer, every human override, every escalation: these are pre-validated hard cases. The <a href=\"https://arxiv.org/abs/2411.12990\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46e3\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46e3\">BetterBench framework</a> found that of 24 evaluated SOTA benchmarks, only 3 included CI build status and only 4 provided replication scripts; the operational rigour we apply to code we routinely fail to apply to evals. Version your benchmark against your model releases. Tag eval runs with model version, prompt version, and rubric version.</p><p><strong>Standards alignment.</strong> For regulated industries, the NIST AI RMF and the <a href=\"https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46ea\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46ea\">NIST GenAI Profile (AI 600-1)</a> define testing, evaluation, verification, and validation (TEVV) as a core function of trustworthy AI. Mapping your benchmark to TEVV categories isn't decorative; it's the documentation trail you'll want when the auditor arrives.</p></div><div id=\"what-does-this-look-like-across-industries\"><h2>What Does This Look Like Across Industries?</h2><p>Four compressed examples to ground the principles. Each is anchored to a published benchmark so the design choices are inspectable.</p><p><strong>Healthcare.</strong> <a href=\"https://arxiv.org/abs/2212.13138\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46f5\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46f5\">MultiMedQA</a> combines six existing medical QA datasets with a new HealthSearchQA collection of consumer questions, then layers expert physician evaluation across multiple axes: factuality, possible harm, possible bias, scientific consensus alignment. The lesson: USMLE-style multiple-choice catches only the lower levels of Miller's pyramid (knows, knows how). Clinical safety requires open-ended generation graded by clinicians on multiple safety axes, not just answer accuracy.</p><p><strong>Legal.</strong> LegalBench's six reasoning types plus LegalBench-RAG's expert-annotated retrieval pairs together cover both answer correctness and retrieval-precision dimensions. The lesson: in regulated, citation-heavy domains, retrieval correctness matters as much as final-answer accuracy. A benchmark that grades only the synthesis ignores the failure mode that matters most: confidently wrong citations.</p><p><strong>Software engineering.</strong> The progression from the original benchmark to its verified subset to Terminal-Bench is the cleanest example of refinement under pressure. The <a href=\"https://openai.com/index/introducing-swe-bench-verified/\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4700\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4700\">OpenAI Verified work</a> filtered roughly a third of original items as ambiguous or infeasible, meaning a third of the original scores were noise from the benchmark, not signal from the model. The lesson: for any executable benchmark, agent harness bugs and grading-spec ambiguity cause more apparent failures than model limitations. Verify before you trust.</p><p><strong>Finance.</strong> The <a href=\"https://arxiv.org/abs/2508.00828\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4707\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4707\">Finance Agent Benchmark</a> provides 537 expert-authored questions covering retrieval through modelling, with an agentic harness that includes Google Search and SEC EDGAR access. The lesson: financial benchmarks need expert-authored questions, real document grounding, and tool-use evaluation; pure-text QA misses the workflow. The regulatory and compliance dimension is also load-bearing: a model that's right but cites a hallucinated 10-K creates legal exposure that pure accuracy metrics never surface.</p><p>The pattern across all four: the benchmark inherits the failure modes of the domain. Generic benchmarks miss these because they're generic. Custom benchmarks earn their cost by being specific.</p></div><div id=\"the-real-test-of-a-benchmark-is-whether-youd-trust-it-tomorrow\"><h2>The Real Test of a Benchmark Is Whether You'd Trust It Tomorrow</h2><p>The benchmark you build today will be consulted dozens of times before it expires. Each consultation is a decision: ship or don't, escalate or don't, retrain or don't. The cost of a bad benchmark isn't measured in eval-set creation hours; it's measured in the production decisions made in its name.</p><p>The public ai benchmarks that survived their first wave of scrutiny (HELM, GPQA Diamond, SWE-bench, LegalBench) share a small set of properties. They define the construct in writing. They use real artefacts when possible. They validate task design against domain experts before grading any model. They publish their rubrics. They plan for contamination, harness exploits, and saturation as design constraints, not afterthoughts. They version themselves like software.</p><p>These properties don't require frontier-lab budgets to replicate. They require treating the benchmark as the substrate of every downstream claim about your AI system, because that's what it is. The benchmarks that hold up are the ones built by people who understood the construct before they wrote the first task.</p></div><div id=\"ready-to-build-a-benchmark-that-actually-measures-your-ai-system\"><h2>Ready to Build a Benchmark That Actually Measures Your AI System?</h2><p>Kili Technology's data labeling and evaluation infrastructure supports the upstream work that makes custom benchmarks reliable: expert annotator workflows, rubric design and calibration, multi-annotator validation, and the continuous re-annotation cycle that keeps benchmarks useful as models and production traffic evolve. <a href=\"https://kili-technology.com/contact-us\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4718\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4718\">Talk to our team</a> about benchmark design for your domain.</p></div><div id=\"resources\"><h2>Resources</h2><div id=\"benchmark-methodology-papers\"><h3>Benchmark Methodology Papers</h3><ul role=\"list\"><li><strong>Holistic Evaluation of Language Models (HELM)</strong> – Stanford CRFM's multi-metric, multi-scenario framework<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2211.09110\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a47ed\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a47ed\">https://arxiv.org/abs/2211.09110</a></li></ul></li><li><strong>GPQA: A Graduate-Level Google-Proof Q&amp;A Benchmark</strong> – Expert-authored, expert-validated, non-expert-tested protocol<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2311.12022\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a47f4\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a47f4\">https://arxiv.org/abs/2311.12022</a></li></ul></li><li><strong>SWE-bench: Can Language Models Resolve Real-World GitHub Issues?</strong> – Real-artefact tasks with executable graders<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2310.06770\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a47fb\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a47fb\">https://arxiv.org/abs/2310.06770</a></li></ul></li><li><strong>LiveCodeBench: Holistic and Contamination Free Evaluation of LLMs for Code</strong> – Date-stamped, contamination-aware design<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2403.07974\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4802\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4802\">https://arxiv.org/abs/2403.07974</a></li></ul></li><li><strong>LegalBench: A Collaboratively Built Benchmark for Legal Reasoning</strong> – Expert-led collaborative construction with reasoning typology<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2308.11462\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4809\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4809\">https://arxiv.org/abs/2308.11462</a></li></ul></li><li><strong>MultiMedQA / Large Language Models Encode Clinical Knowledge</strong> – Multi-axis human evaluation for medical LLMs<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2212.13138\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4810\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4810\">https://arxiv.org/abs/2212.13138</a></li></ul></li><li><strong>Finance Agent Benchmark</strong> – Expert-authored finance benchmark with agentic harness<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2508.00828\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4817\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4817\">https://arxiv.org/abs/2508.00828</a></li></ul></li></ul></div><div id=\"benchmark-quality-and-validity-research\"><h3>Benchmark Quality and Validity Research</h3><ul role=\"list\"><li><strong>Measuring What Matters: Construct Validity in LLM Benchmarks</strong> – 445-benchmark systematic review<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2511.04703\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a481e\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a481e\">https://arxiv.org/abs/2511.04703</a></li></ul></li><li><strong>BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices</strong> – Operational checklist for benchmark quality<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2411.12990\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4825\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4825\">https://arxiv.org/abs/2411.12990</a></li></ul></li><li><strong>Evaluation Data Contamination in LLMs: How Do We Measure It and (When) Does It Matter?</strong> – ConTAM analysis across 13 benchmarks × 7 models<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2411.03923\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a482c\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a482c\">https://arxiv.org/abs/2411.03923</a></li></ul></li><li><strong>An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability</strong> – Rubric clarity dominates reliability<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2506.13639\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4833\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4833\">https://arxiv.org/abs/2506.13639</a></li></ul></li><li><strong>LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation</strong> – Rubric-guided judging in legal domain<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2510.07243\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a483a\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a483a\">https://arxiv.org/abs/2510.07243</a></li></ul></li></ul></div><div id=\"industry-reports-and-practitioner-guidance\"><h3>Industry Reports and Practitioner Guidance</h3><ul role=\"list\"><li><strong>Demystifying Evals for AI Agents (Anthropic Engineering)</strong> – Three-grader taxonomy and capability vs. regression evals<ul role=\"list\"><li><a href=\"https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4841\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4841\">https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</a></li></ul></li><li><strong>2025 AI Index Report — Technical Performance (Stanford HAI)</strong> – Benchmark saturation data<ul role=\"list\"><li><a href=\"https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4848\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4848\">https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance</a></li></ul></li><li><strong>Introducing SWE-bench Verified (OpenAI)</strong> – Human-filtered SWE-bench subset methodology<ul role=\"list\"><li><a href=\"https://openai.com/index/introducing-swe-bench-verified/\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a484f\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a484f\">https://openai.com/index/introducing-swe-bench-verified/</a></li></ul></li><li><strong>Bridging the Operational AI Gap (MIT Technology Review Insights)</strong> – Enterprise AI deployment gap analysis<ul role=\"list\"><li><a href=\"https://www.technologyreview.com/2026/03/04/1133642/\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4856\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4856\">https://www.technologyreview.com/2026/03/04/1133642/</a></li></ul></li></ul></div><div id=\"standards-and-policy\"><h3>Standards and Policy</h3><ul role=\"list\"><li><strong>NIST AI Risk Management Framework (AI RMF 1.0)</strong> – Foundational TEVV definitions<ul role=\"list\"><li><a href=\"https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a485d\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a485d\">https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf</a></li></ul></li><li><strong>NIST AI Risk Management Framework: Generative AI Profile (AI 600-1)</strong> – Pre-deployment testing guidance for GAI<ul role=\"list\"><li><a href=\"https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4864\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4864\">https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf</a></li></ul></li></ul><p>‍</p></div></div></div><div data-w-id=\"fd4a9c6d-c1b9-68ca-8b6b-b5742071530f\"><h3>Subscribe for updates</h3><p>Stay updated with the latest news, articles and update directly into your box</p></div></div></div></div>",
  "markdown": "LLMs\n\nAI Evaluation\n\nFoundation Models\n\n## Custom AI Benchmark Guide: What the Best Public Evals Teach You About Building Your Own\n\nThe public benchmarks that defined AI progress for the last three years are saturating in months, not years, and a 445-benchmark systematic review found most don't accurately capture what their names claim. This guide distils the design choices behind HELM, GPQA Diamond, SWE-bench, LiveCodeBench, LegalBench, and MultiMedQA into a practitioner methodology for building AI benchmarks you can actually trust.\n\n![](https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20\\(1\\).png)\n\nKili Technology\n\n·\n\nApr 30, 2026\n\n![](https://cdn.prod.website-files.com/68da32b2041c593b0511a582/69f34f60ec23da20cca59620_custom%20ai%20benchmarks.png)\n\n## AI Summary\n\n-   A 2025 review of 445 LLM benchmarks found pervasive construct-validity gaps across widely cited ai benchmarks.\n-   HELM's 7-metric design raised cross-model evaluation coverage from 17.9% to 96%.\n-   GPQA Diamond's expert-write, expert-validate, non-expert-test pipeline is the cleanest reusable protocol for sourcing tasks from domain experts.\n-   SWE-bench scores rose from 1.96% to 71.7% in one year — and a later human-filtered subset removed roughly a third of the original tasks as ambiguous or infeasible.\n-   LLM judge reliability hinges on rubric clarity, not chain-of-thought; the rubric is the operationalisation of the construct you claim to measure.\n-   Kili Technology supports the upstream work behind reliable custom AI benchmarks — expert annotator workflows, rubric design, multi-annotator validation, and ongoing re-annotation as models and production traffic evolve.\n\n## Introduction\n\nThree years ago, \"build a benchmark\" meant pick a task, write a few hundred examples, publish a leaderboard. The methodology was loose because the time horizons were long: MMLU, released in 2020, took until 2024 to saturate.\n\nThat assumption is gone. The 2025 AI Index reports gains of 18.8 points on MMMU, 48.9 on GPQA Diamond, and 67.3 points in SWE-bench scores in a single year. Humanity's Last Exam, released in early 2025 with the explicit goal of resisting frontier reasoning models, went from 8.8% (top score, January 2025) to over 50% by April 2026. The exam was designed as the hardest standardised exam ever assembled for an LLM, and even an exam of that calibre hit the saturation curve within fifteen months. Stanford HAI's Vanessa Parli framed the resulting question bluntly: are we measuring the right thing, are the benchmarks compromised, and how should the research community evaluate models?\n\nThe pressure is sharper for teams running production systems. Public leaderboards optimise for capability ceilings; production systems break on consistency floors. An MIT NANDA analysis of roughly 300 enterprise AI deployments found only 5% reach measurable P&L impact, and the binding constraints are workflow integration and evaluation gaps, not model quality. Gartner expects more than 40% of agentic AI projects to be cancelled by 2027.\n\nA custom benchmark is no longer optional infrastructure for any team deploying AI in a domain that matters. The question is how to evaluate the system you actually ship without repeating the validity, contamination, and saturation mistakes that the public ai benchmarks are now publicly working through.\n\n## Why Are Public AI Benchmarks No Longer Enough?\n\nThree forces have converged.\n\nThe first is **saturation speed**. When the dataset was introduced in late 2023, the best model (Claude 2) solved 1.96% of real GitHub issues drawn from SWE-bench. By 2024, leading research systems and agents were solving 71.7%. The benchmark community responded with harder variants — a verified human-filtered subset and a Pro tier — but the underlying pattern is structural: any static benchmark hard for today's frontier reasoning models will be solved by next year's. A useful heuristic for benchmark builders is that top models should land below roughly 35% accuracy at launch — anything easier is already a regression eval in disguise, and the resulting scores will not separate frontier systems from each other.\n\nThe second is **construct validity erosion**. The 2025 Reuel et al. systematic review of 445 LLM benchmarks identified prevalent gaps in construct validity, the property that the test actually measures the capability it claims to measure. Naming a benchmark \"general reasoning\" or \"general knowledge\" doesn't establish that the score generalises to the construct. Raji et al. flagged this in 2021, calling general-purpose benchmark framing \"ultimately dangerous and deceptive.\" The 2025 review confirms the problem is endemic, not isolated.\n\nThe third is **benchmark exploitability**. Recent research has shown that headline scores often measure how well a model gamed the test harness rather than how well it solved the underlying tasks. Automated scanning agents have been demonstrated that exploit structural flaws in popular ai benchmarks — for instance, the lack of strict isolation between the agent under test and the evaluator process — to achieve near-perfect scores without solving any of the tasks. The attack pattern unfolds in stages: the scanning agent probes the harness, identifies leaked grader signals, and produces outputs that satisfy the grader without solving the problem. The takeaway is unambiguous: high scores on static benchmarks can be deeply misleading, and inflated leaderboard scores can be uncorrelated with the underlying capabilities.\n\nThe fourth is the **research-versus-production divergence**. Public benchmarks measure peak capability on single attempts. Production systems need consistency: an [Anthropic engineering analysis](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) of τ-bench found agents hitting 60% pass@1 dropped to 25% pass^k (consistency across k trials). That gap is invisible in single-run leaderboards and catastrophic for users. The same analysis pushes teams toward pass@k _and_ pass^k, partial-credit graders, and balanced positive/negative cases, none of which are standard on public leaderboards.\n\nIf your system processes legal contracts, diagnoses medical images, or routes financial trades, the relevant comparison isn't whether your model beats GPT-4 on MMLU. It's whether it does the specific job you need it to do, reliably, on the kind of inputs your production traffic actually contains.\n\n## What Makes a Custom AI Benchmark Actually Measure What It Claims?\n\nConstruct validity is the first principle. Before you draft a single task, write down (in a paragraph, not a vibe) what capability you're measuring, why it matters for your system, and what it would mean for the score to go up or down. If you can't articulate the construct, you can't measure it. In practice this also means aligning the benchmark to a concrete business outcome or domain task — what \"right\" looks like — rather than chasing an abstract aggregate score.\n\nThe Reuel et al. research proposes four validity types worth carrying through:  \n\\- **construct** (does it measure what it names), criterion (does it correlate with downstream outcomes),  \n\\- **consequential** (does optimising for it produce the behaviours you want), and  \n\\- **external** (does it generalise outside the eval set).\n\nA benchmark that ignores any of the four is gameable.\n\nThe second principle is **multi-metric coverage with explicit gaps acknowledged**. The [HELM framework from Stanford CRFM](https://arxiv.org/abs/2211.09110) rejected single-number leaderboards and instead reported 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) across 16 core scenarios. The point wasn't completeness; it was making trade-offs visible to anyone reading the data. Before HELM, models were evaluated on just 17.9% of its core scenarios on average; HELM raised this to 96%, putting 30 models on equal footing for direct comparison. For a custom benchmark, the lesson is to pick three or four metrics that capture the trade-offs you actually face (latency vs. accuracy, recall vs. precision, helpfulness vs. harm) and report them all, every time. A workable pattern is to lead with one primary metric for at-a-glance understanding and relegate detailed secondary metrics — and the parameters used to compute them — to an appendix that lets another team replicate the run end-to-end.\n\nThe third principle is **balanced problem sets**. Anthropic's engineering guidance phrases it directly: include cases where the behaviour should occur and cases where it shouldn't. A benchmark of \"did the agent successfully cancel the order\" is incomplete without \"did the agent refuse to cancel an order it shouldn't have.\" Balanced sets are how you catch over-eager agents and miss-rate trade-offs that single-axis evaluation hides. Edge cases and adversarial items belong in the same set — silent failures hide between average and worst-case behaviour.\n\nConstruct validity is also where rubric design enters early. [Park et al. (2025)](https://arxiv.org/abs/2506.13639) showed empirically that evaluation criteria are the dominant factor in LLM judge reliability; chain-of-thought offers minimal gains when the rubric is clear. **The rubric is the operationalisation of the construct.** If the rubric is vague, the construct is vague, and no amount of grader sophistication will save the score. A clear rubric also accelerates downstream understanding: when scores diverge across model versions, the team can read the rubric and trace the divergence to a specific criterion rather than re-arguing the underlying construct.\n\nThis is also where Kili-style annotation infrastructure starts to matter, not for the eval itself, but for the upstream work of writing the rubric, validating it against expert disagreement, and revising the data collection process before any tasks are graded.\n\n## Where Should the Tasks Come From, and Who Should Write Them?\n\nTwo patterns from the public ai benchmarks dominate:\n\n### **Real artefacts beat synthetic prompts**\n\n**‍**‍[SWE-bench](https://arxiv.org/abs/2310.06770) drew its 2,294 tasks from real GitHub issues across 12 popular Python repositories and graded by running the project's actual test suite. A patch passes only if it makes the failing tests pass without breaking the passing ones (FAIL\\_TO\\_PASS / PASS\\_TO\\_PASS). The realistic construction setting, in the authors' phrasing, gave the dataset properties no synthetic prompt collection could replicate: tasks are continually updatable from new pull requests, hard to game with surface heuristics, and grounded in code that other humans actually had to review. For a custom benchmark, the implication is to mine your own systems first. Bug trackers, support tickets, rejected agent outputs, escalations to human review: these are the highest-signal task sources you have.\n\n### **E****xpert authoring with adversarial validation**.\n\n‍[GPQA Diamond](https://arxiv.org/abs/2311.12022) is the cleanest published example. Its four-stage pipeline (expert authoring, expert validation, revision, non-expert validation) produced 448 multiple-choice questions where domain PhDs reach 65% accuracy, 74% if you discount clear mistakes, but skilled non-experts (humans with 30+ minutes of unrestricted web access) reach only 34%. The Diamond subset is the high-confidence slice — questions where two domain experts agreed on the answer and a third validated it independently. The non-expert validation stage is the underappreciated part: it gives you a quantitative answer to \"is this question actually hard, or just obscure?\" GPQA Diamond also embeds a canary string for contamination tracing, a small touch with disproportionate downstream value.\n\n[LegalBench](https://arxiv.org/abs/2308.11462) is the canonical example of the expert-led collaborative model: 162 tasks across 6 reasoning types (issue-spotting, rule-recall, rule-application, rule-conclusion, interpretation, rhetorical understanding) authored by 40+ contributors, including lawyers, law professors, and legal practitioners. The reasoning typology is itself a contribution: it forces task authors to articulate what kind of legal cognition each task is testing, which makes downstream interpretation of scores tractable.\n\nFour operational rules follow:\n\n\\- **Start with real failures, then scale.** Anthropic's guidance recommends 20–50 tasks drawn from production failures as a starting point; the working bar for a defensible domain benchmark sits higher, at a hand-picked, expert-labelled set of 200–1,000 examples that reflect real user journeys and edge cases for the certain use cases your agents actually serve in production. Treat dataset construction as a series of stages, not a one-shot exercise: a small seed set, then expansion, then iteration as failures surface.\n\n\\- **Write reference solutions** — if your domain experts can't solve the task, the model definitely can't, and the failure tells you nothing.\n\n\\- **Use multiple raters per item** to evaluate annotation quality: the verified-subset protocol used three annotators per item, severity-ensembled, to filter out infeasible or under-specified items, and produced what is now the de facto standard for executable-test benchmarks; single-rater work inherits single-rater blind spots, and using too few raters strips out the human disagreement that is itself a signal about task ambiguity.\n\n\\- **Keep the test set strictly separated from anything used to develop the system** — train/test contamination quietly inflates scores on held-out evaluations, and the inflation is invisible until production performance disappoints.\n\n## How Do You Grade Outputs That Don't Have a Single Right Answer?\n\nThe Anthropic taxonomy is the cleanest framing: code-based graders, model-based graders (LLM judge), and human graders. Pick the cheapest grader that works for the task — and let the cost of being wrong, not the convenience of measuring, drive the choice.\n\n**Code-based graders** are the gold standard when applicable: exact match, regex, executable tests, structured output validation. The FAIL\\_TO\\_PASS / PASS\\_TO\\_PASS pattern from the original benchmark is a code-based grader; so is \"does the JSON parse and contain the required fields.\" If the task admits a code-based grader, use it. The grader is deterministic, free to run, and impossible to game without solving the task — and for agents that produce structured outputs, this is the cheapest reliable signal you will ever have.\n\n**Model-based graders** are the workhorse for open-ended outputs. Park et al. showed evaluation criteria dominate reliability; rubric-guided judges (Prometheus, G-Eval) reach Pearson correlations around 0.897 with humans rating the same outputs when the rubric is unambiguous. The [LeMAJ legal evaluation framework](https://arxiv.org/abs/2510.07243) found inter-rater agreement among humans increased 11% when reviewers used a shared rubric, and rubric-guided LLM judge configurations hit Cohen's κ of 0.75 with human consensus. The pattern across these results: rubric clarity is the constraint, not judge model size.\n\nThree operational rules. **Validate against a human-labelled golden set** — the working bar in practice is 75–90% agreement with human consensus; below that, the judge is amplifying noise. **Decompose into structured criteria** — binary checks (\"does the output cite a real case?\") plus ordinal scores (\"rate clarity 1–5 with anchor descriptions\") outperform single overall scores in nearly every published comparison. **Watch the known biases.** Model-based judges show position bias, length bias, and self-preference (preferring outputs from the same model family). Randomise position, normalise length when possible, and use a different model family for judging than for the system under test.\n\n**Human graders** are the calibration layer and the high-stakes layer. Reserve humans for golden-set construction, judge calibration, and tasks where the cost of a wrong answer makes a 90%-agreement model judge unacceptable: clinical safety, legal compliance, financial advice. The economics rarely support running humans across a full benchmark, but they almost always support running humans across a calibration subset — and the resulting scores anchor every other grader downstream.\n\n## How Do You Keep a Custom AI Benchmark Useful Over Time?\n\nA custom benchmark is a versioned artefact with a maintenance schedule. Treat it that way and it stays useful; treat it as a one-time project and it expires inside a year.\n\n**Contamination defence.** The [Singh et al. (2024) ConTAM analysis](https://arxiv.org/abs/2411.03923) of 13 benchmarks across 7 models found contamination has been underestimated in many prominent LLM releases, even when developers attempted decontamination. Three practical defences: hold out a private split that never goes public; date-stamp every item so you can filter to post-cutoff data per model (the [LiveCodeBench](https://arxiv.org/abs/2403.07974) approach, which date-stamps problems from competitive programming contests since May 2023); and embed a canary string, GPQA Diamond–style, so you can later test whether a model has memorised your dataset.\n\n**Harness isolation.** A subtler failure mode is the absence of strict isolation between the system under test and the evaluator. When agents can read or write to the same filesystem as the grader, observe the grader's logs, or otherwise inspect the scoring process, automated exploits become trivial — and as recent scanning-agent research shows, headline scores in those conditions can reflect harness gaming rather than genuine capabilities. Build agent/evaluator isolation into the design, not as a follow-up.\n\n**Saturation planning.** Distinguish capability evals from regression evals from day one. A capability eval and a regression eval are two different stages of the same benchmark's life: capability evals start at low pass rates (5–30%) and let you hill-climb; once they saturate above 90%, they become regression evals, where the goal flips from \"can the model do this\" to \"did we break something that used to work.\" Both stages matter. A team with only capability evals goes blind once the model is good; a team with only regression evals never sees what it can't do yet, and the scores stop reflecting any meaningful comparison between systems.\n\n**Continuous re-annotation.** Pipe production failures back into the eval set. Every time a user flags a wrong answer, every human override, every escalation: these are pre-validated hard cases. The [BetterBench framework](https://arxiv.org/abs/2411.12990) found that of 24 evaluated SOTA benchmarks, only 3 included CI build status and only 4 provided replication scripts; the operational rigour we apply to code we routinely fail to apply to evals. Version your benchmark against your model releases. Tag eval runs with model version, prompt version, and rubric version.\n\n**Standards alignment.** For regulated industries, the NIST AI RMF and the [NIST GenAI Profile (AI 600-1)](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf) define testing, evaluation, verification, and validation (TEVV) as a core function of trustworthy AI. Mapping your benchmark to TEVV categories isn't decorative; it's the documentation trail you'll want when the auditor arrives.\n\n## What Does This Look Like Across Industries?\n\nFour compressed examples to ground the principles. Each is anchored to a published benchmark so the design choices are inspectable.\n\n**Healthcare.** [MultiMedQA](https://arxiv.org/abs/2212.13138) combines six existing medical QA datasets with a new HealthSearchQA collection of consumer questions, then layers expert physician evaluation across multiple axes: factuality, possible harm, possible bias, scientific consensus alignment. The lesson: USMLE-style multiple-choice catches only the lower levels of Miller's pyramid (knows, knows how). Clinical safety requires open-ended generation graded by clinicians on multiple safety axes, not just answer accuracy.\n\n**Legal.** LegalBench's six reasoning types plus LegalBench-RAG's expert-annotated retrieval pairs together cover both answer correctness and retrieval-precision dimensions. The lesson: in regulated, citation-heavy domains, retrieval correctness matters as much as final-answer accuracy. A benchmark that grades only the synthesis ignores the failure mode that matters most: confidently wrong citations.\n\n**Software engineering.** The progression from the original benchmark to its verified subset to Terminal-Bench is the cleanest example of refinement under pressure. The [OpenAI Verified work](https://openai.com/index/introducing-swe-bench-verified/) filtered roughly a third of original items as ambiguous or infeasible, meaning a third of the original scores were noise from the benchmark, not signal from the model. The lesson: for any executable benchmark, agent harness bugs and grading-spec ambiguity cause more apparent failures than model limitations. Verify before you trust.\n\n**Finance.** The [Finance Agent Benchmark](https://arxiv.org/abs/2508.00828) provides 537 expert-authored questions covering retrieval through modelling, with an agentic harness that includes Google Search and SEC EDGAR access. The lesson: financial benchmarks need expert-authored questions, real document grounding, and tool-use evaluation; pure-text QA misses the workflow. The regulatory and compliance dimension is also load-bearing: a model that's right but cites a hallucinated 10-K creates legal exposure that pure accuracy metrics never surface.\n\nThe pattern across all four: the benchmark inherits the failure modes of the domain. Generic benchmarks miss these because they're generic. Custom benchmarks earn their cost by being specific.\n\n## The Real Test of a Benchmark Is Whether You'd Trust It Tomorrow\n\nThe benchmark you build today will be consulted dozens of times before it expires. Each consultation is a decision: ship or don't, escalate or don't, retrain or don't. The cost of a bad benchmark isn't measured in eval-set creation hours; it's measured in the production decisions made in its name.\n\nThe public ai benchmarks that survived their first wave of scrutiny (HELM, GPQA Diamond, SWE-bench, LegalBench) share a small set of properties. They define the construct in writing. They use real artefacts when possible. They validate task design against domain experts before grading any model. They publish their rubrics. They plan for contamination, harness exploits, and saturation as design constraints, not afterthoughts. They version themselves like software.\n\nThese properties don't require frontier-lab budgets to replicate. They require treating the benchmark as the substrate of every downstream claim about your AI system, because that's what it is. The benchmarks that hold up are the ones built by people who understood the construct before they wrote the first task.\n\n## Ready to Build a Benchmark That Actually Measures Your AI System?\n\nKili Technology's data labeling and evaluation infrastructure supports the upstream work that makes custom benchmarks reliable: expert annotator workflows, rubric design and calibration, multi-annotator validation, and the continuous re-annotation cycle that keeps benchmarks useful as models and production traffic evolve. [Talk to our team](https://kili-technology.com/contact-us) about benchmark design for your domain.\n\n## Resources\n\n### Benchmark Methodology Papers\n\n-   **Holistic Evaluation of Language Models (HELM)** – Stanford CRFM's multi-metric, multi-scenario framework\n    -   [https://arxiv.org/abs/2211.09110](https://arxiv.org/abs/2211.09110)\n-   **GPQA: A Graduate-Level Google-Proof Q&A Benchmark** – Expert-authored, expert-validated, non-expert-tested protocol\n    -   [https://arxiv.org/abs/2311.12022](https://arxiv.org/abs/2311.12022)\n-   **SWE-bench: Can Language Models Resolve Real-World GitHub Issues?** – Real-artefact tasks with executable graders\n    -   [https://arxiv.org/abs/2310.06770](https://arxiv.org/abs/2310.06770)\n-   **LiveCodeBench: Holistic and Contamination Free Evaluation of LLMs for Code** – Date-stamped, contamination-aware design\n    -   [https://arxiv.org/abs/2403.07974](https://arxiv.org/abs/2403.07974)\n-   **LegalBench: A Collaboratively Built Benchmark for Legal Reasoning** – Expert-led collaborative construction with reasoning typology\n    -   [https://arxiv.org/abs/2308.11462](https://arxiv.org/abs/2308.11462)\n-   **MultiMedQA / Large Language Models Encode Clinical Knowledge** – Multi-axis human evaluation for medical LLMs\n    -   [https://arxiv.org/abs/2212.13138](https://arxiv.org/abs/2212.13138)\n-   **Finance Agent Benchmark** – Expert-authored finance benchmark with agentic harness\n    -   [https://arxiv.org/abs/2508.00828](https://arxiv.org/abs/2508.00828)\n\n### Benchmark Quality and Validity Research\n\n-   **Measuring What Matters: Construct Validity in LLM Benchmarks** – 445-benchmark systematic review\n    -   [https://arxiv.org/abs/2511.04703](https://arxiv.org/abs/2511.04703)\n-   **BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices** – Operational checklist for benchmark quality\n    -   [https://arxiv.org/abs/2411.12990](https://arxiv.org/abs/2411.12990)\n-   **Evaluation Data Contamination in LLMs: How Do We Measure It and (When) Does It Matter?** – ConTAM analysis across 13 benchmarks × 7 models\n    -   [https://arxiv.org/abs/2411.03923](https://arxiv.org/abs/2411.03923)\n-   **An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability** – Rubric clarity dominates reliability\n    -   [https://arxiv.org/abs/2506.13639](https://arxiv.org/abs/2506.13639)\n-   **LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation** – Rubric-guided judging in legal domain\n    -   [https://arxiv.org/abs/2510.07243](https://arxiv.org/abs/2510.07243)\n\n### Industry Reports and Practitioner Guidance\n\n-   **Demystifying Evals for AI Agents (Anthropic Engineering)** – Three-grader taxonomy and capability vs. regression evals\n    -   [https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)\n-   **2025 AI Index Report — Technical Performance (Stanford HAI)** – Benchmark saturation data\n    -   [https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance](https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance)\n-   **Introducing SWE-bench Verified (OpenAI)** – Human-filtered SWE-bench subset methodology\n    -   [https://openai.com/index/introducing-swe-bench-verified/](https://openai.com/index/introducing-swe-bench-verified/)\n-   **Bridging the Operational AI Gap (MIT Technology Review Insights)** – Enterprise AI deployment gap analysis\n    -   [https://www.technologyreview.com/2026/03/04/1133642/](https://www.technologyreview.com/2026/03/04/1133642/)\n\n### Standards and Policy\n\n-   **NIST AI Risk Management Framework (AI RMF 1.0)** – Foundational TEVV definitions\n    -   [https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf](https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf)\n-   **NIST AI Risk Management Framework: Generative AI Profile (AI 600-1)** – Pre-deployment testing guidance for GAI\n    -   [https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf)\n\n‍\n\n### Subscribe for updates\n\nStay updated with the latest news, articles and update directly into your box"
}

▼✓网页搜索查询: "SPEC benchmark design principles relevance reproducibility fairness verifiabi...", 结果数: 8

{
  "results": [
    {
      "title": "[PDF] How to Build a Benchmark - SPEC Research Group",
      "url": "https://research.spec.org/icpe_proceedings/2015/proceedings/p333.pdf",
      "description": "The key characteristics can be or-ganized in the following groups, which will be discussed in more detail in the next sections: • Relevance How closely the benchmark behavior cor-relates to behaviors that are of interest to consumers of the results • Reproducibility The ability to consistently produce similar results when the benchmark is run with the same test conﬁguration • Fairness Allowing diﬀerent test conﬁgurations to com-pete on their merits with-out artiﬁcial limitations • Veriﬁability Providing conﬁdence that a benchmark result is accurate • Usability Avoiding roadblocks for users to run the benchmark in their test environments All benchmarks are subject to these same criteria, but each category includes additional issues that are speciﬁc to the individual benchmark, depending on the benchmark’s goals."
    },
    {
      "title": "[PDF] Designing a Good Benchmark",
      "url": "https://www.spec.org/events/beijing2016/slides/012-Designing_a_Good_Benchmark2%20-%20Jeremy%20Arnold.pdf",
      "description": "Consistency Running the benchmark multiple times under the same conditions will produce the same results Description The hardware and software components and configuration are described in sufficient detail to allow an equivalent environment to be constructed Power Measurements Power should be measurable using a variety of devices Reproducibility Copyright © 1988‐2016 Standard Performance Evaluation Corporation (SPEC). Designing a Good Benchmark Jeremy Arnold SPECpower Architect Copyright © 1988‐2016 Standard Performance Evaluation Corporation (SPEC). 2 Benchmarks can be assessed on several different characteristics, including: □Relevance □Reproducibility □Fairness □Verifiability □Usability Benchmark Characteristics Copyright © 1988‐2016 Standard Performance Evaluation Corporation (SPEC). Self-describing Includes tools for automatically discovery of system details Practical Runs on reasonably sized systems Configurability Allow flexibility for research Energy Data Collection Use of SPEC PTDaemon or other tools to automatically collect power data Usability Copyright © 1988‐2016 Standard Performance Evaluation Corporation (SPEC). 8 Benchmarks can be assessed on several different characteristics, including: □Relevance □Reproducibility □Fairness □Verifiability □Usability Benchmark Characteristics Copyright © 1988‐2016 Standard Performance Evaluation Corporation (SPEC)."
    },
    {
      "title": "SPEC Fair Use Rules",
      "url": "https://www.spec.org/products/fairuse/",
      "description": "# SPEC Fair Use Rules. General Requirements For Public Use of All SPEC Benchmark Results. To help assure that these principles are met, the following requirements must be met by any organization or individual who makes public use of SPEC benchmark results. ▹ Section I General Requirements: rules that apply to public use of all SPEC benchmarks. ▹ Section II Individual Requirements: additional rules that apply to specific benchmarks. It is intended that this document provides the information needed for compliance with Fair Use, and in the event of any inconsistencies, this document takes precedence over individual benchmark run rules fair use requirements. **Compliance.** Claimed results must be compliant with that benchmark's rules. The SPEC web site () or a suitable sub page must be noted as a resource for additional information about the benchmark. Statements regarding SPEC, its benchmarks, and results published by SPEC, must be clear and correct. SPEC's trademarks may not be used to mislabel something that is not a SPEC metric."
    },
    {
      "title": "The Art of Building a Good Benchmark - ResearchGate",
      "url": "https://www.researchgate.net/publication/220789383_The_Art_of_Building_a_Good_Benchmark",
      "description": "... The design of a systems benchmark must meet a number of requirements, namely relevance, repeatability and reproducibility, fairness, and portability [8, 26]"
    },
    {
      "title": "Relevance and Evolution of Benchmarking in Computer Systems",
      "url": "https://www.mdpi.com/2073-431X/14/12/516",
      "description": "This review synthesizes the evolution from classical synthetic benchmarks (Whetstone, Dhrystone) and application kernels (LINPACK) to modern suites (SPEC"
    },
    {
      "title": "[PDF] The Science of Systems Benchmarking",
      "url": "https://closer.scitevents.org/Documents/Previous_Invited_Speakers/2021/CLOSER_2021_KS_4_Presentation.pdf",
      "description": "Kounev Benchmark Standardization § Standard Performance Evaluation Corporation (SPEC) § Goal: provide standardized set of application benchmarks and standardized methodology for running them and reporting results § First benchmark was SPEC89 § 4 C programs § 6 Fortran programs OSG Open Systems Group HPG High Performance Group GWPG Graphics and Workstation Performance Group RG Research Group > 100 member organizations & associates 31 S. Kounev SPEC Research Group (RG) http://research.spec.org § Founded in March 2011 § Scope: Systems benchmarking, performance evaluation, and experimental system analysis § Provide metrics, (research) benchmarks, methodologies and tools § Foster transfer of knowledge and collaboration btw. Kounev Agenda § Introduction § Benchmarking Education § Benchmark Standardization § Case Study on Cloud Benchmarking § Measuring and quantifying elasticity § Reproducibility of experimental evaluation 64 S."
    },
    {
      "title": "SPEC CPU 2026 in Detail - Why the Most Important CPU Benchmark Did No…",
      "url": "https://www.igorslab.de/en/?p=317730",
      "description": "# SPEC CPU 2026 in Detail – Why the Most Important CPU Benchmark Did Not Simply Get Bigger. With SPEC CPU 2026, one of the most important vendor-neutral CPU benchmarks after SPEC CPU 2017 receives a fundamental overhaul. This is more than just a version jump with new workloads, because the new suite attempts to capture several developments of recent years at once: significantly higher core counts, changed memory hierarchies, more open-source software in production environments, new compiler realities, greater platform diversity, and the question of how modern servers, workstations, laptops, and even smaller systems can be meaningfully compared. According to the published SPEC documents, SPEC CPU 2026 primarily measures the performance of general-purpose processors, the memory hierarchy, and compilers, optionally supplemented by energy efficiency or power consumption using the already familiar SPEC methodology. The suite was officially announced on May 5, 2026, and SPEC classifies it as a significant update to the established, vendor-neutral standard for compute-intensive performance measurements. The most important point is not only the number of benchmarks."
    },
    {
      "title": "Benchmarking Quantum Computers: Towards a Standard Performance Evaluation Approach",
      "url": "https://arxiv.org/html/2407.10941v4",
      "description": "# Benchmarking Quantum Computers: Towards a Standard Performance Evaluation Approach. The technological development of increasingly larger quantum processors on different quantum platforms raises the problem of how to fairly compare their performance, known as quantum benchmarking of quantum processors. This is a challenge that computer scientists have already faced when comparing classical processors, leading to the development of various mathematical tools to address it, but also to the identification of the limits of this problem. In this work, we briefly review the most important aspects of both classical processor benchmarks and the metrics comprising them, providing precise definitions and analyzing the quality attributes that they should exhibit. Additionally, we review some of the most important metrics and benchmarks for quantum processors proposed in the literature, assessing what quality attributes they fulfill. These guidelines aim to pave the way for establishing a roadmap towards standardizing the performance evaluation of quantum devices, ultimately leading to the creation of an organization akin to the Standard Performance Evaluation Corporation (SPEC)."
    }
  ]
}

▼✓网页搜索查询: "psychometrics IRT item response theory AI benchmark evaluation design", 结果数: 8

{
  "results": [
    {
      "title": "Item Response Theory for Evaluating Regression Algorithms",
      "url": "https://ieeexplore.ieee.org/document/9207030/",
      "description": "Item Response Theory (IRT) is a tool developed in psychometrics to measure latent abilities of human respondents based on their responses to items with"
    },
    {
      "title": "Item Response Theory (IRT): Intro, Models, and Examples",
      "url": "https://assess.com/what-is-item-response-theory/",
      "description": "# Item Response Theory (IRT): Intro, Models, and Examples. Home » Item Response Theory (IRT): Intro, Models, and Examples. Item response theory (IRT) is a family of machine learning models in the field of psychometrics, which are used to design, analyze, validate, and score assessments. Want to learn more about IRT, how it works, and why it is so important for assessment? ## What is Item Response Theory? IRT is a family of models that try to describe *how examinees respond to items on a test*, hence the name. These models can be used to evaluate item performance, because the descriptions are quite useful in and of themselves. IRT is model-driven, in that there is a specific mathematical equation that is assumed, and we fit the models based on raw data, similar to linear regression. IRT used to be known as **latent trait theory** and **item characteristic curve theory**."
    },
    {
      "title": "Building an Evaluation Scale using Item Response Theory - PMC",
      "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC5167538/",
      "description": "We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation."
    },
    {
      "title": "Lifting the benchmark iceberg with item-response theory | OpenReview",
      "url": "https://openreview.net/forum?id=ZyVQqK7mcP",
      "description": "## Lifting the benchmark iceberg with item-response theory. **TL;DR:** We reveal how hidden implementation choices in LLM benchmarks bias model rankings and propose item response theory (IRT) as a solution for more transparent and reliable evaluations. **Abstract:** The evaluation of large language models (LLMs) through benchmarks has become a cornerstone of AI development, guiding critical decisions about model deployment and research directions. However, as benchmarks evolve from narrow task-specific assessments to broad capability evaluations, they become more difficult to develop, understand and analyze. Here, we report a \\enquote{benchmark iceberg} phenomenon --- where much of the variability in model rankings stems not from true capability differences, but from hidden implementation choices beneath the surface of reported scores. Our analysis demonstrates how minor changes to these implementation details can alter model rankings --- a concerning finding given benchmarks' role in shaping the AI landscape. By adapting item response theory (IRT) we transform benchmarks from opaque leaderboards into transparent measurement instruments, revealing how hidden implementation choices currently distort our perception of model capabilities."
    },
    {
      "title": "[PDF] Efficient and Robust Model Benchmarks with Item Response Theory ...",
      "url": "https://www.turing.ac.uk/sites/default/files/2021-07/efficient_and_robust_paper.pdf",
      "description": "Given the current candidate ability θ, a fitted IRT model with the likelihood function g(r; ω, θ), and a set of L items with parameters {ω1, …, ωL}, the Fisher item information (FII) on the jth item is then calculated as: (11) (12) Here (ωj, θ) refers to the corresponding probability measure of the IRT model. C.\tThe Constraint of Non-negative Discrimination For typical IRT models, positive discrimination indicates the item has better average responses from candidates with higher ability estimates. 1. To compare different IRT models, we evaluate their performance to make inferences over unseen responses (several standard machine learning evaluation measures). We use the training set to estimate the IRT models’ parameters, and the test set to verify the expected responses from each IRT model. During adaptive testing, each time we update the model ability, we use the trained IRT to infer the expected value of responses (performance measures)."
    },
    {
      "title": "Item Response Theory Overview - Emergent Mind",
      "url": "https://www.emergentmind.com/topics/item-response-theory-irt",
      "description": "Item Response Theory (IRT) is a statistical framework that models latent abilities via test responses to enable precise measurement and"
    },
    {
      "title": "Efficient and Robust Model Benchmarks with Item Response Theory ...",
      "url": "https://revistas.unir.net/index.php/ijimai/article/view/708",
      "description": "To this end, we adapt existing approaches from psychometrics: specifically, Item Response Theory and Adaptive Testing. Both are well-founded"
    },
    {
      "title": "When Classical Test Theory Alone Is Not Enough: How Item ...",
      "url": "https://luluyan.medium.com/why-classical-test-theory-falls-short-for-llm-evaluation-and-how-item-response-theory-irt-93f03f3b6a2a",
      "description": "Item Response Theory (IRT) and broader psychometric thinking offer a dramatically better toolkit, especially when we need to measure latent"
    }
  ]
}

▼thought

Now let me fetch a couple more key sources: the SPEC paper on "How to Build a Benchmark", and the PMC article on computational method benchmarking guidelines.

▼✓抓取网页正文网址: "https://research.spec.org/icpe_proceedings/2015/proceedings/p333.pdf"

{
  "url": "https://research.spec.org/icpe_proceedings/2015/proceedings/p333.pdf",
  "error": "Navigation timeout of 60000 ms exceeded"
}

▼✓网页搜索查询: "data contamination prevention benchmark design private holdout canary string ...", 结果数: 8

{
  "results": [
    {
      "title": "BIG-Bench Canary Contamination in GPT-4 — LessWrong",
      "url": "https://www.lesswrong.com/posts/kSmHMoaLKGcGgyWzs/big-bench-canary-contamination-in-gpt-4",
      "description": "# BIG-Bench Canary Contamination in GPT-4. The BIG-Bench canary string is a unique string included in documents intended to be excluded from the training datasets of models (e.g. model evaluation tasks):. As a contingency, checking whether a model can generate the canary string would allow us to detect whether such documents have made it into training corpora anyway. In a thread from a few months ago, @niplav pointed out that Claude 3.5 Sonnet could reproduce the canary string. The publicly available GPT-4o chat model does not generate the BIG-Bench canary string, but as I discovered the non-public pre-RLHF GPT-4-base *does*. The canary string has also been used in a number of LessWrong posts (e.g. 1, 2, 3) containing text the authors didn’t want future AIs to be trivially aware of, which is another reason canaries being compromised is bad. After testing a subset of the BIG-Bench benchmark tasks, I found that GPT-4-base had memorized several tasks in the benchmark (all containing the canary string)."
    },
    {
      "title": "BIG-Bench Canary Contamination in GPT-4 — AI Alignment Forum",
      "url": "https://www.alignmentforum.org/posts/kSmHMoaLKGcGgyWzs/big-bench-canary-contamination-in-gpt-4",
      "description": "# BIG-Bench Canary Contamination in GPT-4. The BIG-Bench canary string is a unique string included in documents intended to be excluded from the training datasets of models (e.g. model evaluation tasks):. As a contingency, checking whether a model can generate the canary string would allow us to detect whether such documents have made it into training corpora anyway. In a thread from a few months ago, @niplav pointed out that Claude 3.5 Sonnet could reproduce the canary string. The publicly available GPT-4o chat model does not generate the BIG-Bench canary string, but as I discovered the non-public pre-RLHF GPT-4-base *does*. But to test this out further, @Linch suggested I check whether GPT-4-base was trained on benchmark tasks themselves. After testing a subset of the BIG-Bench benchmark tasks, I found that GPT-4-base had memorized several tasks in the benchmark (all containing the canary string). GPT-4-base can reproduce the canary string."
    },
    {
      "title": "Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation",
      "url": "https://arxiv.org/html/2502.17521v1",
      "description": "# Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation. Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks. However, because these static benchmarks are released on the Internet for transparent evaluation, and LLMs gather as much data as possible from the Internet for training, potential data contamination is unavoidable Magar and Schwartz (2022); Deng et al. To mitigate the risk of data contamination in LLM benchmarking, researchers have proposed various methods to enhance static approaches, such as data encryption Jacovi et al."
    },
    {
      "title": "DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination | OpenReview",
      "url": "https://openreview.net/forum?id=3BZyQqbytZ",
      "description": "## DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination. **Abstract:** The rapid advancement of code large language models (Code LLMs) underscores the critical need for effective and transparent benchmarking methods. To overcome these challenges, we propose DyCodeEval, a novel benchmarking suite specifically designed to evaluate Code LLMs under realistic contamination scenarios. We introduce a dynamic data generation method and conduct extensive empirical studies on two seed datasets involving 18 Code LLMs. The results demonstrate that DyCodeEval effectively assesses the reasoning capabilities of Code LLMs under contamination conditions while producing diverse problem variants, thereby ensuring robust and consistent benchmarking outcomes. **Lay Summary:** Large language models (LLMs) are increasingly used to write code and solve programming tasks. Our approach provides a more robust and diverse benchmark for assessing code LLMs, helping researchers and developers better understand what these models can and cannot do in realistic settings. **Keywords:** benchmarking, code generation, large language model, trustworthy ML."
    },
    {
      "title": "Don’t train on this data or what’s a canary string? – Jürg Stuker",
      "url": "https://stuker.com/2026/dont-train-on-this-data-or-whats-a-canary-string/",
      "description": "+ Die Fotos von “Caught” zum Verkauf. # Don’t train on this data or what’s a canary string? There are many reasons why I don’t want data to be fed into the training of large language models (LLMs). In the case of a benchmark, we would not be able to measure the ability of a system to answer a question, but rather how well the system can find a memorized answer within its model. A proposed approach to exclude assets from training is to place an instruction in the content I don’t want to be used. The term “canary” comes from coal miners who placed canaries into coal mines as an early-warning signal for toxic gases, primarily carbon monoxide. Somehow programmers adopted this for example to detect buffer overflows on stack-allocated variables or programs overwriting security-critical sections in memory. Content containing the string showed up in GPT-4, Opus 3, and Sonnet 3.5. ## KI trainiert KI: Synthetische Daten sind die nächste strategische Ressource."
    },
    {
      "title": "Benchmarking Large Language Models Under Data Contamination",
      "url": "https://arxiv.org/html/2502.17521v2",
      "description": "In this work, we conduct an in-depth analysis of existing static and dynamic benchmarks for evaluating LLMs. We first examine methods that"
    },
    {
      "title": "GitHub - lyy1994/awesome-data-contamination: The Paper List on Data Contamination for Large Language Models Evaluation.",
      "url": "https://github.com/lyy1994/awesome-data-contamination",
      "description": "# GitHub - lyy1994/awesome-data-contamination: The Paper List on Data Contamination for Large Language Models Evaluation. *   [Code](https://github.com/lyy1994/awesome-data-contamination). *   [Code](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-news). *   [📜 Papers](https://github.com/lyy1994/awesome-data-contamination#papers). [](https://github.com/lyy1994/awesome-data-contamination). Data Contamination, also known as [train-test contamination](https://arxiv.org/abs/2211.09110) or [benchmark leakage](https://arxiv.org/abs/2311.01964), indicates the case in which the model has seen information (e.g., test instances, test prompts, etc.) about the test set to be evaluated on during training. [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-papers). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-the-list). **Time Travel in LLMs: Tracing Data Contamination in Large Language Models** (ICLR 2024) [![Image 29](https://camo.githubusercontent.com/252cdafe0155a82ee210cdfdda04a3d427820c74b89266462b071c4b61f8e4fa/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656163746976652d677265656e)](https://camo.githubusercontent.com/252cdafe0155a82ee210cdfdda04a3d427820c74b89266462b071c4b61f8e4fa/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656163746976652d677265656e)[![Image 30](https://camo.githubusercontent.com/ba23fd56c3d0a47247c288f602d60609461efe07904762dc8c5f318fc7a9f352/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f546f6f6c2d707572706c65)](https://camo.githubusercontent.com/ba23fd56c3d0a47247c288f602d60609461efe07904762dc8c5f318fc7a9f352/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f546f6f6c2d707572706c65). **A Survey on Data Contamination for Large Language Models** (arXiv, 20 Feb 2025) [![Image 165](https://camo.githubusercontent.com/0284e621c895eff4465d55339abb8fb60abc29de8f3e86b04f7be1028a339566/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5375727665792d6379616e)](https://camo.githubusercontent.com/0284e621c895eff4465d55339abb8fb60abc29de8f3e86b04f7be1028a339566/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5375727665792d6379616e). **Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation** (arXiv, 23 Feb 2025) [![Image 166](https://camo.githubusercontent.com/0284e621c895eff4465d55339abb8fb60abc29de8f3e86b04f7be1028a339566/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5375727665792d6379616e)](https://camo.githubusercontent.com/0284e621c895eff4465d55339abb8fb60abc29de8f3e86b04f7be1028a339566/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5375727665792d6379616e). **Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models** (arXiv, 10 Oct 2025) [![Image 185](https://camo.githubusercontent.com/252cdafe0155a82ee210cdfdda04a3d427820c74b89266462b071c4b61f8e4fa/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656163746976652d677265656e)](https://camo.githubusercontent.com/252cdafe0155a82ee210cdfdda04a3d427820c74b89266462b071c4b61f8e4fa/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656163746976652d677265656e)[![Image 186](https://camo.githubusercontent.com/41df7b35f6d16eefcd563055b856883f9d23d7eb994992b8ded39d93ea36a1a9/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446174617365742d6f72616e6765)](https://camo.githubusercontent.com/41df7b35f6d16eefcd563055b856883f9d23d7eb994992b8ded39d93ea36a1a9/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446174617365742d6f72616e6765). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-resources). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-datasets). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-citation). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-acknowledgement). [evaluation](https://github.com/topics/evaluation \"Topic: evaluation\")[awesome-list](https://github.com/topics/awesome-list \"Topic: awesome-list\")[data-contamination](https://github.com/topics/data-contamination \"Topic: data-contamination\")[paper-list](https://github.com/topics/paper-list \"Topic: paper-list\")[pre-trained-model](https://github.com/topics/pre-trained-model \"Topic: pre-trained-model\")[pre-trained-language-models](https://github.com/topics/pre-trained-language-models \"Topic: pre-trained-language-models\")[foundation-models](https://github.com/topics/foundation-models \"Topic: foundation-models\")[large-language-models](https://github.com/topics/large-language-models \"Topic: large-language-models\")[llm](https://github.com/topics/llm \"Topic: llm\"). [**2** watching](https://github.com/lyy1994/awesome-data-contamination/watchers)."
    },
    {
      "title": "Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation - ACL Anthology",
      "url": "https://aclanthology.org/2025.emnlp-main.511/",
      "description": "## Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation. ##### Correct Metadata for. ##### Abstract. ##### Export citation. ##### Markdown (Informal). Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation (Chen et al., EMNLP 2025). ##### ACL. ACL materials are Copyright © 1963–2026 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License. The ACL Anthology is managed and built by the ACL Anthology team of volunteers. *Site last built on 08 May 2026 at 14:01 UTC with commit ff75cc6.*."
    }
  ]
}

▼✓网页搜索查询: "Goodhart's law benchmark gaming overfitting leaderboard metric target", 结果数: 8

{
  "results": [
    {
      "title": "Specification gaming, Goodhart’s law, and the metrics that lie about AI | explainx.ai Blog | explainx.ai",
      "url": "https://explainx.ai/blog/specification-gaming-goodharts-law-ai-metrics",
      "description": "# Specification gaming, Goodhart’s law, and the metrics that lie about AI. In AI, that shows up as reward hacking, benchmark overfitting, and agents that please evaluators while failing users. A practical take on Goodhart, proxy metrics, and what to do in product and governance. Goodhart’s law (paraphrased) warns that *any* proxy used as a *sole* target can eventually break as a measure. 1. **Games.** A simulated agent is rewarded for a score; it finds a weird strategy that maxes the score in a way humans would call unfair or brittle—the classic RL anecdote, still pedagogically useful. That is governance’s answer to the same structural uncertainty as Goodhart in a product dashboard. For most shipping teams, the honest goal is not ‘open the black box’ but ‘know when it breaks, why it might have broken, and what to do next’—tied to tools, data retention, and governance.Apr 23, 2026."
    },
    {
      "title": "The Paradox of AI Benchmarks: Challenges in Evaluation",
      "url": "https://thesequence.substack.com/p/the-sequence-opinion-750-the-paradox",
      "description": "# The Sequence Opinion #750: The Paradox of AI Benchmarks: Challenges in Evaluation. ### Goodhart’s Law and some of the challenges around AI benchmarks. We largely rely on standardized benchmarks and tidy metrics to declare winners and track improvement. Yet a central paradox undercuts this practice: as soon as a metric becomes the target, it often stops measuring what we intended. This is Goodhart’s Law in action—*“when a measure becomes a target, it ceases to be a good measure.”* In AI, that means a leaderboard gain may reflect success at **gaming the test**, not a durable leap in capability. This essay examines the **core challenges of AI benchmarks and evaluations** across language, vision, and reinforcement learning (RL). We’ll look at how once-useful measures turn into misleading goals; why good evals should be **accurate, inexpensive, and hard to memorize**; where academic and real-world tests clash; and what RL environments promise—and fail—to deliver as evolving evaluations. ## Goodhart’s Law and the Benchmark Trap."
    },
    {
      "title": "Gaming the System: Goodhart’s Law Exemplified in AI Leaderboard Controversy",
      "url": "https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy",
      "description": "# Gaming the System: Goodhart’s Law Exemplified in AI Leaderboard Controversy. ### How the race to the top in AI benchmarks is leading to specialized optimization at the expense of real-world performance. The recent uproar over the LMSYS (now LMArena) *Chatbot Arena* leaderboard is a striking, real-world example of **Goodhart’s Law** in action. Goodhart’s Law – originally from economics – warns that *“**when a measure becomes a target, it ceases to be a good measure**”*. In AI model evaluation, this means that once researchers and companies focus on beating a particular benchmark, the benchmark itself can become **misleading**. The Chatbot Arena controversy has proved this in practice: model developers have found ways to optimize for high Arena scores (the *target*), ultimately distorting the Arena’s ability to measure true model quality. **LMSYS** (now LMArena) is a crowdsourced evaluation platform for chat-based AI models."
    },
    {
      "title": "Goodhart's law states when a metric becomes a target, it ceases to ...",
      "url": "https://www.threads.com/@carnage4life/post/DWLQen2DY3K/goodharts-law-states-when-a-metric-becomes-a-target-it-ceases-to-be-a-good",
      "description": "Goodhart's law states when a metric becomes a target, it ceases to be a good metric. This describes the phenomenon of companies creating"
    },
    {
      "title": "The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks",
      "url": "https://tianpan.co/blog/2026-04-19-goodharts-law-ai-benchmark-gaming",
      "description": "# The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks. In late 2024, OpenAI's o3 system scored 75.7% on the ARC-AGI benchmark — a test specifically designed to resist optimization. Then practitioners looked closer: o3 had been trained on 75% of the benchmark's public training set, and the highest-compute configuration used 172 times more resources than the baseline. It wasn't a capability breakthrough dressed up as a score. It was a score dressed up as a capability breakthrough. The moment a benchmark becomes the thing teams optimize for, it stops measuring what it was designed to measure. Goodhart's Law — \"when a measure becomes a target, it ceases to be a good measure\" — was articulated in 1970s economic policy, but it describes AI benchmarking with eerie precision. ## The Lifecycle of a Benchmark. A research team identifies a meaningful capability — reasoning, code generation, multilingual comprehension — and constructs test questions to measure it."
    },
    {
      "title": "Organizations everywhere fall prey to Goodhart's law: “When a ...",
      "url": "https://www.facebook.com/MITSloanManagementReview/posts/organizations-everywhere-fall-prey-to-goodharts-law-when-a-measure-becomes-a-tar/1317146260451960/",
      "description": "Organizations everywhere fall prey to Goodhart's law: “When a measure becomes a target, it ceases to be a good measure."
    },
    {
      "title": "LLM Leaderboards are Bullshit - Goodhart's Law Strikes Again",
      "url": "https://www.reddit.com/r/LocalLLaMA/comments/1bjvjaf/llm_leaderboards_are_bullshit_goodharts_law/",
      "description": "205 votes, 84 comments. The AI research community's obsession with leaderboards and evaluation benchmarks for large language models is a"
    },
    {
      "title": "Is AI Cheating on the Test: Data Contamination, Gaming, and the ...",
      "url": "https://medium.com/@wasowski.jarek/is-ai-cheating-on-the-test-data-contamination-gaming-and-the-benchmark-crisis-9dff2fba494f",
      "description": "How data contamination, Goodhart's Law, and gaming turned objective AI tests into a marketing tool — and what to do about it as a"
    }
  ]
}

▼thought

I now have a comprehensive set of sources. Let me synthesize this into a well-structured, comprehensive answer. The key areas to cover: 1. **Fundamental design principles** - from multiple traditions (SPEC's 5 criteria, Ofir Press's 3 properties, Reuel et al.'s 4 validity types) 2. **Task construction** - real artifacts vs synthetic, expert authoring, GPQA Diamond pipeline 3. **Scoring and metrics** - single vs multi-metric, code-based/model-based/human graders, rubric design 4. **Data contamination defense** - canary strings, private holdouts, dynamic generation, date-stamping 5. **Goodhart's Law and benchmark gaming** - the fundamental tension 6. **Saturation and lifecycle management** - capability vs regression evals, continuous updating 7. **Practical process** - from construct definition to maintenance Let me organize this well. I should reference the key sources: - Ofir Press blog post - SPEC criteria (relevance, reproducibility, fairness, verifiability, usability) - Kili Technology guide - Reuel et al. 4 validity types - GPQA Diamond pipeline - HELM multi-metric approach - ConTAM contamination analysis - BetterBench framework - Anthropic engineering guidance - How2Bench 55-criteria checklist - PeerBench - GIM benchmark with IRT - The arXiv paper on interactive evaluation Let me write the answer now.

如何设计一个优秀的 Benchmark

这个问题在过去三年里变得异常紧迫。2023 年底，Claude 2 在 SWE-bench 上只能解决 1.96% 的真实 GitHub issue；到 2024 年，前沿系统已经做到 71.7% Stanford HAI 2025 AI Index。Humanity's Last Exam 从 2025 年初的 8.8% 到 2026 年 4 月突破 50%——一个被设计为「最难标准化考试」的 benchmark，十五个月内就触达饱和曲线。Reuel 等人对 445 个 LLM benchmark 的系统审查发现，大多数 benchmark 并不能真正测量它们声称要测量的东西 Measuring What Matters。

在这篇回答里，我会从经典计算机系统 benchmark 方法论和当代 AI benchmark 实践两条线索出发，梳理一套可操作的设计框架。

一、Benchmark 的五大基础属性（SPEC 框架）

Standard Performance Evaluation Corporation (SPEC) 在过去三十多年里建立了一套 benchmark 评估元标准，至今仍是最清晰的基础框架 SPEC Research Group - How to Build a Benchmark。五个维度：

Relevance（相关性）：benchmark 的行为与实际使用场景的关联有多紧密。这是第一性原理——如果你测的东西没人关心，分数再精确也毫无意义。

Reproducibility（可复现性）：相同配置下多次运行能否得到一致结果。这对 AI benchmark 尤其棘手：非确定性采样、浮点精度差异、提示词的微小变化都可能导致分数漂移。

Fairness（公平性）：不同被测系统能否在同等条件下公平竞争，不存在对特定架构或实现的人工偏见。

Verifiability（可验证性）：第三方能否独立验证声称的结果。AI benchmark 在这方面问题严重——BetterBench 研究发现，24 个被评估的 SOTA benchmark 中只有 3 个包含 CI 构建状态，只有 4 个提供了可复现脚本 BetterBench。

Usability（可用性）：用户在自己的测试环境中运行 benchmark 的难度。门槛越高，越少人会用，benchmark 的生态价值就越低。

二、当代 AI Benchmark 的三项核心属性（Ofir Press 框架）

SWE-bench、AssistantBench、CiteME 等 benchmark 的作者 Ofir Press 将好的 AI benchmark 提炼为三个核心属性 How to Build Good Language Modeling Benchmarks：

1. Natural（自然性）

任务必须来自真实人类会问的问题，而不是闭门造车编出来的。两个硬指标判断「不自然」：

问题设定不现实：比如选择题——「现实中没有人去看医生时说'医生我的肘部疼，原因一定是这四个选项之一……'」
题目是编造的，而非来自真实用户问题：如果在 Google 工作，与其坐在房间里空想题目，不如去翻 Search 日志，找出用户搜了但没找到好答案的真实查询

一个很好的自检：问自己「usefulness criteria」——一个在这个 benchmark 上得分比 baseline 高的系统，是否真的对人类有用？是否真的能提高生产力？

2. Automatically Evaluateable（自动可评估）

给定模型输出，你需要自动判断对错。代码类任务用单元测试（SWE-bench 的 FAIL_TO_PASS / PASS_TO_PASS 模式）；结构化输出用 JSON schema 校验。但像摘要这种任务自动评估极难，这解释了为什么摘要类 benchmark 尽管实用价值高却进展缓慢。Ofir 明确反对「用 LLM 评 LLM」——「要么用 LM 解决问题，要么用它评判输出；既当运动员又当裁判会导致根本性问题。」

3. Challenging（有挑战性）

这是变化最快的指标。Ofir 的建议经历了三次修订：

初版：发布时顶尖模型准确率应在 1%–35%
2025 年 1 月修订：应在 0.1%–9%
2025 年 5 月修订：「我现在要求合作者不要想'让 AI 得 0% 的 benchmark'，而要设计让 AI 得 -200% 的 benchmark。要找那些即使模型性能翻三倍也还是零分的题目。只看现在让模型失败还不够——你必须预测未来 6–12 个月的进步速度，设计出连明年的模型也做不出来的 benchmark。」

一个实用技巧：用强 baseline 过滤掉简单实例（如 Bamboogle 过滤掉 Google 能搜到的题，CiteME 过滤掉 GPT-4o 能直接答对的题）。

三、Construct Validity：让 benchmark 真正测量它声称的东西

这是整个设计过程中最重要的概念，也是最容易被跳过的。

Reuel 等人提出的四种效度类型构成一个完整的检验框架 Measuring What Matters: Construct Validity in LLM Benchmarks：

效度类型	核心问题	典型失败模式
Construct（构念效度）	它是否测量了名字所声称的能力？	把「通用推理」benchmark 命名为推理测试，实际主要测的是知识记忆
Criterion（效标效度）	分数是否与下游真实结果相关？	benchmark 分数涨了 10 分但生产环境表现毫无变化
Consequential（后果效度）	优化这个指标是否产生你想要的行为？	优化 pass@1 导致模型变得过度保守，拒绝回答合理问题
External（外部效度）	是否泛化到 eval set 之外的场景？	在精选测试集上表现好，但换了分布立刻崩

核心操作：在写第一道题之前，先用一段话（不是感觉）写下你要测量什么能力、为什么它对系统重要、分数上升或下降意味着什么。如果你说不清楚 construct，你就无法测量它。

GIM benchmark 更进一步，将 IRT（项目反应理论，来自心理测量学的工具）引入 AI benchmark：通过 2PL 模型估计题目难度和区分度参数，使分数不仅是「对了多少」，而是经过校准的潜在能力估计 GIM。位置论文也指出 AI 评估应该是「基于明确能力理论的推理任务」，并提出了 Evaluation Card 作为文档化工具 Position: AI Evaluations Should be Grounded on a Theory of Capability。

四、题目从哪来，谁来出

真实制品碾压合成提示词

SWE-bench 的 2,294 个任务来自 12 个流行 Python 仓库的真实 GitHub issue，评分方式是运行项目实际的测试套件 SWE-bench。真实构造赋予了数据集三个合成数据无法复制的属性：可持续从新 PR 更新、难以用表面启发式攻击、扎根于真人审核过的代码。

对于自定义 benchmark：先挖掘你自己的系统——bug tracker、support ticket、被拒绝的 agent 输出、升级到人工审核的案例，这些都是最高信号的任务来源。Anthropic 的建议是从 20–50 个来自生产失败的案例开始，逐步扩展到 200–1,000 个专家标注的样本 Demystifying Evals for AI Agents。

专家创作 + 对抗性验证

GPQA Diamond 是最干净的参考实现：

领域专家出题
另一位专家验证
修订
非专家验证——给非专家（可以上网 30+ 分钟）做一遍

结果是：领域 PhD 正确率 65%（排除明显失误后 74%），但熟练的非专家仅 34%。非专家验证阶段回答了关键问题：「这题是真的难，还是只是冷门？」Diamond 子集是两道专家一致同意且第三位独立验证通过的高置信度切片。GPQA Diamond 还嵌入了 canary 字符串用于污染追踪。

LegalBench 则是专家协作模型的典范：40+ 律师、法学教授和法律从业者贡献了 162 个任务，按六种推理类型分类（问题识别、规则回忆、规则适用、规则结论、解释、修辞理解）。推理类型分类本身就是一个贡献——它迫使出题者说清每道题在测哪种法律认知能力。

四个操作规则

从真实失败开始，迭代扩展——seed set → expansion → iteration
写出参考答案——如果你的领域专家都做不出来，模型肯定做不出来，这个失败不告诉你任何信息
每道题使用多位评分者——单评分者继承单评分者的盲点。SWE-bench Verified 用了三位标注者做 severity ensembled 筛选，去掉了约 1/3 模糊或不可行的原始题目 Introducing SWE-bench Verified
严格分离训练集和测试集——train/test contamination 会悄悄抬高 held-out 评估的分数，直到生产环境表现不如预期你才发现

五、评分：代码评、模型评、人评

Anthropic 工程团队的分类法是最清晰的框架 Demystifying Evals for AI Agents：三类评分器，从便宜到贵选用。

Code-based grader（黄金标准）

如果可以，就用它。确切匹配、正则、可执行测试、结构化输出验证——确定性、免费运行、不解决任务就无法绕过。SWE-bench 的 FAIL_TO_PASS / PASS_TO_PASS 模式就是代码评分器。

Model-based grader（LLM judge，开放输出的主力）

关键发现来自 Park et al. (2025)：评估标准（rubric）是 LLM judge 可靠性的主导因素，当 rubric 清晰时 chain-of-thought 几乎不带来额外增益 An Empirical Study of LLM-as-a-Judge。Rubric 是 construct 的操作化。如果 rubric 模糊，construct 就模糊，grader 再高级也救不回来。

LeMAJ 法律评估框架发现，使用共享 rubric 后评审者间一致性提高了 11%，rubric 引导的 LLM judge 配置达到了与人类共识的 Cohen's κ = 0.75 LeMAJ。

三个实操规则：

用人类标注的 golden set 验证：与人类共识的 75–90% 一致是底线
分解为结构化标准：二元检查（「输出是否引用了真实案例？」）+ 有序评分（「1–5 分评分清晰度，附锚点描述」）远优于单一总分
注意已知偏差：位置偏差、长度偏差、自我偏好（偏好同模型家族的输出）。随机化位置、标准化长度、用与被测系统不同模型家族的 judge

Human grader（校准层和高风险层）

用于 golden set 构建、judge 校准、以及错误成本不可接受的场景（临床安全、法律合规、金融建议）。经济上往往无法在完整 benchmark 上运行人类，但几乎总是可以在校准子集上运行——这些分数锚定了下游所有 grader。

六、数据污染：Benchmark 设计中最棘手的问题

Singh et al. (2024) 的 ConTAM 分析覆盖 13 个 benchmark × 7 个模型，发现即便开发者尝试过 decontamination，污染仍然被低估 Evaluation Data Contamination in LLMs。在实践中，甚至 BIG-Bench canary 字符串已经被 GPT-4-base 和 Claude 3.5 Sonnet 记住 BIG-Bench Canary Contamination in GPT-4。

四层防御

层级	方法	代表案例
检测	嵌入 canary 字符串，事后测试模型是否能复现	GPQA Diamond
时间隔离	为每道题打日期戳，按模型训练截止日期过滤	LiveCodeBench（从 2023 年 5 月起为编程竞赛题目打时间戳）
空间隔离	保留一个从不公开的 private split	GIM benchmark 的 615 public + 205 private 设计 GIM
根本免疫	不发布答案（如 SciCode 只发布函数描述和单元测试，不发布解答代码）；或动态生成题目	动态 benchmark 从静态到动态的演进 Recent Advances in LLM Benchmarks against Data Contamination

Ofir Press 在 SciCode 设计中体现了「即使 benchmark 完全泄露进训练数据，模型也无法得分」的思路：PhD 出编程题，只发布描述和测试，不发布答案。

Harness 隔离——一个容易被忽视的问题

如果被测系统和评分器共享文件系统，agent 可以读取 grader 日志、探测评分信号、输出满足 grader 却不解决问题的结果。PeerBench 提出了 sealed execution（密封执行）的方案 PeerBench at NeurIPS 2025。这种隔离应该在设计阶段就嵌入架构，而不是事后补救。

七、Goodhart 定律与 Benchmark 生命周期管理

「当一个度量变成目标，它就不再是一个好的度量。」

OpenAI 的 o3 在 ARC-AGI 上得了 75.7%，但细看：o3 在 benchmark 的 75% 公开训练集上训练过，最高计算配置用了 baseline 的 172 倍资源。「这不是伪装成分数的能力突破，而是伪装成能力突破的分数」 The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks。

从第一天就区分两类 eval

Capability eval（能力评估）：pass rate 低（5–30%），让你看到「模型能做什么新事情」。
Regression eval（回归评估）：pass rate 高（>90%），让你看到「我们有没有搞坏原本能用的东西」。

一个只有 capability eval 的团队一旦模型变好就瞎了；一个只有 regression eval 的团队永远看不到模型还不能做什么。同一个 benchmark 会经历从 capability eval 到 regression eval 的生命周期转换——规划这个转换，而不是等它饱和了再手忙脚乱。

持续更新

把生产环境失败回流到 eval set——每个用户标记的错误答案、每次人工覆盖、每次升级，都是经过预验证的难题
给 benchmark 打版本号，与模型版本对齐
标注每次 eval run 的模型版本、prompt 版本、rubric 版本

八、一个完整的设计 Checklist

综合 How2Bench 的 55 项标准检查表 How2Bench、BetterBench 框架 BetterBench、以及以上所有来源，一个 benchmark 设计应覆盖以下维度：

设计阶段

用一段话写明 construct：测什么能力、为什么重要、分数变化意味着什么
明确目标受众和用例（研究对比 vs 生产决策 vs 安全审计）
选定一个主指标 + 3–4 个辅助指标，覆盖你面临的真实 trade-off
规划 capability eval → regression eval 的生命周期

任务构建

优先使用真实制品（bug report、support ticket、真实查询）
如需合成，走「专家出题 → 专家验证 → 修订 → 非专家验证」的 GPQA Diamond 管道
每道题写出参考答案
使用多位评分者，记录评分者间分歧作为题目模糊度的信号
包含平衡的正负案例（「agent 成功取消订单」+「agent 拒绝取消不该取消的订单」）

评分系统

能用 code-based grader 就用 code-based grader
用 LLM judge 时，rubric 是核心——写清楚每个评分等级的操作定义和锚点
用人类标注 golden set 验证 judge 可靠性（≥75–90% 一致）
注意并缓解 judge 偏差（位置、长度、自我偏好）
隔离被测系统和评分器（harness isolation）

污染防御

嵌入 canary 字符串
为每道题打日期戳 / 保留 private holdout
如可能，不公开发布答案
考虑动态生成机制

维护计划

版本化：benchmark 版本与模型版本对齐
生产失败回流管道
定期 re-annotation
提供可复现脚本和 CI 集成

补充说明：本次检索覆盖与未覆盖的内容

已覆盖：设计原则（SPEC、Ofir Press）、construct validity（Reuel et al.）、任务来源（SWE-bench、GPQA Diamond、LegalBench）、评分方法（code/model/human grader 三层分类）、污染防御（canary、private holdout、动态生成、harness isolation）、Goodhart 定律与生命周期管理。

尚需深入但本次未充分展开的领域：

交互式评估（interactive evaluation）作为一个独立范式的方法论——这是 2026 年 5 月 arXiv 上一篇位置论文的焦点 Interactive Evaluation Requires a Design Science，认为「交互式评估应被视为一种有原则的评估范式，而不仅仅是新的 agent benchmark 家族」
多模态 benchmark 的特殊挑战（图像+文本+代码综合治理等）
具体领域的 benchmark 设计细节（如医疗影像、金融交易等需要领域专家的深度参与才能讲清）
IRT（项目反应理论）在 AI benchmark 中的系统应用方法论——目前 GIM 等少数 benchmark 在使用，但尚未成为社区标准

140784 tokens · $0.0233

如何设计一个优秀的benchmark

▼thought

▼✓网页搜索查询: "how to design a good benchmark principles methodology", 结果数: 10

{
  "results": [
    {
      "title": "How to Build Good Language Modeling Benchmarks – Ofir Press",
      "url": "https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/",
      "description": "Building benchmarks is important because they shine a spotlight on the weaknesses of existing language models and so can guide the community on how to improve them. I’ve spent a lot of my career both on building benchmarks and on building systems that push forward the state-of-the-art on a given benchmark, and I believe that building good benchmarks is *just as important* as building new systems. Designing a good benchmark is challenging and I’ve spent a lot of time recently thinking about what makes for a good benchmark. Try to build a benchmark that has natural questions that some category of humans ask on a frequent basis. B. **The questions are made up and not taken from actual questions asked by actual humans**: If you work for Google and you’re tasked with building a challenging question-answering benchmark, a really nonoptimal thing to do would be to sit around by yourself in a room and just try to think of questions."
    },
    {
      "title": "[PDF] Principles for Financial Benchmarks - IOSCO",
      "url": "https://www.iosco.org/library/pubdocs/pdf/ioscopd415.pdf",
      "description": "a) Clearly define and substantiate through appropriate written arrangements the roles and obligations of third parties who participate in the"
    },
    {
      "title": "Benchmark Standards for Justice Innovations",
      "url": "https://justiceinnovation.law.stanford.edu/resources/evaluation/benchmark/",
      "description": "The benchmark principles, checklists, and criteria can be used to review what you are creating. Judge your initiative against these expert best practices, and"
    },
    {
      "title": "[PDF] How to Build a Benchmark - SPEC Research Group",
      "url": "https://research.spec.org/icpe_proceedings/2015/proceedings/p333.pdf",
      "description": "The key characteristics can be or-ganized in the following groups, which will be discussed in more detail in the next sections: • Relevance How closely the benchmark behavior cor-relates to behaviors that are of interest to consumers of the results • Reproducibility The ability to consistently produce similar results when the benchmark is run with the same test conﬁguration • Fairness Allowing diﬀerent test conﬁgurations to com-pete on their merits with-out artiﬁcial limitations • Veriﬁability Providing conﬁdence that a benchmark result is accurate • Usability Avoiding roadblocks for users to run the benchmark in their test environments All benchmarks are subject to these same criteria, but each category includes additional issues that are speciﬁc to the individual benchmark, depending on the benchmark’s goals."
    },
    {
      "title": "Custom AI Benchmark Guide: Lessons from Public Evals",
      "url": "https://kili-technology.com/blog/custom-ai-benchmark-guide-what-the-best-public-evals-teach-you-about-building-your-own",
      "description": "Learn best practices for combining LLM-as-a-judge and HITL workflows for reliable AI. # Custom AI Benchmark Guide: What the Best Public Evals Teach You About Building Your Own. The public benchmarks that defined AI progress for the last three years are saturating in months, not years, and a 445-benchmark systematic review found most don't accurately capture what their names claim. This guide distils the design choices behind HELM, GPQA Diamond, SWE-bench, LiveCodeBench, LegalBench, and MultiMedQA into a practitioner methodology for building AI benchmarks you can actually trust. The 2025 AI Index reports gains of 18.8 points on MMMU, 48.9 on GPQA Diamond, and 67.3 points in SWE-bench scores in a single year. A custom benchmark is no longer optional infrastructure for any team deploying AI in a domain that matters. The question is how to evaluate the system you actually ship without repeating the validity, contamination, and saturation mistakes that the public ai benchmarks are now publicly working through."
    },
    {
      "title": "How to Build AI Benchmarks That Evolve | Label Studio",
      "url": "https://labelstud.io/blog/how-to-build-ai-benchmarks-that-evolve-with-your-models/",
      "description": "# How to Build AI Benchmarks that Evolve with your Models. In our first post in the Benchmark series, we explored why evaluating large language models (LLMs) is uniquely challenging—and how AI benchmarks offer a solution by bringing structure, repeatability, and objectivity to an otherwise subjective process. In this article, we’ll break down what makes a benchmark effective: the core components you need, different scoring approaches, and when to use them. We’ll also explore how benchmark strategies should evolve as your system matures—from early prototypes to production-ready applications—so you can evaluate your model in the right way, at the right time. An AI benchmark has two key components: a standardized set of tasks, and a scoring methodology. By asking the model to answer the same questions every time, we can begin to get a deeper understanding of where our model is performing well or poorly, and how it has changed over time."
    },
    {
      "title": "How Should I Build A Benchmark? Revisiting Code-Related Benchmarks For LLMs",
      "url": "https://arxiv.org/html/2501.10711v1",
      "description": "# How Should I Build A Benchmark? Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to assure its quality, reliability, and reproducibility. We propose How2Bench comprising a 55-criteria checklist as a set of guidelines to comprehensively govern the development of code-related benchmarks. Nearly 70% of the benchmarks did not take measures for data quality assurance;. Finally, we conducted a human study involving 49 participants and revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency. For example, a recent study pointed out that “current programming benchmarks are inadequate for assessing the actual correctness of LLM-generated code” (Liu et al., 2023a). This checklist covers the entire lifecycle of benchmark development, from design and construction to evaluation, analysis, and release as shown in Figure 1."
    },
    {
      "title": "Essential guidelines for computational method benchmarking - PMC",
      "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC6584985/",
      "description": "In general, we can define three broad types of benchmarking studies: (i) those by method developers, to demonstrate the merits of their approach (e.g., [22–26]); (ii) neutral studies performed to systematically compare methods for a certain analysis, either conducted directly by an independent group (e.g., [27–38]) or in collaboration with method authors (e.g., [39]); or (iii) those organized in the form of a community challenge, such as those from the DREAM [40–44], FlowCAP [45, 46], CASP [47, 48], CAMI [49], Assemblathon [50, 51], MAQC/SEQC [52–54], and GA4GH [55] consortia. Evaluation criteria: key quantitative performance metrics | +++ | Number and types of performance metrics | Subjectivity in the choice of metrics: e.g., selecting metrics that do not translate to real-world performance  Metrics that give over-optimistic estimates of performance  Methods may not be directly comparable according to individual metrics (e.g., if methods are designed for different tasks) |."
    },
    {
      "title": "What is Benchmarking? Technical & Competitive ... - ASQ",
      "url": "https://asq.org/quality-resources/benchmarking?srsltid=AfmBOoqVw5c0a3C-vqeTZlpsK1or5hUKoylrtrOH-S4jJKkMPSm1rBUZ",
      "description": "Benchmarking is a method that allows companies to compare products, performance and services to top competitors. Learn more as ASQ.org."
    },
    {
      "title": "What advice can you give me for writing a meaningful benchmark?",
      "url": "https://stackoverflow.com/questions/324200/what-advice-can-you-give-me-for-writing-a-meaningful-benchmark",
      "description": "The most meaningful benchmark is to measure how your code performs under everyday usage. That will obviously provide you with the most realistic numbers."
    }
  ]
}

▼✓网页搜索查询: "benchmark design flaws pitfalls AI ML evaluation 2024 2025", 结果数: 10

{
  "results": [
    {
      "title": "NeurIPS Poster Position: Benchmarking is Broken - Don't Let AI be Its Own Judge",
      "url": "https://neurips.cc/virtual/2025/poster/121919",
      "description": "Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. As a flood of participants enters the AI space, this \"Wild West\" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's.In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? For true and sustainable AI advancement, we call for a paradigm shift to a unified, live, and quality-controlled benchmarking framework—robust by construction rather than reliant on courtesy or goodwill. To concretize this position, we introduce the idea of PeerBench, a community-governed, proctored evaluation blueprint that seeks to improve security and credibility through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to lay the groundwork for evaluations that restore integrity and deliver genuinely trustworthy measures of AI progress."
    },
    {
      "title": "AI Benchmarks for 2025 ·",
      "url": "https://dasarpai.com/dsblog/ai-benchmarks-2025/",
      "description": "# AI Benchmarks for 2025. A term **“AI benchmark”** is thrown around a lot and can be confusing because it’s used in slightly different ways depending on the context. In this artcile we will try to understand what are the different meaning of this term and what are the latest AI benchmarks. In general, an **AI benchmark** is a **standardized way to evaluate the performance** of an AI system and/or models. For example, Image classification, text generation, translation, reasoning etc. **Dataset** : The data used to test performance of the model. **Metrics** : How performance is measured or what metrics are used to evalute the model performance. **Leaderboard** : A ranked list of various model’s performances on the against same benchmark. For example, when someone say the model was evaluated against SuperGLUE (a famous NLP benchmark) benchmark it means. Here’s a list of **LLM benchmarks released in the last 2–3 years (2022–2024)** — these are *recent*, research-driven, and widely cited or adopted:."
    },
    {
      "title": "Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation",
      "url": "https://arxiv.org/html/2502.06559v1",
      "description": "An Interdisciplinary Review of Current Issues in AI Evaluation. Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing logic that fails to account for how AI models are increasingly multimodal and interact with humans and other technical systems). Quantitative artificial intelligence (AI) benchmarks (i.e., combinations of test datasets and performance metrics that are taken to represent general or specific tasks and used to compare AI model capabilities and/or risks (Raji et al., 2021)) play a central role in the release and marketing of newly developed AI tools."
    },
    {
      "title": "Avoiding Common Pitfalls in LLM Evaluation",
      "url": "https://www.honeyhive.ai/post/avoiding-common-pitfalls-in-llm-evaluation",
      "description": "# Avoiding Common Pitfalls in LLM Evaluation. Discover the hidden challenges of LLM evaluation and the most common mistakes we've seen after helping hundreds of teams build effective evals that drive business results. ###### About the author:. ###### Felipe Adachi. ## Join our monthly newsletter. Your submission has been received! Something went wrong while submitting the form."
    },
    {
      "title": "Custom AI Benchmark Guide: Lessons from Public Evals",
      "url": "https://kili-technology.com/blog/custom-ai-benchmark-guide-what-the-best-public-evals-teach-you-about-building-your-own",
      "description": "Learn best practices for combining LLM-as-a-judge and HITL workflows for reliable AI. # Custom AI Benchmark Guide: What the Best Public Evals Teach You About Building Your Own. The public benchmarks that defined AI progress for the last three years are saturating in months, not years, and a 445-benchmark systematic review found most don't accurately capture what their names claim. This guide distils the design choices behind HELM, GPQA Diamond, SWE-bench, LiveCodeBench, LegalBench, and MultiMedQA into a practitioner methodology for building AI benchmarks you can actually trust. The 2025 AI Index reports gains of 18.8 points on MMMU, 48.9 on GPQA Diamond, and 67.3 points in SWE-bench scores in a single year. A custom benchmark is no longer optional infrastructure for any team deploying AI in a domain that matters. The question is how to evaluate the system you actually ship without repeating the validity, contamination, and saturation mistakes that the public ai benchmarks are now publicly working through."
    },
    {
      "title": "Technical Performance | The 2025 AI Index Report | Stanford HAI",
      "url": "https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance",
      "description": "In 2023, AI researchers introduced several challenging new benchmarks, including MMMU, GPQA, and SWE-bench, aimed at testing the limits of increasingly capable AI systems. By 2024, AI performance on these benchmarks saw remarkable improvements, with gains of 18.8 and 48.9 percentage points on MMMU and GPQA, respectively. On SWE-bench, AI systems could solve just 4.4% of coding problems in 2023—a figure that jumped to 71.7% in 2024. In early January 2024, the leading closed-weight model outperformed the top open-weight model by 8.04% on the Chatbot Arena Leaderboard. By early 2025, this gap had narrowed to just 5.4%. Likewise, the difference between the top two models shrank from 4.9% in 2023 to just 0.7% in 2024. The saturation of traditional AI benchmarks like MMLU, GSM8K, and HumanEval, coupled with improved performance on newer, more challenging benchmarks such as MMMU and GPQA, has pushed researchers to explore additional evaluation methods for leading AI systems."
    },
    {
      "title": "A benchmark of expert-level academic questions to assess AI ...",
      "url": "https://www.nature.com/articles/s41586-025-09962-4",
      "description": "# A benchmark of expert-level academic questions to assess AI capabilities | Nature. We also use optional cookies for advertising, personalisation of content, usage analysis, and social media, as well as to allow video information to be shared for both marketing, analytics and editorial purposes. By accepting optional cookies, you consent to the processing of your personal data - including transfers to third parties. See our privacy policy for more information on the use of your personal data. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To address this gap, we introduce HLE (originally defined as Humanity’s Last Exam, although we will use the term HLE for this paper), a benchmark of 2,500 challenging questions from dozens of subject areas, designed to assess LLM capabilities at an expert level in broad academic subjects."
    },
    {
      "title": "LLM Evaluation Framework: MMLU, Chatbot Arena & LLM-as-Judge [2026 Guide]",
      "url": "https://www.meta-intelligence.tech/en/insight-llm-evaluation",
      "description": "# LLM Evaluation Framework: MMLU, Chatbot Arena & LLM-as-Judge [2026 Guide]. Multi-dimensional LLM evaluation radar chart and leaderboard illustration. Evaluating large language models is one of the most challenging problems in the AI field. However, the capability dimensions of LLMs are extraordinarily broad: they simultaneously handle translation, summarization, code generation, mathematical reasoning, creative writing, fact-checking, and dozens of other tasks — no single metric can capture the full picture. The more fundamental difficulty is that **\"a good answer\" is itself a subjective and multi-dimensional concept.** One answer may be impeccable in factual accuracy but stiff in tone and lacking empathy; another may be beautifully written but contain subtle hallucinations. `The Core Dilemma of LLM Evaluation:. in their survey categorize LLM evaluation methods into three major classes: automated benchmark evaluation, human evaluation, and model-as-evaluator (LLM-as-Judge). This article will systematically dissect the current major LLM evaluation methodologies, from static benchmarks to dynamic human rankings, from automated judges to enterprise custom frameworks, providing readers with a complete evaluation decision map."
    },
    {
      "title": "2025 July \"AI Evaluation\" Digest",
      "url": "https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest",
      "description": "[![Image 1: The AI Evaluation Substack](https://substackcdn.com/image/fetch/$s_!xXz3!,w_40,h_40,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b6dcf51-f765-489b-8c76-0a26cf708ebf_494x494.png)](https://aievaluation.substack.com/). # [The AI Evaluation Substack](https://aievaluation.substack.com/). [Sign in](https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest). [![Image 3: AI Evaluation's avatar](https://substackcdn.com/image/fetch/$s_!bqQs!,w_36,h_36,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87229cf4-5123-4968-bfbf-3033864b23dc_440x542.png)](https://substack.com/@aievals). [AI Evaluation](https://substack.com/@aievals). Now, as the AI community faces new challenges with evaluating ever-larger LMs, new initiatives (e.g., [inspect_evals](https://arxiv.org/pdf/2507.06893) and [WEVAL](https://weval.org/)) are following in OpenML’s footsteps, showing that lasting impact comes not just from flashy algorithms or brute-force compute, but from building the shared foundations that make ongoing, open, and rigorous scientific progress possible. *   [MLR-Bench](https://arxiv.org/pdf/2505.19955) tests whether AI systems can truly conduct independent scientific research in the area of machine learning, evaluating models across 201 machine learning problems through all research stages. [![Image 4](https://substackcdn.com/image/fetch/$s_!xXz3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b6dcf51-f765-489b-8c76-0a26cf708ebf_494x494.png)The AI Evaluation Substack A monthly digest of the latest developments, research trends and key initiatives in the realm of AI evaluation.](https://aievaluation.substack.com/?utm_source=substack&utm_campaign=publication_embed&utm_medium=web). [12 Likes](https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest)∙. [Jul 25, 2025](https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest/comment/138924777 \"Jul 25, 2025, 7:55 PM\"). [Aug 26, 2025](https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest/comment/149274778 \"Aug 26, 2025, 1:43 PM\"). [2 more comments...](https://aievaluation.substack.com/p/2025-july-ai-evaluation-digest/comments). [2025 September \"AI Evaluation\" Digest](https://aievaluation.substack.com/p/2025-september-ai-evaluation-digest). [What could possibly go wrong?](https://aievaluation.substack.com/p/2025-september-ai-evaluation-digest). [2025 December \"AI Evaluation\" Digest](https://aievaluation.substack.com/p/2025-december-ai-evaluation-digest). [Call for Tributes: Your test of time.](https://aievaluation.substack.com/p/2025-december-ai-evaluation-digest). [2025 November \"AI Evaluation\" Digest](https://aievaluation.substack.com/p/2025-november-ai-evaluation-digest). Seeing is all you need](https://aievaluation.substack.com/p/2025-november-ai-evaluation-digest)."
    },
    {
      "title": "Why AI Needs Better Benchmarks",
      "url": "https://www.youtube.com/watch?v=kLZeFWZewH0",
      "description": "Why AI Needs Better Benchmarks\nThe AI Daily Brief: Artificial Intelligence News\n581000 subscribers\n102 likes\n3982 views\n27 Mar 2026\nARC-AGI-3 from the ARC Prize measures intelligence by testing learning efficiency across 135 interactive visual games. Coverage examines benchmark saturation, benchmark maxing, and the shift from memorization-focused tests toward interactive, tool-enabled real-world evaluations. ARC-AGI-3 aims to expose gaps between human learning and current models and to refocus research on genuine general intelligence.\n\nThe AI Daily Brief helps you understand the most important news and discussions in AI. \nSubscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614\nGet it ad free at http://patreon.com/aidailybrief\nLearn more about the show https://aidailybrief.ai/\n6 comments"
    }
  ]
}

▼✓网页搜索查询: "benchmark saturation data contamination goodhart's law evaluation crisis", 结果数: 10

{
  "results": [
    {
      "title": "The Measurement Crisis: Saturation, Goodhart's Law, and the End of ...",
      "url": "https://hub.stabilarity.com/the-measurement-crisis-saturation-goodharts-law-and-the-end-of-ai-leaderboards/",
      "description": "The Measurement Crisis ... Their analysis identifies two contamination pathways: direct (benchmark items in training data) ... benchmark saturation, Goodhart's Law,"
    },
    {
      "title": "A Systematic Study of Benchmark Saturation - arXiv",
      "url": "https://arxiv.org/html/2602.16763v1",
      "description": "Various works have highlighted fundamental evaluation pitfalls in AI evaluation: Data contamination, i.e., when test content appears in training, can"
    },
    {
      "title": "To Solve the Benchmark Crisis, Evals Must Think - Fig",
      "url": "https://blog.fig.inc/to-solve-the-benchmark-crisis-evals-must-think/",
      "description": "This is Goodhart's Law4 at scale: when a measure becomes a target, it ceases to be a good measure. A practical anecdote: A team we recently"
    },
    {
      "title": "Benchmark Saturation: AI Evaluation Metrics and Ceiling Effects - Interactive | Michael Brenndoerfer | Michael Brenndoerfer",
      "url": "https://mbrenndoerfer.com/writing/benchmark-saturation-ai-evaluation-metrics",
      "description": "# Benchmark Saturation: AI Evaluation Metrics and Ceiling Effects. Discover benchmark saturation in AI evaluation. Learn why static metrics hit ceiling effects, lose statistical power, and how dynamic benchmarks solve this. When GPT-4 achieved 86.4% on the MMLU benchmark in early 2023, it approached but did not surpass the average human performance of approximately 89.8% by expert test-takers in those subjects. This rapid ascent reveals a fundamental challenge in evaluating language models: benchmarks have a lifespan. What begins as a rigorous test of capability eventually becomes a checkmark on a datasheet, unable to distinguish between good models and great ones. This phenomenon is known as benchmark saturation. Benchmark saturation occurs when model performance on a static dataset approaches the theoretical ceiling, rendering the metric incapable of discriminating between improvements. When a benchmark saturates, the number no longer represents meaningful capability differences; it becomes an artifact of measurement precision, annotation noise, and memorization rather than understanding."
    },
    {
      "title": "The Sequence Opinion #750: The Paradox of AI Benchmarks: Challenges in Evaluation",
      "url": "https://thesequence.substack.com/p/the-sequence-opinion-750-the-paradox",
      "description": "# The Sequence Opinion #750: The Paradox of AI Benchmarks: Challenges in Evaluation. ### Goodhart’s Law and some of the challenges around AI benchmarks. We largely rely on standardized benchmarks and tidy metrics to declare winners and track improvement. Yet a central paradox undercuts this practice: as soon as a metric becomes the target, it often stops measuring what we intended. This is Goodhart’s Law in action—*“when a measure becomes a target, it ceases to be a good measure.”* In AI, that means a leaderboard gain may reflect success at **gaming the test**, not a durable leap in capability. This essay examines the **core challenges of AI benchmarks and evaluations** across language, vision, and reinforcement learning (RL). We’ll look at how once-useful measures turn into misleading goals; why good evals should be **accurate, inexpensive, and hard to memorize**; where academic and real-world tests clash; and what RL environments promise—and fail—to deliver as evolving evaluations. ## Goodhart’s Law and the Benchmark Trap."
    },
    {
      "title": "The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks",
      "url": "https://tianpan.co/blog/2026-04-19-goodharts-law-ai-benchmark-gaming",
      "description": "# The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks. In late 2024, OpenAI's o3 system scored 75.7% on the ARC-AGI benchmark — a test specifically designed to resist optimization. Then practitioners looked closer: o3 had been trained on 75% of the benchmark's public training set, and the highest-compute configuration used 172 times more resources than the baseline. It wasn't a capability breakthrough dressed up as a score. It was a score dressed up as a capability breakthrough. The moment a benchmark becomes the thing teams optimize for, it stops measuring what it was designed to measure. Goodhart's Law — \"when a measure becomes a target, it ceases to be a good measure\" — was articulated in 1970s economic policy, but it describes AI benchmarking with eerie precision. ## The Lifecycle of a Benchmark. A research team identifies a meaningful capability — reasoning, code generation, multilingual comprehension — and constructs test questions to measure it."
    },
    {
      "title": "Is AI Cheating on the Test: Data Contamination, Gaming, and the ...",
      "url": "https://medium.com/@wasowski.jarek/is-ai-cheating-on-the-test-data-contamination-gaming-and-the-benchmark-crisis-9dff2fba494f",
      "description": "How data contamination, Goodhart's Law, and gaming turned objective AI tests into a marketing tool — and what to do about it as a practitioner."
    },
    {
      "title": "A Systematic Study of Benchmark Saturation",
      "url": "https://mcml.ai/publications/ars+26/",
      "description": "In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To"
    },
    {
      "title": "[PDF] Scaling Laws, Foundation Models, and the AI Singularity",
      "url": "https://wjarr.com/sites/default/files/fulltext_pdf/WJARR-2026-0011.pdf",
      "description": "As models consume their own outputs, recursive training risks variance collapse and model autophagy unless carefully mitigated through provenance filtering and human data prioritisation For scaling-law practice, the implication is that future performance may be bottlenecked less by raw compute and more by data governance: provenance tracking, contamination control, deduplication, and methods that preserve distributional diversity while improving informational density per token (Penedo et al., 2024; Villalobos et al., 2024). A useful taxonomy separates: (1) frontier catastrophic risks, such as misuse at scale, loss of control in tightly coupled systems, or rapid World Journal of Advanced Research and Reviews, 2026, 29(01), 111-134 125 capability jumps under competitive pressure; (2) high-frequency societal harms, including disinformation, discrimination, privacy violations, and workplace exploitation; and (3) systemic infrastructure risks, such as energy demand, critical dependency on concentrated providers, and fragility created by common-model monocultures (Bengio et al., 2025; Bontcheva et al., 2023; CMA, 2024; International Energy Agency, 2024)."
    },
    {
      "title": "The Metrology Imperative: The Necessity of Robust Evaluation Frameworks and Comprehensive Automated Judges in Generative AI\n\t\t\t\t\t\t\t| International Journal of Intelligent Systems and Applications in Engineering",
      "url": "https://ijisae.org/index.php/IJISAE/article/view/8188",
      "description": "# The Metrology Imperative: The Necessity of Robust Evaluation Frameworks and Comprehensive Automated Judges in Generative AI. Across the past several years, the accelerating advancement of Large Language Models (LLMs) and generative artificial intelligence has quietly produced a crisis that much of the field has been slow to name directly—a breakdown in the ability to evaluate what these systems can and cannot actually do. This brief argues, with considerable urgency, that building robust and dynamic evaluation frameworks alongside sophisticated automated judges—most prominently through the LLM-as-a-Judge paradigm—is not an optional enhancement to existing practices but an absolute prerequisite for the continued, safe, and value-aligned development of AI systems. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief."
    }
  ]
}

▼✓信息流全文检索关键词: "benchmark design methodology evaluation", 条数: 20

{
  "items": [
    {
      "id": "oai:arXiv.org:2605.18498v1",
      "url": "https://arxiv.org/abs/2605.18498",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs",
      "author": [
        "Jing Wang",
        "Hongxuan Lu",
        "Jazze Young",
        "Shu Wang",
        "Zhimin Xin"
      ],
      "summary": "arXiv:2605.18498v1 Announce Type: cross Abstract: Expert specialization in Mixture-of-Experts (MoE) models remains poorly understood, with traditional evaluations conflating architectural load-balancing with functional specialization. We introduce DBES, a comprehensive diagnostic framework combining a multi-domain benchmark with five theoretically grounded metrics: Routing Specialization, Normalized Effective Rank, Domain Isolation, Routing Stiffness Score, and N-gram Expertise measures. Critical findings demonstrate distinct specialization paradigms across models: Qwen-series exhibit modular specialization with high domain isolation, while DeepSeek and GLM employ distributed collaboration. However, we emphasize that specialization is a diagnostic dimension, necessary but not sufficient for downstream performance. Most crucially, interventional evidence validates the actionability of these metrics: by using DBES to identify high-specialization expert paths during domain-specific post-training, we achieved 66% to 94.48% improvement in specialized domains with only 15% of original training resources, demonstrating that these diagnostic tools can be converted into concrete optimization operators. This work provides the first systematic methodology for evaluating expert specialization independently of accuracy metrics, offering crucial insights for the design and post-training optimization of next-generation MoE systems.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.16282v1",
      "url": "https://arxiv.org/abs/2605.16282",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents",
      "author": [
        "Miles Q. Li",
        "Benjamin C. M. Fung",
        "Boyang Li",
        "Heba Ismail",
        "Farkhund Iqbal"
      ],
      "summary": "arXiv:2605.16282v1 Announce Type: cross Abstract: The rapid deployment of LLM-based autonomous agents has introduced safety risks that extend far beyond traditional LLM concerns, prompting a proliferation of safety benchmarks since late 2023. However, these benchmarks have developed independently, with inconsistent threat models, incompatible metrics, and overlapping yet incomplete risk coverage. We present the first systematic analysis dedicated to agent safety benchmarks as evaluation instruments. We catalog 40 behavioral agent-safety benchmarks (2023-2026), plus 5 adjacent evaluator, defense, and dataset artifacts, propose a six-axis taxonomy of benchmark evaluation methodology, and apply it across the corpus to characterize how methodological choices shape safety conclusions. A coverage matrix reveals broad risk coverage but limited methodological convergence, while the taxonomy analysis shows a behavioral-benchmark core concentrated in sandboxed, constrained, and often safety-only evaluation. Across the landscape, we find that benchmark choice can yield contradictory safety conclusions, coverage counts often overstate evaluation depth, environment fidelity systematically shapes reported safety, the field disproportionately tests externally imposed rather than agent-internal risks, metric fragmentation limits comparison, and robustness remains effectively unbenchmarked. We ground these claims with a cross-benchmark consistency check, with 95% confidence intervals and Kendall's W concordance analysis, finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94). We release structured metadata, full taxonomy codings, risk annotations, and all experimental artifacts, and propose minimum reporting standards for future benchmarks.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.05739v3",
      "url": "https://arxiv.org/abs/2605.05739",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback",
      "author": [
        "Mohammad Al Ridhawi",
        "Mahtab Haj Ali",
        "Hussein Al Osman"
      ],
      "summary": "arXiv:2605.05739v3 Announce Type: replace-cross Abstract: Agentic artificial intelligence systems produce outputs through sequences of interdependent autonomous decisions, yet standard evaluation assesses outputs alone and cannot diagnose the underlying process. We develop a behavioral evaluation methodology that complements output-level testing by scoring the intermediate decision process itself. Behavioral traces logged at each autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges. A perturbation procedure that corrupts one dimension while leaving the other five intact confirms dimension specificity; cross-model agreement reaches Krippendorff's alpha = 0.85. The composite behavioral score correlates at Spearman rho = 0.72 with realized 20-day Sharpe ratio. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty added to the Soft Actor-Critic reward. Three fine-tuning cycles, confined to validation data, reduce one-day MAPE from 0.61% to 0.54% (11.5% relative; p<0.001, d=0.31) on the held-out 2017 to 2025 test period, significant under Diebold-Mariano and localized by Giacomini-White to the high-volatility regime. The methodology is application-agnostic and applies to any agentic system whose intermediate decisions can be logged.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.16909v1",
      "url": "https://arxiv.org/abs/2605.16909",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents",
      "author": [
        "Zhiqiang Liu",
        "Wenhui Dong",
        "Yilang Tan",
        "Yuwen Qu",
        "Haochen Yin",
        "Chenyang Si"
      ],
      "summary": "arXiv:2605.16909v1 Announce Type: new Abstract: Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.13877v2",
      "url": "https://arxiv.org/abs/2605.13877",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "ARES-LSHADE: Autoresearch-Enhanced LSHADE with Memetic Polish for the GNBG Benchmark",
      "author": [
        "Abdullah Naeem",
        "Md Wasi Ul kabir",
        "Manish Bhatt",
        "Ayon Dey",
        "Anav Katwal",
        "Md Tamjidul Hoque"
      ],
      "summary": "arXiv:2605.13877v2 Announce Type: replace-cross Abstract: We present ARES-LSHADE, a memetic differential-evolution variant submitted to the GECCO 2026 competition on LLM-designed evolutionary algorithms for the Generalized Numerical Benchmark Generator (GNBG). The algorithm builds on the LLM-LSHADE 2025 winner, contributing two new components: (a) a scout-augmented mutation operator with adaptive CMA-ES integration, produced by an autonomous research loop across approximately thirty LLM-driven design experiments, and (b) a multi-start L-BFGS-B polish phase that respects strict blackbox treatment of the benchmark. On the official 31-run-per-function evaluation with the competition-specified function-evaluation budgets, ARES-LSHADE obtains 510 of 744 wins (per-function gap below 1e-8), reaching machine precision on 18 of 24 functions. The remaining six functions exhibit characteristic plateau signatures consistent with GNBG's compositional structure, and were independently identified by the autoresearch loop as the hardest of the suite. Beyond the result itself, this report documents two methodological observations: (i) an LLM-driven research loop with operator-only edit surface and fitness-only observation space converges to a characteristic plateau on this benchmark; (ii) when we initially widened the observation space to include the benchmark's compositional metadata, the resulting algorithm trivially solved all 24 functions but violated the competition's blackbox rule, which we identified before submission. We discuss this tension between LLM capability and benchmark integrity as a design consideration for future LLM-driven optimization-algorithm research. Code and reproducibility artifacts are available at https://github.com/anaeem1/ARES-LSHADE.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.17829v1",
      "url": "https://arxiv.org/abs/2605.17829",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Interactive Evaluation Requires a Design Science",
      "author": [
        "Keyang Xuan",
        "Peiyang Song",
        "Pan Lu",
        "Pengrui Han",
        "Wenkai Li",
        "Zhenyu Zhang",
        "Zexue He",
        "Wenyue Hua",
        "Manling Li",
        "Jiaxuan You",
        "Adrian Weller",
        "Yizhong Wang",
        "Jiaxin Pei"
      ],
      "summary": "arXiv:2605.17829v1 Announce Type: new Abstract: AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2510.18941v2",
      "url": "https://arxiv.org/abs/2510.18941",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge",
      "author": [
        "Zhilin Wang",
        "Jaehun Jung",
        "Ximing Lu",
        "Shizhe Diao",
        "Ellie Evans",
        "Jiaqi Zeng",
        "Pavlo Molchanov",
        "Yejin Choi",
        "Jan Kautz",
        "Yi Dong"
      ],
      "summary": "arXiv:2510.18941v2 Announce Type: replace-cross Abstract: Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench and Leaderboard: https://huggingface.co/spaces/nvidia/ProfBench",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2601.11895v3",
      "url": "https://arxiv.org/abs/2601.11895",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models",
      "author": [
        "Adarsh Kumarappan",
        "Pareesa Ameneh Golnari",
        "Wen Wen",
        "Xiaoyu Liu",
        "Gabriel Ryan",
        "Yuting Sun",
        "Shengyu Fu",
        "Elsie Nallipogu"
      ],
      "summary": "arXiv:2601.11895v3 Announce Type: replace-cross Abstract: DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry and synthesized using generator models from multiple provider families to mitigate single-source bias. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, with the strongest achieving only 43.5% Pass@1, confirming the benchmark remains challenging and revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement, detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2510.26384v2",
      "url": "https://arxiv.org/abs/2510.26384",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings",
      "author": [
        "Andrew M. Bean",
        "Nabeel Seedat",
        "Shengzhuang Chen",
        "Jonathan Richard Schwarz"
      ],
      "summary": "arXiv:2510.26384v2 Announce Type: replace Abstract: The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks (\"cold-start\"), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we propose a new item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.25% data subset, we predict full benchmark scores with a 3.2% mean absolute error, and on Humanity's Last Exam we predict full scores with 2.9% mean absolute error using a 2.0% sample. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.18421v1",
      "url": "https://arxiv.org/abs/2605.18421",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective",
      "author": [
        "Yuyao Wang",
        "Zhongjian Zhang",
        "Mo Chi",
        "Kaichi Yu",
        "Yuhan Li",
        "Miao Peng",
        "Bing Tong",
        "Chen Zhang",
        "Yan Zhou",
        "Jia Li"
      ],
      "summary": "arXiv:2605.18421v1 Announce Type: cross Abstract: Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2509.19590v2",
      "url": "https://arxiv.org/abs/2509.19590",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Position: AI Evaluations Should be Grounded on a Theory of Capability",
      "author": [
        "Nathanael Jo",
        "Ashia Wilson"
      ],
      "summary": "arXiv:2509.19590v2 Announce Type: replace Abstract: Evaluations of generative models are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet skepticism about their reliability continues to grow. How can we know that a reported accuracy genuinely reflects a model's underlying performance? Although benchmark results are often presented as direct measurements of capability, in practice they are inferences: treating a score as evidence of capability already presupposes a theory of what it means to be capable at a task. We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability. While this perspective is standard in fields like psychometrics, it remains underdeveloped in AI evaluation, where core assumptions are often left implicit. As a proof-of-concept, we empirically show that reported performance can depend strongly on the evaluator's modeling assumptions, underscoring the need for transparent, theory-driven evaluation practices. We conclude by offering an Evaluation Card to help researchers document, justify, and scrutinize the modeling decisions underlying AI evaluations.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.18693v1",
      "url": "https://arxiv.org/abs/2605.18693",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents",
      "author": [
        "Yifan Zhou",
        "Zhentao Zhang",
        "Ziming Cheng",
        "Shuo Zhang",
        "Qizhen Lan",
        "Zhangquan Chen",
        "Zhi Yang",
        "QianyuXu",
        "Ronghao Chen",
        "Huacan Wang",
        "Sen Hu"
      ],
      "summary": "arXiv:2605.18693v1 Announce Type: new Abstract: As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task-conditioned generation, where a task-specific skill is synthesized after the task is revealed, and task-agnostic generation, where a reusable skill library must be distilled before downstream tasks are known. It also spans two complementary procedural sources: repository-grounded instances, where procedures are distributed across code, configuration, and scripts, and document-grounded instances, where procedures and constraints must be distilled from long-form text. We provide standardized task specifications, pinned environments, and evaluation protocols centered on deterministic execution-based checks, supplemented by auxiliary signals for diagnosis. Experiments across a range of skill-generation methods and backbones show substantial performance variation, highlight the difficulty of reusable skill distillation, and reveal distinct failure modes in skill generation from software repositories versus long-form documents. SkillGenBench establishes a reproducible testbed for studying skill generation as an independent research problem in agent systems.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.17554v1",
      "url": "https://arxiv.org/abs/2605.17554",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps",
      "author": [
        "Tanmay Asthana",
        "Aman Saksena",
        "Divyansh Sahu"
      ],
      "summary": "arXiv:2605.17554v1 Announce Type: new Abstract: Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.18663v1",
      "url": "https://arxiv.org/abs/2605.18663",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "GIM: Evaluating models via tasks that integrate multiple cognitive domains",
      "author": [
        "Rohit Patel",
        "Alexandre Rezende",
        "Steven McClain"
      ],
      "summary": "arXiv:2605.18663v1 Announce Type: new Abstract: As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.17373v1",
      "url": "https://arxiv.org/abs/2605.17373",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics",
      "author": [
        "Qiran Zou",
        "Hou Hei Lam",
        "Wenhao Zhao",
        "Tingting Chen",
        "Yiming Tang",
        "Samson Yu",
        "Yingtao Zhu",
        "Srinivas Anumasa",
        "Zufeng Zhang",
        "Tianyi Zhang",
        "Chang Liu",
        "Zhengyao Jiang",
        "Anirudh Goyal",
        "Dianbo Liu"
      ],
      "summary": "arXiv:2605.17373v1 Announce Type: cross Abstract: AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree-search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process-level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML-bench.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.14133v2",
      "url": "https://arxiv.org/abs/2605.14133",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents",
      "author": [
        "Yuxiang Lai",
        "Peng Xia",
        "Haonian Ji",
        "Kaiwen Xiong",
        "Kaide Zeng",
        "Jiaqi Liu",
        "Fang Wu",
        "Jike Zhong",
        "Zeyu Zheng",
        "Cihang Xie",
        "Huaxiu Yao"
      ],
      "summary": "arXiv:2605.14133v2 Announce Type: replace Abstract: Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \\textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17\\% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2508.20810v3",
      "url": "https://arxiv.org/abs/2508.20810",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs",
      "author": [
        "Jessica M. Lundin",
        "Usman Nasir Nakakana",
        "Guillaume Chabot-Couture"
      ],
      "summary": "arXiv:2508.20810v3 Announce Type: replace Abstract: Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-18T04:00:00.000Z",
      "fetched_at": "2026-05-18T16:06:46.568Z"
    },
    {
      "id": "oai:arXiv.org:2605.17937v1",
      "url": "https://arxiv.org/abs/2605.17937",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting",
      "author": [
        "Zhensheng Wang",
        "Wenmian Yang",
        "Qingtai Wu",
        "Lequan Ma",
        "Yiquan Zhang",
        "Weijia Jia"
      ],
      "summary": "arXiv:2605.17937v1 Announce Type: cross Abstract: Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.18630v1",
      "url": "https://arxiv.org/abs/2605.18630",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science",
      "author": [
        "Nithin Somasekharan",
        "Youssef Hassan",
        "Shiyao Lin",
        "Gihan Panapitiya",
        "Patrick Emami",
        "Anurag Acharya",
        "Sameera Horawalavithana",
        "Shaowu Pan"
      ],
      "summary": "arXiv:2605.18630v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    },
    {
      "id": "oai:arXiv.org:2605.16372v1",
      "url": "https://arxiv.org/abs/2605.16372",
      "source_url": "https://rss.arxiv.org/rss/cs.ai",
      "title": "SwordBench: Evaluating Orthogonality of Steering Image Representations",
      "author": [
        "Vladimir Zaigrajew",
        "Dawid Pludowski",
        "Hubert Baniecki",
        "Przemyslaw Biecek"
      ],
      "summary": "arXiv:2605.16372v1 Announce Type: cross Abstract: Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.",
      "image_url": null,
      "cover_img": null,
      "pub_date": "2026-05-19T04:00:00.000Z",
      "fetched_at": "2026-05-19T16:06:09.129Z"
    }
  ],
  "total": 5511
}

▼thought

▼✓抓取网页正文网址: "https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/"

▼✓抓取网页正文网址: "https://kili-technology.com/blog/custom-ai-benchmark-guide-what-the-best-publ..."

{
  "url": "about:blank",
  "title": "Custom AI Benchmark Guide: Lessons from Public Evals",
  "summary": "How HELM, GPQA Diamond, SWE-bench, and LegalBench design choices translate into a practitioner methodology for custom AI benchmarks you can trust.",
  "content": "<div id=\"readability-page-1\" class=\"page\"><div><div><div><div role=\"list\" data-w-id=\"866f87fb-b78e-c941-0157-b3859e4af64a\"><div role=\"listitem\"><p>LLMs</p></div><div role=\"listitem\"><p>AI Evaluation</p></div><div role=\"listitem\"><p>Foundation Models</p></div></div><div><h2 data-w-id=\"403ca70e-d530-b6de-d868-c208d5f4dfbf\">Custom AI Benchmark Guide: What the Best Public Evals Teach You About Building Your Own</h2><p data-w-id=\"9e94c3f3-3cf3-5e65-6535-70daafdadb94\">The public benchmarks that defined AI progress for the last three years are saturating in months, not years, and a 445-benchmark systematic review found most don't accurately capture what their names claim. This guide distils the design choices behind HELM, GPQA Diamond, SWE-bench, LiveCodeBench, LegalBench, and MultiMedQA into a practitioner methodology for building AI benchmarks you can actually trust.</p></div><div data-w-id=\"0fe81583-09e6-8487-ab50-1aaf94dc0ea6\"><div><p><img src=\"https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1).png\" loading=\"lazy\" alt=\"\" sizes=\"(max-width: 767px) 100vw, (max-width: 991px) 727px, 939px\" srcset=\"https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-500.png 500w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-800.png 800w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-1080.png 1080w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-1600.png 1600w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-2000.png 2000w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-2600.png 2600w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1)-p-3200.png 3200w, https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20(1).png 3644w\"></p><p>Kili Technology</p></div><p>·</p><p>Apr 30, 2026</p></div></div><p><img src=\"https://cdn.prod.website-files.com/68da32b2041c593b0511a582/69f34f60ec23da20cca59620_custom%20ai%20benchmarks.png\" loading=\"lazy\" alt=\"\"></p></div><div id=\"w-node-_630176ae-f75c-4c07-5b3f-71f4b37392a2-0511a5bd\"><div><h2>AI Summary</h2><div><ul id=\"\"><li id=\"\">A 2025 review of 445 LLM benchmarks found pervasive construct-validity gaps across widely cited ai benchmarks.</li><li id=\"\">HELM's 7-metric design raised cross-model evaluation coverage from 17.9% to 96%.</li><li id=\"\">GPQA Diamond's expert-write, expert-validate, non-expert-test pipeline is the cleanest reusable protocol for sourcing tasks from domain experts.</li><li id=\"\">SWE-bench scores rose from 1.96% to 71.7% in one year — and a later human-filtered subset removed roughly a third of the original tasks as ambiguous or infeasible.</li><li id=\"\">LLM judge reliability hinges on rubric clarity, not chain-of-thought; the rubric is the operationalisation of the construct you claim to measure.</li><li id=\"\">Kili Technology supports the upstream work behind reliable custom AI benchmarks — expert annotator workflows, rubric design, multi-annotator validation, and ongoing re-annotation as models and production traffic evolve.</li></ul></div></div><div fs-toc-element=\"contents\" fs-toc-offsettop=\"7rem\"><div id=\"introduction\"><h2>Introduction</h2><p>Three years ago, \"build a benchmark\" meant pick a task, write a few hundred examples, publish a leaderboard. The methodology was loose because the time horizons were long: MMLU, released in 2020, took until 2024 to saturate.</p><p>That assumption is gone. The 2025 AI Index reports gains of 18.8 points on MMMU, 48.9 on GPQA Diamond, and 67.3 points in SWE-bench scores in a single year. Humanity's Last Exam, released in early 2025 with the explicit goal of resisting frontier reasoning models, went from 8.8% (top score, January 2025) to over 50% by April 2026. The exam was designed as the hardest standardised exam ever assembled for an LLM, and even an exam of that calibre hit the saturation curve within fifteen months. Stanford HAI's Vanessa Parli framed the resulting question bluntly: are we measuring the right thing, are the benchmarks compromised, and how should the research community evaluate models?</p><p>The pressure is sharper for teams running production systems. Public leaderboards optimise for capability ceilings; production systems break on consistency floors. An MIT NANDA analysis of roughly 300 enterprise AI deployments found only 5% reach measurable P&amp;L impact, and the binding constraints are workflow integration and evaluation gaps, not model quality. Gartner expects more than 40% of agentic AI projects to be cancelled by 2027.</p><p>A custom benchmark is no longer optional infrastructure for any team deploying AI in a domain that matters. The question is how to evaluate the system you actually ship without repeating the validity, contamination, and saturation mistakes that the public ai benchmarks are now publicly working through.</p></div><div id=\"why-are-public-ai-benchmarks-no-longer-enough\"><h2>Why Are Public AI Benchmarks No Longer Enough?</h2><p>Three forces have converged.</p><p>The first is <strong>saturation speed</strong>. When the dataset was introduced in late 2023, the best model (Claude 2) solved 1.96% of real GitHub issues drawn from SWE-bench. By 2024, leading research systems and agents were solving 71.7%. The benchmark community responded with harder variants — a verified human-filtered subset and a Pro tier — but the underlying pattern is structural: any static benchmark hard for today's frontier reasoning models will be solved by next year's. A useful heuristic for benchmark builders is that top models should land below roughly 35% accuracy at launch — anything easier is already a regression eval in disguise, and the resulting scores will not separate frontier systems from each other.</p><p>The second is <strong>construct validity erosion</strong>. The 2025 Reuel et al. systematic review of 445 LLM benchmarks identified prevalent gaps in construct validity, the property that the test actually measures the capability it claims to measure. Naming a benchmark \"general reasoning\" or \"general knowledge\" doesn't establish that the score generalises to the construct. Raji et al. flagged this in 2021, calling general-purpose benchmark framing \"ultimately dangerous and deceptive.\" The 2025 review confirms the problem is endemic, not isolated.</p><p>The third is <strong>benchmark exploitability</strong>. Recent research has shown that headline scores often measure how well a model gamed the test harness rather than how well it solved the underlying tasks. Automated scanning agents have been demonstrated that exploit structural flaws in popular ai benchmarks — for instance, the lack of strict isolation between the agent under test and the evaluator process — to achieve near-perfect scores without solving any of the tasks. The attack pattern unfolds in stages: the scanning agent probes the harness, identifies leaked grader signals, and produces outputs that satisfy the grader without solving the problem. The takeaway is unambiguous: high scores on static benchmarks can be deeply misleading, and inflated leaderboard scores can be uncorrelated with the underlying capabilities.</p><p>The fourth is the <strong>research-versus-production divergence</strong>. Public benchmarks measure peak capability on single attempts. Production systems need consistency: an <a href=\"https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4660\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4660\">Anthropic engineering analysis</a> of τ-bench found agents hitting 60% pass@1 dropped to 25% pass^k (consistency across k trials). That gap is invisible in single-run leaderboards and catastrophic for users. The same analysis pushes teams toward pass@k <em>and</em> pass^k, partial-credit graders, and balanced positive/negative cases, none of which are standard on public leaderboards.</p><p>If your system processes legal contracts, diagnoses medical images, or routes financial trades, the relevant comparison isn't whether your model beats GPT-4 on MMLU. It's whether it does the specific job you need it to do, reliably, on the kind of inputs your production traffic actually contains.</p></div><div id=\"what-makes-a-custom-ai-benchmark-actually-measure-what-it-claims\"><h2>What Makes a Custom AI Benchmark Actually Measure What It Claims?</h2><p>Construct validity is the first principle. Before you draft a single task, write down (in a paragraph, not a vibe) what capability you're measuring, why it matters for your system, and what it would mean for the score to go up or down. If you can't articulate the construct, you can't measure it. In practice this also means aligning the benchmark to a concrete business outcome or domain task — what \"right\" looks like — rather than chasing an abstract aggregate score.</p><div><p>The Reuel et al. research proposes four validity types worth carrying through: <br>- <strong>construct </strong>(does it measure what it names), criterion (does it correlate with downstream outcomes),<br>- <strong>consequential </strong>(does optimising for it produce the behaviours you want), and <br>- <strong>external</strong> (does it generalise outside the eval set). </p><p>A benchmark that ignores any of the four is gameable.</p></div><p>The second principle is <strong>multi-metric coverage with explicit gaps acknowledged</strong>. The <a href=\"https://arxiv.org/abs/2211.09110\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4673\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4673\">HELM framework from Stanford CRFM</a> rejected single-number leaderboards and instead reported 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) across 16 core scenarios. The point wasn't completeness; it was making trade-offs visible to anyone reading the data. Before HELM, models were evaluated on just 17.9% of its core scenarios on average; HELM raised this to 96%, putting 30 models on equal footing for direct comparison. For a custom benchmark, the lesson is to pick three or four metrics that capture the trade-offs you actually face (latency vs. accuracy, recall vs. precision, helpfulness vs. harm) and report them all, every time. A workable pattern is to lead with one primary metric for at-a-glance understanding and relegate detailed secondary metrics — and the parameters used to compute them — to an appendix that lets another team replicate the run end-to-end.</p><p>The third principle is <strong>balanced problem sets</strong>. Anthropic's engineering guidance phrases it directly: include cases where the behaviour should occur and cases where it shouldn't. A benchmark of \"did the agent successfully cancel the order\" is incomplete without \"did the agent refuse to cancel an order it shouldn't have.\" Balanced sets are how you catch over-eager agents and miss-rate trade-offs that single-axis evaluation hides. Edge cases and adversarial items belong in the same set — silent failures hide between average and worst-case behaviour.</p><p>Construct validity is also where rubric design enters early. <a href=\"https://arxiv.org/abs/2506.13639\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a467d\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a467d\">Park et al. (2025)</a> showed empirically that evaluation criteria are the dominant factor in LLM judge reliability; chain-of-thought offers minimal gains when the rubric is clear. <strong>The rubric is the operationalisation of the construct.</strong> If the rubric is vague, the construct is vague, and no amount of grader sophistication will save the score. A clear rubric also accelerates downstream understanding: when scores diverge across model versions, the team can read the rubric and trace the divergence to a specific criterion rather than re-arguing the underlying construct.</p><p>This is also where Kili-style annotation infrastructure starts to matter, not for the eval itself, but for the upstream work of writing the rubric, validating it against expert disagreement, and revising the data collection process before any tasks are graded.</p></div><div id=\"where-should-the-tasks-come-from-and-who-should-write-them\"><h2>Where Should the Tasks Come From, and Who Should Write Them?</h2><p>Two patterns from the public ai benchmarks dominate:</p><div id=\"real-artefacts-beat-synthetic-prompts\"><h3><strong>Real artefacts beat synthetic prompts</strong></h3><p><strong>‍</strong>‍<a href=\"https://arxiv.org/abs/2310.06770\" data-wf-native-id-path=\"2f63eefe-abe7-aa5a-7eb6-da7683ec89b4\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"2f63eefe-abe7-aa5a-7eb6-da7683ec89b4\">SWE-bench</a> drew its 2,294 tasks from real GitHub issues across 12 popular Python repositories and graded by running the project's actual test suite. A patch passes only if it makes the failing tests pass without breaking the passing ones (FAIL_TO_PASS / PASS_TO_PASS). The realistic construction setting, in the authors' phrasing, gave the dataset properties no synthetic prompt collection could replicate: tasks are continually updatable from new pull requests, hard to game with surface heuristics, and grounded in code that other humans actually had to review. For a custom benchmark, the implication is to mine your own systems first. Bug trackers, support tickets, rejected agent outputs, escalations to human review: these are the highest-signal task sources you have.</p></div><div id=\"expert-authoring-with-adversarial-validation\"><h3><strong>E</strong><strong>xpert authoring with adversarial validation</strong>. </h3><p>‍<a href=\"https://arxiv.org/abs/2311.12022\" data-wf-native-id-path=\"a29c6242-7f8d-265f-9105-82d268588ad8\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"a29c6242-7f8d-265f-9105-82d268588ad8\">GPQA Diamond</a> is the cleanest published example. Its four-stage pipeline (expert authoring, expert validation, revision, non-expert validation) produced 448 multiple-choice questions where domain PhDs reach 65% accuracy, 74% if you discount clear mistakes, but skilled non-experts (humans with 30+ minutes of unrestricted web access) reach only 34%. The Diamond subset is the high-confidence slice — questions where two domain experts agreed on the answer and a third validated it independently. The non-expert validation stage is the underappreciated part: it gives you a quantitative answer to \"is this question actually hard, or just obscure?\" GPQA Diamond also embeds a canary string for contamination tracing, a small touch with disproportionate downstream value.</p><p><a href=\"https://arxiv.org/abs/2308.11462\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a469a\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a469a\">LegalBench</a> is the canonical example of the expert-led collaborative model: 162 tasks across 6 reasoning types (issue-spotting, rule-recall, rule-application, rule-conclusion, interpretation, rhetorical understanding) authored by 40+ contributors, including lawyers, law professors, and legal practitioners. The reasoning typology is itself a contribution: it forces task authors to articulate what kind of legal cognition each task is testing, which makes downstream interpretation of scores tractable.</p><p>Four operational rules follow:</p><p>- <strong>Start with real failures, then scale.</strong> Anthropic's guidance recommends 20–50 tasks drawn from production failures as a starting point; the working bar for a defensible domain benchmark sits higher, at a hand-picked, expert-labelled set of 200–1,000 examples that reflect real user journeys and edge cases for the certain use cases your agents actually serve in production. Treat dataset construction as a series of stages, not a one-shot exercise: a small seed set, then expansion, then iteration as failures surface. </p><p>- <strong>Write reference solutions</strong> — if your domain experts can't solve the task, the model definitely can't, and the failure tells you nothing. </p><p>- <strong>Use multiple raters per item</strong> to evaluate annotation quality: the verified-subset protocol used three annotators per item, severity-ensembled, to filter out infeasible or under-specified items, and produced what is now the de facto standard for executable-test benchmarks; single-rater work inherits single-rater blind spots, and using too few raters strips out the human disagreement that is itself a signal about task ambiguity. </p><p>- <strong>Keep the test set strictly separated from anything used to develop the system</strong> — train/test contamination quietly inflates scores on held-out evaluations, and the inflation is invisible until production performance disappoints.</p></div></div><div id=\"how-do-you-grade-outputs-that-dont-have-a-single-right-answer\"><h2>How Do You Grade Outputs That Don't Have a Single Right Answer?</h2><p>The Anthropic taxonomy is the cleanest framing: code-based graders, model-based graders (LLM judge), and human graders. Pick the cheapest grader that works for the task — and let the cost of being wrong, not the convenience of measuring, drive the choice.</p><p><strong>Code-based graders</strong> are the gold standard when applicable: exact match, regex, executable tests, structured output validation. The FAIL_TO_PASS / PASS_TO_PASS pattern from the original benchmark is a code-based grader; so is \"does the JSON parse and contain the required fields.\" If the task admits a code-based grader, use it. The grader is deterministic, free to run, and impossible to game without solving the task — and for agents that produce structured outputs, this is the cheapest reliable signal you will ever have.</p><p><strong>Model-based graders</strong> are the workhorse for open-ended outputs. Park et al. showed evaluation criteria dominate reliability; rubric-guided judges (Prometheus, G-Eval) reach Pearson correlations around 0.897 with humans rating the same outputs when the rubric is unambiguous. The <a href=\"https://arxiv.org/abs/2510.07243\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46b7\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46b7\">LeMAJ legal evaluation framework</a> found inter-rater agreement among humans increased 11% when reviewers used a shared rubric, and rubric-guided LLM judge configurations hit Cohen's κ of 0.75 with human consensus. The pattern across these results: rubric clarity is the constraint, not judge model size.</p><p>Three operational rules. <strong>Validate against a human-labelled golden set</strong> — the working bar in practice is 75–90% agreement with human consensus; below that, the judge is amplifying noise. <strong>Decompose into structured criteria</strong> — binary checks (\"does the output cite a real case?\") plus ordinal scores (\"rate clarity 1–5 with anchor descriptions\") outperform single overall scores in nearly every published comparison. <strong>Watch the known biases.</strong> Model-based judges show position bias, length bias, and self-preference (preferring outputs from the same model family). Randomise position, normalise length when possible, and use a different model family for judging than for the system under test.</p><p><strong>Human graders</strong> are the calibration layer and the high-stakes layer. Reserve humans for golden-set construction, judge calibration, and tasks where the cost of a wrong answer makes a 90%-agreement model judge unacceptable: clinical safety, legal compliance, financial advice. The economics rarely support running humans across a full benchmark, but they almost always support running humans across a calibration subset — and the resulting scores anchor every other grader downstream.</p></div><div id=\"how-do-you-keep-a-custom-ai-benchmark-useful-over-time\"><h2>How Do You Keep a Custom AI Benchmark Useful Over Time?</h2><p>A custom benchmark is a versioned artefact with a maintenance schedule. Treat it that way and it stays useful; treat it as a one-time project and it expires inside a year.</p><p><strong>Contamination defence.</strong> The <a href=\"https://arxiv.org/abs/2411.03923\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46d1\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46d1\">Singh et al. (2024) ConTAM analysis</a> of 13 benchmarks across 7 models found contamination has been underestimated in many prominent LLM releases, even when developers attempted decontamination. Three practical defences: hold out a private split that never goes public; date-stamp every item so you can filter to post-cutoff data per model (the <a href=\"https://arxiv.org/abs/2403.07974\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46d4\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46d4\">LiveCodeBench</a> approach, which date-stamps problems from competitive programming contests since May 2023); and embed a canary string, GPQA Diamond–style, so you can later test whether a model has memorised your dataset.</p><p><strong>Harness isolation.</strong> A subtler failure mode is the absence of strict isolation between the system under test and the evaluator. When agents can read or write to the same filesystem as the grader, observe the grader's logs, or otherwise inspect the scoring process, automated exploits become trivial — and as recent scanning-agent research shows, headline scores in those conditions can reflect harness gaming rather than genuine capabilities. Build agent/evaluator isolation into the design, not as a follow-up.</p><p><strong>Saturation planning.</strong> Distinguish capability evals from regression evals from day one. A capability eval and a regression eval are two different stages of the same benchmark's life: capability evals start at low pass rates (5–30%) and let you hill-climb; once they saturate above 90%, they become regression evals, where the goal flips from \"can the model do this\" to \"did we break something that used to work.\" Both stages matter. A team with only capability evals goes blind once the model is good; a team with only regression evals never sees what it can't do yet, and the scores stop reflecting any meaningful comparison between systems.</p><p><strong>Continuous re-annotation.</strong> Pipe production failures back into the eval set. Every time a user flags a wrong answer, every human override, every escalation: these are pre-validated hard cases. The <a href=\"https://arxiv.org/abs/2411.12990\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46e3\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46e3\">BetterBench framework</a> found that of 24 evaluated SOTA benchmarks, only 3 included CI build status and only 4 provided replication scripts; the operational rigour we apply to code we routinely fail to apply to evals. Version your benchmark against your model releases. Tag eval runs with model version, prompt version, and rubric version.</p><p><strong>Standards alignment.</strong> For regulated industries, the NIST AI RMF and the <a href=\"https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46ea\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46ea\">NIST GenAI Profile (AI 600-1)</a> define testing, evaluation, verification, and validation (TEVV) as a core function of trustworthy AI. Mapping your benchmark to TEVV categories isn't decorative; it's the documentation trail you'll want when the auditor arrives.</p></div><div id=\"what-does-this-look-like-across-industries\"><h2>What Does This Look Like Across Industries?</h2><p>Four compressed examples to ground the principles. Each is anchored to a published benchmark so the design choices are inspectable.</p><p><strong>Healthcare.</strong> <a href=\"https://arxiv.org/abs/2212.13138\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a46f5\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a46f5\">MultiMedQA</a> combines six existing medical QA datasets with a new HealthSearchQA collection of consumer questions, then layers expert physician evaluation across multiple axes: factuality, possible harm, possible bias, scientific consensus alignment. The lesson: USMLE-style multiple-choice catches only the lower levels of Miller's pyramid (knows, knows how). Clinical safety requires open-ended generation graded by clinicians on multiple safety axes, not just answer accuracy.</p><p><strong>Legal.</strong> LegalBench's six reasoning types plus LegalBench-RAG's expert-annotated retrieval pairs together cover both answer correctness and retrieval-precision dimensions. The lesson: in regulated, citation-heavy domains, retrieval correctness matters as much as final-answer accuracy. A benchmark that grades only the synthesis ignores the failure mode that matters most: confidently wrong citations.</p><p><strong>Software engineering.</strong> The progression from the original benchmark to its verified subset to Terminal-Bench is the cleanest example of refinement under pressure. The <a href=\"https://openai.com/index/introducing-swe-bench-verified/\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4700\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4700\">OpenAI Verified work</a> filtered roughly a third of original items as ambiguous or infeasible, meaning a third of the original scores were noise from the benchmark, not signal from the model. The lesson: for any executable benchmark, agent harness bugs and grading-spec ambiguity cause more apparent failures than model limitations. Verify before you trust.</p><p><strong>Finance.</strong> The <a href=\"https://arxiv.org/abs/2508.00828\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4707\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4707\">Finance Agent Benchmark</a> provides 537 expert-authored questions covering retrieval through modelling, with an agentic harness that includes Google Search and SEC EDGAR access. The lesson: financial benchmarks need expert-authored questions, real document grounding, and tool-use evaluation; pure-text QA misses the workflow. The regulatory and compliance dimension is also load-bearing: a model that's right but cites a hallucinated 10-K creates legal exposure that pure accuracy metrics never surface.</p><p>The pattern across all four: the benchmark inherits the failure modes of the domain. Generic benchmarks miss these because they're generic. Custom benchmarks earn their cost by being specific.</p></div><div id=\"the-real-test-of-a-benchmark-is-whether-youd-trust-it-tomorrow\"><h2>The Real Test of a Benchmark Is Whether You'd Trust It Tomorrow</h2><p>The benchmark you build today will be consulted dozens of times before it expires. Each consultation is a decision: ship or don't, escalate or don't, retrain or don't. The cost of a bad benchmark isn't measured in eval-set creation hours; it's measured in the production decisions made in its name.</p><p>The public ai benchmarks that survived their first wave of scrutiny (HELM, GPQA Diamond, SWE-bench, LegalBench) share a small set of properties. They define the construct in writing. They use real artefacts when possible. They validate task design against domain experts before grading any model. They publish their rubrics. They plan for contamination, harness exploits, and saturation as design constraints, not afterthoughts. They version themselves like software.</p><p>These properties don't require frontier-lab budgets to replicate. They require treating the benchmark as the substrate of every downstream claim about your AI system, because that's what it is. The benchmarks that hold up are the ones built by people who understood the construct before they wrote the first task.</p></div><div id=\"ready-to-build-a-benchmark-that-actually-measures-your-ai-system\"><h2>Ready to Build a Benchmark That Actually Measures Your AI System?</h2><p>Kili Technology's data labeling and evaluation infrastructure supports the upstream work that makes custom benchmarks reliable: expert annotator workflows, rubric design and calibration, multi-annotator validation, and the continuous re-annotation cycle that keeps benchmarks useful as models and production traffic evolve. <a href=\"https://kili-technology.com/contact-us\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4718\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4718\">Talk to our team</a> about benchmark design for your domain.</p></div><div id=\"resources\"><h2>Resources</h2><div id=\"benchmark-methodology-papers\"><h3>Benchmark Methodology Papers</h3><ul role=\"list\"><li><strong>Holistic Evaluation of Language Models (HELM)</strong> – Stanford CRFM's multi-metric, multi-scenario framework<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2211.09110\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a47ed\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a47ed\">https://arxiv.org/abs/2211.09110</a></li></ul></li><li><strong>GPQA: A Graduate-Level Google-Proof Q&amp;A Benchmark</strong> – Expert-authored, expert-validated, non-expert-tested protocol<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2311.12022\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a47f4\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a47f4\">https://arxiv.org/abs/2311.12022</a></li></ul></li><li><strong>SWE-bench: Can Language Models Resolve Real-World GitHub Issues?</strong> – Real-artefact tasks with executable graders<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2310.06770\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a47fb\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a47fb\">https://arxiv.org/abs/2310.06770</a></li></ul></li><li><strong>LiveCodeBench: Holistic and Contamination Free Evaluation of LLMs for Code</strong> – Date-stamped, contamination-aware design<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2403.07974\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4802\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4802\">https://arxiv.org/abs/2403.07974</a></li></ul></li><li><strong>LegalBench: A Collaboratively Built Benchmark for Legal Reasoning</strong> – Expert-led collaborative construction with reasoning typology<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2308.11462\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4809\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4809\">https://arxiv.org/abs/2308.11462</a></li></ul></li><li><strong>MultiMedQA / Large Language Models Encode Clinical Knowledge</strong> – Multi-axis human evaluation for medical LLMs<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2212.13138\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4810\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4810\">https://arxiv.org/abs/2212.13138</a></li></ul></li><li><strong>Finance Agent Benchmark</strong> – Expert-authored finance benchmark with agentic harness<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2508.00828\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4817\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4817\">https://arxiv.org/abs/2508.00828</a></li></ul></li></ul></div><div id=\"benchmark-quality-and-validity-research\"><h3>Benchmark Quality and Validity Research</h3><ul role=\"list\"><li><strong>Measuring What Matters: Construct Validity in LLM Benchmarks</strong> – 445-benchmark systematic review<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2511.04703\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a481e\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a481e\">https://arxiv.org/abs/2511.04703</a></li></ul></li><li><strong>BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices</strong> – Operational checklist for benchmark quality<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2411.12990\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4825\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4825\">https://arxiv.org/abs/2411.12990</a></li></ul></li><li><strong>Evaluation Data Contamination in LLMs: How Do We Measure It and (When) Does It Matter?</strong> – ConTAM analysis across 13 benchmarks × 7 models<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2411.03923\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a482c\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a482c\">https://arxiv.org/abs/2411.03923</a></li></ul></li><li><strong>An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability</strong> – Rubric clarity dominates reliability<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2506.13639\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4833\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4833\">https://arxiv.org/abs/2506.13639</a></li></ul></li><li><strong>LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation</strong> – Rubric-guided judging in legal domain<ul role=\"list\"><li><a href=\"https://arxiv.org/abs/2510.07243\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a483a\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a483a\">https://arxiv.org/abs/2510.07243</a></li></ul></li></ul></div><div id=\"industry-reports-and-practitioner-guidance\"><h3>Industry Reports and Practitioner Guidance</h3><ul role=\"list\"><li><strong>Demystifying Evals for AI Agents (Anthropic Engineering)</strong> – Three-grader taxonomy and capability vs. regression evals<ul role=\"list\"><li><a href=\"https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4841\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4841\">https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents</a></li></ul></li><li><strong>2025 AI Index Report — Technical Performance (Stanford HAI)</strong> – Benchmark saturation data<ul role=\"list\"><li><a href=\"https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4848\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4848\">https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance</a></li></ul></li><li><strong>Introducing SWE-bench Verified (OpenAI)</strong> – Human-filtered SWE-bench subset methodology<ul role=\"list\"><li><a href=\"https://openai.com/index/introducing-swe-bench-verified/\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a484f\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a484f\">https://openai.com/index/introducing-swe-bench-verified/</a></li></ul></li><li><strong>Bridging the Operational AI Gap (MIT Technology Review Insights)</strong> – Enterprise AI deployment gap analysis<ul role=\"list\"><li><a href=\"https://www.technologyreview.com/2026/03/04/1133642/\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4856\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4856\">https://www.technologyreview.com/2026/03/04/1133642/</a></li></ul></li></ul></div><div id=\"standards-and-policy\"><h3>Standards and Policy</h3><ul role=\"list\"><li><strong>NIST AI Risk Management Framework (AI RMF 1.0)</strong> – Foundational TEVV definitions<ul role=\"list\"><li><a href=\"https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a485d\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a485d\">https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf</a></li></ul></li><li><strong>NIST AI Risk Management Framework: Generative AI Profile (AI 600-1)</strong> – Pre-deployment testing guidance for GAI<ul role=\"list\"><li><a href=\"https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf\" data-wf-native-id-path=\"900a637e-0ac4-6d1d-0559-258d5d6a4864\" data-wf-ao-click-engagement-tracking=\"true\" data-wf-element-id=\"900a637e-0ac4-6d1d-0559-258d5d6a4864\">https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf</a></li></ul></li></ul><p>‍</p></div></div></div><div data-w-id=\"fd4a9c6d-c1b9-68ca-8b6b-b5742071530f\"><h3>Subscribe for updates</h3><p>Stay updated with the latest news, articles and update directly into your box</p></div></div></div></div>",
  "markdown": "LLMs\n\nAI Evaluation\n\nFoundation Models\n\n## Custom AI Benchmark Guide: What the Best Public Evals Teach You About Building Your Own\n\nThe public benchmarks that defined AI progress for the last three years are saturating in months, not years, and a 445-benchmark systematic review found most don't accurately capture what their names claim. This guide distils the design choices behind HELM, GPQA Diamond, SWE-bench, LiveCodeBench, LegalBench, and MultiMedQA into a practitioner methodology for building AI benchmarks you can actually trust.\n\n![](https://cdn.prod.website-files.com/68da32b2041c593b0511a582/68e5167fa5f9ef2e3b0a82c5_Kili_Wordmark_Midnight_RGB%20\\(1\\).png)\n\nKili Technology\n\n·\n\nApr 30, 2026\n\n![](https://cdn.prod.website-files.com/68da32b2041c593b0511a582/69f34f60ec23da20cca59620_custom%20ai%20benchmarks.png)\n\n## AI Summary\n\n-   A 2025 review of 445 LLM benchmarks found pervasive construct-validity gaps across widely cited ai benchmarks.\n-   HELM's 7-metric design raised cross-model evaluation coverage from 17.9% to 96%.\n-   GPQA Diamond's expert-write, expert-validate, non-expert-test pipeline is the cleanest reusable protocol for sourcing tasks from domain experts.\n-   SWE-bench scores rose from 1.96% to 71.7% in one year — and a later human-filtered subset removed roughly a third of the original tasks as ambiguous or infeasible.\n-   LLM judge reliability hinges on rubric clarity, not chain-of-thought; the rubric is the operationalisation of the construct you claim to measure.\n-   Kili Technology supports the upstream work behind reliable custom AI benchmarks — expert annotator workflows, rubric design, multi-annotator validation, and ongoing re-annotation as models and production traffic evolve.\n\n## Introduction\n\nThree years ago, \"build a benchmark\" meant pick a task, write a few hundred examples, publish a leaderboard. The methodology was loose because the time horizons were long: MMLU, released in 2020, took until 2024 to saturate.\n\nThat assumption is gone. The 2025 AI Index reports gains of 18.8 points on MMMU, 48.9 on GPQA Diamond, and 67.3 points in SWE-bench scores in a single year. Humanity's Last Exam, released in early 2025 with the explicit goal of resisting frontier reasoning models, went from 8.8% (top score, January 2025) to over 50% by April 2026. The exam was designed as the hardest standardised exam ever assembled for an LLM, and even an exam of that calibre hit the saturation curve within fifteen months. Stanford HAI's Vanessa Parli framed the resulting question bluntly: are we measuring the right thing, are the benchmarks compromised, and how should the research community evaluate models?\n\nThe pressure is sharper for teams running production systems. Public leaderboards optimise for capability ceilings; production systems break on consistency floors. An MIT NANDA analysis of roughly 300 enterprise AI deployments found only 5% reach measurable P&L impact, and the binding constraints are workflow integration and evaluation gaps, not model quality. Gartner expects more than 40% of agentic AI projects to be cancelled by 2027.\n\nA custom benchmark is no longer optional infrastructure for any team deploying AI in a domain that matters. The question is how to evaluate the system you actually ship without repeating the validity, contamination, and saturation mistakes that the public ai benchmarks are now publicly working through.\n\n## Why Are Public AI Benchmarks No Longer Enough?\n\nThree forces have converged.\n\nThe first is **saturation speed**. When the dataset was introduced in late 2023, the best model (Claude 2) solved 1.96% of real GitHub issues drawn from SWE-bench. By 2024, leading research systems and agents were solving 71.7%. The benchmark community responded with harder variants — a verified human-filtered subset and a Pro tier — but the underlying pattern is structural: any static benchmark hard for today's frontier reasoning models will be solved by next year's. A useful heuristic for benchmark builders is that top models should land below roughly 35% accuracy at launch — anything easier is already a regression eval in disguise, and the resulting scores will not separate frontier systems from each other.\n\nThe second is **construct validity erosion**. The 2025 Reuel et al. systematic review of 445 LLM benchmarks identified prevalent gaps in construct validity, the property that the test actually measures the capability it claims to measure. Naming a benchmark \"general reasoning\" or \"general knowledge\" doesn't establish that the score generalises to the construct. Raji et al. flagged this in 2021, calling general-purpose benchmark framing \"ultimately dangerous and deceptive.\" The 2025 review confirms the problem is endemic, not isolated.\n\nThe third is **benchmark exploitability**. Recent research has shown that headline scores often measure how well a model gamed the test harness rather than how well it solved the underlying tasks. Automated scanning agents have been demonstrated that exploit structural flaws in popular ai benchmarks — for instance, the lack of strict isolation between the agent under test and the evaluator process — to achieve near-perfect scores without solving any of the tasks. The attack pattern unfolds in stages: the scanning agent probes the harness, identifies leaked grader signals, and produces outputs that satisfy the grader without solving the problem. The takeaway is unambiguous: high scores on static benchmarks can be deeply misleading, and inflated leaderboard scores can be uncorrelated with the underlying capabilities.\n\nThe fourth is the **research-versus-production divergence**. Public benchmarks measure peak capability on single attempts. Production systems need consistency: an [Anthropic engineering analysis](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) of τ-bench found agents hitting 60% pass@1 dropped to 25% pass^k (consistency across k trials). That gap is invisible in single-run leaderboards and catastrophic for users. The same analysis pushes teams toward pass@k _and_ pass^k, partial-credit graders, and balanced positive/negative cases, none of which are standard on public leaderboards.\n\nIf your system processes legal contracts, diagnoses medical images, or routes financial trades, the relevant comparison isn't whether your model beats GPT-4 on MMLU. It's whether it does the specific job you need it to do, reliably, on the kind of inputs your production traffic actually contains.\n\n## What Makes a Custom AI Benchmark Actually Measure What It Claims?\n\nConstruct validity is the first principle. Before you draft a single task, write down (in a paragraph, not a vibe) what capability you're measuring, why it matters for your system, and what it would mean for the score to go up or down. If you can't articulate the construct, you can't measure it. In practice this also means aligning the benchmark to a concrete business outcome or domain task — what \"right\" looks like — rather than chasing an abstract aggregate score.\n\nThe Reuel et al. research proposes four validity types worth carrying through:  \n\\- **construct** (does it measure what it names), criterion (does it correlate with downstream outcomes),  \n\\- **consequential** (does optimising for it produce the behaviours you want), and  \n\\- **external** (does it generalise outside the eval set).\n\nA benchmark that ignores any of the four is gameable.\n\nThe second principle is **multi-metric coverage with explicit gaps acknowledged**. The [HELM framework from Stanford CRFM](https://arxiv.org/abs/2211.09110) rejected single-number leaderboards and instead reported 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) across 16 core scenarios. The point wasn't completeness; it was making trade-offs visible to anyone reading the data. Before HELM, models were evaluated on just 17.9% of its core scenarios on average; HELM raised this to 96%, putting 30 models on equal footing for direct comparison. For a custom benchmark, the lesson is to pick three or four metrics that capture the trade-offs you actually face (latency vs. accuracy, recall vs. precision, helpfulness vs. harm) and report them all, every time. A workable pattern is to lead with one primary metric for at-a-glance understanding and relegate detailed secondary metrics — and the parameters used to compute them — to an appendix that lets another team replicate the run end-to-end.\n\nThe third principle is **balanced problem sets**. Anthropic's engineering guidance phrases it directly: include cases where the behaviour should occur and cases where it shouldn't. A benchmark of \"did the agent successfully cancel the order\" is incomplete without \"did the agent refuse to cancel an order it shouldn't have.\" Balanced sets are how you catch over-eager agents and miss-rate trade-offs that single-axis evaluation hides. Edge cases and adversarial items belong in the same set — silent failures hide between average and worst-case behaviour.\n\nConstruct validity is also where rubric design enters early. [Park et al. (2025)](https://arxiv.org/abs/2506.13639) showed empirically that evaluation criteria are the dominant factor in LLM judge reliability; chain-of-thought offers minimal gains when the rubric is clear. **The rubric is the operationalisation of the construct.** If the rubric is vague, the construct is vague, and no amount of grader sophistication will save the score. A clear rubric also accelerates downstream understanding: when scores diverge across model versions, the team can read the rubric and trace the divergence to a specific criterion rather than re-arguing the underlying construct.\n\nThis is also where Kili-style annotation infrastructure starts to matter, not for the eval itself, but for the upstream work of writing the rubric, validating it against expert disagreement, and revising the data collection process before any tasks are graded.\n\n## Where Should the Tasks Come From, and Who Should Write Them?\n\nTwo patterns from the public ai benchmarks dominate:\n\n### **Real artefacts beat synthetic prompts**\n\n**‍**‍[SWE-bench](https://arxiv.org/abs/2310.06770) drew its 2,294 tasks from real GitHub issues across 12 popular Python repositories and graded by running the project's actual test suite. A patch passes only if it makes the failing tests pass without breaking the passing ones (FAIL\\_TO\\_PASS / PASS\\_TO\\_PASS). The realistic construction setting, in the authors' phrasing, gave the dataset properties no synthetic prompt collection could replicate: tasks are continually updatable from new pull requests, hard to game with surface heuristics, and grounded in code that other humans actually had to review. For a custom benchmark, the implication is to mine your own systems first. Bug trackers, support tickets, rejected agent outputs, escalations to human review: these are the highest-signal task sources you have.\n\n### **E****xpert authoring with adversarial validation**.\n\n‍[GPQA Diamond](https://arxiv.org/abs/2311.12022) is the cleanest published example. Its four-stage pipeline (expert authoring, expert validation, revision, non-expert validation) produced 448 multiple-choice questions where domain PhDs reach 65% accuracy, 74% if you discount clear mistakes, but skilled non-experts (humans with 30+ minutes of unrestricted web access) reach only 34%. The Diamond subset is the high-confidence slice — questions where two domain experts agreed on the answer and a third validated it independently. The non-expert validation stage is the underappreciated part: it gives you a quantitative answer to \"is this question actually hard, or just obscure?\" GPQA Diamond also embeds a canary string for contamination tracing, a small touch with disproportionate downstream value.\n\n[LegalBench](https://arxiv.org/abs/2308.11462) is the canonical example of the expert-led collaborative model: 162 tasks across 6 reasoning types (issue-spotting, rule-recall, rule-application, rule-conclusion, interpretation, rhetorical understanding) authored by 40+ contributors, including lawyers, law professors, and legal practitioners. The reasoning typology is itself a contribution: it forces task authors to articulate what kind of legal cognition each task is testing, which makes downstream interpretation of scores tractable.\n\nFour operational rules follow:\n\n\\- **Start with real failures, then scale.** Anthropic's guidance recommends 20–50 tasks drawn from production failures as a starting point; the working bar for a defensible domain benchmark sits higher, at a hand-picked, expert-labelled set of 200–1,000 examples that reflect real user journeys and edge cases for the certain use cases your agents actually serve in production. Treat dataset construction as a series of stages, not a one-shot exercise: a small seed set, then expansion, then iteration as failures surface.\n\n\\- **Write reference solutions** — if your domain experts can't solve the task, the model definitely can't, and the failure tells you nothing.\n\n\\- **Use multiple raters per item** to evaluate annotation quality: the verified-subset protocol used three annotators per item, severity-ensembled, to filter out infeasible or under-specified items, and produced what is now the de facto standard for executable-test benchmarks; single-rater work inherits single-rater blind spots, and using too few raters strips out the human disagreement that is itself a signal about task ambiguity.\n\n\\- **Keep the test set strictly separated from anything used to develop the system** — train/test contamination quietly inflates scores on held-out evaluations, and the inflation is invisible until production performance disappoints.\n\n## How Do You Grade Outputs That Don't Have a Single Right Answer?\n\nThe Anthropic taxonomy is the cleanest framing: code-based graders, model-based graders (LLM judge), and human graders. Pick the cheapest grader that works for the task — and let the cost of being wrong, not the convenience of measuring, drive the choice.\n\n**Code-based graders** are the gold standard when applicable: exact match, regex, executable tests, structured output validation. The FAIL\\_TO\\_PASS / PASS\\_TO\\_PASS pattern from the original benchmark is a code-based grader; so is \"does the JSON parse and contain the required fields.\" If the task admits a code-based grader, use it. The grader is deterministic, free to run, and impossible to game without solving the task — and for agents that produce structured outputs, this is the cheapest reliable signal you will ever have.\n\n**Model-based graders** are the workhorse for open-ended outputs. Park et al. showed evaluation criteria dominate reliability; rubric-guided judges (Prometheus, G-Eval) reach Pearson correlations around 0.897 with humans rating the same outputs when the rubric is unambiguous. The [LeMAJ legal evaluation framework](https://arxiv.org/abs/2510.07243) found inter-rater agreement among humans increased 11% when reviewers used a shared rubric, and rubric-guided LLM judge configurations hit Cohen's κ of 0.75 with human consensus. The pattern across these results: rubric clarity is the constraint, not judge model size.\n\nThree operational rules. **Validate against a human-labelled golden set** — the working bar in practice is 75–90% agreement with human consensus; below that, the judge is amplifying noise. **Decompose into structured criteria** — binary checks (\"does the output cite a real case?\") plus ordinal scores (\"rate clarity 1–5 with anchor descriptions\") outperform single overall scores in nearly every published comparison. **Watch the known biases.** Model-based judges show position bias, length bias, and self-preference (preferring outputs from the same model family). Randomise position, normalise length when possible, and use a different model family for judging than for the system under test.\n\n**Human graders** are the calibration layer and the high-stakes layer. Reserve humans for golden-set construction, judge calibration, and tasks where the cost of a wrong answer makes a 90%-agreement model judge unacceptable: clinical safety, legal compliance, financial advice. The economics rarely support running humans across a full benchmark, but they almost always support running humans across a calibration subset — and the resulting scores anchor every other grader downstream.\n\n## How Do You Keep a Custom AI Benchmark Useful Over Time?\n\nA custom benchmark is a versioned artefact with a maintenance schedule. Treat it that way and it stays useful; treat it as a one-time project and it expires inside a year.\n\n**Contamination defence.** The [Singh et al. (2024) ConTAM analysis](https://arxiv.org/abs/2411.03923) of 13 benchmarks across 7 models found contamination has been underestimated in many prominent LLM releases, even when developers attempted decontamination. Three practical defences: hold out a private split that never goes public; date-stamp every item so you can filter to post-cutoff data per model (the [LiveCodeBench](https://arxiv.org/abs/2403.07974) approach, which date-stamps problems from competitive programming contests since May 2023); and embed a canary string, GPQA Diamond–style, so you can later test whether a model has memorised your dataset.\n\n**Harness isolation.** A subtler failure mode is the absence of strict isolation between the system under test and the evaluator. When agents can read or write to the same filesystem as the grader, observe the grader's logs, or otherwise inspect the scoring process, automated exploits become trivial — and as recent scanning-agent research shows, headline scores in those conditions can reflect harness gaming rather than genuine capabilities. Build agent/evaluator isolation into the design, not as a follow-up.\n\n**Saturation planning.** Distinguish capability evals from regression evals from day one. A capability eval and a regression eval are two different stages of the same benchmark's life: capability evals start at low pass rates (5–30%) and let you hill-climb; once they saturate above 90%, they become regression evals, where the goal flips from \"can the model do this\" to \"did we break something that used to work.\" Both stages matter. A team with only capability evals goes blind once the model is good; a team with only regression evals never sees what it can't do yet, and the scores stop reflecting any meaningful comparison between systems.\n\n**Continuous re-annotation.** Pipe production failures back into the eval set. Every time a user flags a wrong answer, every human override, every escalation: these are pre-validated hard cases. The [BetterBench framework](https://arxiv.org/abs/2411.12990) found that of 24 evaluated SOTA benchmarks, only 3 included CI build status and only 4 provided replication scripts; the operational rigour we apply to code we routinely fail to apply to evals. Version your benchmark against your model releases. Tag eval runs with model version, prompt version, and rubric version.\n\n**Standards alignment.** For regulated industries, the NIST AI RMF and the [NIST GenAI Profile (AI 600-1)](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf) define testing, evaluation, verification, and validation (TEVV) as a core function of trustworthy AI. Mapping your benchmark to TEVV categories isn't decorative; it's the documentation trail you'll want when the auditor arrives.\n\n## What Does This Look Like Across Industries?\n\nFour compressed examples to ground the principles. Each is anchored to a published benchmark so the design choices are inspectable.\n\n**Healthcare.** [MultiMedQA](https://arxiv.org/abs/2212.13138) combines six existing medical QA datasets with a new HealthSearchQA collection of consumer questions, then layers expert physician evaluation across multiple axes: factuality, possible harm, possible bias, scientific consensus alignment. The lesson: USMLE-style multiple-choice catches only the lower levels of Miller's pyramid (knows, knows how). Clinical safety requires open-ended generation graded by clinicians on multiple safety axes, not just answer accuracy.\n\n**Legal.** LegalBench's six reasoning types plus LegalBench-RAG's expert-annotated retrieval pairs together cover both answer correctness and retrieval-precision dimensions. The lesson: in regulated, citation-heavy domains, retrieval correctness matters as much as final-answer accuracy. A benchmark that grades only the synthesis ignores the failure mode that matters most: confidently wrong citations.\n\n**Software engineering.** The progression from the original benchmark to its verified subset to Terminal-Bench is the cleanest example of refinement under pressure. The [OpenAI Verified work](https://openai.com/index/introducing-swe-bench-verified/) filtered roughly a third of original items as ambiguous or infeasible, meaning a third of the original scores were noise from the benchmark, not signal from the model. The lesson: for any executable benchmark, agent harness bugs and grading-spec ambiguity cause more apparent failures than model limitations. Verify before you trust.\n\n**Finance.** The [Finance Agent Benchmark](https://arxiv.org/abs/2508.00828) provides 537 expert-authored questions covering retrieval through modelling, with an agentic harness that includes Google Search and SEC EDGAR access. The lesson: financial benchmarks need expert-authored questions, real document grounding, and tool-use evaluation; pure-text QA misses the workflow. The regulatory and compliance dimension is also load-bearing: a model that's right but cites a hallucinated 10-K creates legal exposure that pure accuracy metrics never surface.\n\nThe pattern across all four: the benchmark inherits the failure modes of the domain. Generic benchmarks miss these because they're generic. Custom benchmarks earn their cost by being specific.\n\n## The Real Test of a Benchmark Is Whether You'd Trust It Tomorrow\n\nThe benchmark you build today will be consulted dozens of times before it expires. Each consultation is a decision: ship or don't, escalate or don't, retrain or don't. The cost of a bad benchmark isn't measured in eval-set creation hours; it's measured in the production decisions made in its name.\n\nThe public ai benchmarks that survived their first wave of scrutiny (HELM, GPQA Diamond, SWE-bench, LegalBench) share a small set of properties. They define the construct in writing. They use real artefacts when possible. They validate task design against domain experts before grading any model. They publish their rubrics. They plan for contamination, harness exploits, and saturation as design constraints, not afterthoughts. They version themselves like software.\n\nThese properties don't require frontier-lab budgets to replicate. They require treating the benchmark as the substrate of every downstream claim about your AI system, because that's what it is. The benchmarks that hold up are the ones built by people who understood the construct before they wrote the first task.\n\n## Ready to Build a Benchmark That Actually Measures Your AI System?\n\nKili Technology's data labeling and evaluation infrastructure supports the upstream work that makes custom benchmarks reliable: expert annotator workflows, rubric design and calibration, multi-annotator validation, and the continuous re-annotation cycle that keeps benchmarks useful as models and production traffic evolve. [Talk to our team](https://kili-technology.com/contact-us) about benchmark design for your domain.\n\n## Resources\n\n### Benchmark Methodology Papers\n\n-   **Holistic Evaluation of Language Models (HELM)** – Stanford CRFM's multi-metric, multi-scenario framework\n    -   [https://arxiv.org/abs/2211.09110](https://arxiv.org/abs/2211.09110)\n-   **GPQA: A Graduate-Level Google-Proof Q&A Benchmark** – Expert-authored, expert-validated, non-expert-tested protocol\n    -   [https://arxiv.org/abs/2311.12022](https://arxiv.org/abs/2311.12022)\n-   **SWE-bench: Can Language Models Resolve Real-World GitHub Issues?** – Real-artefact tasks with executable graders\n    -   [https://arxiv.org/abs/2310.06770](https://arxiv.org/abs/2310.06770)\n-   **LiveCodeBench: Holistic and Contamination Free Evaluation of LLMs for Code** – Date-stamped, contamination-aware design\n    -   [https://arxiv.org/abs/2403.07974](https://arxiv.org/abs/2403.07974)\n-   **LegalBench: A Collaboratively Built Benchmark for Legal Reasoning** – Expert-led collaborative construction with reasoning typology\n    -   [https://arxiv.org/abs/2308.11462](https://arxiv.org/abs/2308.11462)\n-   **MultiMedQA / Large Language Models Encode Clinical Knowledge** – Multi-axis human evaluation for medical LLMs\n    -   [https://arxiv.org/abs/2212.13138](https://arxiv.org/abs/2212.13138)\n-   **Finance Agent Benchmark** – Expert-authored finance benchmark with agentic harness\n    -   [https://arxiv.org/abs/2508.00828](https://arxiv.org/abs/2508.00828)\n\n### Benchmark Quality and Validity Research\n\n-   **Measuring What Matters: Construct Validity in LLM Benchmarks** – 445-benchmark systematic review\n    -   [https://arxiv.org/abs/2511.04703](https://arxiv.org/abs/2511.04703)\n-   **BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices** – Operational checklist for benchmark quality\n    -   [https://arxiv.org/abs/2411.12990](https://arxiv.org/abs/2411.12990)\n-   **Evaluation Data Contamination in LLMs: How Do We Measure It and (When) Does It Matter?** – ConTAM analysis across 13 benchmarks × 7 models\n    -   [https://arxiv.org/abs/2411.03923](https://arxiv.org/abs/2411.03923)\n-   **An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability** – Rubric clarity dominates reliability\n    -   [https://arxiv.org/abs/2506.13639](https://arxiv.org/abs/2506.13639)\n-   **LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation** – Rubric-guided judging in legal domain\n    -   [https://arxiv.org/abs/2510.07243](https://arxiv.org/abs/2510.07243)\n\n### Industry Reports and Practitioner Guidance\n\n-   **Demystifying Evals for AI Agents (Anthropic Engineering)** – Three-grader taxonomy and capability vs. regression evals\n    -   [https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)\n-   **2025 AI Index Report — Technical Performance (Stanford HAI)** – Benchmark saturation data\n    -   [https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance](https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance)\n-   **Introducing SWE-bench Verified (OpenAI)** – Human-filtered SWE-bench subset methodology\n    -   [https://openai.com/index/introducing-swe-bench-verified/](https://openai.com/index/introducing-swe-bench-verified/)\n-   **Bridging the Operational AI Gap (MIT Technology Review Insights)** – Enterprise AI deployment gap analysis\n    -   [https://www.technologyreview.com/2026/03/04/1133642/](https://www.technologyreview.com/2026/03/04/1133642/)\n\n### Standards and Policy\n\n-   **NIST AI Risk Management Framework (AI RMF 1.0)** – Foundational TEVV definitions\n    -   [https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf](https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf)\n-   **NIST AI Risk Management Framework: Generative AI Profile (AI 600-1)** – Pre-deployment testing guidance for GAI\n    -   [https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf)\n\n‍\n\n### Subscribe for updates\n\nStay updated with the latest news, articles and update directly into your box"
}

▼✓网页搜索查询: "SPEC benchmark design principles relevance reproducibility fairness verifiabi...", 结果数: 8

{
  "results": [
    {
      "title": "[PDF] How to Build a Benchmark - SPEC Research Group",
      "url": "https://research.spec.org/icpe_proceedings/2015/proceedings/p333.pdf",
      "description": "The key characteristics can be or-ganized in the following groups, which will be discussed in more detail in the next sections: • Relevance How closely the benchmark behavior cor-relates to behaviors that are of interest to consumers of the results • Reproducibility The ability to consistently produce similar results when the benchmark is run with the same test conﬁguration • Fairness Allowing diﬀerent test conﬁgurations to com-pete on their merits with-out artiﬁcial limitations • Veriﬁability Providing conﬁdence that a benchmark result is accurate • Usability Avoiding roadblocks for users to run the benchmark in their test environments All benchmarks are subject to these same criteria, but each category includes additional issues that are speciﬁc to the individual benchmark, depending on the benchmark’s goals."
    },
    {
      "title": "[PDF] Designing a Good Benchmark",
      "url": "https://www.spec.org/events/beijing2016/slides/012-Designing_a_Good_Benchmark2%20-%20Jeremy%20Arnold.pdf",
      "description": "Consistency Running the benchmark multiple times under the same conditions will produce the same results Description The hardware and software components and configuration are described in sufficient detail to allow an equivalent environment to be constructed Power Measurements Power should be measurable using a variety of devices Reproducibility Copyright © 1988‐2016 Standard Performance Evaluation Corporation (SPEC). Designing a Good Benchmark Jeremy Arnold SPECpower Architect Copyright © 1988‐2016 Standard Performance Evaluation Corporation (SPEC). 2 Benchmarks can be assessed on several different characteristics, including: □Relevance □Reproducibility □Fairness □Verifiability □Usability Benchmark Characteristics Copyright © 1988‐2016 Standard Performance Evaluation Corporation (SPEC). Self-describing Includes tools for automatically discovery of system details Practical Runs on reasonably sized systems Configurability Allow flexibility for research Energy Data Collection Use of SPEC PTDaemon or other tools to automatically collect power data Usability Copyright © 1988‐2016 Standard Performance Evaluation Corporation (SPEC). 8 Benchmarks can be assessed on several different characteristics, including: □Relevance □Reproducibility □Fairness □Verifiability □Usability Benchmark Characteristics Copyright © 1988‐2016 Standard Performance Evaluation Corporation (SPEC)."
    },
    {
      "title": "SPEC Fair Use Rules",
      "url": "https://www.spec.org/products/fairuse/",
      "description": "# SPEC Fair Use Rules. General Requirements For Public Use of All SPEC Benchmark Results. To help assure that these principles are met, the following requirements must be met by any organization or individual who makes public use of SPEC benchmark results. ▹ Section I General Requirements: rules that apply to public use of all SPEC benchmarks. ▹ Section II Individual Requirements: additional rules that apply to specific benchmarks. It is intended that this document provides the information needed for compliance with Fair Use, and in the event of any inconsistencies, this document takes precedence over individual benchmark run rules fair use requirements. **Compliance.** Claimed results must be compliant with that benchmark's rules. The SPEC web site () or a suitable sub page must be noted as a resource for additional information about the benchmark. Statements regarding SPEC, its benchmarks, and results published by SPEC, must be clear and correct. SPEC's trademarks may not be used to mislabel something that is not a SPEC metric."
    },
    {
      "title": "The Art of Building a Good Benchmark - ResearchGate",
      "url": "https://www.researchgate.net/publication/220789383_The_Art_of_Building_a_Good_Benchmark",
      "description": "... The design of a systems benchmark must meet a number of requirements, namely relevance, repeatability and reproducibility, fairness, and portability [8, 26]"
    },
    {
      "title": "Relevance and Evolution of Benchmarking in Computer Systems",
      "url": "https://www.mdpi.com/2073-431X/14/12/516",
      "description": "This review synthesizes the evolution from classical synthetic benchmarks (Whetstone, Dhrystone) and application kernels (LINPACK) to modern suites (SPEC"
    },
    {
      "title": "[PDF] The Science of Systems Benchmarking",
      "url": "https://closer.scitevents.org/Documents/Previous_Invited_Speakers/2021/CLOSER_2021_KS_4_Presentation.pdf",
      "description": "Kounev Benchmark Standardization § Standard Performance Evaluation Corporation (SPEC) § Goal: provide standardized set of application benchmarks and standardized methodology for running them and reporting results § First benchmark was SPEC89 § 4 C programs § 6 Fortran programs OSG Open Systems Group HPG High Performance Group GWPG Graphics and Workstation Performance Group RG Research Group > 100 member organizations & associates 31 S. Kounev SPEC Research Group (RG) http://research.spec.org § Founded in March 2011 § Scope: Systems benchmarking, performance evaluation, and experimental system analysis § Provide metrics, (research) benchmarks, methodologies and tools § Foster transfer of knowledge and collaboration btw. Kounev Agenda § Introduction § Benchmarking Education § Benchmark Standardization § Case Study on Cloud Benchmarking § Measuring and quantifying elasticity § Reproducibility of experimental evaluation 64 S."
    },
    {
      "title": "SPEC CPU 2026 in Detail - Why the Most Important CPU Benchmark Did No…",
      "url": "https://www.igorslab.de/en/?p=317730",
      "description": "# SPEC CPU 2026 in Detail – Why the Most Important CPU Benchmark Did Not Simply Get Bigger. With SPEC CPU 2026, one of the most important vendor-neutral CPU benchmarks after SPEC CPU 2017 receives a fundamental overhaul. This is more than just a version jump with new workloads, because the new suite attempts to capture several developments of recent years at once: significantly higher core counts, changed memory hierarchies, more open-source software in production environments, new compiler realities, greater platform diversity, and the question of how modern servers, workstations, laptops, and even smaller systems can be meaningfully compared. According to the published SPEC documents, SPEC CPU 2026 primarily measures the performance of general-purpose processors, the memory hierarchy, and compilers, optionally supplemented by energy efficiency or power consumption using the already familiar SPEC methodology. The suite was officially announced on May 5, 2026, and SPEC classifies it as a significant update to the established, vendor-neutral standard for compute-intensive performance measurements. The most important point is not only the number of benchmarks."
    },
    {
      "title": "Benchmarking Quantum Computers: Towards a Standard Performance Evaluation Approach",
      "url": "https://arxiv.org/html/2407.10941v4",
      "description": "# Benchmarking Quantum Computers: Towards a Standard Performance Evaluation Approach. The technological development of increasingly larger quantum processors on different quantum platforms raises the problem of how to fairly compare their performance, known as quantum benchmarking of quantum processors. This is a challenge that computer scientists have already faced when comparing classical processors, leading to the development of various mathematical tools to address it, but also to the identification of the limits of this problem. In this work, we briefly review the most important aspects of both classical processor benchmarks and the metrics comprising them, providing precise definitions and analyzing the quality attributes that they should exhibit. Additionally, we review some of the most important metrics and benchmarks for quantum processors proposed in the literature, assessing what quality attributes they fulfill. These guidelines aim to pave the way for establishing a roadmap towards standardizing the performance evaluation of quantum devices, ultimately leading to the creation of an organization akin to the Standard Performance Evaluation Corporation (SPEC)."
    }
  ]
}

▼✓网页搜索查询: "psychometrics IRT item response theory AI benchmark evaluation design", 结果数: 8

{
  "results": [
    {
      "title": "Item Response Theory for Evaluating Regression Algorithms",
      "url": "https://ieeexplore.ieee.org/document/9207030/",
      "description": "Item Response Theory (IRT) is a tool developed in psychometrics to measure latent abilities of human respondents based on their responses to items with"
    },
    {
      "title": "Item Response Theory (IRT): Intro, Models, and Examples",
      "url": "https://assess.com/what-is-item-response-theory/",
      "description": "# Item Response Theory (IRT): Intro, Models, and Examples. Home » Item Response Theory (IRT): Intro, Models, and Examples. Item response theory (IRT) is a family of machine learning models in the field of psychometrics, which are used to design, analyze, validate, and score assessments. Want to learn more about IRT, how it works, and why it is so important for assessment? ## What is Item Response Theory? IRT is a family of models that try to describe *how examinees respond to items on a test*, hence the name. These models can be used to evaluate item performance, because the descriptions are quite useful in and of themselves. IRT is model-driven, in that there is a specific mathematical equation that is assumed, and we fit the models based on raw data, similar to linear regression. IRT used to be known as **latent trait theory** and **item characteristic curve theory**."
    },
    {
      "title": "Building an Evaluation Scale using Item Response Theory - PMC",
      "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC5167538/",
      "description": "We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation."
    },
    {
      "title": "Lifting the benchmark iceberg with item-response theory | OpenReview",
      "url": "https://openreview.net/forum?id=ZyVQqK7mcP",
      "description": "## Lifting the benchmark iceberg with item-response theory. **TL;DR:** We reveal how hidden implementation choices in LLM benchmarks bias model rankings and propose item response theory (IRT) as a solution for more transparent and reliable evaluations. **Abstract:** The evaluation of large language models (LLMs) through benchmarks has become a cornerstone of AI development, guiding critical decisions about model deployment and research directions. However, as benchmarks evolve from narrow task-specific assessments to broad capability evaluations, they become more difficult to develop, understand and analyze. Here, we report a \\enquote{benchmark iceberg} phenomenon --- where much of the variability in model rankings stems not from true capability differences, but from hidden implementation choices beneath the surface of reported scores. Our analysis demonstrates how minor changes to these implementation details can alter model rankings --- a concerning finding given benchmarks' role in shaping the AI landscape. By adapting item response theory (IRT) we transform benchmarks from opaque leaderboards into transparent measurement instruments, revealing how hidden implementation choices currently distort our perception of model capabilities."
    },
    {
      "title": "[PDF] Efficient and Robust Model Benchmarks with Item Response Theory ...",
      "url": "https://www.turing.ac.uk/sites/default/files/2021-07/efficient_and_robust_paper.pdf",
      "description": "Given the current candidate ability θ, a fitted IRT model with the likelihood function g(r; ω, θ), and a set of L items with parameters {ω1, …, ωL}, the Fisher item information (FII) on the jth item is then calculated as: (11) (12) Here (ωj, θ) refers to the corresponding probability measure of the IRT model. C.\tThe Constraint of Non-negative Discrimination For typical IRT models, positive discrimination indicates the item has better average responses from candidates with higher ability estimates. 1. To compare different IRT models, we evaluate their performance to make inferences over unseen responses (several standard machine learning evaluation measures). We use the training set to estimate the IRT models’ parameters, and the test set to verify the expected responses from each IRT model. During adaptive testing, each time we update the model ability, we use the trained IRT to infer the expected value of responses (performance measures)."
    },
    {
      "title": "Item Response Theory Overview - Emergent Mind",
      "url": "https://www.emergentmind.com/topics/item-response-theory-irt",
      "description": "Item Response Theory (IRT) is a statistical framework that models latent abilities via test responses to enable precise measurement and"
    },
    {
      "title": "Efficient and Robust Model Benchmarks with Item Response Theory ...",
      "url": "https://revistas.unir.net/index.php/ijimai/article/view/708",
      "description": "To this end, we adapt existing approaches from psychometrics: specifically, Item Response Theory and Adaptive Testing. Both are well-founded"
    },
    {
      "title": "When Classical Test Theory Alone Is Not Enough: How Item ...",
      "url": "https://luluyan.medium.com/why-classical-test-theory-falls-short-for-llm-evaluation-and-how-item-response-theory-irt-93f03f3b6a2a",
      "description": "Item Response Theory (IRT) and broader psychometric thinking offer a dramatically better toolkit, especially when we need to measure latent"
    }
  ]
}

▼thought

Now let me fetch a couple more key sources: the SPEC paper on "How to Build a Benchmark", and the PMC article on computational method benchmarking guidelines.

▼✓抓取网页正文网址: "https://research.spec.org/icpe_proceedings/2015/proceedings/p333.pdf"

{
  "url": "https://research.spec.org/icpe_proceedings/2015/proceedings/p333.pdf",
  "error": "Navigation timeout of 60000 ms exceeded"
}

▼✓网页搜索查询: "data contamination prevention benchmark design private holdout canary string ...", 结果数: 8

{
  "results": [
    {
      "title": "BIG-Bench Canary Contamination in GPT-4 — LessWrong",
      "url": "https://www.lesswrong.com/posts/kSmHMoaLKGcGgyWzs/big-bench-canary-contamination-in-gpt-4",
      "description": "# BIG-Bench Canary Contamination in GPT-4. The BIG-Bench canary string is a unique string included in documents intended to be excluded from the training datasets of models (e.g. model evaluation tasks):. As a contingency, checking whether a model can generate the canary string would allow us to detect whether such documents have made it into training corpora anyway. In a thread from a few months ago, @niplav pointed out that Claude 3.5 Sonnet could reproduce the canary string. The publicly available GPT-4o chat model does not generate the BIG-Bench canary string, but as I discovered the non-public pre-RLHF GPT-4-base *does*. The canary string has also been used in a number of LessWrong posts (e.g. 1, 2, 3) containing text the authors didn’t want future AIs to be trivially aware of, which is another reason canaries being compromised is bad. After testing a subset of the BIG-Bench benchmark tasks, I found that GPT-4-base had memorized several tasks in the benchmark (all containing the canary string)."
    },
    {
      "title": "BIG-Bench Canary Contamination in GPT-4 — AI Alignment Forum",
      "url": "https://www.alignmentforum.org/posts/kSmHMoaLKGcGgyWzs/big-bench-canary-contamination-in-gpt-4",
      "description": "# BIG-Bench Canary Contamination in GPT-4. The BIG-Bench canary string is a unique string included in documents intended to be excluded from the training datasets of models (e.g. model evaluation tasks):. As a contingency, checking whether a model can generate the canary string would allow us to detect whether such documents have made it into training corpora anyway. In a thread from a few months ago, @niplav pointed out that Claude 3.5 Sonnet could reproduce the canary string. The publicly available GPT-4o chat model does not generate the BIG-Bench canary string, but as I discovered the non-public pre-RLHF GPT-4-base *does*. But to test this out further, @Linch suggested I check whether GPT-4-base was trained on benchmark tasks themselves. After testing a subset of the BIG-Bench benchmark tasks, I found that GPT-4-base had memorized several tasks in the benchmark (all containing the canary string). GPT-4-base can reproduce the canary string."
    },
    {
      "title": "Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation",
      "url": "https://arxiv.org/html/2502.17521v1",
      "description": "# Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation. Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks. However, because these static benchmarks are released on the Internet for transparent evaluation, and LLMs gather as much data as possible from the Internet for training, potential data contamination is unavoidable Magar and Schwartz (2022); Deng et al. To mitigate the risk of data contamination in LLM benchmarking, researchers have proposed various methods to enhance static approaches, such as data encryption Jacovi et al."
    },
    {
      "title": "DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination | OpenReview",
      "url": "https://openreview.net/forum?id=3BZyQqbytZ",
      "description": "## DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination. **Abstract:** The rapid advancement of code large language models (Code LLMs) underscores the critical need for effective and transparent benchmarking methods. To overcome these challenges, we propose DyCodeEval, a novel benchmarking suite specifically designed to evaluate Code LLMs under realistic contamination scenarios. We introduce a dynamic data generation method and conduct extensive empirical studies on two seed datasets involving 18 Code LLMs. The results demonstrate that DyCodeEval effectively assesses the reasoning capabilities of Code LLMs under contamination conditions while producing diverse problem variants, thereby ensuring robust and consistent benchmarking outcomes. **Lay Summary:** Large language models (LLMs) are increasingly used to write code and solve programming tasks. Our approach provides a more robust and diverse benchmark for assessing code LLMs, helping researchers and developers better understand what these models can and cannot do in realistic settings. **Keywords:** benchmarking, code generation, large language model, trustworthy ML."
    },
    {
      "title": "Don’t train on this data or what’s a canary string? – Jürg Stuker",
      "url": "https://stuker.com/2026/dont-train-on-this-data-or-whats-a-canary-string/",
      "description": "+ Die Fotos von “Caught” zum Verkauf. # Don’t train on this data or what’s a canary string? There are many reasons why I don’t want data to be fed into the training of large language models (LLMs). In the case of a benchmark, we would not be able to measure the ability of a system to answer a question, but rather how well the system can find a memorized answer within its model. A proposed approach to exclude assets from training is to place an instruction in the content I don’t want to be used. The term “canary” comes from coal miners who placed canaries into coal mines as an early-warning signal for toxic gases, primarily carbon monoxide. Somehow programmers adopted this for example to detect buffer overflows on stack-allocated variables or programs overwriting security-critical sections in memory. Content containing the string showed up in GPT-4, Opus 3, and Sonnet 3.5. ## KI trainiert KI: Synthetische Daten sind die nächste strategische Ressource."
    },
    {
      "title": "Benchmarking Large Language Models Under Data Contamination",
      "url": "https://arxiv.org/html/2502.17521v2",
      "description": "In this work, we conduct an in-depth analysis of existing static and dynamic benchmarks for evaluating LLMs. We first examine methods that"
    },
    {
      "title": "GitHub - lyy1994/awesome-data-contamination: The Paper List on Data Contamination for Large Language Models Evaluation.",
      "url": "https://github.com/lyy1994/awesome-data-contamination",
      "description": "# GitHub - lyy1994/awesome-data-contamination: The Paper List on Data Contamination for Large Language Models Evaluation. *   [Code](https://github.com/lyy1994/awesome-data-contamination). *   [Code](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-news). *   [📜 Papers](https://github.com/lyy1994/awesome-data-contamination#papers). [](https://github.com/lyy1994/awesome-data-contamination). Data Contamination, also known as [train-test contamination](https://arxiv.org/abs/2211.09110) or [benchmark leakage](https://arxiv.org/abs/2311.01964), indicates the case in which the model has seen information (e.g., test instances, test prompts, etc.) about the test set to be evaluated on during training. [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-papers). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-the-list). **Time Travel in LLMs: Tracing Data Contamination in Large Language Models** (ICLR 2024) [![Image 29](https://camo.githubusercontent.com/252cdafe0155a82ee210cdfdda04a3d427820c74b89266462b071c4b61f8e4fa/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656163746976652d677265656e)](https://camo.githubusercontent.com/252cdafe0155a82ee210cdfdda04a3d427820c74b89266462b071c4b61f8e4fa/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656163746976652d677265656e)[![Image 30](https://camo.githubusercontent.com/ba23fd56c3d0a47247c288f602d60609461efe07904762dc8c5f318fc7a9f352/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f546f6f6c2d707572706c65)](https://camo.githubusercontent.com/ba23fd56c3d0a47247c288f602d60609461efe07904762dc8c5f318fc7a9f352/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f546f6f6c2d707572706c65). **A Survey on Data Contamination for Large Language Models** (arXiv, 20 Feb 2025) [![Image 165](https://camo.githubusercontent.com/0284e621c895eff4465d55339abb8fb60abc29de8f3e86b04f7be1028a339566/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5375727665792d6379616e)](https://camo.githubusercontent.com/0284e621c895eff4465d55339abb8fb60abc29de8f3e86b04f7be1028a339566/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5375727665792d6379616e). **Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation** (arXiv, 23 Feb 2025) [![Image 166](https://camo.githubusercontent.com/0284e621c895eff4465d55339abb8fb60abc29de8f3e86b04f7be1028a339566/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5375727665792d6379616e)](https://camo.githubusercontent.com/0284e621c895eff4465d55339abb8fb60abc29de8f3e86b04f7be1028a339566/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5375727665792d6379616e). **Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models** (arXiv, 10 Oct 2025) [![Image 185](https://camo.githubusercontent.com/252cdafe0155a82ee210cdfdda04a3d427820c74b89266462b071c4b61f8e4fa/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656163746976652d677265656e)](https://camo.githubusercontent.com/252cdafe0155a82ee210cdfdda04a3d427820c74b89266462b071c4b61f8e4fa/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656163746976652d677265656e)[![Image 186](https://camo.githubusercontent.com/41df7b35f6d16eefcd563055b856883f9d23d7eb994992b8ded39d93ea36a1a9/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446174617365742d6f72616e6765)](https://camo.githubusercontent.com/41df7b35f6d16eefcd563055b856883f9d23d7eb994992b8ded39d93ea36a1a9/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446174617365742d6f72616e6765). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-resources). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-datasets). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-citation). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination). [](https://github.com/lyy1994/awesome-data-contamination#-acknowledgement). [evaluation](https://github.com/topics/evaluation \"Topic: evaluation\")[awesome-list](https://github.com/topics/awesome-list \"Topic: awesome-list\")[data-contamination](https://github.com/topics/data-contamination \"Topic: data-contamination\")[paper-list](https://github.com/topics/paper-list \"Topic: paper-list\")[pre-trained-model](https://github.com/topics/pre-trained-model \"Topic: pre-trained-model\")[pre-trained-language-models](https://github.com/topics/pre-trained-language-models \"Topic: pre-trained-language-models\")[foundation-models](https://github.com/topics/foundation-models \"Topic: foundation-models\")[large-language-models](https://github.com/topics/large-language-models \"Topic: large-language-models\")[llm](https://github.com/topics/llm \"Topic: llm\"). [**2** watching](https://github.com/lyy1994/awesome-data-contamination/watchers)."
    },
    {
      "title": "Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation - ACL Anthology",
      "url": "https://aclanthology.org/2025.emnlp-main.511/",
      "description": "## Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation. ##### Correct Metadata for. ##### Abstract. ##### Export citation. ##### Markdown (Informal). Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation (Chen et al., EMNLP 2025). ##### ACL. ACL materials are Copyright © 1963–2026 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License. The ACL Anthology is managed and built by the ACL Anthology team of volunteers. *Site last built on 08 May 2026 at 14:01 UTC with commit ff75cc6.*."
    }
  ]
}

▼✓网页搜索查询: "Goodhart's law benchmark gaming overfitting leaderboard metric target", 结果数: 8

{
  "results": [
    {
      "title": "Specification gaming, Goodhart’s law, and the metrics that lie about AI | explainx.ai Blog | explainx.ai",
      "url": "https://explainx.ai/blog/specification-gaming-goodharts-law-ai-metrics",
      "description": "# Specification gaming, Goodhart’s law, and the metrics that lie about AI. In AI, that shows up as reward hacking, benchmark overfitting, and agents that please evaluators while failing users. A practical take on Goodhart, proxy metrics, and what to do in product and governance. Goodhart’s law (paraphrased) warns that *any* proxy used as a *sole* target can eventually break as a measure. 1. **Games.** A simulated agent is rewarded for a score; it finds a weird strategy that maxes the score in a way humans would call unfair or brittle—the classic RL anecdote, still pedagogically useful. That is governance’s answer to the same structural uncertainty as Goodhart in a product dashboard. For most shipping teams, the honest goal is not ‘open the black box’ but ‘know when it breaks, why it might have broken, and what to do next’—tied to tools, data retention, and governance.Apr 23, 2026."
    },
    {
      "title": "The Paradox of AI Benchmarks: Challenges in Evaluation",
      "url": "https://thesequence.substack.com/p/the-sequence-opinion-750-the-paradox",
      "description": "# The Sequence Opinion #750: The Paradox of AI Benchmarks: Challenges in Evaluation. ### Goodhart’s Law and some of the challenges around AI benchmarks. We largely rely on standardized benchmarks and tidy metrics to declare winners and track improvement. Yet a central paradox undercuts this practice: as soon as a metric becomes the target, it often stops measuring what we intended. This is Goodhart’s Law in action—*“when a measure becomes a target, it ceases to be a good measure.”* In AI, that means a leaderboard gain may reflect success at **gaming the test**, not a durable leap in capability. This essay examines the **core challenges of AI benchmarks and evaluations** across language, vision, and reinforcement learning (RL). We’ll look at how once-useful measures turn into misleading goals; why good evals should be **accurate, inexpensive, and hard to memorize**; where academic and real-world tests clash; and what RL environments promise—and fail—to deliver as evolving evaluations. ## Goodhart’s Law and the Benchmark Trap."
    },
    {
      "title": "Gaming the System: Goodhart’s Law Exemplified in AI Leaderboard Controversy",
      "url": "https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy",
      "description": "# Gaming the System: Goodhart’s Law Exemplified in AI Leaderboard Controversy. ### How the race to the top in AI benchmarks is leading to specialized optimization at the expense of real-world performance. The recent uproar over the LMSYS (now LMArena) *Chatbot Arena* leaderboard is a striking, real-world example of **Goodhart’s Law** in action. Goodhart’s Law – originally from economics – warns that *“**when a measure becomes a target, it ceases to be a good measure**”*. In AI model evaluation, this means that once researchers and companies focus on beating a particular benchmark, the benchmark itself can become **misleading**. The Chatbot Arena controversy has proved this in practice: model developers have found ways to optimize for high Arena scores (the *target*), ultimately distorting the Arena’s ability to measure true model quality. **LMSYS** (now LMArena) is a crowdsourced evaluation platform for chat-based AI models."
    },
    {
      "title": "Goodhart's law states when a metric becomes a target, it ceases to ...",
      "url": "https://www.threads.com/@carnage4life/post/DWLQen2DY3K/goodharts-law-states-when-a-metric-becomes-a-target-it-ceases-to-be-a-good",
      "description": "Goodhart's law states when a metric becomes a target, it ceases to be a good metric. This describes the phenomenon of companies creating"
    },
    {
      "title": "The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks",
      "url": "https://tianpan.co/blog/2026-04-19-goodharts-law-ai-benchmark-gaming",
      "description": "# The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks. In late 2024, OpenAI's o3 system scored 75.7% on the ARC-AGI benchmark — a test specifically designed to resist optimization. Then practitioners looked closer: o3 had been trained on 75% of the benchmark's public training set, and the highest-compute configuration used 172 times more resources than the baseline. It wasn't a capability breakthrough dressed up as a score. It was a score dressed up as a capability breakthrough. The moment a benchmark becomes the thing teams optimize for, it stops measuring what it was designed to measure. Goodhart's Law — \"when a measure becomes a target, it ceases to be a good measure\" — was articulated in 1970s economic policy, but it describes AI benchmarking with eerie precision. ## The Lifecycle of a Benchmark. A research team identifies a meaningful capability — reasoning, code generation, multilingual comprehension — and constructs test questions to measure it."
    },
    {
      "title": "Organizations everywhere fall prey to Goodhart's law: “When a ...",
      "url": "https://www.facebook.com/MITSloanManagementReview/posts/organizations-everywhere-fall-prey-to-goodharts-law-when-a-measure-becomes-a-tar/1317146260451960/",
      "description": "Organizations everywhere fall prey to Goodhart's law: “When a measure becomes a target, it ceases to be a good measure."
    },
    {
      "title": "LLM Leaderboards are Bullshit - Goodhart's Law Strikes Again",
      "url": "https://www.reddit.com/r/LocalLLaMA/comments/1bjvjaf/llm_leaderboards_are_bullshit_goodharts_law/",
      "description": "205 votes, 84 comments. The AI research community's obsession with leaderboards and evaluation benchmarks for large language models is a"
    },
    {
      "title": "Is AI Cheating on the Test: Data Contamination, Gaming, and the ...",
      "url": "https://medium.com/@wasowski.jarek/is-ai-cheating-on-the-test-data-contamination-gaming-and-the-benchmark-crisis-9dff2fba494f",
      "description": "How data contamination, Goodhart's Law, and gaming turned objective AI tests into a marketing tool — and what to do about it as a"
    }
  ]
}

▼thought

如何设计一个优秀的 Benchmark

在这篇回答里，我会从经典计算机系统 benchmark 方法论和当代 AI benchmark 实践两条线索出发，梳理一套可操作的设计框架。

一、Benchmark 的五大基础属性（SPEC 框架）

Relevance（相关性）：benchmark 的行为与实际使用场景的关联有多紧密。这是第一性原理——如果你测的东西没人关心，分数再精确也毫无意义。

Fairness（公平性）：不同被测系统能否在同等条件下公平竞争，不存在对特定架构或实现的人工偏见。

Usability（可用性）：用户在自己的测试环境中运行 benchmark 的难度。门槛越高，越少人会用，benchmark 的生态价值就越低。

二、当代 AI Benchmark 的三项核心属性（Ofir Press 框架）

SWE-bench、AssistantBench、CiteME 等 benchmark 的作者 Ofir Press 将好的 AI benchmark 提炼为三个核心属性 How to Build Good Language Modeling Benchmarks：

1. Natural（自然性）

任务必须来自真实人类会问的问题，而不是闭门造车编出来的。两个硬指标判断「不自然」：

问题设定不现实：比如选择题——「现实中没有人去看医生时说'医生我的肘部疼，原因一定是这四个选项之一……'」
题目是编造的，而非来自真实用户问题：如果在 Google 工作，与其坐在房间里空想题目，不如去翻 Search 日志，找出用户搜了但没找到好答案的真实查询

一个很好的自检：问自己「usefulness criteria」——一个在这个 benchmark 上得分比 baseline 高的系统，是否真的对人类有用？是否真的能提高生产力？

2. Automatically Evaluateable（自动可评估）

3. Challenging（有挑战性）

这是变化最快的指标。Ofir 的建议经历了三次修订：

初版：发布时顶尖模型准确率应在 1%–35%
2025 年 1 月修订：应在 0.1%–9%
2025 年 5 月修订：「我现在要求合作者不要想'让 AI 得 0% 的 benchmark'，而要设计让 AI 得 -200% 的 benchmark。要找那些即使模型性能翻三倍也还是零分的题目。只看现在让模型失败还不够——你必须预测未来 6–12 个月的进步速度，设计出连明年的模型也做不出来的 benchmark。」

一个实用技巧：用强 baseline 过滤掉简单实例（如 Bamboogle 过滤掉 Google 能搜到的题，CiteME 过滤掉 GPT-4o 能直接答对的题）。

三、Construct Validity：让 benchmark 真正测量它声称的东西

这是整个设计过程中最重要的概念，也是最容易被跳过的。

Reuel 等人提出的四种效度类型构成一个完整的检验框架 Measuring What Matters: Construct Validity in LLM Benchmarks：

效度类型	核心问题	典型失败模式
Construct（构念效度）	它是否测量了名字所声称的能力？	把「通用推理」benchmark 命名为推理测试，实际主要测的是知识记忆
Criterion（效标效度）	分数是否与下游真实结果相关？	benchmark 分数涨了 10 分但生产环境表现毫无变化
Consequential（后果效度）	优化这个指标是否产生你想要的行为？	优化 pass@1 导致模型变得过度保守，拒绝回答合理问题
External（外部效度）	是否泛化到 eval set 之外的场景？	在精选测试集上表现好，但换了分布立刻崩

四、题目从哪来，谁来出

真实制品碾压合成提示词

专家创作 + 对抗性验证

GPQA Diamond 是最干净的参考实现：

领域专家出题
另一位专家验证
修订
非专家验证——给非专家（可以上网 30+ 分钟）做一遍

四个操作规则

从真实失败开始，迭代扩展——seed set → expansion → iteration
写出参考答案——如果你的领域专家都做不出来，模型肯定做不出来，这个失败不告诉你任何信息
每道题使用多位评分者——单评分者继承单评分者的盲点。SWE-bench Verified 用了三位标注者做 severity ensembled 筛选，去掉了约 1/3 模糊或不可行的原始题目 Introducing SWE-bench Verified
严格分离训练集和测试集——train/test contamination 会悄悄抬高 held-out 评估的分数，直到生产环境表现不如预期你才发现

五、评分：代码评、模型评、人评

Anthropic 工程团队的分类法是最清晰的框架 Demystifying Evals for AI Agents：三类评分器，从便宜到贵选用。

Code-based grader（黄金标准）

Model-based grader（LLM judge，开放输出的主力）

LeMAJ 法律评估框架发现，使用共享 rubric 后评审者间一致性提高了 11%，rubric 引导的 LLM judge 配置达到了与人类共识的 Cohen's κ = 0.75 LeMAJ。

三个实操规则：

用人类标注的 golden set 验证：与人类共识的 75–90% 一致是底线
分解为结构化标准：二元检查（「输出是否引用了真实案例？」）+ 有序评分（「1–5 分评分清晰度，附锚点描述」）远优于单一总分
注意已知偏差：位置偏差、长度偏差、自我偏好（偏好同模型家族的输出）。随机化位置、标准化长度、用与被测系统不同模型家族的 judge

Human grader（校准层和高风险层）

六、数据污染：Benchmark 设计中最棘手的问题

四层防御

层级	方法	代表案例
检测	嵌入 canary 字符串，事后测试模型是否能复现	GPQA Diamond
时间隔离	为每道题打日期戳，按模型训练截止日期过滤	LiveCodeBench（从 2023 年 5 月起为编程竞赛题目打时间戳）
空间隔离	保留一个从不公开的 private split	GIM benchmark 的 615 public + 205 private 设计 GIM
根本免疫	不发布答案（如 SciCode 只发布函数描述和单元测试，不发布解答代码）；或动态生成题目	动态 benchmark 从静态到动态的演进 Recent Advances in LLM Benchmarks against Data Contamination

Ofir Press 在 SciCode 设计中体现了「即使 benchmark 完全泄露进训练数据，模型也无法得分」的思路：PhD 出编程题，只发布描述和测试，不发布答案。

Harness 隔离——一个容易被忽视的问题

七、Goodhart 定律与 Benchmark 生命周期管理

「当一个度量变成目标，它就不再是一个好的度量。」

从第一天就区分两类 eval

持续更新

把生产环境失败回流到 eval set——每个用户标记的错误答案、每次人工覆盖、每次升级，都是经过预验证的难题
给 benchmark 打版本号，与模型版本对齐
标注每次 eval run 的模型版本、prompt 版本、rubric 版本

八、一个完整的设计 Checklist

综合 How2Bench 的 55 项标准检查表 How2Bench、BetterBench 框架 BetterBench、以及以上所有来源，一个 benchmark 设计应覆盖以下维度：

设计阶段

用一段话写明 construct：测什么能力、为什么重要、分数变化意味着什么
明确目标受众和用例（研究对比 vs 生产决策 vs 安全审计）
选定一个主指标 + 3–4 个辅助指标，覆盖你面临的真实 trade-off
规划 capability eval → regression eval 的生命周期

任务构建

优先使用真实制品（bug report、support ticket、真实查询）
如需合成，走「专家出题 → 专家验证 → 修订 → 非专家验证」的 GPQA Diamond 管道
每道题写出参考答案
使用多位评分者，记录评分者间分歧作为题目模糊度的信号
包含平衡的正负案例（「agent 成功取消订单」+「agent 拒绝取消不该取消的订单」）

评分系统

能用 code-based grader 就用 code-based grader
用 LLM judge 时，rubric 是核心——写清楚每个评分等级的操作定义和锚点
用人类标注 golden set 验证 judge 可靠性（≥75–90% 一致）
注意并缓解 judge 偏差（位置、长度、自我偏好）
隔离被测系统和评分器（harness isolation）

污染防御

嵌入 canary 字符串
为每道题打日期戳 / 保留 private holdout
如可能，不公开发布答案
考虑动态生成机制

维护计划

版本化：benchmark 版本与模型版本对齐
生产失败回流管道
定期 re-annotation
提供可复现脚本和 CI 集成

补充说明：本次检索覆盖与未覆盖的内容

尚需深入但本次未充分展开的领域：

交互式评估（interactive evaluation）作为一个独立范式的方法论——这是 2026 年 5 月 arXiv 上一篇位置论文的焦点 Interactive Evaluation Requires a Design Science，认为「交互式评估应被视为一种有原则的评估范式，而不仅仅是新的 agent benchmark 家族」
多模态 benchmark 的特殊挑战（图像+文本+代码综合治理等）
具体领域的 benchmark 设计细节（如医疗影像、金融交易等需要领域专家的深度参与才能讲清）
IRT（项目反应理论）在 AI benchmark 中的系统应用方法论——目前 GIM 等少数 benchmark 在使用，但尚未成为社区标准

140784 tokens · $0.0233