深度研究一下 lm-evaluation-har...

深度研究一下 lm-evaluation-harness

▼thought

用户想要深入了解 `lm-evaluation-harness`。这是一个广泛使用的 LLM 评估框架，由 EleutherAI 维护。我需要从多个角度来研究它： 1. 首先从站内 feed 中搜索相关内容 2. 从外部网页搜索最新信息 3. 可能还需要查看 GitHub 仓库让我同时进行这些搜索。

▼✓信息流全文检索关键词: "lm-evaluation-harness EleutherAI benchmark", 条数: 30

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"items\": [\n    {\n      \"id\": \"oai:arXiv.org:2605.04454v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04454\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone\",\n      \"author\": [\n        \"Varad Vishwarupe\",\n        \"Nigel Shadbolt\",\n        \"Marina Jirotka\",\n        \"Ivan Flechais\"\n      ],\n      \"summary\": \"arXiv:2605.04454v1 Announce Type: cross Abstract: Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level at which evidence is collected: model-level, response-level, interaction-level, or deployment-level. Two studies support this position. First, a structured audit of eleven alignment benchmarks, extended to a sixteen-benchmark corpus, dual-coded against an eight-dimension rubric with Cohen's kappa = 0.87, finds that user-facing verification support is absent across every benchmark examined, while process steerability is nearly absent. The few interactional benchmarks identified, including tau-bench, CURATe, Rifts, and Common Ground, remain fragmented in coverage, and benchmark construction rather than data source determines what is measured. Second, a blinded cross-model stress test using 180 transcripts across three frontier models and four scaffolds finds that the same verification scaffold raises one model's verification support to ceiling while leaving another categorically unchanged. This shows that scaffold efficacy is model-dependent and that the gap identified by the audit cannot be closed at the model level alone. We propose a system-level evaluation agenda: alignment profiles instead of single scores, fixed-scaffolding protocols for comparable interactional evaluation, and reporting templates that make the inferential distance between evaluation evidence and deployment claims explicit.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.10144v2\",\n      \"url\": \"https://arxiv.org/abs/2602.10144\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"When LLMs get significantly worse: A statistical approach to detect model degradations\",\n      \"author\": [\n        \"Jonas K\\\\\\\"ubler\",\n        \"Kailash Budhathoki\",\n        \"Matth\\\\\\\"aus Kleindessner\",\n        \"Xiong Zhou\",\n        \"Junming Yin\",\n        \"Ashish Khetan\",\n        \"George Karypis\"\n      ],\n      \"summary\": \"arXiv:2602.10144v2 Announce Type: replace-cross Abstract: Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model's degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04523v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04523\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation\",\n      \"author\": [\n        \"Ivan Bondarenko\",\n        \"Roman Derunets\",\n        \"Oleg Sedukhin\",\n        \"Mikhail Komarov\",\n        \"Ivan Chernov\",\n        \"Mikhail Kulakov\"\n      ],\n      \"summary\": \"arXiv:2605.04523v1 Announce Type: new Abstract: We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno-Lite-0.1, a 7B domain-adapted model with a strong cost--performance trade-off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"a0f60af6af70259d9ac714a9b879b056074869ef062450ebda96f1f65f23df29\",\n      \"url\": \"https://github.com/bogini\",\n      \"source_url\": \"https://github.com/trending/developers/typescript?since=daily\",\n      \"title\": \"bogini\",\n      \"author\": [\n        \"bogini\"\n      ],\n      \"summary\": \"Popular repo: function-health-exporter - Turn your Function Health labs into LLM-ready reports\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T16:48:45.730Z\",\n      \"fetched_at\": \"2026-05-07T16:48:46.617Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.03858v1\",\n      \"url\": \"https://arxiv.org/abs/2605.03858\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following\",\n      \"author\": [\n        \"Jaeyun Lee\",\n        \"Junyoung Koh\",\n        \"Zeynel Tok\",\n        \"Hunar Batra\",\n        \"Ronald Clark\"\n      ],\n      \"summary\": \"arXiv:2605.03858v1 Announce Type: new Abstract: Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:21.529Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.05175v1\",\n      \"url\": \"https://arxiv.org/abs/2605.05175\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge\",\n      \"author\": [\n        \"Perry E. Radau\"\n      ],\n      \"summary\": \"arXiv:2605.05175v1 Announce Type: cross Abstract: Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used an independent LLM judge; primed stem-only tested responses to incorrect user claims. Results: Overall MCQ accuracy was 93.2% to 97.1%. GE scanner operations was the lowest category for every model (88.2% to 94.6%). In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%; GE scanner operations stem-only accuracy was 13.8% to 29.8%. Conclusion: High MCQ performance can mask weak free-text recall, especially for vendor-specific operational knowledge. MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"55c5e35b9bbd92d15c88e3a7d0d8a7800ee1a3c0f0ad444c77ed2d5b62b143aa\",\n      \"url\": \"https://github.com/VectifyAI/PageIndex\",\n      \"source_url\": \"https://github.com/trending?since=daily\",\n      \"title\": \"VectifyAI/PageIndex\",\n      \"author\": [\n        \"VectifyAI\"\n      ],\n      \"summary\": \"📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T16:45:06.412Z\",\n      \"fetched_at\": \"2026-05-07T16:45:07.284Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.00933v2\",\n      \"url\": \"https://arxiv.org/abs/2602.00933\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers\",\n      \"author\": [\n        \"Chaithanya Bandi\",\n        \"Ben Hertzberg\",\n        \"Geobio Boo\",\n        \"Tejas Polakam\",\n        \"Jeff Da\",\n        \"Sami Hassaan\",\n        \"Manasi Sharma\",\n        \"Andrew Park\",\n        \"Ernesto Hernandez\",\n        \"Dan Rambado\",\n        \"Ivan Salazar\",\n        \"Rafael Cruz\",\n        \"Chetan Rane\",\n        \"Ben Levin\",\n        \"Brad Kenstler\",\n        \"Bing Liu\"\n      ],\n      \"summary\": \"arXiv:2602.00933v2 Announce Type: replace-cross Abstract: The Model Context Protocol (MCP) is rapidly becoming the standard interface for Large Language Models (LLMs) to discover and invoke external tools. However, existing evaluations often fail to capture the complexity of real-world scenarios, relying on restricted toolsets, simplistic workflows, or subjective LLM-as-a-judge metrics. We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency, comprising 36 real MCP servers and 220 tools. It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step workflows. Tasks use natural language prompts that avoid naming specific tools or servers, requiring agents to identify and orchestrate 3-6 tool calls across multiple servers. We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer, complemented by internal diagnostics on tool discovery, parameterization, syntax, error recovery, and efficiency. Evaluation results on frontier models reveal that top models achieve pass rates exceeding 50%, with primary failures arising from inadequate tool usage and task understanding. We release the task schema, containerized harness, and a 500-task public subset of the benchmark dataset to facilitate reproducible comparisons and advance the development of robust, tool-augmented agents.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-05T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-05T16:07:19.561Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04357v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04357\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs\",\n      \"author\": [\n        \"Yixuan Mei\",\n        \"Zikun Li\",\n        \"Zixuan Chen\",\n        \"Shiqi Pan\",\n        \"Mengdi Wu\",\n        \"Xupeng Miao\",\n        \"Zhihao Jia\",\n        \"K. V. Rashmi\"\n      ],\n      \"summary\": \"arXiv:2605.04357v1 Announce Type: cross Abstract: The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\\\\times$ over the best baseline, and delivers up to 2.39$\\\\times$ higher goodput under scarce resource availability.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"1cad32c447f9131e633154d70b4987bb207a86751c3c8538ce833d36909d798c\",\n      \"url\": \"https://ahk.cardor.dev\",\n      \"source_url\": \"https://news.ycombinator.com/rss\",\n      \"title\": \"Agent-harness-kit scaffolding for multi-agent workflows (MCP, provider-agnostic)\",\n      \"author\": null,\n      \"summary\": \"Comments\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T10:45:59.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:17.897Z\"\n    },\n    {\n      \"id\": \"38bf6ecc5fc6495ace1d93af960422ad1fe6b249a479a6470528e9ca7c98cf4e\",\n      \"url\": \"https://trendshift.io/repositories/14565\",\n      \"source_url\": \"https://trendshift.io/github-trending-repositories?trending-range=1\",\n      \"title\": \"free-llm-api-resources\",\n      \"author\": [\n        \"cheahjs\"\n      ],\n      \"summary\": \"A list of free LLM inference resources accessible via API.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T16:05:38.142Z\",\n      \"fetched_at\": \"2026-05-06T16:05:42.052Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04298v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04298\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs\",\n      \"author\": [\n        \"Stefano Bann\\\\`o\",\n        \"Kate Knill\",\n        \"Mark Gales\"\n      ],\n      \"summary\": \"arXiv:2605.04298v1 Announce Type: new Abstract: Automated essay scoring (AES) research often relies on rank-based correlation metrics to validate analytic assessment. However, such metrics obscure both intrinsic intercorrelations among analytic dimensions that arise from the structure of writing proficiency itself and halo effects, whereby holistic impressions bleed into fine-grained component scores. As a result, high correlations may mask a system's true diagnostic behaviour. In this study, we propose a novel self-referential assessment evaluation framework that focuses on identifying intra-learner strengths and weaknesses rather than assessing inter-learner rankings. We conduct experiments on the publicly available ICNALE GRA, a uniquely dense second-language writing dataset annotated holistically and analytically by up to 80 trained raters. To obtain reliable reference scores, we apply two-facet Rasch modelling to calibrate rater severity and derive fair average scores across ten analytic aspects and holistic proficiency. We compare the analytic scoring performance of human operational raters and three large language models (LLMs) in a zero-shot setting. Our results show that LLMs tend to outperform single human raters in identifying relative weaknesses (negative feedback) across several proficiency aspects, while human raters remain stronger at identifying relative strengths (positive feedback). Overall, our findings highlight the limitations of rank-based evaluation for analytic assessment and demonstrate the value of intra-learner, profile-based methods for assessing and deploying LLMs in AES.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"cb49d85958c64b4aa4c473e84eef58173340bb3ed59f3ddf6d38ffd6b25c4f2b\",\n      \"url\": \"https://trendshift.io/repositories/22628\",\n      \"source_url\": \"https://trendshift.io/github-trending-repositories?trending-range=1\",\n      \"title\": \"decolua/9router\",\n      \"author\": null,\n      \"summary\": \"Unlimited FREE AI coding. Connect Claude Code, Codex, Cursor, Cline, Copilot, Antigravity to FREE Claude/GPT/Gemini via 40+ providers. Auto-fallback, RTK -40% tokens, never hit limits.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T16:14:28.906Z\",\n      \"fetched_at\": \"2026-05-07T16:14:29.690Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.05170v1\",\n      \"url\": \"https://arxiv.org/abs/2605.05170\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours\",\n      \"author\": [\n        \"The Verkor Team\",\n        \"Ravi Krishna\",\n        \"Suresh Krishna\",\n        \"David Chin\"\n      ],\n      \"summary\": \"arXiv:2605.05170v1 Announce Type: cross Abstract: Driven by a rapid co-evolution of both harness and underlying models, LLM agents are improving at a dizzying pace. In our prior work (performed in Dec. 2025), we introduced \\\"Design Conductor\\\" (or just \\\"Conductor\\\"), a system capable of building a 5-stage Linux-capable RISC-V CPU in 12 hours. In this work, we introduce an updated multi-agent harness powered by frontier models released in April 2026, which is able to handle 80x larger tasks, at higher quality, fully autonomously. Following a brief introduction, we examine 4 designs that the system produced autonomously, including \\\"VerTQ\\\", an LLM inference accelerator which hard-wires support for TurboQuant in a 240-cycle pipeline, starting from the TurboQuant arXiv paper. VerTQ includes heavy compute processing, with 5129 FP16/32 units; the design was mapped to an FPGA at 125 MHz and consumes 5.7 mm^2 in TSMC 16FF (8 attention pipes). We review the key new characteristics that enabled these results. Finally, we analyze Design Conductor's token usage and other empirical characteristics, including its limitations.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04135v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04135\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation\",\n      \"author\": [\n        \"David Gringras\",\n        \"Misha Salahshoor\"\n      ],\n      \"summary\": \"arXiv:2605.04135v1 Announce Type: cross Abstract: Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-4o-mini zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about \\\"AI\\\" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of \\\"AI\\\" rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at frontierlag.org.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"064c2984cfb59b2fa13683fc6a0d7ae1e6872fb7ccbd288d5ddcebb87d622f70\",\n      \"url\": \"https://doi.org/10.7910/DVN/FB7V2B\",\n      \"source_url\": \"https://dataverse.harvard.edu/\",\n      \"title\": \"Indic MMLU\",\n      \"author\": [\n        \"Anonymous, Anonymous\"\n      ],\n      \"summary\": \"The Indic MMLU dataset is a multilingual adaptation of the Massive Multitask Language Understanding (MMLU) benchmark developed to evaluate the reasoning, knowledge comprehension, and multilingual capabilities of Large Language Models (LLMs) across Indian languages. The dataset consists of professionally translated and quality-filtered multiple-choice question-answer pairs spanning diverse academic and professional domains, including science, mathematics, history, law, medicine, engineering, humanities, and social sciences. The primary purpose of this dataset is to provide a standardized benchmark for assessing model performance in low-resource and linguistically diverse Indic settings. The dataset enables research in multilingual NLP, cross-lingual transfer learning, language alignment, and culturally grounded AI evaluation. The dataset was generated through a structured pipeline involving machine-assisted translation of the original English MMLU benchmark into selected Indic languages, followed by extensive quality filtering using translation evaluation metrics such as BLEU, chrF++, and TER. Additional validation steps were applied to preserve semantic fidelity, answer consistency, and linguistic fluency. The final data is provided in standardized machine-readable formats suitable for benchmarking and downstream evaluation workflows. Indic MMLU is intended for researchers, academic institutions, and industry practitioners working on multilingual AI systems, Indic language technologies, and large-scale language model evaluation. By extending a widely recognized benchmark into Indian languages, the dataset contributes toward more inclusive, representative, and culturally relevant evaluation standards for modern AI systems\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T06:07:44.000Z\",\n      \"fetched_at\": \"2026-05-07T16:35:52.266Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04083v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04083\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals\",\n      \"author\": [\n        \"Tadhg Looram\",\n        \"Lucas Nuzzi\",\n        \"Kyle Waters\",\n        \"Steven Dillmann\"\n      ],\n      \"summary\": \"arXiv:2605.04083v1 Announce Type: new Abstract: Much of the focus in RL today is on evaluation design: building meaningful evals that serve simultaneously as benchmarks and as well-defined reward signals for post-training. Yet, many real-world tasks are governed by subjective, procedural, and domain-specific requirements that are difficult to encode as exact-match targets or open-ended preference judgments frequently used in RL pipelines today. In this work, we present AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals. AsymmetryZero represents each task as a stable evaluation contract that makes grading criteria explicit: what is being graded, how each criterion is judged, and how criterion-level decisions are aggregated into a task outcome. The same contract can be executed using Inspect for model-only evaluations, as well as the Harbor Framework for agentic evaluations, enabling comparable scores and shared audit artifacts across both settings. We argue that the central challenge in post-training today is the faithful encoding of expert requirements into the evaluation itself. To that end, we present a study using Harbor that holds task contracts fixed and compares a five-model frontier jury against a five-model compact jury across four frontier-class solvers (Claude Opus 4.6, GPT-5.4, Grok-4.20, Gemini-3.1-Pro). We find that criterion-level frontier-vs-compact agreement ranges from $75.9\\\\%$ to $89.6\\\\%$ (strict common-subset agreement: $77.8\\\\%$ to $92.1\\\\%$), while compact juries exhibit substantially higher internal dissent (3--2 split rate $28.7\\\\%$--$32.4\\\\%$) than frontier juries ($6.1\\\\%$--$11.5\\\\%$). Verifier traces further show that compact juries reduce per-criterion judging cost to roughly $4.2\\\\%$--$5.6\\\\%$ of frontier and latency to roughly $21.7\\\\%$--$27.1\\\\%$, even as aggregated task-level outcomes often remain comparatively stable.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04118v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04118\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation\",\n      \"author\": [\n        \"Cong Liu\",\n        \"Milong Ren\",\n        \"Jiaqi Guan\",\n        \"Chengyue Gong\",\n        \"Jinyuan Sun\",\n        \"Xinshi Chen\",\n        \"Wenzhi Xiao\"\n      ],\n      \"summary\": \"arXiv:2605.04118v1 Announce Type: cross Abstract: Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04503v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04503\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning\",\n      \"author\": [\n        \"Yuancheng Wei\",\n        \"Haojie Zhang\",\n        \"Linli Yao\",\n        \"Lei Li\",\n        \"Jiali Chen\",\n        \"Tao Huang\",\n        \"Yiting Lu\",\n        \"Duojun Huang\",\n        \"Xin Li\",\n        \"Zhao Zhong\"\n      ],\n      \"summary\": \"arXiv:2605.04503v1 Announce Type: cross Abstract: Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust evaluation of multimodal large language models (MLLMs) on IDC. To address these gaps, we introduce DiffCap-Bench, a comprehensive IDC benchmark covering ten distinct difference categories to ensure diversity and compositional complexity. Furthermore, we propose an LLM-as-a-Judge evaluation protocol grounded in human-validated Difference Lists, enabling a robust assessment of models' ability to both capture and describe visual changes. Through extensive evaluation of state-of-the-art MLLMs, we reveal significant performance gaps between proprietary and open-source models, highlight the critical importance of reasoning capability, and identify clear limitations in model scaling. Our framework also demonstrates strong alignment with human expert judgments and strong correlation with downstream image editing data construction quality. These findings establish DiffCap-Bench as both a reliable IDC evaluation framework and a practical predictor of downstream utility. The benchmark and code will be made publicly available to support further research.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2601.22725v3\",\n      \"url\": \"https://arxiv.org/abs/2601.22725\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation\",\n      \"author\": [\n        \"Jin Li\",\n        \"Tao Chen\",\n        \"Shuai Jiang\",\n        \"Weijie Wang\",\n        \"Jingwen Luo\",\n        \"Chenhui Wu\"\n      ],\n      \"summary\": \"arXiv:2601.22725v3 Announce Type: replace-cross Abstract: Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \\\\times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\\\\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.03792v1\",\n      \"url\": \"https://arxiv.org/abs/2605.03792\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"TriBench-Ko: Evaluating LLM Risks in Judicial Workflows\",\n      \"author\": [\n        \"Haesung Lee\",\n        \"Gyubin Choi\",\n        \"Eun-Ju Lee\",\n        \"So-Min Lee\",\n        \"Youkang Ko\",\n        \"Dogyoon Lim\",\n        \"Sung-Kyoung Jang\",\n        \"Yohan Jo\"\n      ],\n      \"summary\": \"arXiv:2605.03792v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM-generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi-lab/TriBench-Ko\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:21.529Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.03936v1\",\n      \"url\": \"https://arxiv.org/abs/2605.03936\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models\",\n      \"author\": [\n        \"Daniel Drucker\",\n        \"Kyle Mahowald\"\n      ],\n      \"summary\": \"arXiv:2605.03936v1 Announce Type: new Abstract: Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a proposed definition, another repairs the definition, and the process repeats. Across 20 concepts and thousands of counterexample-repair cycles, we find that, although many LM-generated counterexamples are judged invalid by both expert humans and an LM judge, the LM judge accepts roughly twice as many as humans do. Nonetheless, per-item validity judgments are moderately consistent across humans and between humans and the LM. We further find that extended iteration produces increasingly verbose definitions without improving accuracy. We also see that some concepts resist stable definitions in general. These findings suggest that while LMs can engage in philosophical reasoning, the counterexample-repair loop hits diminishing returns quickly and could be a fruitful test case for evaluating whether LMs can sustain high-level iterated philosophical reasoning.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:21.529Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2603.16659v2\",\n      \"url\": \"https://arxiv.org/abs/2603.16659\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"LLMs learn scientific taste from institutional traces across the social sciences\",\n      \"author\": [\n        \"Ziqin Gong\",\n        \"Ning Li\",\n        \"Huaikang Zhou\"\n      ],\n      \"summary\": \"arXiv:2603.16659v2 Announce Type: replace Abstract: Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say \\\"I'm sure\\\" versus \\\"I'm guessing.\\\" Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.02974v1\",\n      \"url\": \"https://arxiv.org/abs/2605.02974\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"PHBench: A Benchmark for Predicting Startup Series A Funding from Product Hunt Launch Signals\",\n      \"author\": [\n        \"Yagiz Ihlamur\",\n        \"Ben Griffin\",\n        \"Rick Chen\"\n      ],\n      \"summary\": \"arXiv:2605.02974v1 Announce Type: cross Abstract: Structured launch signals on Product Hunt contain statistically significant predictive information for Series A funding outcomes. We construct PHBench from 67,292 featured Product Hunt posts spanning 2019-2025, linked to Crunchbase funding records via deterministic domain matching, identifying 528 verified Series A raises within 18 months of launch (positive rate: 0.78%). Our best-performing model, a three-component ensemble (ENS_avg, ENS_ISO, XGB) selected by validation F0.5, achieves F0.5 = 0.097 and AP = 0.037 (95% CI: 0.024-0.072; 4.7x lift over random) on the private held-out test set (103 positives). A paired bootstrap confirms a statistically credible advantage over the logistic regression baseline (AP delta: +0.013, 95% CI: [0.004, 0.039], p < 0.001; F0.5 delta: +0.056, 95% CI: [0.006, 0.122], p = 0.016). Validation-set metrics (F0.5 = 0.284, AP = 0.126) reflect best-of-144 selection bias on 53 positives and are reported for benchmark reproducibility only. We further evaluate three zero-shot Gemini models (Gemini 2.5 Flash, Gemini 3 Flash, and Gemini 3.1 Pro) in an anonymized numerical setting. The best LLM achieves AP = 0.034 (Gemini 3 Flash), below the LR baseline AP of 0.044. Notably, the most capable Gemini variant (Gemini 3.1 Pro, AP = 0.023) performs worst -- an unexpected pattern that warrants further investigation across providers and prompting strategies. Both ML and LLM models show the same temporal performance decay tracking the 2020-2021 funding boom and subsequent contraction, confirming the dataset captures genuine market structure rather than noise. PHBench provides a reproducible framework comprising public training, validation, and blind test splits; 61 engineered features; a five-metric evaluation harness; and a public leaderboard at https://phbench.com. All code, baseline models, and anonymized dataset splits are publicly available.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"ad87ab2f75834cf15e480f1ab02abe44dee472a1f5fcb50674048dce52e5976d\",\n      \"url\": \"https://github.com/darkrishabh/agent-skills-eval\",\n      \"source_url\": \"https://news.ycombinator.com/rss\",\n      \"title\": \"Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs\",\n      \"author\": null,\n      \"summary\": \"Comments\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T06:12:25.000Z\",\n      \"fetched_at\": \"2026-05-07T10:00:03.704Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04539v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04539\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization\",\n      \"author\": [\n        \"Qiming Bao\",\n        \"Juho Leinonen\",\n        \"Paul Denny\",\n        \"Michael J. Witbrock\"\n      ],\n      \"summary\": \"arXiv:2605.04539v1 Announce Type: new Abstract: Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the \\\"alignment tax\\\" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.22480v2\",\n      \"url\": \"https://arxiv.org/abs/2602.22480\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"VeRO: An Evaluation Harness for Agents to Optimize Agents\",\n      \"author\": [\n        \"Varun Ursekar (Emily)\",\n        \"Apaar Shanker (Emily)\",\n        \"Veronica Chatrath (Emily)\",\n        \"Yuan (Emily)\",\n        \"Xue\",\n        \"Sam Denton\"\n      ],\n      \"summary\": \"arXiv:2602.22480v2 Announce Type: replace Abstract: An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-05T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-05T16:07:19.561Z\"\n    },\n    {\n      \"id\": \"f44dc79f7ed763f48b0544dd7231ac19fe6346d8e7081314dc755ea0cd5ce723\",\n      \"url\": \"https://www.reddit.com/r/LocalLLaMA/comments/1t5yajb/qwen36_27b_uncensored_heretic_v2_native_mtp/\",\n      \"source_url\": \"https://www.reddit.com/r/localllama.rss\",\n      \"title\": \"Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats.\",\n      \"author\": [\n        \"/u/LLMFan46\"\n      ],\n      \"summary\": \"llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-MLP-Only: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-MLP-Only llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4 All are confirmed to have their full 15 MTPs retained and preserved. Comes with benchmark too. Find all my models here: HuggingFace-LLMFan46 submitted by /u/LLMFan46 [link] [留言]\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T02:59:31.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:35.893Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2511.06452v3\",\n      \"url\": \"https://arxiv.org/abs/2511.06452\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains\",\n      \"author\": [\n        \"Leyan Xue\",\n        \"Changqing Zhang\",\n        \"Kecheng Xue\",\n        \"Xiaohong Liu\",\n        \"Guangyu Wang\",\n        \"Zongbo Han\"\n      ],\n      \"summary\": \"arXiv:2511.06452v3 Announce Type: replace Abstract: Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"d784734c9cb6667fe53b569b13f402f11bd62e5d676abe30f47114e90c834d23\",\n      \"url\": \"https://doi.org/10.7910/DVN/RYY9V9\",\n      \"source_url\": \"https://dataverse.harvard.edu/\",\n      \"title\": \"TraceSuite Review Artifact\",\n      \"author\": [\n        \"NeurIPS, AuthorAno\"\n      ],\n      \"summary\": \"Review artifact for a benchmark and dataset submission. Contains trajectory metadata, a representative sample, and sharded image archives.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T06:01:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:35:52.266Z\"\n    }\n  ],\n  \"total\": 2985\n}"
    }
  ],
  "details": {
    "items": [
      {
        "id": "oai:arXiv.org:2605.04454v1",
        "url": "https://arxiv.org/abs/2605.04454",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone",
        "author": [
          "Varad Vishwarupe",
          "Nigel Shadbolt",
          "Marina Jirotka",
          "Ivan Flechais"
        ],
        "summary": "arXiv:2605.04454v1 Announce Type: cross Abstract: Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level at which evidence is collected: model-level, response-level, interaction-level, or deployment-level. Two studies support this position. First, a structured audit of eleven alignment benchmarks, extended to a sixteen-benchmark corpus, dual-coded against an eight-dimension rubric with Cohen's kappa = 0.87, finds that user-facing verification support is absent across every benchmark examined, while process steerability is nearly absent. The few interactional benchmarks identified, including tau-bench, CURATe, Rifts, and Common Ground, remain fragmented in coverage, and benchmark construction rather than data source determines what is measured. Second, a blinded cross-model stress test using 180 transcripts across three frontier models and four scaffolds finds that the same verification scaffold raises one model's verification support to ceiling while leaving another categorically unchanged. This shows that scaffold efficacy is model-dependent and that the gap identified by the audit cannot be closed at the model level alone. We propose a system-level evaluation agenda: alignment profiles instead of single scores, fixed-scaffolding protocols for comparable interactional evaluation, and reporting templates that make the inferential distance between evaluation evidence and deployment claims explicit.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2602.10144v2",
        "url": "https://arxiv.org/abs/2602.10144",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "When LLMs get significantly worse: A statistical approach to detect model degradations",
        "author": [
          "Jonas K\\\"ubler",
          "Kailash Budhathoki",
          "Matth\\\"aus Kleindessner",
          "Xiong Zhou",
          "Junming Yin",
          "Ashish Khetan",
          "George Karypis"
        ],
        "summary": "arXiv:2602.10144v2 Announce Type: replace-cross Abstract: Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model's degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.04523v1",
        "url": "https://arxiv.org/abs/2605.04523",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation",
        "author": [
          "Ivan Bondarenko",
          "Roman Derunets",
          "Oleg Sedukhin",
          "Mikhail Komarov",
          "Ivan Chernov",
          "Mikhail Kulakov"
        ],
        "summary": "arXiv:2605.04523v1 Announce Type: new Abstract: We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno-Lite-0.1, a 7B domain-adapted model with a strong cost--performance trade-off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "a0f60af6af70259d9ac714a9b879b056074869ef062450ebda96f1f65f23df29",
        "url": "https://github.com/bogini",
        "source_url": "https://github.com/trending/developers/typescript?since=daily",
        "title": "bogini",
        "author": [
          "bogini"
        ],
        "summary": "Popular repo: function-health-exporter - Turn your Function Health labs into LLM-ready reports",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T16:48:45.730Z",
        "fetched_at": "2026-05-07T16:48:46.617Z"
      },
      {
        "id": "oai:arXiv.org:2605.03858v1",
        "url": "https://arxiv.org/abs/2605.03858",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following",
        "author": [
          "Jaeyun Lee",
          "Junyoung Koh",
          "Zeynel Tok",
          "Hunar Batra",
          "Ronald Clark"
        ],
        "summary": "arXiv:2605.03858v1 Announce Type: new Abstract: Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:21.529Z"
      },
      {
        "id": "oai:arXiv.org:2605.05175v1",
        "url": "https://arxiv.org/abs/2605.05175",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge",
        "author": [
          "Perry E. Radau"
        ],
        "summary": "arXiv:2605.05175v1 Announce Type: cross Abstract: Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used an independent LLM judge; primed stem-only tested responses to incorrect user claims. Results: Overall MCQ accuracy was 93.2% to 97.1%. GE scanner operations was the lowest category for every model (88.2% to 94.6%). In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%; GE scanner operations stem-only accuracy was 13.8% to 29.8%. Conclusion: High MCQ performance can mask weak free-text recall, especially for vendor-specific operational knowledge. MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "55c5e35b9bbd92d15c88e3a7d0d8a7800ee1a3c0f0ad444c77ed2d5b62b143aa",
        "url": "https://github.com/VectifyAI/PageIndex",
        "source_url": "https://github.com/trending?since=daily",
        "title": "VectifyAI/PageIndex",
        "author": [
          "VectifyAI"
        ],
        "summary": "📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T16:45:06.412Z",
        "fetched_at": "2026-05-07T16:45:07.284Z"
      },
      {
        "id": "oai:arXiv.org:2602.00933v2",
        "url": "https://arxiv.org/abs/2602.00933",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers",
        "author": [
          "Chaithanya Bandi",
          "Ben Hertzberg",
          "Geobio Boo",
          "Tejas Polakam",
          "Jeff Da",
          "Sami Hassaan",
          "Manasi Sharma",
          "Andrew Park",
          "Ernesto Hernandez",
          "Dan Rambado",
          "Ivan Salazar",
          "Rafael Cruz",
          "Chetan Rane",
          "Ben Levin",
          "Brad Kenstler",
          "Bing Liu"
        ],
        "summary": "arXiv:2602.00933v2 Announce Type: replace-cross Abstract: The Model Context Protocol (MCP) is rapidly becoming the standard interface for Large Language Models (LLMs) to discover and invoke external tools. However, existing evaluations often fail to capture the complexity of real-world scenarios, relying on restricted toolsets, simplistic workflows, or subjective LLM-as-a-judge metrics. We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency, comprising 36 real MCP servers and 220 tools. It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step workflows. Tasks use natural language prompts that avoid naming specific tools or servers, requiring agents to identify and orchestrate 3-6 tool calls across multiple servers. We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer, complemented by internal diagnostics on tool discovery, parameterization, syntax, error recovery, and efficiency. Evaluation results on frontier models reveal that top models achieve pass rates exceeding 50%, with primary failures arising from inadequate tool usage and task understanding. We release the task schema, containerized harness, and a 500-task public subset of the benchmark dataset to facilitate reproducible comparisons and advance the development of robust, tool-augmented agents.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-05T04:00:00.000Z",
        "fetched_at": "2026-05-05T16:07:19.561Z"
      },
      {
        "id": "oai:arXiv.org:2605.04357v1",
        "url": "https://arxiv.org/abs/2605.04357",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs",
        "author": [
          "Yixuan Mei",
          "Zikun Li",
          "Zixuan Chen",
          "Shiqi Pan",
          "Mengdi Wu",
          "Xupeng Miao",
          "Zhihao Jia",
          "K. V. Rashmi"
        ],
        "summary": "arXiv:2605.04357v1 Announce Type: cross Abstract: The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\\times$ over the best baseline, and delivers up to 2.39$\\times$ higher goodput under scarce resource availability.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "1cad32c447f9131e633154d70b4987bb207a86751c3c8538ce833d36909d798c",
        "url": "https://ahk.cardor.dev",
        "source_url": "https://news.ycombinator.com/rss",
        "title": "Agent-harness-kit scaffolding for multi-agent workflows (MCP, provider-agnostic)",
        "author": null,
        "summary": "Comments",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T10:45:59.000Z",
        "fetched_at": "2026-05-07T16:00:17.897Z"
      },
      {
        "id": "38bf6ecc5fc6495ace1d93af960422ad1fe6b249a479a6470528e9ca7c98cf4e",
        "url": "https://trendshift.io/repositories/14565",
        "source_url": "https://trendshift.io/github-trending-repositories?trending-range=1",
        "title": "free-llm-api-resources",
        "author": [
          "cheahjs"
        ],
        "summary": "A list of free LLM inference resources accessible via API.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T16:05:38.142Z",
        "fetched_at": "2026-05-06T16:05:42.052Z"
      },
      {
        "id": "oai:arXiv.org:2605.04298v1",
        "url": "https://arxiv.org/abs/2605.04298",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs",
        "author": [
          "Stefano Bann\\`o",
          "Kate Knill",
          "Mark Gales"
        ],
        "summary": "arXiv:2605.04298v1 Announce Type: new Abstract: Automated essay scoring (AES) research often relies on rank-based correlation metrics to validate analytic assessment. However, such metrics obscure both intrinsic intercorrelations among analytic dimensions that arise from the structure of writing proficiency itself and halo effects, whereby holistic impressions bleed into fine-grained component scores. As a result, high correlations may mask a system's true diagnostic behaviour. In this study, we propose a novel self-referential assessment evaluation framework that focuses on identifying intra-learner strengths and weaknesses rather than assessing inter-learner rankings. We conduct experiments on the publicly available ICNALE GRA, a uniquely dense second-language writing dataset annotated holistically and analytically by up to 80 trained raters. To obtain reliable reference scores, we apply two-facet Rasch modelling to calibrate rater severity and derive fair average scores across ten analytic aspects and holistic proficiency. We compare the analytic scoring performance of human operational raters and three large language models (LLMs) in a zero-shot setting. Our results show that LLMs tend to outperform single human raters in identifying relative weaknesses (negative feedback) across several proficiency aspects, while human raters remain stronger at identifying relative strengths (positive feedback). Overall, our findings highlight the limitations of rank-based evaluation for analytic assessment and demonstrate the value of intra-learner, profile-based methods for assessing and deploying LLMs in AES.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "cb49d85958c64b4aa4c473e84eef58173340bb3ed59f3ddf6d38ffd6b25c4f2b",
        "url": "https://trendshift.io/repositories/22628",
        "source_url": "https://trendshift.io/github-trending-repositories?trending-range=1",
        "title": "decolua/9router",
        "author": null,
        "summary": "Unlimited FREE AI coding. Connect Claude Code, Codex, Cursor, Cline, Copilot, Antigravity to FREE Claude/GPT/Gemini via 40+ providers. Auto-fallback, RTK -40% tokens, never hit limits.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T16:14:28.906Z",
        "fetched_at": "2026-05-07T16:14:29.690Z"
      },
      {
        "id": "oai:arXiv.org:2605.05170v1",
        "url": "https://arxiv.org/abs/2605.05170",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours",
        "author": [
          "The Verkor Team",
          "Ravi Krishna",
          "Suresh Krishna",
          "David Chin"
        ],
        "summary": "arXiv:2605.05170v1 Announce Type: cross Abstract: Driven by a rapid co-evolution of both harness and underlying models, LLM agents are improving at a dizzying pace. In our prior work (performed in Dec. 2025), we introduced \"Design Conductor\" (or just \"Conductor\"), a system capable of building a 5-stage Linux-capable RISC-V CPU in 12 hours. In this work, we introduce an updated multi-agent harness powered by frontier models released in April 2026, which is able to handle 80x larger tasks, at higher quality, fully autonomously. Following a brief introduction, we examine 4 designs that the system produced autonomously, including \"VerTQ\", an LLM inference accelerator which hard-wires support for TurboQuant in a 240-cycle pipeline, starting from the TurboQuant arXiv paper. VerTQ includes heavy compute processing, with 5129 FP16/32 units; the design was mapped to an FPGA at 125 MHz and consumes 5.7 mm^2 in TSMC 16FF (8 attention pipes). We review the key new characteristics that enabled these results. Finally, we analyze Design Conductor's token usage and other empirical characteristics, including its limitations.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.04135v1",
        "url": "https://arxiv.org/abs/2605.04135",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation",
        "author": [
          "David Gringras",
          "Misha Salahshoor"
        ],
        "summary": "arXiv:2605.04135v1 Announce Type: cross Abstract: Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-4o-mini zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about \"AI\" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of \"AI\" rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at frontierlag.org.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "064c2984cfb59b2fa13683fc6a0d7ae1e6872fb7ccbd288d5ddcebb87d622f70",
        "url": "https://doi.org/10.7910/DVN/FB7V2B",
        "source_url": "https://dataverse.harvard.edu/",
        "title": "Indic MMLU",
        "author": [
          "Anonymous, Anonymous"
        ],
        "summary": "The Indic MMLU dataset is a multilingual adaptation of the Massive Multitask Language Understanding (MMLU) benchmark developed to evaluate the reasoning, knowledge comprehension, and multilingual capabilities of Large Language Models (LLMs) across Indian languages. The dataset consists of professionally translated and quality-filtered multiple-choice question-answer pairs spanning diverse academic and professional domains, including science, mathematics, history, law, medicine, engineering, humanities, and social sciences. The primary purpose of this dataset is to provide a standardized benchmark for assessing model performance in low-resource and linguistically diverse Indic settings. The dataset enables research in multilingual NLP, cross-lingual transfer learning, language alignment, and culturally grounded AI evaluation. The dataset was generated through a structured pipeline involving machine-assisted translation of the original English MMLU benchmark into selected Indic languages, followed by extensive quality filtering using translation evaluation metrics such as BLEU, chrF++, and TER. Additional validation steps were applied to preserve semantic fidelity, answer consistency, and linguistic fluency. The final data is provided in standardized machine-readable formats suitable for benchmarking and downstream evaluation workflows. Indic MMLU is intended for researchers, academic institutions, and industry practitioners working on multilingual AI systems, Indic language technologies, and large-scale language model evaluation. By extending a widely recognized benchmark into Indian languages, the dataset contributes toward more inclusive, representative, and culturally relevant evaluation standards for modern AI systems",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T06:07:44.000Z",
        "fetched_at": "2026-05-07T16:35:52.266Z"
      },
      {
        "id": "oai:arXiv.org:2605.04083v1",
        "url": "https://arxiv.org/abs/2605.04083",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals",
        "author": [
          "Tadhg Looram",
          "Lucas Nuzzi",
          "Kyle Waters",
          "Steven Dillmann"
        ],
        "summary": "arXiv:2605.04083v1 Announce Type: new Abstract: Much of the focus in RL today is on evaluation design: building meaningful evals that serve simultaneously as benchmarks and as well-defined reward signals for post-training. Yet, many real-world tasks are governed by subjective, procedural, and domain-specific requirements that are difficult to encode as exact-match targets or open-ended preference judgments frequently used in RL pipelines today. In this work, we present AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals. AsymmetryZero represents each task as a stable evaluation contract that makes grading criteria explicit: what is being graded, how each criterion is judged, and how criterion-level decisions are aggregated into a task outcome. The same contract can be executed using Inspect for model-only evaluations, as well as the Harbor Framework for agentic evaluations, enabling comparable scores and shared audit artifacts across both settings. We argue that the central challenge in post-training today is the faithful encoding of expert requirements into the evaluation itself. To that end, we present a study using Harbor that holds task contracts fixed and compares a five-model frontier jury against a five-model compact jury across four frontier-class solvers (Claude Opus 4.6, GPT-5.4, Grok-4.20, Gemini-3.1-Pro). We find that criterion-level frontier-vs-compact agreement ranges from $75.9\\%$ to $89.6\\%$ (strict common-subset agreement: $77.8\\%$ to $92.1\\%$), while compact juries exhibit substantially higher internal dissent (3--2 split rate $28.7\\%$--$32.4\\%$) than frontier juries ($6.1\\%$--$11.5\\%$). Verifier traces further show that compact juries reduce per-criterion judging cost to roughly $4.2\\%$--$5.6\\%$ of frontier and latency to roughly $21.7\\%$--$27.1\\%$, even as aggregated task-level outcomes often remain comparatively stable.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.04118v1",
        "url": "https://arxiv.org/abs/2605.04118",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation",
        "author": [
          "Cong Liu",
          "Milong Ren",
          "Jiaqi Guan",
          "Chengyue Gong",
          "Jinyuan Sun",
          "Xinshi Chen",
          "Wenzhi Xiao"
        ],
        "summary": "arXiv:2605.04118v1 Announce Type: cross Abstract: Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.04503v1",
        "url": "https://arxiv.org/abs/2605.04503",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning",
        "author": [
          "Yuancheng Wei",
          "Haojie Zhang",
          "Linli Yao",
          "Lei Li",
          "Jiali Chen",
          "Tao Huang",
          "Yiting Lu",
          "Duojun Huang",
          "Xin Li",
          "Zhao Zhong"
        ],
        "summary": "arXiv:2605.04503v1 Announce Type: cross Abstract: Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust evaluation of multimodal large language models (MLLMs) on IDC. To address these gaps, we introduce DiffCap-Bench, a comprehensive IDC benchmark covering ten distinct difference categories to ensure diversity and compositional complexity. Furthermore, we propose an LLM-as-a-Judge evaluation protocol grounded in human-validated Difference Lists, enabling a robust assessment of models' ability to both capture and describe visual changes. Through extensive evaluation of state-of-the-art MLLMs, we reveal significant performance gaps between proprietary and open-source models, highlight the critical importance of reasoning capability, and identify clear limitations in model scaling. Our framework also demonstrates strong alignment with human expert judgments and strong correlation with downstream image editing data construction quality. These findings establish DiffCap-Bench as both a reliable IDC evaluation framework and a practical predictor of downstream utility. The benchmark and code will be made publicly available to support further research.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2601.22725v3",
        "url": "https://arxiv.org/abs/2601.22725",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation",
        "author": [
          "Jin Li",
          "Tao Chen",
          "Shuai Jiang",
          "Weijie Wang",
          "Jingwen Luo",
          "Chenhui Wu"
        ],
        "summary": "arXiv:2601.22725v3 Announce Type: replace-cross Abstract: Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \\times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.03792v1",
        "url": "https://arxiv.org/abs/2605.03792",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "TriBench-Ko: Evaluating LLM Risks in Judicial Workflows",
        "author": [
          "Haesung Lee",
          "Gyubin Choi",
          "Eun-Ju Lee",
          "So-Min Lee",
          "Youkang Ko",
          "Dogyoon Lim",
          "Sung-Kyoung Jang",
          "Yohan Jo"
        ],
        "summary": "arXiv:2605.03792v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM-generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi-lab/TriBench-Ko",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:21.529Z"
      },
      {
        "id": "oai:arXiv.org:2605.03936v1",
        "url": "https://arxiv.org/abs/2605.03936",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models",
        "author": [
          "Daniel Drucker",
          "Kyle Mahowald"
        ],
        "summary": "arXiv:2605.03936v1 Announce Type: new Abstract: Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a proposed definition, another repairs the definition, and the process repeats. Across 20 concepts and thousands of counterexample-repair cycles, we find that, although many LM-generated counterexamples are judged invalid by both expert humans and an LM judge, the LM judge accepts roughly twice as many as humans do. Nonetheless, per-item validity judgments are moderately consistent across humans and between humans and the LM. We further find that extended iteration produces increasingly verbose definitions without improving accuracy. We also see that some concepts resist stable definitions in general. These findings suggest that while LMs can engage in philosophical reasoning, the counterexample-repair loop hits diminishing returns quickly and could be a fruitful test case for evaluating whether LMs can sustain high-level iterated philosophical reasoning.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:21.529Z"
      },
      {
        "id": "oai:arXiv.org:2603.16659v2",
        "url": "https://arxiv.org/abs/2603.16659",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "LLMs learn scientific taste from institutional traces across the social sciences",
        "author": [
          "Ziqin Gong",
          "Ning Li",
          "Huaikang Zhou"
        ],
        "summary": "arXiv:2603.16659v2 Announce Type: replace Abstract: Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say \"I'm sure\" versus \"I'm guessing.\" Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.02974v1",
        "url": "https://arxiv.org/abs/2605.02974",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "PHBench: A Benchmark for Predicting Startup Series A Funding from Product Hunt Launch Signals",
        "author": [
          "Yagiz Ihlamur",
          "Ben Griffin",
          "Rick Chen"
        ],
        "summary": "arXiv:2605.02974v1 Announce Type: cross Abstract: Structured launch signals on Product Hunt contain statistically significant predictive information for Series A funding outcomes. We construct PHBench from 67,292 featured Product Hunt posts spanning 2019-2025, linked to Crunchbase funding records via deterministic domain matching, identifying 528 verified Series A raises within 18 months of launch (positive rate: 0.78%). Our best-performing model, a three-component ensemble (ENS_avg, ENS_ISO, XGB) selected by validation F0.5, achieves F0.5 = 0.097 and AP = 0.037 (95% CI: 0.024-0.072; 4.7x lift over random) on the private held-out test set (103 positives). A paired bootstrap confirms a statistically credible advantage over the logistic regression baseline (AP delta: +0.013, 95% CI: [0.004, 0.039], p < 0.001; F0.5 delta: +0.056, 95% CI: [0.006, 0.122], p = 0.016). Validation-set metrics (F0.5 = 0.284, AP = 0.126) reflect best-of-144 selection bias on 53 positives and are reported for benchmark reproducibility only. We further evaluate three zero-shot Gemini models (Gemini 2.5 Flash, Gemini 3 Flash, and Gemini 3.1 Pro) in an anonymized numerical setting. The best LLM achieves AP = 0.034 (Gemini 3 Flash), below the LR baseline AP of 0.044. Notably, the most capable Gemini variant (Gemini 3.1 Pro, AP = 0.023) performs worst -- an unexpected pattern that warrants further investigation across providers and prompting strategies. Both ML and LLM models show the same temporal performance decay tracking the 2020-2021 funding boom and subsequent contraction, confirming the dataset captures genuine market structure rather than noise. PHBench provides a reproducible framework comprising public training, validation, and blind test splits; 61 engineered features; a five-metric evaluation harness; and a public leaderboard at https://phbench.com. All code, baseline models, and anonymized dataset splits are publicly available.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "ad87ab2f75834cf15e480f1ab02abe44dee472a1f5fcb50674048dce52e5976d",
        "url": "https://github.com/darkrishabh/agent-skills-eval",
        "source_url": "https://news.ycombinator.com/rss",
        "title": "Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs",
        "author": null,
        "summary": "Comments",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T06:12:25.000Z",
        "fetched_at": "2026-05-07T10:00:03.704Z"
      },
      {
        "id": "oai:arXiv.org:2605.04539v1",
        "url": "https://arxiv.org/abs/2605.04539",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization",
        "author": [
          "Qiming Bao",
          "Juho Leinonen",
          "Paul Denny",
          "Michael J. Witbrock"
        ],
        "summary": "arXiv:2605.04539v1 Announce Type: new Abstract: Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the \"alignment tax\" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "oai:arXiv.org:2602.22480v2",
        "url": "https://arxiv.org/abs/2602.22480",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "VeRO: An Evaluation Harness for Agents to Optimize Agents",
        "author": [
          "Varun Ursekar (Emily)",
          "Apaar Shanker (Emily)",
          "Veronica Chatrath (Emily)",
          "Yuan (Emily)",
          "Xue",
          "Sam Denton"
        ],
        "summary": "arXiv:2602.22480v2 Announce Type: replace Abstract: An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-05T04:00:00.000Z",
        "fetched_at": "2026-05-05T16:07:19.561Z"
      },
      {
        "id": "f44dc79f7ed763f48b0544dd7231ac19fe6346d8e7081314dc755ea0cd5ce723",
        "url": "https://www.reddit.com/r/LocalLLaMA/comments/1t5yajb/qwen36_27b_uncensored_heretic_v2_native_mtp/",
        "source_url": "https://www.reddit.com/r/localllama.rss",
        "title": "Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats.",
        "author": [
          "/u/LLMFan46"
        ],
        "summary": "llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-MLP-Only: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-MLP-Only llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4 All are confirmed to have their full 15 MTPs retained and preserved. Comes with benchmark too. Find all my models here: HuggingFace-LLMFan46 submitted by /u/LLMFan46 [link] [留言]",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T02:59:31.000Z",
        "fetched_at": "2026-05-07T16:36:35.893Z"
      },
      {
        "id": "oai:arXiv.org:2511.06452v3",
        "url": "https://arxiv.org/abs/2511.06452",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains",
        "author": [
          "Leyan Xue",
          "Changqing Zhang",
          "Kecheng Xue",
          "Xiaohong Liu",
          "Guangyu Wang",
          "Zongbo Han"
        ],
        "summary": "arXiv:2511.06452v3 Announce Type: replace Abstract: Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "d784734c9cb6667fe53b569b13f402f11bd62e5d676abe30f47114e90c834d23",
        "url": "https://doi.org/10.7910/DVN/RYY9V9",
        "source_url": "https://dataverse.harvard.edu/",
        "title": "TraceSuite Review Artifact",
        "author": [
          "NeurIPS, AuthorAno"
        ],
        "summary": "Review artifact for a benchmark and dataset submission. Contains trajectory metadata, a representative sample, and sharded image archives.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T06:01:00.000Z",
        "fetched_at": "2026-05-07T16:35:52.266Z"
      }
    ],
    "total": 2985
  }
}

▼✓网页搜索查询: "lm-evaluation-harness EleutherAI 2024 2025 overview architecture", 结果数: 15

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"EleutherAI's lm-evaluation-harness: Architecture and Configuration\",\n      \"url\": \"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\n      \"description\": \"EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\\\"hf\\\", pretrained=\\\"gpt2\\\") = \\\"hf\\\" = \\\"gpt2\\\"results = evaluator.evaluate(model, tasks=[\\\"hellaswag\\\"], num_fewshot=5) = = \\\"hellaswag\\\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\\\"gpt2\\\", = \\\"gpt2\\\" tasks=[\\\"hellaswag\\\", \\\"mmlu\\\"], = \\\"hellaswag\\\" \\\"mmlu\\\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\\\lm-eval --model \\\\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\\\ --model_args \\\\ --tasks mmlu,hellaswag \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size auto \\\\ --batch_size \\\\ --output_path results/mistral-7b.json \\\\ --output_path \\\\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\\\lm-eval --model \\\\ --model_args model=gpt-4-turbo \\\\ --model_args \\\\ --tasks mmlu,bbh \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size 10 --batch_size.\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.9 - ADS\",\n      \"url\": \"https://ui.adsabs.harvard.edu/abs/2025zndo..15699229S/abstract\",\n      \"description\": \"by @JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2825 Added C4 Support by @Zephyr271828 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889 Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870 Add question suffix before the <|assistant|> tag by @TingchenFu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876 Add device arg to model\\\\_args passed to LLM object in VLLM model class by @momentino in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879 paws-x fix formatting by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2759 Delete scripts/cost\\\\_estimate.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2985 Adding ACPBench Hard tasks by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2980 [SGLANG] Add the SGLANG generate API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2997 fix example notebook by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2998 Log tokenized request warning only once by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3002 [Add Dataset Update] KBL 2025 by @abzb1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000 Output path fix by @Niccolo-Ajroldi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993 use images with api models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Adding resize images support by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2958 Revert \\\"feat: add question suffix (#2876)\\\" by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3007 [hotfix] modify multimodal check in evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3013 [Fix] Update resolve\\\\_hf\\\\_chat\\\\_template arguments by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992 Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987 [vllm] data parallel for V1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3011 add arab\\\\_culture task by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/3006 chore: clean up and extend .gitignore rules by @e1washere in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030 Enable text-only evals for VLM models by @ysulsky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999 [Fix] acc\\\\_mutual\\\\_info metric calculation bug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3035 Fix: fix vllm issue with DP>1 by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025 add Mbpp instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2995 remove prints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3041 [longbench] fix metric calculation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2983 Fallback to super implementation in fewshot\\\\_context for Unitxt tasks by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3023 Fix Typo in README and Comment in utils\\\\_mcq.py by @vtjl10 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057 fix longbech citation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3061 mmlu task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3070 Fix typos in docstrings in instructions.py by @maximevtush in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060 bump version to 0.4.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3073 New\"\n    },\n    {\n      \"title\": \"EleutherAI LM Evaluation Harness | VerifyWise AI Governance Library\",\n      \"url\": \"https://verifywise.ai/ai-governance-library/assessment-and-evaluation/model-evaluation-harness\",\n      \"description\": \"# EleutherAI LM Evaluation Harness. The EleutherAI LM Evaluation Harness is the Swiss Army knife of language model evaluation, offering a standardized way to benchmark LLMs across hundreds of tasks with just a few lines of code. Rather than cobbling together different evaluation scripts and dealing with inconsistent metrics, this open-source framework lets you run comprehensive assessments covering everything from basic language understanding to complex reasoning, safety, and alignment properties. It's become the de facto standard for reproducible LLM evaluation in the research community. The harness also handles the technical complexity of model evaluation automatically. The framework lets you run individual tasks, task groups, or comprehensive suites. For your first evaluation, try running a subset of tasks on a smaller model to understand the output format and timing. With hundreds of available tasks, it's easy to run evaluations that aren't relevant to your specific use case or to miss important benchmark categories.\"\n    },\n    {\n      \"title\": \"GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n      \"description\": \"* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\"\n    },\n    {\n      \"title\": \"LM Evaluation Harness\",\n      \"url\": \"https://slyracoon23.github.io/lm-evaluation-harness/\",\n      \"description\": \"Overview¶. The LM Evaluation Harness is designed to facilitate the integration of various API-based language models into a standardized evaluation framework.\"\n    },\n    {\n      \"title\": \"Evaluating LLMs — EleutherAI\",\n      \"url\": \"https://www.eleuther.ai/projects/large-language-model-evaluation\",\n      \"description\": \"# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs  Next.\"\n    },\n    {\n      \"title\": \"LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly\",\n      \"url\": \"https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545\",\n      \"description\": \"# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity….\"\n    },\n    {\n      \"title\": \"Integrating benchmarks into LM Evaluation Harness - Hugging Face\",\n      \"url\": \"https://huggingface.co/blog/Neo111x/integrating-benchmarks-into-lm-evaluation-harness\",\n      \"description\": \"The LM Evaluation Harness is a Python-based framework developed by EleutherAI for evaluating the performance of language models on a wide range of NLP benchmarks. Several steps are required to integrate a custom benchmark from a local dataset into lm evaluation harness, this blog will tackle integrating your custom dataset in multiple choice question fashion following the standard MMLU approach and the MMLU-Var approach. The standard MMLU benchmark uses a multiple choice format, The model receives a question along with a list of possible answers (A, B, C, D) and is prompted to select the correct one. For example, in a question about the capital of the UAE, the model sees both the question and all four answer choices in the input. So during the evaluation the choices are used by the evaluator to compute the probability of each choice as a continuation to the given prompt following this simplified formula for each output token:.\"\n    },\n    {\n      \"title\": \"Releases · EleutherAI/lm-evaluation-harness - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/releases\",\n      \"description\": \"A framework for few-shot evaluation of language models. - Releases · EleutherAI/lm-evaluation-harness.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness - Jimmy Song\",\n      \"url\": \"https://jimmysong.io/ai/lm-evaluation-harness/\",\n      \"description\": \"LLM, AI Native Infra and Agentic AI. Recaps and insights from tech conferences like KubeCon. Thoughts and analysis on economics, culture, politics, and society. #### AI Infrastructure. AI infrastructure framework through Yin-Yang and Five Elements model from Eastern philosophy. AI-native infrastructure handbook for CTOs and platform leaders. Comprehensive directory of AI open-source software projects. Discover experiences and journeys beyond technology. Get in touch with Jimmy for inquiries. #### LinkedIn. Scan to connect on LinkedIn for professional networking. Scan to follow my X account for updates on cloud native and AI. The Language Model Evaluation Harness is a framework for large-scale, reproducible evaluation of generative language models across many tasks and datasets. The lm-evaluation-harness provides a unified interface and a large collection of tasks (e.g. Hellaswag, LAMBADA) for evaluating generative LMs. It supports local models, Hugging Face models, and commercial APIs. ReLE (chinese-llm-benchmark) is a continuously updated Chinese LLM evaluation …. DeepEval is an open-source LLM evaluation framework that provides modular ….\"\n    },\n    {\n      \"title\": \"Language Model Evaluation Harness: A Comprehensive Tool for Language Model Assessment | by Frank Morales Aguilera | Artificial Intelligence in Plain English\",\n      \"url\": \"https://ai.plainenglish.io/language-model-evaluation-harness-a-comprehensive-tool-for-language-model-assessment-3666b55c9c25\",\n      \"description\": \"# Language Model Evaluation Harness: A Comprehensive Tool for Language Model Assessment. The **Language Model Evaluation Harness** is a powerful tool designed to evaluate language models in a standardized and comprehensive manner[1]. The Language Model Evaluation Harness offers a wide range of features that make it a versatile tool for language model evaluation[1,2]:. 1. **Unified Framework**: It provides a unified interface for testing generative language models on many evaluation tasks[1,2]. 2. **Broad Benchmark Support**: The harness supports over 60 standard academic benchmarks for language models, with hundreds of subtasks and variants implemented[1,2]. The Language Model Evaluation Harness is the backend for Hugging Face’s popular Open LLM Leaderboard [1,2]. This widespread usage is a testament to the tool’s credibility and significant role in language model research and development. ### Standard evaluation metrics for language models. In conclusion, the Language Model Evaluation Harness is a crucial tool that addresses the challenge of comprehensively auditing the performance of language models [3].\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo\",\n      \"url\": \"https://zenodo.org/records/12608602\",\n      \"description\": \"# EleutherAI/lm-evaluation-harness: v0.4.3. # lm-eval v0.4.3 Release Notes. We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future. The big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors! This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs. There's a lot more to check out, including:. We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/lm_eval/tasks/aime/README.md at main\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/aime/README.md\",\n      \"description\": \"A framework for few-shot evaluation of language models. - lm-evaluation-harness/lm_eval/tasks/aime/README.md at main · EleutherAI/lm-evaluation-harness.\"\n    },\n    {\n      \"title\": \"LM Eval Harness - EleutherAI\",\n      \"url\": \"https://www.eleuther.ai/artifacts/lm-eval-harness\",\n      \"description\": \"LM Eval Harness. Library. 13 May. Written By Stella Biderman. Our library for reproducible and transparent evaluation of LLMs. NLPEvaluation.\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "EleutherAI's lm-evaluation-harness: Architecture and Configuration",
        "url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
        "description": "EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\"hf\", pretrained=\"gpt2\") = \"hf\" = \"gpt2\"results = evaluator.evaluate(model, tasks=[\"hellaswag\"], num_fewshot=5) = = \"hellaswag\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\"gpt2\", = \"gpt2\" tasks=[\"hellaswag\", \"mmlu\"], = \"hellaswag\" \"mmlu\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\lm-eval --model \\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\ --model_args \\ --tasks mmlu,hellaswag \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size auto \\ --batch_size \\ --output_path results/mistral-7b.json \\ --output_path \\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\lm-eval --model \\ --model_args model=gpt-4-turbo \\ --model_args \\ --tasks mmlu,bbh \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size 10 --batch_size."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.9 - ADS",
        "url": "https://ui.adsabs.harvard.edu/abs/2025zndo..15699229S/abstract",
        "description": "by @JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2825 Added C4 Support by @Zephyr271828 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889 Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870 Add question suffix before the <|assistant|> tag by @TingchenFu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876 Add device arg to model\\_args passed to LLM object in VLLM model class by @momentino in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879 paws-x fix formatting by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2759 Delete scripts/cost\\_estimate.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2985 Adding ACPBench Hard tasks by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2980 [SGLANG] Add the SGLANG generate API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2997 fix example notebook by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2998 Log tokenized request warning only once by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3002 [Add Dataset Update] KBL 2025 by @abzb1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000 Output path fix by @Niccolo-Ajroldi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993 use images with api models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Adding resize images support by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2958 Revert \"feat: add question suffix (#2876)\" by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3007 [hotfix] modify multimodal check in evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3013 [Fix] Update resolve\\_hf\\_chat\\_template arguments by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992 Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987 [vllm] data parallel for V1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3011 add arab\\_culture task by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/3006 chore: clean up and extend .gitignore rules by @e1washere in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030 Enable text-only evals for VLM models by @ysulsky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999 [Fix] acc\\_mutual\\_info metric calculation bug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3035 Fix: fix vllm issue with DP>1 by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025 add Mbpp instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2995 remove prints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3041 [longbench] fix metric calculation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2983 Fallback to super implementation in fewshot\\_context for Unitxt tasks by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3023 Fix Typo in README and Comment in utils\\_mcq.py by @vtjl10 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057 fix longbech citation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3061 mmlu task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3070 Fix typos in docstrings in instructions.py by @maximevtush in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060 bump version to 0.4.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3073 New"
      },
      {
        "title": "EleutherAI LM Evaluation Harness | VerifyWise AI Governance Library",
        "url": "https://verifywise.ai/ai-governance-library/assessment-and-evaluation/model-evaluation-harness",
        "description": "# EleutherAI LM Evaluation Harness. The EleutherAI LM Evaluation Harness is the Swiss Army knife of language model evaluation, offering a standardized way to benchmark LLMs across hundreds of tasks with just a few lines of code. Rather than cobbling together different evaluation scripts and dealing with inconsistent metrics, this open-source framework lets you run comprehensive assessments covering everything from basic language understanding to complex reasoning, safety, and alignment properties. It's become the de facto standard for reproducible LLM evaluation in the research community. The harness also handles the technical complexity of model evaluation automatically. The framework lets you run individual tasks, task groups, or comprehensive suites. For your first evaluation, try running a subset of tasks on a smaller model to understand the output format and timing. With hundreds of available tasks, it's easy to run evaluations that aren't relevant to your specific use case or to miss important benchmark categories."
      },
      {
        "title": "GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness",
        "description": "* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface)."
      },
      {
        "title": "LM Evaluation Harness",
        "url": "https://slyracoon23.github.io/lm-evaluation-harness/",
        "description": "Overview¶. The LM Evaluation Harness is designed to facilitate the integration of various API-based language models into a standardized evaluation framework."
      },
      {
        "title": "Evaluating LLMs — EleutherAI",
        "url": "https://www.eleuther.ai/projects/large-language-model-evaluation",
        "description": "# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs  Next."
      },
      {
        "title": "LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly",
        "url": "https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545",
        "description": "# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity…."
      },
      {
        "title": "Integrating benchmarks into LM Evaluation Harness - Hugging Face",
        "url": "https://huggingface.co/blog/Neo111x/integrating-benchmarks-into-lm-evaluation-harness",
        "description": "The LM Evaluation Harness is a Python-based framework developed by EleutherAI for evaluating the performance of language models on a wide range of NLP benchmarks. Several steps are required to integrate a custom benchmark from a local dataset into lm evaluation harness, this blog will tackle integrating your custom dataset in multiple choice question fashion following the standard MMLU approach and the MMLU-Var approach. The standard MMLU benchmark uses a multiple choice format, The model receives a question along with a list of possible answers (A, B, C, D) and is prompted to select the correct one. For example, in a question about the capital of the UAE, the model sees both the question and all four answer choices in the input. So during the evaluation the choices are used by the evaluator to compute the probability of each choice as a continuation to the given prompt following this simplified formula for each output token:."
      },
      {
        "title": "Releases · EleutherAI/lm-evaluation-harness - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/releases",
        "description": "A framework for few-shot evaluation of language models. - Releases · EleutherAI/lm-evaluation-harness."
      },
      {
        "title": "lm-evaluation-harness - Jimmy Song",
        "url": "https://jimmysong.io/ai/lm-evaluation-harness/",
        "description": "LLM, AI Native Infra and Agentic AI. Recaps and insights from tech conferences like KubeCon. Thoughts and analysis on economics, culture, politics, and society. #### AI Infrastructure. AI infrastructure framework through Yin-Yang and Five Elements model from Eastern philosophy. AI-native infrastructure handbook for CTOs and platform leaders. Comprehensive directory of AI open-source software projects. Discover experiences and journeys beyond technology. Get in touch with Jimmy for inquiries. #### LinkedIn. Scan to connect on LinkedIn for professional networking. Scan to follow my X account for updates on cloud native and AI. The Language Model Evaluation Harness is a framework for large-scale, reproducible evaluation of generative language models across many tasks and datasets. The lm-evaluation-harness provides a unified interface and a large collection of tasks (e.g. Hellaswag, LAMBADA) for evaluating generative LMs. It supports local models, Hugging Face models, and commercial APIs. ReLE (chinese-llm-benchmark) is a continuously updated Chinese LLM evaluation …. DeepEval is an open-source LLM evaluation framework that provides modular …."
      },
      {
        "title": "Language Model Evaluation Harness: A Comprehensive Tool for Language Model Assessment | by Frank Morales Aguilera | Artificial Intelligence in Plain English",
        "url": "https://ai.plainenglish.io/language-model-evaluation-harness-a-comprehensive-tool-for-language-model-assessment-3666b55c9c25",
        "description": "# Language Model Evaluation Harness: A Comprehensive Tool for Language Model Assessment. The **Language Model Evaluation Harness** is a powerful tool designed to evaluate language models in a standardized and comprehensive manner[1]. The Language Model Evaluation Harness offers a wide range of features that make it a versatile tool for language model evaluation[1,2]:. 1. **Unified Framework**: It provides a unified interface for testing generative language models on many evaluation tasks[1,2]. 2. **Broad Benchmark Support**: The harness supports over 60 standard academic benchmarks for language models, with hundreds of subtasks and variants implemented[1,2]. The Language Model Evaluation Harness is the backend for Hugging Face’s popular Open LLM Leaderboard [1,2]. This widespread usage is a testament to the tool’s credibility and significant role in language model research and development. ### Standard evaluation metrics for language models. In conclusion, the Language Model Evaluation Harness is a crucial tool that addresses the challenge of comprehensively auditing the performance of language models [3]."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo",
        "url": "https://zenodo.org/records/12608602",
        "description": "# EleutherAI/lm-evaluation-harness: v0.4.3. # lm-eval v0.4.3 Release Notes. We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future. The big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors! This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs. There's a lot more to check out, including:. We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder."
      },
      {
        "title": "lm-evaluation-harness/lm_eval/tasks/aime/README.md at main",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/aime/README.md",
        "description": "A framework for few-shot evaluation of language models. - lm-evaluation-harness/lm_eval/tasks/aime/README.md at main · EleutherAI/lm-evaluation-harness."
      },
      {
        "title": "LM Eval Harness - EleutherAI",
        "url": "https://www.eleuther.ai/artifacts/lm-eval-harness",
        "description": "LM Eval Harness. Library. 13 May. Written By Stella Biderman. Our library for reproducible and transparent evaluation of LLMs. NLPEvaluation."
      }
    ]
  }
}

▼✓网页搜索查询: "lm-evaluation-harness GitHub EleutherAI latest release features", 结果数: 10

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"Language Model Evaluation Harness - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n      \"description\": \"A new v0.4.0 release of lm-evaluation-harness is available ! New updates and features include: New Open LLM Leaderboard tasks have been added ! You can find\"\n    },\n    {\n      \"title\": \"Language Model Evaluation Harness - GitHub\",\n      \"url\": \"https://github.com/HazyResearch/based-evaluation-harness\",\n      \"description\": \"4.0 release of lm-evaluation-harness is available ! New updates and features include: Internal refactoring; Config-based task creation and configuration; Easier\"\n    },\n    {\n      \"title\": \"Releases · EleutherAI/lm-evaluation-harness - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/releases\",\n      \"description\": \"We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal and vllm-vlm .\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo\",\n      \"url\": \"https://zenodo.org/records/12608602\",\n      \"description\": \"We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.\"\n    },\n    {\n      \"title\": \"EleutherAI's lm-evaluation-harness: Architecture and Configuration\",\n      \"url\": \"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\n      \"description\": \"EleutherAI's lm-evaluation-harness has emerged as one of the most robust and comprehensive frameworks for evaluating language models.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/docs/task_guide.md at main - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md\",\n      \"description\": \"The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the\"\n    },\n    {\n      \"title\": \"Has anyone gotten EleutherAI/lm-evaluation-harness to work on ...\",\n      \"url\": \"https://www.reddit.com/r/LocalLLaMA/comments/136akxj/has_anyone_gotten_eleutherailmevaluationharness/\",\n      \"description\": \"As the title says. GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models.\"\n    },\n    {\n      \"title\": \"Integrating benchmarks into LM Evaluation Harness - Hugging Face\",\n      \"url\": \"https://huggingface.co/blog/Neo111x/integrating-benchmarks-into-lm-evaluation-harness\",\n      \"description\": \"The LM Evaluation Harness is a Python-based framework developed by EleutherAI for evaluating the performance of language models on a wide range\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/lm_eval/tasks/README.md at main - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md\",\n      \"description\": \"A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language\"\n    },\n    {\n      \"title\": \"LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly\",\n      \"url\": \"https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545\",\n      \"description\": \"This tool has been designed to integrate seamlessly with common datasets, offering customizable settings to suit a variety of tasks. One of the\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "Language Model Evaluation Harness - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness",
        "description": "A new v0.4.0 release of lm-evaluation-harness is available ! New updates and features include: New Open LLM Leaderboard tasks have been added ! You can find"
      },
      {
        "title": "Language Model Evaluation Harness - GitHub",
        "url": "https://github.com/HazyResearch/based-evaluation-harness",
        "description": "4.0 release of lm-evaluation-harness is available ! New updates and features include: Internal refactoring; Config-based task creation and configuration; Easier"
      },
      {
        "title": "Releases · EleutherAI/lm-evaluation-harness - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/releases",
        "description": "We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal and vllm-vlm ."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo",
        "url": "https://zenodo.org/records/12608602",
        "description": "We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future."
      },
      {
        "title": "EleutherAI's lm-evaluation-harness: Architecture and Configuration",
        "url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
        "description": "EleutherAI's lm-evaluation-harness has emerged as one of the most robust and comprehensive frameworks for evaluating language models."
      },
      {
        "title": "lm-evaluation-harness/docs/task_guide.md at main - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md",
        "description": "The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the"
      },
      {
        "title": "Has anyone gotten EleutherAI/lm-evaluation-harness to work on ...",
        "url": "https://www.reddit.com/r/LocalLLaMA/comments/136akxj/has_anyone_gotten_eleutherailmevaluationharness/",
        "description": "As the title says. GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models."
      },
      {
        "title": "Integrating benchmarks into LM Evaluation Harness - Hugging Face",
        "url": "https://huggingface.co/blog/Neo111x/integrating-benchmarks-into-lm-evaluation-harness",
        "description": "The LM Evaluation Harness is a Python-based framework developed by EleutherAI for evaluating the performance of language models on a wide range"
      },
      {
        "title": "lm-evaluation-harness/lm_eval/tasks/README.md at main - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md",
        "description": "A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language"
      },
      {
        "title": "LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly",
        "url": "https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545",
        "description": "This tool has been designed to integrate seamlessly with common datasets, offering customizable settings to suit a variety of tasks. One of the"
      }
    ]
  }
}

▼thought

The feeds_search results are mostly about other evaluation benchmarks and not specifically about lm-evaluation-harness itself. Let me now fetch the GitHub README and the releases page for more detailed information, as well as some of the more informative pages.

▼✓抓取网页正文网址: "https://github.com/EleutherAI/lm-evaluation-harness"

{
  "content": [
    {
      "type": "text",
      "text": "{\"url\":\"https://github.com/EleutherAI/lm-evaluation-harness\",\"error\":\"Navigation timeout of 60000 ms exceeded\"}"
    }
  ],
  "details": {
    "url": "https://github.com/EleutherAI/lm-evaluation-harness",
    "error": "Navigation timeout of 60000 ms exceeded"
  },
  "isError": true
}

▼✓抓取网页正文网址: "https://github.com/EleutherAI/lm-evaluation-harness/releases"

{
  "content": [
    {
      "type": "text",
      "text": "{\"url\":\"https://github.com/EleutherAI/lm-evaluation-harness/releases\",\"error\":\"Navigation timeout of 60000 ms exceeded\"}"
    }
  ],
  "details": {
    "url": "https://github.com/EleutherAI/lm-evaluation-harness/releases",
    "error": "Navigation timeout of 60000 ms exceeded"
  },
  "isError": true
}

▼✓抓取网页正文网址: "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-met..."

{
  "content": [
    {
      "type": "text",
      "text": "{\"url\":\"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\"error\":\"Navigation timeout of 60000 ms exceeded\"}"
    }
  ],
  "details": {
    "url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
    "error": "Navigation timeout of 60000 ms exceeded"
  },
  "isError": true
}

▼thought

GitHub pages are timing out. Let me try fetching the raw README from GitHub's raw content, and also try other sources.

▼✓抓取网页正文网址: "https://raw.githubusercontent.com/EleutherAI/lm-evaluation-harness/main/READM..."

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://raw.githubusercontent.com/EleutherAI/lm-evaluation-harness/main/README.md\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><pre># Language Model Evaluation Harness\\n\\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10256836.svg)](https://doi.org/10.5281/zenodo.10256836)\\n\\n---\\n\\n## Latest News 📣\\n- [2025/12] **CLI refactored** with subcommands (`run`, `ls`, `validate`) and YAML config file support via `--config`. See the [CLI Reference](./docs/interface.md) and [Configuration Guide](./docs/config_files.md).\\n- [2025/12] **Lighter install**: Base package no longer includes `transformers`/`torch`. Install model backends separately: `pip install lm_eval[hf]`, `lm_eval[vllm]`, etc.\\n- [2025/07] Added `think_end_token` arg to `hf` (token/str), `vllm` and `sglang` (str) for stripping CoT reasoning traces from models that support it.\\n- [2025/03] Added support for steering HF models!\\n- [2025/02] Added [SGLang](https://docs.sglang.ai/) support!\\n- [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.\\n- [2024/07] [API model](docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. **To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.**\\n- [2024/07] New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.\\n\\n---\\n\\n## Announcement\\n\\n**A new v0.4.0 release of lm-evaluation-harness is available** !\\n\\nNew updates and features include:\\n\\n- **New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.**\\n- Internal refactoring\\n- Config-based task creation and configuration\\n- Easier import and sharing of externally-defined task config YAMLs\\n- Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource\\n- More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more\\n- Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more\\n- Logging and usability changes\\n- New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more\\n\\nPlease see our updated documentation pages in `docs/` for more details.\\n\\nDevelopment will be continuing on the `main` branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the [EleutherAI discord](https://discord.gg/eleutherai)!\\n\\n---\\n\\n## Overview\\n\\nThis project provides a unified framework to test generative language models on a large number of different evaluation tasks.\\n\\n**Features:**\\n\\n- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.\\n- Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.\\n- Support for fast and memory-efficient inference with [vLLM](https://github.com/vllm-project/vllm).\\n- Support for commercial APIs including [OpenAI](https://openai.com), and [TextSynth](https://textsynth.com/).\\n- Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).\\n- Support for local models and benchmarks.\\n- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.\\n- Easy support for custom prompts and evaluation metrics.\\n\\nThe Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;authuser=2&amp;cites=15052937328817631261,4097184744846514103,1520777361382155671,17476825572045927382,18443729326628441434,14801318227356878622,7890865700763267262,12854182577605049984,15641002901115500560,5104500764547628290), and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.\\n\\n## Install\\n\\nTo install the `lm-eval` package from the github repository, run:\\n\\n```bash\\ngit clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness\\ncd lm-evaluation-harness\\npip install -e .\\n```\\n\\n### Installing Model Backends\\n\\nThe base installation provides the core evaluation framework. **Model backends must be installed separately** using optional extras:\\n\\nFor HuggingFace transformers models:\\n\\n```bash\\npip install \\\"lm_eval[hf]\\\"\\n```\\n\\nFor vLLM inference:\\n\\n```bash\\npip install \\\"lm_eval[vllm]\\\"\\n```\\n\\nFor API-based models (OpenAI, Anthropic, etc.):\\n\\n```bash\\npip install \\\"lm_eval[api]\\\"\\n```\\n\\nMultiple backends can be installed together:\\n\\n```bash\\npip install \\\"lm_eval[hf,vllm,api]\\\"\\n```\\n\\nA detailed table of all optional extras is available at the end of this document.\\n\\n## Basic Usage\\n\\n### Documentation\\n\\n| Guide | Description |\\n|-------|-------------|\\n| [CLI Reference](./docs/interface.md) | Command-line arguments and subcommands |\\n| [Configuration Guide](./docs/config_files.md) | YAML config file format and examples |\\n| [Python API](./docs/python-api.md) | Programmatic usage with `simple_evaluate()` |\\n| [Task Guide](./lm_eval/tasks/README.md) | Available tasks and task configuration |\\n\\nUse `lm-eval -h` to see available options, or `lm-eval run -h` for evaluation options.\\n\\nList available tasks with:\\n\\n```bash\\nlm-eval ls tasks\\n```\\n\\n### Hugging Face `transformers`\\n\\n&gt; [!Important]\\n&gt; To use the HuggingFace backend, first install: `pip install \\\"lm_eval[hf]\\\"`\\n\\nTo evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command (this assumes you are using a CUDA-compatible GPU):\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=EleutherAI/gpt-j-6B \\\\\\n    --tasks hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8\\n```\\n\\nAdditional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=\\\"float\\\" \\\\\\n    --tasks lambada_openai,hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8\\n```\\n\\nModels that are loaded via both `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) and `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported.\\n\\nBatch size selection can be automated by setting the  ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=\\\"float\\\" \\\\\\n    --tasks lambada_openai,hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size auto:4\\n```\\n\\n&gt; [!Note]\\n&gt; Just like you can provide a local path to `transformers.AutoModel`, you can also provide a local path to `lm_eval` via `--model_args pretrained=/path/to/model`\\n\\n#### Evaluating GGUF Models\\n\\n`lm-eval` supports evaluating models in GGUF format using the Hugging Face (`hf`) backend. This allows you to use quantized models compatible with `transformers`, `AutoModel`, and llama.cpp conversions.\\n\\nTo evaluate a GGUF model, pass the path to the directory containing the model weights, the `gguf_file`, and optionally a separate `tokenizer` path using the `--model_args` flag.\\n\\n**🚨 Important Note:**  \\nIf no separate tokenizer is provided, Hugging Face will attempt to reconstruct the tokenizer from the GGUF file — this can take **hours** or even hang indefinitely. Passing a separate tokenizer avoids this issue and can reduce tokenizer loading time from hours to seconds.\\n\\n**✅ Recommended usage:**\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=/path/to/gguf_folder,gguf_file=model-name.gguf,tokenizer=/path/to/tokenizer \\\\\\n    --tasks hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8\\n```\\n\\n&gt; [!Tip]\\n&gt; Ensure the tokenizer path points to a valid Hugging Face tokenizer directory (e.g., containing tokenizer_config.json, vocab.json, etc.).\\n\\n#### Multi-GPU Evaluation with Hugging Face `accelerate`\\n\\nWe support three main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.\\n\\nTo perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:\\n\\n```bash\\naccelerate launch -m lm_eval --model hf \\\\\\n    --tasks lambada_openai,arc_easy \\\\\\n    --batch_size 16\\n```\\n\\n(or via `accelerate launch --no-python lm_eval`).\\n\\nFor cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.\\n\\n**WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.\\n\\nThe second way of using `accelerate` for multi-GPU evaluation is when your model is *too large to fit on a single GPU.*\\n\\nIn this setting, run the library *outside the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --tasks lambada_openai,arc_easy \\\\\\n    --model_args parallelize=True \\\\\\n    --batch_size 16\\n```\\n\\nThis means that your model's weights will be split across all available GPUs.\\n\\nFor more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:\\n\\n- `device_map_option`: How to split model weights across available GPUs. defaults to \\\"auto\\\".\\n- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.\\n- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.\\n- `offload_folder`: a folder where model weights will be offloaded to disk if needed.\\n\\nThe third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.\\n\\n```bash\\naccelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \\\\\\n    -m lm_eval --model hf \\\\\\n    --tasks lambada_openai,arc_easy \\\\\\n    --model_args parallelize=True \\\\\\n    --batch_size 16\\n```\\n\\nTo learn more about model parallelism and how to use it with the `accelerate` library, see the [accelerate documentation](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism)\\n\\n**Warning: We do not natively support multi-node evaluation using the `hf` model type! Please reference [our GPT-NeoX library integration](https://github.com/EleutherAI/gpt-neox/blob/main/eval.py) for an example of code in which a custom multi-machine evaluation script is written.**\\n\\n**Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py).**\\n\\n#### Tensor Parallelism (native PyTorch)\\n\\nFor models that support PyTorch's native Tensor Parallelism (via DTensor), you can shard model weights across GPUs without `accelerate`'s device-map by passing `tp_plan=auto` in `--model_args`. Launch with `torchrun` or `accelerate launch`:\\n\\n```bash\\ntorchrun --nproc-per-node=4 -m lm_eval \\\\\\n    --model hf \\\\\\n    --model_args pretrained=google/gemma-4-31B-it,tp_plan=auto \\\\\\n    --tasks lambada_openai,arc_easy \\\\\\n    --batch_size 16\\n```\\n\\n**Constraints:**\\n\\n- `tp_plan` and `parallelize=True` are mutually exclusive — use one or the other.\\n- The number of key-value heads in the model must be divisible by `--nproc-per-node` (the TP degree).\\n- Requires PyTorch &gt;= 2.4 and a `transformers` version that exposes a TP plan for the model (v4.47+).\\n\\n### Steered Hugging Face `transformers` models\\n\\nTo evaluate a Hugging Face `transformers` model with steering vectors applied, specify the model type as `steered` and provide the path to either a PyTorch file containing pre-defined steering vectors, or a CSV file that specifies how to derive steering vectors from pretrained `sparsify` or `sae_lens` models (you will need to install the corresponding optional dependency for this method).\\n\\nSpecify pre-defined steering vectors:\\n\\n```python\\nimport torch\\n\\nsteer_config = {\\n    \\\"layers.3\\\": {\\n        \\\"steering_vector\\\": torch.randn(1, 768),\\n        \\\"bias\\\": torch.randn(1, 768),\\n        \\\"steering_coefficient\\\": 1,\\n        \\\"action\\\": \\\"add\\\"\\n    },\\n}\\ntorch.save(steer_config, \\\"steer_config.pt\\\")\\n```\\n\\nSpecify derived steering vectors:\\n\\n```python\\nimport pandas as pd\\n\\npd.DataFrame({\\n    \\\"loader\\\": [\\\"sparsify\\\"],\\n    \\\"action\\\": [\\\"add\\\"],\\n    \\\"sparse_model\\\": [\\\"EleutherAI/sae-pythia-70m-32k\\\"],\\n    \\\"hookpoint\\\": [\\\"layers.3\\\"],\\n    \\\"feature_index\\\": [30],\\n    \\\"steering_coefficient\\\": [10.0],\\n}).to_csv(\\\"steer_config.csv\\\", index=False)\\n```\\n\\nRun the evaluation harness with steering vectors applied:\\n\\n```bash\\nlm_eval --model steered \\\\\\n    --model_args pretrained=EleutherAI/pythia-160m,steer_path=steer_config.pt \\\\\\n    --tasks lambada_openai,hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8\\n```\\n\\n### NVIDIA `nemo` models\\n\\n[NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) is a generative AI framework built for researchers and pytorch developers working on language models.\\n\\nTo evaluate a `nemo` model, start by installing NeMo following [the documentation](https://github.com/NVIDIA/NeMo?tab=readme-ov-file#installation). We highly recommended to use the NVIDIA PyTorch or NeMo container, especially if having issues installing Apex or any other dependencies (see [latest released containers](https://github.com/NVIDIA/NeMo/releases)). Please also install the lm evaluation harness library following the instructions in [the Install section](https://github.com/EleutherAI/lm-evaluation-harness/tree/main?tab=readme-ov-file#install).\\n\\nNeMo models can be obtained through [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/models) or in [NVIDIA's Hugging Face page](https://huggingface.co/nvidia). In [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo/tree/main/scripts/nlp_language_modeling) there are conversion scripts to convert the `hf` checkpoints of popular models like llama, falcon, mixtral or mpt to `nemo`.\\n\\nRun a `nemo` model on one GPU:\\n\\n```bash\\nlm_eval --model nemo_lm \\\\\\n    --model_args path=&lt;path_to_nemo_model&gt; \\\\\\n    --tasks hellaswag \\\\\\n    --batch_size 32\\n```\\n\\nIt is recommended to unpack the `nemo` model to avoid the unpacking inside the docker container - it may overflow disk space. For that you can run:\\n\\n```bash\\nmkdir MY_MODEL\\ntar -xvf MY_MODEL.nemo -c MY_MODEL\\n```\\n\\n#### Multi-GPU evaluation with NVIDIA `nemo` models\\n\\nBy default, only one GPU is used. But we do support either data replication or tensor/pipeline parallelism during evaluation, on one node.\\n\\n1) To enable data replication, set the `model_args` of `devices` to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is:\\n\\n```bash\\ntorchrun --nproc-per-node=8 --no-python lm_eval \\\\\\n    --model nemo_lm \\\\\\n    --model_args path=&lt;path_to_nemo_model&gt;,devices=8 \\\\\\n    --tasks hellaswag \\\\\\n    --batch_size 32\\n```\\n\\n1) To enable tensor and/or pipeline parallelism, set the `model_args` of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. In addition, you also have to set up `devices` to be equal to the product of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is:\\n\\n```bash\\ntorchrun --nproc-per-node=4 --no-python lm_eval \\\\\\n    --model nemo_lm \\\\\\n    --model_args path=&lt;path_to_nemo_model&gt;,devices=4,tensor_model_parallel_size=2,pipeline_model_parallel_size=2 \\\\\\n    --tasks hellaswag \\\\\\n    --batch_size 32\\n```\\n\\nNote that it is recommended to substitute the `python` command by `torchrun --nproc-per-node=&lt;number of devices&gt; --no-python` to facilitate loading the model into the GPUs. This is especially important for large checkpoints loaded into multiple GPUs.\\n\\nNot supported yet: multi-node evaluation and combinations of data replication with tensor or pipeline parallelism.\\n\\n### Megatron-LM models\\n\\n[Megatron-LM](https://github.com/NVIDIA/Megatron-LM) is NVIDIA's large-scale transformer training framework. This backend allows direct evaluation of Megatron-LM checkpoints without conversion.\\n\\n**Requirements:**\\n- Megatron-LM must be installed or accessible via `MEGATRON_PATH` environment variable\\n- PyTorch with CUDA support\\n\\n**Setup:**\\n\\n```bash\\n# Set environment variable pointing to Megatron-LM installation\\nexport MEGATRON_PATH=/path/to/Megatron-LM\\n```\\n\\n**Basic usage (single GPU):**\\n\\n```bash\\nlm_eval --model megatron_lm \\\\\\n    --model_args load=/path/to/checkpoint,tokenizer_type=HuggingFaceTokenizer,tokenizer_model=/path/to/tokenizer \\\\\\n    --tasks hellaswag \\\\\\n    --batch_size 1\\n```\\n\\n**Supported checkpoint formats:**\\n- Standard Megatron checkpoints (`model_optim_rng.pt`)\\n- Distributed checkpoints (`.distcp` format, auto-detected)\\n\\n#### Parallelism Modes\\n\\nThe Megatron-LM backend supports the following parallelism modes:\\n\\n| Mode | Configuration | Description |\\n|------|---------------|-------------|\\n| Single GPU | `devices=1` (default) | Standard single GPU evaluation |\\n| Data Parallelism | `devices&gt;1, TP=1` | Each GPU has a full model replica, data is distributed |\\n| Tensor Parallelism | `TP == devices` | Model layers are split across GPUs |\\n| Expert Parallelism | `EP == devices, TP=1` | For MoE models, experts are distributed across GPUs |\\n\\n&gt; [!Note]\\n&gt; - Pipeline Parallelism (PP &gt; 1) is not currently supported.\\n&gt; - Expert Parallelism (EP) cannot be combined with Tensor Parallelism (TP).\\n\\n**Data Parallelism (4 GPUs, each with full model replica):**\\n\\n```bash\\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\\\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,devices=4 \\\\\\n    --tasks hellaswag\\n```\\n\\n**Tensor Parallelism (TP=2):**\\n\\n```bash\\ntorchrun --nproc-per-node=2 -m lm_eval --model megatron_lm \\\\\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,devices=2,tensor_model_parallel_size=2 \\\\\\n    --tasks hellaswag\\n```\\n\\n**Expert Parallelism for MoE models (EP=4):**\\n\\n```bash\\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\\\\n    --model_args load=/path/to/moe_checkpoint,tokenizer_model=/path/to/tokenizer,devices=4,expert_model_parallel_size=4 \\\\\\n    --tasks hellaswag\\n```\\n\\n**Using extra_args for additional Megatron options:**\\n\\n```bash\\nlm_eval --model megatron_lm \\\\\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,extra_args=\\\"--no-rope-fusion --trust-remote-code\\\" \\\\\\n    --tasks hellaswag\\n```\\n\\n&gt; [!Note]\\n&gt; The `--use-checkpoint-args` flag is enabled by default, which loads model architecture parameters from the checkpoint. For checkpoints converted via Megatron-Bridge, this typically includes all necessary model configuration.\\n\\n#### Multi-GPU evaluation with OpenVINO models\\n\\nPipeline parallelism during evaluation is supported with OpenVINO models\\n\\nTo enable pipeline parallelism, set the `model_args` of `pipeline_parallel`. In addition, you also have to set up `device` to value `HETERO:&lt;GPU index1&gt;,&lt;GPU index2&gt;` for example `HETERO:GPU.1,GPU.0` For example, the command to use pipeline parallelism of 2 is:\\n\\n```bash\\nlm_eval --model openvino \\\\\\n    --tasks wikitext \\\\\\n    --model_args pretrained=&lt;path_to_ov_model&gt;,pipeline_parallel=True \\\\\\n    --device HETERO:GPU.1,GPU.0\\n```\\n\\n### Tensor + Data Parallel and Optimized Inference with `vLLM`\\n\\nWe also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:\\n\\n```bash\\nlm_eval --model vllm \\\\\\n    --model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \\\\\\n    --tasks lambada_openai \\\\\\n    --batch_size auto\\n```\\n\\nTo use vllm, do `pip install \\\"lm_eval[vllm]\\\"`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.\\n\\n&gt; [!Note]\\n&gt; `data_parallel_size&gt;1` dispatches each replica as a separate [ray](https://github.com/ray-project/ray) actor and requires `pip install ray`. Each actor reserves `tensor_parallel_size` GPUs (default 1).\\n\\nvLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.\\n\\n&gt; [!Tip]\\n&gt; For fastest performance, we recommend using `--batch_size auto` for vLLM whenever possible, to leverage its continuous batching functionality!\\n\\n&gt; [!Tip]\\n&gt; Passing `max_model_len=4096` or some other reasonable default to vLLM through model args may cause speedups or prevent out-of-memory errors when trying to use auto batch size, such as for Mistral-7B-v0.1 which defaults to a maximum length of 32k.\\n\\n### Tensor + Data Parallel and Fast Offline Batching Inference with `SGLang`\\n\\nWe support SGLang for efficient offline batch inference. Its **[Fast Backend Runtime](https://docs.sglang.ai/index.html)** delivers high performance through optimized memory management and parallel processing techniques. Key features include tensor parallelism, continuous batching, and support for various quantization methods (FP8/INT4/AWQ/GPTQ).\\n\\nTo use SGLang as the evaluation backend, please **install it in advance** via SGLang documents [here](https://docs.sglang.io/get_started/install.html#install-sglang).\\n\\n&gt; [!Tip]\\n&gt; Due to the installing method of [`Flashinfer`](https://docs.flashinfer.ai/)-- a fast attention kernel library, we don't include the dependencies of `SGLang` within [pyproject.toml](pyproject.toml). Note that the `Flashinfer` also has some requirements on `torch` version.\\n\\nSGLang's server arguments are slightly different from other backends, see [here](https://docs.sglang.io/advanced_features/server_arguments.html) for more information. We provide an example of the usage here:\\n\\n```bash\\nlm_eval --model sglang \\\\\\n    --model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \\\\\\n    --tasks gsm8k_cot \\\\\\n    --batch_size auto\\n```\\n\\n&gt; [!Tip]\\n&gt; When encountering out-of-memory (OOM) errors (especially for multiple-choice tasks), try these solutions:\\n&gt;\\n&gt; 1. Use a manual `batch_size`, rather than `auto`.\\n&gt; 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.\\n&gt; 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).\\n\\n### Windows ML\\n\\nWe support **Windows ML** for hardware-accelerated inference on Windows platforms. This enables evaluation on CPU, GPU, and **NPU (Neural Processing Unit)** devices.\\n\\nWindows ML?\\nhttps://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview\\n\\nTo use Windows ML, install the required dependencies:\\n\\n```bash\\npip install wasdk-Microsoft.Windows.AI.MachineLearning[all] wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap onnxruntime-windowsml onnxruntime-genai-winml\\n```\\n\\nEvaluate an ONNX Runtime GenAI LLM on NPU/GPU/CPU on Windows:\\n\\n```bash\\nlm_eval --model winml \\\\\\n    --model_args pretrained=/path/to/onnx/model \\\\\\n    --tasks mmlu \\\\\\n    --batch_size 1\\n```\\n\\n&gt; [!Note]\\n&gt; The Windows ML backend is ONLY for ONNX Runtime GenAI model format. Models targeting `transformers.js` won't work. You can verify this by finding the `genai_config.json` file in the model folder.\\n\\n&gt; [!Note]\\n&gt; To run an ONNX Runtime GenAI model on the target device, you MUST convert the original model to that vendor and device type. Converted models won't work / work well on other vendor or device types. To learn more on model conversion, please visit [Microsoft AI Tool Kit](https://code.visualstudio.com/docs/intelligentapps/modelconversion)\\n\\n### Model APIs and Inference Servers\\n\\n&gt; [!Important]\\n&gt; To use API-based models, first install: `pip install \\\"lm_eval[api]\\\"`\\n\\nOur library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.\\n\\nTo call a hosted model, use:\\n\\n```bash\\nexport OPENAI_API_KEY=YOUR_KEY_HERE\\nlm_eval --model openai-completions \\\\\\n    --model_args model=davinci-002 \\\\\\n    --tasks lambada_openai,hellaswag\\n```\\n\\nWe also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.\\n\\n```bash\\nlm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16\\n```\\n\\nNote that for externally hosted models, configs such as `--device` which relate to where to place a local model should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.\\n\\n| API or Inference Server                                                                                                   | Implemented?                                                                                            | `--model &lt;xxx&gt;` name                                  | Models supported:                                                                                                                                               | Request Types:                                                                 |\\n|---------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|\\n| OpenAI Completions                                                                                                        | :heavy_check_mark:                                                                                      | `openai-completions`, `local-completions`             | All OpenAI Completions API models                                                                                                                               | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| OpenAI ChatCompletions                                                                                                    | :heavy_check_mark:                                                                                      | `openai-chat-completions`, `local-chat-completions`   | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt)                                                                                   | `generate_until` (no logprobs)                                                 |\\n| Anthropic                                                                                                                 | :heavy_check_mark:                                                                                      | `anthropic`                                           | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model)                                                                    | `generate_until` (no logprobs)                                                 |\\n| Anthropic Chat                                                                                                            | :heavy_check_mark:                                                                                      | `anthropic-chat`, `anthropic-chat-completions`        | [Supported Anthropic Engines](https://docs.anthropic.com/claude/docs/models-overview)                                                                           | `generate_until` (no logprobs)                                                 |\\n| [LiteLLM](https://github.com/BerriAI/litellm) (gateway to 100+ providers)                                                 | :heavy_check_mark:                                                                                      | `litellm`, `litellm-chat`, `litellm-chat-completions` | [All LiteLLM-supported providers](https://docs.litellm.ai/docs/providers)                                                                                       | `generate_until` (no logprobs)                                                 |\\n| Textsynth                                                                                                                 | :heavy_check_mark:                                                                                      | `textsynth`                                           | [All supported engines](https://textsynth.com/documentation.html#engines)                                                                                       | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Cohere                                                                                                                    | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A                                                   | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models)                                                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark:                                                                                      | `gguf`, `ggml`                                        | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp)                                                                                     | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |\\n| vLLM                                                                                                                      | :heavy_check_mark:                                                                                      | `vllm`                                                | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html)                                                                   | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Mamba                                                                                                                     | :heavy_check_mark:                                                                                      | `mamba_ssm`                                           | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces)                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Huggingface Optimum (Causal LMs)                                                                                          | :heavy_check_mark:                                                                                      | `openvino`                                            | Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format                                 | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Huggingface Optimum-intel IPEX (Causal LMs)                                                                               | :heavy_check_mark:                                                                                      | `ipex`                                                | Any decoder-only AutoModelForCausalLM                                                                                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Huggingface Optimum-habana (Causal LMs)                                                                                   | :heavy_check_mark:                                                                                      | `habana`                                              | Any decoder-only AutoModelForCausalLM                                                                                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Neuron via AWS Inf2 (Causal LMs)                                                                                          | :heavy_check_mark:                                                                                      | `neuronx`                                             | Any decoder-only AutoModelForCausalLM supported to run on [huggingface-ami image for inferentia2](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| NVIDIA NeMo                                                                                                               | :heavy_check_mark:                                                                                      | `nemo_lm`                                             | [All supported models](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/core/core.html#nemo-models)                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| NVIDIA Megatron-LM                                                                                                        | :heavy_check_mark:                                                                                      | `megatron_lm`                                         | [Megatron-LM GPT models](https://github.com/NVIDIA/Megatron-LM) (standard and distributed checkpoints)                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Watsonx.ai                                                                                                                | :heavy_check_mark:                                                                                      | `watsonx_llm`                                         | [Supported Watsonx.ai Engines](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx)                                      | `generate_until` `loglikelihood`                                               |\\n| Windows ML                                                                                                                | :heavy_check_mark:                                                                                      | `winml`                                               | [ONNX models in GenAI format](https://code.visualstudio.com/docs/intelligentapps/modelconversion)                                                               | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| [Your local inference server!](docs/API_guide.md)                                                                         | :heavy_check_mark:                                                                                      | `local-completions` or `local-chat-completions`       | Support for OpenAI API-compatible servers, with easy customization for other APIs.                                                                              | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n\\nModels which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.\\n\\nFor more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\\n\\n&gt; [!Note]\\n&gt; For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using `--limit 10` first to confirm answer extraction and scoring on generative tasks is performing as expected. providing `system=\\\"&lt;some system prompt here&gt;\\\"` within `--model_args` for anthropic-chat-completions, to instruct the model what format to respond in, may be useful.\\n\\n### Other Frameworks\\n\\nA number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).\\n\\nTo create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage).\\n\\n### Additional Features\\n\\n&gt; [!Note]\\n&gt; For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.\\n\\nIf you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher). **Note that the PyTorch MPS backend is still in early stages of development, so correctness issues or unsupported operations may exist. If you observe oddities in model performance on the MPS back-end, we recommend first checking that a forward pass of your model on `--device cpu` and `--device mps` match.**\\n\\n&gt; [!Note]\\n&gt; You can inspect what the LM inputs look like by running the following command:\\n&gt;\\n&gt; ```bash\\n&gt; python write_out.py \\\\\\n&gt;     --tasks &lt;task1,task2,...&gt; \\\\\\n&gt;     --num_fewshot 5 \\\\\\n&gt;     --num_examples 10 \\\\\\n&gt;     --output_base_path /path/to/output/folder\\n&gt; ```\\n&gt;\\n&gt; This will write out one text file for each task.\\n\\nTo verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:\\n\\n```bash\\nlm_eval --model openai \\\\\\n    --model_args engine=davinci-002 \\\\\\n    --tasks lambada_openai,hellaswag \\\\\\n    --check_integrity\\n```\\n\\n## Advanced Usage Tips\\n\\nFor models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \\\\\\n    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \\\\\\n    --device cuda:0\\n```\\n\\nModels provided as delta weights can be easily loaded using the Hugging Face transformers library. Within --model_args, set the delta argument to specify the delta weights, and use the pretrained argument to designate the relative base model to which they will be applied:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=Ejafa/llama_7B,delta=lmsys/vicuna-7b-delta-v1.1 \\\\\\n    --tasks hellaswag\\n```\\n\\nGPTQ quantized models can be loaded using [GPTQModel](https://github.com/ModelCloud/GPTQModel) (faster) or [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)\\n\\nGPTQModel: add `,gptqmodel=True` to `model_args`\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=model-name-or-path,gptqmodel=True \\\\\\n    --tasks hellaswag\\n```\\n\\nAutoGPTQ: add `,autogptq=True` to `model_args`:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\\\\n    --tasks hellaswag\\n```\\n\\nWe support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.\\n\\n## Saving &amp; Caching Results\\n\\nTo save evaluation results provide an `--output_path`. We also support logging model responses with the `--log_samples` flag for post-hoc analysis.\\n\\n&gt; [!TIP]\\n&gt; Use `--use_cache &lt;DIR&gt;` to cache evaluation results and skip previously evaluated samples when resuming runs of the same (model, task) pairs. Note that caching is rank-dependent, so restart with the same GPU count if interrupted. You can also use --cache_requests to save dataset preprocessing steps for faster evaluation resumption.\\n\\nTo push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use the `--hf_hub_log_args` flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub - [example dataset on the  HF Hub](https://huggingface.co/datasets/KonradSzafer/lm-eval-results-demo). For instance:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\\\\n    --tasks hellaswag \\\\\\n    --log_samples \\\\\\n    --output_path results \\\\\\n    --hf_hub_log_args hub_results_org=EleutherAI,hub_repo_name=lm-eval-results,push_results_to_hub=True,push_samples_to_hub=True,public_repo=False \\\\\\n```\\n\\nThis allows you to easily download the results and samples from the Hub, using:\\n\\n```python\\nfrom datasets import load_dataset\\n\\nload_dataset(\\\"EleutherAI/lm-eval-results-private\\\", \\\"hellaswag\\\", \\\"latest\\\")\\n```\\n\\nFor a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!\\n\\n## Visualizing Results\\n\\nYou can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights &amp; Biases (W&amp;B) and Zeno.\\n\\n### Zeno\\n\\nYou can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.\\n\\nFirst, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).\\nAdd this key as an environment variable:\\n\\n```bash\\nexport ZENO_API_KEY=[your api key]\\n```\\n\\nYou'll also need to install the `lm_eval[zeno]` package extra.\\n\\nTo visualize the results, run the eval harness with the `log_samples` and `output_path` flags.\\nWe expect `output_path` to contain multiple folders that represent individual model names.\\nYou can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.\\n\\n```bash\\nlm_eval \\\\\\n    --model hf \\\\\\n    --model_args pretrained=EleutherAI/gpt-j-6B \\\\\\n    --tasks hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8 \\\\\\n    --log_samples \\\\\\n    --output_path output/gpt-j-6B\\n```\\n\\nThen, you can upload the resulting data using the `zeno_visualize` script:\\n\\n```bash\\npython scripts/zeno_visualize.py \\\\\\n    --data_path output \\\\\\n    --project_name \\\"Eleuther Project\\\"\\n```\\n\\nThis will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno.\\nIf you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.\\n\\nYou can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).\\n\\n### Weights and Biases\\n\\nWith the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights &amp; Biases (W&amp;B) platform.\\n\\nThe integration provide functionalities\\n\\n- to automatically log the evaluation results,\\n- log the samples as W&amp;B Tables for easy visualization,\\n- log the `results.json` file as an artifact for version control,\\n- log the `&lt;task_name&gt;_eval_samples.json` file if the samples are logged,\\n- generate a comprehensive report for analysis and visualization with all the important metric,\\n- log task and cli specific configs,\\n- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.\\n\\nFirst you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.\\n\\nAuthenticate your machine with an your unique W&amp;B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.\\n\\nRun eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.\\n\\n```bash\\nlm_eval \\\\\\n    --model hf \\\\\\n    --model_args pretrained=microsoft/phi-2,trust_remote_code=True \\\\\\n    --tasks hellaswag,mmlu_abstract_algebra \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8 \\\\\\n    --output_path output/phi-2 \\\\\\n    --limit 10 \\\\\\n    --wandb_args project=lm-eval-harness-integration \\\\\\n    --log_samples\\n```\\n\\nIn the stdout, you will find the link to the W&amp;B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb), and an example of how to integrate it beyond the CLI.\\n\\n## Contributing\\n\\nCheck out our [open issues](https://github.com/EleutherAI/lm-evaluation-harness/issues) and feel free to submit pull requests!\\n\\nFor more information on the library and how everything fits together, see our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs).\\n\\nTo get started with development, first clone the repository and install the dev dependencies:\\n\\n```bash\\ngit clone https://github.com/EleutherAI/lm-evaluation-harness\\ncd lm-evaluation-harness\\npip install -e \\\".[dev,hf]\\\"\\n````\\n\\n### Implementing new tasks\\n\\nTo implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).\\n\\nIn general, we follow this priority list for addressing concerns about prompting and other eval details:\\n\\n1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.\\n2. If there is a clear and unambiguous official implementation, use that procedure.\\n3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.\\n4. If there are multiple common implementations but not universal or widespread agreement, use our preferred option among the common implementations. As before, prioritize choosing from among the implementations found in LLM training papers.\\n\\nThese are guidelines and not rules, and can be overruled in special circumstances.\\n\\nWe try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different\"\n}"
    }
  ],
  "details": {
    "url": "https://raw.githubusercontent.com/EleutherAI/lm-evaluation-harness/main/README.md",
    "content": "<div id=\"readability-page-1\" class=\"page\"><pre># Language Model Evaluation Harness\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10256836.svg)](https://doi.org/10.5281/zenodo.10256836)\n\n---\n\n## Latest News 📣\n- [2025/12] **CLI refactored** with subcommands (`run`, `ls`, `validate`) and YAML config file support via `--config`. See the [CLI Reference](./docs/interface.md) and [Configuration Guide](./docs/config_files.md).\n- [2025/12] **Lighter install**: Base package no longer includes `transformers`/`torch`. Install model backends separately: `pip install lm_eval[hf]`, `lm_eval[vllm]`, etc.\n- [2025/07] Added `think_end_token` arg to `hf` (token/str), `vllm` and `sglang` (str) for stripping CoT reasoning traces from models that support it.\n- [2025/03] Added support for steering HF models!\n- [2025/02] Added [SGLang](https://docs.sglang.ai/) support!\n- [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.\n- [2024/07] [API model](docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. **To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.**\n- [2024/07] New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.\n\n---\n\n## Announcement\n\n**A new v0.4.0 release of lm-evaluation-harness is available** !\n\nNew updates and features include:\n\n- **New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.**\n- Internal refactoring\n- Config-based task creation and configuration\n- Easier import and sharing of externally-defined task config YAMLs\n- Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource\n- More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more\n- Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more\n- Logging and usability changes\n- New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more\n\nPlease see our updated documentation pages in `docs/` for more details.\n\nDevelopment will be continuing on the `main` branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the [EleutherAI discord](https://discord.gg/eleutherai)!\n\n---\n\n## Overview\n\nThis project provides a unified framework to test generative language models on a large number of different evaluation tasks.\n\n**Features:**\n\n- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.\n- Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.\n- Support for fast and memory-efficient inference with [vLLM](https://github.com/vllm-project/vllm).\n- Support for commercial APIs including [OpenAI](https://openai.com), and [TextSynth](https://textsynth.com/).\n- Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).\n- Support for local models and benchmarks.\n- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.\n- Easy support for custom prompts and evaluation metrics.\n\nThe Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;authuser=2&amp;cites=15052937328817631261,4097184744846514103,1520777361382155671,17476825572045927382,18443729326628441434,14801318227356878622,7890865700763267262,12854182577605049984,15641002901115500560,5104500764547628290), and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.\n\n## Install\n\nTo install the `lm-eval` package from the github repository, run:\n\n```bash\ngit clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness\ncd lm-evaluation-harness\npip install -e .\n```\n\n### Installing Model Backends\n\nThe base installation provides the core evaluation framework. **Model backends must be installed separately** using optional extras:\n\nFor HuggingFace transformers models:\n\n```bash\npip install \"lm_eval[hf]\"\n```\n\nFor vLLM inference:\n\n```bash\npip install \"lm_eval[vllm]\"\n```\n\nFor API-based models (OpenAI, Anthropic, etc.):\n\n```bash\npip install \"lm_eval[api]\"\n```\n\nMultiple backends can be installed together:\n\n```bash\npip install \"lm_eval[hf,vllm,api]\"\n```\n\nA detailed table of all optional extras is available at the end of this document.\n\n## Basic Usage\n\n### Documentation\n\n| Guide | Description |\n|-------|-------------|\n| [CLI Reference](./docs/interface.md) | Command-line arguments and subcommands |\n| [Configuration Guide](./docs/config_files.md) | YAML config file format and examples |\n| [Python API](./docs/python-api.md) | Programmatic usage with `simple_evaluate()` |\n| [Task Guide](./lm_eval/tasks/README.md) | Available tasks and task configuration |\n\nUse `lm-eval -h` to see available options, or `lm-eval run -h` for evaluation options.\n\nList available tasks with:\n\n```bash\nlm-eval ls tasks\n```\n\n### Hugging Face `transformers`\n\n&gt; [!Important]\n&gt; To use the HuggingFace backend, first install: `pip install \"lm_eval[hf]\"`\n\nTo evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command (this assumes you are using a CUDA-compatible GPU):\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI/gpt-j-6B \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\nAdditional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=\"float\" \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\nModels that are loaded via both `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) and `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported.\n\nBatch size selection can be automated by setting the  ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=\"float\" \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size auto:4\n```\n\n&gt; [!Note]\n&gt; Just like you can provide a local path to `transformers.AutoModel`, you can also provide a local path to `lm_eval` via `--model_args pretrained=/path/to/model`\n\n#### Evaluating GGUF Models\n\n`lm-eval` supports evaluating models in GGUF format using the Hugging Face (`hf`) backend. This allows you to use quantized models compatible with `transformers`, `AutoModel`, and llama.cpp conversions.\n\nTo evaluate a GGUF model, pass the path to the directory containing the model weights, the `gguf_file`, and optionally a separate `tokenizer` path using the `--model_args` flag.\n\n**🚨 Important Note:**  \nIf no separate tokenizer is provided, Hugging Face will attempt to reconstruct the tokenizer from the GGUF file — this can take **hours** or even hang indefinitely. Passing a separate tokenizer avoids this issue and can reduce tokenizer loading time from hours to seconds.\n\n**✅ Recommended usage:**\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=/path/to/gguf_folder,gguf_file=model-name.gguf,tokenizer=/path/to/tokenizer \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\n&gt; [!Tip]\n&gt; Ensure the tokenizer path points to a valid Hugging Face tokenizer directory (e.g., containing tokenizer_config.json, vocab.json, etc.).\n\n#### Multi-GPU Evaluation with Hugging Face `accelerate`\n\nWe support three main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.\n\nTo perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:\n\n```bash\naccelerate launch -m lm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --batch_size 16\n```\n\n(or via `accelerate launch --no-python lm_eval`).\n\nFor cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.\n\n**WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.\n\nThe second way of using `accelerate` for multi-GPU evaluation is when your model is *too large to fit on a single GPU.*\n\nIn this setting, run the library *outside the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:\n\n```bash\nlm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --model_args parallelize=True \\\n    --batch_size 16\n```\n\nThis means that your model's weights will be split across all available GPUs.\n\nFor more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:\n\n- `device_map_option`: How to split model weights across available GPUs. defaults to \"auto\".\n- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.\n- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.\n- `offload_folder`: a folder where model weights will be offloaded to disk if needed.\n\nThe third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.\n\n```bash\naccelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \\\n    -m lm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --model_args parallelize=True \\\n    --batch_size 16\n```\n\nTo learn more about model parallelism and how to use it with the `accelerate` library, see the [accelerate documentation](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism)\n\n**Warning: We do not natively support multi-node evaluation using the `hf` model type! Please reference [our GPT-NeoX library integration](https://github.com/EleutherAI/gpt-neox/blob/main/eval.py) for an example of code in which a custom multi-machine evaluation script is written.**\n\n**Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py).**\n\n#### Tensor Parallelism (native PyTorch)\n\nFor models that support PyTorch's native Tensor Parallelism (via DTensor), you can shard model weights across GPUs without `accelerate`'s device-map by passing `tp_plan=auto` in `--model_args`. Launch with `torchrun` or `accelerate launch`:\n\n```bash\ntorchrun --nproc-per-node=4 -m lm_eval \\\n    --model hf \\\n    --model_args pretrained=google/gemma-4-31B-it,tp_plan=auto \\\n    --tasks lambada_openai,arc_easy \\\n    --batch_size 16\n```\n\n**Constraints:**\n\n- `tp_plan` and `parallelize=True` are mutually exclusive — use one or the other.\n- The number of key-value heads in the model must be divisible by `--nproc-per-node` (the TP degree).\n- Requires PyTorch &gt;= 2.4 and a `transformers` version that exposes a TP plan for the model (v4.47+).\n\n### Steered Hugging Face `transformers` models\n\nTo evaluate a Hugging Face `transformers` model with steering vectors applied, specify the model type as `steered` and provide the path to either a PyTorch file containing pre-defined steering vectors, or a CSV file that specifies how to derive steering vectors from pretrained `sparsify` or `sae_lens` models (you will need to install the corresponding optional dependency for this method).\n\nSpecify pre-defined steering vectors:\n\n```python\nimport torch\n\nsteer_config = {\n    \"layers.3\": {\n        \"steering_vector\": torch.randn(1, 768),\n        \"bias\": torch.randn(1, 768),\n        \"steering_coefficient\": 1,\n        \"action\": \"add\"\n    },\n}\ntorch.save(steer_config, \"steer_config.pt\")\n```\n\nSpecify derived steering vectors:\n\n```python\nimport pandas as pd\n\npd.DataFrame({\n    \"loader\": [\"sparsify\"],\n    \"action\": [\"add\"],\n    \"sparse_model\": [\"EleutherAI/sae-pythia-70m-32k\"],\n    \"hookpoint\": [\"layers.3\"],\n    \"feature_index\": [30],\n    \"steering_coefficient\": [10.0],\n}).to_csv(\"steer_config.csv\", index=False)\n```\n\nRun the evaluation harness with steering vectors applied:\n\n```bash\nlm_eval --model steered \\\n    --model_args pretrained=EleutherAI/pythia-160m,steer_path=steer_config.pt \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\n### NVIDIA `nemo` models\n\n[NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) is a generative AI framework built for researchers and pytorch developers working on language models.\n\nTo evaluate a `nemo` model, start by installing NeMo following [the documentation](https://github.com/NVIDIA/NeMo?tab=readme-ov-file#installation). We highly recommended to use the NVIDIA PyTorch or NeMo container, especially if having issues installing Apex or any other dependencies (see [latest released containers](https://github.com/NVIDIA/NeMo/releases)). Please also install the lm evaluation harness library following the instructions in [the Install section](https://github.com/EleutherAI/lm-evaluation-harness/tree/main?tab=readme-ov-file#install).\n\nNeMo models can be obtained through [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/models) or in [NVIDIA's Hugging Face page](https://huggingface.co/nvidia). In [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo/tree/main/scripts/nlp_language_modeling) there are conversion scripts to convert the `hf` checkpoints of popular models like llama, falcon, mixtral or mpt to `nemo`.\n\nRun a `nemo` model on one GPU:\n\n```bash\nlm_eval --model nemo_lm \\\n    --model_args path=&lt;path_to_nemo_model&gt; \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\nIt is recommended to unpack the `nemo` model to avoid the unpacking inside the docker container - it may overflow disk space. For that you can run:\n\n```bash\nmkdir MY_MODEL\ntar -xvf MY_MODEL.nemo -c MY_MODEL\n```\n\n#### Multi-GPU evaluation with NVIDIA `nemo` models\n\nBy default, only one GPU is used. But we do support either data replication or tensor/pipeline parallelism during evaluation, on one node.\n\n1) To enable data replication, set the `model_args` of `devices` to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is:\n\n```bash\ntorchrun --nproc-per-node=8 --no-python lm_eval \\\n    --model nemo_lm \\\n    --model_args path=&lt;path_to_nemo_model&gt;,devices=8 \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\n1) To enable tensor and/or pipeline parallelism, set the `model_args` of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. In addition, you also have to set up `devices` to be equal to the product of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is:\n\n```bash\ntorchrun --nproc-per-node=4 --no-python lm_eval \\\n    --model nemo_lm \\\n    --model_args path=&lt;path_to_nemo_model&gt;,devices=4,tensor_model_parallel_size=2,pipeline_model_parallel_size=2 \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\nNote that it is recommended to substitute the `python` command by `torchrun --nproc-per-node=&lt;number of devices&gt; --no-python` to facilitate loading the model into the GPUs. This is especially important for large checkpoints loaded into multiple GPUs.\n\nNot supported yet: multi-node evaluation and combinations of data replication with tensor or pipeline parallelism.\n\n### Megatron-LM models\n\n[Megatron-LM](https://github.com/NVIDIA/Megatron-LM) is NVIDIA's large-scale transformer training framework. This backend allows direct evaluation of Megatron-LM checkpoints without conversion.\n\n**Requirements:**\n- Megatron-LM must be installed or accessible via `MEGATRON_PATH` environment variable\n- PyTorch with CUDA support\n\n**Setup:**\n\n```bash\n# Set environment variable pointing to Megatron-LM installation\nexport MEGATRON_PATH=/path/to/Megatron-LM\n```\n\n**Basic usage (single GPU):**\n\n```bash\nlm_eval --model megatron_lm \\\n    --model_args load=/path/to/checkpoint,tokenizer_type=HuggingFaceTokenizer,tokenizer_model=/path/to/tokenizer \\\n    --tasks hellaswag \\\n    --batch_size 1\n```\n\n**Supported checkpoint formats:**\n- Standard Megatron checkpoints (`model_optim_rng.pt`)\n- Distributed checkpoints (`.distcp` format, auto-detected)\n\n#### Parallelism Modes\n\nThe Megatron-LM backend supports the following parallelism modes:\n\n| Mode | Configuration | Description |\n|------|---------------|-------------|\n| Single GPU | `devices=1` (default) | Standard single GPU evaluation |\n| Data Parallelism | `devices&gt;1, TP=1` | Each GPU has a full model replica, data is distributed |\n| Tensor Parallelism | `TP == devices` | Model layers are split across GPUs |\n| Expert Parallelism | `EP == devices, TP=1` | For MoE models, experts are distributed across GPUs |\n\n&gt; [!Note]\n&gt; - Pipeline Parallelism (PP &gt; 1) is not currently supported.\n&gt; - Expert Parallelism (EP) cannot be combined with Tensor Parallelism (TP).\n\n**Data Parallelism (4 GPUs, each with full model replica):**\n\n```bash\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,devices=4 \\\n    --tasks hellaswag\n```\n\n**Tensor Parallelism (TP=2):**\n\n```bash\ntorchrun --nproc-per-node=2 -m lm_eval --model megatron_lm \\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,devices=2,tensor_model_parallel_size=2 \\\n    --tasks hellaswag\n```\n\n**Expert Parallelism for MoE models (EP=4):**\n\n```bash\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\n    --model_args load=/path/to/moe_checkpoint,tokenizer_model=/path/to/tokenizer,devices=4,expert_model_parallel_size=4 \\\n    --tasks hellaswag\n```\n\n**Using extra_args for additional Megatron options:**\n\n```bash\nlm_eval --model megatron_lm \\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,extra_args=\"--no-rope-fusion --trust-remote-code\" \\\n    --tasks hellaswag\n```\n\n&gt; [!Note]\n&gt; The `--use-checkpoint-args` flag is enabled by default, which loads model architecture parameters from the checkpoint. For checkpoints converted via Megatron-Bridge, this typically includes all necessary model configuration.\n\n#### Multi-GPU evaluation with OpenVINO models\n\nPipeline parallelism during evaluation is supported with OpenVINO models\n\nTo enable pipeline parallelism, set the `model_args` of `pipeline_parallel`. In addition, you also have to set up `device` to value `HETERO:&lt;GPU index1&gt;,&lt;GPU index2&gt;` for example `HETERO:GPU.1,GPU.0` For example, the command to use pipeline parallelism of 2 is:\n\n```bash\nlm_eval --model openvino \\\n    --tasks wikitext \\\n    --model_args pretrained=&lt;path_to_ov_model&gt;,pipeline_parallel=True \\\n    --device HETERO:GPU.1,GPU.0\n```\n\n### Tensor + Data Parallel and Optimized Inference with `vLLM`\n\nWe also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:\n\n```bash\nlm_eval --model vllm \\\n    --model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \\\n    --tasks lambada_openai \\\n    --batch_size auto\n```\n\nTo use vllm, do `pip install \"lm_eval[vllm]\"`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.\n\n&gt; [!Note]\n&gt; `data_parallel_size&gt;1` dispatches each replica as a separate [ray](https://github.com/ray-project/ray) actor and requires `pip install ray`. Each actor reserves `tensor_parallel_size` GPUs (default 1).\n\nvLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.\n\n&gt; [!Tip]\n&gt; For fastest performance, we recommend using `--batch_size auto` for vLLM whenever possible, to leverage its continuous batching functionality!\n\n&gt; [!Tip]\n&gt; Passing `max_model_len=4096` or some other reasonable default to vLLM through model args may cause speedups or prevent out-of-memory errors when trying to use auto batch size, such as for Mistral-7B-v0.1 which defaults to a maximum length of 32k.\n\n### Tensor + Data Parallel and Fast Offline Batching Inference with `SGLang`\n\nWe support SGLang for efficient offline batch inference. Its **[Fast Backend Runtime](https://docs.sglang.ai/index.html)** delivers high performance through optimized memory management and parallel processing techniques. Key features include tensor parallelism, continuous batching, and support for various quantization methods (FP8/INT4/AWQ/GPTQ).\n\nTo use SGLang as the evaluation backend, please **install it in advance** via SGLang documents [here](https://docs.sglang.io/get_started/install.html#install-sglang).\n\n&gt; [!Tip]\n&gt; Due to the installing method of [`Flashinfer`](https://docs.flashinfer.ai/)-- a fast attention kernel library, we don't include the dependencies of `SGLang` within [pyproject.toml](pyproject.toml). Note that the `Flashinfer` also has some requirements on `torch` version.\n\nSGLang's server arguments are slightly different from other backends, see [here](https://docs.sglang.io/advanced_features/server_arguments.html) for more information. We provide an example of the usage here:\n\n```bash\nlm_eval --model sglang \\\n    --model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \\\n    --tasks gsm8k_cot \\\n    --batch_size auto\n```\n\n&gt; [!Tip]\n&gt; When encountering out-of-memory (OOM) errors (especially for multiple-choice tasks), try these solutions:\n&gt;\n&gt; 1. Use a manual `batch_size`, rather than `auto`.\n&gt; 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.\n&gt; 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).\n\n### Windows ML\n\nWe support **Windows ML** for hardware-accelerated inference on Windows platforms. This enables evaluation on CPU, GPU, and **NPU (Neural Processing Unit)** devices.\n\nWindows ML?\nhttps://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview\n\nTo use Windows ML, install the required dependencies:\n\n```bash\npip install wasdk-Microsoft.Windows.AI.MachineLearning[all] wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap onnxruntime-windowsml onnxruntime-genai-winml\n```\n\nEvaluate an ONNX Runtime GenAI LLM on NPU/GPU/CPU on Windows:\n\n```bash\nlm_eval --model winml \\\n    --model_args pretrained=/path/to/onnx/model \\\n    --tasks mmlu \\\n    --batch_size 1\n```\n\n&gt; [!Note]\n&gt; The Windows ML backend is ONLY for ONNX Runtime GenAI model format. Models targeting `transformers.js` won't work. You can verify this by finding the `genai_config.json` file in the model folder.\n\n&gt; [!Note]\n&gt; To run an ONNX Runtime GenAI model on the target device, you MUST convert the original model to that vendor and device type. Converted models won't work / work well on other vendor or device types. To learn more on model conversion, please visit [Microsoft AI Tool Kit](https://code.visualstudio.com/docs/intelligentapps/modelconversion)\n\n### Model APIs and Inference Servers\n\n&gt; [!Important]\n&gt; To use API-based models, first install: `pip install \"lm_eval[api]\"`\n\nOur library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.\n\nTo call a hosted model, use:\n\n```bash\nexport OPENAI_API_KEY=YOUR_KEY_HERE\nlm_eval --model openai-completions \\\n    --model_args model=davinci-002 \\\n    --tasks lambada_openai,hellaswag\n```\n\nWe also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.\n\n```bash\nlm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16\n```\n\nNote that for externally hosted models, configs such as `--device` which relate to where to place a local model should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.\n\n| API or Inference Server                                                                                                   | Implemented?                                                                                            | `--model &lt;xxx&gt;` name                                  | Models supported:                                                                                                                                               | Request Types:                                                                 |\n|---------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|\n| OpenAI Completions                                                                                                        | :heavy_check_mark:                                                                                      | `openai-completions`, `local-completions`             | All OpenAI Completions API models                                                                                                                               | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| OpenAI ChatCompletions                                                                                                    | :heavy_check_mark:                                                                                      | `openai-chat-completions`, `local-chat-completions`   | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt)                                                                                   | `generate_until` (no logprobs)                                                 |\n| Anthropic                                                                                                                 | :heavy_check_mark:                                                                                      | `anthropic`                                           | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model)                                                                    | `generate_until` (no logprobs)                                                 |\n| Anthropic Chat                                                                                                            | :heavy_check_mark:                                                                                      | `anthropic-chat`, `anthropic-chat-completions`        | [Supported Anthropic Engines](https://docs.anthropic.com/claude/docs/models-overview)                                                                           | `generate_until` (no logprobs)                                                 |\n| [LiteLLM](https://github.com/BerriAI/litellm) (gateway to 100+ providers)                                                 | :heavy_check_mark:                                                                                      | `litellm`, `litellm-chat`, `litellm-chat-completions` | [All LiteLLM-supported providers](https://docs.litellm.ai/docs/providers)                                                                                       | `generate_until` (no logprobs)                                                 |\n| Textsynth                                                                                                                 | :heavy_check_mark:                                                                                      | `textsynth`                                           | [All supported engines](https://textsynth.com/documentation.html#engines)                                                                                       | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Cohere                                                                                                                    | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A                                                   | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models)                                                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark:                                                                                      | `gguf`, `ggml`                                        | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp)                                                                                     | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |\n| vLLM                                                                                                                      | :heavy_check_mark:                                                                                      | `vllm`                                                | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html)                                                                   | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Mamba                                                                                                                     | :heavy_check_mark:                                                                                      | `mamba_ssm`                                           | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces)                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum (Causal LMs)                                                                                          | :heavy_check_mark:                                                                                      | `openvino`                                            | Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format                                 | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum-intel IPEX (Causal LMs)                                                                               | :heavy_check_mark:                                                                                      | `ipex`                                                | Any decoder-only AutoModelForCausalLM                                                                                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum-habana (Causal LMs)                                                                                   | :heavy_check_mark:                                                                                      | `habana`                                              | Any decoder-only AutoModelForCausalLM                                                                                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Neuron via AWS Inf2 (Causal LMs)                                                                                          | :heavy_check_mark:                                                                                      | `neuronx`                                             | Any decoder-only AutoModelForCausalLM supported to run on [huggingface-ami image for inferentia2](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| NVIDIA NeMo                                                                                                               | :heavy_check_mark:                                                                                      | `nemo_lm`                                             | [All supported models](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/core/core.html#nemo-models)                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| NVIDIA Megatron-LM                                                                                                        | :heavy_check_mark:                                                                                      | `megatron_lm`                                         | [Megatron-LM GPT models](https://github.com/NVIDIA/Megatron-LM) (standard and distributed checkpoints)                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Watsonx.ai                                                                                                                | :heavy_check_mark:                                                                                      | `watsonx_llm`                                         | [Supported Watsonx.ai Engines](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx)                                      | `generate_until` `loglikelihood`                                               |\n| Windows ML                                                                                                                | :heavy_check_mark:                                                                                      | `winml`                                               | [ONNX models in GenAI format](https://code.visualstudio.com/docs/intelligentapps/modelconversion)                                                               | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| [Your local inference server!](docs/API_guide.md)                                                                         | :heavy_check_mark:                                                                                      | `local-completions` or `local-chat-completions`       | Support for OpenAI API-compatible servers, with easy customization for other APIs.                                                                              | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n\nModels which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.\n\nFor more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\n\n&gt; [!Note]\n&gt; For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using `--limit 10` first to confirm answer extraction and scoring on generative tasks is performing as expected. providing `system=\"&lt;some system prompt here&gt;\"` within `--model_args` for anthropic-chat-completions, to instruct the model what format to respond in, may be useful.\n\n### Other Frameworks\n\nA number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).\n\nTo create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage).\n\n### Additional Features\n\n&gt; [!Note]\n&gt; For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.\n\nIf you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher). **Note that the PyTorch MPS backend is still in early stages of development, so correctness issues or unsupported operations may exist. If you observe oddities in model performance on the MPS back-end, we recommend first checking that a forward pass of your model on `--device cpu` and `--device mps` match.**\n\n&gt; [!Note]\n&gt; You can inspect what the LM inputs look like by running the following command:\n&gt;\n&gt; ```bash\n&gt; python write_out.py \\\n&gt;     --tasks &lt;task1,task2,...&gt; \\\n&gt;     --num_fewshot 5 \\\n&gt;     --num_examples 10 \\\n&gt;     --output_base_path /path/to/output/folder\n&gt; ```\n&gt;\n&gt; This will write out one text file for each task.\n\nTo verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:\n\n```bash\nlm_eval --model openai \\\n    --model_args engine=davinci-002 \\\n    --tasks lambada_openai,hellaswag \\\n    --check_integrity\n```\n\n## Advanced Usage Tips\n\nFor models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \\\n    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \\\n    --device cuda:0\n```\n\nModels provided as delta weights can be easily loaded using the Hugging Face transformers library. Within --model_args, set the delta argument to specify the delta weights, and use the pretrained argument to designate the relative base model to which they will be applied:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=Ejafa/llama_7B,delta=lmsys/vicuna-7b-delta-v1.1 \\\n    --tasks hellaswag\n```\n\nGPTQ quantized models can be loaded using [GPTQModel](https://github.com/ModelCloud/GPTQModel) (faster) or [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)\n\nGPTQModel: add `,gptqmodel=True` to `model_args`\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,gptqmodel=True \\\n    --tasks hellaswag\n```\n\nAutoGPTQ: add `,autogptq=True` to `model_args`:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\n    --tasks hellaswag\n```\n\nWe support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.\n\n## Saving &amp; Caching Results\n\nTo save evaluation results provide an `--output_path`. We also support logging model responses with the `--log_samples` flag for post-hoc analysis.\n\n&gt; [!TIP]\n&gt; Use `--use_cache &lt;DIR&gt;` to cache evaluation results and skip previously evaluated samples when resuming runs of the same (model, task) pairs. Note that caching is rank-dependent, so restart with the same GPU count if interrupted. You can also use --cache_requests to save dataset preprocessing steps for faster evaluation resumption.\n\nTo push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use the `--hf_hub_log_args` flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub - [example dataset on the  HF Hub](https://huggingface.co/datasets/KonradSzafer/lm-eval-results-demo). For instance:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\n    --tasks hellaswag \\\n    --log_samples \\\n    --output_path results \\\n    --hf_hub_log_args hub_results_org=EleutherAI,hub_repo_name=lm-eval-results,push_results_to_hub=True,push_samples_to_hub=True,public_repo=False \\\n```\n\nThis allows you to easily download the results and samples from the Hub, using:\n\n```python\nfrom datasets import load_dataset\n\nload_dataset(\"EleutherAI/lm-eval-results-private\", \"hellaswag\", \"latest\")\n```\n\nFor a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!\n\n## Visualizing Results\n\nYou can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights &amp; Biases (W&amp;B) and Zeno.\n\n### Zeno\n\nYou can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.\n\nFirst, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).\nAdd this key as an environment variable:\n\n```bash\nexport ZENO_API_KEY=[your api key]\n```\n\nYou'll also need to install the `lm_eval[zeno]` package extra.\n\nTo visualize the results, run the eval harness with the `log_samples` and `output_path` flags.\nWe expect `output_path` to contain multiple folders that represent individual model names.\nYou can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.\n\n```bash\nlm_eval \\\n    --model hf \\\n    --model_args pretrained=EleutherAI/gpt-j-6B \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8 \\\n    --log_samples \\\n    --output_path output/gpt-j-6B\n```\n\nThen, you can upload the resulting data using the `zeno_visualize` script:\n\n```bash\npython scripts/zeno_visualize.py \\\n    --data_path output \\\n    --project_name \"Eleuther Project\"\n```\n\nThis will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno.\nIf you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.\n\nYou can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).\n\n### Weights and Biases\n\nWith the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights &amp; Biases (W&amp;B) platform.\n\nThe integration provide functionalities\n\n- to automatically log the evaluation results,\n- log the samples as W&amp;B Tables for easy visualization,\n- log the `results.json` file as an artifact for version control,\n- log the `&lt;task_name&gt;_eval_samples.json` file if the samples are logged,\n- generate a comprehensive report for analysis and visualization with all the important metric,\n- log task and cli specific configs,\n- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.\n\nFirst you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.\n\nAuthenticate your machine with an your unique W&amp;B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.\n\nRun eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.\n\n```bash\nlm_eval \\\n    --model hf \\\n    --model_args pretrained=microsoft/phi-2,trust_remote_code=True \\\n    --tasks hellaswag,mmlu_abstract_algebra \\\n    --device cuda:0 \\\n    --batch_size 8 \\\n    --output_path output/phi-2 \\\n    --limit 10 \\\n    --wandb_args project=lm-eval-harness-integration \\\n    --log_samples\n```\n\nIn the stdout, you will find the link to the W&amp;B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb), and an example of how to integrate it beyond the CLI.\n\n## Contributing\n\nCheck out our [open issues](https://github.com/EleutherAI/lm-evaluation-harness/issues) and feel free to submit pull requests!\n\nFor more information on the library and how everything fits together, see our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs).\n\nTo get started with development, first clone the repository and install the dev dependencies:\n\n```bash\ngit clone https://github.com/EleutherAI/lm-evaluation-harness\ncd lm-evaluation-harness\npip install -e \".[dev,hf]\"\n````\n\n### Implementing new tasks\n\nTo implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).\n\nIn general, we follow this priority list for addressing concerns about prompting and other eval details:\n\n1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.\n2. If there is a clear and unambiguous official implementation, use that procedure.\n3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.\n4. If there are multiple common implementations but not universal or widespread agreement, use our preferred option among the common implementations. As before, prioritize choosing from among the implementations found in LLM training papers.\n\nThese are guidelines and not rules, and can be overruled in special circumstances.\n\nWe try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different"
  }
}

▼✓网页搜索结果数: 15, 查询: "lm-evaluation-harness v0.4.9 v0.4.8 new features tasks 2025"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.9 - NASA ADS\",\n      \"url\": \"https://ui.adsabs.harvard.edu/abs/2025zndo..15699229S/abstract\",\n      \"description\": \"by @JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2825 Added C4 Support by @Zephyr271828 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889 Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870 Add question suffix before the <|assistant|> tag by @TingchenFu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876 Add device arg to model\\\\_args passed to LLM object in VLLM model class by @momentino in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879 paws-x fix formatting by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2759 Delete scripts/cost\\\\_estimate.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2985 Adding ACPBench Hard tasks by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2980 [SGLANG] Add the SGLANG generate API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2997 fix example notebook by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2998 Log tokenized request warning only once by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3002 [Add Dataset Update] KBL 2025 by @abzb1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000 Output path fix by @Niccolo-Ajroldi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993 use images with api models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Adding resize images support by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2958 Revert \\\"feat: add question suffix (#2876)\\\" by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3007 [hotfix] modify multimodal check in evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3013 [Fix] Update resolve\\\\_hf\\\\_chat\\\\_template arguments by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992 Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987 [vllm] data parallel for V1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3011 add arab\\\\_culture task by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/3006 chore: clean up and extend .gitignore rules by @e1washere in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030 Enable text-only evals for VLM models by @ysulsky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999 [Fix] acc\\\\_mutual\\\\_info metric calculation bug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3035 Fix: fix vllm issue with DP>1 by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025 add Mbpp instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2995 remove prints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3041 [longbench] fix metric calculation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2983 Fallback to super implementation in fewshot\\\\_context for Unitxt tasks by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3023 Fix Typo in README and Comment in utils\\\\_mcq.py by @vtjl10 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057 fix longbech citation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3061 mmlu task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3070 Fix typos in docstrings in instructions.py by @maximevtush in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060 bump version to 0.4.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3073 New\"\n    },\n    {\n      \"title\": \"user_13987929/lm-evaluation-harness - Gitee\",\n      \"url\": \"https://gitee.com/aikekj_admin/lm-evaluation-harness\",\n      \"description\": \"New updates and features include: Internal refactoring; Config-based task creation and configuration; Easier import and sharing of externally-defined task\"\n    },\n    {\n      \"title\": \"EleutherAI's lm-evaluation-harness: Architecture and Configuration\",\n      \"url\": \"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\n      \"description\": \"EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\\\"hf\\\", pretrained=\\\"gpt2\\\") = \\\"hf\\\" = \\\"gpt2\\\"results = evaluator.evaluate(model, tasks=[\\\"hellaswag\\\"], num_fewshot=5) = = \\\"hellaswag\\\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\\\"gpt2\\\", = \\\"gpt2\\\" tasks=[\\\"hellaswag\\\", \\\"mmlu\\\"], = \\\"hellaswag\\\" \\\"mmlu\\\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\\\lm-eval --model \\\\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\\\ --model_args \\\\ --tasks mmlu,hellaswag \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size auto \\\\ --batch_size \\\\ --output_path results/mistral-7b.json \\\\ --output_path \\\\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\\\lm-eval --model \\\\ --model_args model=gpt-4-turbo \\\\ --model_args \\\\ --tasks mmlu,bbh \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size 10 --batch_size.\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.8 - Zenodo\",\n      \"url\": \"https://zenodo.org/records/14970487\",\n      \"description\": \"4.8 Release Notes. Key Improvements. New Backend Support: Added SGLang as new evaluation backend! Enabled model steering with vector support via\"\n    },\n    {\n      \"title\": \"Releases · EleutherAI/lm-evaluation-harness - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/releases\",\n      \"description\": \"This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to\"\n    },\n    {\n      \"title\": \"GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n      \"description\": \"* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.9.1\",\n      \"url\": \"https://zenodo.org/records/16737642\",\n      \"description\": \"There is a **newer version** of the record. # EleutherAI/lm-evaluation-harness: v0.4.9.1. # lm-eval v0.4.9.1 Release Notes. This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon! ## New Benchmarks & Tasks. ### Tasks & Benchmarks:. ## New Contributors. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9...v0.4.9.1. ### Files (9.2 MB). ### EleutherAI/lm-evaluation-harness-v0.4.9.1.zip. | EleutherAI/lm-evaluation-harness-v0.4.9.1.zip md5:c9963fc62b221f792bfdbdc69681b33f | 9.2 MB | Preview Download |. ## Additional details. | Views  Total views | 48,774 | 1,131 |. | Downloads  Total downloads | 1,405 | 59 |. | Data volume  Total data volume | 4.6 GB | 597.1 MB |. More info on how stats are collected.... Release: v0.4.9.1. ## Details. ### Image URL. ### Target URL. Find out more on how we use cookies.\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.4\",\n      \"url\": \"https://zenodo.org/records/13694023\",\n      \"description\": \"# EleutherAI/lm-evaluation-harness: v0.4.4. # lm-eval v0.4.4 Release Notes. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here. ## New Tasks. As a further discoverability improvement, `lm_eval --tasks list` now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! New tasks as of v0.4.4 include:. Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for \\\"parent\\\" tasks like `mmlu` to aggregate and report a unified score across a set of component \\\"subtasks\\\". There were two ways to add a task to a given `group` name: 1) to provide (a list of) values to the `group` field in a given subtask's config file:.\"\n    },\n    {\n      \"title\": \"blockdialect - lm-evaluation-harness - Stanford GitLab\",\n      \"url\": \"https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness\",\n      \"description\": \"A new v0.4.0 release of lm-evaluation-harness is available ! New updates and features include: New Open LLM Leaderboard tasks have been added ! You\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/docs/task_guide.md at main\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md\",\n      \"description\": \"The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes\",\n      \"url\": \"https://zenodo.org/records/17728786\",\n      \"description\": \"There is a **newer version** of the record. # EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes. ### Authors/Creators. ## Description. This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: **Python 3.10 is now the minimum required version**. ### New Benchmarks & Tasks. A big wave of new evaluation tasks this release:. ### Fixes & Improvements. ### Model & Backend Support. ## New Contributors. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2. ### Files (10.0 MB). ### EleutherAI/lm-evaluation-harness-v0.4.9.2.zip. | EleutherAI/lm-evaluation-harness-v0.4.9.2.zip md5:a1821ba45eee3362f49140f332ff6512 | 10.0 MB | Preview Download |. ## Additional details. | Views  Total views | 49,045 | 689 |. | Downloads  Total downloads | 1,420 | 39 |. | Data volume  Total data volume | 4.7 GB | 389.4 MB |. More info on how stats are collected.... Release: v0.4.9.2. ## Details. #### ``` 10.5281/zenodo.17728786 ```. ### Image URL. ### Target URL. ## Support. Find out more on how we use cookies.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/lm_eval/tasks/README.md at main - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md\",\n      \"description\": \"A list of supported tasks and task groupings can be viewed with lm-eval ls tasks . For more information, including a full list of task names and their\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo\",\n      \"url\": \"https://zenodo.org/records/12608602\",\n      \"description\": \"# EleutherAI/lm-evaluation-harness: v0.4.3. # lm-eval v0.4.3 Release Notes. We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future. The big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors! This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs. There's a lot more to check out, including:. We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder.\"\n    },\n    {\n      \"title\": \"LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly\",\n      \"url\": \"https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545\",\n      \"description\": \"# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity….\"\n    },\n    {\n      \"title\": \"v0.4.0 - EleutherAI/lm-evaluation-harness - MyGit\",\n      \"url\": \"https://mygit.osfipin.com/release/132521756\",\n      \"description\": \"* Update `actions/setup-python`in CI workflows by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/365. * Update `lambada_openai` multilingual data source by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/370. * Update Pile Test/Val Download URLs by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/373. * Add accuracy metric to crows-pairs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/380. * Upstream `hf-causal` and `hf-seq2seq` model implementations by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/381. * Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in https://github.com/EleutherAI/lm-evaluation-harness/pull/403. * Update README installation instructions by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/407. * fix: add seq2seq peft by @zanussbaum in https://github.com/EleutherAI/lm-evaluation-harness/pull/418. * Add support for load\\\\_in\\\\_8bit and trust\\\\_remote\\\\_code model params by @philwee in https://github.com/EleutherAI/lm-evaluation-harness/pull/422. * Hotfix: patch issues with the `huggingface.py` model classes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/427. * [Refactor, WIP] Multiple Choice + loglikelihood\\\\_rolling support for YAML tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/467. * [Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/486. * Update config by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/501. * [Refactor] Docs update by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/577. * Add error handling for calling `.to(device)` by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/585.\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.9 - NASA ADS",
        "url": "https://ui.adsabs.harvard.edu/abs/2025zndo..15699229S/abstract",
        "description": "by @JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2825 Added C4 Support by @Zephyr271828 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889 Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870 Add question suffix before the <|assistant|> tag by @TingchenFu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876 Add device arg to model\\_args passed to LLM object in VLLM model class by @momentino in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879 paws-x fix formatting by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2759 Delete scripts/cost\\_estimate.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2985 Adding ACPBench Hard tasks by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2980 [SGLANG] Add the SGLANG generate API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2997 fix example notebook by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2998 Log tokenized request warning only once by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3002 [Add Dataset Update] KBL 2025 by @abzb1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000 Output path fix by @Niccolo-Ajroldi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993 use images with api models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Adding resize images support by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2958 Revert \"feat: add question suffix (#2876)\" by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3007 [hotfix] modify multimodal check in evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3013 [Fix] Update resolve\\_hf\\_chat\\_template arguments by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992 Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987 [vllm] data parallel for V1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3011 add arab\\_culture task by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/3006 chore: clean up and extend .gitignore rules by @e1washere in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030 Enable text-only evals for VLM models by @ysulsky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999 [Fix] acc\\_mutual\\_info metric calculation bug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3035 Fix: fix vllm issue with DP>1 by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025 add Mbpp instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2995 remove prints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3041 [longbench] fix metric calculation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2983 Fallback to super implementation in fewshot\\_context for Unitxt tasks by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3023 Fix Typo in README and Comment in utils\\_mcq.py by @vtjl10 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057 fix longbech citation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3061 mmlu task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3070 Fix typos in docstrings in instructions.py by @maximevtush in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060 bump version to 0.4.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3073 New"
      },
      {
        "title": "user_13987929/lm-evaluation-harness - Gitee",
        "url": "https://gitee.com/aikekj_admin/lm-evaluation-harness",
        "description": "New updates and features include: Internal refactoring; Config-based task creation and configuration; Easier import and sharing of externally-defined task"
      },
      {
        "title": "EleutherAI's lm-evaluation-harness: Architecture and Configuration",
        "url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
        "description": "EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\"hf\", pretrained=\"gpt2\") = \"hf\" = \"gpt2\"results = evaluator.evaluate(model, tasks=[\"hellaswag\"], num_fewshot=5) = = \"hellaswag\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\"gpt2\", = \"gpt2\" tasks=[\"hellaswag\", \"mmlu\"], = \"hellaswag\" \"mmlu\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\lm-eval --model \\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\ --model_args \\ --tasks mmlu,hellaswag \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size auto \\ --batch_size \\ --output_path results/mistral-7b.json \\ --output_path \\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\lm-eval --model \\ --model_args model=gpt-4-turbo \\ --model_args \\ --tasks mmlu,bbh \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size 10 --batch_size."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.8 - Zenodo",
        "url": "https://zenodo.org/records/14970487",
        "description": "4.8 Release Notes. Key Improvements. New Backend Support: Added SGLang as new evaluation backend! Enabled model steering with vector support via"
      },
      {
        "title": "Releases · EleutherAI/lm-evaluation-harness - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/releases",
        "description": "This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to"
      },
      {
        "title": "GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness",
        "description": "* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface)."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.9.1",
        "url": "https://zenodo.org/records/16737642",
        "description": "There is a **newer version** of the record. # EleutherAI/lm-evaluation-harness: v0.4.9.1. # lm-eval v0.4.9.1 Release Notes. This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon! ## New Benchmarks & Tasks. ### Tasks & Benchmarks:. ## New Contributors. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9...v0.4.9.1. ### Files (9.2 MB). ### EleutherAI/lm-evaluation-harness-v0.4.9.1.zip. | EleutherAI/lm-evaluation-harness-v0.4.9.1.zip md5:c9963fc62b221f792bfdbdc69681b33f | 9.2 MB | Preview Download |. ## Additional details. | Views  Total views | 48,774 | 1,131 |. | Downloads  Total downloads | 1,405 | 59 |. | Data volume  Total data volume | 4.6 GB | 597.1 MB |. More info on how stats are collected.... Release: v0.4.9.1. ## Details. ### Image URL. ### Target URL. Find out more on how we use cookies."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.4",
        "url": "https://zenodo.org/records/13694023",
        "description": "# EleutherAI/lm-evaluation-harness: v0.4.4. # lm-eval v0.4.4 Release Notes. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here. ## New Tasks. As a further discoverability improvement, `lm_eval --tasks list` now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! New tasks as of v0.4.4 include:. Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for \"parent\" tasks like `mmlu` to aggregate and report a unified score across a set of component \"subtasks\". There were two ways to add a task to a given `group` name: 1) to provide (a list of) values to the `group` field in a given subtask's config file:."
      },
      {
        "title": "blockdialect - lm-evaluation-harness - Stanford GitLab",
        "url": "https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness",
        "description": "A new v0.4.0 release of lm-evaluation-harness is available ! New updates and features include: New Open LLM Leaderboard tasks have been added ! You"
      },
      {
        "title": "lm-evaluation-harness/docs/task_guide.md at main",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md",
        "description": "The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the"
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes",
        "url": "https://zenodo.org/records/17728786",
        "description": "There is a **newer version** of the record. # EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes. ### Authors/Creators. ## Description. This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: **Python 3.10 is now the minimum required version**. ### New Benchmarks & Tasks. A big wave of new evaluation tasks this release:. ### Fixes & Improvements. ### Model & Backend Support. ## New Contributors. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2. ### Files (10.0 MB). ### EleutherAI/lm-evaluation-harness-v0.4.9.2.zip. | EleutherAI/lm-evaluation-harness-v0.4.9.2.zip md5:a1821ba45eee3362f49140f332ff6512 | 10.0 MB | Preview Download |. ## Additional details. | Views  Total views | 49,045 | 689 |. | Downloads  Total downloads | 1,420 | 39 |. | Data volume  Total data volume | 4.7 GB | 389.4 MB |. More info on how stats are collected.... Release: v0.4.9.2. ## Details. #### ``` 10.5281/zenodo.17728786 ```. ### Image URL. ### Target URL. ## Support. Find out more on how we use cookies."
      },
      {
        "title": "lm-evaluation-harness/lm_eval/tasks/README.md at main - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md",
        "description": "A list of supported tasks and task groupings can be viewed with lm-eval ls tasks . For more information, including a full list of task names and their"
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo",
        "url": "https://zenodo.org/records/12608602",
        "description": "# EleutherAI/lm-evaluation-harness: v0.4.3. # lm-eval v0.4.3 Release Notes. We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future. The big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors! This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs. There's a lot more to check out, including:. We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder."
      },
      {
        "title": "LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly",
        "url": "https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545",
        "description": "# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity…."
      },
      {
        "title": "v0.4.0 - EleutherAI/lm-evaluation-harness - MyGit",
        "url": "https://mygit.osfipin.com/release/132521756",
        "description": "* Update `actions/setup-python`in CI workflows by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/365. * Update `lambada_openai` multilingual data source by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/370. * Update Pile Test/Val Download URLs by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/373. * Add accuracy metric to crows-pairs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/380. * Upstream `hf-causal` and `hf-seq2seq` model implementations by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/381. * Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in https://github.com/EleutherAI/lm-evaluation-harness/pull/403. * Update README installation instructions by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/407. * fix: add seq2seq peft by @zanussbaum in https://github.com/EleutherAI/lm-evaluation-harness/pull/418. * Add support for load\\_in\\_8bit and trust\\_remote\\_code model params by @philwee in https://github.com/EleutherAI/lm-evaluation-harness/pull/422. * Hotfix: patch issues with the `huggingface.py` model classes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/427. * [Refactor, WIP] Multiple Choice + loglikelihood\\_rolling support for YAML tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/467. * [Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/486. * Update config by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/501. * [Refactor] Docs update by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/577. * Add error handling for calling `.to(device)` by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/585."
      }
    ]
  }
}

▼✓网页搜索结果数: 10, 查询: "lm-evaluation-harness architecture task model interface design internals"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"EleutherAI's lm-evaluation-harness: Architecture and Configuration\",\n      \"url\": \"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\n      \"description\": \"EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\\\"hf\\\", pretrained=\\\"gpt2\\\") = \\\"hf\\\" = \\\"gpt2\\\"results = evaluator.evaluate(model, tasks=[\\\"hellaswag\\\"], num_fewshot=5) = = \\\"hellaswag\\\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\\\"gpt2\\\", = \\\"gpt2\\\" tasks=[\\\"hellaswag\\\", \\\"mmlu\\\"], = \\\"hellaswag\\\" \\\"mmlu\\\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\\\lm-eval --model \\\\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\\\ --model_args \\\\ --tasks mmlu,hellaswag \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size auto \\\\ --batch_size \\\\ --output_path results/mistral-7b.json \\\\ --output_path \\\\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\\\lm-eval --model \\\\ --model_args model=gpt-4-turbo \\\\ --model_args \\\\ --tasks mmlu,bbh \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size 10 --batch_size.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/docs/new_task_guide.md at main - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md\",\n      \"description\": \"[Skip to content](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#start-of-content). [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.You signed out in another tab or window. [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.You switched accounts on another tab or window. [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.Dismiss alert. [History](https://github.com/EleutherAI/lm-evaluation-harness/commits/main/docs/new_task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/commits/main/docs/new_task_guide.md)History. [Raw](https://github.com/EleutherAI/lm-evaluation-harness/raw/refs/heads/main/docs/new_task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#new-task-guide). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#setup). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#creating-a-yaml-file). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#selecting-and-configuring-a-dataset). Now, in `utils.py` we'll write a function to process each split of our dataset (the following example is drawn from [the `hellaswag` task](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hellaswag/utils.py)):. [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-local-datasets). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#writing-a-prompt-template). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#basic-prompts). For example in [Swag](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/swag/swag.yaml#L10-L11), `doc_to_text` and `doc_to_target` given the name of one of the feature each. `doc_to_choice` can be directly given a list of text as option (See [Toxigen](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/toxigen/toxigen.yaml#L11)). if a dataset feature is already a list, you can set the name of the feature as `doc_to_choice` (See [Hellaswag](https://github.com/EleutherAI/lm-evaluation-harness/blob/e0eda4d3ffa10e5f65e0976161cd134bec61983a/lm_eval/tasks/hellaswag/hellaswag.yaml#L13)). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#writing-a-prompt-with-jinja-2). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#multiple-choice-format). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-python-functions-for-prompts). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#importing-a-prompt-from-promptsource). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#setting-metrics). For a full list of natively supported metrics and aggregation functions see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#optional-more-advanced-setup). For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#task-name--tags-registering-a-task). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs). See [./task_guide.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md) for a full list of related sub-keys. [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#configuring-python-classes). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#checking-validity). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#versioning). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#checking-performance--equivalence). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#task-validity-checklist). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#submitting-your-task).\"\n    },\n    {\n      \"title\": \"GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n      \"description\": \"* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/docs/task_guide.md at main - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md\",\n      \"description\": \"The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the\"\n    },\n    {\n      \"title\": \"LM Harness Evaluations — NVIDIA NeMo Microservices\",\n      \"url\": \"https://docs.nvidia.com/nemo/microservices/25.9.0/evaluate/flows/academic-benchmarks/lm-harness.html\",\n      \"description\": \"# LM Harness Evaluations#. LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and IFEval. Use this evaluation type to benchmark general language understanding and reasoning tasks. All LM Harness evaluations use the same target structure. | `type` | Always `\\\"model\\\"` for LM Harness evaluations. The URL of the API endpoint for the model. Always `\\\"model\\\"` for LM Harness evaluations. | `url` | The URL of the API endpoint for the model. Example LM Harness Tasks by Category#. | Math & Reasoning | `gsm8k` | Grade school and advanced math word problems. Math word problems and translation tasks in multiple languages. | Multilingual Tasks | `mgsm`, `wikilingua` | Math word problems and translation tasks in multiple languages. For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run `python -m lm_eval --tasks list`. For the full list of LM Harness tasks, refer to tasks. These parameters control how requests are made to the model:.\"\n    },\n    {\n      \"title\": \"Evaluating LLMs — EleutherAI\",\n      \"url\": \"https://www.eleuther.ai/projects/large-language-model-evaluation\",\n      \"description\": \"# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs  Next.\"\n    },\n    {\n      \"title\": \"LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly\",\n      \"url\": \"https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545\",\n      \"description\": \"# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity….\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.2\",\n      \"url\": \"https://zenodo.org/records/10829972\",\n      \"description\": \"# EleutherAI/lm-evaluation-harness: v0.4.2. # lm-eval v0.4.2 Release Notes. We are releasing a new minor version of lm-eval for PyPI users! There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:. previously, users had to call `lm_eval.tasks.initialize_tasks()` to register the library's default tasks, or `lm_eval.tasks.include_path()` to include a custom directory of task YAML configs. lm_eval.tasks.include_path(\\\"/path/to/my/custom/tasks\\\"). lm_eval.simple_evaluate(model=lm, tasks=[\\\"arc_easy\\\"])`. task_manager = TaskManager() # pass include_path=\\\"/path/to/my/custom/tasks\\\" if desired. lm_eval.simple_evaluate(model=lm, tasks=[\\\"arc_easy\\\"], task_manager=task_manager)`. Previous versions of the library incorrectly reported erroneously large `stderr` scores for groups of tasks such as MMLU. We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information. We're grateful for the community's support. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.1...v0.4.2. ### EleutherAI/lm-evaluation-harness-v0.4.2.zip. | EleutherAI/lm-evaluation-harness-v0.4.2.zip md5:b975dd5fdc27c997493db2b723272b63 | 2.3 MB | Preview Download |.\"\n    },\n    {\n      \"title\": \"blockdialect - lm-evaluation-harness - Stanford GitLab\",\n      \"url\": \"https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness\",\n      \"description\": \"This project provides a unified framework to test generative language models on a large number of different evaluation tasks.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness with LoRa fine tuned models Hugging Face\",\n      \"url\": \"https://stackoverflow.com/questions/78306200/lm-evaluation-harness-with-lora-fine-tuned-models-hugging-face\",\n      \"description\": \"They suggest how to use lm_eval when evaluating on a peft model: you should add in pretrained the model used for tuning and add peft=<PEFT_MODEL_NAME> into the\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "EleutherAI's lm-evaluation-harness: Architecture and Configuration",
        "url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
        "description": "EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\"hf\", pretrained=\"gpt2\") = \"hf\" = \"gpt2\"results = evaluator.evaluate(model, tasks=[\"hellaswag\"], num_fewshot=5) = = \"hellaswag\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\"gpt2\", = \"gpt2\" tasks=[\"hellaswag\", \"mmlu\"], = \"hellaswag\" \"mmlu\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\lm-eval --model \\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\ --model_args \\ --tasks mmlu,hellaswag \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size auto \\ --batch_size \\ --output_path results/mistral-7b.json \\ --output_path \\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\lm-eval --model \\ --model_args model=gpt-4-turbo \\ --model_args \\ --tasks mmlu,bbh \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size 10 --batch_size."
      },
      {
        "title": "lm-evaluation-harness/docs/new_task_guide.md at main - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md",
        "description": "[Skip to content](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#start-of-content). [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.You signed out in another tab or window. [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.You switched accounts on another tab or window. [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.Dismiss alert. [History](https://github.com/EleutherAI/lm-evaluation-harness/commits/main/docs/new_task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/commits/main/docs/new_task_guide.md)History. [Raw](https://github.com/EleutherAI/lm-evaluation-harness/raw/refs/heads/main/docs/new_task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#new-task-guide). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#setup). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#creating-a-yaml-file). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#selecting-and-configuring-a-dataset). Now, in `utils.py` we'll write a function to process each split of our dataset (the following example is drawn from [the `hellaswag` task](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hellaswag/utils.py)):. [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-local-datasets). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#writing-a-prompt-template). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#basic-prompts). For example in [Swag](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/swag/swag.yaml#L10-L11), `doc_to_text` and `doc_to_target` given the name of one of the feature each. `doc_to_choice` can be directly given a list of text as option (See [Toxigen](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/toxigen/toxigen.yaml#L11)). if a dataset feature is already a list, you can set the name of the feature as `doc_to_choice` (See [Hellaswag](https://github.com/EleutherAI/lm-evaluation-harness/blob/e0eda4d3ffa10e5f65e0976161cd134bec61983a/lm_eval/tasks/hellaswag/hellaswag.yaml#L13)). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#writing-a-prompt-with-jinja-2). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#multiple-choice-format). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-python-functions-for-prompts). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#importing-a-prompt-from-promptsource). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#setting-metrics). For a full list of natively supported metrics and aggregation functions see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#optional-more-advanced-setup). For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#task-name--tags-registering-a-task). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs). See [./task_guide.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md) for a full list of related sub-keys. [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#configuring-python-classes). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#checking-validity). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#versioning). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#checking-performance--equivalence). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#task-validity-checklist). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#submitting-your-task)."
      },
      {
        "title": "GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness",
        "description": "* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface)."
      },
      {
        "title": "lm-evaluation-harness/docs/task_guide.md at main - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md",
        "description": "The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the"
      },
      {
        "title": "LM Harness Evaluations — NVIDIA NeMo Microservices",
        "url": "https://docs.nvidia.com/nemo/microservices/25.9.0/evaluate/flows/academic-benchmarks/lm-harness.html",
        "description": "# LM Harness Evaluations#. LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and IFEval. Use this evaluation type to benchmark general language understanding and reasoning tasks. All LM Harness evaluations use the same target structure. | `type` | Always `\"model\"` for LM Harness evaluations. The URL of the API endpoint for the model. Always `\"model\"` for LM Harness evaluations. | `url` | The URL of the API endpoint for the model. Example LM Harness Tasks by Category#. | Math & Reasoning | `gsm8k` | Grade school and advanced math word problems. Math word problems and translation tasks in multiple languages. | Multilingual Tasks | `mgsm`, `wikilingua` | Math word problems and translation tasks in multiple languages. For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run `python -m lm_eval --tasks list`. For the full list of LM Harness tasks, refer to tasks. These parameters control how requests are made to the model:."
      },
      {
        "title": "Evaluating LLMs — EleutherAI",
        "url": "https://www.eleuther.ai/projects/large-language-model-evaluation",
        "description": "# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs  Next."
      },
      {
        "title": "LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly",
        "url": "https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545",
        "description": "# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity…."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.2",
        "url": "https://zenodo.org/records/10829972",
        "description": "# EleutherAI/lm-evaluation-harness: v0.4.2. # lm-eval v0.4.2 Release Notes. We are releasing a new minor version of lm-eval for PyPI users! There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:. previously, users had to call `lm_eval.tasks.initialize_tasks()` to register the library's default tasks, or `lm_eval.tasks.include_path()` to include a custom directory of task YAML configs. lm_eval.tasks.include_path(\"/path/to/my/custom/tasks\"). lm_eval.simple_evaluate(model=lm, tasks=[\"arc_easy\"])`. task_manager = TaskManager() # pass include_path=\"/path/to/my/custom/tasks\" if desired. lm_eval.simple_evaluate(model=lm, tasks=[\"arc_easy\"], task_manager=task_manager)`. Previous versions of the library incorrectly reported erroneously large `stderr` scores for groups of tasks such as MMLU. We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information. We're grateful for the community's support. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.1...v0.4.2. ### EleutherAI/lm-evaluation-harness-v0.4.2.zip. | EleutherAI/lm-evaluation-harness-v0.4.2.zip md5:b975dd5fdc27c997493db2b723272b63 | 2.3 MB | Preview Download |."
      },
      {
        "title": "blockdialect - lm-evaluation-harness - Stanford GitLab",
        "url": "https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness",
        "description": "This project provides a unified framework to test generative language models on a large number of different evaluation tasks."
      },
      {
        "title": "lm-evaluation-harness with LoRa fine tuned models Hugging Face",
        "url": "https://stackoverflow.com/questions/78306200/lm-evaluation-harness-with-lora-fine-tuned-models-hugging-face",
        "description": "They suggest how to use lm_eval when evaluating on a peft model: you should add in pretrained the model used for tuning and add peft=<PEFT_MODEL_NAME> into the"
      }
    ]
  }
}

▼thought

I now have a very comprehensive picture of lm-evaluation-harness. Let me also get the latest release notes and some more details about the architecture. Let me also search for some critical perspectives and comparisons with other evaluation frameworks.

▼✓网页搜索结果数: 10, 查询: "lm-evaluation-harness vs lmms-eval vs HELM comparison evaluation framework 2025"

{
"content": [
{
"type": "text",
"text": "{\n \"results\": [\n {\n \"title\": \"HELM vs LM Evaluation Harness (2026): Honest Comparison [Free]\",\n \"url\": \"https://aisecurityandsafety.org/en/compare/helm-vs-lm-evaluation-harness/\",\n \"description\": \"# HELM vs LM Evaluation Harness: Comparison. Compare HELM and LM Evaluation Harness, two evaluation tools for AI safety and security. LM Evaluation Harness is a evaluation tool by EleutherAI that eleutherAI's widely-used open-source framework for evaluating LLMs across hundreds of benchmarks and tasks. LM Evaluation Harness provides hundreds of built in evaluation tasks, support for local and api based models, and hugging face open llm leaderboard backend. In terms of language support, HELM works with Python while LM Evaluation Harness supports Python. Both HELM and LM Evaluation Harness are open-source and freely available, so pricing is unlikely to be the deciding factor. What is the difference between HELM and LM Evaluation Harness? : HELM is a evaluation tool that is open-source and freely available, while LM Evaluation Harness is a evaluation tool that is open-source and freely available. Which is better, HELM or LM Evaluation Harness? Can HELM and LM Evaluation Harness be used together?\"\n },\n {\n \"title\": \"25 Resources for Model Evaluation Capabilities\",\n \"url\": \"https://fmcheatsheet.org/foundation-model-resources/model-evaluation-capabilities/\",\n \"description\": \"LM Evaluation Harness. Orchestration framework for standardizing LM prompted evaluation, supporting hundreds of subtasks. ... HELM with comparisons to many\"\n },\n {\n \"title\": \"Top 10 LLM Evaluation Harnesses: Features, Pros, Cons ...\",\n \"url\": \"https://www.devopsschool.com/blog/top-10-llm-evaluation-harnesses-features-pros-cons-comparison/\",\n \"description\": \"# Top 10 LLM Evaluation Harnesses: Features, Pros, Cons & Comparison. LLM Evaluation Harnesses are tools, frameworks, and platforms that help teams test large language models, prompts, RAG pipelines, chatbots, copilots, and AI agents before they are released into production. Instead of relying on manual checking or public benchmark scores, evaluation harnesses allow teams to run repeatable tests against their own datasets, prompts, model versions, and real-world scenarios. LLM Evaluation Harnesses help teams catch these issues early by adding structured tests, regression checks, human review, LLM-as-judge scoring, trace inspection, and production feedback loops. **Best for:** AI engineers, LLMOps teams, ML platform teams, product teams, AI startups, enterprise AI teams, and organizations building chatbots, copilots, RAG applications, or AI agents. **One-line verdict:** Best for open benchmark evaluation of language models across standardized academic and research tasks. EleutherAI LM Evaluation Harness is an open-source framework used to evaluate language models across a wide range of benchmark tasks.\"\n },\n {\n \"title\": \"Evaluating LLMs — EleutherAI\",\n \"url\": \"https://www.eleuther.ai/projects/large-language-model-evaluation\",\n \"description\": \"# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs Next.\"\n },\n {\n \"title\": \"The 4 LLM Evaluation Frameworks: How to Benchmark AI Like ...\",\n \"url\": \"https://pub.towardsai.net/the-4-llm-evaluation-frameworks-how-to-benchmark-ai-like-google-and-openai-do-a08d0cd0b689\",\n \"description\": \"# The 4 LLM Evaluation Frameworks: How to Benchmark AI Like Google and OpenAI Do | by TANVEER MUSTAFA | Towards AI. ## Towards AI. Join 100K+ AI practitioners on Towards AI Academy. # The 4 LLM Evaluation Frameworks: How to Benchmark AI Like Google and OpenAI Do. Press enter or click to view image in full size. Image generated by Author using AI. _Understanding EleutherAI Harness, HELM, BIG-bench, and Domain-Specific Evals_. You’ve spent $2 million on training and can’t answer the simplest question: Is it actually good? This is the evaluation crisis. Compared to what?) The result: Hype without substance, wasted resources, and users losing trust. Enter standardized evaluation frameworks: EleutherAI Harness runs 200+ tasks in one command. ## Create an account to read the full story. If you’re new to Medium, create a new account to read this story on us. ## Published in Towards AI. AI systems engineer writing about LLM architecture, evaluation frameworks and scalable product development.\"\n },\n {\n \"title\": \"Holistic Evaluation of Language Models (HELM)\",\n \"url\": \"https://crfm.stanford.edu/helm/\",\n \"description\": \"# **A reproducible and transparent framework for evaluating foundation models.**. ### Find leaderboards with many scenarios, metrics, and models with support for multimodality and model-graded evaluation. A new leaderboard for evaluating general capabilities of language models. Holistic Evaluation of Audio-Language Models. Lightweight, broad evaluation of the capabilities of language models using in-context learning. Thorough language model evaluations based on the scenarios from the original HELM paper. Holistic evaluation of text-to-image models. Evaluations of instruction following models with absolute ratings. Massive Multitask Language Understanding (MMLU) evaluations using standardized prompts. Holistic Evaluation of Vision-Language Models. Evaluations of Vision-Language Models on extracting structured information from images. Safety benchmark that aggregates popular safety benchmarks across 6 risk vectors. Chinese-language benchmark for holistic evaluation of Chinese language models. Thai-language evaluations of language models on standardized examinations in Thailand. Assessment of large language models across various tasks, emphasizing Southeast Asian languages. A benchmark of LLM long context capabilities. Evaluation of LLMs on 7 popular Arabic-language benchmarks.\"\n },\n {\n \"title\": \"GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub\",\n \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n \"description\": \"* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\"\n },\n {\n \"title\": \"Evaluation Types — NVIDIA NeMo Microservices\",\n \"url\": \"https://docs.nvidia.com/nemo/microservices/25.4.0/evaluate/evaluation-types.html\",\n \"description\": \"NVIDIA NeMo Evaluator supports evaluation of LLMs through academic benchmarks, custom automated evaluations, and LLM-as-a-Judge. Beyond LLM evaluation, NeMo Evaluator also supports evaluation of retriever and RAG pipelines. Similarity Metrics evaluation is best suited for use cases where the LLM generations are not expected to be highly creative. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator. LLM-as-a-Judge Evaluation can be run on Foundational, Aligned or Fine-tuned models. The following are limitations of LLM-as-a-Judge Evaluations:. For non-chat models that do not use a chat template, NeMo Evaluator uses a default chat template in the following formats:. For models that do not support chat functionality, we recommend that you use Academic Benchmarks, such as LM Evaluation Harness and BigCode Evaluation Harness, and Similarity Metrics Evaluation with appropriate custom data. For chat models you can use the standard MT-Bench dataset or a Similarity Metrics Evaluation with appropriate custom data. You can use LLM-as-a-Judge evaluation for chat-enabled models in NIM.\"\n },\n {\n \"title\": \"Best practice in evaluating Base vs. Instruct Llama Models (with lm ...\",\n \"url\": \"https://www.reddit.com/r/LocalLLaMA/comments/1pwizo0/best_practice_in_evaluating_base_vs_instruct/\",\n \"description\": \"I'm currently benchmarking Llama 3.3 70B instruct (including quantized variants) using lm-evaluation-harness.\"\n },\n {\n \"title\": \"Guide to Real-World LLM Evaluation Frameworks & Benchmarks\",\n \"url\": \"https://medium.com/algomart/guide-to-real-world-llm-evaluation-frameworks-benchmarks-9a4bada5bd04\",\n \"description\": \"# Guide to Real-World LLM Evaluation Frameworks & Benchmarks | by Yash Jain | AlgoMart | Medium. # Guide to Real-World LLM Evaluation Frameworks & Benchmarks. Whether you’re building consumer apps, advancing research, or deploying infrastructure, knowing how to evaluate large language models (LLMs) isn’t just helpful — it’s essential. This guide offers a hands-on view of real-world evaluation tools — from broad, multi-metric frameworks to purpose-built benchmarks. ## Published in AlgoMart. ## More from Yash Jain and AlgoMart. Image 10: Designing a FastAPI + LLM System for 10K Concurrent Users and Scaling RAG to 100K Daily Users. ## Designing a FastAPI + LLM System for 10K Concurrent Users and Scaling RAG to 100K Daily Users ### Building an LLM-powered API that works for 10 users is easy. ## Compared 6 Python AI Agent Frameworks So You Don’t Have To ### Choosing an AI agent framework in 2026 feels a lot like asking the internet which JavaScript framework to use in 2018.\"\n }\n ]\n}"
}
],
"details": {
"results": [
{
"title": "HELM vs LM Evaluation Harness (2026): Honest Comparison [Free]",
"url": "https://aisecurityandsafety.org/en/compare/helm-vs-lm-evaluation-harness/",
"description": "# HELM vs LM Evaluation Harness: Comparison. Compare HELM and LM Evaluation Harness, two evaluation tools for AI safety and security. LM Evaluation Harness is a evaluation tool by EleutherAI that eleutherAI's widely-used open-source framework for evaluating LLMs across hundreds of benchmarks and tasks. LM Evaluation Harness provides hundreds of built in evaluation tasks, support for local and api based models, and hugging face open llm leaderboard backend. In terms of language support, HELM works with Python while LM Evaluation Harness supports Python. Both HELM and LM Evaluation Harness are open-source and freely available, so pricing is unlikely to be the deciding factor. What is the difference between HELM and LM Evaluation Harness? : HELM is a evaluation tool that is open-source and freely available, while LM Evaluation Harness is a evaluation tool that is open-source and freely available. Which is better, HELM or LM Evaluation Harness? Can HELM and LM Evaluation Harness be used together?"
},
{
"title": "25 Resources for Model Evaluation Capabilities",
"url": "https://fmcheatsheet.org/foundation-model-resources/model-evaluation-capabilities/",
"description": "LM Evaluation Harness. Orchestration framework for standardizing LM prompted evaluation, supporting hundreds of subtasks. ... HELM with comparisons to many"
},
{
"title": "Top 10 LLM Evaluation Harnesses: Features, Pros, Cons ...",
"url": "https://www.devopsschool.com/blog/top-10-llm-evaluation-harnesses-features-pros-cons-comparison/",
"description": "# Top 10 LLM Evaluation Harnesses: Features, Pros, Cons & Comparison. LLM Evaluation Harnesses are tools, frameworks, and platforms that help teams test large language models, prompts, RAG pipelines, chatbots, copilots, and AI agents before they are released into production. Instead of relying on manual checking or public benchmark scores, evaluation harnesses allow teams to run repeatable tests against their own datasets, prompts, model versions, and real-world scenarios. LLM Evaluation Harnesses help teams catch these issues early by adding structured tests, regression checks, human review, LLM-as-judge scoring, trace inspection, and production feedback loops. **Best for:** AI engineers, LLMOps teams, ML platform teams, product teams, AI startups, enterprise AI teams, and organizations building chatbots, copilots, RAG applications, or AI agents. **One-line verdict:** Best for open benchmark evaluation of language models across standardized academic and research tasks. EleutherAI LM Evaluation Harness is an open-source framework used to evaluate language models across a wide range of benchmark tasks."
},
{
"title": "Evaluating LLMs — EleutherAI",
"url": "https://www.eleuther.ai/projects/large-language-model-evaluation",
"description": "# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs Next."
},
{
"title": "The 4 LLM Evaluation Frameworks: How to Benchmark AI Like ...",
"url": "https://pub.towardsai.net/the-4-llm-evaluation-frameworks-how-to-benchmark-ai-like-google-and-openai-do-a08d0cd0b689",
"description": "# The 4 LLM Evaluation Frameworks: How to Benchmark AI Like Google and OpenAI Do | by TANVEER MUSTAFA | Towards AI. ## Towards AI. Join 100K+ AI practitioners on Towards AI Academy. # The 4 LLM Evaluation Frameworks: How to Benchmark AI Like Google and OpenAI Do. Press enter or click to view image in full size. Image generated by Author using AI. _Understanding EleutherAI Harness, HELM, BIG-bench, and Domain-Specific Evals_. You’ve spent $2 million on training and can’t answer the simplest question: Is it actually good? This is the evaluation crisis. Compared to what?) The result: Hype without substance, wasted resources, and users losing trust. Enter standardized evaluation frameworks: EleutherAI Harness runs 200+ tasks in one command. ## Create an account to read the full story. If you’re new to Medium, create a new account to read this story on us. ## Published in Towards AI. AI systems engineer writing about LLM architecture, evaluation frameworks and scalable product development."
},
{
"title": "Holistic Evaluation of Language Models (HELM)",
"url": "https://crfm.stanford.edu/helm/",
"description": "# **A reproducible and transparent framework for evaluating foundation models.**. ### Find leaderboards with many scenarios, metrics, and models with support for multimodality and model-graded evaluation. A new leaderboard for evaluating general capabilities of language models. Holistic Evaluation of Audio-Language Models. Lightweight, broad evaluation of the capabilities of language models using in-context learning. Thorough language model evaluations based on the scenarios from the original HELM paper. Holistic evaluation of text-to-image models. Evaluations of instruction following models with absolute ratings. Massive Multitask Language Understanding (MMLU) evaluations using standardized prompts. Holistic Evaluation of Vision-Language Models. Evaluations of Vision-Language Models on extracting structured information from images. Safety benchmark that aggregates popular safety benchmarks across 6 risk vectors. Chinese-language benchmark for holistic evaluation of Chinese language models. Thai-language evaluations of language models on standardized examinations in Thailand. Assessment of large language models across various tasks, emphasizing Southeast Asian languages. A benchmark of LLM long context capabilities. Evaluation of LLMs on 7 popular Arabic-language benchmarks."
},
{
"title": "GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub",
"url": "https://github.com/EleutherAI/lm-evaluation-harness",
"description": "* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface)."
},
{
"title": "Evaluation Types — NVIDIA NeMo Microservices",
"url": "https://docs.nvidia.com/nemo/microservices/25.4.0/evaluate/evaluation-types.html",
"description": "NVIDIA NeMo Evaluator supports evaluation of LLMs through academic benchmarks, custom automated evaluations, and LLM-as-a-Judge. Beyond LLM evaluation, NeMo Evaluator also supports evaluation of retriever and RAG pipelines. Similarity Metrics evaluation is best suited for use cases where the LLM generations are not expected to be highly creative. For more information, refer to Use Custom Data with NVIDIA NeMo Evaluator. LLM-as-a-Judge Evaluation can be run on Foundational, Aligned or Fine-tuned models. The following are limitations of LLM-as-a-Judge Evaluations:. For non-chat models that do not use a chat template, NeMo Evaluator uses a default chat template in the following formats:. For models that do not support chat functionality, we recommend that you use Academic Benchmarks, such as LM Evaluation Harness and BigCode Evaluation Harness, and Similarity Metrics Evaluation with appropriate custom data. For chat models you can use the standard MT-Bench dataset or a Similarity Metrics Evaluation with appropriate custom data. You can use LLM-as-a-Judge evaluation for chat-enabled models in NIM."
},
{
"title": "Best practice in evaluating Base vs. Instruct Llama Models (with lm ...",
"url": "https://www.reddit.com/r/LocalLLaMA/comments/1pwizo0/best_practice_in_evaluating_base_vs_instruct/",
"description": "I'm currently benchmarking Llama 3.3 70B instruct (including quantized variants) using lm-evaluation-harness."
},
{
"title": "Guide to Real-World LLM Evaluation Frameworks & Benchmarks",
"url": "https://medium.com/algomart/guide-to-real-world-llm-evaluation-frameworks-benchmarks-9a4bada5bd04",
"description": "# Guide to Real-World LLM Evaluation Frameworks & Benchmarks | by Yash Jain | AlgoMart | Medium. # Guide to Real-World LLM Evaluation Frameworks & Benchmarks. Whether you’re building consumer apps, advancing research, or deploying infrastructure, knowing how to evaluate large language models (LLMs) isn’t just helpful — it’s essential. This guide offers a hands-on view of real-world evaluation tools — from broad, multi-metric frameworks to purpose-built benchmarks. ## Published in AlgoMart. ## More from Yash Jain and AlgoMart. Image 10: Designing a FastAPI + LLM System for 10K Concurrent Users and Scaling RAG to 100K Daily Users. ## Designing a FastAPI + LLM System for 10K Concurrent Users and Scaling RAG to 100K Daily Users ### Building an LLM-powered API that works for 10 users is easy. ## Compared 6 Python AI Agent Frameworks So You Don’t Have To ### Choosing an AI agent framework in 2026 feels a lot like asking the internet which JavaScript framework to use in 2018."
}
]
}
}

▼✓网页搜索结果数: 10, 查询: "lm-evaluation-harness criticism limitations issues reproducibility"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"Conference Talk 16: A Deep Dive on LLM Evaluation – Christian Mills\",\n      \"url\": \"https://christianjmills.com/posts/mastering-llms-course-notes/conference-talk-016/\",\n      \"description\": \"# Conference Talk 16: A Deep Dive on LLM Evaluation. In this talk, **Hailey Schoelkopf** from **Eleuther AI** provides an overview of the challenges in LLM evaluation, exploring different measurement techniques, highlighting reproducibility issues, and advocating for best practices like sharing evaluation code and using task-specific downstream evaluations. * **Slides:** A Deep Dive on LM Evaluation. * **Topic:** Deep dive into the challenges and best practices of Large Language Model (LLM) evaluation. + Maintainer of the **LM Evaluation Harness**, a widely used open-source library for evaluating LLMs. ## Challenges in LLM Evaluation. This involves summing the log probabilities of each token in Y, conditioned on the preceding tokens in X and Y. - where \\\\(\\\\log p(y\\\\_i | x, y\\\\_0, \\\\ldots, y\\\\_{i-1})\\\\) is the log probability of the \\\\(i\\\\)-th target token conditioned on the full input \\\\(x\\\\) and the preceding target tokens. * **Calculation:** Based on the average per-token log probability of the text, with lower perplexity indicating a better fit to the data.\"\n    },\n    {\n      \"title\": \"Challenges in Language Model Evaluations\",\n      \"url\": \"https://lm-evaluation-challenges.github.io/\",\n      \"description\": \"Reproducibility challenges unique to LMs, including the importance of prompting, prompt engineering, and LMs' non-robustness to implementation details. Data\"\n    },\n    {\n      \"title\": \"blockdialect - lm-evaluation-harness - Stanford GitLab\",\n      \"url\": \"https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness\",\n      \"description\": \"Evaluation with publicly available prompts ensures reproducibility and comparability between papers. Easy support for custom prompts and evaluation metrics.\"\n    },\n    {\n      \"title\": \"LLM evaluation and reproducibility : r/MLQuestions - Reddit\",\n      \"url\": \"https://www.reddit.com/r/MLQuestions/comments/1ppqcjd/llm_evaluation_and_reproducibility/\",\n      \"description\": \"Ok so it's a problem with how the results are reported. I am seeing a 20% difference in accuracy compared to what is shown on the leader board\"\n    },\n    {\n      \"title\": \"GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n      \"description\": \"* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\"\n    },\n    {\n      \"title\": \"[Literature Review] Lessons from the Trenches on Reproducible ...\",\n      \"url\": \"https://www.themoonlight.io/en/review/lessons-from-the-trenches-on-reproducible-evaluation-of-language-models\",\n      \"description\": \"The authors present multiple case studies illustrating how lm-eval has been practically applied to improve evaluation methods, detailing various modeling\"\n    },\n    {\n      \"title\": \"The LLM Evaluation Problem Nobody Talks About - Towards AI\",\n      \"url\": \"https://pub.towardsai.net/the-llm-evaluation-problem-nobody-talks-about-5e876a12a5af\",\n      \"description\": \"Your benchmark numbers are probably not reproducible. NeMo Evaluator SDK was built specifically to fix that and it's more opinionated than\"\n    },\n    {\n      \"title\": \"Lessons from the Trenches on Reproducible Evaluation ... - EleutherAI\",\n      \"url\": \"https://www.eleuther.ai/papers-blog/lessons-from-the-trenches-on-reproducible-evaluation-of-language-models\",\n      \"description\": \"First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or\"\n    },\n    {\n      \"title\": \"Lessons from the Trenches on Reproducible Evaluation of ... - arXiv\",\n      \"url\": \"https://arxiv.org/html/2405.14782v1\",\n      \"description\": \"In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.\"\n    },\n    {\n      \"title\": \"Evaluating Large Language Models: Techniques, Challenges, and ...\",\n      \"url\": \"https://medium.com/@prag-matic/evaluating-large-language-models-techniques-challenges-and-best-practices-8b3f15f13253\",\n      \"description\": \"This blog delves into the multifaceted world of LLM evaluations, exploring methodologies, detailed evaluation metrics, challenges, and emerging trends.\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "Conference Talk 16: A Deep Dive on LLM Evaluation – Christian Mills",
        "url": "https://christianjmills.com/posts/mastering-llms-course-notes/conference-talk-016/",
        "description": "# Conference Talk 16: A Deep Dive on LLM Evaluation. In this talk, **Hailey Schoelkopf** from **Eleuther AI** provides an overview of the challenges in LLM evaluation, exploring different measurement techniques, highlighting reproducibility issues, and advocating for best practices like sharing evaluation code and using task-specific downstream evaluations. * **Slides:** A Deep Dive on LM Evaluation. * **Topic:** Deep dive into the challenges and best practices of Large Language Model (LLM) evaluation. + Maintainer of the **LM Evaluation Harness**, a widely used open-source library for evaluating LLMs. ## Challenges in LLM Evaluation. This involves summing the log probabilities of each token in Y, conditioned on the preceding tokens in X and Y. - where \\(\\log p(y\\_i | x, y\\_0, \\ldots, y\\_{i-1})\\) is the log probability of the \\(i\\)-th target token conditioned on the full input \\(x\\) and the preceding target tokens. * **Calculation:** Based on the average per-token log probability of the text, with lower perplexity indicating a better fit to the data."
      },
      {
        "title": "Challenges in Language Model Evaluations",
        "url": "https://lm-evaluation-challenges.github.io/",
        "description": "Reproducibility challenges unique to LMs, including the importance of prompting, prompt engineering, and LMs' non-robustness to implementation details. Data"
      },
      {
        "title": "blockdialect - lm-evaluation-harness - Stanford GitLab",
        "url": "https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness",
        "description": "Evaluation with publicly available prompts ensures reproducibility and comparability between papers. Easy support for custom prompts and evaluation metrics."
      },
      {
        "title": "LLM evaluation and reproducibility : r/MLQuestions - Reddit",
        "url": "https://www.reddit.com/r/MLQuestions/comments/1ppqcjd/llm_evaluation_and_reproducibility/",
        "description": "Ok so it's a problem with how the results are reported. I am seeing a 20% difference in accuracy compared to what is shown on the leader board"
      },
      {
        "title": "GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness",
        "description": "* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface)."
      },
      {
        "title": "[Literature Review] Lessons from the Trenches on Reproducible ...",
        "url": "https://www.themoonlight.io/en/review/lessons-from-the-trenches-on-reproducible-evaluation-of-language-models",
        "description": "The authors present multiple case studies illustrating how lm-eval has been practically applied to improve evaluation methods, detailing various modeling"
      },
      {
        "title": "The LLM Evaluation Problem Nobody Talks About - Towards AI",
        "url": "https://pub.towardsai.net/the-llm-evaluation-problem-nobody-talks-about-5e876a12a5af",
        "description": "Your benchmark numbers are probably not reproducible. NeMo Evaluator SDK was built specifically to fix that and it's more opinionated than"
      },
      {
        "title": "Lessons from the Trenches on Reproducible Evaluation ... - EleutherAI",
        "url": "https://www.eleuther.ai/papers-blog/lessons-from-the-trenches-on-reproducible-evaluation-of-language-models",
        "description": "First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or"
      },
      {
        "title": "Lessons from the Trenches on Reproducible Evaluation of ... - arXiv",
        "url": "https://arxiv.org/html/2405.14782v1",
        "description": "In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers."
      },
      {
        "title": "Evaluating Large Language Models: Techniques, Challenges, and ...",
        "url": "https://medium.com/@prag-matic/evaluating-large-language-models-techniques-challenges-and-best-practices-8b3f15f13253",
        "description": "This blog delves into the multifaceted world of LLM evaluations, exploring methodologies, detailed evaluation metrics, challenges, and emerging trends."
      }
    ]
  }
}

▼✓抓取网页正文网址: "https://zenodo.org/records/17728786"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://zenodo.org/records/17728786\",\n  \"title\": \"EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes\",\n  \"summary\": \"This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum requir…\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><article>\\n          \\n\\n              \\n\\n              \\n\\n              <div id=\\\"record-info\\\" aria-label=\\\"Publication date and version number\\\">\\n                    <p><span title=\\\"Publication date\\\">\\n                        Published November 26, 2025\\n                      </span>\\n                      <span> | Version v0.4.9.2</span>\\n                    </p>\\n                    <p><span role=\\\"note\\\" aria-label=\\\"Resource type\\\">\\n                          Software\\n                        </span>\\n                      \\n\\n                      <span role=\\\"note\\\" data-tooltip=\\\"The record and files are publicly accessible.\\\" data-inverted=\\\"\\\" aria-label=\\\"Access status\\\">\\n                        \\n                          \\n                        \\n                        <span aria-label=\\\"The record and files are publicly accessible.\\\">\\n                          Open\\n                        </span>\\n                      </span>\\n                    </p>\\n                  </div>\\n              <div id=\\\"record-title-section\\\" aria-label=\\\"Record title, authors and contributors\\\">\\n      <div>\\n          <h3>Authors/Creators</h3>\\n          <ul>\\n            \\n  \\n  <li>\\n    <a data-tooltip=\\\"Language Technologies Institute, CMU\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Lintang+Sutawika%22\\\">\\n\\n      <span>Lintang Sutawika</span><sup>1</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Hailey+Schoelkopf%22\\\">\\n\\n      <span>Hailey Schoelkopf</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Leo+Gao%22\\\">\\n\\n      <span>Leo Gao</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Baber+Abbasi%22\\\">\\n\\n      <span>Baber Abbasi</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Booz Allen Hamilton, EleutherAI\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Stella+Biderman%22\\\">\\n\\n      <span>Stella Biderman</span><sup>2</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Jonathan+Tow%22\\\">\\n\\n      <span>Jonathan Tow</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22ben+fattori%22\\\">\\n\\n      <span>ben fattori</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Charles+Lovering%22\\\">\\n\\n      <span>Charles Lovering</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22farzanehnakhaee70%22\\\">\\n\\n      <span>farzanehnakhaee70</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Jason+Phang%22\\\">\\n\\n      <span>Jason Phang</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"playscape.gg\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Anish+Thite%22\\\">\\n\\n      <span>Anish Thite</span><sup>3</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Fazz%22\\\">\\n\\n      <span>Fazz</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Max Planck Institute for Software Systems: MPI SWS\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Aflah%22\\\">\\n\\n      <span>Aflah</span><sup>4</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Niklas%22\\\">\\n\\n      <span>Niklas</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"MistralAI\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Thomas+Wang%22\\\">\\n\\n      <span>Thomas Wang</span><sup>5</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22sdtblck%22\\\">\\n\\n      <span>sdtblck</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22gakada%22\\\">\\n\\n      <span>gakada</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22nopperl%22\\\">\\n\\n      <span>nopperl</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22researcher2%22\\\">\\n\\n      <span>researcher2</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22tttyuntian%22\\\">\\n\\n      <span>tttyuntian</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Hitz Zentroa UPV/EHU\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Julen+Etxaniz%22\\\">\\n\\n      <span>Julen Etxaniz</span><sup>6</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"@azurro\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Chris%22\\\">\\n\\n      <span>Chris</span><sup>7</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"MIT\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22James+A.+Michaelov%22\\\">\\n\\n      <span>James A. Michaelov</span><sup>8</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Shinhan Securities Co.\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Hanwool+Albert+Lee%22\\\">\\n\\n      <span>Hanwool Albert Lee</span><sup>9</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Janna%22\\\">\\n\\n      <span>Janna</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Leonid+Sinev%22\\\">\\n\\n      <span>Leonid Sinev</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Charles University\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Zden%C4%9Bk+Kasner%22\\\">\\n\\n      <span>Zdeněk Kasner</span><sup>10</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Open Source Developer @ IBM\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Kiersten+Stokes%22\\\">\\n\\n      <span>Kiersten Stokes</span><sup>11</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Khalid%22\\\">\\n\\n      <span>Khalid</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22KonradSzafer%22\\\">\\n\\n      <span>KonradSzafer</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n          </ul>\\n      </div>\\n\\n        \\n        \\n          \\n\\n\\n<section id=\\\"creators-affiliations\\\" aria-label=\\\"Affiliations for creators\\\">\\n  <ul>\\n    \\n    <li>\\n      1.\\n\\n      \\n\\n      Language Technologies Institute, CMU\\n    </li>\\n  \\n    <li>\\n      2.\\n\\n      \\n\\n      Booz Allen Hamilton, EleutherAI\\n    </li>\\n  \\n    <li>\\n      3.\\n\\n      \\n\\n      playscape.gg\\n    </li>\\n  \\n    <li>\\n      4.\\n\\n      \\n\\n      Max Planck Institute for Software Systems: MPI SWS\\n    </li>\\n  \\n    <li>\\n      5.\\n\\n      \\n\\n      MistralAI\\n    </li>\\n  \\n    <li>\\n      6.\\n\\n      \\n\\n      Hitz Zentroa UPV/EHU\\n    </li>\\n  \\n    <li>\\n      7.\\n\\n      \\n\\n      @azurro\\n    </li>\\n  \\n    <li>\\n      8.\\n\\n      \\n\\n      MIT\\n    </li>\\n  \\n    <li>\\n      9.\\n\\n      \\n\\n      Shinhan Securities Co.\\n    </li>\\n  \\n    <li>\\n      10.\\n\\n      \\n\\n      Charles University\\n    </li>\\n  \\n    <li>\\n      11.\\n\\n      \\n\\n      Open Source Developer @ IBM\\n    </li>\\n  \\n\\n    </ul>\\n</section>\\n\\n        \\n\\n    </div>\\n\\n\\n\\n  <section id=\\\"description\\\" aria-label=\\\"Record description\\\">\\n    <h2 id=\\\"description-heading\\\">Description</h2>\\n    \\n    <div>\\n      <p>This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: <strong>Python 3.10 is now the minimum required version</strong>.</p>\\n<h3>New Benchmarks &amp; Tasks</h3>\\n<p>A big wave of new evaluation tasks this release:</p>\\n<ul>\\n<li><strong>AIME</strong> and <strong>MATH500</strong> math reasoning benchmarks by @jannalulu in #3248, #3311</li>\\n<li><strong>BabiLong</strong> and <strong>Longbench v2</strong> for long-context evaluation by @jannalulu in #3287, #3338</li>\\n<li><strong>GraphWalks</strong> by @jannalulu in #3377</li>\\n<li><strong>ZhoBLiMP</strong>, <strong>BLiMP-NL</strong>, <strong>TurBLiMP</strong>, <strong>LM-SynEval</strong>, and <strong>BHS</strong> linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265</li>\\n<li><strong>Icelandic WinoGrande</strong> by @jmichaelov in #3277</li>\\n<li><strong>CLIcK</strong> Korean benchmark by @shing100 in #3173</li>\\n<li><strong>MMLU-Redux</strong> (generative) and Spanish translation by @luiscosio in #2705</li>\\n<li><strong>EsBBQ</strong> and <strong>CaBBQ</strong> bias benchmarks by @valleruizf in #3167</li>\\n<li><strong>EQBench</strong> in Spanish and Catalan by @priverabsc in #3168</li>\\n<li><strong>Anthropic discrim-eval</strong> by @Helw150 in #3091</li>\\n<li><strong>XNLI-VA</strong> by @FranValero97 in #3194</li>\\n<li><strong>Bangla MMLU</strong> (Titulm) by @Ismail-Hossain-1 in #3317</li>\\n<li><strong>HumanEval infilling</strong> by @its-alpesh in #3299</li>\\n<li><strong>CNN-DailyMail 3.0.0</strong> by @preordinary in #3426</li>\\n<li><strong>Global PIQA</strong> and new <code>acc_norm_bytes</code> metric by @baberabb in #3368</li>\\n</ul>\\n<h3>Fixes &amp; Improvements</h3>\\n<p><strong>Core Changes:</strong></p>\\n<ul>\\n<li><strong>Python 3.10 minimum</strong> by @jannalulu in #3337</li>\\n<li><strong>Unpinned <code>datasets</code></strong> library by @baberabb in #3316</li>\\n<li><strong>BOS token handling</strong>: Delegate to tokenizer; <code>add_bos_token</code> now defaults to <code>None</code> by @baberabb in #3347</li>\\n<li>Renamed <code>LOGLEVEL</code> env var to <code>LMEVAL_LOG_LEVEL</code> to avoid conflicts by @fxmarty-amd in #3418</li>\\n<li>Resolve duplicate task names with safeguards by @giuliolovisotto in #3394</li>\\n</ul>\\n<p><strong>Task Fixes:</strong></p>\\n<ul>\\n<li>Fixed MMLU-Redux to exclude samples without <code>error_type=\\\"ok\\\"</code> and display summary table by @fxmarty-amd in #3410, #3406</li>\\n<li>Fixed AIME answer extraction by @jannalulu in #3353</li>\\n<li>Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361</li>\\n<li>Fixed <code>crows_pairs</code> dataset by @jannalulu in #3378</li>\\n<li>Fixed Gemma tokenizer <code>add_bos_token</code> not updating by @DarkLight1337 in #3206</li>\\n<li>Fixed <code>lambada_multilingual_stablelm</code> by @jmichaelov, @HallerPatrick in #3294, #3222</li>\\n<li>Fixed CodeXGLUE by @gsaltintas in #3238</li>\\n<li>Pinned correct MMLUSR version by @christinaexyou in #3350</li>\\n<li>Updated <code>minerva_math</code> by @baberabb in #3259</li>\\n</ul>\\n<p><strong>Backend Fixes:</strong></p>\\n<ul>\\n<li>Fixed vLLM import errors when not installed by @fxmarty-amd in #3292</li>\\n<li>Fixed vLLM <code>data_parallel_size&gt;1</code> issue by @Dornavineeth in #3303</li>\\n<li>Resolved deprecated <code>vllm.utils.get_open_port</code> by @DarkLight1337 in #3398</li>\\n<li>Fixed GPT series model bugs by @zinccat in #3348</li>\\n<li>Fixed PIL image hashing to use actual bytes by @tboerstad in #3331</li>\\n<li>Fixed <code>additional_config</code> parsing by @brian-dellabetta in #3393</li>\\n<li>Fixed batch chunking seed handling with groupby by @slimfrkha in #3047</li>\\n<li>Fixed no-output error handling by @Oseltamivir in #3395</li>\\n<li>Replaced deprecated <code>torch_dtype</code> with <code>dtype</code> by @AbdulmalikDS in #3415</li>\\n<li>Fixed custom task config reading by @SkyR0ver in #3425</li>\\n</ul>\\n<h3>Model &amp; Backend Support</h3>\\n<ul>\\n<li><strong>OpenAI GPT-5</strong> support by @babyplutokurt in #3247</li>\\n<li><strong>Azure OpenAI</strong> support by @zinccat in #3349</li>\\n<li><strong>Fine-tuned Gemma3</strong> evaluation support by @LearnerSXH in #3234</li>\\n<li><strong>OpenVINO text2text</strong> models by @nikita-savelyevv in #3101</li>\\n<li><strong>Intel XPU</strong> support for HFLM by @kaixuanliu in #3211</li>\\n<li><strong>Attention head steering</strong> support by @luciaquirke in #3279</li>\\n<li>Leverage vLLM's <code>tokenizer_info</code> endpoint to avoid manual duplication by @m-misiura in #3185</li>\\n</ul>\\n<h2>What's Changed</h2>\\n<ul>\\n<li>Remove <code>trust_remote_code: True</code> from updated datasets by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3213</li>\\n<li>Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234</li>\\n<li>Fix <code>add_bos_token</code> not updated for Gemma tokenizer by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3206</li>\\n<li>remove incomplete compilation instructions, solves #3233 by @ceferisbarov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242</li>\\n<li>Update utils.py by @Anri-Lombard in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246</li>\\n<li>Adding support for OpenAI GPT-5 model by @babyplutokurt in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247</li>\\n<li>Add xnli_va dataset by @FranValero97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194</li>\\n<li>Add ZhoBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3218</li>\\n<li>Add BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3221</li>\\n<li>Add TurBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3219</li>\\n<li>Add LM-SynEval Benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3184</li>\\n<li>Fix unknown group key to tag in yaml config for <code>lambada_multilingual_stablelm</code> by @HallerPatrick in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222</li>\\n<li>update <code>minerva_math</code> by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3259</li>\\n<li>feat: Add CLIcK task by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3173</li>\\n<li>Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091</li>\\n<li>Add support for OpenVINO text2text generation models  by @nikita-savelyevv in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101</li>\\n<li>Update MMLU-ProX task by @weihao1115 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174</li>\\n<li>Support for AIME dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248</li>\\n<li>feat(scrolls): delete chat_template from kwargs by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267</li>\\n<li>pacify pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3268</li>\\n<li>Fix codexglue by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238</li>\\n<li>Add BHS benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3265</li>\\n<li>Add <code>acc_norm</code> metric to BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3272</li>\\n<li>Add <code>acc_norm</code> metric to ZhoBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3271</li>\\n<li>Add EsBBQ and CaBBQ tasks by @valleruizf in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167</li>\\n<li>Add support for steering individual attention heads by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/3279</li>\\n<li>Add the Icelandic WinoGrande benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3277</li>\\n<li>Ignore seed when splitting batch in chunks with groupby by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3047</li>\\n<li>[fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3292</li>\\n<li>Fix LongBench Evaluation by @TimurAysin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273</li>\\n<li>add intel xpu support for HFLM by @kaixuanliu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211</li>\\n<li>feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in https://github.com/EleutherAI/lm-evaluation-harness/pull/2705</li>\\n<li>Add BabiLong by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3287</li>\\n<li>Add AIME to task description by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3296</li>\\n<li>Add humaneval_infilling task by @its-alpesh in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299</li>\\n<li>Add eqbench tasks in Spanish and Catalan by @priverabsc in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168</li>\\n<li>[fix] add math and longbench to test dependencies by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3321</li>\\n<li>Fix: VLLM model when data_parallel_size&gt;1 by @Dornavineeth in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303</li>\\n<li>unpin datasets; update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3316</li>\\n<li>bump to python 3.10 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3337</li>\\n<li>Longbench v2 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3338</li>\\n<li>Leverage vllm's <code>tokenizer_info</code> endpoint to avoid manual duplication  by @m-misiura in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185</li>\\n<li>Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317</li>\\n<li>remove duplicate tags/groups by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3343</li>\\n<li>Align <code>humaneval_64_instruct</code> task label in README to name in yaml file by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3344</li>\\n<li>Fixes bugs when using gpt series model by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348</li>\\n<li>[fix] aime doesn't extract answers by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3353</li>\\n<li>add global_piqa; add acc_norm_bytes metric by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3368</li>\\n<li>[fix] crows_pairs dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3378</li>\\n<li>Fix issue 3355 assertion error by @marksverdhei in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356</li>\\n<li>fix(gsm8k): align README to yaml file by @neoheartbeats in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388</li>\\n<li>added azure openai support by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3349</li>\\n<li>Delegate BOS to the tokenizer; <code>add_bos_token</code> defaults to <code>None</code> by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3347</li>\\n<li>fix trust_remote_code=True for longbench by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3361</li>\\n<li>[feat] add graphwalks by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3377</li>\\n<li>Longbench group fix by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3359</li>\\n<li>Resolve deprecation of <code>vllm.utils.get_open_port</code> by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3398</li>\\n<li>Trim whitespace in remove_whitespace filter by @ziqing-huang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408</li>\\n<li>Fixes #3391 avoid error on no-output by @Oseltamivir in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395</li>\\n<li>Fix PIL image hashing to use actual bytes instead of object repr by @tboerstad in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331</li>\\n<li>[MMLU redux] Do not use samples which do not have <code>error_type=\\\"ok\\\"</code> by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3410</li>\\n<li>fix: resolve duplicate task names and add safeguards. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/3394</li>\\n<li>Add MATH500 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3311</li>\\n<li>[bugfix] additional_config parsing by @brian-dellabetta in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393</li>\\n<li>fix(tasks):pin correct MMLUSR version by @christinaexyou in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350</li>\\n<li>Fix <code>lambada_multilingual_stablelm</code> by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3294</li>\\n<li>Fix descriptions in the Moral Stories and Histoires Morales tasks. by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/3374</li>\\n<li>Replace deprecated torch_dtype parameter with dtype by @AbdulmalikDS in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415</li>\\n<li>[fix] Fix mmlu_redux not displaying summary table + display to the user the tasks / yaml that are actually pulled by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3406</li>\\n<li>Rename the conflicting environment variable <code>LOGLEVEL</code> to <code>LMEVAL_LOG_LEVEL</code> by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3418</li>\\n<li>Update SGLang installation and documentation links by @Bobchenyx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422</li>\\n<li>Fix reading custom task configs by @SkyR0ver in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425</li>\\n<li>New Task: Add CNN-DailyMail (3.0.0) by @preordinary in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426</li>\\n</ul>\\n<h2>New Contributors</h2>\\n<ul>\\n<li>@LearnerSXH made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234</li>\\n<li>@ceferisbarov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242</li>\\n<li>@Anri-Lombard made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246</li>\\n<li>@babyplutokurt made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247</li>\\n<li>@FranValero97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194</li>\\n<li>@HallerPatrick made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222</li>\\n<li>@Helw150 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091</li>\\n<li>@nikita-savelyevv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101</li>\\n<li>@weihao1115 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174</li>\\n<li>@jannalulu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248</li>\\n<li>@slimfrkha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267</li>\\n<li>@gsaltintas made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238</li>\\n<li>@valleruizf made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167</li>\\n<li>@TimurAysin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273</li>\\n<li>@kaixuanliu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211</li>\\n<li>@its-alpesh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299</li>\\n<li>@priverabsc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168</li>\\n<li>@Dornavineeth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303</li>\\n<li>@m-misiura made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185</li>\\n<li>@Ismail-Hossain-1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317</li>\\n<li>@zinccat made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348</li>\\n<li>@marksverdhei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356</li>\\n<li>@neoheartbeats made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388</li>\\n<li>@ziqing-huang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408</li>\\n<li>@Oseltamivir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395</li>\\n<li>@tboerstad made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331</li>\\n<li>@brian-dellabetta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393</li>\\n<li>@christinaexyou made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350</li>\\n<li>@AbdulmalikDS made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415</li>\\n<li>@Bobchenyx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422</li>\\n<li>@SkyR0ver made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425</li>\\n<li>@preordinary made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426</li>\\n</ul>\\n<p><strong>Full Changelog</strong>: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2</p>\\n    </div>\\n  </section>\\n\\n\\n<section id=\\\"record-files\\\" aria-label=\\\"Files\\\"><h2 id=\\\"files-heading\\\">Files</h2>\\n            \\n  <div href=\\\"#files-preview-accordion-panel\\\">\\n    <h3>\\n      <p><span id=\\\"preview-file-title\\\">EleutherAI/lm-evaluation-harness-v0.4.9.2.zip</span>\\n        \\n      </p>\\n    </h3>\\n    \\n  </div>\\n  <div href=\\\"#files-list-accordion-panel\\\">\\n    <h3>\\n      <p>\\n        Files\\n        <small> (10.0 MB)</small>\\n        \\n      </p>\\n    </h3>\\n\\n    <div role=\\\"region\\\" id=\\\"files-list-accordion-panel\\\" aria-labelledby=\\\"files-list-accordion-trigger\\\">\\n        \\n  <table id=\\\"file-list-table\\\">\\n    <thead>\\n      <tr>\\n        <th>Name</th>\\n        <th>Size</th>\\n        <th>\\n            \\n            <a role=\\\"button\\\" href=\\\"https://zenodo.org/api/records/17728786/files-archive\\\">\\n               Download all\\n            </a>\\n        </th>\\n      </tr>\\n    </thead>\\n    <tbody>\\n    \\n      \\n        \\n        \\n        <tr>\\n          <td>\\n            <div>\\n              <p><a href=\\\"https://zenodo.org/records/17728786/files/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?download=1\\\">EleutherAI/lm-evaluation-harness-v0.4.9.2.zip</a>\\n            </p></div>\\n            <small>md5:a1821ba45eee3362f49140f332ff6512\\n            \\n            </small>\\n          </td>\\n          <td>10.0 MB</td>\\n          <td>\\n            <span>\\n              \\n                <a role=\\\"button\\\" href=\\\"https://zenodo.org/records/17728786/preview/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?include_deleted=0\\\" target=\\\"preview-iframe\\\" data-file-key=\\\"EleutherAI/lm-evaluation-harness-v0.4.9.2.zip\\\">\\n                  Preview\\n                </a>\\n              \\n              <a role=\\\"button\\\" href=\\\"https://zenodo.org/records/17728786/files/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?download=1\\\">\\n                Download\\n              </a>\\n            </span>\\n          </td>\\n        </tr>\\n      \\n    \\n    </tbody>\\n  </table>\\n      </div>\\n  </div>\\n\\n    </section>\\n              \\n\\n  <section id=\\\"additional-details\\\" aria-label=\\\"Additional record details\\\">\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n  <h2 id=\\\"record-details-heading\\\">Additional details</h2>\\n\\n  \\n\\n  \\n\\n  \\n\\n  \\n    \\n      \\n    \\n    \\n  \\n\\n  \\n\\n  \\n\\n  \\n      \\n      \\n      \\n      \\n        \\n          \\n        \\n      \\n        \\n      \\n        \\n      \\n\\n      \\n        \\n        \\n      \\n      \\n      \\n      \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n\\n      \\n      \\n      \\n      \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n\\n      \\n  \\n\\n  \\n\\n\\n\\n  \\n\\n  </section>\\n    \\n    <div id=\\\"citations-search\\\" data-record-pids=\\\"{&quot;doi&quot;: {&quot;client&quot;: &quot;datacite&quot;, &quot;identifier&quot;: &quot;10.5281/zenodo.17728786&quot;, &quot;provider&quot;: &quot;datacite&quot;}, &quot;oai&quot;: {&quot;identifier&quot;: &quot;oai:zenodo.org:17728786&quot;, &quot;provider&quot;: &quot;oai&quot;}}\\\" data-record-parent-pids=\\\"{&quot;doi&quot;: {&quot;client&quot;: &quot;datacite&quot;, &quot;identifier&quot;: &quot;10.5281/zenodo.5371628&quot;, &quot;provider&quot;: &quot;datacite&quot;}}\\\" data-citations-endpoint=\\\"https://zenodo-broker.web.cern.ch/api/relationships\\\" aria-label=\\\"Record citations\\\"><p>Citations<a href=\\\"https://support.zenodo.org/help/en-gb/25-citations\\\" target=\\\"_blank\\\" rel=\\\"noreferrer\\\" aria-label=\\\"Citations help page\\\"></a></p><div><div><div role=\\\"list\\\"><div role=\\\"listitem\\\"><p><label for=\\\"literature-facet-checkbox\\\">literature (0)</label></p></div><div role=\\\"listitem\\\"><p><label for=\\\"dataset-facet-checkbox\\\">dataset (0)</label></p></div><div role=\\\"listitem\\\"><p><label for=\\\"software-facet-checkbox\\\">software (0)</label></p></div><div role=\\\"listitem\\\"><p><label for=\\\"unknown-facet-checkbox\\\">unknown (0)</label></p></div><div role=\\\"listitem\\\"><p><label for=\\\"citations-to-version\\\">Citations to this version</label></p></div></div><div><p><label for=\\\"citations-search-bar\\\">Search citations</label></p></div></div><div><p>Oops! Something went wrong while fetching results.</p></div></div></div>\\n  \\n        </article></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://zenodo.org/records/17728786",
    "title": "EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes",
    "summary": "This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum requir…",
    "content": "<div id=\"readability-page-1\" class=\"page\"><article>\n          \n\n              \n\n              \n\n              <div id=\"record-info\" aria-label=\"Publication date and version number\">\n                    <p><span title=\"Publication date\">\n                        Published November 26, 2025\n                      </span>\n                      <span> | Version v0.4.9.2</span>\n                    </p>\n                    <p><span role=\"note\" aria-label=\"Resource type\">\n                          Software\n                        </span>\n                      \n\n                      <span role=\"note\" data-tooltip=\"The record and files are publicly accessible.\" data-inverted=\"\" aria-label=\"Access status\">\n                        \n                          \n                        \n                        <span aria-label=\"The record and files are publicly accessible.\">\n                          Open\n                        </span>\n                      </span>\n                    </p>\n                  </div>\n              <div id=\"record-title-section\" aria-label=\"Record title, authors and contributors\">\n      <div>\n          <h3>Authors/Creators</h3>\n          <ul>\n            \n  \n  <li>\n    <a data-tooltip=\"Language Technologies Institute, CMU\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Lintang+Sutawika%22\">\n\n      <span>Lintang Sutawika</span><sup>1</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Hailey+Schoelkopf%22\">\n\n      <span>Hailey Schoelkopf</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Leo+Gao%22\">\n\n      <span>Leo Gao</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Baber+Abbasi%22\">\n\n      <span>Baber Abbasi</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Booz Allen Hamilton, EleutherAI\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Stella+Biderman%22\">\n\n      <span>Stella Biderman</span><sup>2</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Jonathan+Tow%22\">\n\n      <span>Jonathan Tow</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22ben+fattori%22\">\n\n      <span>ben fattori</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Charles+Lovering%22\">\n\n      <span>Charles Lovering</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22farzanehnakhaee70%22\">\n\n      <span>farzanehnakhaee70</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Jason+Phang%22\">\n\n      <span>Jason Phang</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"playscape.gg\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Anish+Thite%22\">\n\n      <span>Anish Thite</span><sup>3</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Fazz%22\">\n\n      <span>Fazz</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Max Planck Institute for Software Systems: MPI SWS\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Aflah%22\">\n\n      <span>Aflah</span><sup>4</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Niklas%22\">\n\n      <span>Niklas</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"MistralAI\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Thomas+Wang%22\">\n\n      <span>Thomas Wang</span><sup>5</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22sdtblck%22\">\n\n      <span>sdtblck</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22gakada%22\">\n\n      <span>gakada</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22nopperl%22\">\n\n      <span>nopperl</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22researcher2%22\">\n\n      <span>researcher2</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22tttyuntian%22\">\n\n      <span>tttyuntian</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Hitz Zentroa UPV/EHU\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Julen+Etxaniz%22\">\n\n      <span>Julen Etxaniz</span><sup>6</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"@azurro\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Chris%22\">\n\n      <span>Chris</span><sup>7</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"MIT\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22James+A.+Michaelov%22\">\n\n      <span>James A. Michaelov</span><sup>8</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Shinhan Securities Co.\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Hanwool+Albert+Lee%22\">\n\n      <span>Hanwool Albert Lee</span><sup>9</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Janna%22\">\n\n      <span>Janna</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Leonid+Sinev%22\">\n\n      <span>Leonid Sinev</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Charles University\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Zden%C4%9Bk+Kasner%22\">\n\n      <span>Zdeněk Kasner</span><sup>10</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Open Source Developer @ IBM\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Kiersten+Stokes%22\">\n\n      <span>Kiersten Stokes</span><sup>11</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Khalid%22\">\n\n      <span>Khalid</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22KonradSzafer%22\">\n\n      <span>KonradSzafer</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n          </ul>\n      </div>\n\n        \n        \n          \n\n\n<section id=\"creators-affiliations\" aria-label=\"Affiliations for creators\">\n  <ul>\n    \n    <li>\n      1.\n\n      \n\n      Language Technologies Institute, CMU\n    </li>\n  \n    <li>\n      2.\n\n      \n\n      Booz Allen Hamilton, EleutherAI\n    </li>\n  \n    <li>\n      3.\n\n      \n\n      playscape.gg\n    </li>\n  \n    <li>\n      4.\n\n      \n\n      Max Planck Institute for Software Systems: MPI SWS\n    </li>\n  \n    <li>\n      5.\n\n      \n\n      MistralAI\n    </li>\n  \n    <li>\n      6.\n\n      \n\n      Hitz Zentroa UPV/EHU\n    </li>\n  \n    <li>\n      7.\n\n      \n\n      @azurro\n    </li>\n  \n    <li>\n      8.\n\n      \n\n      MIT\n    </li>\n  \n    <li>\n      9.\n\n      \n\n      Shinhan Securities Co.\n    </li>\n  \n    <li>\n      10.\n\n      \n\n      Charles University\n    </li>\n  \n    <li>\n      11.\n\n      \n\n      Open Source Developer @ IBM\n    </li>\n  \n\n    </ul>\n</section>\n\n        \n\n    </div>\n\n\n\n  <section id=\"description\" aria-label=\"Record description\">\n    <h2 id=\"description-heading\">Description</h2>\n    \n    <div>\n      <p>This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: <strong>Python 3.10 is now the minimum required version</strong>.</p>\n<h3>New Benchmarks &amp; Tasks</h3>\n<p>A big wave of new evaluation tasks this release:</p>\n<ul>\n<li><strong>AIME</strong> and <strong>MATH500</strong> math reasoning benchmarks by @jannalulu in #3248, #3311</li>\n<li><strong>BabiLong</strong> and <strong>Longbench v2</strong> for long-context evaluation by @jannalulu in #3287, #3338</li>\n<li><strong>GraphWalks</strong> by @jannalulu in #3377</li>\n<li><strong>ZhoBLiMP</strong>, <strong>BLiMP-NL</strong>, <strong>TurBLiMP</strong>, <strong>LM-SynEval</strong>, and <strong>BHS</strong> linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265</li>\n<li><strong>Icelandic WinoGrande</strong> by @jmichaelov in #3277</li>\n<li><strong>CLIcK</strong> Korean benchmark by @shing100 in #3173</li>\n<li><strong>MMLU-Redux</strong> (generative) and Spanish translation by @luiscosio in #2705</li>\n<li><strong>EsBBQ</strong> and <strong>CaBBQ</strong> bias benchmarks by @valleruizf in #3167</li>\n<li><strong>EQBench</strong> in Spanish and Catalan by @priverabsc in #3168</li>\n<li><strong>Anthropic discrim-eval</strong> by @Helw150 in #3091</li>\n<li><strong>XNLI-VA</strong> by @FranValero97 in #3194</li>\n<li><strong>Bangla MMLU</strong> (Titulm) by @Ismail-Hossain-1 in #3317</li>\n<li><strong>HumanEval infilling</strong> by @its-alpesh in #3299</li>\n<li><strong>CNN-DailyMail 3.0.0</strong> by @preordinary in #3426</li>\n<li><strong>Global PIQA</strong> and new <code>acc_norm_bytes</code> metric by @baberabb in #3368</li>\n</ul>\n<h3>Fixes &amp; Improvements</h3>\n<p><strong>Core Changes:</strong></p>\n<ul>\n<li><strong>Python 3.10 minimum</strong> by @jannalulu in #3337</li>\n<li><strong>Unpinned <code>datasets</code></strong> library by @baberabb in #3316</li>\n<li><strong>BOS token handling</strong>: Delegate to tokenizer; <code>add_bos_token</code> now defaults to <code>None</code> by @baberabb in #3347</li>\n<li>Renamed <code>LOGLEVEL</code> env var to <code>LMEVAL_LOG_LEVEL</code> to avoid conflicts by @fxmarty-amd in #3418</li>\n<li>Resolve duplicate task names with safeguards by @giuliolovisotto in #3394</li>\n</ul>\n<p><strong>Task Fixes:</strong></p>\n<ul>\n<li>Fixed MMLU-Redux to exclude samples without <code>error_type=\"ok\"</code> and display summary table by @fxmarty-amd in #3410, #3406</li>\n<li>Fixed AIME answer extraction by @jannalulu in #3353</li>\n<li>Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361</li>\n<li>Fixed <code>crows_pairs</code> dataset by @jannalulu in #3378</li>\n<li>Fixed Gemma tokenizer <code>add_bos_token</code> not updating by @DarkLight1337 in #3206</li>\n<li>Fixed <code>lambada_multilingual_stablelm</code> by @jmichaelov, @HallerPatrick in #3294, #3222</li>\n<li>Fixed CodeXGLUE by @gsaltintas in #3238</li>\n<li>Pinned correct MMLUSR version by @christinaexyou in #3350</li>\n<li>Updated <code>minerva_math</code> by @baberabb in #3259</li>\n</ul>\n<p><strong>Backend Fixes:</strong></p>\n<ul>\n<li>Fixed vLLM import errors when not installed by @fxmarty-amd in #3292</li>\n<li>Fixed vLLM <code>data_parallel_size&gt;1</code> issue by @Dornavineeth in #3303</li>\n<li>Resolved deprecated <code>vllm.utils.get_open_port</code> by @DarkLight1337 in #3398</li>\n<li>Fixed GPT series model bugs by @zinccat in #3348</li>\n<li>Fixed PIL image hashing to use actual bytes by @tboerstad in #3331</li>\n<li>Fixed <code>additional_config</code> parsing by @brian-dellabetta in #3393</li>\n<li>Fixed batch chunking seed handling with groupby by @slimfrkha in #3047</li>\n<li>Fixed no-output error handling by @Oseltamivir in #3395</li>\n<li>Replaced deprecated <code>torch_dtype</code> with <code>dtype</code> by @AbdulmalikDS in #3415</li>\n<li>Fixed custom task config reading by @SkyR0ver in #3425</li>\n</ul>\n<h3>Model &amp; Backend Support</h3>\n<ul>\n<li><strong>OpenAI GPT-5</strong> support by @babyplutokurt in #3247</li>\n<li><strong>Azure OpenAI</strong> support by @zinccat in #3349</li>\n<li><strong>Fine-tuned Gemma3</strong> evaluation support by @LearnerSXH in #3234</li>\n<li><strong>OpenVINO text2text</strong> models by @nikita-savelyevv in #3101</li>\n<li><strong>Intel XPU</strong> support for HFLM by @kaixuanliu in #3211</li>\n<li><strong>Attention head steering</strong> support by @luciaquirke in #3279</li>\n<li>Leverage vLLM's <code>tokenizer_info</code> endpoint to avoid manual duplication by @m-misiura in #3185</li>\n</ul>\n<h2>What's Changed</h2>\n<ul>\n<li>Remove <code>trust_remote_code: True</code> from updated datasets by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3213</li>\n<li>Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234</li>\n<li>Fix <code>add_bos_token</code> not updated for Gemma tokenizer by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3206</li>\n<li>remove incomplete compilation instructions, solves #3233 by @ceferisbarov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242</li>\n<li>Update utils.py by @Anri-Lombard in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246</li>\n<li>Adding support for OpenAI GPT-5 model by @babyplutokurt in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247</li>\n<li>Add xnli_va dataset by @FranValero97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194</li>\n<li>Add ZhoBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3218</li>\n<li>Add BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3221</li>\n<li>Add TurBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3219</li>\n<li>Add LM-SynEval Benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3184</li>\n<li>Fix unknown group key to tag in yaml config for <code>lambada_multilingual_stablelm</code> by @HallerPatrick in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222</li>\n<li>update <code>minerva_math</code> by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3259</li>\n<li>feat: Add CLIcK task by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3173</li>\n<li>Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091</li>\n<li>Add support for OpenVINO text2text generation models  by @nikita-savelyevv in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101</li>\n<li>Update MMLU-ProX task by @weihao1115 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174</li>\n<li>Support for AIME dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248</li>\n<li>feat(scrolls): delete chat_template from kwargs by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267</li>\n<li>pacify pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3268</li>\n<li>Fix codexglue by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238</li>\n<li>Add BHS benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3265</li>\n<li>Add <code>acc_norm</code> metric to BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3272</li>\n<li>Add <code>acc_norm</code> metric to ZhoBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3271</li>\n<li>Add EsBBQ and CaBBQ tasks by @valleruizf in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167</li>\n<li>Add support for steering individual attention heads by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/3279</li>\n<li>Add the Icelandic WinoGrande benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3277</li>\n<li>Ignore seed when splitting batch in chunks with groupby by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3047</li>\n<li>[fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3292</li>\n<li>Fix LongBench Evaluation by @TimurAysin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273</li>\n<li>add intel xpu support for HFLM by @kaixuanliu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211</li>\n<li>feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in https://github.com/EleutherAI/lm-evaluation-harness/pull/2705</li>\n<li>Add BabiLong by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3287</li>\n<li>Add AIME to task description by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3296</li>\n<li>Add humaneval_infilling task by @its-alpesh in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299</li>\n<li>Add eqbench tasks in Spanish and Catalan by @priverabsc in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168</li>\n<li>[fix] add math and longbench to test dependencies by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3321</li>\n<li>Fix: VLLM model when data_parallel_size&gt;1 by @Dornavineeth in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303</li>\n<li>unpin datasets; update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3316</li>\n<li>bump to python 3.10 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3337</li>\n<li>Longbench v2 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3338</li>\n<li>Leverage vllm's <code>tokenizer_info</code> endpoint to avoid manual duplication  by @m-misiura in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185</li>\n<li>Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317</li>\n<li>remove duplicate tags/groups by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3343</li>\n<li>Align <code>humaneval_64_instruct</code> task label in README to name in yaml file by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3344</li>\n<li>Fixes bugs when using gpt series model by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348</li>\n<li>[fix] aime doesn't extract answers by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3353</li>\n<li>add global_piqa; add acc_norm_bytes metric by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3368</li>\n<li>[fix] crows_pairs dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3378</li>\n<li>Fix issue 3355 assertion error by @marksverdhei in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356</li>\n<li>fix(gsm8k): align README to yaml file by @neoheartbeats in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388</li>\n<li>added azure openai support by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3349</li>\n<li>Delegate BOS to the tokenizer; <code>add_bos_token</code> defaults to <code>None</code> by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3347</li>\n<li>fix trust_remote_code=True for longbench by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3361</li>\n<li>[feat] add graphwalks by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3377</li>\n<li>Longbench group fix by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3359</li>\n<li>Resolve deprecation of <code>vllm.utils.get_open_port</code> by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3398</li>\n<li>Trim whitespace in remove_whitespace filter by @ziqing-huang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408</li>\n<li>Fixes #3391 avoid error on no-output by @Oseltamivir in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395</li>\n<li>Fix PIL image hashing to use actual bytes instead of object repr by @tboerstad in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331</li>\n<li>[MMLU redux] Do not use samples which do not have <code>error_type=\"ok\"</code> by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3410</li>\n<li>fix: resolve duplicate task names and add safeguards. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/3394</li>\n<li>Add MATH500 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3311</li>\n<li>[bugfix] additional_config parsing by @brian-dellabetta in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393</li>\n<li>fix(tasks):pin correct MMLUSR version by @christinaexyou in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350</li>\n<li>Fix <code>lambada_multilingual_stablelm</code> by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3294</li>\n<li>Fix descriptions in the Moral Stories and Histoires Morales tasks. by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/3374</li>\n<li>Replace deprecated torch_dtype parameter with dtype by @AbdulmalikDS in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415</li>\n<li>[fix] Fix mmlu_redux not displaying summary table + display to the user the tasks / yaml that are actually pulled by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3406</li>\n<li>Rename the conflicting environment variable <code>LOGLEVEL</code> to <code>LMEVAL_LOG_LEVEL</code> by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3418</li>\n<li>Update SGLang installation and documentation links by @Bobchenyx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422</li>\n<li>Fix reading custom task configs by @SkyR0ver in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425</li>\n<li>New Task: Add CNN-DailyMail (3.0.0) by @preordinary in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426</li>\n</ul>\n<h2>New Contributors</h2>\n<ul>\n<li>@LearnerSXH made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234</li>\n<li>@ceferisbarov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242</li>\n<li>@Anri-Lombard made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246</li>\n<li>@babyplutokurt made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247</li>\n<li>@FranValero97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194</li>\n<li>@HallerPatrick made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222</li>\n<li>@Helw150 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091</li>\n<li>@nikita-savelyevv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101</li>\n<li>@weihao1115 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174</li>\n<li>@jannalulu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248</li>\n<li>@slimfrkha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267</li>\n<li>@gsaltintas made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238</li>\n<li>@valleruizf made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167</li>\n<li>@TimurAysin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273</li>\n<li>@kaixuanliu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211</li>\n<li>@its-alpesh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299</li>\n<li>@priverabsc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168</li>\n<li>@Dornavineeth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303</li>\n<li>@m-misiura made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185</li>\n<li>@Ismail-Hossain-1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317</li>\n<li>@zinccat made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348</li>\n<li>@marksverdhei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356</li>\n<li>@neoheartbeats made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388</li>\n<li>@ziqing-huang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408</li>\n<li>@Oseltamivir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395</li>\n<li>@tboerstad made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331</li>\n<li>@brian-dellabetta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393</li>\n<li>@christinaexyou made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350</li>\n<li>@AbdulmalikDS made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415</li>\n<li>@Bobchenyx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422</li>\n<li>@SkyR0ver made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425</li>\n<li>@preordinary made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426</li>\n</ul>\n<p><strong>Full Changelog</strong>: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2</p>\n    </div>\n  </section>\n\n\n<section id=\"record-files\" aria-label=\"Files\"><h2 id=\"files-heading\">Files</h2>\n            \n  <div href=\"#files-preview-accordion-panel\">\n    <h3>\n      <p><span id=\"preview-file-title\">EleutherAI/lm-evaluation-harness-v0.4.9.2.zip</span>\n        \n      </p>\n    </h3>\n    \n  </div>\n  <div href=\"#files-list-accordion-panel\">\n    <h3>\n      <p>\n        Files\n        <small> (10.0 MB)</small>\n        \n      </p>\n    </h3>\n\n    <div role=\"region\" id=\"files-list-accordion-panel\" aria-labelledby=\"files-list-accordion-trigger\">\n        \n  <table id=\"file-list-table\">\n    <thead>\n      <tr>\n        <th>Name</th>\n        <th>Size</th>\n        <th>\n            \n            <a role=\"button\" href=\"https://zenodo.org/api/records/17728786/files-archive\">\n               Download all\n            </a>\n        </th>\n      </tr>\n    </thead>\n    <tbody>\n    \n      \n        \n        \n        <tr>\n          <td>\n            <div>\n              <p><a href=\"https://zenodo.org/records/17728786/files/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?download=1\">EleutherAI/lm-evaluation-harness-v0.4.9.2.zip</a>\n            </p></div>\n            <small>md5:a1821ba45eee3362f49140f332ff6512\n            \n            </small>\n          </td>\n          <td>10.0 MB</td>\n          <td>\n            <span>\n              \n                <a role=\"button\" href=\"https://zenodo.org/records/17728786/preview/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?include_deleted=0\" target=\"preview-iframe\" data-file-key=\"EleutherAI/lm-evaluation-harness-v0.4.9.2.zip\">\n                  Preview\n                </a>\n              \n              <a role=\"button\" href=\"https://zenodo.org/records/17728786/files/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?download=1\">\n                Download\n              </a>\n            </span>\n          </td>\n        </tr>\n      \n    \n    </tbody>\n  </table>\n      </div>\n  </div>\n\n    </section>\n              \n\n  <section id=\"additional-details\" aria-label=\"Additional record details\">\n\n\n\n\n\n\n\n\n\n\n\n  <h2 id=\"record-details-heading\">Additional details</h2>\n\n  \n\n  \n\n  \n\n  \n    \n      \n    \n    \n  \n\n  \n\n  \n\n  \n      \n      \n      \n      \n        \n          \n        \n      \n        \n      \n        \n      \n\n      \n        \n        \n      \n      \n      \n      \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n\n      \n      \n      \n      \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n\n      \n  \n\n  \n\n\n\n  \n\n  </section>\n    \n    <div id=\"citations-search\" data-record-pids=\"{&quot;doi&quot;: {&quot;client&quot;: &quot;datacite&quot;, &quot;identifier&quot;: &quot;10.5281/zenodo.17728786&quot;, &quot;provider&quot;: &quot;datacite&quot;}, &quot;oai&quot;: {&quot;identifier&quot;: &quot;oai:zenodo.org:17728786&quot;, &quot;provider&quot;: &quot;oai&quot;}}\" data-record-parent-pids=\"{&quot;doi&quot;: {&quot;client&quot;: &quot;datacite&quot;, &quot;identifier&quot;: &quot;10.5281/zenodo.5371628&quot;, &quot;provider&quot;: &quot;datacite&quot;}}\" data-citations-endpoint=\"https://zenodo-broker.web.cern.ch/api/relationships\" aria-label=\"Record citations\"><p>Citations<a href=\"https://support.zenodo.org/help/en-gb/25-citations\" target=\"_blank\" rel=\"noreferrer\" aria-label=\"Citations help page\"></a></p><div><div><div role=\"list\"><div role=\"listitem\"><p><label for=\"literature-facet-checkbox\">literature (0)</label></p></div><div role=\"listitem\"><p><label for=\"dataset-facet-checkbox\">dataset (0)</label></p></div><div role=\"listitem\"><p><label for=\"software-facet-checkbox\">software (0)</label></p></div><div role=\"listitem\"><p><label for=\"unknown-facet-checkbox\">unknown (0)</label></p></div><div role=\"listitem\"><p><label for=\"citations-to-version\">Citations to this version</label></p></div></div><div><p><label for=\"citations-search-bar\">Search citations</label></p></div></div><div><p>Oops! Something went wrong while fetching results.</p></div></div></div>\n  \n        </article></div>"
  }
}

▼thought

Now I have a very comprehensive picture. Let me also get the EleutherAI paper "Lessons from the Trenches on Reproducible Evaluation" which is a key reference for understanding the philosophy behind the harness.

▼✓抓取网页正文网址: "https://arxiv.org/html/2405.14782v1"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://arxiv.org/html/2405.14782v1\",\n  \"title\": \"Lessons from the Trenches on Reproducible Evaluation of Language Models\",\n  \"author\": \"Stella Biderman\",\n  \"summary\": \"Hailey Schoelkopf\\n\\n\\n\\n\\nLintang Sutawika\\n\\n\\n\\n\\n\\nLeo Gao\\n\\nEleutherAI\\n\\n\\nJonathan Tow\\n\\nStability AI\\n\\n\\nBaber Abbasi\\n\\nEleutherAI\\n\\n\\nAlham Fikri Aji\\n\\nMBZUAI\\n\\n\\nPawan Sasanka Ammanamanchi\\n\\nIIIT Hyderabad\\n\\n\\nSidney …\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><article>\\n\\n<p>\\n<span>\\n<span>Hailey Schoelkopf\\n</span><span>\\n<span>\\n</span></span></span>\\n<span>\\n<span>Lintang Sutawika\\n</span><span>\\n<span>\\n</span></span></span>\\n<span>\\n<span>\\n<br>Leo Gao\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Jonathan Tow\\n</span><span>\\n<span>Stability AI\\n</span></span></span>\\n<span>\\n<span>Baber Abbasi\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Alham Fikri Aji\\n</span><span>\\n<span>MBZUAI\\n</span></span></span>\\n<span>\\n<span>Pawan Sasanka Ammanamanchi\\n</span><span>\\n<span>IIIT Hyderabad\\n</span></span></span>\\n<span>\\n<span>Sidney Black\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Jordan Clive\\n</span><span>\\n<span>Chattermill AI\\n</span></span></span>\\n<span>\\n<span>Anthony DiPofi\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Julen Etxaniz\\n</span><span>\\n<span>HiTZ Center - Ixa, UPV/EHU\\n</span></span></span>\\n<span>\\n<span>Benjamin Fattori\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Jessica Zosa Forde\\n</span><span>\\n<span>Brown University\\n</span></span></span>\\n<span>\\n<span>Charles Foster\\n</span><span>\\n<span>Finetune\\n</span></span></span>\\n<span>\\n<span>Mimansa Jaiswal\\n</span><span>\\n<span>University of Michigan\\n</span></span></span>\\n<span>\\n<span>Wilson Y. Lee\\n</span><span>\\n<span>HubSpot\\n</span></span></span>\\n<span>\\n<span>Haonan Li\\n</span><span>\\n<span>MBZUAI\\n</span>\\n<span>LibrAI\\n</span></span></span>\\n<span>\\n<span>Charles Lovering\\n</span><span>\\n<span>Kensho\\n</span></span></span>\\n<span>\\n<span>Niklas Muennighoff\\n</span><span>\\n<span>Contextual AI\\n</span></span></span>\\n<span>\\n<span>Ellie Pavlick\\n</span><span>\\n<span>HiTZ Center - Ixa, UPV/EHU\\n</span></span></span>\\n<span>\\n<span>Jason Phang\\n</span><span>\\n<span>EleutherAI\\n</span>\\n<span>New York University\\n</span></span></span>\\n<span>\\n<span>Aviya Skowron\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Samson Tan\\n</span><span>\\n<span>Amazon\\n</span></span></span>\\n<span>\\n<span>Xiangru Tang\\n</span><span>\\n<span>Yale University\\n</span></span></span>\\n<span>\\n<span>Kevin A. Wang\\n</span><span>\\n<span>HiTZ Center - Ixa, UPV/EHU\\n</span></span></span>\\n<span>\\n<span>Genta Indra Winata\\n</span><span>\\n<span>HKUST\\n</span></span></span>\\n<span>\\n<span>François Yvon\\n</span><span>\\n<span>Sorbonne University\\n</span></span></span>\\n<span>\\n<span>Andy Zou\\n</span><span>\\n<span>CMU\\n</span></span></span>\\n</p>\\n<div id=\\\"abstract\\\">\\n<h6>Abstract</h6><p id=\\\"id1.id1\\\">Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers. First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or lessening the impact of these challenges on research. Third, we present the&nbsp;<span id=\\\"id1.id1.1\\\">Language Model Evaluation Harness</span>&nbsp;(<span id=\\\"id1.id1.2\\\">lm-eval</span>): an open source library for independent, reproducible, and extensible evaluation of language models that seeks to address these issues. We describe the features of the library as well as case studies in which the library has been used to alleviate these methodological concerns.</p></div>\\n<section id=\\\"S1\\\">\\n<h2>\\n<span>1 </span>Introduction</h2>\\n<div id=\\\"S1.p1\\\">\\n<p id=\\\"S1.p1.1\\\">Evaluation on shared benchmark tasks is a crucial tool used to track and communicate progress in the machine learning and language modeling communities <cite>(Ruder, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib98\\\" title=\\\"\\\">2021</a>)</cite>. Benchmarks are used to track progress toward shared community goals and to demonstrate the improvements of newly proposed methods over prior baselines. Evaluation practices thus play a crucial role in the direction of the field: inconsistencies or biases in evaluation practices can lead to skewed performance comparisons, which may influence the direction of future research and the adoption of new methods by the community <cite>(Dehghani et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib28\\\" title=\\\"\\\">2021</a>)</cite> or lead to adverse effects from deploying suboptimal or harmful models <cite>(Bender &amp; Friedman, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib10\\\" title=\\\"\\\">2018</a>)</cite> on tasks for which they are ill-suited <cite>(Raji et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib94\\\" title=\\\"\\\">2022</a>)</cite>.</p></div>\\n<div id=\\\"S1.p2\\\">\\n<p id=\\\"S1.p2.1\\\">In this work, we detail our lessons learned that have been especially beneficial to obtaining useful and rigorous findings. By sharing these lessons, we aim to foster a more robust and reliable evaluation ecosystem. Our contributions are threefold:</p><ol id=\\\"S1.I1\\\">\\n<li id=\\\"S1.I1.i1\\\">\\n<span>1.</span>\\n<div id=\\\"S1.I1.i1.p1\\\">\\n<p id=\\\"S1.I1.i1.p1.1\\\">We highlight several commonly-faced challenges in evaluating language models, including the difficulty of assessing the correctness of natural language responses, challenges in benchmark design, and the dependence upon implementation details that are often obscured or unreported (Section&nbsp;<a href=\\\"https://arxiv.org/html/2405.14782v1#S2\\\" title=\\\"2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>2</span></a>).</p></div>\\n</li>\\n<li id=\\\"S1.I1.i2\\\">\\n<span>2.</span>\\n<div id=\\\"S1.I1.i2.p1\\\">\\n<p id=\\\"S1.I1.i2.p1.1\\\">We then discuss best practices we’ve identified to improve how to communicate results and improve evaluation rigor in the language modeling community, despite–or to mitigate the impact of—these challenges (Section&nbsp;<a href=\\\"https://arxiv.org/html/2405.14782v1#S3\\\" title=\\\"3 Best Practices for Language Model Evaluation ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>3</span></a>).</p></div>\\n</li>\\n<li id=\\\"S1.I1.i3\\\">\\n<span>3.</span>\\n<div id=\\\"S1.I1.i3.p1\\\">\\n<p id=\\\"S1.I1.i3.p1.1\\\">To enable researchers and engineers to easily utilize the best practices we have identified, we present &nbsp;<span id=\\\"S1.I1.i3.p1.1.1\\\">lm-eval</span>, our open-source library for reproducible evaluation of language models (Section&nbsp;<a href=\\\"https://arxiv.org/html/2405.14782v1#S4\\\" title=\\\"4 The Language Model Evaluation Harness ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>4</span></a>). <span id=\\\"S1.I1.i3.p1.1.2\\\">lm-eval</span>’s framework defines a flexible API for both model implementation and evaluation task implementation, and then handles all of the work to <span id=\\\"S1.I1.i3.p1.1.3\\\">orchestrate</span> evaluations internally, allowing users to evaluate their chosen combinations of models + tasks, dependent on use case. We additionally present a set of case studies in which&nbsp;<span id=\\\"S1.I1.i3.p1.1.4\\\">lm-eval</span>&nbsp;was used to improve the level of rigor of a set of evaluations (Section&nbsp;<a href=\\\"https://arxiv.org/html/2405.14782v1#S5\\\" title=\\\"5 Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>5</span></a>).</p></div>\\n</li>\\n</ol>\\n</div>\\n</section>\\n<section id=\\\"S2\\\">\\n<h2>\\n<span>2 </span>Challenges in Evaluating Language Models</h2>\\n<section id=\\\"S2.SS1\\\">\\n<h3>\\n<span>2.1 </span>Evaluating and Scoring Natural Language Abilities</h3>\\n<div id=\\\"S2.SS1.p1\\\">\\n<p id=\\\"S2.SS1.p1.1\\\">The biggest challenge in language model evaluation is a concept we term <span id=\\\"S2.SS1.p1.1.1\\\">the Key Problem</span>: When evaluating language models, there can be many semantically equivalent but syntactically different ways of expressing the same idea. In an ideal world, we would have a way to automatically detect when two sentences express the same content but in different words. Unfortunately, our best tools for determining whether two sentences are semantically equivalent <span id=\\\"S2.SS1.p1.1.2\\\">are the very models we are seeking to evaluate</span>. This problem drives many of the approaches to LM benchmarking, and many problems in LM evaluation stem from there not being any silver bullets for solving the Key Problem.</p></div>\\n<div id=\\\"S2.SS1.p2\\\">\\n<p id=\\\"S2.SS1.p2.1\\\">In principle, this would be solvable by simply having expert human annotators score model responses for correctness. The main reason this is not ubiquitous is cost: performing accurate human studies is not only difficult and time-consuming but also very expensive due to fair compensation, pricing smaller actors or organizations out of performing such evaluations. Additionally, there are other reasons relying on solely human assessments must be done with caution: they can be flawed and biased, especially for complex judgments such as factuality <cite>(Hosking et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib47\\\" title=\\\"\\\">2024</a>; Xu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib127\\\" title=\\\"\\\">2023</a>; Wu &amp; Aji, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib126\\\" title=\\\"\\\">2023</a>)</cite>. Expert, trained human judgment can alleviate these issues but is inherently non-scalable.</p></div>\\n<div id=\\\"S2.SS1.p3\\\">\\n<p id=\\\"S2.SS1.p3.1\\\">To address the high costs of manual human evaluation, automated metrics are often used. These offer notable advantages in that they are (theoretically) fully reproducible, far easier and cheaper to compute, and can avoid some of the issues faced by human studies <cite>(Wei &amp; Jia, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib124\\\" title=\\\"\\\">2021</a>; Freitag et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib33\\\" title=\\\"\\\">2021</a>; Amidei et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib4\\\" title=\\\"\\\">2020</a>)</cite>. Automated metrics such as BLEU <cite>(Papineni et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib81\\\" title=\\\"\\\">2002</a>)</cite> and ROUGE <cite>(Lin, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib65\\\" title=\\\"\\\">2004</a>)</cite> seek to directly solve the Key Problem by measuring the distance from a generated response to a gold-standard one, such as via counting the n-gram overlap between the two texts.\\nHeuristic-based metrics such as BLEU (and its derivatives) have flaws <cite>(Callison-Burch et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib19\\\" title=\\\"\\\">2006</a>)</cite> and present reproducibility challenges <cite>(Marie et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib74\\\" title=\\\"\\\">2021</a>)</cite>, but can be useful. More recently, model-based metrics have recently gained momentum through evaluation methods that leverage large language models as a grader <cite>(Kim et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib56\\\" title=\\\"\\\">2024</a>; Wang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib120\\\" title=\\\"\\\">2024</a>; Liu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib68\\\" title=\\\"\\\">2023b</a>)</cite>, especially as proxies for human preference evaluation <cite>(Zheng et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib133\\\" title=\\\"\\\">2023</a>)</cite>, but these are known to be flawed <cite>(Wang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib118\\\" title=\\\"\\\">2023</a>; Huang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib49\\\" title=\\\"\\\">2024</a>; Shen et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib106\\\" title=\\\"\\\">2023</a>; Zeng et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib130\\\" title=\\\"\\\">2024</a>; Hu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib48\\\" title=\\\"\\\">2024</a>; Liu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib69\\\" title=\\\"\\\">2023c</a>; Chen et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib20\\\" title=\\\"\\\">2024</a>)</cite> and suffer from similar reproducibility issues as BLEU, ROUGE, and their variants.</p></div>\\n<div id=\\\"S2.SS1.p4\\\">\\n<p id=\\\"S2.SS1.p4.1\\\">The Key Problem can alternately be sidestepped by artificially restricting the answer space. The most prevalent way to achieve this is to reframe questions as multiple choice problems, with a single gold target answer and a finite, static set of possible responses <cite>(Hendrycks et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib42\\\" title=\\\"\\\">2020</a>; Srivastava et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib110\\\" title=\\\"\\\">2022</a>; Li’evin et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib64\\\" title=\\\"\\\">2022</a>; Lin et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib66\\\" title=\\\"\\\">2022</a>; Robinson et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib95\\\" title=\\\"\\\">2023</a>; Holtzman et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib46\\\" title=\\\"\\\">2022</a>)</cite>. Alternatively, when a reference answer is known, one can perform string-matching approaches heuristically to determine whether the model’s answer matches the ground truth <cite>(Dua et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib30\\\" title=\\\"\\\">2019</a>; Joshi et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib53\\\" title=\\\"\\\">2017</a>; Hendrycks et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib43\\\" title=\\\"\\\">2021</a>)</cite>.</p></div>\\n<div id=\\\"S2.SS1.p5\\\">\\n<p id=\\\"S2.SS1.p5.1\\\">This challenge does not necessarily impact other applications of language models and related technologies, such as playing games where it easy to check that the game has ended <cite>(Romstad et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib97\\\" title=\\\"\\\">2008</a>; Silver et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib107\\\" title=\\\"\\\">2018</a>; † et&nbsp;al.(2022)(FAIR)†, Bakhtin, Brown, Dinan, Farina,\\nFlaherty, Fried, Goff, Gray, Hu, et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib31\\\" title=\\\"\\\">FAIR</a>)</cite>, more constrained scientific applications <cite>(Jumper et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib54\\\" title=\\\"\\\">2021</a>; Ahdritz et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib1\\\" title=\\\"\\\">2022</a>)</cite>, or domains where we have <span id=\\\"S2.SS1.p5.1.1\\\">practically usable verifiers</span> even when the solutions are not checkable in all contexts <cite>(Biderman, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib12\\\" title=\\\"\\\">2020</a>; Biderman &amp; Raff, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib13\\\" title=\\\"\\\">2022</a>; Lewkowycz et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib59\\\" title=\\\"\\\">2022</a>)</cite>. In the case of LLMs, the most notable cases where this ground-truth verifier is known are coding and mathematics problems, although the verifiers used, such as unit tests, may still break down in edge cases <cite>(Liu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib67\\\" title=\\\"\\\">2023a</a>)</cite></p></div>\\n</section>\\n<section id=\\\"S2.SS2\\\">\\n<h3>\\n<span>2.2 </span>Benchmark Design and Validity</h3>\\n<div id=\\\"S2.SS2.p1\\\">\\n<p id=\\\"S2.SS2.p1.1\\\">Typically, we do not care about the actual numeric score of a model on a benchmark. Instead, we desire the benchmark to be a useful proxy for some real-world phenomenon. The <span id=\\\"S2.SS2.p1.1.1\\\">validity</span> of an evaluation is the extent to which these correlate <cite>(Messick, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib76\\\" title=\\\"\\\">1994</a>)</cite>. For a recent overview of validity concerns in NLP benchmarking, see <cite>Subramonian et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib111\\\" title=\\\"\\\">2023</a>)</cite>. Also see <cite>Raji et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib93\\\" title=\\\"\\\">2021</a>); Saphra et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib101\\\" title=\\\"\\\">2023</a>); Davis (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib25\\\" title=\\\"\\\">2023</a>)</cite> for extended discussion of construct validity in LLM evaluation.</p></div>\\n<div id=\\\"S2.SS2.p2\\\">\\n<p id=\\\"S2.SS2.p2.1\\\">While validity is an ongoing problem in language model evaluation, we focus on mitigating other concerns first: as we will describe,&nbsp;<span id=\\\"S2.SS2.p2.1.1\\\">lm-eval</span>&nbsp;is designed to ensure measurements are <span id=\\\"S2.SS2.p2.1.2\\\">consistent</span> across runs and models, regardless of (construct) validity.</p></div>\\n</section>\\n<section id=\\\"S2.SS3\\\">\\n<h3>\\n<span>2.3 </span>Implementation Difficulties and (Ir)Reproducibility</h3>\\n<div id=\\\"S2.SS3.p1\\\">\\n<p id=\\\"S2.SS3.p1.1\\\">Once a benchmark has been designed, it then needs to be implemented by machine learning researchers around the world to see use in driving progress in the field. This introduces a host of new challenges that need to be addressed in order to ensure that everyone is evaluating models on a benchmark in the same fashion when comparing results. This adaptation process can introduce inconsistencies and make it difficult to draw conclusions across different implementations. Researchers must <span id=\\\"S2.SS3.p1.1.1\\\">adapt</span> it to their own workflows and libraries for the purposes of actually adopting the benchmark in their research.</p></div>\\n<section id=\\\"S2.SS3.SSS1\\\">\\n<h4>\\n<span>2.3.1 </span>“Minor” Implementation Details Matter</h4>\\n<div id=\\\"S2.SS3.SSS1.p1\\\">\\n<p id=\\\"S2.SS3.SSS1.p1.1\\\">The importance of interoperability and full reproducibility stems from the fact that language models are incredibly sensitive to precise details that may not be obvious to practitioners. Even minor variations in prompts, formatting, or other implementation details can significantly impact the performance and validity of evaluations <cite>(Weber et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib121\\\" title=\\\"\\\">2023</a>; Sclar et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib103\\\" title=\\\"\\\">2023</a>; Mizrahi et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib79\\\" title=\\\"\\\">2024</a>; Alzahrani et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib3\\\" title=\\\"\\\">2024</a>; Lu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib70\\\" title=\\\"\\\">2022</a>; Webson &amp; Pavlick, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib122\\\" title=\\\"\\\">2022</a>; Min et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib78\\\" title=\\\"\\\">2022</a>)</cite>.\\nWithout access to the original evaluation code, when re-implementing evaluation procedures from scratch is required, it is nearly impossible to account for all the subtle details that can affect outcomes. As a result, these implementations are likely to diverge in ways that make it extremely difficult to ensure fair comparisons across works, even when evaluating on the same benchmark. Even having the prompts reported in a paper is no substitute for having access to the actual evaluation code: prompts in papers are often incorrect or difficult to map to the exact code implementation because they’ve been stylized to be human-readable.</p></div>\\n</section>\\n<section id=\\\"S2.SS3.SSS2\\\">\\n<h4>\\n<span>2.3.2 </span>Lack of Agreement About “Apples to Apples”</h4>\\n<div id=\\\"S2.SS3.SSS2.p1\\\">\\n<p id=\\\"S2.SS3.SSS2.p1.1\\\">Even assuming that benchmarks are implemented consistently across works, the question of <span id=\\\"S2.SS3.SSS2.p1.1.1\\\">how to draw fair comparisons</span> across models and methods is still difficult for LMs.</p></div>\\n<div id=\\\"S2.SS3.SSS2.p2\\\">\\n<p id=\\\"S2.SS3.SSS2.p2.1\\\">For instance, different instruction-tuned models may be trained to expect certain formats&nbsp;<cite>(Taori et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib113\\\" title=\\\"\\\">2023</a>; Sanh et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib100\\\" title=\\\"\\\">2022</a>; Wei et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib123\\\" title=\\\"\\\">2022</a>)</cite> – using these models’ intended prompt formats can make the evaluation tasks inherently different or change their difficulty, but not using these can also bias against models trained with formats not matching tasks’ “standard” prompting styles. Likewise, if an original benchmark implementation (including prompting and postprocessing) is tailored for a specific model, other models trained differently will suffer, artificially skewing perceptions of what techniques are effective.</p></div>\\n<div id=\\\"S2.SS3.SSS2.p3\\\">\\n<p id=\\\"S2.SS3.SSS2.p3.1\\\">Likewise, some questions of how to set up controlled experiments are still open–is it ideal to normalize performance and comparisons by the number of parameters? Training FLOPs? Inference cost? Must training data be held equal? How should models which can leverage external resources such as retrieved documents or external tools be compared? These questions are all context-dependent but can impact findings significantly. For example, <cite>Wang et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib119\\\" title=\\\"\\\">2022</a>)</cite> explore comparisons across architectures and training objectives, and choose to normalize for FLOPs, thus comparing encoder-decoder models with double the parameters to decoder-only models. Comparing results of models with equivalent training FLOPs, regardless of the allocation of those FLOPs, is commonplace (<cite>Hoffmann et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib44\\\" title=\\\"\\\">2022</a>); Peng et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib83\\\" title=\\\"\\\">2023</a>); Touvron et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib115\\\" title=\\\"\\\">2023</a>)</cite>, <span id=\\\"S2.SS3.SSS2.p3.1.1\\\">inter alia</span>). However, in a more memory-constrained setting, comparing models equi-parameter may be more logical. While this is not inherently problematic, as different application contexts motivate different evaluation criteria, it is common to gloss headline claims as “model X is better than model y” or “model x doesn’t really work as advertised” without paying significant attention to how comparisons were made.</p></div>\\n</section>\\n<section id=\\\"S2.SS3.SSS3\\\">\\n<h4>\\n<span>2.3.3 </span>Comparisons with Prior Work are Expensive (and Sometimes Impossible)</h4>\\n<div id=\\\"S2.SS3.SSS3.p1\\\">\\n<p id=\\\"S2.SS3.SSS3.p1.1\\\">Setting aside the question of establishing fair comparisons between methods or models, an additional key challenge in language modeling research is that many barriers prevent thorough comparison with related work.</p></div>\\n<div id=\\\"S2.SS3.SSS3.p2\\\">\\n<p id=\\\"S2.SS3.SSS3.p2.1\\\">Many LMs developed by industrial labs, often used as reference points for benchmarks, have never been released externally <cite>(Chowdhery et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib21\\\" title=\\\"\\\">2023</a>; Hoffmann et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib44\\\" title=\\\"\\\">2022</a>)</cite>, preventing comparisons except by pulling unverified evaluation numbers from technical reports. Those models that have been made available via APIs may non-transparently not match the published versions or otherwise be modified for deployment. Additionally, these API models are quickly <span id=\\\"S2.SS3.SSS3.p2.1.1\\\">deprecated</span> and no longer accessible, rendering slews of work no longer reproducible<span id=\\\"footnote1\\\"><sup>1</sup><span><span><sup>1</sup><span>1</span>Notably, OpenAI’s <span id=\\\"footnote1.1\\\">code-davinci-002</span> model was <a href=\\\"https://platform.openai.com/docs/deprecations/base-gpt-models\\\" title=\\\"\\\">deprecated in January 2024</a>, making at minimum hundreds of research studies irreproducible.</span></span></span>. API access, especially for large volumes of evaluation, is quite expensive.</p></div>\\n\\n</section>\\n</section>\\n<section id=\\\"S2.SS4\\\">\\n<h3>\\n<span>2.4 </span>Fast-changing Progress and Conventions</h3>\\n<figure id=\\\"S2.F1\\\"><img alt=\\\"Refer to caption\\\" height=\\\"246\\\" id=\\\"S2.F1.g1\\\" src=\\\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/Timeline_v1.png\\\" width=\\\"598\\\">\\n<figcaption><span>Figure 1: </span>A timeline showing the relative release dates of a selection of notable benchmarks used to evaluate LMs, as compared to the release dates of BERT <cite>(Devlin et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib29\\\" title=\\\"\\\">2018</a>)</cite>, GPT-2 <cite>(Radford et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib92\\\" title=\\\"\\\">2019</a>)</cite>, GPT-3 <cite>(Brown et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib18\\\" title=\\\"\\\">2020</a>)</cite>, and ChatGPT, used as approximate stand-ins for shifts in how the community uses and therefore evaluates LMs.</figcaption>\\n</figure>\\n<div id=\\\"S2.SS4.p1\\\">\\n<p id=\\\"S2.SS4.p1.1\\\">Due to the time-consuming nature of developing good benchmarks and the rapid pace of change in NLP research in the past decade, many widely used language model evaluation benchmarks do not represent the current paradigm of how language models are trained. This has two major impacts:</p><ul id=\\\"S2.I1\\\">\\n<li id=\\\"S2.I1.i1\\\">\\n<span>•</span>\\n<div id=\\\"S2.I1.i1.p1\\\">\\n<p id=\\\"S2.I1.i1.p1.1\\\">Benchmarks are being used for purposes they were not originally designed for or designed for validity under: for example, a large number of benchmarks have been built around fine-tuning on a known training set and closed space of labels <cite>(Wang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib117\\\" title=\\\"\\\">2019b</a>; <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib116\\\" title=\\\"\\\">a</a>)</cite>.</p></div>\\n</li>\\n<li id=\\\"S2.I1.i2\\\">\\n<span>•</span>\\n<div id=\\\"S2.I1.i2.p1\\\">\\n<p id=\\\"S2.I1.i2.p1.1\\\">There is no “ground-truth” implementation from the original benchmark authors for many of these popular benchmarks “retrofitted” to be used with prompted autoregressive LMs. In the absence of a clear standard, the community’s methodology for evaluating on these benchmarks may be fragmented or undocumented <cite>(Clark et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib23\\\" title=\\\"\\\">2018</a>; Paperno et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib80\\\" title=\\\"\\\">2016</a>)</cite>.</p></div>\\n</li>\\n</ul>\\n</div>\\n<div id=\\\"S2.SS4.p2\\\">\\n<p id=\\\"S2.SS4.p2.1\\\">To illustrate the effects of this development timeline, Figure <a href=\\\"https://arxiv.org/html/2405.14782v1#S2.F1\\\" title=\\\"Figure 1 ‣ 2.4 Fast-changing Progress and Conventions ‣ 2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>1</span></a> shows how many prominent LM benchmarks were designed prior to shifts such as in-context learning and chat interaction, and therefore were not designed to take these formats and approaches into account. This can affect validity or difficulty in unforeseen ways.</p></div>\\n</section>\\n</section>\\n<section id=\\\"S3\\\">\\n<h2>\\n<span>3 </span>Best Practices for Language Model Evaluation</h2>\\n<div id=\\\"S3.p1\\\">\\n<p id=\\\"S3.p1.1\\\">While LM evaluation is difficult and suffers from a number of challenges as we have described, there are measures that can be taken to significantly improve current practices. We provide our high-level recommendations regarding such measures, and detail our motivations briefly for each.</p></div>\\n<section id=\\\"S3.SS0.SSS0.Px1\\\">\\n<h5>Always share your exact prompts and code</h5>\\n<div id=\\\"S3.SS0.SSS0.Px1.p1\\\">\\n<ul id=\\\"S3.I1\\\">\\n<li id=\\\"S3.I1.i1\\\">\\n<span>•</span>\\n<div id=\\\"S3.I1.i1.p1\\\">\\n<p id=\\\"S3.I1.i1.p1.1\\\">If possible, full <span id=\\\"S3.I1.i1.p1.1.1\\\">evaluation code</span> including the full prompts used should also be provided for reproducible evaluation runs, as well as further identifiers such as links to specific commits used. Failing this, sharing prompts is often not done, but can drastically improve reproducibility.\\n</p></div>\\n</li>\\n<li id=\\\"S3.I1.i2\\\">\\n<span>•</span>\\n<div id=\\\"S3.I1.i2.p1\\\">\\n<p id=\\\"S3.I1.i2.p1.1\\\">For fair comparison against other models, evaluation should be done with the same set of prompts unless there’s a good reason not to. <span id=\\\"S3.I1.i2.p1.1.1\\\">Prompts should not be optimized for performance on a given model but not others, and the amount of prompt engineering done should be disclosed.</span></p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n<section id=\\\"S3.SS0.SSS0.Px2\\\">\\n<h5>Avoid copying results from other implementations</h5>\\n<div id=\\\"S3.SS0.SSS0.Px2.p1\\\">\\n<ul id=\\\"S3.I2\\\">\\n<li id=\\\"S3.I2.i1\\\">\\n<span>•</span>\\n<div id=\\\"S3.I2.i1.p1\\\">\\n<p id=\\\"S3.I2.i1.p1.1\\\">Comparing results across papers can be misleading due to a wide range of experimental differences, including prompts, sample size, metric calculation, and more <cite>(Marie et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib74\\\" title=\\\"\\\">2021</a>)</cite>.</p></div>\\n</li>\\n<li id=\\\"S3.I2.i2\\\">\\n<span>•</span>\\n<div id=\\\"S3.I2.i2.p1\\\">\\n<p id=\\\"S3.I2.i2.p1.1\\\">Results should <span id=\\\"S3.I2.i2.p1.1.1\\\">not</span> be copied or reported from other papers&nbsp;<cite>(Marie, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib73\\\" title=\\\"\\\">2022</a>)</cite> whenever possible, unless one can verify that the exact same code has been used to run the experiments in those papers. If such copying is unavoidable, it should be clearly marked as such and treated carefully.</p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n<section id=\\\"S3.SS0.SSS0.Px3\\\">\\n<h5>Always provide model outputs</h5>\\n<div id=\\\"S3.SS0.SSS0.Px3.p1\\\">\\n<ul id=\\\"S3.I3\\\">\\n<li id=\\\"S3.I3.i1\\\">\\n<span>•</span>\\n<div id=\\\"S3.I3.i1.p1\\\">\\n<p id=\\\"S3.I3.i1.p1.1\\\">Providing model outputs alongside evaluation code can allow others to recalculate scores based on these artifacts, which can be useful for performing statistical significance testing and for assessing the impact of different evaluation metrics or scoring approaches.</p></div>\\n</li>\\n<li id=\\\"S3.I3.i2\\\">\\n<span>•</span>\\n<div id=\\\"S3.I3.i2.p1\\\">\\n<p id=\\\"S3.I3.i2.p1.1\\\">Evaluation of large models or APIs can be quite costly–sharing such artifacts allows researchers without access to significant compute to participate in evaluation research.</p></div>\\n</li>\\n<li id=\\\"S3.I3.i3\\\">\\n<span>•</span>\\n<div id=\\\"S3.I3.i3.p1\\\">\\n<p id=\\\"S3.I3.i3.p1.1\\\">Finally, sharing outputs can allow results on API models to be reproduced to some extent, even if the models are subsequently deprecated.</p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n<section id=\\\"S3.SS0.SSS0.Px4\\\">\\n<h5>Perform qualitative analyses</h5>\\n<div id=\\\"S3.SS0.SSS0.Px4.p1\\\">\\n<ul id=\\\"S3.I4\\\">\\n<li id=\\\"S3.I4.i1\\\">\\n<span>•</span>\\n<div id=\\\"S3.I4.i1.p1\\\">\\n<p id=\\\"S3.I4.i1.p1.1\\\">Qualitatively review a small batch of results and outputs before testing at scale: it is very easy to have bugs in your generation code, especially when working with multiple sets of benchmarks, prompts, and models of different architectures. Catching issues early can save a lot of time and compute, and increase confidence in results.</p></div>\\n</li>\\n<li id=\\\"S3.I4.i2\\\">\\n<span>•</span>\\n<div id=\\\"S3.I4.i2.p1\\\">\\n<p id=\\\"S3.I4.i2.p1.1\\\">Quantitative scores only provide so much information. To understand why a model is scoring so well or so poorly, it is important to do some sort of qualitative error analysis. This can sometimes reveal superficial errors that are easier to correct with post-processing&nbsp;<cite>Bawden &amp; Yvon (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib8\\\" title=\\\"\\\">2023</a>)</cite>, or more fundamental errors.</p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n<section id=\\\"S3.SS0.SSS0.Px5\\\">\\n<h5>Perform statistical significance testing</h5>\\n<div id=\\\"S3.SS0.SSS0.Px5.p1\\\">\\n<ul id=\\\"S3.I5\\\">\\n<li id=\\\"S3.I5.i1\\\">\\n<span>•</span>\\n<div id=\\\"S3.I5.i1.p1\\\">\\n<p id=\\\"S3.I5.i1.p1.1\\\">Most works on language modeling do not perform statistical significance testing <cite>(Marie et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib74\\\" title=\\\"\\\">2021</a>)</cite>. This simple addition can dramatically boost the reliability of claimed results.</p></div>\\n</li>\\n<li id=\\\"S3.I5.i2\\\">\\n<span>•</span>\\n<div id=\\\"S3.I5.i2.p1\\\">\\n<p id=\\\"S3.I5.i2.p1.1\\\">Although costly, reporting results run over more than one random seed can dramatically boost the validity and utility of results. For example, averaging across model runs <cite>(Sellam et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib104\\\" title=\\\"\\\">2022</a>)</cite>, or averaging over multiple selections of few-shot examples.</p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n</section>\\n<section id=\\\"S4\\\">\\n<h2>\\n<span>4 </span>The Language Model Evaluation Harness</h2>\\n<div id=\\\"S4.p1\\\">\\n<p id=\\\"S4.p1.1\\\">Informed by these practices we have built <span id=\\\"S4.p1.1.1\\\">lm-eval</span>. Unlike prior work on unified benchmarking libraries <cite>(Liang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib63\\\" title=\\\"\\\">2023</a>; Srivastava et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib110\\\" title=\\\"\\\">2022</a>)</cite>, the Evaluation Library does not seek to solely prescribe what the correct benchmark or evaluation protocols to use are, and allows users to select their desired tasks and use cases.</p></div>\\n<div id=\\\"S4.p2\\\">\\n<p id=\\\"S4.p2.1\\\">The role of the <span id=\\\"S4.p2.1.1\\\">lm-eval</span>&nbsp;is to solve the <span id=\\\"S4.p2.1.2\\\">orchestration problem</span>: previously, performing thorough LM evaluations would require painstaking re-implementation of previous tasks (likely to introduce subtle methodological divergences) or the individual installation and usage of extant code for each benchmark. Our goal is to make it easy to allow researchers or library users to simply install one codebase, and run their method plus selected baselines on their desired tasks in a controlled fashion. We hope to make it more effort-intensive to <span id=\\\"S4.p2.1.3\\\">not</span> follow minimum best practices for evaluation rigor.</p></div>\\n<section id=\\\"S4.SS1\\\">\\n<h3>\\n<span>4.1 </span>Design</h3>\\n<div id=\\\"S4.SS1.p1\\\">\\n<p id=\\\"S4.SS1.p1.1\\\">We provide an overview of <span id=\\\"S4.SS1.p1.1.1\\\">lm-eval</span>’s major components and design philosophy. At its core, <span id=\\\"S4.SS1.p1.1.2\\\">lm-eval</span>&nbsp;allows for the contribution of two types of implementations: evaluation <span id=\\\"S4.SS1.p1.1.3\\\">Tasks</span> and integrations with novel <span id=\\\"S4.SS1.p1.1.4\\\">LM</span> implementations.</p></div>\\n<section id=\\\"S4.SS1.SSS0.Px1\\\">\\n<h5>Tasks</h5>\\n<div id=\\\"S4.SS1.SSS0.Px1.p1\\\">\\n<p id=\\\"S4.SS1.SSS0.Px1.p1.1\\\"><span id=\\\"S4.SS1.SSS0.Px1.p1.1.1\\\">lm-eval</span>&nbsp;is built around modular implementations of evaluation tasks, implemented as a <span id=\\\"S4.SS1.SSS0.Px1.p1.1.2\\\">Task</span> class using a common API. This allows tasks to be collected in a common library, for new tasks to be extended or implemented easily, and for novel tasks to be easily shared reproducibly among practitioners or other library users. Users can implement tasks either via YAML-based configuration files or via subclassing the provided <span id=\\\"S4.SS1.SSS0.Px1.p1.1.3\\\">Task</span> class and providing custom code for specific methods. In Figure <a href=\\\"https://arxiv.org/html/2405.14782v1#S4.F2\\\" title=\\\"Figure 2 ‣ Tasks ‣ 4.1 Design ‣ 4 The Language Model Evaluation Harness ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>2</span></a>, we show an example of the evaluation logic packaged within a <span id=\\\"S4.SS1.SSS0.Px1.p1.1.4\\\">Task</span> class.</p></div>\\n<figure id=\\\"S4.F2\\\"><img alt=\\\"Refer to caption\\\" height=\\\"199\\\" id=\\\"S4.F2.g1\\\" src=\\\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/taskoverview.png\\\" width=\\\"598\\\">\\n<figcaption><span>Figure 2: </span>The operations performed by a <span id=\\\"S4.F2.7.1\\\">Task</span> object in <span id=\\\"S4.F2.8.2\\\">lm-eval</span>. Tasks are configured by YAML files or as a Python subclass, and encompass 1) a data source (using the <span id=\\\"S4.F2.9.3\\\">Datasets</span> library<cite>(Lhoest et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib60\\\" title=\\\"\\\">2021</a>)</cite>), 2) tools for defining prompts and format, 3) mapping these prompts to rendered inputs and expected output type from an <span id=\\\"S4.F2.10.4\\\">LM</span> in the form of <span id=\\\"S4.F2.11.5\\\">Requests</span>, and 4) rules for post-processing the <span id=\\\"S4.F2.12.6\\\">LM</span>’s outputs and calculating the final task metrics.</figcaption>\\n</figure>\\n<div id=\\\"S4.SS1.SSS0.Px1.p2\\\">\\n<p id=\\\"S4.SS1.SSS0.Px1.p2.1\\\">We provide a number of implementations for common tasks, and accept new tasks sourced from the community. We strive to match the paper originally introducing a benchmark dataset in its methodology, including using the same prompts if applicable. For tasks such as those introduced prior to prompted evaluation becoming the standard, we source evaluation methodology from the paper first posing the evaluation dataset as a prompted task. For example, we implement many tasks as adapted for in-context learning by <cite>Brown et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib18\\\" title=\\\"\\\">2020</a>)</cite>.</p></div>\\n</section>\\n<section id=\\\"S4.SS1.SSS0.Px2\\\">\\n<h5>LMs</h5>\\n<div id=\\\"S4.SS1.SSS0.Px2.p1\\\">\\n<p id=\\\"S4.SS1.SSS0.Px2.p1.1\\\">The next core piece of <span id=\\\"S4.SS1.SSS0.Px2.p1.1.1\\\">lm-eval</span>&nbsp;is the <span id=\\\"S4.SS1.SSS0.Px2.p1.1.2\\\">LM</span> API. Because effective <span id=\\\"S4.SS1.SSS0.Px2.p1.1.3\\\">orchestration</span> is our core goal, we allow arbitrary software libraries or (autoregressive) language model architectures to extend a provided interface for <span id=\\\"S4.SS1.SSS0.Px2.p1.1.4\\\">LM</span> objects.</p></div>\\n<div id=\\\"S4.SS1.SSS0.Px2.p2\\\">\\n<p id=\\\"S4.SS1.SSS0.Px2.p2.1\\\">For ease of use, and compartmentalization of the model definition and external library integrations for custom models away from core evaluation logic, we assume that LMs operate upon dispatched <span id=\\\"S4.SS1.SSS0.Px2.p2.1.1\\\">Requests</span> which consist of mapping <span id=\\\"S4.SS1.SSS0.Px2.p2.1.2\\\">string inputs</span> to some <span id=\\\"S4.SS1.SSS0.Px2.p2.1.3\\\">string or probability</span> as output. We thus abstract tokenizers away within the <span id=\\\"S4.SS1.SSS0.Px2.p2.1.4\\\">LM</span> class, and treat a neural language model combined with its tokenizer as a single system being evaluated.</p></div>\\n<div id=\\\"S4.SS1.SSS0.Px2.p3\\\">\\n<p id=\\\"S4.SS1.SSS0.Px2.p3.1\\\">LMs implement a simple interface, consisting of several types of <span id=\\\"S4.SS1.SSS0.Px2.p3.1.1\\\">Requests</span> in order to be used within the library for all supported tasks.</p></div>\\n</section>\\n<section id=\\\"S4.SS1.SSS0.Px3\\\">\\n<h5>Request Types</h5>\\n<div id=\\\"S4.SS1.SSS0.Px3.p1\\\">\\n<p id=\\\"S4.SS1.SSS0.Px3.p1.1\\\">We allow for 3 core types of <span id=\\\"S4.SS1.SSS0.Px3.p1.1.1\\\">Requests</span> that may be sent to a language model, which consist of distinct types of <span id=\\\"S4.SS1.SSS0.Px3.p1.1.2\\\">measurements</span> that can be performed to observe a model’s response or latent capabilities in a prompted format. These are:</p></div>\\n<div id=\\\"S4.SS1.SSS0.Px3.p2\\\">\\n<ul id=\\\"S4.I1\\\">\\n<li id=\\\"S4.I1.i1\\\">\\n<span>•</span>\\n<div id=\\\"S4.I1.i1.p1\\\">\\n<p id=\\\"S4.I1.i1.p1.1\\\">(Conditional) Loglikelihoods (<span id=\\\"S4.I1.i1.p1.1.1\\\">loglikelihood, multiple_choice</span>) - computing the probability of given output string(s), conditioned on some provided input.</p></div>\\n</li>\\n<li id=\\\"S4.I1.i2\\\">\\n<span>•</span>\\n<div id=\\\"S4.I1.i2.p1\\\">\\n<p id=\\\"S4.I1.i2.p1.1\\\">Perplexities (<span id=\\\"S4.I1.i2.p1.1.1\\\">loglikelihood_rolling</span>) - measuring the average loglikelihood or probability of producing the tokens in a given dataset.</p></div>\\n</li>\\n<li id=\\\"S4.I1.i3\\\">\\n<span>•</span>\\n<div id=\\\"S4.I1.i3.p1\\\">\\n<p id=\\\"S4.I1.i3.p1.1\\\">Generation (<span id=\\\"S4.I1.i3.p1.1.1\\\">generate_until</span>) - generating text until a given stopping condition is reached, from a model conditioned on some provided input.</p></div>\\n</li>\\n</ul>\\n</div>\\n<figure id=\\\"S4.F3\\\"><img alt=\\\"Refer to caption\\\" height=\\\"295\\\" id=\\\"S4.F3.g1\\\" src=\\\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/typesoftasks.png\\\" width=\\\"598\\\">\\n<figcaption><span>Figure 3: </span>Overview of the three core <span id=\\\"S4.F3.2.1\\\">Request</span> types supported by our evaluation framework. These include (1) conditional loglikelihoods, (2) perplexities, and (3) generation-based requests.</figcaption>\\n</figure>\\n<div id=\\\"S4.SS1.SSS0.Px3.p3\\\">\\n<p id=\\\"S4.SS1.SSS0.Px3.p3.1\\\">Provided with these three primitive operations, we are able to implement the major ways in the literature that have been used to evaluate LMs (<cite>Gao et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib36\\\" title=\\\"\\\">2020</a>)</cite>, <cite>Brown et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib18\\\" title=\\\"\\\">2020</a>)</cite>, <span id=\\\"S4.SS1.SSS0.Px3.p3.1.1\\\">inter alia</span>). While these high-level approaches are standard, they all contain a number of subtle implementation decisions which are often not disclosed in papers. Therefore, we include a full formal description of common implementation details involved in ours and others’ approaches within Appendix <a href=\\\"https://arxiv.org/html/2405.14782v1#A1\\\" title=\\\"Appendix A Formalizing Measurements ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>A</span></a> for completeness, which we hope will be a useful contribution to the literature.</p></div>\\n</section>\\n</section>\\n<section id=\\\"S4.SS2\\\">\\n<h3>\\n<span>4.2 </span>Addressing Challenges and Incorporating Best Practices</h3>\\n<div id=\\\"S4.SS2.p1\\\">\\n<p id=\\\"S4.SS2.p1.1\\\">Here we detail how we position <span id=\\\"S4.SS2.p1.1.1\\\">lm-eval</span>&nbsp;to address the issues mentioned in Section <a href=\\\"https://arxiv.org/html/2405.14782v1#S2\\\" title=\\\"2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>2</span></a> and incorporate the recommendations in Section <a href=\\\"https://arxiv.org/html/2405.14782v1#S3\\\" title=\\\"3 Best Practices for Language Model Evaluation ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>3</span></a>, in order to encourage a more robust evaluation ecosystem.</p></div>\\n<section id=\\\"S4.SS2.SSS0.Px1\\\">\\n<h5>Reproducibility</h5>\\n<div id=\\\"S4.SS2.SSS0.Px1.p1\\\">\\n<p id=\\\"S4.SS2.SSS0.Px1.p1.1\\\"><span id=\\\"S4.SS2.SSS0.Px1.p1.1.1\\\">lm-eval</span>&nbsp;encourages and enables reproducible evaluation in several ways. First, by providing a standardized implementation of many common tasks, practitioners can report on these tasks and ensure they are evaluating on the same prompt and implementation as other users of the library.</p></div>\\n<div id=\\\"S4.SS2.SSS0.Px1.p2\\\">\\n<p id=\\\"S4.SS2.SSS0.Px1.p2.1\\\">Alongside task results we report a <span id=\\\"S4.SS2.SSS0.Px1.p2.1.1\\\">version</span> field, incremented each time a task must be modified in a way that affects its scoring. Therefore, in the case where task implementations have bugs or must otherwise be updated, one can still reference the version of the task used, to ensure future research can reproduce reported results.</p></div>\\n<div id=\\\"S4.SS2.SSS0.Px1.p3\\\">\\n<p id=\\\"S4.SS2.SSS0.Px1.p3.1\\\">While this is not a panacea for the costs of comparing to prior work, and rerunning baselines oneself is advised, when prior work uses our library one can be confident that the results from prior work match what one would have gotten had one rerun it oneself using that version of the library <cite>(Beeching et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib9\\\" title=\\\"\\\">2023</a>)</cite>.</p></div>\\n</section>\\n<section id=\\\"S4.SS2.SSS0.Px2\\\">\\n<h5>Qualitative Analysis</h5>\\n<div id=\\\"S4.SS2.SSS0.Px2.p1\\\">\\n<p id=\\\"S4.SS2.SSS0.Px2.p1.1\\\"><span id=\\\"S4.SS2.SSS0.Px2.p1.1.1\\\">lm-eval</span>&nbsp;provides support for performing qualitative analysis of evaluation scores. In keeping with our recommended best practices, we implement the following, which allow for qualitative checks to be a core part of the evaluation workflow when using&nbsp;<span id=\\\"S4.SS2.SSS0.Px2.p1.1.2\\\">lm-eval</span>:</p></div>\\n<div id=\\\"S4.SS2.SSS0.Px2.p2\\\">\\n<ul id=\\\"S4.I2\\\">\\n<li id=\\\"S4.I2.i1\\\">\\n<span>•</span>\\n<div id=\\\"S4.I2.i1.p1\\\">\\n<p id=\\\"S4.I2.i1.p1.1\\\">We allow for artificially <span id=\\\"S4.I2.i1.p1.1.1\\\">limit</span>ing the amount of samples used for a given evaluation run, to enable code to be tested and outputs to be reviewed in small batches prior to full evaluation runs.</p></div>\\n</li>\\n<li id=\\\"S4.I2.i2\\\">\\n<span>•</span>\\n<div id=\\\"S4.I2.i2.p1\\\">\\n<p id=\\\"S4.I2.i2.p1.1\\\">Per-sample logging is supported, for post-hoc reproduction of scores or error analysis of model mistakes or evaluation implementation.</p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n<section id=\\\"S4.SS2.SSS0.Px3\\\">\\n<h5>Statistical Testing</h5>\\n<div id=\\\"S4.SS2.SSS0.Px3.p1\\\">\\n<p id=\\\"S4.SS2.SSS0.Px3.p1.1\\\"><span id=\\\"S4.SS2.SSS0.Px3.p1.1.1\\\">lm-eval</span>&nbsp;reports the standard error (SE) of most supported metrics, calculated by either bootstrapping or dividing the sample standard deviation by the root of the sample size.</p></div>\\n<div id=\\\"S4.SS2.SSS0.Px3.p2\\\">\\n<p id=\\\"S4.SS2.SSS0.Px3.p2.1\\\">By reporting these SE calculations prominently in every evaluation run, we make it trivial for practitioners to add simple statistical measures such as confidence intervals to their results. While we believe more rigorous and widespread statistical testing in LM evaluation is still needed, we hope that this will spur the community to report and be more aware of statistical significance concerns by lowering the difficulty of reporting such measures.</p></div>\\n</section>\\n</section>\\n</section>\\n<section id=\\\"S5\\\">\\n<h2>\\n<span>5 </span>Case Studies</h2>\\n<div id=\\\"S5.p1\\\">\\n<p id=\\\"S5.p1.1\\\">Finally, we demonstrate <span id=\\\"S5.p1.1.1\\\">lm-eval</span>’s utility for improving evaluation rigor and understanding via case studies of its successful usage. We provide additional case studies within Appendix <a href=\\\"https://arxiv.org/html/2405.14782v1#A2\\\" title=\\\"Appendix B Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>B</span></a>.</p></div>\\n<section id=\\\"S5.SS1\\\">\\n<h3>\\n<span>5.1 </span>Prompts Massively Affect Results</h3>\\n<div id=\\\"S5.SS1.p1\\\">\\n<p id=\\\"S5.SS1.p1.1\\\">As mentioned in Section <a href=\\\"https://arxiv.org/html/2405.14782v1#S2.SS3.SSS1\\\" title=\\\"2.3.1 “Minor” Implementation Details Matter ‣ 2.3 Implementation Difficulties and (Ir)Reproducibility ‣ 2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>2.3.1</span></a>, language models can be very sensitive to the specific prompt or framing of an evaluation task.</p></div>\\n<div id=\\\"S5.SS1.p2\\\">\\n<p id=\\\"S5.SS1.p2.1\\\">Here we detail how <span id=\\\"S5.SS1.p2.1.1\\\">lm-eval</span>&nbsp;has been used to study this sensitivity, and can be used to improve confidence in the comparison of scores across models. We focus our attention on two popular language modeling benchmarks: the ARC question answering benchmark <cite>(Clark et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib23\\\" title=\\\"\\\">2018</a>)</cite> and MMLU <cite>(Hendrycks et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib43\\\" title=\\\"\\\">2021</a>)</cite>. However, not all papers evaluate on these tasks in the same way as the original formats.</p></div>\\n<figure id=\\\"S5.T1\\\">\\n\\n<figcaption><span>Table 1: </span>Comparison of 0-shot model performance for several pretrained LMs <cite>(Black et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib16\\\" title=\\\"\\\">2022</a>; Touvron et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib115\\\" title=\\\"\\\">2023</a>; Penedo et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib82\\\" title=\\\"\\\">2023</a>; Jiang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib51\\\" title=\\\"\\\">2023</a>; <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib52\\\" title=\\\"\\\">2024</a>)</cite> on ARC (Easy and Challenge subsets) and MMLU across two commonly used prompt styles, evaluated using <span id=\\\"S5.T1.3.1\\\">lm-eval</span>.</figcaption>\\n</figure>\\n<div id=\\\"S5.SS1.p3\\\">\\n<p id=\\\"S5.SS1.p3.1\\\">ARC is first adapted to the in-context learning setting by <cite>Brown et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib18\\\" title=\\\"\\\">2020</a>)</cite>, who implement the dataset as a “cloze” task: the model is prompted via <span id=\\\"S5.SS1.p3.1.1\\\">‘‘Question: {question}<math alttext=\\\"\\\\backslash\\\" display=\\\"inline\\\" id=\\\"S5.SS1.p3.1.1.m1.1\\\"><semantics id=\\\"S5.SS1.p3.1.1.m1.1a\\\"><mo id=\\\"S5.SS1.p3.1.1.m1.1.1\\\" xref=\\\"S5.SS1.p3.1.1.m1.1.1.cmml\\\">\\\\</mo><annotation-xml encoding=\\\"MathML-Content\\\" id=\\\"S5.SS1.p3.1.1.m1.1b\\\"><ci id=\\\"S5.SS1.p3.1.1.m1.1.1.cmml\\\" xref=\\\"S5.SS1.p3.1.1.m1.1.1\\\">\\\\</ci></annotation-xml><annotation encoding=\\\"application/x-tex\\\" id=\\\"S5.SS1.p3.1.1.m1.1c\\\">\\\\backslash</annotation><annotation encoding=\\\"application/x-llamapun\\\" id=\\\"S5.SS1.p3.1.1.m1.1d\\\">\\\\</annotation></semantics></math>nAnswer:’’</span> and the likelihood of each potential completion string’s <span id=\\\"S5.SS1.p3.1.2\\\">text</span> is compared. Comparatively, MMLU is implemented in <cite>Hendrycks et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib42\\\" title=\\\"\\\">2020</a>)</cite> via providing the model with the question text, each of the 4 possible answers preceded by an answer letter A, B, C, or D, and scoring over the <span id=\\\"S5.SS1.p3.1.3\\\">answer letters’ probabilities</span>.</p></div>\\n<div id=\\\"S5.SS1.p4\\\">\\n<p id=\\\"S5.SS1.p4.1\\\">However, if models do not adopt these approaches, or disclose their exact settings, it is impossible to reliably compare stated model performance. In Table <a href=\\\"https://arxiv.org/html/2405.14782v1#S5.T1\\\" title=\\\"Table 1 ‣ 5.1 Prompts Massively Affect Results ‣ 5 Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>1</span></a>, we compare evaluation on the Easy and Challenge sets of ARC using the prompt from <cite>Brown et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib18\\\" title=\\\"\\\">2020</a>)</cite> (“Cloze”) and using an MMLU-style answer letter with explicit multiple choice options (“MMLU-style”). We additionally compare MMLU scores between the original MMLU prompting style (“MMLU-style”) and an approach we term “Hybrid”, consisting of an MMLU-style prompt but using the <span id=\\\"S5.SS1.p4.1.1\\\">answer strings</span> instead of answer letters as the set of continuations over which we can scor\"\n}"
    }
  ],
  "details": {
    "url": "https://arxiv.org/html/2405.14782v1",
    "title": "Lessons from the Trenches on Reproducible Evaluation of Language Models",
    "author": "Stella Biderman",
    "summary": "Hailey Schoelkopf\n\n\n\n\nLintang Sutawika\n\n\n\n\n\nLeo Gao\n\nEleutherAI\n\n\nJonathan Tow\n\nStability AI\n\n\nBaber Abbasi\n\nEleutherAI\n\n\nAlham Fikri Aji\n\nMBZUAI\n\n\nPawan Sasanka Ammanamanchi\n\nIIIT Hyderabad\n\n\nSidney …",
    "content": "<div id=\"readability-page-1\" class=\"page\"><article>\n\n<p>\n<span>\n<span>Hailey Schoelkopf\n</span><span>\n<span>\n</span></span></span>\n<span>\n<span>Lintang Sutawika\n</span><span>\n<span>\n</span></span></span>\n<span>\n<span>\n<br>Leo Gao\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Jonathan Tow\n</span><span>\n<span>Stability AI\n</span></span></span>\n<span>\n<span>Baber Abbasi\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Alham Fikri Aji\n</span><span>\n<span>MBZUAI\n</span></span></span>\n<span>\n<span>Pawan Sasanka Ammanamanchi\n</span><span>\n<span>IIIT Hyderabad\n</span></span></span>\n<span>\n<span>Sidney Black\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Jordan Clive\n</span><span>\n<span>Chattermill AI\n</span></span></span>\n<span>\n<span>Anthony DiPofi\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Julen Etxaniz\n</span><span>\n<span>HiTZ Center - Ixa, UPV/EHU\n</span></span></span>\n<span>\n<span>Benjamin Fattori\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Jessica Zosa Forde\n</span><span>\n<span>Brown University\n</span></span></span>\n<span>\n<span>Charles Foster\n</span><span>\n<span>Finetune\n</span></span></span>\n<span>\n<span>Mimansa Jaiswal\n</span><span>\n<span>University of Michigan\n</span></span></span>\n<span>\n<span>Wilson Y. Lee\n</span><span>\n<span>HubSpot\n</span></span></span>\n<span>\n<span>Haonan Li\n</span><span>\n<span>MBZUAI\n</span>\n<span>LibrAI\n</span></span></span>\n<span>\n<span>Charles Lovering\n</span><span>\n<span>Kensho\n</span></span></span>\n<span>\n<span>Niklas Muennighoff\n</span><span>\n<span>Contextual AI\n</span></span></span>\n<span>\n<span>Ellie Pavlick\n</span><span>\n<span>HiTZ Center - Ixa, UPV/EHU\n</span></span></span>\n<span>\n<span>Jason Phang\n</span><span>\n<span>EleutherAI\n</span>\n<span>New York University\n</span></span></span>\n<span>\n<span>Aviya Skowron\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Samson Tan\n</span><span>\n<span>Amazon\n</span></span></span>\n<span>\n<span>Xiangru Tang\n</span><span>\n<span>Yale University\n</span></span></span>\n<span>\n<span>Kevin A. Wang\n</span><span>\n<span>HiTZ Center - Ixa, UPV/EHU\n</span></span></span>\n<span>\n<span>Genta Indra Winata\n</span><span>\n<span>HKUST\n</span></span></span>\n<span>\n<span>François Yvon\n</span><span>\n<span>Sorbonne University\n</span></span></span>\n<span>\n<span>Andy Zou\n</span><span>\n<span>CMU\n</span></span></span>\n</p>\n<div id=\"abstract\">\n<h6>Abstract</h6><p id=\"id1.id1\">Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers. First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or lessening the impact of these challenges on research. Third, we present the&nbsp;<span id=\"id1.id1.1\">Language Model Evaluation Harness</span>&nbsp;(<span id=\"id1.id1.2\">lm-eval</span>): an open source library for independent, reproducible, and extensible evaluation of language models that seeks to address these issues. We describe the features of the library as well as case studies in which the library has been used to alleviate these methodological concerns.</p></div>\n<section id=\"S1\">\n<h2>\n<span>1 </span>Introduction</h2>\n<div id=\"S1.p1\">\n<p id=\"S1.p1.1\">Evaluation on shared benchmark tasks is a crucial tool used to track and communicate progress in the machine learning and language modeling communities <cite>(Ruder, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib98\" title=\"\">2021</a>)</cite>. Benchmarks are used to track progress toward shared community goals and to demonstrate the improvements of newly proposed methods over prior baselines. Evaluation practices thus play a crucial role in the direction of the field: inconsistencies or biases in evaluation practices can lead to skewed performance comparisons, which may influence the direction of future research and the adoption of new methods by the community <cite>(Dehghani et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib28\" title=\"\">2021</a>)</cite> or lead to adverse effects from deploying suboptimal or harmful models <cite>(Bender &amp; Friedman, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib10\" title=\"\">2018</a>)</cite> on tasks for which they are ill-suited <cite>(Raji et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib94\" title=\"\">2022</a>)</cite>.</p></div>\n<div id=\"S1.p2\">\n<p id=\"S1.p2.1\">In this work, we detail our lessons learned that have been especially beneficial to obtaining useful and rigorous findings. By sharing these lessons, we aim to foster a more robust and reliable evaluation ecosystem. Our contributions are threefold:</p><ol id=\"S1.I1\">\n<li id=\"S1.I1.i1\">\n<span>1.</span>\n<div id=\"S1.I1.i1.p1\">\n<p id=\"S1.I1.i1.p1.1\">We highlight several commonly-faced challenges in evaluating language models, including the difficulty of assessing the correctness of natural language responses, challenges in benchmark design, and the dependence upon implementation details that are often obscured or unreported (Section&nbsp;<a href=\"https://arxiv.org/html/2405.14782v1#S2\" title=\"2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>2</span></a>).</p></div>\n</li>\n<li id=\"S1.I1.i2\">\n<span>2.</span>\n<div id=\"S1.I1.i2.p1\">\n<p id=\"S1.I1.i2.p1.1\">We then discuss best practices we’ve identified to improve how to communicate results and improve evaluation rigor in the language modeling community, despite–or to mitigate the impact of—these challenges (Section&nbsp;<a href=\"https://arxiv.org/html/2405.14782v1#S3\" title=\"3 Best Practices for Language Model Evaluation ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>3</span></a>).</p></div>\n</li>\n<li id=\"S1.I1.i3\">\n<span>3.</span>\n<div id=\"S1.I1.i3.p1\">\n<p id=\"S1.I1.i3.p1.1\">To enable researchers and engineers to easily utilize the best practices we have identified, we present &nbsp;<span id=\"S1.I1.i3.p1.1.1\">lm-eval</span>, our open-source library for reproducible evaluation of language models (Section&nbsp;<a href=\"https://arxiv.org/html/2405.14782v1#S4\" title=\"4 The Language Model Evaluation Harness ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>4</span></a>). <span id=\"S1.I1.i3.p1.1.2\">lm-eval</span>’s framework defines a flexible API for both model implementation and evaluation task implementation, and then handles all of the work to <span id=\"S1.I1.i3.p1.1.3\">orchestrate</span> evaluations internally, allowing users to evaluate their chosen combinations of models + tasks, dependent on use case. We additionally present a set of case studies in which&nbsp;<span id=\"S1.I1.i3.p1.1.4\">lm-eval</span>&nbsp;was used to improve the level of rigor of a set of evaluations (Section&nbsp;<a href=\"https://arxiv.org/html/2405.14782v1#S5\" title=\"5 Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>5</span></a>).</p></div>\n</li>\n</ol>\n</div>\n</section>\n<section id=\"S2\">\n<h2>\n<span>2 </span>Challenges in Evaluating Language Models</h2>\n<section id=\"S2.SS1\">\n<h3>\n<span>2.1 </span>Evaluating and Scoring Natural Language Abilities</h3>\n<div id=\"S2.SS1.p1\">\n<p id=\"S2.SS1.p1.1\">The biggest challenge in language model evaluation is a concept we term <span id=\"S2.SS1.p1.1.1\">the Key Problem</span>: When evaluating language models, there can be many semantically equivalent but syntactically different ways of expressing the same idea. In an ideal world, we would have a way to automatically detect when two sentences express the same content but in different words. Unfortunately, our best tools for determining whether two sentences are semantically equivalent <span id=\"S2.SS1.p1.1.2\">are the very models we are seeking to evaluate</span>. This problem drives many of the approaches to LM benchmarking, and many problems in LM evaluation stem from there not being any silver bullets for solving the Key Problem.</p></div>\n<div id=\"S2.SS1.p2\">\n<p id=\"S2.SS1.p2.1\">In principle, this would be solvable by simply having expert human annotators score model responses for correctness. The main reason this is not ubiquitous is cost: performing accurate human studies is not only difficult and time-consuming but also very expensive due to fair compensation, pricing smaller actors or organizations out of performing such evaluations. Additionally, there are other reasons relying on solely human assessments must be done with caution: they can be flawed and biased, especially for complex judgments such as factuality <cite>(Hosking et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib47\" title=\"\">2024</a>; Xu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib127\" title=\"\">2023</a>; Wu &amp; Aji, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib126\" title=\"\">2023</a>)</cite>. Expert, trained human judgment can alleviate these issues but is inherently non-scalable.</p></div>\n<div id=\"S2.SS1.p3\">\n<p id=\"S2.SS1.p3.1\">To address the high costs of manual human evaluation, automated metrics are often used. These offer notable advantages in that they are (theoretically) fully reproducible, far easier and cheaper to compute, and can avoid some of the issues faced by human studies <cite>(Wei &amp; Jia, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib124\" title=\"\">2021</a>; Freitag et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib33\" title=\"\">2021</a>; Amidei et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib4\" title=\"\">2020</a>)</cite>. Automated metrics such as BLEU <cite>(Papineni et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib81\" title=\"\">2002</a>)</cite> and ROUGE <cite>(Lin, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib65\" title=\"\">2004</a>)</cite> seek to directly solve the Key Problem by measuring the distance from a generated response to a gold-standard one, such as via counting the n-gram overlap between the two texts.\nHeuristic-based metrics such as BLEU (and its derivatives) have flaws <cite>(Callison-Burch et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib19\" title=\"\">2006</a>)</cite> and present reproducibility challenges <cite>(Marie et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib74\" title=\"\">2021</a>)</cite>, but can be useful. More recently, model-based metrics have recently gained momentum through evaluation methods that leverage large language models as a grader <cite>(Kim et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib56\" title=\"\">2024</a>; Wang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib120\" title=\"\">2024</a>; Liu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib68\" title=\"\">2023b</a>)</cite>, especially as proxies for human preference evaluation <cite>(Zheng et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib133\" title=\"\">2023</a>)</cite>, but these are known to be flawed <cite>(Wang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib118\" title=\"\">2023</a>; Huang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib49\" title=\"\">2024</a>; Shen et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib106\" title=\"\">2023</a>; Zeng et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib130\" title=\"\">2024</a>; Hu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib48\" title=\"\">2024</a>; Liu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib69\" title=\"\">2023c</a>; Chen et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib20\" title=\"\">2024</a>)</cite> and suffer from similar reproducibility issues as BLEU, ROUGE, and their variants.</p></div>\n<div id=\"S2.SS1.p4\">\n<p id=\"S2.SS1.p4.1\">The Key Problem can alternately be sidestepped by artificially restricting the answer space. The most prevalent way to achieve this is to reframe questions as multiple choice problems, with a single gold target answer and a finite, static set of possible responses <cite>(Hendrycks et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib42\" title=\"\">2020</a>; Srivastava et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib110\" title=\"\">2022</a>; Li’evin et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib64\" title=\"\">2022</a>; Lin et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib66\" title=\"\">2022</a>; Robinson et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib95\" title=\"\">2023</a>; Holtzman et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib46\" title=\"\">2022</a>)</cite>. Alternatively, when a reference answer is known, one can perform string-matching approaches heuristically to determine whether the model’s answer matches the ground truth <cite>(Dua et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib30\" title=\"\">2019</a>; Joshi et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib53\" title=\"\">2017</a>; Hendrycks et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib43\" title=\"\">2021</a>)</cite>.</p></div>\n<div id=\"S2.SS1.p5\">\n<p id=\"S2.SS1.p5.1\">This challenge does not necessarily impact other applications of language models and related technologies, such as playing games where it easy to check that the game has ended <cite>(Romstad et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib97\" title=\"\">2008</a>; Silver et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib107\" title=\"\">2018</a>; † et&nbsp;al.(2022)(FAIR)†, Bakhtin, Brown, Dinan, Farina,\nFlaherty, Fried, Goff, Gray, Hu, et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib31\" title=\"\">FAIR</a>)</cite>, more constrained scientific applications <cite>(Jumper et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib54\" title=\"\">2021</a>; Ahdritz et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib1\" title=\"\">2022</a>)</cite>, or domains where we have <span id=\"S2.SS1.p5.1.1\">practically usable verifiers</span> even when the solutions are not checkable in all contexts <cite>(Biderman, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib12\" title=\"\">2020</a>; Biderman &amp; Raff, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib13\" title=\"\">2022</a>; Lewkowycz et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib59\" title=\"\">2022</a>)</cite>. In the case of LLMs, the most notable cases where this ground-truth verifier is known are coding and mathematics problems, although the verifiers used, such as unit tests, may still break down in edge cases <cite>(Liu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib67\" title=\"\">2023a</a>)</cite></p></div>\n</section>\n<section id=\"S2.SS2\">\n<h3>\n<span>2.2 </span>Benchmark Design and Validity</h3>\n<div id=\"S2.SS2.p1\">\n<p id=\"S2.SS2.p1.1\">Typically, we do not care about the actual numeric score of a model on a benchmark. Instead, we desire the benchmark to be a useful proxy for some real-world phenomenon. The <span id=\"S2.SS2.p1.1.1\">validity</span> of an evaluation is the extent to which these correlate <cite>(Messick, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib76\" title=\"\">1994</a>)</cite>. For a recent overview of validity concerns in NLP benchmarking, see <cite>Subramonian et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib111\" title=\"\">2023</a>)</cite>. Also see <cite>Raji et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib93\" title=\"\">2021</a>); Saphra et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib101\" title=\"\">2023</a>); Davis (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib25\" title=\"\">2023</a>)</cite> for extended discussion of construct validity in LLM evaluation.</p></div>\n<div id=\"S2.SS2.p2\">\n<p id=\"S2.SS2.p2.1\">While validity is an ongoing problem in language model evaluation, we focus on mitigating other concerns first: as we will describe,&nbsp;<span id=\"S2.SS2.p2.1.1\">lm-eval</span>&nbsp;is designed to ensure measurements are <span id=\"S2.SS2.p2.1.2\">consistent</span> across runs and models, regardless of (construct) validity.</p></div>\n</section>\n<section id=\"S2.SS3\">\n<h3>\n<span>2.3 </span>Implementation Difficulties and (Ir)Reproducibility</h3>\n<div id=\"S2.SS3.p1\">\n<p id=\"S2.SS3.p1.1\">Once a benchmark has been designed, it then needs to be implemented by machine learning researchers around the world to see use in driving progress in the field. This introduces a host of new challenges that need to be addressed in order to ensure that everyone is evaluating models on a benchmark in the same fashion when comparing results. This adaptation process can introduce inconsistencies and make it difficult to draw conclusions across different implementations. Researchers must <span id=\"S2.SS3.p1.1.1\">adapt</span> it to their own workflows and libraries for the purposes of actually adopting the benchmark in their research.</p></div>\n<section id=\"S2.SS3.SSS1\">\n<h4>\n<span>2.3.1 </span>“Minor” Implementation Details Matter</h4>\n<div id=\"S2.SS3.SSS1.p1\">\n<p id=\"S2.SS3.SSS1.p1.1\">The importance of interoperability and full reproducibility stems from the fact that language models are incredibly sensitive to precise details that may not be obvious to practitioners. Even minor variations in prompts, formatting, or other implementation details can significantly impact the performance and validity of evaluations <cite>(Weber et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib121\" title=\"\">2023</a>; Sclar et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib103\" title=\"\">2023</a>; Mizrahi et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib79\" title=\"\">2024</a>; Alzahrani et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib3\" title=\"\">2024</a>; Lu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib70\" title=\"\">2022</a>; Webson &amp; Pavlick, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib122\" title=\"\">2022</a>; Min et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib78\" title=\"\">2022</a>)</cite>.\nWithout access to the original evaluation code, when re-implementing evaluation procedures from scratch is required, it is nearly impossible to account for all the subtle details that can affect outcomes. As a result, these implementations are likely to diverge in ways that make it extremely difficult to ensure fair comparisons across works, even when evaluating on the same benchmark. Even having the prompts reported in a paper is no substitute for having access to the actual evaluation code: prompts in papers are often incorrect or difficult to map to the exact code implementation because they’ve been stylized to be human-readable.</p></div>\n</section>\n<section id=\"S2.SS3.SSS2\">\n<h4>\n<span>2.3.2 </span>Lack of Agreement About “Apples to Apples”</h4>\n<div id=\"S2.SS3.SSS2.p1\">\n<p id=\"S2.SS3.SSS2.p1.1\">Even assuming that benchmarks are implemented consistently across works, the question of <span id=\"S2.SS3.SSS2.p1.1.1\">how to draw fair comparisons</span> across models and methods is still difficult for LMs.</p></div>\n<div id=\"S2.SS3.SSS2.p2\">\n<p id=\"S2.SS3.SSS2.p2.1\">For instance, different instruction-tuned models may be trained to expect certain formats&nbsp;<cite>(Taori et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib113\" title=\"\">2023</a>; Sanh et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib100\" title=\"\">2022</a>; Wei et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib123\" title=\"\">2022</a>)</cite> – using these models’ intended prompt formats can make the evaluation tasks inherently different or change their difficulty, but not using these can also bias against models trained with formats not matching tasks’ “standard” prompting styles. Likewise, if an original benchmark implementation (including prompting and postprocessing) is tailored for a specific model, other models trained differently will suffer, artificially skewing perceptions of what techniques are effective.</p></div>\n<div id=\"S2.SS3.SSS2.p3\">\n<p id=\"S2.SS3.SSS2.p3.1\">Likewise, some questions of how to set up controlled experiments are still open–is it ideal to normalize performance and comparisons by the number of parameters? Training FLOPs? Inference cost? Must training data be held equal? How should models which can leverage external resources such as retrieved documents or external tools be compared? These questions are all context-dependent but can impact findings significantly. For example, <cite>Wang et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib119\" title=\"\">2022</a>)</cite> explore comparisons across architectures and training objectives, and choose to normalize for FLOPs, thus comparing encoder-decoder models with double the parameters to decoder-only models. Comparing results of models with equivalent training FLOPs, regardless of the allocation of those FLOPs, is commonplace (<cite>Hoffmann et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib44\" title=\"\">2022</a>); Peng et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib83\" title=\"\">2023</a>); Touvron et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib115\" title=\"\">2023</a>)</cite>, <span id=\"S2.SS3.SSS2.p3.1.1\">inter alia</span>). However, in a more memory-constrained setting, comparing models equi-parameter may be more logical. While this is not inherently problematic, as different application contexts motivate different evaluation criteria, it is common to gloss headline claims as “model X is better than model y” or “model x doesn’t really work as advertised” without paying significant attention to how comparisons were made.</p></div>\n</section>\n<section id=\"S2.SS3.SSS3\">\n<h4>\n<span>2.3.3 </span>Comparisons with Prior Work are Expensive (and Sometimes Impossible)</h4>\n<div id=\"S2.SS3.SSS3.p1\">\n<p id=\"S2.SS3.SSS3.p1.1\">Setting aside the question of establishing fair comparisons between methods or models, an additional key challenge in language modeling research is that many barriers prevent thorough comparison with related work.</p></div>\n<div id=\"S2.SS3.SSS3.p2\">\n<p id=\"S2.SS3.SSS3.p2.1\">Many LMs developed by industrial labs, often used as reference points for benchmarks, have never been released externally <cite>(Chowdhery et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib21\" title=\"\">2023</a>; Hoffmann et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib44\" title=\"\">2022</a>)</cite>, preventing comparisons except by pulling unverified evaluation numbers from technical reports. Those models that have been made available via APIs may non-transparently not match the published versions or otherwise be modified for deployment. Additionally, these API models are quickly <span id=\"S2.SS3.SSS3.p2.1.1\">deprecated</span> and no longer accessible, rendering slews of work no longer reproducible<span id=\"footnote1\"><sup>1</sup><span><span><sup>1</sup><span>1</span>Notably, OpenAI’s <span id=\"footnote1.1\">code-davinci-002</span> model was <a href=\"https://platform.openai.com/docs/deprecations/base-gpt-models\" title=\"\">deprecated in January 2024</a>, making at minimum hundreds of research studies irreproducible.</span></span></span>. API access, especially for large volumes of evaluation, is quite expensive.</p></div>\n\n</section>\n</section>\n<section id=\"S2.SS4\">\n<h3>\n<span>2.4 </span>Fast-changing Progress and Conventions</h3>\n<figure id=\"S2.F1\"><img alt=\"Refer to caption\" height=\"246\" id=\"S2.F1.g1\" src=\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/Timeline_v1.png\" width=\"598\">\n<figcaption><span>Figure 1: </span>A timeline showing the relative release dates of a selection of notable benchmarks used to evaluate LMs, as compared to the release dates of BERT <cite>(Devlin et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib29\" title=\"\">2018</a>)</cite>, GPT-2 <cite>(Radford et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib92\" title=\"\">2019</a>)</cite>, GPT-3 <cite>(Brown et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib18\" title=\"\">2020</a>)</cite>, and ChatGPT, used as approximate stand-ins for shifts in how the community uses and therefore evaluates LMs.</figcaption>\n</figure>\n<div id=\"S2.SS4.p1\">\n<p id=\"S2.SS4.p1.1\">Due to the time-consuming nature of developing good benchmarks and the rapid pace of change in NLP research in the past decade, many widely used language model evaluation benchmarks do not represent the current paradigm of how language models are trained. This has two major impacts:</p><ul id=\"S2.I1\">\n<li id=\"S2.I1.i1\">\n<span>•</span>\n<div id=\"S2.I1.i1.p1\">\n<p id=\"S2.I1.i1.p1.1\">Benchmarks are being used for purposes they were not originally designed for or designed for validity under: for example, a large number of benchmarks have been built around fine-tuning on a known training set and closed space of labels <cite>(Wang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib117\" title=\"\">2019b</a>; <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib116\" title=\"\">a</a>)</cite>.</p></div>\n</li>\n<li id=\"S2.I1.i2\">\n<span>•</span>\n<div id=\"S2.I1.i2.p1\">\n<p id=\"S2.I1.i2.p1.1\">There is no “ground-truth” implementation from the original benchmark authors for many of these popular benchmarks “retrofitted” to be used with prompted autoregressive LMs. In the absence of a clear standard, the community’s methodology for evaluating on these benchmarks may be fragmented or undocumented <cite>(Clark et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib23\" title=\"\">2018</a>; Paperno et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib80\" title=\"\">2016</a>)</cite>.</p></div>\n</li>\n</ul>\n</div>\n<div id=\"S2.SS4.p2\">\n<p id=\"S2.SS4.p2.1\">To illustrate the effects of this development timeline, Figure <a href=\"https://arxiv.org/html/2405.14782v1#S2.F1\" title=\"Figure 1 ‣ 2.4 Fast-changing Progress and Conventions ‣ 2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>1</span></a> shows how many prominent LM benchmarks were designed prior to shifts such as in-context learning and chat interaction, and therefore were not designed to take these formats and approaches into account. This can affect validity or difficulty in unforeseen ways.</p></div>\n</section>\n</section>\n<section id=\"S3\">\n<h2>\n<span>3 </span>Best Practices for Language Model Evaluation</h2>\n<div id=\"S3.p1\">\n<p id=\"S3.p1.1\">While LM evaluation is difficult and suffers from a number of challenges as we have described, there are measures that can be taken to significantly improve current practices. We provide our high-level recommendations regarding such measures, and detail our motivations briefly for each.</p></div>\n<section id=\"S3.SS0.SSS0.Px1\">\n<h5>Always share your exact prompts and code</h5>\n<div id=\"S3.SS0.SSS0.Px1.p1\">\n<ul id=\"S3.I1\">\n<li id=\"S3.I1.i1\">\n<span>•</span>\n<div id=\"S3.I1.i1.p1\">\n<p id=\"S3.I1.i1.p1.1\">If possible, full <span id=\"S3.I1.i1.p1.1.1\">evaluation code</span> including the full prompts used should also be provided for reproducible evaluation runs, as well as further identifiers such as links to specific commits used. Failing this, sharing prompts is often not done, but can drastically improve reproducibility.\n</p></div>\n</li>\n<li id=\"S3.I1.i2\">\n<span>•</span>\n<div id=\"S3.I1.i2.p1\">\n<p id=\"S3.I1.i2.p1.1\">For fair comparison against other models, evaluation should be done with the same set of prompts unless there’s a good reason not to. <span id=\"S3.I1.i2.p1.1.1\">Prompts should not be optimized for performance on a given model but not others, and the amount of prompt engineering done should be disclosed.</span></p></div>\n</li>\n</ul>\n</div>\n</section>\n<section id=\"S3.SS0.SSS0.Px2\">\n<h5>Avoid copying results from other implementations</h5>\n<div id=\"S3.SS0.SSS0.Px2.p1\">\n<ul id=\"S3.I2\">\n<li id=\"S3.I2.i1\">\n<span>•</span>\n<div id=\"S3.I2.i1.p1\">\n<p id=\"S3.I2.i1.p1.1\">Comparing results across papers can be misleading due to a wide range of experimental differences, including prompts, sample size, metric calculation, and more <cite>(Marie et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib74\" title=\"\">2021</a>)</cite>.</p></div>\n</li>\n<li id=\"S3.I2.i2\">\n<span>•</span>\n<div id=\"S3.I2.i2.p1\">\n<p id=\"S3.I2.i2.p1.1\">Results should <span id=\"S3.I2.i2.p1.1.1\">not</span> be copied or reported from other papers&nbsp;<cite>(Marie, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib73\" title=\"\">2022</a>)</cite> whenever possible, unless one can verify that the exact same code has been used to run the experiments in those papers. If such copying is unavoidable, it should be clearly marked as such and treated carefully.</p></div>\n</li>\n</ul>\n</div>\n</section>\n<section id=\"S3.SS0.SSS0.Px3\">\n<h5>Always provide model outputs</h5>\n<div id=\"S3.SS0.SSS0.Px3.p1\">\n<ul id=\"S3.I3\">\n<li id=\"S3.I3.i1\">\n<span>•</span>\n<div id=\"S3.I3.i1.p1\">\n<p id=\"S3.I3.i1.p1.1\">Providing model outputs alongside evaluation code can allow others to recalculate scores based on these artifacts, which can be useful for performing statistical significance testing and for assessing the impact of different evaluation metrics or scoring approaches.</p></div>\n</li>\n<li id=\"S3.I3.i2\">\n<span>•</span>\n<div id=\"S3.I3.i2.p1\">\n<p id=\"S3.I3.i2.p1.1\">Evaluation of large models or APIs can be quite costly–sharing such artifacts allows researchers without access to significant compute to participate in evaluation research.</p></div>\n</li>\n<li id=\"S3.I3.i3\">\n<span>•</span>\n<div id=\"S3.I3.i3.p1\">\n<p id=\"S3.I3.i3.p1.1\">Finally, sharing outputs can allow results on API models to be reproduced to some extent, even if the models are subsequently deprecated.</p></div>\n</li>\n</ul>\n</div>\n</section>\n<section id=\"S3.SS0.SSS0.Px4\">\n<h5>Perform qualitative analyses</h5>\n<div id=\"S3.SS0.SSS0.Px4.p1\">\n<ul id=\"S3.I4\">\n<li id=\"S3.I4.i1\">\n<span>•</span>\n<div id=\"S3.I4.i1.p1\">\n<p id=\"S3.I4.i1.p1.1\">Qualitatively review a small batch of results and outputs before testing at scale: it is very easy to have bugs in your generation code, especially when working with multiple sets of benchmarks, prompts, and models of different architectures. Catching issues early can save a lot of time and compute, and increase confidence in results.</p></div>\n</li>\n<li id=\"S3.I4.i2\">\n<span>•</span>\n<div id=\"S3.I4.i2.p1\">\n<p id=\"S3.I4.i2.p1.1\">Quantitative scores only provide so much information. To understand why a model is scoring so well or so poorly, it is important to do some sort of qualitative error analysis. This can sometimes reveal superficial errors that are easier to correct with post-processing&nbsp;<cite>Bawden &amp; Yvon (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib8\" title=\"\">2023</a>)</cite>, or more fundamental errors.</p></div>\n</li>\n</ul>\n</div>\n</section>\n<section id=\"S3.SS0.SSS0.Px5\">\n<h5>Perform statistical significance testing</h5>\n<div id=\"S3.SS0.SSS0.Px5.p1\">\n<ul id=\"S3.I5\">\n<li id=\"S3.I5.i1\">\n<span>•</span>\n<div id=\"S3.I5.i1.p1\">\n<p id=\"S3.I5.i1.p1.1\">Most works on language modeling do not perform statistical significance testing <cite>(Marie et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib74\" title=\"\">2021</a>)</cite>. This simple addition can dramatically boost the reliability of claimed results.</p></div>\n</li>\n<li id=\"S3.I5.i2\">\n<span>•</span>\n<div id=\"S3.I5.i2.p1\">\n<p id=\"S3.I5.i2.p1.1\">Although costly, reporting results run over more than one random seed can dramatically boost the validity and utility of results. For example, averaging across model runs <cite>(Sellam et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib104\" title=\"\">2022</a>)</cite>, or averaging over multiple selections of few-shot examples.</p></div>\n</li>\n</ul>\n</div>\n</section>\n</section>\n<section id=\"S4\">\n<h2>\n<span>4 </span>The Language Model Evaluation Harness</h2>\n<div id=\"S4.p1\">\n<p id=\"S4.p1.1\">Informed by these practices we have built <span id=\"S4.p1.1.1\">lm-eval</span>. Unlike prior work on unified benchmarking libraries <cite>(Liang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib63\" title=\"\">2023</a>; Srivastava et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib110\" title=\"\">2022</a>)</cite>, the Evaluation Library does not seek to solely prescribe what the correct benchmark or evaluation protocols to use are, and allows users to select their desired tasks and use cases.</p></div>\n<div id=\"S4.p2\">\n<p id=\"S4.p2.1\">The role of the <span id=\"S4.p2.1.1\">lm-eval</span>&nbsp;is to solve the <span id=\"S4.p2.1.2\">orchestration problem</span>: previously, performing thorough LM evaluations would require painstaking re-implementation of previous tasks (likely to introduce subtle methodological divergences) or the individual installation and usage of extant code for each benchmark. Our goal is to make it easy to allow researchers or library users to simply install one codebase, and run their method plus selected baselines on their desired tasks in a controlled fashion. We hope to make it more effort-intensive to <span id=\"S4.p2.1.3\">not</span> follow minimum best practices for evaluation rigor.</p></div>\n<section id=\"S4.SS1\">\n<h3>\n<span>4.1 </span>Design</h3>\n<div id=\"S4.SS1.p1\">\n<p id=\"S4.SS1.p1.1\">We provide an overview of <span id=\"S4.SS1.p1.1.1\">lm-eval</span>’s major components and design philosophy. At its core, <span id=\"S4.SS1.p1.1.2\">lm-eval</span>&nbsp;allows for the contribution of two types of implementations: evaluation <span id=\"S4.SS1.p1.1.3\">Tasks</span> and integrations with novel <span id=\"S4.SS1.p1.1.4\">LM</span> implementations.</p></div>\n<section id=\"S4.SS1.SSS0.Px1\">\n<h5>Tasks</h5>\n<div id=\"S4.SS1.SSS0.Px1.p1\">\n<p id=\"S4.SS1.SSS0.Px1.p1.1\"><span id=\"S4.SS1.SSS0.Px1.p1.1.1\">lm-eval</span>&nbsp;is built around modular implementations of evaluation tasks, implemented as a <span id=\"S4.SS1.SSS0.Px1.p1.1.2\">Task</span> class using a common API. This allows tasks to be collected in a common library, for new tasks to be extended or implemented easily, and for novel tasks to be easily shared reproducibly among practitioners or other library users. Users can implement tasks either via YAML-based configuration files or via subclassing the provided <span id=\"S4.SS1.SSS0.Px1.p1.1.3\">Task</span> class and providing custom code for specific methods. In Figure <a href=\"https://arxiv.org/html/2405.14782v1#S4.F2\" title=\"Figure 2 ‣ Tasks ‣ 4.1 Design ‣ 4 The Language Model Evaluation Harness ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>2</span></a>, we show an example of the evaluation logic packaged within a <span id=\"S4.SS1.SSS0.Px1.p1.1.4\">Task</span> class.</p></div>\n<figure id=\"S4.F2\"><img alt=\"Refer to caption\" height=\"199\" id=\"S4.F2.g1\" src=\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/taskoverview.png\" width=\"598\">\n<figcaption><span>Figure 2: </span>The operations performed by a <span id=\"S4.F2.7.1\">Task</span> object in <span id=\"S4.F2.8.2\">lm-eval</span>. Tasks are configured by YAML files or as a Python subclass, and encompass 1) a data source (using the <span id=\"S4.F2.9.3\">Datasets</span> library<cite>(Lhoest et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib60\" title=\"\">2021</a>)</cite>), 2) tools for defining prompts and format, 3) mapping these prompts to rendered inputs and expected output type from an <span id=\"S4.F2.10.4\">LM</span> in the form of <span id=\"S4.F2.11.5\">Requests</span>, and 4) rules for post-processing the <span id=\"S4.F2.12.6\">LM</span>’s outputs and calculating the final task metrics.</figcaption>\n</figure>\n<div id=\"S4.SS1.SSS0.Px1.p2\">\n<p id=\"S4.SS1.SSS0.Px1.p2.1\">We provide a number of implementations for common tasks, and accept new tasks sourced from the community. We strive to match the paper originally introducing a benchmark dataset in its methodology, including using the same prompts if applicable. For tasks such as those introduced prior to prompted evaluation becoming the standard, we source evaluation methodology from the paper first posing the evaluation dataset as a prompted task. For example, we implement many tasks as adapted for in-context learning by <cite>Brown et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib18\" title=\"\">2020</a>)</cite>.</p></div>\n</section>\n<section id=\"S4.SS1.SSS0.Px2\">\n<h5>LMs</h5>\n<div id=\"S4.SS1.SSS0.Px2.p1\">\n<p id=\"S4.SS1.SSS0.Px2.p1.1\">The next core piece of <span id=\"S4.SS1.SSS0.Px2.p1.1.1\">lm-eval</span>&nbsp;is the <span id=\"S4.SS1.SSS0.Px2.p1.1.2\">LM</span> API. Because effective <span id=\"S4.SS1.SSS0.Px2.p1.1.3\">orchestration</span> is our core goal, we allow arbitrary software libraries or (autoregressive) language model architectures to extend a provided interface for <span id=\"S4.SS1.SSS0.Px2.p1.1.4\">LM</span> objects.</p></div>\n<div id=\"S4.SS1.SSS0.Px2.p2\">\n<p id=\"S4.SS1.SSS0.Px2.p2.1\">For ease of use, and compartmentalization of the model definition and external library integrations for custom models away from core evaluation logic, we assume that LMs operate upon dispatched <span id=\"S4.SS1.SSS0.Px2.p2.1.1\">Requests</span> which consist of mapping <span id=\"S4.SS1.SSS0.Px2.p2.1.2\">string inputs</span> to some <span id=\"S4.SS1.SSS0.Px2.p2.1.3\">string or probability</span> as output. We thus abstract tokenizers away within the <span id=\"S4.SS1.SSS0.Px2.p2.1.4\">LM</span> class, and treat a neural language model combined with its tokenizer as a single system being evaluated.</p></div>\n<div id=\"S4.SS1.SSS0.Px2.p3\">\n<p id=\"S4.SS1.SSS0.Px2.p3.1\">LMs implement a simple interface, consisting of several types of <span id=\"S4.SS1.SSS0.Px2.p3.1.1\">Requests</span> in order to be used within the library for all supported tasks.</p></div>\n</section>\n<section id=\"S4.SS1.SSS0.Px3\">\n<h5>Request Types</h5>\n<div id=\"S4.SS1.SSS0.Px3.p1\">\n<p id=\"S4.SS1.SSS0.Px3.p1.1\">We allow for 3 core types of <span id=\"S4.SS1.SSS0.Px3.p1.1.1\">Requests</span> that may be sent to a language model, which consist of distinct types of <span id=\"S4.SS1.SSS0.Px3.p1.1.2\">measurements</span> that can be performed to observe a model’s response or latent capabilities in a prompted format. These are:</p></div>\n<div id=\"S4.SS1.SSS0.Px3.p2\">\n<ul id=\"S4.I1\">\n<li id=\"S4.I1.i1\">\n<span>•</span>\n<div id=\"S4.I1.i1.p1\">\n<p id=\"S4.I1.i1.p1.1\">(Conditional) Loglikelihoods (<span id=\"S4.I1.i1.p1.1.1\">loglikelihood, multiple_choice</span>) - computing the probability of given output string(s), conditioned on some provided input.</p></div>\n</li>\n<li id=\"S4.I1.i2\">\n<span>•</span>\n<div id=\"S4.I1.i2.p1\">\n<p id=\"S4.I1.i2.p1.1\">Perplexities (<span id=\"S4.I1.i2.p1.1.1\">loglikelihood_rolling</span>) - measuring the average loglikelihood or probability of producing the tokens in a given dataset.</p></div>\n</li>\n<li id=\"S4.I1.i3\">\n<span>•</span>\n<div id=\"S4.I1.i3.p1\">\n<p id=\"S4.I1.i3.p1.1\">Generation (<span id=\"S4.I1.i3.p1.1.1\">generate_until</span>) - generating text until a given stopping condition is reached, from a model conditioned on some provided input.</p></div>\n</li>\n</ul>\n</div>\n<figure id=\"S4.F3\"><img alt=\"Refer to caption\" height=\"295\" id=\"S4.F3.g1\" src=\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/typesoftasks.png\" width=\"598\">\n<figcaption><span>Figure 3: </span>Overview of the three core <span id=\"S4.F3.2.1\">Request</span> types supported by our evaluation framework. These include (1) conditional loglikelihoods, (2) perplexities, and (3) generation-based requests.</figcaption>\n</figure>\n<div id=\"S4.SS1.SSS0.Px3.p3\">\n<p id=\"S4.SS1.SSS0.Px3.p3.1\">Provided with these three primitive operations, we are able to implement the major ways in the literature that have been used to evaluate LMs (<cite>Gao et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib36\" title=\"\">2020</a>)</cite>, <cite>Brown et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib18\" title=\"\">2020</a>)</cite>, <span id=\"S4.SS1.SSS0.Px3.p3.1.1\">inter alia</span>). While these high-level approaches are standard, they all contain a number of subtle implementation decisions which are often not disclosed in papers. Therefore, we include a full formal description of common implementation details involved in ours and others’ approaches within Appendix <a href=\"https://arxiv.org/html/2405.14782v1#A1\" title=\"Appendix A Formalizing Measurements ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>A</span></a> for completeness, which we hope will be a useful contribution to the literature.</p></div>\n</section>\n</section>\n<section id=\"S4.SS2\">\n<h3>\n<span>4.2 </span>Addressing Challenges and Incorporating Best Practices</h3>\n<div id=\"S4.SS2.p1\">\n<p id=\"S4.SS2.p1.1\">Here we detail how we position <span id=\"S4.SS2.p1.1.1\">lm-eval</span>&nbsp;to address the issues mentioned in Section <a href=\"https://arxiv.org/html/2405.14782v1#S2\" title=\"2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>2</span></a> and incorporate the recommendations in Section <a href=\"https://arxiv.org/html/2405.14782v1#S3\" title=\"3 Best Practices for Language Model Evaluation ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>3</span></a>, in order to encourage a more robust evaluation ecosystem.</p></div>\n<section id=\"S4.SS2.SSS0.Px1\">\n<h5>Reproducibility</h5>\n<div id=\"S4.SS2.SSS0.Px1.p1\">\n<p id=\"S4.SS2.SSS0.Px1.p1.1\"><span id=\"S4.SS2.SSS0.Px1.p1.1.1\">lm-eval</span>&nbsp;encourages and enables reproducible evaluation in several ways. First, by providing a standardized implementation of many common tasks, practitioners can report on these tasks and ensure they are evaluating on the same prompt and implementation as other users of the library.</p></div>\n<div id=\"S4.SS2.SSS0.Px1.p2\">\n<p id=\"S4.SS2.SSS0.Px1.p2.1\">Alongside task results we report a <span id=\"S4.SS2.SSS0.Px1.p2.1.1\">version</span> field, incremented each time a task must be modified in a way that affects its scoring. Therefore, in the case where task implementations have bugs or must otherwise be updated, one can still reference the version of the task used, to ensure future research can reproduce reported results.</p></div>\n<div id=\"S4.SS2.SSS0.Px1.p3\">\n<p id=\"S4.SS2.SSS0.Px1.p3.1\">While this is not a panacea for the costs of comparing to prior work, and rerunning baselines oneself is advised, when prior work uses our library one can be confident that the results from prior work match what one would have gotten had one rerun it oneself using that version of the library <cite>(Beeching et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib9\" title=\"\">2023</a>)</cite>.</p></div>\n</section>\n<section id=\"S4.SS2.SSS0.Px2\">\n<h5>Qualitative Analysis</h5>\n<div id=\"S4.SS2.SSS0.Px2.p1\">\n<p id=\"S4.SS2.SSS0.Px2.p1.1\"><span id=\"S4.SS2.SSS0.Px2.p1.1.1\">lm-eval</span>&nbsp;provides support for performing qualitative analysis of evaluation scores. In keeping with our recommended best practices, we implement the following, which allow for qualitative checks to be a core part of the evaluation workflow when using&nbsp;<span id=\"S4.SS2.SSS0.Px2.p1.1.2\">lm-eval</span>:</p></div>\n<div id=\"S4.SS2.SSS0.Px2.p2\">\n<ul id=\"S4.I2\">\n<li id=\"S4.I2.i1\">\n<span>•</span>\n<div id=\"S4.I2.i1.p1\">\n<p id=\"S4.I2.i1.p1.1\">We allow for artificially <span id=\"S4.I2.i1.p1.1.1\">limit</span>ing the amount of samples used for a given evaluation run, to enable code to be tested and outputs to be reviewed in small batches prior to full evaluation runs.</p></div>\n</li>\n<li id=\"S4.I2.i2\">\n<span>•</span>\n<div id=\"S4.I2.i2.p1\">\n<p id=\"S4.I2.i2.p1.1\">Per-sample logging is supported, for post-hoc reproduction of scores or error analysis of model mistakes or evaluation implementation.</p></div>\n</li>\n</ul>\n</div>\n</section>\n<section id=\"S4.SS2.SSS0.Px3\">\n<h5>Statistical Testing</h5>\n<div id=\"S4.SS2.SSS0.Px3.p1\">\n<p id=\"S4.SS2.SSS0.Px3.p1.1\"><span id=\"S4.SS2.SSS0.Px3.p1.1.1\">lm-eval</span>&nbsp;reports the standard error (SE) of most supported metrics, calculated by either bootstrapping or dividing the sample standard deviation by the root of the sample size.</p></div>\n<div id=\"S4.SS2.SSS0.Px3.p2\">\n<p id=\"S4.SS2.SSS0.Px3.p2.1\">By reporting these SE calculations prominently in every evaluation run, we make it trivial for practitioners to add simple statistical measures such as confidence intervals to their results. While we believe more rigorous and widespread statistical testing in LM evaluation is still needed, we hope that this will spur the community to report and be more aware of statistical significance concerns by lowering the difficulty of reporting such measures.</p></div>\n</section>\n</section>\n</section>\n<section id=\"S5\">\n<h2>\n<span>5 </span>Case Studies</h2>\n<div id=\"S5.p1\">\n<p id=\"S5.p1.1\">Finally, we demonstrate <span id=\"S5.p1.1.1\">lm-eval</span>’s utility for improving evaluation rigor and understanding via case studies of its successful usage. We provide additional case studies within Appendix <a href=\"https://arxiv.org/html/2405.14782v1#A2\" title=\"Appendix B Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>B</span></a>.</p></div>\n<section id=\"S5.SS1\">\n<h3>\n<span>5.1 </span>Prompts Massively Affect Results</h3>\n<div id=\"S5.SS1.p1\">\n<p id=\"S5.SS1.p1.1\">As mentioned in Section <a href=\"https://arxiv.org/html/2405.14782v1#S2.SS3.SSS1\" title=\"2.3.1 “Minor” Implementation Details Matter ‣ 2.3 Implementation Difficulties and (Ir)Reproducibility ‣ 2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>2.3.1</span></a>, language models can be very sensitive to the specific prompt or framing of an evaluation task.</p></div>\n<div id=\"S5.SS1.p2\">\n<p id=\"S5.SS1.p2.1\">Here we detail how <span id=\"S5.SS1.p2.1.1\">lm-eval</span>&nbsp;has been used to study this sensitivity, and can be used to improve confidence in the comparison of scores across models. We focus our attention on two popular language modeling benchmarks: the ARC question answering benchmark <cite>(Clark et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib23\" title=\"\">2018</a>)</cite> and MMLU <cite>(Hendrycks et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib43\" title=\"\">2021</a>)</cite>. However, not all papers evaluate on these tasks in the same way as the original formats.</p></div>\n<figure id=\"S5.T1\">\n\n<figcaption><span>Table 1: </span>Comparison of 0-shot model performance for several pretrained LMs <cite>(Black et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib16\" title=\"\">2022</a>; Touvron et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib115\" title=\"\">2023</a>; Penedo et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib82\" title=\"\">2023</a>; Jiang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib51\" title=\"\">2023</a>; <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib52\" title=\"\">2024</a>)</cite> on ARC (Easy and Challenge subsets) and MMLU across two commonly used prompt styles, evaluated using <span id=\"S5.T1.3.1\">lm-eval</span>.</figcaption>\n</figure>\n<div id=\"S5.SS1.p3\">\n<p id=\"S5.SS1.p3.1\">ARC is first adapted to the in-context learning setting by <cite>Brown et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib18\" title=\"\">2020</a>)</cite>, who implement the dataset as a “cloze” task: the model is prompted via <span id=\"S5.SS1.p3.1.1\">‘‘Question: {question}<math alttext=\"\\backslash\" display=\"inline\" id=\"S5.SS1.p3.1.1.m1.1\"><semantics id=\"S5.SS1.p3.1.1.m1.1a\"><mo id=\"S5.SS1.p3.1.1.m1.1.1\" xref=\"S5.SS1.p3.1.1.m1.1.1.cmml\">\\</mo><annotation-xml encoding=\"MathML-Content\" id=\"S5.SS1.p3.1.1.m1.1b\"><ci id=\"S5.SS1.p3.1.1.m1.1.1.cmml\" xref=\"S5.SS1.p3.1.1.m1.1.1\">\\</ci></annotation-xml><annotation encoding=\"application/x-tex\" id=\"S5.SS1.p3.1.1.m1.1c\">\\backslash</annotation><annotation encoding=\"application/x-llamapun\" id=\"S5.SS1.p3.1.1.m1.1d\">\\</annotation></semantics></math>nAnswer:’’</span> and the likelihood of each potential completion string’s <span id=\"S5.SS1.p3.1.2\">text</span> is compared. Comparatively, MMLU is implemented in <cite>Hendrycks et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib42\" title=\"\">2020</a>)</cite> via providing the model with the question text, each of the 4 possible answers preceded by an answer letter A, B, C, or D, and scoring over the <span id=\"S5.SS1.p3.1.3\">answer letters’ probabilities</span>.</p></div>\n<div id=\"S5.SS1.p4\">\n<p id=\"S5.SS1.p4.1\">However, if models do not adopt these approaches, or disclose their exact settings, it is impossible to reliably compare stated model performance. In Table <a href=\"https://arxiv.org/html/2405.14782v1#S5.T1\" title=\"Table 1 ‣ 5.1 Prompts Massively Affect Results ‣ 5 Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>1</span></a>, we compare evaluation on the Easy and Challenge sets of ARC using the prompt from <cite>Brown et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib18\" title=\"\">2020</a>)</cite> (“Cloze”) and using an MMLU-style answer letter with explicit multiple choice options (“MMLU-style”). We additionally compare MMLU scores between the original MMLU prompting style (“MMLU-style”) and an approach we term “Hybrid”, consisting of an MMLU-style prompt but using the <span id=\"S5.SS1.p4.1.1\">answer strings</span> instead of answer letters as the set of continuations over which we can scor"
  }
}

▼✓网页搜索结果数: 10, 查询: "lm-evaluation-harness simple_evaluate predict_only loglikelihood generate_unt..."

{
"content": [
{
"type": "text",
"text": "{\n \"results\": [\n {\n \"title\": \"LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly\",\n \"url\": \"https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545\",\n \"description\": \"# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity….\"\n },\n {\n \"title\": \"GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub\",\n \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n \"description\": \"* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\"\n },\n {\n \"title\": \"EleutherAI's lm-evaluation-harness: Architecture and Configuration\",\n \"url\": \"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\n \"description\": \"EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\\\"hf\\\", pretrained=\\\"gpt2\\\") = \\\"hf\\\" = \\\"gpt2\\\"results = evaluator.evaluate(model, tasks=[\\\"hellaswag\\\"], num_fewshot=5) = = \\\"hellaswag\\\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\\\"gpt2\\\", = \\\"gpt2\\\" tasks=[\\\"hellaswag\\\", \\\"mmlu\\\"], = \\\"hellaswag\\\" \\\"mmlu\\\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\\\lm-eval --model \\\\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\\\ --model_args \\\\ --tasks mmlu,hellaswag \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size auto \\\\ --batch_size \\\\ --output_path results/mistral-7b.json \\\\ --output_path \\\\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\\\lm-eval --model \\\\ --model_args model=gpt-4-turbo \\\\ --model_args \\\\ --tasks mmlu,bbh \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size 10 --batch_size.\"\n },\n {\n \"title\": \"lm-evaluation-harness/docs/task_guide.md at main - GitHub\",\n \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md\",\n \"description\": \"The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the\"\n },\n {\n \"title\": \"lm-evaluation-harness/docs/task_guide.md - Stanford GitLab\",\n \"url\": \"https://code.stanford.edu/tambe-lab/blockdialect/-/blob/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness/docs/task_guide.md\",\n \"description\": \"The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined.\"\n },\n {\n \"title\": \"LM Evaluation Harness\",\n \"url\": \"https://slyracoon23.github.io/lm-evaluation-harness/\",\n \"description\": \"This tool allows researchers and developers to: Evaluate model performance on a wide range of tasks; Compare different models using consistent metrics; Extend\"\n },\n {\n \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo\",\n \"url\": \"https://zenodo.org/records/12608602\",\n \"description\": \"# EleutherAI/lm-evaluation-harness: v0.4.3. # lm-eval v0.4.3 Release Notes. We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future. The big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors! This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs. There's a lot more to check out, including:. We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder.\"\n },\n {\n \"title\": \"blockdialect - lm-evaluation-harness - Stanford GitLab\",\n \"url\": \"https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness\",\n \"description\": \"This project provides a unified framework to test generative language models on a large number of different evaluation tasks.\"\n },\n {\n \"title\": \"Evaluating LLMs — EleutherAI\",\n \"url\": \"https://www.eleuther.ai/projects/large-language-model-evaluation\",\n \"description\": \"# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs Next.\"\n },\n {\n \"title\": \"lm-evaluation-harness/examples/lm-eval-overview.ipynb at main · EleutherAI/lm-evaluation-harness\",\n \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/examples/lm-eval-overview.ipynb\",\n \"description\": \"# lm-evaluation-harness/examples/lm-eval-overview.ipynb at main · EleutherAI/lm-evaluation-harness · GitHub. * GitHub Copilot Write better code with AI. * GitHub Spark Build and deploy intelligent apps. * GitHub Models Manage and compare prompts. * Code Review Manage code changes. * GitHub Advanced Security Find and fix vulnerabilities. * Code security Secure your code as you build. * GitHub Advanced Security Enterprise-grade security features. * Copilot for Business Enterprise-grade AI features. # Search code, repositories, users, issues, pull requests... We read every piece of feedback, and take your input very seriously. ## Use saved searches to filter your results more quickly. You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert. EleutherAI/**lm-evaluation-harness**Public. * NotificationsYou must be signed in to change notification settings. You must be signed in to make or propose changes. ### Footer navigation.\"\n }\n ]\n}"
}
],
"details": {
"results": [
{
"title": "LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly",
"url": "https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545",
"description": "# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity…."
},
{
"title": "GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub",
"url": "https://github.com/EleutherAI/lm-evaluation-harness",
"description": "* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface)."
},
{
"title": "EleutherAI's lm-evaluation-harness: Architecture and Configuration",
"url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
"description": "EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\"hf\", pretrained=\"gpt2\") = \"hf\" = \"gpt2\"results = evaluator.evaluate(model, tasks=[\"hellaswag\"], num_fewshot=5) = = \"hellaswag\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\"gpt2\", = \"gpt2\" tasks=[\"hellaswag\", \"mmlu\"], = \"hellaswag\" \"mmlu\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\lm-eval --model \\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\ --model_args \\ --tasks mmlu,hellaswag \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size auto \\ --batch_size \\ --output_path results/mistral-7b.json \\ --output_path \\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\lm-eval --model \\ --model_args model=gpt-4-turbo \\ --model_args \\ --tasks mmlu,bbh \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size 10 --batch_size."
},
{
"title": "lm-evaluation-harness/docs/task_guide.md at main - GitHub",
"url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md",
"description": "The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the"
},
{
"title": "lm-evaluation-harness/docs/task_guide.md - Stanford GitLab",
"url": "https://code.stanford.edu/tambe-lab/blockdialect/-/blob/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness/docs/task_guide.md",
"description": "The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined."
},
{
"title": "LM Evaluation Harness",
"url": "https://slyracoon23.github.io/lm-evaluation-harness/",
"description": "This tool allows researchers and developers to: Evaluate model performance on a wide range of tasks; Compare different models using consistent metrics; Extend"
},
{
"title": "EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo",
"url": "https://zenodo.org/records/12608602",
"description": "# EleutherAI/lm-evaluation-harness: v0.4.3. # lm-eval v0.4.3 Release Notes. We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future. The big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors! This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs. There's a lot more to check out, including:. We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder."
},
{
"title": "blockdialect - lm-evaluation-harness - Stanford GitLab",
"url": "https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness",
"description": "This project provides a unified framework to test generative language models on a large number of different evaluation tasks."
},
{
"title": "Evaluating LLMs — EleutherAI",
"url": "https://www.eleuther.ai/projects/large-language-model-evaluation",
"description": "# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs Next."
},
{
"title": "lm-evaluation-harness/examples/lm-eval-overview.ipynb at main · EleutherAI/lm-evaluation-harness",
"url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/examples/lm-eval-overview.ipynb",
"description": "# lm-evaluation-harness/examples/lm-eval-overview.ipynb at main · EleutherAI/lm-evaluation-harness · GitHub. * GitHub Copilot Write better code with AI. * GitHub Spark Build and deploy intelligent apps. * GitHub Models Manage and compare prompts. * Code Review Manage code changes. * GitHub Advanced Security Find and fix vulnerabilities. * Code security Secure your code as you build. * GitHub Advanced Security Enterprise-grade security features. * Copilot for Business Enterprise-grade AI features. # Search code, repositories, users, issues, pull requests... We read every piece of feedback, and take your input very seriously. ## Use saved searches to filter your results more quickly. You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert. EleutherAI/**lm-evaluation-harness**Public. * NotificationsYou must be signed in to change notification settings. You must be signed in to make or propose changes. ### Footer navigation."
}
]
}
}

▼thought

Now I have a very comprehensive picture. Let me compile this into a thorough, well-structured deep-dive on lm-evaluation-harness. I have: 1. The full README from the GitHub repo (raw) 2. The v0.4.9.2 release notes from Zenodo 3. The EleutherAI paper "Lessons from the Trenches on Reproducible Evaluation of Language Models" 4. Various search results about architecture, comparisons, limitations Let me now write a comprehensive deep-dive.

LM Evaluation Harness：深度研究报告

一句话定位

lm-evaluation-harness（简称 lm-eval）是 EleutherAI 开发的开源框架，用于对语言模型进行标准化、可复现的 few-shot 评估。它解决了 LLM 评估中最核心的"编排问题"（orchestration problem）：过去研究者需要为每个 benchmark 单独安装代码、重新实现 prompt，极易引入细微差异导致结果不可比。lm-eval 用一个统一接口覆盖了 60+ 标准学术 benchmark、数百个子任务和变体，是 Hugging Face Open LLM Leaderboard 的后端引擎，被 NVIDIA、Cohere、BigScience、Nous Research 等数十家组织内部使用 GitHub README。

一、设计哲学：为什么需要它

EleutherAI 团队在 2024 年发表的论文 Lessons from the Trenches on Reproducible Evaluation of Language Models 中系统阐述了 LLM 评估面临的挑战，以及 lm-eval 的设计理念：

核心问题

"Key Problem"（语义等价但句法不同）：同一个意思可以有无数种表达方式，而判断两句话是否语义等价的"最佳工具"恰恰是我们想要评估的模型本身。这迫使大多数 benchmark 退而求其次——要么改成多选题（限制答案空间），要么用启发式字符串匹配。
"Minor" Implementation Details Matter：prompt 的微小变化、格式化方式、tokenization 细节都会显著影响评估结果。论文中的案例研究表明，同一模型在 ARC 上用 "Cloze" 风格 prompt 和 "MMLU 风格" prompt，得分可以天差地别 Lessons from the Trenches, Section 5.1。
跨论文比较几乎不可能：不同论文使用不同的 prompt、不同的 few-shot 示例选择、不同的 metric 计算方式。即使声称评估的是"同一个 benchmark"，实际跑的可能是完全不同的东西。
API 模型快速弃用：OpenAI 的 code-davinci-002 于 2024 年 1 月弃用，导致数百项研究无法复现。

`lm-eval` 的应对策略

统一代码库：所有 task 实现在同一框架内，确保 prompt、metric、后处理逻辑一致
Task 版本化：每次修改影响评分的 task 时递增版本号，保证可追溯
标准误差报告：每次评估自动输出 bootstrap 标准误差，降低统计显著性检验门槛
Per-sample logging：支持记录每个样本的模型输出，便于事后分析和复现

二、架构：三层核心抽象

2.1 Request Types（三种测量原语）

lm-eval 将模型能力测量抽象为三种基本 Request 类型 Lessons from the Trenches, Section 4.1：

Request 类型	测量内容	典型用途
`loglikelihood` / `multiple_choice`	给定输入，计算输出字符串的条件对数概率	MMLU、HellaSwag 等多选题
`loglikelihood_rolling`	计算整个数据集上的平均对数概率（困惑度）	WikiText、LAMBADA 等语言建模任务
`generate_until`	从给定输入生成文本，直到满足停止条件	GSM8K、BBH、HumanEval 等生成式任务

关键设计决策：不支持 logprobs 的 API 模型（如 Anthropic Claude、OpenAI ChatCompletions）只能用于 generate_until 类型任务，而本地模型或支持 logprobs 的 API 可以运行全部四种类型 GitHub README。

2.2 Task（任务抽象）

每个 Task 封装了四个环节：

数据源：通过 HuggingFace datasets 库加载
Prompt 构建：支持 YAML 配置（Jinja2 模板）、Python 子类、或从 PromptSource 导入
Request 映射：将 prompt 渲染为具体输入，指定期望的输出类型
后处理与评分：解析模型输出、计算 metric

Task 可以通过 YAML 配置文件定义（推荐方式），也可以通过继承 Python Task 类实现自定义逻辑 new_task_guide.md。

2.3 LM（模型抽象）

模型被抽象为一个统一的 LM 接口，将 tokenizer 和神经网络视为一个整体系统。支持的模型后端极其丰富：

类别	后端	模型类型名
本地推理	HuggingFace transformers	`hf`
本地推理	vLLM	`vllm`
本地推理	SGLang	`sglang`
本地推理	GGUF / llama.cpp	`gguf`, `ggml`
本地推理	Mamba SSM	`mamba_ssm`
本地推理	NVIDIA NeMo	`nemo_lm`
本地推理	NVIDIA Megatron-LM	`megatron_lm`
本地推理	OpenVINO	`openvino`
本地推理	Windows ML (NPU)	`winml`
API	OpenAI Completions / Chat	`openai-completions`, `openai-chat-completions`
API	Anthropic	`anthropic-chat`, `anthropic-chat-completions`
API	LiteLLM (100+ providers)	`litellm`, `litellm-chat`
API	Textsynth	`textsynth`
API	Azure OpenAI	通过 `openai-chat-completions`
本地服务器	OpenAI 兼容 API	`local-completions`, `local-chat-completions`
特殊	Steered HF models	`steered`
特殊	HF Multimodal / vLLM VLM	`hf-multimodal`, `vllm-vlm`

三、版本演进与最新动态

关键版本节点

版本	时间	关键变化
v0.4.0	2024	重大重构：Config-based task 创建、Jinja2 prompt 设计、vLLM 支持、新 task 加入
v0.4.2	2024	修复 MMLU 等 group task 的 stderr 聚合公式错误；引入 `TaskManager`
v0.4.3	2024	Chat Templating 支持（`hf` 和 `vllm`），解决 instruct 模型 prompt 格式问题
v0.4.4	2024	Open LLM Leaderboard v2 任务；`lm-eval --tasks list` 美化输出
v0.4.8	2025	SGLang 后端支持；model steering 向量支持
v0.4.9	2025	vLLM data parallel V1；ACPBench Hard；多模态 API 模型支持
v0.4.9.1	2025	快速补丁：新任务和修复
v0.4.9.2	2025.11	Python 3.10 最低要求；大量新 benchmark；GPT-5 支持；Azure OpenAI 支持

v0.4.9.2 Release Notes (Zenodo)

2025 年 12 月最新变化（README 中标注）

CLI 重构：引入子命令（run、ls、validate）和 YAML 配置文件支持（--config）
轻量化安装：基础包不再包含 transformers/torch，按需安装后端：pip install lm_eval[hf]、lm_eval[vllm] 等

四、使用方式

CLI（命令行）

# 基础用法
lm_eval --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cuda:0 \
    --batch_size 8

# 带 chat template（instruct 模型推荐）
lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 \
    --tasks mmlu \
    --num_fewshot 5 \
    --apply_chat_template

# API 模型
lm_eval --model openai-chat-completions \
    --model_args model=gpt-4-turbo \
    --tasks mmlu,bbh \
    --num_fewshot 5

# 列出可用任务
lm-eval ls tasks

Python API

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args={"pretrained": "gpt2"},
    tasks=["hellaswag", "mmlu"],
    num_fewshot=0,
)

高级功能

--predict_only：仅输出解码文本，不做自动评分（适用于无法自动评分的任务）
--log_samples：记录每个样本的模型输出，用于事后分析
--use_cache：缓存评估结果，中断后可续跑
--hf_hub_log_args：将结果和样本推送到 Hugging Face Hub
W&B 集成：--wandb_args project=my-project 自动记录到 Weights & Biases
Zeno 集成：可视化评估结果

五、生态位与竞品对比

与 HELM (Stanford CRFM) 的区别

HELM 是 Stanford 的全景式评估框架，追求的是 holistic（全方位）：覆盖更多场景（QA、摘要、信息检索、toxicity、bias 等）、更多 metric、更多 model。它的设计哲学是"告诉你在真实世界中模型表现如何"。

lm-eval 的哲学不同：它不预设"正确"的评估协议，而是提供一个灵活的编排框架，让用户选择自己关心的 task 和配置。它更轻量、更易扩展、更聚焦于标准学术 benchmark 的快速评估。

与 lmms-eval 的关系

lmms-eval 最初是从 lm-evaluation-harness fork 出来的，专注于多模态（视觉-语言）模型评估。lm-eval 在 v0.4.9 之后也开始原生支持多模态（hf-multimodal、vllm-vlm），但官方建议需要更广泛多模态任务时仍使用 lmms-eval。

与 NVIDIA NeMo Evaluator 的关系

NVIDIA NeMo Evaluator 直接集成了 lm-eval 作为其 Academic Benchmarks 评估的后端 NVIDIA NeMo Microservices docs，说明 lm-eval 已成为行业标准组件。

六、已知局限与批评

6.1 评估不等于真实能力

lm-eval 的设计文档明确指出：它确保的是测量的一致性（consistency），而非构念效度（construct validity）——即 benchmark 分数是否真的反映了我们关心的现实能力 Lessons from the Trenches, Section 2.2。这是所有 benchmark 的固有问题，不是 lm-eval 特有的。

6.2 Prompt 敏感性

同一模型在不同 prompt 风格下得分差异巨大（论文中展示了 ARC 和 MMLU 的案例）。lm-eval 通过版本化 task 来保证同一 task 内可比，但无法消除"选哪个 prompt 变体更公平"的争论。

6.3 不支持多节点评估

hf 模型类型原生不支持多节点评估，需要通过外部推理服务器或自定义分布式脚本。

6.4 vLLM 输出与 HF 不完全一致

vLLM 偶尔会产生与 HuggingFace 参考实现不同的输出。官方提供了一个对比脚本来验证一致性。

6.5 Cohere API 长期阻塞

Cohere API 集成因 API bug 长期处于 blocked 状态 GitHub Issue #395。

6.6 学术论文中的"评估滞后"

站内索引到的一篇 arXiv 论文 Frontier Lag (arXiv:2605.04135) 对 112,303 篇 LLM 论文进行了文献计量审计，发现中位论文评估的模型比同期前沿落后约 10.85 ECI 点（约等于 Claude Sonnet 3.7 到 Claude Opus 4.5 的差距），且这一差距正以每年 +5.53 ECI 的速度扩大。这意味着即使使用 lm-eval 这样的标准化工具，学术评估仍然面临"评估即过时"的结构性问题。

七、总结

lm-evaluation-harness 已经成为 LLM 评估的事实标准。它的核心价值不在于提供"最好的 benchmark"，而在于：

降低评估门槛：一行命令跑通 60+ benchmark
保证可复现性：统一代码库 + 版本化 task + 标准误差报告
灵活可扩展：YAML 配置新 task、支持几乎所有主流模型后端
社区驱动：持续有大量新 benchmark 和模型后端由社区贡献（v0.4.9.2 一个版本就有 32 位新贡献者）

如果你需要快速、标准化地评估一个 LLM，lm-eval 是目前最成熟的选择。但需要记住：它解决的是"测量一致性"问题，而非"测量什么才有意义"的问题——后者仍然需要研究者自己判断。

关键链接

260314 tokens · $0.0413

深度研究一下 lm-evaluation-har...

深度研究一下 lm-evaluation-harness

▼thought

▼✓信息流全文检索关键词: "lm-evaluation-harness EleutherAI benchmark", 条数: 30

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"items\": [\n    {\n      \"id\": \"oai:arXiv.org:2605.04454v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04454\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone\",\n      \"author\": [\n        \"Varad Vishwarupe\",\n        \"Nigel Shadbolt\",\n        \"Marina Jirotka\",\n        \"Ivan Flechais\"\n      ],\n      \"summary\": \"arXiv:2605.04454v1 Announce Type: cross Abstract: Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level at which evidence is collected: model-level, response-level, interaction-level, or deployment-level. Two studies support this position. First, a structured audit of eleven alignment benchmarks, extended to a sixteen-benchmark corpus, dual-coded against an eight-dimension rubric with Cohen's kappa = 0.87, finds that user-facing verification support is absent across every benchmark examined, while process steerability is nearly absent. The few interactional benchmarks identified, including tau-bench, CURATe, Rifts, and Common Ground, remain fragmented in coverage, and benchmark construction rather than data source determines what is measured. Second, a blinded cross-model stress test using 180 transcripts across three frontier models and four scaffolds finds that the same verification scaffold raises one model's verification support to ceiling while leaving another categorically unchanged. This shows that scaffold efficacy is model-dependent and that the gap identified by the audit cannot be closed at the model level alone. We propose a system-level evaluation agenda: alignment profiles instead of single scores, fixed-scaffolding protocols for comparable interactional evaluation, and reporting templates that make the inferential distance between evaluation evidence and deployment claims explicit.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.10144v2\",\n      \"url\": \"https://arxiv.org/abs/2602.10144\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"When LLMs get significantly worse: A statistical approach to detect model degradations\",\n      \"author\": [\n        \"Jonas K\\\\\\\"ubler\",\n        \"Kailash Budhathoki\",\n        \"Matth\\\\\\\"aus Kleindessner\",\n        \"Xiong Zhou\",\n        \"Junming Yin\",\n        \"Ashish Khetan\",\n        \"George Karypis\"\n      ],\n      \"summary\": \"arXiv:2602.10144v2 Announce Type: replace-cross Abstract: Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model's degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04523v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04523\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation\",\n      \"author\": [\n        \"Ivan Bondarenko\",\n        \"Roman Derunets\",\n        \"Oleg Sedukhin\",\n        \"Mikhail Komarov\",\n        \"Ivan Chernov\",\n        \"Mikhail Kulakov\"\n      ],\n      \"summary\": \"arXiv:2605.04523v1 Announce Type: new Abstract: We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno-Lite-0.1, a 7B domain-adapted model with a strong cost--performance trade-off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"a0f60af6af70259d9ac714a9b879b056074869ef062450ebda96f1f65f23df29\",\n      \"url\": \"https://github.com/bogini\",\n      \"source_url\": \"https://github.com/trending/developers/typescript?since=daily\",\n      \"title\": \"bogini\",\n      \"author\": [\n        \"bogini\"\n      ],\n      \"summary\": \"Popular repo: function-health-exporter - Turn your Function Health labs into LLM-ready reports\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T16:48:45.730Z\",\n      \"fetched_at\": \"2026-05-07T16:48:46.617Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.03858v1\",\n      \"url\": \"https://arxiv.org/abs/2605.03858\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following\",\n      \"author\": [\n        \"Jaeyun Lee\",\n        \"Junyoung Koh\",\n        \"Zeynel Tok\",\n        \"Hunar Batra\",\n        \"Ronald Clark\"\n      ],\n      \"summary\": \"arXiv:2605.03858v1 Announce Type: new Abstract: Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:21.529Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.05175v1\",\n      \"url\": \"https://arxiv.org/abs/2605.05175\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge\",\n      \"author\": [\n        \"Perry E. Radau\"\n      ],\n      \"summary\": \"arXiv:2605.05175v1 Announce Type: cross Abstract: Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used an independent LLM judge; primed stem-only tested responses to incorrect user claims. Results: Overall MCQ accuracy was 93.2% to 97.1%. GE scanner operations was the lowest category for every model (88.2% to 94.6%). In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%; GE scanner operations stem-only accuracy was 13.8% to 29.8%. Conclusion: High MCQ performance can mask weak free-text recall, especially for vendor-specific operational knowledge. MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"55c5e35b9bbd92d15c88e3a7d0d8a7800ee1a3c0f0ad444c77ed2d5b62b143aa\",\n      \"url\": \"https://github.com/VectifyAI/PageIndex\",\n      \"source_url\": \"https://github.com/trending?since=daily\",\n      \"title\": \"VectifyAI/PageIndex\",\n      \"author\": [\n        \"VectifyAI\"\n      ],\n      \"summary\": \"📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T16:45:06.412Z\",\n      \"fetched_at\": \"2026-05-07T16:45:07.284Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.00933v2\",\n      \"url\": \"https://arxiv.org/abs/2602.00933\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers\",\n      \"author\": [\n        \"Chaithanya Bandi\",\n        \"Ben Hertzberg\",\n        \"Geobio Boo\",\n        \"Tejas Polakam\",\n        \"Jeff Da\",\n        \"Sami Hassaan\",\n        \"Manasi Sharma\",\n        \"Andrew Park\",\n        \"Ernesto Hernandez\",\n        \"Dan Rambado\",\n        \"Ivan Salazar\",\n        \"Rafael Cruz\",\n        \"Chetan Rane\",\n        \"Ben Levin\",\n        \"Brad Kenstler\",\n        \"Bing Liu\"\n      ],\n      \"summary\": \"arXiv:2602.00933v2 Announce Type: replace-cross Abstract: The Model Context Protocol (MCP) is rapidly becoming the standard interface for Large Language Models (LLMs) to discover and invoke external tools. However, existing evaluations often fail to capture the complexity of real-world scenarios, relying on restricted toolsets, simplistic workflows, or subjective LLM-as-a-judge metrics. We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency, comprising 36 real MCP servers and 220 tools. It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step workflows. Tasks use natural language prompts that avoid naming specific tools or servers, requiring agents to identify and orchestrate 3-6 tool calls across multiple servers. We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer, complemented by internal diagnostics on tool discovery, parameterization, syntax, error recovery, and efficiency. Evaluation results on frontier models reveal that top models achieve pass rates exceeding 50%, with primary failures arising from inadequate tool usage and task understanding. We release the task schema, containerized harness, and a 500-task public subset of the benchmark dataset to facilitate reproducible comparisons and advance the development of robust, tool-augmented agents.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-05T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-05T16:07:19.561Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04357v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04357\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs\",\n      \"author\": [\n        \"Yixuan Mei\",\n        \"Zikun Li\",\n        \"Zixuan Chen\",\n        \"Shiqi Pan\",\n        \"Mengdi Wu\",\n        \"Xupeng Miao\",\n        \"Zhihao Jia\",\n        \"K. V. Rashmi\"\n      ],\n      \"summary\": \"arXiv:2605.04357v1 Announce Type: cross Abstract: The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\\\\times$ over the best baseline, and delivers up to 2.39$\\\\times$ higher goodput under scarce resource availability.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"1cad32c447f9131e633154d70b4987bb207a86751c3c8538ce833d36909d798c\",\n      \"url\": \"https://ahk.cardor.dev\",\n      \"source_url\": \"https://news.ycombinator.com/rss\",\n      \"title\": \"Agent-harness-kit scaffolding for multi-agent workflows (MCP, provider-agnostic)\",\n      \"author\": null,\n      \"summary\": \"Comments\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T10:45:59.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:17.897Z\"\n    },\n    {\n      \"id\": \"38bf6ecc5fc6495ace1d93af960422ad1fe6b249a479a6470528e9ca7c98cf4e\",\n      \"url\": \"https://trendshift.io/repositories/14565\",\n      \"source_url\": \"https://trendshift.io/github-trending-repositories?trending-range=1\",\n      \"title\": \"free-llm-api-resources\",\n      \"author\": [\n        \"cheahjs\"\n      ],\n      \"summary\": \"A list of free LLM inference resources accessible via API.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T16:05:38.142Z\",\n      \"fetched_at\": \"2026-05-06T16:05:42.052Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04298v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04298\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs\",\n      \"author\": [\n        \"Stefano Bann\\\\`o\",\n        \"Kate Knill\",\n        \"Mark Gales\"\n      ],\n      \"summary\": \"arXiv:2605.04298v1 Announce Type: new Abstract: Automated essay scoring (AES) research often relies on rank-based correlation metrics to validate analytic assessment. However, such metrics obscure both intrinsic intercorrelations among analytic dimensions that arise from the structure of writing proficiency itself and halo effects, whereby holistic impressions bleed into fine-grained component scores. As a result, high correlations may mask a system's true diagnostic behaviour. In this study, we propose a novel self-referential assessment evaluation framework that focuses on identifying intra-learner strengths and weaknesses rather than assessing inter-learner rankings. We conduct experiments on the publicly available ICNALE GRA, a uniquely dense second-language writing dataset annotated holistically and analytically by up to 80 trained raters. To obtain reliable reference scores, we apply two-facet Rasch modelling to calibrate rater severity and derive fair average scores across ten analytic aspects and holistic proficiency. We compare the analytic scoring performance of human operational raters and three large language models (LLMs) in a zero-shot setting. Our results show that LLMs tend to outperform single human raters in identifying relative weaknesses (negative feedback) across several proficiency aspects, while human raters remain stronger at identifying relative strengths (positive feedback). Overall, our findings highlight the limitations of rank-based evaluation for analytic assessment and demonstrate the value of intra-learner, profile-based methods for assessing and deploying LLMs in AES.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"cb49d85958c64b4aa4c473e84eef58173340bb3ed59f3ddf6d38ffd6b25c4f2b\",\n      \"url\": \"https://trendshift.io/repositories/22628\",\n      \"source_url\": \"https://trendshift.io/github-trending-repositories?trending-range=1\",\n      \"title\": \"decolua/9router\",\n      \"author\": null,\n      \"summary\": \"Unlimited FREE AI coding. Connect Claude Code, Codex, Cursor, Cline, Copilot, Antigravity to FREE Claude/GPT/Gemini via 40+ providers. Auto-fallback, RTK -40% tokens, never hit limits.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T16:14:28.906Z\",\n      \"fetched_at\": \"2026-05-07T16:14:29.690Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.05170v1\",\n      \"url\": \"https://arxiv.org/abs/2605.05170\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours\",\n      \"author\": [\n        \"The Verkor Team\",\n        \"Ravi Krishna\",\n        \"Suresh Krishna\",\n        \"David Chin\"\n      ],\n      \"summary\": \"arXiv:2605.05170v1 Announce Type: cross Abstract: Driven by a rapid co-evolution of both harness and underlying models, LLM agents are improving at a dizzying pace. In our prior work (performed in Dec. 2025), we introduced \\\"Design Conductor\\\" (or just \\\"Conductor\\\"), a system capable of building a 5-stage Linux-capable RISC-V CPU in 12 hours. In this work, we introduce an updated multi-agent harness powered by frontier models released in April 2026, which is able to handle 80x larger tasks, at higher quality, fully autonomously. Following a brief introduction, we examine 4 designs that the system produced autonomously, including \\\"VerTQ\\\", an LLM inference accelerator which hard-wires support for TurboQuant in a 240-cycle pipeline, starting from the TurboQuant arXiv paper. VerTQ includes heavy compute processing, with 5129 FP16/32 units; the design was mapped to an FPGA at 125 MHz and consumes 5.7 mm^2 in TSMC 16FF (8 attention pipes). We review the key new characteristics that enabled these results. Finally, we analyze Design Conductor's token usage and other empirical characteristics, including its limitations.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04135v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04135\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation\",\n      \"author\": [\n        \"David Gringras\",\n        \"Misha Salahshoor\"\n      ],\n      \"summary\": \"arXiv:2605.04135v1 Announce Type: cross Abstract: Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-4o-mini zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about \\\"AI\\\" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of \\\"AI\\\" rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at frontierlag.org.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"064c2984cfb59b2fa13683fc6a0d7ae1e6872fb7ccbd288d5ddcebb87d622f70\",\n      \"url\": \"https://doi.org/10.7910/DVN/FB7V2B\",\n      \"source_url\": \"https://dataverse.harvard.edu/\",\n      \"title\": \"Indic MMLU\",\n      \"author\": [\n        \"Anonymous, Anonymous\"\n      ],\n      \"summary\": \"The Indic MMLU dataset is a multilingual adaptation of the Massive Multitask Language Understanding (MMLU) benchmark developed to evaluate the reasoning, knowledge comprehension, and multilingual capabilities of Large Language Models (LLMs) across Indian languages. The dataset consists of professionally translated and quality-filtered multiple-choice question-answer pairs spanning diverse academic and professional domains, including science, mathematics, history, law, medicine, engineering, humanities, and social sciences. The primary purpose of this dataset is to provide a standardized benchmark for assessing model performance in low-resource and linguistically diverse Indic settings. The dataset enables research in multilingual NLP, cross-lingual transfer learning, language alignment, and culturally grounded AI evaluation. The dataset was generated through a structured pipeline involving machine-assisted translation of the original English MMLU benchmark into selected Indic languages, followed by extensive quality filtering using translation evaluation metrics such as BLEU, chrF++, and TER. Additional validation steps were applied to preserve semantic fidelity, answer consistency, and linguistic fluency. The final data is provided in standardized machine-readable formats suitable for benchmarking and downstream evaluation workflows. Indic MMLU is intended for researchers, academic institutions, and industry practitioners working on multilingual AI systems, Indic language technologies, and large-scale language model evaluation. By extending a widely recognized benchmark into Indian languages, the dataset contributes toward more inclusive, representative, and culturally relevant evaluation standards for modern AI systems\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T06:07:44.000Z\",\n      \"fetched_at\": \"2026-05-07T16:35:52.266Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04083v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04083\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals\",\n      \"author\": [\n        \"Tadhg Looram\",\n        \"Lucas Nuzzi\",\n        \"Kyle Waters\",\n        \"Steven Dillmann\"\n      ],\n      \"summary\": \"arXiv:2605.04083v1 Announce Type: new Abstract: Much of the focus in RL today is on evaluation design: building meaningful evals that serve simultaneously as benchmarks and as well-defined reward signals for post-training. Yet, many real-world tasks are governed by subjective, procedural, and domain-specific requirements that are difficult to encode as exact-match targets or open-ended preference judgments frequently used in RL pipelines today. In this work, we present AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals. AsymmetryZero represents each task as a stable evaluation contract that makes grading criteria explicit: what is being graded, how each criterion is judged, and how criterion-level decisions are aggregated into a task outcome. The same contract can be executed using Inspect for model-only evaluations, as well as the Harbor Framework for agentic evaluations, enabling comparable scores and shared audit artifacts across both settings. We argue that the central challenge in post-training today is the faithful encoding of expert requirements into the evaluation itself. To that end, we present a study using Harbor that holds task contracts fixed and compares a five-model frontier jury against a five-model compact jury across four frontier-class solvers (Claude Opus 4.6, GPT-5.4, Grok-4.20, Gemini-3.1-Pro). We find that criterion-level frontier-vs-compact agreement ranges from $75.9\\\\%$ to $89.6\\\\%$ (strict common-subset agreement: $77.8\\\\%$ to $92.1\\\\%$), while compact juries exhibit substantially higher internal dissent (3--2 split rate $28.7\\\\%$--$32.4\\\\%$) than frontier juries ($6.1\\\\%$--$11.5\\\\%$). Verifier traces further show that compact juries reduce per-criterion judging cost to roughly $4.2\\\\%$--$5.6\\\\%$ of frontier and latency to roughly $21.7\\\\%$--$27.1\\\\%$, even as aggregated task-level outcomes often remain comparatively stable.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04118v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04118\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation\",\n      \"author\": [\n        \"Cong Liu\",\n        \"Milong Ren\",\n        \"Jiaqi Guan\",\n        \"Chengyue Gong\",\n        \"Jinyuan Sun\",\n        \"Xinshi Chen\",\n        \"Wenzhi Xiao\"\n      ],\n      \"summary\": \"arXiv:2605.04118v1 Announce Type: cross Abstract: Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04503v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04503\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning\",\n      \"author\": [\n        \"Yuancheng Wei\",\n        \"Haojie Zhang\",\n        \"Linli Yao\",\n        \"Lei Li\",\n        \"Jiali Chen\",\n        \"Tao Huang\",\n        \"Yiting Lu\",\n        \"Duojun Huang\",\n        \"Xin Li\",\n        \"Zhao Zhong\"\n      ],\n      \"summary\": \"arXiv:2605.04503v1 Announce Type: cross Abstract: Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust evaluation of multimodal large language models (MLLMs) on IDC. To address these gaps, we introduce DiffCap-Bench, a comprehensive IDC benchmark covering ten distinct difference categories to ensure diversity and compositional complexity. Furthermore, we propose an LLM-as-a-Judge evaluation protocol grounded in human-validated Difference Lists, enabling a robust assessment of models' ability to both capture and describe visual changes. Through extensive evaluation of state-of-the-art MLLMs, we reveal significant performance gaps between proprietary and open-source models, highlight the critical importance of reasoning capability, and identify clear limitations in model scaling. Our framework also demonstrates strong alignment with human expert judgments and strong correlation with downstream image editing data construction quality. These findings establish DiffCap-Bench as both a reliable IDC evaluation framework and a practical predictor of downstream utility. The benchmark and code will be made publicly available to support further research.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2601.22725v3\",\n      \"url\": \"https://arxiv.org/abs/2601.22725\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation\",\n      \"author\": [\n        \"Jin Li\",\n        \"Tao Chen\",\n        \"Shuai Jiang\",\n        \"Weijie Wang\",\n        \"Jingwen Luo\",\n        \"Chenhui Wu\"\n      ],\n      \"summary\": \"arXiv:2601.22725v3 Announce Type: replace-cross Abstract: Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \\\\times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\\\\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.03792v1\",\n      \"url\": \"https://arxiv.org/abs/2605.03792\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"TriBench-Ko: Evaluating LLM Risks in Judicial Workflows\",\n      \"author\": [\n        \"Haesung Lee\",\n        \"Gyubin Choi\",\n        \"Eun-Ju Lee\",\n        \"So-Min Lee\",\n        \"Youkang Ko\",\n        \"Dogyoon Lim\",\n        \"Sung-Kyoung Jang\",\n        \"Yohan Jo\"\n      ],\n      \"summary\": \"arXiv:2605.03792v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM-generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi-lab/TriBench-Ko\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:21.529Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.03936v1\",\n      \"url\": \"https://arxiv.org/abs/2605.03936\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models\",\n      \"author\": [\n        \"Daniel Drucker\",\n        \"Kyle Mahowald\"\n      ],\n      \"summary\": \"arXiv:2605.03936v1 Announce Type: new Abstract: Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a proposed definition, another repairs the definition, and the process repeats. Across 20 concepts and thousands of counterexample-repair cycles, we find that, although many LM-generated counterexamples are judged invalid by both expert humans and an LM judge, the LM judge accepts roughly twice as many as humans do. Nonetheless, per-item validity judgments are moderately consistent across humans and between humans and the LM. We further find that extended iteration produces increasingly verbose definitions without improving accuracy. We also see that some concepts resist stable definitions in general. These findings suggest that while LMs can engage in philosophical reasoning, the counterexample-repair loop hits diminishing returns quickly and could be a fruitful test case for evaluating whether LMs can sustain high-level iterated philosophical reasoning.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:21.529Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2603.16659v2\",\n      \"url\": \"https://arxiv.org/abs/2603.16659\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"LLMs learn scientific taste from institutional traces across the social sciences\",\n      \"author\": [\n        \"Ziqin Gong\",\n        \"Ning Li\",\n        \"Huaikang Zhou\"\n      ],\n      \"summary\": \"arXiv:2603.16659v2 Announce Type: replace Abstract: Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say \\\"I'm sure\\\" versus \\\"I'm guessing.\\\" Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:24.058Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.02974v1\",\n      \"url\": \"https://arxiv.org/abs/2605.02974\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"PHBench: A Benchmark for Predicting Startup Series A Funding from Product Hunt Launch Signals\",\n      \"author\": [\n        \"Yagiz Ihlamur\",\n        \"Ben Griffin\",\n        \"Rick Chen\"\n      ],\n      \"summary\": \"arXiv:2605.02974v1 Announce Type: cross Abstract: Structured launch signals on Product Hunt contain statistically significant predictive information for Series A funding outcomes. We construct PHBench from 67,292 featured Product Hunt posts spanning 2019-2025, linked to Crunchbase funding records via deterministic domain matching, identifying 528 verified Series A raises within 18 months of launch (positive rate: 0.78%). Our best-performing model, a three-component ensemble (ENS_avg, ENS_ISO, XGB) selected by validation F0.5, achieves F0.5 = 0.097 and AP = 0.037 (95% CI: 0.024-0.072; 4.7x lift over random) on the private held-out test set (103 positives). A paired bootstrap confirms a statistically credible advantage over the logistic regression baseline (AP delta: +0.013, 95% CI: [0.004, 0.039], p < 0.001; F0.5 delta: +0.056, 95% CI: [0.006, 0.122], p = 0.016). Validation-set metrics (F0.5 = 0.284, AP = 0.126) reflect best-of-144 selection bias on 53 positives and are reported for benchmark reproducibility only. We further evaluate three zero-shot Gemini models (Gemini 2.5 Flash, Gemini 3 Flash, and Gemini 3.1 Pro) in an anonymized numerical setting. The best LLM achieves AP = 0.034 (Gemini 3 Flash), below the LR baseline AP of 0.044. Notably, the most capable Gemini variant (Gemini 3.1 Pro, AP = 0.023) performs worst -- an unexpected pattern that warrants further investigation across providers and prompting strategies. Both ML and LLM models show the same temporal performance decay tracking the 2020-2021 funding boom and subsequent contraction, confirming the dataset captures genuine market structure rather than noise. PHBench provides a reproducible framework comprising public training, validation, and blind test splits; 61 engineered features; a five-metric evaluation harness; and a public leaderboard at https://phbench.com. All code, baseline models, and anonymized dataset splits are publicly available.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-06T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-06T16:00:29.193Z\"\n    },\n    {\n      \"id\": \"ad87ab2f75834cf15e480f1ab02abe44dee472a1f5fcb50674048dce52e5976d\",\n      \"url\": \"https://github.com/darkrishabh/agent-skills-eval\",\n      \"source_url\": \"https://news.ycombinator.com/rss\",\n      \"title\": \"Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs\",\n      \"author\": null,\n      \"summary\": \"Comments\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T06:12:25.000Z\",\n      \"fetched_at\": \"2026-05-07T10:00:03.704Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2605.04539v1\",\n      \"url\": \"https://arxiv.org/abs/2605.04539\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.cl\",\n      \"title\": \"RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization\",\n      \"author\": [\n        \"Qiming Bao\",\n        \"Juho Leinonen\",\n        \"Paul Denny\",\n        \"Michael J. Witbrock\"\n      ],\n      \"summary\": \"arXiv:2605.04539v1 Announce Type: new Abstract: Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the \\\"alignment tax\\\" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:22.293Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2602.22480v2\",\n      \"url\": \"https://arxiv.org/abs/2602.22480\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.ai\",\n      \"title\": \"VeRO: An Evaluation Harness for Agents to Optimize Agents\",\n      \"author\": [\n        \"Varun Ursekar (Emily)\",\n        \"Apaar Shanker (Emily)\",\n        \"Veronica Chatrath (Emily)\",\n        \"Yuan (Emily)\",\n        \"Xue\",\n        \"Sam Denton\"\n      ],\n      \"summary\": \"arXiv:2602.22480v2 Announce Type: replace Abstract: An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-05T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-05T16:07:19.561Z\"\n    },\n    {\n      \"id\": \"f44dc79f7ed763f48b0544dd7231ac19fe6346d8e7081314dc755ea0cd5ce723\",\n      \"url\": \"https://www.reddit.com/r/LocalLLaMA/comments/1t5yajb/qwen36_27b_uncensored_heretic_v2_native_mtp/\",\n      \"source_url\": \"https://www.reddit.com/r/localllama.rss\",\n      \"title\": \"Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats.\",\n      \"author\": [\n        \"/u/LLMFan46\"\n      ],\n      \"summary\": \"llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-MLP-Only: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-MLP-Only llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4 All are confirmed to have their full 15 MTPs retained and preserved. Comes with benchmark too. Find all my models here: HuggingFace-LLMFan46 submitted by /u/LLMFan46 [link] [留言]\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T02:59:31.000Z\",\n      \"fetched_at\": \"2026-05-07T16:36:35.893Z\"\n    },\n    {\n      \"id\": \"oai:arXiv.org:2511.06452v3\",\n      \"url\": \"https://arxiv.org/abs/2511.06452\",\n      \"source_url\": \"https://rss.arxiv.org/rss/cs.lg\",\n      \"title\": \"MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains\",\n      \"author\": [\n        \"Leyan Xue\",\n        \"Changqing Zhang\",\n        \"Kecheng Xue\",\n        \"Xiaohong Liu\",\n        \"Guangyu Wang\",\n        \"Zongbo Han\"\n      ],\n      \"summary\": \"arXiv:2511.06452v3 Announce Type: replace Abstract: Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T04:00:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:00:30.005Z\"\n    },\n    {\n      \"id\": \"d784734c9cb6667fe53b569b13f402f11bd62e5d676abe30f47114e90c834d23\",\n      \"url\": \"https://doi.org/10.7910/DVN/RYY9V9\",\n      \"source_url\": \"https://dataverse.harvard.edu/\",\n      \"title\": \"TraceSuite Review Artifact\",\n      \"author\": [\n        \"NeurIPS, AuthorAno\"\n      ],\n      \"summary\": \"Review artifact for a benchmark and dataset submission. Contains trajectory metadata, a representative sample, and sharded image archives.\",\n      \"image_url\": null,\n      \"cover_img\": null,\n      \"pub_date\": \"2026-05-07T06:01:00.000Z\",\n      \"fetched_at\": \"2026-05-07T16:35:52.266Z\"\n    }\n  ],\n  \"total\": 2985\n}"
    }
  ],
  "details": {
    "items": [
      {
        "id": "oai:arXiv.org:2605.04454v1",
        "url": "https://arxiv.org/abs/2605.04454",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone",
        "author": [
          "Varad Vishwarupe",
          "Nigel Shadbolt",
          "Marina Jirotka",
          "Ivan Flechais"
        ],
        "summary": "arXiv:2605.04454v1 Announce Type: cross Abstract: Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level at which evidence is collected: model-level, response-level, interaction-level, or deployment-level. Two studies support this position. First, a structured audit of eleven alignment benchmarks, extended to a sixteen-benchmark corpus, dual-coded against an eight-dimension rubric with Cohen's kappa = 0.87, finds that user-facing verification support is absent across every benchmark examined, while process steerability is nearly absent. The few interactional benchmarks identified, including tau-bench, CURATe, Rifts, and Common Ground, remain fragmented in coverage, and benchmark construction rather than data source determines what is measured. Second, a blinded cross-model stress test using 180 transcripts across three frontier models and four scaffolds finds that the same verification scaffold raises one model's verification support to ceiling while leaving another categorically unchanged. This shows that scaffold efficacy is model-dependent and that the gap identified by the audit cannot be closed at the model level alone. We propose a system-level evaluation agenda: alignment profiles instead of single scores, fixed-scaffolding protocols for comparable interactional evaluation, and reporting templates that make the inferential distance between evaluation evidence and deployment claims explicit.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2602.10144v2",
        "url": "https://arxiv.org/abs/2602.10144",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "When LLMs get significantly worse: A statistical approach to detect model degradations",
        "author": [
          "Jonas K\\\"ubler",
          "Kailash Budhathoki",
          "Matth\\\"aus Kleindessner",
          "Xiong Zhou",
          "Junming Yin",
          "Ashish Khetan",
          "George Karypis"
        ],
        "summary": "arXiv:2602.10144v2 Announce Type: replace-cross Abstract: Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model's degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.04523v1",
        "url": "https://arxiv.org/abs/2605.04523",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation",
        "author": [
          "Ivan Bondarenko",
          "Roman Derunets",
          "Oleg Sedukhin",
          "Mikhail Komarov",
          "Ivan Chernov",
          "Mikhail Kulakov"
        ],
        "summary": "arXiv:2605.04523v1 Announce Type: new Abstract: We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model. We also introduce Meno-Lite-0.1, a 7B domain-adapted model with a strong cost--performance trade-off, and analyse MTRAGEval, highlighting annotation limitations and directions for improvement. Our code is publicly available: https://github.com/RaguTeam/ragu_mtrag_semeval",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "a0f60af6af70259d9ac714a9b879b056074869ef062450ebda96f1f65f23df29",
        "url": "https://github.com/bogini",
        "source_url": "https://github.com/trending/developers/typescript?since=daily",
        "title": "bogini",
        "author": [
          "bogini"
        ],
        "summary": "Popular repo: function-health-exporter - Turn your Function Health labs into LLM-ready reports",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T16:48:45.730Z",
        "fetched_at": "2026-05-07T16:48:46.617Z"
      },
      {
        "id": "oai:arXiv.org:2605.03858v1",
        "url": "https://arxiv.org/abs/2605.03858",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following",
        "author": [
          "Jaeyun Lee",
          "Junyoung Koh",
          "Zeynel Tok",
          "Hunar Batra",
          "Ronald Clark"
        ],
        "summary": "arXiv:2605.03858v1 Announce Type: new Abstract: Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:21.529Z"
      },
      {
        "id": "oai:arXiv.org:2605.05175v1",
        "url": "https://arxiv.org/abs/2605.05175",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge",
        "author": [
          "Perry E. Radau"
        ],
        "summary": "arXiv:2605.05175v1 Announce Type: cross Abstract: Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used an independent LLM judge; primed stem-only tested responses to incorrect user claims. Results: Overall MCQ accuracy was 93.2% to 97.1%. GE scanner operations was the lowest category for every model (88.2% to 94.6%). In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%; GE scanner operations stem-only accuracy was 13.8% to 29.8%. Conclusion: High MCQ performance can mask weak free-text recall, especially for vendor-specific operational knowledge. MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "55c5e35b9bbd92d15c88e3a7d0d8a7800ee1a3c0f0ad444c77ed2d5b62b143aa",
        "url": "https://github.com/VectifyAI/PageIndex",
        "source_url": "https://github.com/trending?since=daily",
        "title": "VectifyAI/PageIndex",
        "author": [
          "VectifyAI"
        ],
        "summary": "📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T16:45:06.412Z",
        "fetched_at": "2026-05-07T16:45:07.284Z"
      },
      {
        "id": "oai:arXiv.org:2602.00933v2",
        "url": "https://arxiv.org/abs/2602.00933",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers",
        "author": [
          "Chaithanya Bandi",
          "Ben Hertzberg",
          "Geobio Boo",
          "Tejas Polakam",
          "Jeff Da",
          "Sami Hassaan",
          "Manasi Sharma",
          "Andrew Park",
          "Ernesto Hernandez",
          "Dan Rambado",
          "Ivan Salazar",
          "Rafael Cruz",
          "Chetan Rane",
          "Ben Levin",
          "Brad Kenstler",
          "Bing Liu"
        ],
        "summary": "arXiv:2602.00933v2 Announce Type: replace-cross Abstract: The Model Context Protocol (MCP) is rapidly becoming the standard interface for Large Language Models (LLMs) to discover and invoke external tools. However, existing evaluations often fail to capture the complexity of real-world scenarios, relying on restricted toolsets, simplistic workflows, or subjective LLM-as-a-judge metrics. We introduce MCP-Atlas, a large-scale benchmark for evaluating tool-use competency, comprising 36 real MCP servers and 220 tools. It includes 1,000 tasks designed to assess tool-use competency in realistic, multi-step workflows. Tasks use natural language prompts that avoid naming specific tools or servers, requiring agents to identify and orchestrate 3-6 tool calls across multiple servers. We score tasks using a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer, complemented by internal diagnostics on tool discovery, parameterization, syntax, error recovery, and efficiency. Evaluation results on frontier models reveal that top models achieve pass rates exceeding 50%, with primary failures arising from inadequate tool usage and task understanding. We release the task schema, containerized harness, and a 500-task public subset of the benchmark dataset to facilitate reproducible comparisons and advance the development of robust, tool-augmented agents.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-05T04:00:00.000Z",
        "fetched_at": "2026-05-05T16:07:19.561Z"
      },
      {
        "id": "oai:arXiv.org:2605.04357v1",
        "url": "https://arxiv.org/abs/2605.04357",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs",
        "author": [
          "Yixuan Mei",
          "Zikun Li",
          "Zixuan Chen",
          "Shiqi Pan",
          "Mengdi Wu",
          "Xupeng Miao",
          "Zhihao Jia",
          "K. V. Rashmi"
        ],
        "summary": "arXiv:2605.04357v1 Announce Type: cross Abstract: The usage of large language models (LLMs) has grown increasingly fragmented, with no single model dominating. Meanwhile, cloud providers offer a wide range of mid-tier and older-generation GPUs that enjoy better availability and deliver comparable performance per dollar to top-tier hardware. To efficiently harness these heterogeneous resources for serving multiple LLMs concurrently, we introduce Coral, an adaptive heterogeneity-aware multi-LLM serving system. The key idea behind Coral is to jointly optimize resource allocation and the serving strategy of each model replica across all models. To keep pace with shifting throughput demand and resource availability, Coral applies a lossless two-stage decomposition that preserves joint optimality while cutting online solve time from hours to tens of seconds. Our evaluation across 6 models and 20 GPU configurations shows that Coral reduces serving cost by up to 2.79$\\times$ over the best baseline, and delivers up to 2.39$\\times$ higher goodput under scarce resource availability.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "1cad32c447f9131e633154d70b4987bb207a86751c3c8538ce833d36909d798c",
        "url": "https://ahk.cardor.dev",
        "source_url": "https://news.ycombinator.com/rss",
        "title": "Agent-harness-kit scaffolding for multi-agent workflows (MCP, provider-agnostic)",
        "author": null,
        "summary": "Comments",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T10:45:59.000Z",
        "fetched_at": "2026-05-07T16:00:17.897Z"
      },
      {
        "id": "38bf6ecc5fc6495ace1d93af960422ad1fe6b249a479a6470528e9ca7c98cf4e",
        "url": "https://trendshift.io/repositories/14565",
        "source_url": "https://trendshift.io/github-trending-repositories?trending-range=1",
        "title": "free-llm-api-resources",
        "author": [
          "cheahjs"
        ],
        "summary": "A list of free LLM inference resources accessible via API.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T16:05:38.142Z",
        "fetched_at": "2026-05-06T16:05:42.052Z"
      },
      {
        "id": "oai:arXiv.org:2605.04298v1",
        "url": "https://arxiv.org/abs/2605.04298",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs",
        "author": [
          "Stefano Bann\\`o",
          "Kate Knill",
          "Mark Gales"
        ],
        "summary": "arXiv:2605.04298v1 Announce Type: new Abstract: Automated essay scoring (AES) research often relies on rank-based correlation metrics to validate analytic assessment. However, such metrics obscure both intrinsic intercorrelations among analytic dimensions that arise from the structure of writing proficiency itself and halo effects, whereby holistic impressions bleed into fine-grained component scores. As a result, high correlations may mask a system's true diagnostic behaviour. In this study, we propose a novel self-referential assessment evaluation framework that focuses on identifying intra-learner strengths and weaknesses rather than assessing inter-learner rankings. We conduct experiments on the publicly available ICNALE GRA, a uniquely dense second-language writing dataset annotated holistically and analytically by up to 80 trained raters. To obtain reliable reference scores, we apply two-facet Rasch modelling to calibrate rater severity and derive fair average scores across ten analytic aspects and holistic proficiency. We compare the analytic scoring performance of human operational raters and three large language models (LLMs) in a zero-shot setting. Our results show that LLMs tend to outperform single human raters in identifying relative weaknesses (negative feedback) across several proficiency aspects, while human raters remain stronger at identifying relative strengths (positive feedback). Overall, our findings highlight the limitations of rank-based evaluation for analytic assessment and demonstrate the value of intra-learner, profile-based methods for assessing and deploying LLMs in AES.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "cb49d85958c64b4aa4c473e84eef58173340bb3ed59f3ddf6d38ffd6b25c4f2b",
        "url": "https://trendshift.io/repositories/22628",
        "source_url": "https://trendshift.io/github-trending-repositories?trending-range=1",
        "title": "decolua/9router",
        "author": null,
        "summary": "Unlimited FREE AI coding. Connect Claude Code, Codex, Cursor, Cline, Copilot, Antigravity to FREE Claude/GPT/Gemini via 40+ providers. Auto-fallback, RTK -40% tokens, never hit limits.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T16:14:28.906Z",
        "fetched_at": "2026-05-07T16:14:29.690Z"
      },
      {
        "id": "oai:arXiv.org:2605.05170v1",
        "url": "https://arxiv.org/abs/2605.05170",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "Design Conductor 2.0: An agent builds a TurboQuant inference accelerator in 80 hours",
        "author": [
          "The Verkor Team",
          "Ravi Krishna",
          "Suresh Krishna",
          "David Chin"
        ],
        "summary": "arXiv:2605.05170v1 Announce Type: cross Abstract: Driven by a rapid co-evolution of both harness and underlying models, LLM agents are improving at a dizzying pace. In our prior work (performed in Dec. 2025), we introduced \"Design Conductor\" (or just \"Conductor\"), a system capable of building a 5-stage Linux-capable RISC-V CPU in 12 hours. In this work, we introduce an updated multi-agent harness powered by frontier models released in April 2026, which is able to handle 80x larger tasks, at higher quality, fully autonomously. Following a brief introduction, we examine 4 designs that the system produced autonomously, including \"VerTQ\", an LLM inference accelerator which hard-wires support for TurboQuant in a 240-cycle pipeline, starting from the TurboQuant arXiv paper. VerTQ includes heavy compute processing, with 5129 FP16/32 units; the design was mapped to an FPGA at 125 MHz and consumes 5.7 mm^2 in TSMC 16FF (8 attention pipes). We review the key new characteristics that enabled these results. Finally, we analyze Design Conductor's token usage and other empirical characteristics, including its limitations.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.04135v1",
        "url": "https://arxiv.org/abs/2605.04135",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation",
        "author": [
          "David Gringras",
          "Misha Salahshoor"
        ],
        "summary": "arXiv:2605.04135v1 Announce Type: cross Abstract: Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-4o-mini zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about \"AI\" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of \"AI\" rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at frontierlag.org.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "064c2984cfb59b2fa13683fc6a0d7ae1e6872fb7ccbd288d5ddcebb87d622f70",
        "url": "https://doi.org/10.7910/DVN/FB7V2B",
        "source_url": "https://dataverse.harvard.edu/",
        "title": "Indic MMLU",
        "author": [
          "Anonymous, Anonymous"
        ],
        "summary": "The Indic MMLU dataset is a multilingual adaptation of the Massive Multitask Language Understanding (MMLU) benchmark developed to evaluate the reasoning, knowledge comprehension, and multilingual capabilities of Large Language Models (LLMs) across Indian languages. The dataset consists of professionally translated and quality-filtered multiple-choice question-answer pairs spanning diverse academic and professional domains, including science, mathematics, history, law, medicine, engineering, humanities, and social sciences. The primary purpose of this dataset is to provide a standardized benchmark for assessing model performance in low-resource and linguistically diverse Indic settings. The dataset enables research in multilingual NLP, cross-lingual transfer learning, language alignment, and culturally grounded AI evaluation. The dataset was generated through a structured pipeline involving machine-assisted translation of the original English MMLU benchmark into selected Indic languages, followed by extensive quality filtering using translation evaluation metrics such as BLEU, chrF++, and TER. Additional validation steps were applied to preserve semantic fidelity, answer consistency, and linguistic fluency. The final data is provided in standardized machine-readable formats suitable for benchmarking and downstream evaluation workflows. Indic MMLU is intended for researchers, academic institutions, and industry practitioners working on multilingual AI systems, Indic language technologies, and large-scale language model evaluation. By extending a widely recognized benchmark into Indian languages, the dataset contributes toward more inclusive, representative, and culturally relevant evaluation standards for modern AI systems",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T06:07:44.000Z",
        "fetched_at": "2026-05-07T16:35:52.266Z"
      },
      {
        "id": "oai:arXiv.org:2605.04083v1",
        "url": "https://arxiv.org/abs/2605.04083",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals",
        "author": [
          "Tadhg Looram",
          "Lucas Nuzzi",
          "Kyle Waters",
          "Steven Dillmann"
        ],
        "summary": "arXiv:2605.04083v1 Announce Type: new Abstract: Much of the focus in RL today is on evaluation design: building meaningful evals that serve simultaneously as benchmarks and as well-defined reward signals for post-training. Yet, many real-world tasks are governed by subjective, procedural, and domain-specific requirements that are difficult to encode as exact-match targets or open-ended preference judgments frequently used in RL pipelines today. In this work, we present AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals. AsymmetryZero represents each task as a stable evaluation contract that makes grading criteria explicit: what is being graded, how each criterion is judged, and how criterion-level decisions are aggregated into a task outcome. The same contract can be executed using Inspect for model-only evaluations, as well as the Harbor Framework for agentic evaluations, enabling comparable scores and shared audit artifacts across both settings. We argue that the central challenge in post-training today is the faithful encoding of expert requirements into the evaluation itself. To that end, we present a study using Harbor that holds task contracts fixed and compares a five-model frontier jury against a five-model compact jury across four frontier-class solvers (Claude Opus 4.6, GPT-5.4, Grok-4.20, Gemini-3.1-Pro). We find that criterion-level frontier-vs-compact agreement ranges from $75.9\\%$ to $89.6\\%$ (strict common-subset agreement: $77.8\\%$ to $92.1\\%$), while compact juries exhibit substantially higher internal dissent (3--2 split rate $28.7\\%$--$32.4\\%$) than frontier juries ($6.1\\%$--$11.5\\%$). Verifier traces further show that compact juries reduce per-criterion judging cost to roughly $4.2\\%$--$5.6\\%$ of frontier and latency to roughly $21.7\\%$--$27.1\\%$, even as aggregated task-level outcomes often remain comparatively stable.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "oai:arXiv.org:2605.04118v1",
        "url": "https://arxiv.org/abs/2605.04118",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation",
        "author": [
          "Cong Liu",
          "Milong Ren",
          "Jiaqi Guan",
          "Chengyue Gong",
          "Jinyuan Sun",
          "Xinshi Chen",
          "Wenzhi Xiao"
        ],
        "summary": "arXiv:2605.04118v1 Announce Type: cross Abstract: Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.04503v1",
        "url": "https://arxiv.org/abs/2605.04503",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning",
        "author": [
          "Yuancheng Wei",
          "Haojie Zhang",
          "Linli Yao",
          "Lei Li",
          "Jiali Chen",
          "Tao Huang",
          "Yiting Lu",
          "Duojun Huang",
          "Xin Li",
          "Zhao Zhong"
        ],
        "summary": "arXiv:2605.04503v1 Announce Type: cross Abstract: Image Difference Captioning (IDC) generates natural language descriptions that precisely identify differences between two images, serving as a key benchmark for fine-grained change perception, cross-modal reasoning, and image editing data construction. However, existing benchmarks lack diversity and compositional complexity, and standard lexical-overlap metrics (e.g., BLEU, METEOR) fail to capture semantic consistency or penalize hallucinations, which together prevent a comprehensive and robust evaluation of multimodal large language models (MLLMs) on IDC. To address these gaps, we introduce DiffCap-Bench, a comprehensive IDC benchmark covering ten distinct difference categories to ensure diversity and compositional complexity. Furthermore, we propose an LLM-as-a-Judge evaluation protocol grounded in human-validated Difference Lists, enabling a robust assessment of models' ability to both capture and describe visual changes. Through extensive evaluation of state-of-the-art MLLMs, we reveal significant performance gaps between proprietary and open-source models, highlight the critical importance of reasoning capability, and identify clear limitations in model scaling. Our framework also demonstrates strong alignment with human expert judgments and strong correlation with downstream image editing data construction quality. These findings establish DiffCap-Bench as both a reliable IDC evaluation framework and a practical predictor of downstream utility. The benchmark and code will be made publicly available to support further research.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2601.22725v3",
        "url": "https://arxiv.org/abs/2601.22725",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation",
        "author": [
          "Jin Li",
          "Tao Chen",
          "Shuai Jiang",
          "Weijie Wang",
          "Jingwen Luo",
          "Chenhui Wu"
        ],
        "summary": "arXiv:2601.22725v3 Announce Type: replace-cross Abstract: Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \\times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.03792v1",
        "url": "https://arxiv.org/abs/2605.03792",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "TriBench-Ko: Evaluating LLM Risks in Judicial Workflows",
        "author": [
          "Haesung Lee",
          "Gyubin Choi",
          "Eun-Ju Lee",
          "So-Min Lee",
          "Youkang Ko",
          "Dogyoon Lim",
          "Sung-Kyoung Jang",
          "Yohan Jo"
        ],
        "summary": "arXiv:2605.03792v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM-generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi-lab/TriBench-Ko",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:21.529Z"
      },
      {
        "id": "oai:arXiv.org:2605.03936v1",
        "url": "https://arxiv.org/abs/2605.03936",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models",
        "author": [
          "Daniel Drucker",
          "Kyle Mahowald"
        ],
        "summary": "arXiv:2605.03936v1 Announce Type: new Abstract: Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a proposed definition, another repairs the definition, and the process repeats. Across 20 concepts and thousands of counterexample-repair cycles, we find that, although many LM-generated counterexamples are judged invalid by both expert humans and an LM judge, the LM judge accepts roughly twice as many as humans do. Nonetheless, per-item validity judgments are moderately consistent across humans and between humans and the LM. We further find that extended iteration produces increasingly verbose definitions without improving accuracy. We also see that some concepts resist stable definitions in general. These findings suggest that while LMs can engage in philosophical reasoning, the counterexample-repair loop hits diminishing returns quickly and could be a fruitful test case for evaluating whether LMs can sustain high-level iterated philosophical reasoning.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:21.529Z"
      },
      {
        "id": "oai:arXiv.org:2603.16659v2",
        "url": "https://arxiv.org/abs/2603.16659",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "LLMs learn scientific taste from institutional traces across the social sciences",
        "author": [
          "Ziqin Gong",
          "Ning Li",
          "Huaikang Zhou"
        ],
        "summary": "arXiv:2603.16659v2 Announce Type: replace Abstract: Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say \"I'm sure\" versus \"I'm guessing.\" Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:36:24.058Z"
      },
      {
        "id": "oai:arXiv.org:2605.02974v1",
        "url": "https://arxiv.org/abs/2605.02974",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "PHBench: A Benchmark for Predicting Startup Series A Funding from Product Hunt Launch Signals",
        "author": [
          "Yagiz Ihlamur",
          "Ben Griffin",
          "Rick Chen"
        ],
        "summary": "arXiv:2605.02974v1 Announce Type: cross Abstract: Structured launch signals on Product Hunt contain statistically significant predictive information for Series A funding outcomes. We construct PHBench from 67,292 featured Product Hunt posts spanning 2019-2025, linked to Crunchbase funding records via deterministic domain matching, identifying 528 verified Series A raises within 18 months of launch (positive rate: 0.78%). Our best-performing model, a three-component ensemble (ENS_avg, ENS_ISO, XGB) selected by validation F0.5, achieves F0.5 = 0.097 and AP = 0.037 (95% CI: 0.024-0.072; 4.7x lift over random) on the private held-out test set (103 positives). A paired bootstrap confirms a statistically credible advantage over the logistic regression baseline (AP delta: +0.013, 95% CI: [0.004, 0.039], p < 0.001; F0.5 delta: +0.056, 95% CI: [0.006, 0.122], p = 0.016). Validation-set metrics (F0.5 = 0.284, AP = 0.126) reflect best-of-144 selection bias on 53 positives and are reported for benchmark reproducibility only. We further evaluate three zero-shot Gemini models (Gemini 2.5 Flash, Gemini 3 Flash, and Gemini 3.1 Pro) in an anonymized numerical setting. The best LLM achieves AP = 0.034 (Gemini 3 Flash), below the LR baseline AP of 0.044. Notably, the most capable Gemini variant (Gemini 3.1 Pro, AP = 0.023) performs worst -- an unexpected pattern that warrants further investigation across providers and prompting strategies. Both ML and LLM models show the same temporal performance decay tracking the 2020-2021 funding boom and subsequent contraction, confirming the dataset captures genuine market structure rather than noise. PHBench provides a reproducible framework comprising public training, validation, and blind test splits; 61 engineered features; a five-metric evaluation harness; and a public leaderboard at https://phbench.com. All code, baseline models, and anonymized dataset splits are publicly available.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-06T04:00:00.000Z",
        "fetched_at": "2026-05-06T16:00:29.193Z"
      },
      {
        "id": "ad87ab2f75834cf15e480f1ab02abe44dee472a1f5fcb50674048dce52e5976d",
        "url": "https://github.com/darkrishabh/agent-skills-eval",
        "source_url": "https://news.ycombinator.com/rss",
        "title": "Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs",
        "author": null,
        "summary": "Comments",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T06:12:25.000Z",
        "fetched_at": "2026-05-07T10:00:03.704Z"
      },
      {
        "id": "oai:arXiv.org:2605.04539v1",
        "url": "https://arxiv.org/abs/2605.04539",
        "source_url": "https://rss.arxiv.org/rss/cs.cl",
        "title": "RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization",
        "author": [
          "Qiming Bao",
          "Juho Leinonen",
          "Paul Denny",
          "Michael J. Witbrock"
        ],
        "summary": "arXiv:2605.04539v1 Announce Type: new Abstract: Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the \"alignment tax\" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:22.293Z"
      },
      {
        "id": "oai:arXiv.org:2602.22480v2",
        "url": "https://arxiv.org/abs/2602.22480",
        "source_url": "https://rss.arxiv.org/rss/cs.ai",
        "title": "VeRO: An Evaluation Harness for Agents to Optimize Agents",
        "author": [
          "Varun Ursekar (Emily)",
          "Apaar Shanker (Emily)",
          "Veronica Chatrath (Emily)",
          "Yuan (Emily)",
          "Xue",
          "Sam Denton"
        ],
        "summary": "arXiv:2602.22480v2 Announce Type: replace Abstract: An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-05T04:00:00.000Z",
        "fetched_at": "2026-05-05T16:07:19.561Z"
      },
      {
        "id": "f44dc79f7ed763f48b0544dd7231ac19fe6346d8e7081314dc755ea0cd5ce723",
        "url": "https://www.reddit.com/r/LocalLLaMA/comments/1t5yajb/qwen36_27b_uncensored_heretic_v2_native_mtp/",
        "source_url": "https://www.reddit.com/r/localllama.rss",
        "title": "Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats.",
        "author": [
          "/u/LLMFan46"
        ],
        "summary": "llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-MLP-Only: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-MLP-Only llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4: https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4 All are confirmed to have their full 15 MTPs retained and preserved. Comes with benchmark too. Find all my models here: HuggingFace-LLMFan46 submitted by /u/LLMFan46 [link] [留言]",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T02:59:31.000Z",
        "fetched_at": "2026-05-07T16:36:35.893Z"
      },
      {
        "id": "oai:arXiv.org:2511.06452v3",
        "url": "https://arxiv.org/abs/2511.06452",
        "source_url": "https://rss.arxiv.org/rss/cs.lg",
        "title": "MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains",
        "author": [
          "Leyan Xue",
          "Changqing Zhang",
          "Kecheng Xue",
          "Xiaohong Liu",
          "Guangyu Wang",
          "Zongbo Han"
        ],
        "summary": "arXiv:2511.06452v3 Announce Type: replace Abstract: Although multimodal fusion has made significant progress, its advancement is severely hindered by the lack of adequate evaluation benchmarks. Current fusion methods are typically evaluated on a small selection of public datasets, a limited scope that inadequately represents the complexity and diversity of real-world scenarios, potentially leading to biased evaluations. This issue presents a twofold challenge. On one hand, models may overfit to the biases of specific datasets, hindering their generalization to broader practical applications. On the other hand, the absence of a unified evaluation standard makes fair and objective comparisons between different fusion methods difficult. Consequently, a truly universal and high-performance fusion model has yet to emerge. To address these challenges, we have developed a large-scale, domain-adaptive benchmark for multimodal evaluation. This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks across key application domains. To complement this, we have also developed an open-source, unified, and automated evaluation pipeline that includes standardized implementations of state-of-the-art models and diverse fusion paradigms. Leveraging this platform, we have conducted large-scale experiments, successfully establishing new performance baselines across multiple tasks. This work provides the academic community with a crucial platform for rigorous and reproducible assessment of multimodal models, aiming to propel the field of multimodal artificial intelligence to new heights.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T04:00:00.000Z",
        "fetched_at": "2026-05-07T16:00:30.005Z"
      },
      {
        "id": "d784734c9cb6667fe53b569b13f402f11bd62e5d676abe30f47114e90c834d23",
        "url": "https://doi.org/10.7910/DVN/RYY9V9",
        "source_url": "https://dataverse.harvard.edu/",
        "title": "TraceSuite Review Artifact",
        "author": [
          "NeurIPS, AuthorAno"
        ],
        "summary": "Review artifact for a benchmark and dataset submission. Contains trajectory metadata, a representative sample, and sharded image archives.",
        "image_url": null,
        "cover_img": null,
        "pub_date": "2026-05-07T06:01:00.000Z",
        "fetched_at": "2026-05-07T16:35:52.266Z"
      }
    ],
    "total": 2985
  }
}

▼✓网页搜索查询: "lm-evaluation-harness EleutherAI 2024 2025 overview architecture", 结果数: 15

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"EleutherAI's lm-evaluation-harness: Architecture and Configuration\",\n      \"url\": \"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\n      \"description\": \"EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\\\"hf\\\", pretrained=\\\"gpt2\\\") = \\\"hf\\\" = \\\"gpt2\\\"results = evaluator.evaluate(model, tasks=[\\\"hellaswag\\\"], num_fewshot=5) = = \\\"hellaswag\\\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\\\"gpt2\\\", = \\\"gpt2\\\" tasks=[\\\"hellaswag\\\", \\\"mmlu\\\"], = \\\"hellaswag\\\" \\\"mmlu\\\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\\\lm-eval --model \\\\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\\\ --model_args \\\\ --tasks mmlu,hellaswag \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size auto \\\\ --batch_size \\\\ --output_path results/mistral-7b.json \\\\ --output_path \\\\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\\\lm-eval --model \\\\ --model_args model=gpt-4-turbo \\\\ --model_args \\\\ --tasks mmlu,bbh \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size 10 --batch_size.\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.9 - ADS\",\n      \"url\": \"https://ui.adsabs.harvard.edu/abs/2025zndo..15699229S/abstract\",\n      \"description\": \"by @JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2825 Added C4 Support by @Zephyr271828 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889 Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870 Add question suffix before the <|assistant|> tag by @TingchenFu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876 Add device arg to model\\\\_args passed to LLM object in VLLM model class by @momentino in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879 paws-x fix formatting by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2759 Delete scripts/cost\\\\_estimate.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2985 Adding ACPBench Hard tasks by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2980 [SGLANG] Add the SGLANG generate API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2997 fix example notebook by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2998 Log tokenized request warning only once by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3002 [Add Dataset Update] KBL 2025 by @abzb1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000 Output path fix by @Niccolo-Ajroldi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993 use images with api models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Adding resize images support by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2958 Revert \\\"feat: add question suffix (#2876)\\\" by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3007 [hotfix] modify multimodal check in evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3013 [Fix] Update resolve\\\\_hf\\\\_chat\\\\_template arguments by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992 Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987 [vllm] data parallel for V1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3011 add arab\\\\_culture task by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/3006 chore: clean up and extend .gitignore rules by @e1washere in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030 Enable text-only evals for VLM models by @ysulsky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999 [Fix] acc\\\\_mutual\\\\_info metric calculation bug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3035 Fix: fix vllm issue with DP>1 by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025 add Mbpp instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2995 remove prints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3041 [longbench] fix metric calculation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2983 Fallback to super implementation in fewshot\\\\_context for Unitxt tasks by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3023 Fix Typo in README and Comment in utils\\\\_mcq.py by @vtjl10 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057 fix longbech citation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3061 mmlu task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3070 Fix typos in docstrings in instructions.py by @maximevtush in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060 bump version to 0.4.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3073 New\"\n    },\n    {\n      \"title\": \"EleutherAI LM Evaluation Harness | VerifyWise AI Governance Library\",\n      \"url\": \"https://verifywise.ai/ai-governance-library/assessment-and-evaluation/model-evaluation-harness\",\n      \"description\": \"# EleutherAI LM Evaluation Harness. The EleutherAI LM Evaluation Harness is the Swiss Army knife of language model evaluation, offering a standardized way to benchmark LLMs across hundreds of tasks with just a few lines of code. Rather than cobbling together different evaluation scripts and dealing with inconsistent metrics, this open-source framework lets you run comprehensive assessments covering everything from basic language understanding to complex reasoning, safety, and alignment properties. It's become the de facto standard for reproducible LLM evaluation in the research community. The harness also handles the technical complexity of model evaluation automatically. The framework lets you run individual tasks, task groups, or comprehensive suites. For your first evaluation, try running a subset of tasks on a smaller model to understand the output format and timing. With hundreds of available tasks, it's easy to run evaluations that aren't relevant to your specific use case or to miss important benchmark categories.\"\n    },\n    {\n      \"title\": \"GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n      \"description\": \"* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\"\n    },\n    {\n      \"title\": \"LM Evaluation Harness\",\n      \"url\": \"https://slyracoon23.github.io/lm-evaluation-harness/\",\n      \"description\": \"Overview¶. The LM Evaluation Harness is designed to facilitate the integration of various API-based language models into a standardized evaluation framework.\"\n    },\n    {\n      \"title\": \"Evaluating LLMs — EleutherAI\",\n      \"url\": \"https://www.eleuther.ai/projects/large-language-model-evaluation\",\n      \"description\": \"# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs  Next.\"\n    },\n    {\n      \"title\": \"LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly\",\n      \"url\": \"https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545\",\n      \"description\": \"# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity….\"\n    },\n    {\n      \"title\": \"Integrating benchmarks into LM Evaluation Harness - Hugging Face\",\n      \"url\": \"https://huggingface.co/blog/Neo111x/integrating-benchmarks-into-lm-evaluation-harness\",\n      \"description\": \"The LM Evaluation Harness is a Python-based framework developed by EleutherAI for evaluating the performance of language models on a wide range of NLP benchmarks. Several steps are required to integrate a custom benchmark from a local dataset into lm evaluation harness, this blog will tackle integrating your custom dataset in multiple choice question fashion following the standard MMLU approach and the MMLU-Var approach. The standard MMLU benchmark uses a multiple choice format, The model receives a question along with a list of possible answers (A, B, C, D) and is prompted to select the correct one. For example, in a question about the capital of the UAE, the model sees both the question and all four answer choices in the input. So during the evaluation the choices are used by the evaluator to compute the probability of each choice as a continuation to the given prompt following this simplified formula for each output token:.\"\n    },\n    {\n      \"title\": \"Releases · EleutherAI/lm-evaluation-harness - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/releases\",\n      \"description\": \"A framework for few-shot evaluation of language models. - Releases · EleutherAI/lm-evaluation-harness.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness - Jimmy Song\",\n      \"url\": \"https://jimmysong.io/ai/lm-evaluation-harness/\",\n      \"description\": \"LLM, AI Native Infra and Agentic AI. Recaps and insights from tech conferences like KubeCon. Thoughts and analysis on economics, culture, politics, and society. #### AI Infrastructure. AI infrastructure framework through Yin-Yang and Five Elements model from Eastern philosophy. AI-native infrastructure handbook for CTOs and platform leaders. Comprehensive directory of AI open-source software projects. Discover experiences and journeys beyond technology. Get in touch with Jimmy for inquiries. #### LinkedIn. Scan to connect on LinkedIn for professional networking. Scan to follow my X account for updates on cloud native and AI. The Language Model Evaluation Harness is a framework for large-scale, reproducible evaluation of generative language models across many tasks and datasets. The lm-evaluation-harness provides a unified interface and a large collection of tasks (e.g. Hellaswag, LAMBADA) for evaluating generative LMs. It supports local models, Hugging Face models, and commercial APIs. ReLE (chinese-llm-benchmark) is a continuously updated Chinese LLM evaluation …. DeepEval is an open-source LLM evaluation framework that provides modular ….\"\n    },\n    {\n      \"title\": \"Language Model Evaluation Harness: A Comprehensive Tool for Language Model Assessment | by Frank Morales Aguilera | Artificial Intelligence in Plain English\",\n      \"url\": \"https://ai.plainenglish.io/language-model-evaluation-harness-a-comprehensive-tool-for-language-model-assessment-3666b55c9c25\",\n      \"description\": \"# Language Model Evaluation Harness: A Comprehensive Tool for Language Model Assessment. The **Language Model Evaluation Harness** is a powerful tool designed to evaluate language models in a standardized and comprehensive manner[1]. The Language Model Evaluation Harness offers a wide range of features that make it a versatile tool for language model evaluation[1,2]:. 1. **Unified Framework**: It provides a unified interface for testing generative language models on many evaluation tasks[1,2]. 2. **Broad Benchmark Support**: The harness supports over 60 standard academic benchmarks for language models, with hundreds of subtasks and variants implemented[1,2]. The Language Model Evaluation Harness is the backend for Hugging Face’s popular Open LLM Leaderboard [1,2]. This widespread usage is a testament to the tool’s credibility and significant role in language model research and development. ### Standard evaluation metrics for language models. In conclusion, the Language Model Evaluation Harness is a crucial tool that addresses the challenge of comprehensively auditing the performance of language models [3].\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo\",\n      \"url\": \"https://zenodo.org/records/12608602\",\n      \"description\": \"# EleutherAI/lm-evaluation-harness: v0.4.3. # lm-eval v0.4.3 Release Notes. We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future. The big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors! This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs. There's a lot more to check out, including:. We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/lm_eval/tasks/aime/README.md at main\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/aime/README.md\",\n      \"description\": \"A framework for few-shot evaluation of language models. - lm-evaluation-harness/lm_eval/tasks/aime/README.md at main · EleutherAI/lm-evaluation-harness.\"\n    },\n    {\n      \"title\": \"LM Eval Harness - EleutherAI\",\n      \"url\": \"https://www.eleuther.ai/artifacts/lm-eval-harness\",\n      \"description\": \"LM Eval Harness. Library. 13 May. Written By Stella Biderman. Our library for reproducible and transparent evaluation of LLMs. NLPEvaluation.\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "EleutherAI's lm-evaluation-harness: Architecture and Configuration",
        "url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
        "description": "EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\"hf\", pretrained=\"gpt2\") = \"hf\" = \"gpt2\"results = evaluator.evaluate(model, tasks=[\"hellaswag\"], num_fewshot=5) = = \"hellaswag\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\"gpt2\", = \"gpt2\" tasks=[\"hellaswag\", \"mmlu\"], = \"hellaswag\" \"mmlu\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\lm-eval --model \\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\ --model_args \\ --tasks mmlu,hellaswag \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size auto \\ --batch_size \\ --output_path results/mistral-7b.json \\ --output_path \\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\lm-eval --model \\ --model_args model=gpt-4-turbo \\ --model_args \\ --tasks mmlu,bbh \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size 10 --batch_size."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.9 - ADS",
        "url": "https://ui.adsabs.harvard.edu/abs/2025zndo..15699229S/abstract",
        "description": "by @JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2825 Added C4 Support by @Zephyr271828 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889 Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870 Add question suffix before the <|assistant|> tag by @TingchenFu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876 Add device arg to model\\_args passed to LLM object in VLLM model class by @momentino in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879 paws-x fix formatting by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2759 Delete scripts/cost\\_estimate.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2985 Adding ACPBench Hard tasks by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2980 [SGLANG] Add the SGLANG generate API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2997 fix example notebook by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2998 Log tokenized request warning only once by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3002 [Add Dataset Update] KBL 2025 by @abzb1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000 Output path fix by @Niccolo-Ajroldi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993 use images with api models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Adding resize images support by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2958 Revert \"feat: add question suffix (#2876)\" by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3007 [hotfix] modify multimodal check in evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3013 [Fix] Update resolve\\_hf\\_chat\\_template arguments by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992 Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987 [vllm] data parallel for V1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3011 add arab\\_culture task by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/3006 chore: clean up and extend .gitignore rules by @e1washere in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030 Enable text-only evals for VLM models by @ysulsky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999 [Fix] acc\\_mutual\\_info metric calculation bug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3035 Fix: fix vllm issue with DP>1 by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025 add Mbpp instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2995 remove prints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3041 [longbench] fix metric calculation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2983 Fallback to super implementation in fewshot\\_context for Unitxt tasks by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3023 Fix Typo in README and Comment in utils\\_mcq.py by @vtjl10 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057 fix longbech citation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3061 mmlu task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3070 Fix typos in docstrings in instructions.py by @maximevtush in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060 bump version to 0.4.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3073 New"
      },
      {
        "title": "EleutherAI LM Evaluation Harness | VerifyWise AI Governance Library",
        "url": "https://verifywise.ai/ai-governance-library/assessment-and-evaluation/model-evaluation-harness",
        "description": "# EleutherAI LM Evaluation Harness. The EleutherAI LM Evaluation Harness is the Swiss Army knife of language model evaluation, offering a standardized way to benchmark LLMs across hundreds of tasks with just a few lines of code. Rather than cobbling together different evaluation scripts and dealing with inconsistent metrics, this open-source framework lets you run comprehensive assessments covering everything from basic language understanding to complex reasoning, safety, and alignment properties. It's become the de facto standard for reproducible LLM evaluation in the research community. The harness also handles the technical complexity of model evaluation automatically. The framework lets you run individual tasks, task groups, or comprehensive suites. For your first evaluation, try running a subset of tasks on a smaller model to understand the output format and timing. With hundreds of available tasks, it's easy to run evaluations that aren't relevant to your specific use case or to miss important benchmark categories."
      },
      {
        "title": "GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness",
        "description": "* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface)."
      },
      {
        "title": "LM Evaluation Harness",
        "url": "https://slyracoon23.github.io/lm-evaluation-harness/",
        "description": "Overview¶. The LM Evaluation Harness is designed to facilitate the integration of various API-based language models into a standardized evaluation framework."
      },
      {
        "title": "Evaluating LLMs — EleutherAI",
        "url": "https://www.eleuther.ai/projects/large-language-model-evaluation",
        "description": "# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs  Next."
      },
      {
        "title": "LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly",
        "url": "https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545",
        "description": "# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity…."
      },
      {
        "title": "Integrating benchmarks into LM Evaluation Harness - Hugging Face",
        "url": "https://huggingface.co/blog/Neo111x/integrating-benchmarks-into-lm-evaluation-harness",
        "description": "The LM Evaluation Harness is a Python-based framework developed by EleutherAI for evaluating the performance of language models on a wide range of NLP benchmarks. Several steps are required to integrate a custom benchmark from a local dataset into lm evaluation harness, this blog will tackle integrating your custom dataset in multiple choice question fashion following the standard MMLU approach and the MMLU-Var approach. The standard MMLU benchmark uses a multiple choice format, The model receives a question along with a list of possible answers (A, B, C, D) and is prompted to select the correct one. For example, in a question about the capital of the UAE, the model sees both the question and all four answer choices in the input. So during the evaluation the choices are used by the evaluator to compute the probability of each choice as a continuation to the given prompt following this simplified formula for each output token:."
      },
      {
        "title": "Releases · EleutherAI/lm-evaluation-harness - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/releases",
        "description": "A framework for few-shot evaluation of language models. - Releases · EleutherAI/lm-evaluation-harness."
      },
      {
        "title": "lm-evaluation-harness - Jimmy Song",
        "url": "https://jimmysong.io/ai/lm-evaluation-harness/",
        "description": "LLM, AI Native Infra and Agentic AI. Recaps and insights from tech conferences like KubeCon. Thoughts and analysis on economics, culture, politics, and society. #### AI Infrastructure. AI infrastructure framework through Yin-Yang and Five Elements model from Eastern philosophy. AI-native infrastructure handbook for CTOs and platform leaders. Comprehensive directory of AI open-source software projects. Discover experiences and journeys beyond technology. Get in touch with Jimmy for inquiries. #### LinkedIn. Scan to connect on LinkedIn for professional networking. Scan to follow my X account for updates on cloud native and AI. The Language Model Evaluation Harness is a framework for large-scale, reproducible evaluation of generative language models across many tasks and datasets. The lm-evaluation-harness provides a unified interface and a large collection of tasks (e.g. Hellaswag, LAMBADA) for evaluating generative LMs. It supports local models, Hugging Face models, and commercial APIs. ReLE (chinese-llm-benchmark) is a continuously updated Chinese LLM evaluation …. DeepEval is an open-source LLM evaluation framework that provides modular …."
      },
      {
        "title": "Language Model Evaluation Harness: A Comprehensive Tool for Language Model Assessment | by Frank Morales Aguilera | Artificial Intelligence in Plain English",
        "url": "https://ai.plainenglish.io/language-model-evaluation-harness-a-comprehensive-tool-for-language-model-assessment-3666b55c9c25",
        "description": "# Language Model Evaluation Harness: A Comprehensive Tool for Language Model Assessment. The **Language Model Evaluation Harness** is a powerful tool designed to evaluate language models in a standardized and comprehensive manner[1]. The Language Model Evaluation Harness offers a wide range of features that make it a versatile tool for language model evaluation[1,2]:. 1. **Unified Framework**: It provides a unified interface for testing generative language models on many evaluation tasks[1,2]. 2. **Broad Benchmark Support**: The harness supports over 60 standard academic benchmarks for language models, with hundreds of subtasks and variants implemented[1,2]. The Language Model Evaluation Harness is the backend for Hugging Face’s popular Open LLM Leaderboard [1,2]. This widespread usage is a testament to the tool’s credibility and significant role in language model research and development. ### Standard evaluation metrics for language models. In conclusion, the Language Model Evaluation Harness is a crucial tool that addresses the challenge of comprehensively auditing the performance of language models [3]."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo",
        "url": "https://zenodo.org/records/12608602",
        "description": "# EleutherAI/lm-evaluation-harness: v0.4.3. # lm-eval v0.4.3 Release Notes. We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future. The big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors! This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs. There's a lot more to check out, including:. We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder."
      },
      {
        "title": "lm-evaluation-harness/lm_eval/tasks/aime/README.md at main",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/aime/README.md",
        "description": "A framework for few-shot evaluation of language models. - lm-evaluation-harness/lm_eval/tasks/aime/README.md at main · EleutherAI/lm-evaluation-harness."
      },
      {
        "title": "LM Eval Harness - EleutherAI",
        "url": "https://www.eleuther.ai/artifacts/lm-eval-harness",
        "description": "LM Eval Harness. Library. 13 May. Written By Stella Biderman. Our library for reproducible and transparent evaluation of LLMs. NLPEvaluation."
      }
    ]
  }
}

▼✓网页搜索查询: "lm-evaluation-harness GitHub EleutherAI latest release features", 结果数: 10

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"Language Model Evaluation Harness - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n      \"description\": \"A new v0.4.0 release of lm-evaluation-harness is available ! New updates and features include: New Open LLM Leaderboard tasks have been added ! You can find\"\n    },\n    {\n      \"title\": \"Language Model Evaluation Harness - GitHub\",\n      \"url\": \"https://github.com/HazyResearch/based-evaluation-harness\",\n      \"description\": \"4.0 release of lm-evaluation-harness is available ! New updates and features include: Internal refactoring; Config-based task creation and configuration; Easier\"\n    },\n    {\n      \"title\": \"Releases · EleutherAI/lm-evaluation-harness - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/releases\",\n      \"description\": \"We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal and vllm-vlm .\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo\",\n      \"url\": \"https://zenodo.org/records/12608602\",\n      \"description\": \"We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future.\"\n    },\n    {\n      \"title\": \"EleutherAI's lm-evaluation-harness: Architecture and Configuration\",\n      \"url\": \"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\n      \"description\": \"EleutherAI's lm-evaluation-harness has emerged as one of the most robust and comprehensive frameworks for evaluating language models.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/docs/task_guide.md at main - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md\",\n      \"description\": \"The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the\"\n    },\n    {\n      \"title\": \"Has anyone gotten EleutherAI/lm-evaluation-harness to work on ...\",\n      \"url\": \"https://www.reddit.com/r/LocalLLaMA/comments/136akxj/has_anyone_gotten_eleutherailmevaluationharness/\",\n      \"description\": \"As the title says. GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models.\"\n    },\n    {\n      \"title\": \"Integrating benchmarks into LM Evaluation Harness - Hugging Face\",\n      \"url\": \"https://huggingface.co/blog/Neo111x/integrating-benchmarks-into-lm-evaluation-harness\",\n      \"description\": \"The LM Evaluation Harness is a Python-based framework developed by EleutherAI for evaluating the performance of language models on a wide range\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/lm_eval/tasks/README.md at main - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md\",\n      \"description\": \"A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language\"\n    },\n    {\n      \"title\": \"LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly\",\n      \"url\": \"https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545\",\n      \"description\": \"This tool has been designed to integrate seamlessly with common datasets, offering customizable settings to suit a variety of tasks. One of the\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "Language Model Evaluation Harness - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness",
        "description": "A new v0.4.0 release of lm-evaluation-harness is available ! New updates and features include: New Open LLM Leaderboard tasks have been added ! You can find"
      },
      {
        "title": "Language Model Evaluation Harness - GitHub",
        "url": "https://github.com/HazyResearch/based-evaluation-harness",
        "description": "4.0 release of lm-evaluation-harness is available ! New updates and features include: Internal refactoring; Config-based task creation and configuration; Easier"
      },
      {
        "title": "Releases · EleutherAI/lm-evaluation-harness - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/releases",
        "description": "We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal and vllm-vlm ."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo",
        "url": "https://zenodo.org/records/12608602",
        "description": "We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future."
      },
      {
        "title": "EleutherAI's lm-evaluation-harness: Architecture and Configuration",
        "url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
        "description": "EleutherAI's lm-evaluation-harness has emerged as one of the most robust and comprehensive frameworks for evaluating language models."
      },
      {
        "title": "lm-evaluation-harness/docs/task_guide.md at main - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md",
        "description": "The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the"
      },
      {
        "title": "Has anyone gotten EleutherAI/lm-evaluation-harness to work on ...",
        "url": "https://www.reddit.com/r/LocalLLaMA/comments/136akxj/has_anyone_gotten_eleutherailmevaluationharness/",
        "description": "As the title says. GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models."
      },
      {
        "title": "Integrating benchmarks into LM Evaluation Harness - Hugging Face",
        "url": "https://huggingface.co/blog/Neo111x/integrating-benchmarks-into-lm-evaluation-harness",
        "description": "The LM Evaluation Harness is a Python-based framework developed by EleutherAI for evaluating the performance of language models on a wide range"
      },
      {
        "title": "lm-evaluation-harness/lm_eval/tasks/README.md at main - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md",
        "description": "A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language"
      },
      {
        "title": "LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly",
        "url": "https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545",
        "description": "This tool has been designed to integrate seamlessly with common datasets, offering customizable settings to suit a variety of tasks. One of the"
      }
    ]
  }
}

▼thought

▼✓抓取网页正文网址: "https://github.com/EleutherAI/lm-evaluation-harness"

{
  "content": [
    {
      "type": "text",
      "text": "{\"url\":\"https://github.com/EleutherAI/lm-evaluation-harness\",\"error\":\"Navigation timeout of 60000 ms exceeded\"}"
    }
  ],
  "details": {
    "url": "https://github.com/EleutherAI/lm-evaluation-harness",
    "error": "Navigation timeout of 60000 ms exceeded"
  },
  "isError": true
}

▼✓抓取网页正文网址: "https://github.com/EleutherAI/lm-evaluation-harness/releases"

{
  "content": [
    {
      "type": "text",
      "text": "{\"url\":\"https://github.com/EleutherAI/lm-evaluation-harness/releases\",\"error\":\"Navigation timeout of 60000 ms exceeded\"}"
    }
  ],
  "details": {
    "url": "https://github.com/EleutherAI/lm-evaluation-harness/releases",
    "error": "Navigation timeout of 60000 ms exceeded"
  },
  "isError": true
}

▼✓抓取网页正文网址: "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-met..."

{
  "content": [
    {
      "type": "text",
      "text": "{\"url\":\"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\"error\":\"Navigation timeout of 60000 ms exceeded\"}"
    }
  ],
  "details": {
    "url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
    "error": "Navigation timeout of 60000 ms exceeded"
  },
  "isError": true
}

▼thought

GitHub pages are timing out. Let me try fetching the raw README from GitHub's raw content, and also try other sources.

▼✓抓取网页正文网址: "https://raw.githubusercontent.com/EleutherAI/lm-evaluation-harness/main/READM..."

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://raw.githubusercontent.com/EleutherAI/lm-evaluation-harness/main/README.md\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><pre># Language Model Evaluation Harness\\n\\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10256836.svg)](https://doi.org/10.5281/zenodo.10256836)\\n\\n---\\n\\n## Latest News 📣\\n- [2025/12] **CLI refactored** with subcommands (`run`, `ls`, `validate`) and YAML config file support via `--config`. See the [CLI Reference](./docs/interface.md) and [Configuration Guide](./docs/config_files.md).\\n- [2025/12] **Lighter install**: Base package no longer includes `transformers`/`torch`. Install model backends separately: `pip install lm_eval[hf]`, `lm_eval[vllm]`, etc.\\n- [2025/07] Added `think_end_token` arg to `hf` (token/str), `vllm` and `sglang` (str) for stripping CoT reasoning traces from models that support it.\\n- [2025/03] Added support for steering HF models!\\n- [2025/02] Added [SGLang](https://docs.sglang.ai/) support!\\n- [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.\\n- [2024/07] [API model](docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. **To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.**\\n- [2024/07] New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.\\n\\n---\\n\\n## Announcement\\n\\n**A new v0.4.0 release of lm-evaluation-harness is available** !\\n\\nNew updates and features include:\\n\\n- **New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.**\\n- Internal refactoring\\n- Config-based task creation and configuration\\n- Easier import and sharing of externally-defined task config YAMLs\\n- Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource\\n- More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more\\n- Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more\\n- Logging and usability changes\\n- New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more\\n\\nPlease see our updated documentation pages in `docs/` for more details.\\n\\nDevelopment will be continuing on the `main` branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the [EleutherAI discord](https://discord.gg/eleutherai)!\\n\\n---\\n\\n## Overview\\n\\nThis project provides a unified framework to test generative language models on a large number of different evaluation tasks.\\n\\n**Features:**\\n\\n- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.\\n- Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.\\n- Support for fast and memory-efficient inference with [vLLM](https://github.com/vllm-project/vllm).\\n- Support for commercial APIs including [OpenAI](https://openai.com), and [TextSynth](https://textsynth.com/).\\n- Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).\\n- Support for local models and benchmarks.\\n- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.\\n- Easy support for custom prompts and evaluation metrics.\\n\\nThe Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;authuser=2&amp;cites=15052937328817631261,4097184744846514103,1520777361382155671,17476825572045927382,18443729326628441434,14801318227356878622,7890865700763267262,12854182577605049984,15641002901115500560,5104500764547628290), and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.\\n\\n## Install\\n\\nTo install the `lm-eval` package from the github repository, run:\\n\\n```bash\\ngit clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness\\ncd lm-evaluation-harness\\npip install -e .\\n```\\n\\n### Installing Model Backends\\n\\nThe base installation provides the core evaluation framework. **Model backends must be installed separately** using optional extras:\\n\\nFor HuggingFace transformers models:\\n\\n```bash\\npip install \\\"lm_eval[hf]\\\"\\n```\\n\\nFor vLLM inference:\\n\\n```bash\\npip install \\\"lm_eval[vllm]\\\"\\n```\\n\\nFor API-based models (OpenAI, Anthropic, etc.):\\n\\n```bash\\npip install \\\"lm_eval[api]\\\"\\n```\\n\\nMultiple backends can be installed together:\\n\\n```bash\\npip install \\\"lm_eval[hf,vllm,api]\\\"\\n```\\n\\nA detailed table of all optional extras is available at the end of this document.\\n\\n## Basic Usage\\n\\n### Documentation\\n\\n| Guide | Description |\\n|-------|-------------|\\n| [CLI Reference](./docs/interface.md) | Command-line arguments and subcommands |\\n| [Configuration Guide](./docs/config_files.md) | YAML config file format and examples |\\n| [Python API](./docs/python-api.md) | Programmatic usage with `simple_evaluate()` |\\n| [Task Guide](./lm_eval/tasks/README.md) | Available tasks and task configuration |\\n\\nUse `lm-eval -h` to see available options, or `lm-eval run -h` for evaluation options.\\n\\nList available tasks with:\\n\\n```bash\\nlm-eval ls tasks\\n```\\n\\n### Hugging Face `transformers`\\n\\n&gt; [!Important]\\n&gt; To use the HuggingFace backend, first install: `pip install \\\"lm_eval[hf]\\\"`\\n\\nTo evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command (this assumes you are using a CUDA-compatible GPU):\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=EleutherAI/gpt-j-6B \\\\\\n    --tasks hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8\\n```\\n\\nAdditional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=\\\"float\\\" \\\\\\n    --tasks lambada_openai,hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8\\n```\\n\\nModels that are loaded via both `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) and `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported.\\n\\nBatch size selection can be automated by setting the  ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=\\\"float\\\" \\\\\\n    --tasks lambada_openai,hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size auto:4\\n```\\n\\n&gt; [!Note]\\n&gt; Just like you can provide a local path to `transformers.AutoModel`, you can also provide a local path to `lm_eval` via `--model_args pretrained=/path/to/model`\\n\\n#### Evaluating GGUF Models\\n\\n`lm-eval` supports evaluating models in GGUF format using the Hugging Face (`hf`) backend. This allows you to use quantized models compatible with `transformers`, `AutoModel`, and llama.cpp conversions.\\n\\nTo evaluate a GGUF model, pass the path to the directory containing the model weights, the `gguf_file`, and optionally a separate `tokenizer` path using the `--model_args` flag.\\n\\n**🚨 Important Note:**  \\nIf no separate tokenizer is provided, Hugging Face will attempt to reconstruct the tokenizer from the GGUF file — this can take **hours** or even hang indefinitely. Passing a separate tokenizer avoids this issue and can reduce tokenizer loading time from hours to seconds.\\n\\n**✅ Recommended usage:**\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=/path/to/gguf_folder,gguf_file=model-name.gguf,tokenizer=/path/to/tokenizer \\\\\\n    --tasks hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8\\n```\\n\\n&gt; [!Tip]\\n&gt; Ensure the tokenizer path points to a valid Hugging Face tokenizer directory (e.g., containing tokenizer_config.json, vocab.json, etc.).\\n\\n#### Multi-GPU Evaluation with Hugging Face `accelerate`\\n\\nWe support three main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.\\n\\nTo perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:\\n\\n```bash\\naccelerate launch -m lm_eval --model hf \\\\\\n    --tasks lambada_openai,arc_easy \\\\\\n    --batch_size 16\\n```\\n\\n(or via `accelerate launch --no-python lm_eval`).\\n\\nFor cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.\\n\\n**WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.\\n\\nThe second way of using `accelerate` for multi-GPU evaluation is when your model is *too large to fit on a single GPU.*\\n\\nIn this setting, run the library *outside the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --tasks lambada_openai,arc_easy \\\\\\n    --model_args parallelize=True \\\\\\n    --batch_size 16\\n```\\n\\nThis means that your model's weights will be split across all available GPUs.\\n\\nFor more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:\\n\\n- `device_map_option`: How to split model weights across available GPUs. defaults to \\\"auto\\\".\\n- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.\\n- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.\\n- `offload_folder`: a folder where model weights will be offloaded to disk if needed.\\n\\nThe third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.\\n\\n```bash\\naccelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \\\\\\n    -m lm_eval --model hf \\\\\\n    --tasks lambada_openai,arc_easy \\\\\\n    --model_args parallelize=True \\\\\\n    --batch_size 16\\n```\\n\\nTo learn more about model parallelism and how to use it with the `accelerate` library, see the [accelerate documentation](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism)\\n\\n**Warning: We do not natively support multi-node evaluation using the `hf` model type! Please reference [our GPT-NeoX library integration](https://github.com/EleutherAI/gpt-neox/blob/main/eval.py) for an example of code in which a custom multi-machine evaluation script is written.**\\n\\n**Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py).**\\n\\n#### Tensor Parallelism (native PyTorch)\\n\\nFor models that support PyTorch's native Tensor Parallelism (via DTensor), you can shard model weights across GPUs without `accelerate`'s device-map by passing `tp_plan=auto` in `--model_args`. Launch with `torchrun` or `accelerate launch`:\\n\\n```bash\\ntorchrun --nproc-per-node=4 -m lm_eval \\\\\\n    --model hf \\\\\\n    --model_args pretrained=google/gemma-4-31B-it,tp_plan=auto \\\\\\n    --tasks lambada_openai,arc_easy \\\\\\n    --batch_size 16\\n```\\n\\n**Constraints:**\\n\\n- `tp_plan` and `parallelize=True` are mutually exclusive — use one or the other.\\n- The number of key-value heads in the model must be divisible by `--nproc-per-node` (the TP degree).\\n- Requires PyTorch &gt;= 2.4 and a `transformers` version that exposes a TP plan for the model (v4.47+).\\n\\n### Steered Hugging Face `transformers` models\\n\\nTo evaluate a Hugging Face `transformers` model with steering vectors applied, specify the model type as `steered` and provide the path to either a PyTorch file containing pre-defined steering vectors, or a CSV file that specifies how to derive steering vectors from pretrained `sparsify` or `sae_lens` models (you will need to install the corresponding optional dependency for this method).\\n\\nSpecify pre-defined steering vectors:\\n\\n```python\\nimport torch\\n\\nsteer_config = {\\n    \\\"layers.3\\\": {\\n        \\\"steering_vector\\\": torch.randn(1, 768),\\n        \\\"bias\\\": torch.randn(1, 768),\\n        \\\"steering_coefficient\\\": 1,\\n        \\\"action\\\": \\\"add\\\"\\n    },\\n}\\ntorch.save(steer_config, \\\"steer_config.pt\\\")\\n```\\n\\nSpecify derived steering vectors:\\n\\n```python\\nimport pandas as pd\\n\\npd.DataFrame({\\n    \\\"loader\\\": [\\\"sparsify\\\"],\\n    \\\"action\\\": [\\\"add\\\"],\\n    \\\"sparse_model\\\": [\\\"EleutherAI/sae-pythia-70m-32k\\\"],\\n    \\\"hookpoint\\\": [\\\"layers.3\\\"],\\n    \\\"feature_index\\\": [30],\\n    \\\"steering_coefficient\\\": [10.0],\\n}).to_csv(\\\"steer_config.csv\\\", index=False)\\n```\\n\\nRun the evaluation harness with steering vectors applied:\\n\\n```bash\\nlm_eval --model steered \\\\\\n    --model_args pretrained=EleutherAI/pythia-160m,steer_path=steer_config.pt \\\\\\n    --tasks lambada_openai,hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8\\n```\\n\\n### NVIDIA `nemo` models\\n\\n[NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) is a generative AI framework built for researchers and pytorch developers working on language models.\\n\\nTo evaluate a `nemo` model, start by installing NeMo following [the documentation](https://github.com/NVIDIA/NeMo?tab=readme-ov-file#installation). We highly recommended to use the NVIDIA PyTorch or NeMo container, especially if having issues installing Apex or any other dependencies (see [latest released containers](https://github.com/NVIDIA/NeMo/releases)). Please also install the lm evaluation harness library following the instructions in [the Install section](https://github.com/EleutherAI/lm-evaluation-harness/tree/main?tab=readme-ov-file#install).\\n\\nNeMo models can be obtained through [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/models) or in [NVIDIA's Hugging Face page](https://huggingface.co/nvidia). In [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo/tree/main/scripts/nlp_language_modeling) there are conversion scripts to convert the `hf` checkpoints of popular models like llama, falcon, mixtral or mpt to `nemo`.\\n\\nRun a `nemo` model on one GPU:\\n\\n```bash\\nlm_eval --model nemo_lm \\\\\\n    --model_args path=&lt;path_to_nemo_model&gt; \\\\\\n    --tasks hellaswag \\\\\\n    --batch_size 32\\n```\\n\\nIt is recommended to unpack the `nemo` model to avoid the unpacking inside the docker container - it may overflow disk space. For that you can run:\\n\\n```bash\\nmkdir MY_MODEL\\ntar -xvf MY_MODEL.nemo -c MY_MODEL\\n```\\n\\n#### Multi-GPU evaluation with NVIDIA `nemo` models\\n\\nBy default, only one GPU is used. But we do support either data replication or tensor/pipeline parallelism during evaluation, on one node.\\n\\n1) To enable data replication, set the `model_args` of `devices` to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is:\\n\\n```bash\\ntorchrun --nproc-per-node=8 --no-python lm_eval \\\\\\n    --model nemo_lm \\\\\\n    --model_args path=&lt;path_to_nemo_model&gt;,devices=8 \\\\\\n    --tasks hellaswag \\\\\\n    --batch_size 32\\n```\\n\\n1) To enable tensor and/or pipeline parallelism, set the `model_args` of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. In addition, you also have to set up `devices` to be equal to the product of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is:\\n\\n```bash\\ntorchrun --nproc-per-node=4 --no-python lm_eval \\\\\\n    --model nemo_lm \\\\\\n    --model_args path=&lt;path_to_nemo_model&gt;,devices=4,tensor_model_parallel_size=2,pipeline_model_parallel_size=2 \\\\\\n    --tasks hellaswag \\\\\\n    --batch_size 32\\n```\\n\\nNote that it is recommended to substitute the `python` command by `torchrun --nproc-per-node=&lt;number of devices&gt; --no-python` to facilitate loading the model into the GPUs. This is especially important for large checkpoints loaded into multiple GPUs.\\n\\nNot supported yet: multi-node evaluation and combinations of data replication with tensor or pipeline parallelism.\\n\\n### Megatron-LM models\\n\\n[Megatron-LM](https://github.com/NVIDIA/Megatron-LM) is NVIDIA's large-scale transformer training framework. This backend allows direct evaluation of Megatron-LM checkpoints without conversion.\\n\\n**Requirements:**\\n- Megatron-LM must be installed or accessible via `MEGATRON_PATH` environment variable\\n- PyTorch with CUDA support\\n\\n**Setup:**\\n\\n```bash\\n# Set environment variable pointing to Megatron-LM installation\\nexport MEGATRON_PATH=/path/to/Megatron-LM\\n```\\n\\n**Basic usage (single GPU):**\\n\\n```bash\\nlm_eval --model megatron_lm \\\\\\n    --model_args load=/path/to/checkpoint,tokenizer_type=HuggingFaceTokenizer,tokenizer_model=/path/to/tokenizer \\\\\\n    --tasks hellaswag \\\\\\n    --batch_size 1\\n```\\n\\n**Supported checkpoint formats:**\\n- Standard Megatron checkpoints (`model_optim_rng.pt`)\\n- Distributed checkpoints (`.distcp` format, auto-detected)\\n\\n#### Parallelism Modes\\n\\nThe Megatron-LM backend supports the following parallelism modes:\\n\\n| Mode | Configuration | Description |\\n|------|---------------|-------------|\\n| Single GPU | `devices=1` (default) | Standard single GPU evaluation |\\n| Data Parallelism | `devices&gt;1, TP=1` | Each GPU has a full model replica, data is distributed |\\n| Tensor Parallelism | `TP == devices` | Model layers are split across GPUs |\\n| Expert Parallelism | `EP == devices, TP=1` | For MoE models, experts are distributed across GPUs |\\n\\n&gt; [!Note]\\n&gt; - Pipeline Parallelism (PP &gt; 1) is not currently supported.\\n&gt; - Expert Parallelism (EP) cannot be combined with Tensor Parallelism (TP).\\n\\n**Data Parallelism (4 GPUs, each with full model replica):**\\n\\n```bash\\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\\\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,devices=4 \\\\\\n    --tasks hellaswag\\n```\\n\\n**Tensor Parallelism (TP=2):**\\n\\n```bash\\ntorchrun --nproc-per-node=2 -m lm_eval --model megatron_lm \\\\\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,devices=2,tensor_model_parallel_size=2 \\\\\\n    --tasks hellaswag\\n```\\n\\n**Expert Parallelism for MoE models (EP=4):**\\n\\n```bash\\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\\\\n    --model_args load=/path/to/moe_checkpoint,tokenizer_model=/path/to/tokenizer,devices=4,expert_model_parallel_size=4 \\\\\\n    --tasks hellaswag\\n```\\n\\n**Using extra_args for additional Megatron options:**\\n\\n```bash\\nlm_eval --model megatron_lm \\\\\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,extra_args=\\\"--no-rope-fusion --trust-remote-code\\\" \\\\\\n    --tasks hellaswag\\n```\\n\\n&gt; [!Note]\\n&gt; The `--use-checkpoint-args` flag is enabled by default, which loads model architecture parameters from the checkpoint. For checkpoints converted via Megatron-Bridge, this typically includes all necessary model configuration.\\n\\n#### Multi-GPU evaluation with OpenVINO models\\n\\nPipeline parallelism during evaluation is supported with OpenVINO models\\n\\nTo enable pipeline parallelism, set the `model_args` of `pipeline_parallel`. In addition, you also have to set up `device` to value `HETERO:&lt;GPU index1&gt;,&lt;GPU index2&gt;` for example `HETERO:GPU.1,GPU.0` For example, the command to use pipeline parallelism of 2 is:\\n\\n```bash\\nlm_eval --model openvino \\\\\\n    --tasks wikitext \\\\\\n    --model_args pretrained=&lt;path_to_ov_model&gt;,pipeline_parallel=True \\\\\\n    --device HETERO:GPU.1,GPU.0\\n```\\n\\n### Tensor + Data Parallel and Optimized Inference with `vLLM`\\n\\nWe also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:\\n\\n```bash\\nlm_eval --model vllm \\\\\\n    --model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \\\\\\n    --tasks lambada_openai \\\\\\n    --batch_size auto\\n```\\n\\nTo use vllm, do `pip install \\\"lm_eval[vllm]\\\"`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.\\n\\n&gt; [!Note]\\n&gt; `data_parallel_size&gt;1` dispatches each replica as a separate [ray](https://github.com/ray-project/ray) actor and requires `pip install ray`. Each actor reserves `tensor_parallel_size` GPUs (default 1).\\n\\nvLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.\\n\\n&gt; [!Tip]\\n&gt; For fastest performance, we recommend using `--batch_size auto` for vLLM whenever possible, to leverage its continuous batching functionality!\\n\\n&gt; [!Tip]\\n&gt; Passing `max_model_len=4096` or some other reasonable default to vLLM through model args may cause speedups or prevent out-of-memory errors when trying to use auto batch size, such as for Mistral-7B-v0.1 which defaults to a maximum length of 32k.\\n\\n### Tensor + Data Parallel and Fast Offline Batching Inference with `SGLang`\\n\\nWe support SGLang for efficient offline batch inference. Its **[Fast Backend Runtime](https://docs.sglang.ai/index.html)** delivers high performance through optimized memory management and parallel processing techniques. Key features include tensor parallelism, continuous batching, and support for various quantization methods (FP8/INT4/AWQ/GPTQ).\\n\\nTo use SGLang as the evaluation backend, please **install it in advance** via SGLang documents [here](https://docs.sglang.io/get_started/install.html#install-sglang).\\n\\n&gt; [!Tip]\\n&gt; Due to the installing method of [`Flashinfer`](https://docs.flashinfer.ai/)-- a fast attention kernel library, we don't include the dependencies of `SGLang` within [pyproject.toml](pyproject.toml). Note that the `Flashinfer` also has some requirements on `torch` version.\\n\\nSGLang's server arguments are slightly different from other backends, see [here](https://docs.sglang.io/advanced_features/server_arguments.html) for more information. We provide an example of the usage here:\\n\\n```bash\\nlm_eval --model sglang \\\\\\n    --model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \\\\\\n    --tasks gsm8k_cot \\\\\\n    --batch_size auto\\n```\\n\\n&gt; [!Tip]\\n&gt; When encountering out-of-memory (OOM) errors (especially for multiple-choice tasks), try these solutions:\\n&gt;\\n&gt; 1. Use a manual `batch_size`, rather than `auto`.\\n&gt; 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.\\n&gt; 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).\\n\\n### Windows ML\\n\\nWe support **Windows ML** for hardware-accelerated inference on Windows platforms. This enables evaluation on CPU, GPU, and **NPU (Neural Processing Unit)** devices.\\n\\nWindows ML?\\nhttps://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview\\n\\nTo use Windows ML, install the required dependencies:\\n\\n```bash\\npip install wasdk-Microsoft.Windows.AI.MachineLearning[all] wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap onnxruntime-windowsml onnxruntime-genai-winml\\n```\\n\\nEvaluate an ONNX Runtime GenAI LLM on NPU/GPU/CPU on Windows:\\n\\n```bash\\nlm_eval --model winml \\\\\\n    --model_args pretrained=/path/to/onnx/model \\\\\\n    --tasks mmlu \\\\\\n    --batch_size 1\\n```\\n\\n&gt; [!Note]\\n&gt; The Windows ML backend is ONLY for ONNX Runtime GenAI model format. Models targeting `transformers.js` won't work. You can verify this by finding the `genai_config.json` file in the model folder.\\n\\n&gt; [!Note]\\n&gt; To run an ONNX Runtime GenAI model on the target device, you MUST convert the original model to that vendor and device type. Converted models won't work / work well on other vendor or device types. To learn more on model conversion, please visit [Microsoft AI Tool Kit](https://code.visualstudio.com/docs/intelligentapps/modelconversion)\\n\\n### Model APIs and Inference Servers\\n\\n&gt; [!Important]\\n&gt; To use API-based models, first install: `pip install \\\"lm_eval[api]\\\"`\\n\\nOur library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.\\n\\nTo call a hosted model, use:\\n\\n```bash\\nexport OPENAI_API_KEY=YOUR_KEY_HERE\\nlm_eval --model openai-completions \\\\\\n    --model_args model=davinci-002 \\\\\\n    --tasks lambada_openai,hellaswag\\n```\\n\\nWe also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.\\n\\n```bash\\nlm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16\\n```\\n\\nNote that for externally hosted models, configs such as `--device` which relate to where to place a local model should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.\\n\\n| API or Inference Server                                                                                                   | Implemented?                                                                                            | `--model &lt;xxx&gt;` name                                  | Models supported:                                                                                                                                               | Request Types:                                                                 |\\n|---------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|\\n| OpenAI Completions                                                                                                        | :heavy_check_mark:                                                                                      | `openai-completions`, `local-completions`             | All OpenAI Completions API models                                                                                                                               | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| OpenAI ChatCompletions                                                                                                    | :heavy_check_mark:                                                                                      | `openai-chat-completions`, `local-chat-completions`   | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt)                                                                                   | `generate_until` (no logprobs)                                                 |\\n| Anthropic                                                                                                                 | :heavy_check_mark:                                                                                      | `anthropic`                                           | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model)                                                                    | `generate_until` (no logprobs)                                                 |\\n| Anthropic Chat                                                                                                            | :heavy_check_mark:                                                                                      | `anthropic-chat`, `anthropic-chat-completions`        | [Supported Anthropic Engines](https://docs.anthropic.com/claude/docs/models-overview)                                                                           | `generate_until` (no logprobs)                                                 |\\n| [LiteLLM](https://github.com/BerriAI/litellm) (gateway to 100+ providers)                                                 | :heavy_check_mark:                                                                                      | `litellm`, `litellm-chat`, `litellm-chat-completions` | [All LiteLLM-supported providers](https://docs.litellm.ai/docs/providers)                                                                                       | `generate_until` (no logprobs)                                                 |\\n| Textsynth                                                                                                                 | :heavy_check_mark:                                                                                      | `textsynth`                                           | [All supported engines](https://textsynth.com/documentation.html#engines)                                                                                       | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Cohere                                                                                                                    | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A                                                   | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models)                                                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark:                                                                                      | `gguf`, `ggml`                                        | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp)                                                                                     | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |\\n| vLLM                                                                                                                      | :heavy_check_mark:                                                                                      | `vllm`                                                | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html)                                                                   | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Mamba                                                                                                                     | :heavy_check_mark:                                                                                      | `mamba_ssm`                                           | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces)                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Huggingface Optimum (Causal LMs)                                                                                          | :heavy_check_mark:                                                                                      | `openvino`                                            | Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format                                 | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Huggingface Optimum-intel IPEX (Causal LMs)                                                                               | :heavy_check_mark:                                                                                      | `ipex`                                                | Any decoder-only AutoModelForCausalLM                                                                                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Huggingface Optimum-habana (Causal LMs)                                                                                   | :heavy_check_mark:                                                                                      | `habana`                                              | Any decoder-only AutoModelForCausalLM                                                                                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Neuron via AWS Inf2 (Causal LMs)                                                                                          | :heavy_check_mark:                                                                                      | `neuronx`                                             | Any decoder-only AutoModelForCausalLM supported to run on [huggingface-ami image for inferentia2](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| NVIDIA NeMo                                                                                                               | :heavy_check_mark:                                                                                      | `nemo_lm`                                             | [All supported models](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/core/core.html#nemo-models)                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| NVIDIA Megatron-LM                                                                                                        | :heavy_check_mark:                                                                                      | `megatron_lm`                                         | [Megatron-LM GPT models](https://github.com/NVIDIA/Megatron-LM) (standard and distributed checkpoints)                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| Watsonx.ai                                                                                                                | :heavy_check_mark:                                                                                      | `watsonx_llm`                                         | [Supported Watsonx.ai Engines](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx)                                      | `generate_until` `loglikelihood`                                               |\\n| Windows ML                                                                                                                | :heavy_check_mark:                                                                                      | `winml`                                               | [ONNX models in GenAI format](https://code.visualstudio.com/docs/intelligentapps/modelconversion)                                                               | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n| [Your local inference server!](docs/API_guide.md)                                                                         | :heavy_check_mark:                                                                                      | `local-completions` or `local-chat-completions`       | Support for OpenAI API-compatible servers, with easy customization for other APIs.                                                                              | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\\n\\nModels which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.\\n\\nFor more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\\n\\n&gt; [!Note]\\n&gt; For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using `--limit 10` first to confirm answer extraction and scoring on generative tasks is performing as expected. providing `system=\\\"&lt;some system prompt here&gt;\\\"` within `--model_args` for anthropic-chat-completions, to instruct the model what format to respond in, may be useful.\\n\\n### Other Frameworks\\n\\nA number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).\\n\\nTo create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage).\\n\\n### Additional Features\\n\\n&gt; [!Note]\\n&gt; For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.\\n\\nIf you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher). **Note that the PyTorch MPS backend is still in early stages of development, so correctness issues or unsupported operations may exist. If you observe oddities in model performance on the MPS back-end, we recommend first checking that a forward pass of your model on `--device cpu` and `--device mps` match.**\\n\\n&gt; [!Note]\\n&gt; You can inspect what the LM inputs look like by running the following command:\\n&gt;\\n&gt; ```bash\\n&gt; python write_out.py \\\\\\n&gt;     --tasks &lt;task1,task2,...&gt; \\\\\\n&gt;     --num_fewshot 5 \\\\\\n&gt;     --num_examples 10 \\\\\\n&gt;     --output_base_path /path/to/output/folder\\n&gt; ```\\n&gt;\\n&gt; This will write out one text file for each task.\\n\\nTo verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:\\n\\n```bash\\nlm_eval --model openai \\\\\\n    --model_args engine=davinci-002 \\\\\\n    --tasks lambada_openai,hellaswag \\\\\\n    --check_integrity\\n```\\n\\n## Advanced Usage Tips\\n\\nFor models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \\\\\\n    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \\\\\\n    --device cuda:0\\n```\\n\\nModels provided as delta weights can be easily loaded using the Hugging Face transformers library. Within --model_args, set the delta argument to specify the delta weights, and use the pretrained argument to designate the relative base model to which they will be applied:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=Ejafa/llama_7B,delta=lmsys/vicuna-7b-delta-v1.1 \\\\\\n    --tasks hellaswag\\n```\\n\\nGPTQ quantized models can be loaded using [GPTQModel](https://github.com/ModelCloud/GPTQModel) (faster) or [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)\\n\\nGPTQModel: add `,gptqmodel=True` to `model_args`\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=model-name-or-path,gptqmodel=True \\\\\\n    --tasks hellaswag\\n```\\n\\nAutoGPTQ: add `,autogptq=True` to `model_args`:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\\\\n    --tasks hellaswag\\n```\\n\\nWe support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.\\n\\n## Saving &amp; Caching Results\\n\\nTo save evaluation results provide an `--output_path`. We also support logging model responses with the `--log_samples` flag for post-hoc analysis.\\n\\n&gt; [!TIP]\\n&gt; Use `--use_cache &lt;DIR&gt;` to cache evaluation results and skip previously evaluated samples when resuming runs of the same (model, task) pairs. Note that caching is rank-dependent, so restart with the same GPU count if interrupted. You can also use --cache_requests to save dataset preprocessing steps for faster evaluation resumption.\\n\\nTo push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use the `--hf_hub_log_args` flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub - [example dataset on the  HF Hub](https://huggingface.co/datasets/KonradSzafer/lm-eval-results-demo). For instance:\\n\\n```bash\\nlm_eval --model hf \\\\\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\\\\n    --tasks hellaswag \\\\\\n    --log_samples \\\\\\n    --output_path results \\\\\\n    --hf_hub_log_args hub_results_org=EleutherAI,hub_repo_name=lm-eval-results,push_results_to_hub=True,push_samples_to_hub=True,public_repo=False \\\\\\n```\\n\\nThis allows you to easily download the results and samples from the Hub, using:\\n\\n```python\\nfrom datasets import load_dataset\\n\\nload_dataset(\\\"EleutherAI/lm-eval-results-private\\\", \\\"hellaswag\\\", \\\"latest\\\")\\n```\\n\\nFor a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!\\n\\n## Visualizing Results\\n\\nYou can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights &amp; Biases (W&amp;B) and Zeno.\\n\\n### Zeno\\n\\nYou can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.\\n\\nFirst, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).\\nAdd this key as an environment variable:\\n\\n```bash\\nexport ZENO_API_KEY=[your api key]\\n```\\n\\nYou'll also need to install the `lm_eval[zeno]` package extra.\\n\\nTo visualize the results, run the eval harness with the `log_samples` and `output_path` flags.\\nWe expect `output_path` to contain multiple folders that represent individual model names.\\nYou can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.\\n\\n```bash\\nlm_eval \\\\\\n    --model hf \\\\\\n    --model_args pretrained=EleutherAI/gpt-j-6B \\\\\\n    --tasks hellaswag \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8 \\\\\\n    --log_samples \\\\\\n    --output_path output/gpt-j-6B\\n```\\n\\nThen, you can upload the resulting data using the `zeno_visualize` script:\\n\\n```bash\\npython scripts/zeno_visualize.py \\\\\\n    --data_path output \\\\\\n    --project_name \\\"Eleuther Project\\\"\\n```\\n\\nThis will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno.\\nIf you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.\\n\\nYou can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).\\n\\n### Weights and Biases\\n\\nWith the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights &amp; Biases (W&amp;B) platform.\\n\\nThe integration provide functionalities\\n\\n- to automatically log the evaluation results,\\n- log the samples as W&amp;B Tables for easy visualization,\\n- log the `results.json` file as an artifact for version control,\\n- log the `&lt;task_name&gt;_eval_samples.json` file if the samples are logged,\\n- generate a comprehensive report for analysis and visualization with all the important metric,\\n- log task and cli specific configs,\\n- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.\\n\\nFirst you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.\\n\\nAuthenticate your machine with an your unique W&amp;B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.\\n\\nRun eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.\\n\\n```bash\\nlm_eval \\\\\\n    --model hf \\\\\\n    --model_args pretrained=microsoft/phi-2,trust_remote_code=True \\\\\\n    --tasks hellaswag,mmlu_abstract_algebra \\\\\\n    --device cuda:0 \\\\\\n    --batch_size 8 \\\\\\n    --output_path output/phi-2 \\\\\\n    --limit 10 \\\\\\n    --wandb_args project=lm-eval-harness-integration \\\\\\n    --log_samples\\n```\\n\\nIn the stdout, you will find the link to the W&amp;B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb), and an example of how to integrate it beyond the CLI.\\n\\n## Contributing\\n\\nCheck out our [open issues](https://github.com/EleutherAI/lm-evaluation-harness/issues) and feel free to submit pull requests!\\n\\nFor more information on the library and how everything fits together, see our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs).\\n\\nTo get started with development, first clone the repository and install the dev dependencies:\\n\\n```bash\\ngit clone https://github.com/EleutherAI/lm-evaluation-harness\\ncd lm-evaluation-harness\\npip install -e \\\".[dev,hf]\\\"\\n````\\n\\n### Implementing new tasks\\n\\nTo implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).\\n\\nIn general, we follow this priority list for addressing concerns about prompting and other eval details:\\n\\n1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.\\n2. If there is a clear and unambiguous official implementation, use that procedure.\\n3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.\\n4. If there are multiple common implementations but not universal or widespread agreement, use our preferred option among the common implementations. As before, prioritize choosing from among the implementations found in LLM training papers.\\n\\nThese are guidelines and not rules, and can be overruled in special circumstances.\\n\\nWe try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different\"\n}"
    }
  ],
  "details": {
    "url": "https://raw.githubusercontent.com/EleutherAI/lm-evaluation-harness/main/README.md",
    "content": "<div id=\"readability-page-1\" class=\"page\"><pre># Language Model Evaluation Harness\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10256836.svg)](https://doi.org/10.5281/zenodo.10256836)\n\n---\n\n## Latest News 📣\n- [2025/12] **CLI refactored** with subcommands (`run`, `ls`, `validate`) and YAML config file support via `--config`. See the [CLI Reference](./docs/interface.md) and [Configuration Guide](./docs/config_files.md).\n- [2025/12] **Lighter install**: Base package no longer includes `transformers`/`torch`. Install model backends separately: `pip install lm_eval[hf]`, `lm_eval[vllm]`, etc.\n- [2025/07] Added `think_end_token` arg to `hf` (token/str), `vllm` and `sglang` (str) for stripping CoT reasoning traces from models that support it.\n- [2025/03] Added support for steering HF models!\n- [2025/02] Added [SGLang](https://docs.sglang.ai/) support!\n- [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.\n- [2024/07] [API model](docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. **To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.**\n- [2024/07] New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.\n\n---\n\n## Announcement\n\n**A new v0.4.0 release of lm-evaluation-harness is available** !\n\nNew updates and features include:\n\n- **New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.**\n- Internal refactoring\n- Config-based task creation and configuration\n- Easier import and sharing of externally-defined task config YAMLs\n- Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource\n- More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more\n- Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more\n- Logging and usability changes\n- New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more\n\nPlease see our updated documentation pages in `docs/` for more details.\n\nDevelopment will be continuing on the `main` branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the [EleutherAI discord](https://discord.gg/eleutherai)!\n\n---\n\n## Overview\n\nThis project provides a unified framework to test generative language models on a large number of different evaluation tasks.\n\n**Features:**\n\n- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.\n- Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.\n- Support for fast and memory-efficient inference with [vLLM](https://github.com/vllm-project/vllm).\n- Support for commercial APIs including [OpenAI](https://openai.com), and [TextSynth](https://textsynth.com/).\n- Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).\n- Support for local models and benchmarks.\n- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.\n- Easy support for custom prompts and evaluation metrics.\n\nThe Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&amp;hl=en&amp;authuser=2&amp;cites=15052937328817631261,4097184744846514103,1520777361382155671,17476825572045927382,18443729326628441434,14801318227356878622,7890865700763267262,12854182577605049984,15641002901115500560,5104500764547628290), and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.\n\n## Install\n\nTo install the `lm-eval` package from the github repository, run:\n\n```bash\ngit clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness\ncd lm-evaluation-harness\npip install -e .\n```\n\n### Installing Model Backends\n\nThe base installation provides the core evaluation framework. **Model backends must be installed separately** using optional extras:\n\nFor HuggingFace transformers models:\n\n```bash\npip install \"lm_eval[hf]\"\n```\n\nFor vLLM inference:\n\n```bash\npip install \"lm_eval[vllm]\"\n```\n\nFor API-based models (OpenAI, Anthropic, etc.):\n\n```bash\npip install \"lm_eval[api]\"\n```\n\nMultiple backends can be installed together:\n\n```bash\npip install \"lm_eval[hf,vllm,api]\"\n```\n\nA detailed table of all optional extras is available at the end of this document.\n\n## Basic Usage\n\n### Documentation\n\n| Guide | Description |\n|-------|-------------|\n| [CLI Reference](./docs/interface.md) | Command-line arguments and subcommands |\n| [Configuration Guide](./docs/config_files.md) | YAML config file format and examples |\n| [Python API](./docs/python-api.md) | Programmatic usage with `simple_evaluate()` |\n| [Task Guide](./lm_eval/tasks/README.md) | Available tasks and task configuration |\n\nUse `lm-eval -h` to see available options, or `lm-eval run -h` for evaluation options.\n\nList available tasks with:\n\n```bash\nlm-eval ls tasks\n```\n\n### Hugging Face `transformers`\n\n&gt; [!Important]\n&gt; To use the HuggingFace backend, first install: `pip install \"lm_eval[hf]\"`\n\nTo evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command (this assumes you are using a CUDA-compatible GPU):\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI/gpt-j-6B \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\nAdditional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=\"float\" \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\nModels that are loaded via both `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) and `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported.\n\nBatch size selection can be automated by setting the  ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=\"float\" \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size auto:4\n```\n\n&gt; [!Note]\n&gt; Just like you can provide a local path to `transformers.AutoModel`, you can also provide a local path to `lm_eval` via `--model_args pretrained=/path/to/model`\n\n#### Evaluating GGUF Models\n\n`lm-eval` supports evaluating models in GGUF format using the Hugging Face (`hf`) backend. This allows you to use quantized models compatible with `transformers`, `AutoModel`, and llama.cpp conversions.\n\nTo evaluate a GGUF model, pass the path to the directory containing the model weights, the `gguf_file`, and optionally a separate `tokenizer` path using the `--model_args` flag.\n\n**🚨 Important Note:**  \nIf no separate tokenizer is provided, Hugging Face will attempt to reconstruct the tokenizer from the GGUF file — this can take **hours** or even hang indefinitely. Passing a separate tokenizer avoids this issue and can reduce tokenizer loading time from hours to seconds.\n\n**✅ Recommended usage:**\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=/path/to/gguf_folder,gguf_file=model-name.gguf,tokenizer=/path/to/tokenizer \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\n&gt; [!Tip]\n&gt; Ensure the tokenizer path points to a valid Hugging Face tokenizer directory (e.g., containing tokenizer_config.json, vocab.json, etc.).\n\n#### Multi-GPU Evaluation with Hugging Face `accelerate`\n\nWe support three main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.\n\nTo perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:\n\n```bash\naccelerate launch -m lm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --batch_size 16\n```\n\n(or via `accelerate launch --no-python lm_eval`).\n\nFor cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.\n\n**WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.\n\nThe second way of using `accelerate` for multi-GPU evaluation is when your model is *too large to fit on a single GPU.*\n\nIn this setting, run the library *outside the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:\n\n```bash\nlm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --model_args parallelize=True \\\n    --batch_size 16\n```\n\nThis means that your model's weights will be split across all available GPUs.\n\nFor more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:\n\n- `device_map_option`: How to split model weights across available GPUs. defaults to \"auto\".\n- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.\n- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.\n- `offload_folder`: a folder where model weights will be offloaded to disk if needed.\n\nThe third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.\n\n```bash\naccelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \\\n    -m lm_eval --model hf \\\n    --tasks lambada_openai,arc_easy \\\n    --model_args parallelize=True \\\n    --batch_size 16\n```\n\nTo learn more about model parallelism and how to use it with the `accelerate` library, see the [accelerate documentation](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism)\n\n**Warning: We do not natively support multi-node evaluation using the `hf` model type! Please reference [our GPT-NeoX library integration](https://github.com/EleutherAI/gpt-neox/blob/main/eval.py) for an example of code in which a custom multi-machine evaluation script is written.**\n\n**Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py).**\n\n#### Tensor Parallelism (native PyTorch)\n\nFor models that support PyTorch's native Tensor Parallelism (via DTensor), you can shard model weights across GPUs without `accelerate`'s device-map by passing `tp_plan=auto` in `--model_args`. Launch with `torchrun` or `accelerate launch`:\n\n```bash\ntorchrun --nproc-per-node=4 -m lm_eval \\\n    --model hf \\\n    --model_args pretrained=google/gemma-4-31B-it,tp_plan=auto \\\n    --tasks lambada_openai,arc_easy \\\n    --batch_size 16\n```\n\n**Constraints:**\n\n- `tp_plan` and `parallelize=True` are mutually exclusive — use one or the other.\n- The number of key-value heads in the model must be divisible by `--nproc-per-node` (the TP degree).\n- Requires PyTorch &gt;= 2.4 and a `transformers` version that exposes a TP plan for the model (v4.47+).\n\n### Steered Hugging Face `transformers` models\n\nTo evaluate a Hugging Face `transformers` model with steering vectors applied, specify the model type as `steered` and provide the path to either a PyTorch file containing pre-defined steering vectors, or a CSV file that specifies how to derive steering vectors from pretrained `sparsify` or `sae_lens` models (you will need to install the corresponding optional dependency for this method).\n\nSpecify pre-defined steering vectors:\n\n```python\nimport torch\n\nsteer_config = {\n    \"layers.3\": {\n        \"steering_vector\": torch.randn(1, 768),\n        \"bias\": torch.randn(1, 768),\n        \"steering_coefficient\": 1,\n        \"action\": \"add\"\n    },\n}\ntorch.save(steer_config, \"steer_config.pt\")\n```\n\nSpecify derived steering vectors:\n\n```python\nimport pandas as pd\n\npd.DataFrame({\n    \"loader\": [\"sparsify\"],\n    \"action\": [\"add\"],\n    \"sparse_model\": [\"EleutherAI/sae-pythia-70m-32k\"],\n    \"hookpoint\": [\"layers.3\"],\n    \"feature_index\": [30],\n    \"steering_coefficient\": [10.0],\n}).to_csv(\"steer_config.csv\", index=False)\n```\n\nRun the evaluation harness with steering vectors applied:\n\n```bash\nlm_eval --model steered \\\n    --model_args pretrained=EleutherAI/pythia-160m,steer_path=steer_config.pt \\\n    --tasks lambada_openai,hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8\n```\n\n### NVIDIA `nemo` models\n\n[NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) is a generative AI framework built for researchers and pytorch developers working on language models.\n\nTo evaluate a `nemo` model, start by installing NeMo following [the documentation](https://github.com/NVIDIA/NeMo?tab=readme-ov-file#installation). We highly recommended to use the NVIDIA PyTorch or NeMo container, especially if having issues installing Apex or any other dependencies (see [latest released containers](https://github.com/NVIDIA/NeMo/releases)). Please also install the lm evaluation harness library following the instructions in [the Install section](https://github.com/EleutherAI/lm-evaluation-harness/tree/main?tab=readme-ov-file#install).\n\nNeMo models can be obtained through [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/models) or in [NVIDIA's Hugging Face page](https://huggingface.co/nvidia). In [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo/tree/main/scripts/nlp_language_modeling) there are conversion scripts to convert the `hf` checkpoints of popular models like llama, falcon, mixtral or mpt to `nemo`.\n\nRun a `nemo` model on one GPU:\n\n```bash\nlm_eval --model nemo_lm \\\n    --model_args path=&lt;path_to_nemo_model&gt; \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\nIt is recommended to unpack the `nemo` model to avoid the unpacking inside the docker container - it may overflow disk space. For that you can run:\n\n```bash\nmkdir MY_MODEL\ntar -xvf MY_MODEL.nemo -c MY_MODEL\n```\n\n#### Multi-GPU evaluation with NVIDIA `nemo` models\n\nBy default, only one GPU is used. But we do support either data replication or tensor/pipeline parallelism during evaluation, on one node.\n\n1) To enable data replication, set the `model_args` of `devices` to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is:\n\n```bash\ntorchrun --nproc-per-node=8 --no-python lm_eval \\\n    --model nemo_lm \\\n    --model_args path=&lt;path_to_nemo_model&gt;,devices=8 \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\n1) To enable tensor and/or pipeline parallelism, set the `model_args` of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. In addition, you also have to set up `devices` to be equal to the product of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is:\n\n```bash\ntorchrun --nproc-per-node=4 --no-python lm_eval \\\n    --model nemo_lm \\\n    --model_args path=&lt;path_to_nemo_model&gt;,devices=4,tensor_model_parallel_size=2,pipeline_model_parallel_size=2 \\\n    --tasks hellaswag \\\n    --batch_size 32\n```\n\nNote that it is recommended to substitute the `python` command by `torchrun --nproc-per-node=&lt;number of devices&gt; --no-python` to facilitate loading the model into the GPUs. This is especially important for large checkpoints loaded into multiple GPUs.\n\nNot supported yet: multi-node evaluation and combinations of data replication with tensor or pipeline parallelism.\n\n### Megatron-LM models\n\n[Megatron-LM](https://github.com/NVIDIA/Megatron-LM) is NVIDIA's large-scale transformer training framework. This backend allows direct evaluation of Megatron-LM checkpoints without conversion.\n\n**Requirements:**\n- Megatron-LM must be installed or accessible via `MEGATRON_PATH` environment variable\n- PyTorch with CUDA support\n\n**Setup:**\n\n```bash\n# Set environment variable pointing to Megatron-LM installation\nexport MEGATRON_PATH=/path/to/Megatron-LM\n```\n\n**Basic usage (single GPU):**\n\n```bash\nlm_eval --model megatron_lm \\\n    --model_args load=/path/to/checkpoint,tokenizer_type=HuggingFaceTokenizer,tokenizer_model=/path/to/tokenizer \\\n    --tasks hellaswag \\\n    --batch_size 1\n```\n\n**Supported checkpoint formats:**\n- Standard Megatron checkpoints (`model_optim_rng.pt`)\n- Distributed checkpoints (`.distcp` format, auto-detected)\n\n#### Parallelism Modes\n\nThe Megatron-LM backend supports the following parallelism modes:\n\n| Mode | Configuration | Description |\n|------|---------------|-------------|\n| Single GPU | `devices=1` (default) | Standard single GPU evaluation |\n| Data Parallelism | `devices&gt;1, TP=1` | Each GPU has a full model replica, data is distributed |\n| Tensor Parallelism | `TP == devices` | Model layers are split across GPUs |\n| Expert Parallelism | `EP == devices, TP=1` | For MoE models, experts are distributed across GPUs |\n\n&gt; [!Note]\n&gt; - Pipeline Parallelism (PP &gt; 1) is not currently supported.\n&gt; - Expert Parallelism (EP) cannot be combined with Tensor Parallelism (TP).\n\n**Data Parallelism (4 GPUs, each with full model replica):**\n\n```bash\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,devices=4 \\\n    --tasks hellaswag\n```\n\n**Tensor Parallelism (TP=2):**\n\n```bash\ntorchrun --nproc-per-node=2 -m lm_eval --model megatron_lm \\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,devices=2,tensor_model_parallel_size=2 \\\n    --tasks hellaswag\n```\n\n**Expert Parallelism for MoE models (EP=4):**\n\n```bash\ntorchrun --nproc-per-node=4 -m lm_eval --model megatron_lm \\\n    --model_args load=/path/to/moe_checkpoint,tokenizer_model=/path/to/tokenizer,devices=4,expert_model_parallel_size=4 \\\n    --tasks hellaswag\n```\n\n**Using extra_args for additional Megatron options:**\n\n```bash\nlm_eval --model megatron_lm \\\n    --model_args load=/path/to/checkpoint,tokenizer_model=/path/to/tokenizer,extra_args=\"--no-rope-fusion --trust-remote-code\" \\\n    --tasks hellaswag\n```\n\n&gt; [!Note]\n&gt; The `--use-checkpoint-args` flag is enabled by default, which loads model architecture parameters from the checkpoint. For checkpoints converted via Megatron-Bridge, this typically includes all necessary model configuration.\n\n#### Multi-GPU evaluation with OpenVINO models\n\nPipeline parallelism during evaluation is supported with OpenVINO models\n\nTo enable pipeline parallelism, set the `model_args` of `pipeline_parallel`. In addition, you also have to set up `device` to value `HETERO:&lt;GPU index1&gt;,&lt;GPU index2&gt;` for example `HETERO:GPU.1,GPU.0` For example, the command to use pipeline parallelism of 2 is:\n\n```bash\nlm_eval --model openvino \\\n    --tasks wikitext \\\n    --model_args pretrained=&lt;path_to_ov_model&gt;,pipeline_parallel=True \\\n    --device HETERO:GPU.1,GPU.0\n```\n\n### Tensor + Data Parallel and Optimized Inference with `vLLM`\n\nWe also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:\n\n```bash\nlm_eval --model vllm \\\n    --model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \\\n    --tasks lambada_openai \\\n    --batch_size auto\n```\n\nTo use vllm, do `pip install \"lm_eval[vllm]\"`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.\n\n&gt; [!Note]\n&gt; `data_parallel_size&gt;1` dispatches each replica as a separate [ray](https://github.com/ray-project/ray) actor and requires `pip install ray`. Each actor reserves `tensor_parallel_size` GPUs (default 1).\n\nvLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.\n\n&gt; [!Tip]\n&gt; For fastest performance, we recommend using `--batch_size auto` for vLLM whenever possible, to leverage its continuous batching functionality!\n\n&gt; [!Tip]\n&gt; Passing `max_model_len=4096` or some other reasonable default to vLLM through model args may cause speedups or prevent out-of-memory errors when trying to use auto batch size, such as for Mistral-7B-v0.1 which defaults to a maximum length of 32k.\n\n### Tensor + Data Parallel and Fast Offline Batching Inference with `SGLang`\n\nWe support SGLang for efficient offline batch inference. Its **[Fast Backend Runtime](https://docs.sglang.ai/index.html)** delivers high performance through optimized memory management and parallel processing techniques. Key features include tensor parallelism, continuous batching, and support for various quantization methods (FP8/INT4/AWQ/GPTQ).\n\nTo use SGLang as the evaluation backend, please **install it in advance** via SGLang documents [here](https://docs.sglang.io/get_started/install.html#install-sglang).\n\n&gt; [!Tip]\n&gt; Due to the installing method of [`Flashinfer`](https://docs.flashinfer.ai/)-- a fast attention kernel library, we don't include the dependencies of `SGLang` within [pyproject.toml](pyproject.toml). Note that the `Flashinfer` also has some requirements on `torch` version.\n\nSGLang's server arguments are slightly different from other backends, see [here](https://docs.sglang.io/advanced_features/server_arguments.html) for more information. We provide an example of the usage here:\n\n```bash\nlm_eval --model sglang \\\n    --model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \\\n    --tasks gsm8k_cot \\\n    --batch_size auto\n```\n\n&gt; [!Tip]\n&gt; When encountering out-of-memory (OOM) errors (especially for multiple-choice tasks), try these solutions:\n&gt;\n&gt; 1. Use a manual `batch_size`, rather than `auto`.\n&gt; 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.\n&gt; 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).\n\n### Windows ML\n\nWe support **Windows ML** for hardware-accelerated inference on Windows platforms. This enables evaluation on CPU, GPU, and **NPU (Neural Processing Unit)** devices.\n\nWindows ML?\nhttps://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview\n\nTo use Windows ML, install the required dependencies:\n\n```bash\npip install wasdk-Microsoft.Windows.AI.MachineLearning[all] wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap onnxruntime-windowsml onnxruntime-genai-winml\n```\n\nEvaluate an ONNX Runtime GenAI LLM on NPU/GPU/CPU on Windows:\n\n```bash\nlm_eval --model winml \\\n    --model_args pretrained=/path/to/onnx/model \\\n    --tasks mmlu \\\n    --batch_size 1\n```\n\n&gt; [!Note]\n&gt; The Windows ML backend is ONLY for ONNX Runtime GenAI model format. Models targeting `transformers.js` won't work. You can verify this by finding the `genai_config.json` file in the model folder.\n\n&gt; [!Note]\n&gt; To run an ONNX Runtime GenAI model on the target device, you MUST convert the original model to that vendor and device type. Converted models won't work / work well on other vendor or device types. To learn more on model conversion, please visit [Microsoft AI Tool Kit](https://code.visualstudio.com/docs/intelligentapps/modelconversion)\n\n### Model APIs and Inference Servers\n\n&gt; [!Important]\n&gt; To use API-based models, first install: `pip install \"lm_eval[api]\"`\n\nOur library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.\n\nTo call a hosted model, use:\n\n```bash\nexport OPENAI_API_KEY=YOUR_KEY_HERE\nlm_eval --model openai-completions \\\n    --model_args model=davinci-002 \\\n    --tasks lambada_openai,hellaswag\n```\n\nWe also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.\n\n```bash\nlm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16\n```\n\nNote that for externally hosted models, configs such as `--device` which relate to where to place a local model should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.\n\n| API or Inference Server                                                                                                   | Implemented?                                                                                            | `--model &lt;xxx&gt;` name                                  | Models supported:                                                                                                                                               | Request Types:                                                                 |\n|---------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|\n| OpenAI Completions                                                                                                        | :heavy_check_mark:                                                                                      | `openai-completions`, `local-completions`             | All OpenAI Completions API models                                                                                                                               | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| OpenAI ChatCompletions                                                                                                    | :heavy_check_mark:                                                                                      | `openai-chat-completions`, `local-chat-completions`   | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt)                                                                                   | `generate_until` (no logprobs)                                                 |\n| Anthropic                                                                                                                 | :heavy_check_mark:                                                                                      | `anthropic`                                           | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model)                                                                    | `generate_until` (no logprobs)                                                 |\n| Anthropic Chat                                                                                                            | :heavy_check_mark:                                                                                      | `anthropic-chat`, `anthropic-chat-completions`        | [Supported Anthropic Engines](https://docs.anthropic.com/claude/docs/models-overview)                                                                           | `generate_until` (no logprobs)                                                 |\n| [LiteLLM](https://github.com/BerriAI/litellm) (gateway to 100+ providers)                                                 | :heavy_check_mark:                                                                                      | `litellm`, `litellm-chat`, `litellm-chat-completions` | [All LiteLLM-supported providers](https://docs.litellm.ai/docs/providers)                                                                                       | `generate_until` (no logprobs)                                                 |\n| Textsynth                                                                                                                 | :heavy_check_mark:                                                                                      | `textsynth`                                           | [All supported engines](https://textsynth.com/documentation.html#engines)                                                                                       | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Cohere                                                                                                                    | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A                                                   | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models)                                                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark:                                                                                      | `gguf`, `ggml`                                        | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp)                                                                                     | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |\n| vLLM                                                                                                                      | :heavy_check_mark:                                                                                      | `vllm`                                                | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html)                                                                   | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Mamba                                                                                                                     | :heavy_check_mark:                                                                                      | `mamba_ssm`                                           | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces)                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum (Causal LMs)                                                                                          | :heavy_check_mark:                                                                                      | `openvino`                                            | Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format                                 | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum-intel IPEX (Causal LMs)                                                                               | :heavy_check_mark:                                                                                      | `ipex`                                                | Any decoder-only AutoModelForCausalLM                                                                                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Huggingface Optimum-habana (Causal LMs)                                                                                   | :heavy_check_mark:                                                                                      | `habana`                                              | Any decoder-only AutoModelForCausalLM                                                                                                                           | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Neuron via AWS Inf2 (Causal LMs)                                                                                          | :heavy_check_mark:                                                                                      | `neuronx`                                             | Any decoder-only AutoModelForCausalLM supported to run on [huggingface-ami image for inferentia2](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| NVIDIA NeMo                                                                                                               | :heavy_check_mark:                                                                                      | `nemo_lm`                                             | [All supported models](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/core/core.html#nemo-models)                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| NVIDIA Megatron-LM                                                                                                        | :heavy_check_mark:                                                                                      | `megatron_lm`                                         | [Megatron-LM GPT models](https://github.com/NVIDIA/Megatron-LM) (standard and distributed checkpoints)                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| Watsonx.ai                                                                                                                | :heavy_check_mark:                                                                                      | `watsonx_llm`                                         | [Supported Watsonx.ai Engines](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx)                                      | `generate_until` `loglikelihood`                                               |\n| Windows ML                                                                                                                | :heavy_check_mark:                                                                                      | `winml`                                               | [ONNX models in GenAI format](https://code.visualstudio.com/docs/intelligentapps/modelconversion)                                                               | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n| [Your local inference server!](docs/API_guide.md)                                                                         | :heavy_check_mark:                                                                                      | `local-completions` or `local-chat-completions`       | Support for OpenAI API-compatible servers, with easy customization for other APIs.                                                                              | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                     |\n\nModels which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.\n\nFor more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\n\n&gt; [!Note]\n&gt; For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using `--limit 10` first to confirm answer extraction and scoring on generative tasks is performing as expected. providing `system=\"&lt;some system prompt here&gt;\"` within `--model_args` for anthropic-chat-completions, to instruct the model what format to respond in, may be useful.\n\n### Other Frameworks\n\nA number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).\n\nTo create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage).\n\n### Additional Features\n\n&gt; [!Note]\n&gt; For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.\n\nIf you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher). **Note that the PyTorch MPS backend is still in early stages of development, so correctness issues or unsupported operations may exist. If you observe oddities in model performance on the MPS back-end, we recommend first checking that a forward pass of your model on `--device cpu` and `--device mps` match.**\n\n&gt; [!Note]\n&gt; You can inspect what the LM inputs look like by running the following command:\n&gt;\n&gt; ```bash\n&gt; python write_out.py \\\n&gt;     --tasks &lt;task1,task2,...&gt; \\\n&gt;     --num_fewshot 5 \\\n&gt;     --num_examples 10 \\\n&gt;     --output_base_path /path/to/output/folder\n&gt; ```\n&gt;\n&gt; This will write out one text file for each task.\n\nTo verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:\n\n```bash\nlm_eval --model openai \\\n    --model_args engine=davinci-002 \\\n    --tasks lambada_openai,hellaswag \\\n    --check_integrity\n```\n\n## Advanced Usage Tips\n\nFor models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \\\n    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \\\n    --device cuda:0\n```\n\nModels provided as delta weights can be easily loaded using the Hugging Face transformers library. Within --model_args, set the delta argument to specify the delta weights, and use the pretrained argument to designate the relative base model to which they will be applied:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=Ejafa/llama_7B,delta=lmsys/vicuna-7b-delta-v1.1 \\\n    --tasks hellaswag\n```\n\nGPTQ quantized models can be loaded using [GPTQModel](https://github.com/ModelCloud/GPTQModel) (faster) or [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)\n\nGPTQModel: add `,gptqmodel=True` to `model_args`\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,gptqmodel=True \\\n    --tasks hellaswag\n```\n\nAutoGPTQ: add `,autogptq=True` to `model_args`:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\n    --tasks hellaswag\n```\n\nWe support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.\n\n## Saving &amp; Caching Results\n\nTo save evaluation results provide an `--output_path`. We also support logging model responses with the `--log_samples` flag for post-hoc analysis.\n\n&gt; [!TIP]\n&gt; Use `--use_cache &lt;DIR&gt;` to cache evaluation results and skip previously evaluated samples when resuming runs of the same (model, task) pairs. Note that caching is rank-dependent, so restart with the same GPU count if interrupted. You can also use --cache_requests to save dataset preprocessing steps for faster evaluation resumption.\n\nTo push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use the `--hf_hub_log_args` flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub - [example dataset on the  HF Hub](https://huggingface.co/datasets/KonradSzafer/lm-eval-results-demo). For instance:\n\n```bash\nlm_eval --model hf \\\n    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \\\n    --tasks hellaswag \\\n    --log_samples \\\n    --output_path results \\\n    --hf_hub_log_args hub_results_org=EleutherAI,hub_repo_name=lm-eval-results,push_results_to_hub=True,push_samples_to_hub=True,public_repo=False \\\n```\n\nThis allows you to easily download the results and samples from the Hub, using:\n\n```python\nfrom datasets import load_dataset\n\nload_dataset(\"EleutherAI/lm-eval-results-private\", \"hellaswag\", \"latest\")\n```\n\nFor a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!\n\n## Visualizing Results\n\nYou can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights &amp; Biases (W&amp;B) and Zeno.\n\n### Zeno\n\nYou can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.\n\nFirst, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).\nAdd this key as an environment variable:\n\n```bash\nexport ZENO_API_KEY=[your api key]\n```\n\nYou'll also need to install the `lm_eval[zeno]` package extra.\n\nTo visualize the results, run the eval harness with the `log_samples` and `output_path` flags.\nWe expect `output_path` to contain multiple folders that represent individual model names.\nYou can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.\n\n```bash\nlm_eval \\\n    --model hf \\\n    --model_args pretrained=EleutherAI/gpt-j-6B \\\n    --tasks hellaswag \\\n    --device cuda:0 \\\n    --batch_size 8 \\\n    --log_samples \\\n    --output_path output/gpt-j-6B\n```\n\nThen, you can upload the resulting data using the `zeno_visualize` script:\n\n```bash\npython scripts/zeno_visualize.py \\\n    --data_path output \\\n    --project_name \"Eleuther Project\"\n```\n\nThis will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno.\nIf you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.\n\nYou can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).\n\n### Weights and Biases\n\nWith the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights &amp; Biases (W&amp;B) platform.\n\nThe integration provide functionalities\n\n- to automatically log the evaluation results,\n- log the samples as W&amp;B Tables for easy visualization,\n- log the `results.json` file as an artifact for version control,\n- log the `&lt;task_name&gt;_eval_samples.json` file if the samples are logged,\n- generate a comprehensive report for analysis and visualization with all the important metric,\n- log task and cli specific configs,\n- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.\n\nFirst you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.\n\nAuthenticate your machine with an your unique W&amp;B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.\n\nRun eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.\n\n```bash\nlm_eval \\\n    --model hf \\\n    --model_args pretrained=microsoft/phi-2,trust_remote_code=True \\\n    --tasks hellaswag,mmlu_abstract_algebra \\\n    --device cuda:0 \\\n    --batch_size 8 \\\n    --output_path output/phi-2 \\\n    --limit 10 \\\n    --wandb_args project=lm-eval-harness-integration \\\n    --log_samples\n```\n\nIn the stdout, you will find the link to the W&amp;B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb), and an example of how to integrate it beyond the CLI.\n\n## Contributing\n\nCheck out our [open issues](https://github.com/EleutherAI/lm-evaluation-harness/issues) and feel free to submit pull requests!\n\nFor more information on the library and how everything fits together, see our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs).\n\nTo get started with development, first clone the repository and install the dev dependencies:\n\n```bash\ngit clone https://github.com/EleutherAI/lm-evaluation-harness\ncd lm-evaluation-harness\npip install -e \".[dev,hf]\"\n````\n\n### Implementing new tasks\n\nTo implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).\n\nIn general, we follow this priority list for addressing concerns about prompting and other eval details:\n\n1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.\n2. If there is a clear and unambiguous official implementation, use that procedure.\n3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.\n4. If there are multiple common implementations but not universal or widespread agreement, use our preferred option among the common implementations. As before, prioritize choosing from among the implementations found in LLM training papers.\n\nThese are guidelines and not rules, and can be overruled in special circumstances.\n\nWe try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different"
  }
}

▼✓网页搜索结果数: 15, 查询: "lm-evaluation-harness v0.4.9 v0.4.8 new features tasks 2025"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.9 - NASA ADS\",\n      \"url\": \"https://ui.adsabs.harvard.edu/abs/2025zndo..15699229S/abstract\",\n      \"description\": \"by @JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2825 Added C4 Support by @Zephyr271828 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889 Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870 Add question suffix before the <|assistant|> tag by @TingchenFu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876 Add device arg to model\\\\_args passed to LLM object in VLLM model class by @momentino in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879 paws-x fix formatting by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2759 Delete scripts/cost\\\\_estimate.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2985 Adding ACPBench Hard tasks by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2980 [SGLANG] Add the SGLANG generate API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2997 fix example notebook by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2998 Log tokenized request warning only once by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3002 [Add Dataset Update] KBL 2025 by @abzb1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000 Output path fix by @Niccolo-Ajroldi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993 use images with api models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Adding resize images support by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2958 Revert \\\"feat: add question suffix (#2876)\\\" by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3007 [hotfix] modify multimodal check in evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3013 [Fix] Update resolve\\\\_hf\\\\_chat\\\\_template arguments by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992 Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987 [vllm] data parallel for V1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3011 add arab\\\\_culture task by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/3006 chore: clean up and extend .gitignore rules by @e1washere in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030 Enable text-only evals for VLM models by @ysulsky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999 [Fix] acc\\\\_mutual\\\\_info metric calculation bug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3035 Fix: fix vllm issue with DP>1 by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025 add Mbpp instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2995 remove prints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3041 [longbench] fix metric calculation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2983 Fallback to super implementation in fewshot\\\\_context for Unitxt tasks by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3023 Fix Typo in README and Comment in utils\\\\_mcq.py by @vtjl10 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057 fix longbech citation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3061 mmlu task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3070 Fix typos in docstrings in instructions.py by @maximevtush in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060 bump version to 0.4.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3073 New\"\n    },\n    {\n      \"title\": \"user_13987929/lm-evaluation-harness - Gitee\",\n      \"url\": \"https://gitee.com/aikekj_admin/lm-evaluation-harness\",\n      \"description\": \"New updates and features include: Internal refactoring; Config-based task creation and configuration; Easier import and sharing of externally-defined task\"\n    },\n    {\n      \"title\": \"EleutherAI's lm-evaluation-harness: Architecture and Configuration\",\n      \"url\": \"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\n      \"description\": \"EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\\\"hf\\\", pretrained=\\\"gpt2\\\") = \\\"hf\\\" = \\\"gpt2\\\"results = evaluator.evaluate(model, tasks=[\\\"hellaswag\\\"], num_fewshot=5) = = \\\"hellaswag\\\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\\\"gpt2\\\", = \\\"gpt2\\\" tasks=[\\\"hellaswag\\\", \\\"mmlu\\\"], = \\\"hellaswag\\\" \\\"mmlu\\\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\\\lm-eval --model \\\\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\\\ --model_args \\\\ --tasks mmlu,hellaswag \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size auto \\\\ --batch_size \\\\ --output_path results/mistral-7b.json \\\\ --output_path \\\\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\\\lm-eval --model \\\\ --model_args model=gpt-4-turbo \\\\ --model_args \\\\ --tasks mmlu,bbh \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size 10 --batch_size.\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.8 - Zenodo\",\n      \"url\": \"https://zenodo.org/records/14970487\",\n      \"description\": \"4.8 Release Notes. Key Improvements. New Backend Support: Added SGLang as new evaluation backend! Enabled model steering with vector support via\"\n    },\n    {\n      \"title\": \"Releases · EleutherAI/lm-evaluation-harness - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/releases\",\n      \"description\": \"This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to\"\n    },\n    {\n      \"title\": \"GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n      \"description\": \"* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.9.1\",\n      \"url\": \"https://zenodo.org/records/16737642\",\n      \"description\": \"There is a **newer version** of the record. # EleutherAI/lm-evaluation-harness: v0.4.9.1. # lm-eval v0.4.9.1 Release Notes. This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon! ## New Benchmarks & Tasks. ### Tasks & Benchmarks:. ## New Contributors. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9...v0.4.9.1. ### Files (9.2 MB). ### EleutherAI/lm-evaluation-harness-v0.4.9.1.zip. | EleutherAI/lm-evaluation-harness-v0.4.9.1.zip md5:c9963fc62b221f792bfdbdc69681b33f | 9.2 MB | Preview Download |. ## Additional details. | Views  Total views | 48,774 | 1,131 |. | Downloads  Total downloads | 1,405 | 59 |. | Data volume  Total data volume | 4.6 GB | 597.1 MB |. More info on how stats are collected.... Release: v0.4.9.1. ## Details. ### Image URL. ### Target URL. Find out more on how we use cookies.\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.4\",\n      \"url\": \"https://zenodo.org/records/13694023\",\n      \"description\": \"# EleutherAI/lm-evaluation-harness: v0.4.4. # lm-eval v0.4.4 Release Notes. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here. ## New Tasks. As a further discoverability improvement, `lm_eval --tasks list` now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! New tasks as of v0.4.4 include:. Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for \\\"parent\\\" tasks like `mmlu` to aggregate and report a unified score across a set of component \\\"subtasks\\\". There were two ways to add a task to a given `group` name: 1) to provide (a list of) values to the `group` field in a given subtask's config file:.\"\n    },\n    {\n      \"title\": \"blockdialect - lm-evaluation-harness - Stanford GitLab\",\n      \"url\": \"https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness\",\n      \"description\": \"A new v0.4.0 release of lm-evaluation-harness is available ! New updates and features include: New Open LLM Leaderboard tasks have been added ! You\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/docs/task_guide.md at main\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md\",\n      \"description\": \"The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes\",\n      \"url\": \"https://zenodo.org/records/17728786\",\n      \"description\": \"There is a **newer version** of the record. # EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes. ### Authors/Creators. ## Description. This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: **Python 3.10 is now the minimum required version**. ### New Benchmarks & Tasks. A big wave of new evaluation tasks this release:. ### Fixes & Improvements. ### Model & Backend Support. ## New Contributors. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2. ### Files (10.0 MB). ### EleutherAI/lm-evaluation-harness-v0.4.9.2.zip. | EleutherAI/lm-evaluation-harness-v0.4.9.2.zip md5:a1821ba45eee3362f49140f332ff6512 | 10.0 MB | Preview Download |. ## Additional details. | Views  Total views | 49,045 | 689 |. | Downloads  Total downloads | 1,420 | 39 |. | Data volume  Total data volume | 4.7 GB | 389.4 MB |. More info on how stats are collected.... Release: v0.4.9.2. ## Details. #### ``` 10.5281/zenodo.17728786 ```. ### Image URL. ### Target URL. ## Support. Find out more on how we use cookies.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/lm_eval/tasks/README.md at main - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md\",\n      \"description\": \"A list of supported tasks and task groupings can be viewed with lm-eval ls tasks . For more information, including a full list of task names and their\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo\",\n      \"url\": \"https://zenodo.org/records/12608602\",\n      \"description\": \"# EleutherAI/lm-evaluation-harness: v0.4.3. # lm-eval v0.4.3 Release Notes. We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future. The big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors! This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs. There's a lot more to check out, including:. We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder.\"\n    },\n    {\n      \"title\": \"LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly\",\n      \"url\": \"https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545\",\n      \"description\": \"# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity….\"\n    },\n    {\n      \"title\": \"v0.4.0 - EleutherAI/lm-evaluation-harness - MyGit\",\n      \"url\": \"https://mygit.osfipin.com/release/132521756\",\n      \"description\": \"* Update `actions/setup-python`in CI workflows by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/365. * Update `lambada_openai` multilingual data source by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/370. * Update Pile Test/Val Download URLs by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/373. * Add accuracy metric to crows-pairs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/380. * Upstream `hf-causal` and `hf-seq2seq` model implementations by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/381. * Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in https://github.com/EleutherAI/lm-evaluation-harness/pull/403. * Update README installation instructions by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/407. * fix: add seq2seq peft by @zanussbaum in https://github.com/EleutherAI/lm-evaluation-harness/pull/418. * Add support for load\\\\_in\\\\_8bit and trust\\\\_remote\\\\_code model params by @philwee in https://github.com/EleutherAI/lm-evaluation-harness/pull/422. * Hotfix: patch issues with the `huggingface.py` model classes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/427. * [Refactor, WIP] Multiple Choice + loglikelihood\\\\_rolling support for YAML tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/467. * [Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/486. * Update config by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/501. * [Refactor] Docs update by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/577. * Add error handling for calling `.to(device)` by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/585.\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.9 - NASA ADS",
        "url": "https://ui.adsabs.harvard.edu/abs/2025zndo..15699229S/abstract",
        "description": "by @JessicaOjo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2825 Added C4 Support by @Zephyr271828 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2889 Fixed a bug that in MMLU-Pro utils.py that throw index error if one choice was removed by @sleepingcat4 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2870 Add question suffix before the <|assistant|> tag by @TingchenFu in https://github.com/EleutherAI/lm-evaluation-harness/pull/2876 Add device arg to model\\_args passed to LLM object in VLLM model class by @momentino in https://github.com/EleutherAI/lm-evaluation-harness/pull/2879 paws-x fix formatting by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2759 Delete scripts/cost\\_estimate.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/2985 Adding ACPBench Hard tasks by @harshakokel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2980 [SGLANG] Add the SGLANG generate API by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2997 fix example notebook by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2998 Log tokenized request warning only once by @RobGeada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3002 [Add Dataset Update] KBL 2025 by @abzb1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3000 Output path fix by @Niccolo-Ajroldi in https://github.com/EleutherAI/lm-evaluation-harness/pull/2993 use images with api models by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2981 Adding resize images support by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/2958 Revert \"feat: add question suffix (#2876)\" by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3007 [hotfix] modify multimodal check in evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3013 [Fix] Update resolve\\_hf\\_chat\\_template arguments by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/2992 Fix error due in Collating queries with different continuation lengths (fixes #2984) by @ameyagodbole in https://github.com/EleutherAI/lm-evaluation-harness/pull/2987 [vllm] data parallel for V1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3011 add arab\\_culture task by @bodasadallah in https://github.com/EleutherAI/lm-evaluation-harness/pull/3006 chore: clean up and extend .gitignore rules by @e1washere in https://github.com/EleutherAI/lm-evaluation-harness/pull/3030 Enable text-only evals for VLM models by @ysulsky in https://github.com/EleutherAI/lm-evaluation-harness/pull/2999 [Fix] acc\\_mutual\\_info metric calculation bug by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3035 Fix: fix vllm issue with DP>1 by @younesbelkada in https://github.com/EleutherAI/lm-evaluation-harness/pull/3025 add Mbpp instruct by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2995 remove prints by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3041 [longbench] fix metric calculation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2983 Fallback to super implementation in fewshot\\_context for Unitxt tasks by @kiersten-stokes in https://github.com/EleutherAI/lm-evaluation-harness/pull/3023 Fix Typo in README and Comment in utils\\_mcq.py by @vtjl10 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3057 fix longbech citation by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3061 mmlu task: update README.md by @annafontanaa in https://github.com/EleutherAI/lm-evaluation-harness/pull/3070 Fix typos in docstrings in instructions.py by @maximevtush in https://github.com/EleutherAI/lm-evaluation-harness/pull/3060 bump version to 0.4.9 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3073 New"
      },
      {
        "title": "user_13987929/lm-evaluation-harness - Gitee",
        "url": "https://gitee.com/aikekj_admin/lm-evaluation-harness",
        "description": "New updates and features include: Internal refactoring; Config-based task creation and configuration; Easier import and sharing of externally-defined task"
      },
      {
        "title": "EleutherAI's lm-evaluation-harness: Architecture and Configuration",
        "url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
        "description": "EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\"hf\", pretrained=\"gpt2\") = \"hf\" = \"gpt2\"results = evaluator.evaluate(model, tasks=[\"hellaswag\"], num_fewshot=5) = = \"hellaswag\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\"gpt2\", = \"gpt2\" tasks=[\"hellaswag\", \"mmlu\"], = \"hellaswag\" \"mmlu\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\lm-eval --model \\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\ --model_args \\ --tasks mmlu,hellaswag \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size auto \\ --batch_size \\ --output_path results/mistral-7b.json \\ --output_path \\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\lm-eval --model \\ --model_args model=gpt-4-turbo \\ --model_args \\ --tasks mmlu,bbh \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size 10 --batch_size."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.8 - Zenodo",
        "url": "https://zenodo.org/records/14970487",
        "description": "4.8 Release Notes. Key Improvements. New Backend Support: Added SGLang as new evaluation backend! Enabled model steering with vector support via"
      },
      {
        "title": "Releases · EleutherAI/lm-evaluation-harness - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/releases",
        "description": "This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to"
      },
      {
        "title": "GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness",
        "description": "* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface)."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.9.1",
        "url": "https://zenodo.org/records/16737642",
        "description": "There is a **newer version** of the record. # EleutherAI/lm-evaluation-harness: v0.4.9.1. # lm-eval v0.4.9.1 Release Notes. This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon! ## New Benchmarks & Tasks. ### Tasks & Benchmarks:. ## New Contributors. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9...v0.4.9.1. ### Files (9.2 MB). ### EleutherAI/lm-evaluation-harness-v0.4.9.1.zip. | EleutherAI/lm-evaluation-harness-v0.4.9.1.zip md5:c9963fc62b221f792bfdbdc69681b33f | 9.2 MB | Preview Download |. ## Additional details. | Views  Total views | 48,774 | 1,131 |. | Downloads  Total downloads | 1,405 | 59 |. | Data volume  Total data volume | 4.6 GB | 597.1 MB |. More info on how stats are collected.... Release: v0.4.9.1. ## Details. ### Image URL. ### Target URL. Find out more on how we use cookies."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.4",
        "url": "https://zenodo.org/records/13694023",
        "description": "# EleutherAI/lm-evaluation-harness: v0.4.4. # lm-eval v0.4.4 Release Notes. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here. ## New Tasks. As a further discoverability improvement, `lm_eval --tasks list` now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! New tasks as of v0.4.4 include:. Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for \"parent\" tasks like `mmlu` to aggregate and report a unified score across a set of component \"subtasks\". There were two ways to add a task to a given `group` name: 1) to provide (a list of) values to the `group` field in a given subtask's config file:."
      },
      {
        "title": "blockdialect - lm-evaluation-harness - Stanford GitLab",
        "url": "https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness",
        "description": "A new v0.4.0 release of lm-evaluation-harness is available ! New updates and features include: New Open LLM Leaderboard tasks have been added ! You"
      },
      {
        "title": "lm-evaluation-harness/docs/task_guide.md at main",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md",
        "description": "The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the"
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes",
        "url": "https://zenodo.org/records/17728786",
        "description": "There is a **newer version** of the record. # EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes. ### Authors/Creators. ## Description. This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: **Python 3.10 is now the minimum required version**. ### New Benchmarks & Tasks. A big wave of new evaluation tasks this release:. ### Fixes & Improvements. ### Model & Backend Support. ## New Contributors. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2. ### Files (10.0 MB). ### EleutherAI/lm-evaluation-harness-v0.4.9.2.zip. | EleutherAI/lm-evaluation-harness-v0.4.9.2.zip md5:a1821ba45eee3362f49140f332ff6512 | 10.0 MB | Preview Download |. ## Additional details. | Views  Total views | 49,045 | 689 |. | Downloads  Total downloads | 1,420 | 39 |. | Data volume  Total data volume | 4.7 GB | 389.4 MB |. More info on how stats are collected.... Release: v0.4.9.2. ## Details. #### ``` 10.5281/zenodo.17728786 ```. ### Image URL. ### Target URL. ## Support. Find out more on how we use cookies."
      },
      {
        "title": "lm-evaluation-harness/lm_eval/tasks/README.md at main - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md",
        "description": "A list of supported tasks and task groupings can be viewed with lm-eval ls tasks . For more information, including a full list of task names and their"
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.3 - Zenodo",
        "url": "https://zenodo.org/records/12608602",
        "description": "# EleutherAI/lm-evaluation-harness: v0.4.3. # lm-eval v0.4.3 Release Notes. We're releasing a new version of LM Eval Harness for PyPI users at long last. We intend to release new PyPI versions more frequently in the future. The big new feature is the often-requested **Chat Templating**, contributed by @KonradSzafer @clefourrier @NathanHB and also worked on by a number of other awesome contributors! This feature is **currently only supported for model types `hf` and `vllm`** but we intend to gather feedback on improvements and also extend this to other relevant models such as APIs. There's a lot more to check out, including:. We had a number of new tasks contributed. **A listing of subfolders and a brief description of the tasks contained in them can now be found at `lm_eval/tasks/README.md`**. Hopefully this will be a useful step to help users to locate the definitions of relevant tasks more easily, by first visiting this page and then locating the README.md for further info on each task contained within a given folder."
      },
      {
        "title": "LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly",
        "url": "https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545",
        "description": "# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity…."
      },
      {
        "title": "v0.4.0 - EleutherAI/lm-evaluation-harness - MyGit",
        "url": "https://mygit.osfipin.com/release/132521756",
        "description": "* Update `actions/setup-python`in CI workflows by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/365. * Update `lambada_openai` multilingual data source by @jon-tow in https://github.com/EleutherAI/lm-evaluation-harness/pull/370. * Update Pile Test/Val Download URLs by @fattorib in https://github.com/EleutherAI/lm-evaluation-harness/pull/373. * Add accuracy metric to crows-pairs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/380. * Upstream `hf-causal` and `hf-seq2seq` model implementations by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/381. * Change device parameter to cuda:0 to avoid runtime error by @Jeffwan in https://github.com/EleutherAI/lm-evaluation-harness/pull/403. * Update README installation instructions by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/407. * fix: add seq2seq peft by @zanussbaum in https://github.com/EleutherAI/lm-evaluation-harness/pull/418. * Add support for load\\_in\\_8bit and trust\\_remote\\_code model params by @philwee in https://github.com/EleutherAI/lm-evaluation-harness/pull/422. * Hotfix: patch issues with the `huggingface.py` model classes by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/427. * [Refactor, WIP] Multiple Choice + loglikelihood\\_rolling support for YAML tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/467. * [Refactor] Add decorator for registering YAMLs as tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/486. * Update config by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/501. * [Refactor] Docs update by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/577. * Add error handling for calling `.to(device)` by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/585."
      }
    ]
  }
}

▼✓网页搜索结果数: 10, 查询: "lm-evaluation-harness architecture task model interface design internals"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"EleutherAI's lm-evaluation-harness: Architecture and Configuration\",\n      \"url\": \"https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html\",\n      \"description\": \"EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\\\"hf\\\", pretrained=\\\"gpt2\\\") = \\\"hf\\\" = \\\"gpt2\\\"results = evaluator.evaluate(model, tasks=[\\\"hellaswag\\\"], num_fewshot=5) = = \\\"hellaswag\\\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\\\"gpt2\\\", = \\\"gpt2\\\" tasks=[\\\"hellaswag\\\", \\\"mmlu\\\"], = \\\"hellaswag\\\" \\\"mmlu\\\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\\\lm-eval --model \\\\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\\\ --model_args \\\\ --tasks mmlu,hellaswag \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size auto \\\\ --batch_size \\\\ --output_path results/mistral-7b.json \\\\ --output_path \\\\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\\\lm-eval --model \\\\ --model_args model=gpt-4-turbo \\\\ --model_args \\\\ --tasks mmlu,bbh \\\\ --tasks \\\\ --num_fewshot 5 \\\\ --num_fewshot \\\\ --batch_size 10 --batch_size.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/docs/new_task_guide.md at main - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md\",\n      \"description\": \"[Skip to content](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#start-of-content). [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.You signed out in another tab or window. [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.You switched accounts on another tab or window. [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.Dismiss alert. [History](https://github.com/EleutherAI/lm-evaluation-harness/commits/main/docs/new_task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/commits/main/docs/new_task_guide.md)History. [Raw](https://github.com/EleutherAI/lm-evaluation-harness/raw/refs/heads/main/docs/new_task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#new-task-guide). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#setup). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#creating-a-yaml-file). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#selecting-and-configuring-a-dataset). Now, in `utils.py` we'll write a function to process each split of our dataset (the following example is drawn from [the `hellaswag` task](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hellaswag/utils.py)):. [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-local-datasets). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#writing-a-prompt-template). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#basic-prompts). For example in [Swag](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/swag/swag.yaml#L10-L11), `doc_to_text` and `doc_to_target` given the name of one of the feature each. `doc_to_choice` can be directly given a list of text as option (See [Toxigen](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/toxigen/toxigen.yaml#L11)). if a dataset feature is already a list, you can set the name of the feature as `doc_to_choice` (See [Hellaswag](https://github.com/EleutherAI/lm-evaluation-harness/blob/e0eda4d3ffa10e5f65e0976161cd134bec61983a/lm_eval/tasks/hellaswag/hellaswag.yaml#L13)). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#writing-a-prompt-with-jinja-2). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#multiple-choice-format). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-python-functions-for-prompts). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#importing-a-prompt-from-promptsource). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#setting-metrics). For a full list of natively supported metrics and aggregation functions see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#optional-more-advanced-setup). For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#task-name--tags-registering-a-task). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs). See [./task_guide.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md) for a full list of related sub-keys. [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#configuring-python-classes). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#checking-validity). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#versioning). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#checking-performance--equivalence). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#task-validity-checklist). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#submitting-your-task).\"\n    },\n    {\n      \"title\": \"GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n      \"description\": \"* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness/docs/task_guide.md at main - GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md\",\n      \"description\": \"The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the\"\n    },\n    {\n      \"title\": \"LM Harness Evaluations — NVIDIA NeMo Microservices\",\n      \"url\": \"https://docs.nvidia.com/nemo/microservices/25.9.0/evaluate/flows/academic-benchmarks/lm-harness.html\",\n      \"description\": \"# LM Harness Evaluations#. LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and IFEval. Use this evaluation type to benchmark general language understanding and reasoning tasks. All LM Harness evaluations use the same target structure. | `type` | Always `\\\"model\\\"` for LM Harness evaluations. The URL of the API endpoint for the model. Always `\\\"model\\\"` for LM Harness evaluations. | `url` | The URL of the API endpoint for the model. Example LM Harness Tasks by Category#. | Math & Reasoning | `gsm8k` | Grade school and advanced math word problems. Math word problems and translation tasks in multiple languages. | Multilingual Tasks | `mgsm`, `wikilingua` | Math word problems and translation tasks in multiple languages. For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run `python -m lm_eval --tasks list`. For the full list of LM Harness tasks, refer to tasks. These parameters control how requests are made to the model:.\"\n    },\n    {\n      \"title\": \"Evaluating LLMs — EleutherAI\",\n      \"url\": \"https://www.eleuther.ai/projects/large-language-model-evaluation\",\n      \"description\": \"# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs  Next.\"\n    },\n    {\n      \"title\": \"LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly\",\n      \"url\": \"https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545\",\n      \"description\": \"# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity….\"\n    },\n    {\n      \"title\": \"EleutherAI/lm-evaluation-harness: v0.4.2\",\n      \"url\": \"https://zenodo.org/records/10829972\",\n      \"description\": \"# EleutherAI/lm-evaluation-harness: v0.4.2. # lm-eval v0.4.2 Release Notes. We are releasing a new minor version of lm-eval for PyPI users! There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:. previously, users had to call `lm_eval.tasks.initialize_tasks()` to register the library's default tasks, or `lm_eval.tasks.include_path()` to include a custom directory of task YAML configs. lm_eval.tasks.include_path(\\\"/path/to/my/custom/tasks\\\"). lm_eval.simple_evaluate(model=lm, tasks=[\\\"arc_easy\\\"])`. task_manager = TaskManager() # pass include_path=\\\"/path/to/my/custom/tasks\\\" if desired. lm_eval.simple_evaluate(model=lm, tasks=[\\\"arc_easy\\\"], task_manager=task_manager)`. Previous versions of the library incorrectly reported erroneously large `stderr` scores for groups of tasks such as MMLU. We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information. We're grateful for the community's support. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.1...v0.4.2. ### EleutherAI/lm-evaluation-harness-v0.4.2.zip. | EleutherAI/lm-evaluation-harness-v0.4.2.zip md5:b975dd5fdc27c997493db2b723272b63 | 2.3 MB | Preview Download |.\"\n    },\n    {\n      \"title\": \"blockdialect - lm-evaluation-harness - Stanford GitLab\",\n      \"url\": \"https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness\",\n      \"description\": \"This project provides a unified framework to test generative language models on a large number of different evaluation tasks.\"\n    },\n    {\n      \"title\": \"lm-evaluation-harness with LoRa fine tuned models Hugging Face\",\n      \"url\": \"https://stackoverflow.com/questions/78306200/lm-evaluation-harness-with-lora-fine-tuned-models-hugging-face\",\n      \"description\": \"They suggest how to use lm_eval when evaluating on a peft model: you should add in pretrained the model used for tuning and add peft=<PEFT_MODEL_NAME> into the\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "EleutherAI's lm-evaluation-harness: Architecture and Configuration",
        "url": "https://slyracoon23.github.io/blog/posts/2025-03-21_eleutherai-evaluation-methods.html",
        "description": "EleutherAI’s lm-evaluation-harness architecture diagram showing the relationship between models, tasks, and evaluation metrics. from lm_eval import evaluator, tasks from importfrom lm_eval.models import get_model from import model = get_model(\"hf\", pretrained=\"gpt2\") = \"hf\" = \"gpt2\"results = evaluator.evaluate(model, tasks=[\"hellaswag\"], num_fewshot=5) = = \"hellaswag\" = 5. import lm_eval import results = lm_eval.simple_evaluate(= model=\"gpt2\", = \"gpt2\" tasks=[\"hellaswag\", \"mmlu\"], = \"hellaswag\" \"mmlu\" num_fewshot=0 = 0). lm-eval --model hf --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 --tasks mmlu --num_fewshot 5 --apply_chat_templatelm-eval --model --model_args --tasks --num_fewshot --apply_chat_template. The chat template handling in lm-evaluation-harness has been updated to better support likelihood and multiple-choice based tasks with chat templates. lm-eval --model hf \\lm-eval --model \\ --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2,device_map=auto,trust_remote_code=true \\ --model_args \\ --tasks mmlu,hellaswag \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size auto \\ --batch_size \\ --output_path results/mistral-7b.json \\ --output_path \\ --apply_chat_template --apply_chat_template. # Set OPENAI_API_KEY environment variable first # Set OPENAI_API_KEY environment variable firstlm-eval --model openai-chat \\lm-eval --model \\ --model_args model=gpt-4-turbo \\ --model_args \\ --tasks mmlu,bbh \\ --tasks \\ --num_fewshot 5 \\ --num_fewshot \\ --batch_size 10 --batch_size."
      },
      {
        "title": "lm-evaluation-harness/docs/new_task_guide.md at main - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md",
        "description": "[Skip to content](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#start-of-content). [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.You signed out in another tab or window. [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.You switched accounts on another tab or window. [Reload](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) to refresh your session.Dismiss alert. [History](https://github.com/EleutherAI/lm-evaluation-harness/commits/main/docs/new_task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/commits/main/docs/new_task_guide.md)History. [Raw](https://github.com/EleutherAI/lm-evaluation-harness/raw/refs/heads/main/docs/new_task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#new-task-guide). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#setup). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#creating-a-yaml-file). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#selecting-and-configuring-a-dataset). Now, in `utils.py` we'll write a function to process each split of our dataset (the following example is drawn from [the `hellaswag` task](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hellaswag/utils.py)):. [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-local-datasets). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#writing-a-prompt-template). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#basic-prompts). For example in [Swag](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/swag/swag.yaml#L10-L11), `doc_to_text` and `doc_to_target` given the name of one of the feature each. `doc_to_choice` can be directly given a list of text as option (See [Toxigen](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/toxigen/toxigen.yaml#L11)). if a dataset feature is already a list, you can set the name of the feature as `doc_to_choice` (See [Hellaswag](https://github.com/EleutherAI/lm-evaluation-harness/blob/e0eda4d3ffa10e5f65e0976161cd134bec61983a/lm_eval/tasks/hellaswag/hellaswag.yaml#L13)). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#writing-a-prompt-with-jinja-2). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#multiple-choice-format). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#using-python-functions-for-prompts). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#importing-a-prompt-from-promptsource). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#setting-metrics). For a full list of natively supported metrics and aggregation functions see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#optional-more-advanced-setup). For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#task-name--tags-registering-a-task). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs). See [./task_guide.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md) for a full list of related sub-keys. [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#configuring-python-classes). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#checking-validity). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#versioning). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#checking-performance--equivalence). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#task-validity-checklist). [](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#submitting-your-task)."
      },
      {
        "title": "GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness",
        "description": "* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface)."
      },
      {
        "title": "lm-evaluation-harness/docs/task_guide.md at main - GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md",
        "description": "The lm-evaluation-harness is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the"
      },
      {
        "title": "LM Harness Evaluations — NVIDIA NeMo Microservices",
        "url": "https://docs.nvidia.com/nemo/microservices/25.9.0/evaluate/flows/academic-benchmarks/lm-harness.html",
        "description": "# LM Harness Evaluations#. LM Evaluation Harness supports over 60 standard academic benchmarks for LLMs, including MMLU, GSM8K, and IFEval. Use this evaluation type to benchmark general language understanding and reasoning tasks. All LM Harness evaluations use the same target structure. | `type` | Always `\"model\"` for LM Harness evaluations. The URL of the API endpoint for the model. Always `\"model\"` for LM Harness evaluations. | `url` | The URL of the API endpoint for the model. Example LM Harness Tasks by Category#. | Math & Reasoning | `gsm8k` | Grade school and advanced math word problems. Math word problems and translation tasks in multiple languages. | Multilingual Tasks | `mgsm`, `wikilingua` | Math word problems and translation tasks in multiple languages. For the full list of LM Harness tasks, see the lm-evaluation-harness tasks directory or run `python -m lm_eval --tasks list`. For the full list of LM Harness tasks, refer to tasks. These parameters control how requests are made to the model:."
      },
      {
        "title": "Evaluating LLMs — EleutherAI",
        "url": "https://www.eleuther.ai/projects/large-language-model-evaluation",
        "description": "# Evaluating LLMs. Quantifying the performance of large language models is crucial to evaluating new techniques and validating new approaches so that different model releases can be compared objectively. LLMs are generally evaluated on several benchmark datasets and given scores, which serve numeric quantities to compare across models. However, model performance is often governed by minor implementation details. Often, papers do not provide the necessary code or sufficient detail to replicate their evaluations fully. To address these problems, we introduced the LM Evaluation Harness, a unifying framework that allows any causal language model to be tested on the same exact inputs and codebase. This provides a ground-truth location to evaluate new LLMs and saves practitioners time implementing few-shot evaluations repeatedly while ensuring that their results can be compared against previous work. The LM Eval Harness currently supports several different NLP tasks and model frameworks, all with a unified interface and task versioning for reproducibility. ## Training LLMs  Next."
      },
      {
        "title": "LLM evaluation | EleutherAI lm-evaluation-harness | Disassembly",
        "url": "https://medium.com/disassembly/llm-evaluation-eleutherai-lm-evaluation-harness-cc379495d545",
        "description": "# LLM evaluation | EleutherAI lm-evaluation-harness | by tony Kuo | Disassembly | Medium. ## A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity… ### Introduction. In the lm-evaluation-harness, the config task is the major way to select the pre-config evaluation task which decides the dataset, group, types of measurements, post-process setting, etc.). When evaluating language models, it’s important to use a variety of measures that fit various aspects of the evaluation task. Here are three types of measures **lm-evaluation-harness** provided. Loglikelihood, introduced in the MMLU for evaluating large language models (LLMs), is a metric used to measure the highest probability of a token in a sequence, particularly in multi-choice tasks. Here is a simple example using `huggingface-transformers` as the model provider, `Llama-3.2–1B` as the target LLM, and the task set to mmlu_flan_n_shot_generative_stem. Image 29: A Strategic Field Guide for Generative AI and Agent Evaluation: Techniques, Metrics and Maturity…."
      },
      {
        "title": "EleutherAI/lm-evaluation-harness: v0.4.2",
        "url": "https://zenodo.org/records/10829972",
        "description": "# EleutherAI/lm-evaluation-harness: v0.4.2. # lm-eval v0.4.2 Release Notes. We are releasing a new minor version of lm-eval for PyPI users! There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:. previously, users had to call `lm_eval.tasks.initialize_tasks()` to register the library's default tasks, or `lm_eval.tasks.include_path()` to include a custom directory of task YAML configs. lm_eval.tasks.include_path(\"/path/to/my/custom/tasks\"). lm_eval.simple_evaluate(model=lm, tasks=[\"arc_easy\"])`. task_manager = TaskManager() # pass include_path=\"/path/to/my/custom/tasks\" if desired. lm_eval.simple_evaluate(model=lm, tasks=[\"arc_easy\"], task_manager=task_manager)`. Previous versions of the library incorrectly reported erroneously large `stderr` scores for groups of tasks such as MMLU. We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information. We're grateful for the community's support. **Full Changelog**: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.1...v0.4.2. ### EleutherAI/lm-evaluation-harness-v0.4.2.zip. | EleutherAI/lm-evaluation-harness-v0.4.2.zip md5:b975dd5fdc27c997493db2b723272b63 | 2.3 MB | Preview Download |."
      },
      {
        "title": "blockdialect - lm-evaluation-harness - Stanford GitLab",
        "url": "https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness",
        "description": "This project provides a unified framework to test generative language models on a large number of different evaluation tasks."
      },
      {
        "title": "lm-evaluation-harness with LoRa fine tuned models Hugging Face",
        "url": "https://stackoverflow.com/questions/78306200/lm-evaluation-harness-with-lora-fine-tuned-models-hugging-face",
        "description": "They suggest how to use lm_eval when evaluating on a peft model: you should add in pretrained the model used for tuning and add peft=<PEFT_MODEL_NAME> into the"
      }
    ]
  }
}

▼thought

▼✓网页搜索结果数: 10, 查询: "lm-evaluation-harness vs lmms-eval vs HELM comparison evaluation framework 2025"

▼✓网页搜索结果数: 10, 查询: "lm-evaluation-harness criticism limitations issues reproducibility"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"results\": [\n    {\n      \"title\": \"Conference Talk 16: A Deep Dive on LLM Evaluation – Christian Mills\",\n      \"url\": \"https://christianjmills.com/posts/mastering-llms-course-notes/conference-talk-016/\",\n      \"description\": \"# Conference Talk 16: A Deep Dive on LLM Evaluation. In this talk, **Hailey Schoelkopf** from **Eleuther AI** provides an overview of the challenges in LLM evaluation, exploring different measurement techniques, highlighting reproducibility issues, and advocating for best practices like sharing evaluation code and using task-specific downstream evaluations. * **Slides:** A Deep Dive on LM Evaluation. * **Topic:** Deep dive into the challenges and best practices of Large Language Model (LLM) evaluation. + Maintainer of the **LM Evaluation Harness**, a widely used open-source library for evaluating LLMs. ## Challenges in LLM Evaluation. This involves summing the log probabilities of each token in Y, conditioned on the preceding tokens in X and Y. - where \\\\(\\\\log p(y\\\\_i | x, y\\\\_0, \\\\ldots, y\\\\_{i-1})\\\\) is the log probability of the \\\\(i\\\\)-th target token conditioned on the full input \\\\(x\\\\) and the preceding target tokens. * **Calculation:** Based on the average per-token log probability of the text, with lower perplexity indicating a better fit to the data.\"\n    },\n    {\n      \"title\": \"Challenges in Language Model Evaluations\",\n      \"url\": \"https://lm-evaluation-challenges.github.io/\",\n      \"description\": \"Reproducibility challenges unique to LMs, including the importance of prompting, prompt engineering, and LMs' non-robustness to implementation details. Data\"\n    },\n    {\n      \"title\": \"blockdialect - lm-evaluation-harness - Stanford GitLab\",\n      \"url\": \"https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness\",\n      \"description\": \"Evaluation with publicly available prompts ensures reproducibility and comparability between papers. Easy support for custom prompts and evaluation metrics.\"\n    },\n    {\n      \"title\": \"LLM evaluation and reproducibility : r/MLQuestions - Reddit\",\n      \"url\": \"https://www.reddit.com/r/MLQuestions/comments/1ppqcjd/llm_evaluation_and_reproducibility/\",\n      \"description\": \"Ok so it's a problem with how the results are reported. I am seeing a 20% difference in accuracy compared to what is shown on the leader board\"\n    },\n    {\n      \"title\": \"GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub\",\n      \"url\": \"https://github.com/EleutherAI/lm-evaluation-harness\",\n      \"description\": \"* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).\"\n    },\n    {\n      \"title\": \"[Literature Review] Lessons from the Trenches on Reproducible ...\",\n      \"url\": \"https://www.themoonlight.io/en/review/lessons-from-the-trenches-on-reproducible-evaluation-of-language-models\",\n      \"description\": \"The authors present multiple case studies illustrating how lm-eval has been practically applied to improve evaluation methods, detailing various modeling\"\n    },\n    {\n      \"title\": \"The LLM Evaluation Problem Nobody Talks About - Towards AI\",\n      \"url\": \"https://pub.towardsai.net/the-llm-evaluation-problem-nobody-talks-about-5e876a12a5af\",\n      \"description\": \"Your benchmark numbers are probably not reproducible. NeMo Evaluator SDK was built specifically to fix that and it's more opinionated than\"\n    },\n    {\n      \"title\": \"Lessons from the Trenches on Reproducible Evaluation ... - EleutherAI\",\n      \"url\": \"https://www.eleuther.ai/papers-blog/lessons-from-the-trenches-on-reproducible-evaluation-of-language-models\",\n      \"description\": \"First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or\"\n    },\n    {\n      \"title\": \"Lessons from the Trenches on Reproducible Evaluation of ... - arXiv\",\n      \"url\": \"https://arxiv.org/html/2405.14782v1\",\n      \"description\": \"In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.\"\n    },\n    {\n      \"title\": \"Evaluating Large Language Models: Techniques, Challenges, and ...\",\n      \"url\": \"https://medium.com/@prag-matic/evaluating-large-language-models-techniques-challenges-and-best-practices-8b3f15f13253\",\n      \"description\": \"This blog delves into the multifaceted world of LLM evaluations, exploring methodologies, detailed evaluation metrics, challenges, and emerging trends.\"\n    }\n  ]\n}"
    }
  ],
  "details": {
    "results": [
      {
        "title": "Conference Talk 16: A Deep Dive on LLM Evaluation – Christian Mills",
        "url": "https://christianjmills.com/posts/mastering-llms-course-notes/conference-talk-016/",
        "description": "# Conference Talk 16: A Deep Dive on LLM Evaluation. In this talk, **Hailey Schoelkopf** from **Eleuther AI** provides an overview of the challenges in LLM evaluation, exploring different measurement techniques, highlighting reproducibility issues, and advocating for best practices like sharing evaluation code and using task-specific downstream evaluations. * **Slides:** A Deep Dive on LM Evaluation. * **Topic:** Deep dive into the challenges and best practices of Large Language Model (LLM) evaluation. + Maintainer of the **LM Evaluation Harness**, a widely used open-source library for evaluating LLMs. ## Challenges in LLM Evaluation. This involves summing the log probabilities of each token in Y, conditioned on the preceding tokens in X and Y. - where \\(\\log p(y\\_i | x, y\\_0, \\ldots, y\\_{i-1})\\) is the log probability of the \\(i\\)-th target token conditioned on the full input \\(x\\) and the preceding target tokens. * **Calculation:** Based on the average per-token log probability of the text, with lower perplexity indicating a better fit to the data."
      },
      {
        "title": "Challenges in Language Model Evaluations",
        "url": "https://lm-evaluation-challenges.github.io/",
        "description": "Reproducibility challenges unique to LMs, including the importance of prompting, prompt engineering, and LMs' non-robustness to implementation details. Data"
      },
      {
        "title": "blockdialect - lm-evaluation-harness - Stanford GitLab",
        "url": "https://code.stanford.edu/tambe-lab/blockdialect/-/tree/00252f91a22e172e2e28a4027ee2d640fc0492a4/lm-evaluation-harness",
        "description": "Evaluation with publicly available prompts ensures reproducibility and comparability between papers. Easy support for custom prompts and evaluation metrics."
      },
      {
        "title": "LLM evaluation and reproducibility : r/MLQuestions - Reddit",
        "url": "https://www.reddit.com/r/MLQuestions/comments/1ppqcjd/llm_evaluation_and_reproducibility/",
        "description": "Ok so it's a problem with how the results are reported. I am seeing a 20% difference in accuracy compared to what is shown on the leader board"
      },
      {
        "title": "GitHub - EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models. · GitHub",
        "url": "https://github.com/EleutherAI/lm-evaluation-harness",
        "description": "* [2024/07] [API model](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. We treat Huggingface as the reference implementation and provide a [script](/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking the validity of vllm results against HF. | Cohere | [⌛ - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |. | [Your local inference server!](/EleutherAI/lm-evaluation-harness/blob/main/docs/API_guide.md) | ✔️ | `local-completions` or `local-chat-completions` | Support for OpenAI API-compatible servers, with easy customization for other APIs. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface)."
      },
      {
        "title": "[Literature Review] Lessons from the Trenches on Reproducible ...",
        "url": "https://www.themoonlight.io/en/review/lessons-from-the-trenches-on-reproducible-evaluation-of-language-models",
        "description": "The authors present multiple case studies illustrating how lm-eval has been practically applied to improve evaluation methods, detailing various modeling"
      },
      {
        "title": "The LLM Evaluation Problem Nobody Talks About - Towards AI",
        "url": "https://pub.towardsai.net/the-llm-evaluation-problem-nobody-talks-about-5e876a12a5af",
        "description": "Your benchmark numbers are probably not reproducible. NeMo Evaluator SDK was built specifically to fix that and it's more opinionated than"
      },
      {
        "title": "Lessons from the Trenches on Reproducible Evaluation ... - EleutherAI",
        "url": "https://www.eleuther.ai/papers-blog/lessons-from-the-trenches-on-reproducible-evaluation-of-language-models",
        "description": "First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or"
      },
      {
        "title": "Lessons from the Trenches on Reproducible Evaluation of ... - arXiv",
        "url": "https://arxiv.org/html/2405.14782v1",
        "description": "In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers."
      },
      {
        "title": "Evaluating Large Language Models: Techniques, Challenges, and ...",
        "url": "https://medium.com/@prag-matic/evaluating-large-language-models-techniques-challenges-and-best-practices-8b3f15f13253",
        "description": "This blog delves into the multifaceted world of LLM evaluations, exploring methodologies, detailed evaluation metrics, challenges, and emerging trends."
      }
    ]
  }
}

▼✓抓取网页正文网址: "https://zenodo.org/records/17728786"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://zenodo.org/records/17728786\",\n  \"title\": \"EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes\",\n  \"summary\": \"This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum requir…\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><article>\\n          \\n\\n              \\n\\n              \\n\\n              <div id=\\\"record-info\\\" aria-label=\\\"Publication date and version number\\\">\\n                    <p><span title=\\\"Publication date\\\">\\n                        Published November 26, 2025\\n                      </span>\\n                      <span> | Version v0.4.9.2</span>\\n                    </p>\\n                    <p><span role=\\\"note\\\" aria-label=\\\"Resource type\\\">\\n                          Software\\n                        </span>\\n                      \\n\\n                      <span role=\\\"note\\\" data-tooltip=\\\"The record and files are publicly accessible.\\\" data-inverted=\\\"\\\" aria-label=\\\"Access status\\\">\\n                        \\n                          \\n                        \\n                        <span aria-label=\\\"The record and files are publicly accessible.\\\">\\n                          Open\\n                        </span>\\n                      </span>\\n                    </p>\\n                  </div>\\n              <div id=\\\"record-title-section\\\" aria-label=\\\"Record title, authors and contributors\\\">\\n      <div>\\n          <h3>Authors/Creators</h3>\\n          <ul>\\n            \\n  \\n  <li>\\n    <a data-tooltip=\\\"Language Technologies Institute, CMU\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Lintang+Sutawika%22\\\">\\n\\n      <span>Lintang Sutawika</span><sup>1</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Hailey+Schoelkopf%22\\\">\\n\\n      <span>Hailey Schoelkopf</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Leo+Gao%22\\\">\\n\\n      <span>Leo Gao</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Baber+Abbasi%22\\\">\\n\\n      <span>Baber Abbasi</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Booz Allen Hamilton, EleutherAI\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Stella+Biderman%22\\\">\\n\\n      <span>Stella Biderman</span><sup>2</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Jonathan+Tow%22\\\">\\n\\n      <span>Jonathan Tow</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22ben+fattori%22\\\">\\n\\n      <span>ben fattori</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Charles+Lovering%22\\\">\\n\\n      <span>Charles Lovering</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22farzanehnakhaee70%22\\\">\\n\\n      <span>farzanehnakhaee70</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Jason+Phang%22\\\">\\n\\n      <span>Jason Phang</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"playscape.gg\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Anish+Thite%22\\\">\\n\\n      <span>Anish Thite</span><sup>3</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Fazz%22\\\">\\n\\n      <span>Fazz</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Max Planck Institute for Software Systems: MPI SWS\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Aflah%22\\\">\\n\\n      <span>Aflah</span><sup>4</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Niklas%22\\\">\\n\\n      <span>Niklas</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"MistralAI\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Thomas+Wang%22\\\">\\n\\n      <span>Thomas Wang</span><sup>5</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22sdtblck%22\\\">\\n\\n      <span>sdtblck</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22gakada%22\\\">\\n\\n      <span>gakada</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22nopperl%22\\\">\\n\\n      <span>nopperl</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22researcher2%22\\\">\\n\\n      <span>researcher2</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22tttyuntian%22\\\">\\n\\n      <span>tttyuntian</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Hitz Zentroa UPV/EHU\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Julen+Etxaniz%22\\\">\\n\\n      <span>Julen Etxaniz</span><sup>6</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"@azurro\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Chris%22\\\">\\n\\n      <span>Chris</span><sup>7</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"MIT\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22James+A.+Michaelov%22\\\">\\n\\n      <span>James A. Michaelov</span><sup>8</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Shinhan Securities Co.\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Hanwool+Albert+Lee%22\\\">\\n\\n      <span>Hanwool Albert Lee</span><sup>9</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Janna%22\\\">\\n\\n      <span>Janna</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Leonid+Sinev%22\\\">\\n\\n      <span>Leonid Sinev</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Charles University\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Zden%C4%9Bk+Kasner%22\\\">\\n\\n      <span>Zdeněk Kasner</span><sup>10</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a data-tooltip=\\\"Open Source Developer @ IBM\\\" href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Kiersten+Stokes%22\\\">\\n\\n      <span>Kiersten Stokes</span><sup>11</sup></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Khalid%22\\\">\\n\\n      <span>Khalid</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n  <li>\\n    <a href=\\\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22KonradSzafer%22\\\">\\n\\n      <span>KonradSzafer</span></a>\\n  \\n\\n  \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n      \\n    \\n  \\n\\n  \\n  \\n    \\n      <i></i>\\n    \\n  \\n</li>\\n  \\n          </ul>\\n      </div>\\n\\n        \\n        \\n          \\n\\n\\n<section id=\\\"creators-affiliations\\\" aria-label=\\\"Affiliations for creators\\\">\\n  <ul>\\n    \\n    <li>\\n      1.\\n\\n      \\n\\n      Language Technologies Institute, CMU\\n    </li>\\n  \\n    <li>\\n      2.\\n\\n      \\n\\n      Booz Allen Hamilton, EleutherAI\\n    </li>\\n  \\n    <li>\\n      3.\\n\\n      \\n\\n      playscape.gg\\n    </li>\\n  \\n    <li>\\n      4.\\n\\n      \\n\\n      Max Planck Institute for Software Systems: MPI SWS\\n    </li>\\n  \\n    <li>\\n      5.\\n\\n      \\n\\n      MistralAI\\n    </li>\\n  \\n    <li>\\n      6.\\n\\n      \\n\\n      Hitz Zentroa UPV/EHU\\n    </li>\\n  \\n    <li>\\n      7.\\n\\n      \\n\\n      @azurro\\n    </li>\\n  \\n    <li>\\n      8.\\n\\n      \\n\\n      MIT\\n    </li>\\n  \\n    <li>\\n      9.\\n\\n      \\n\\n      Shinhan Securities Co.\\n    </li>\\n  \\n    <li>\\n      10.\\n\\n      \\n\\n      Charles University\\n    </li>\\n  \\n    <li>\\n      11.\\n\\n      \\n\\n      Open Source Developer @ IBM\\n    </li>\\n  \\n\\n    </ul>\\n</section>\\n\\n        \\n\\n    </div>\\n\\n\\n\\n  <section id=\\\"description\\\" aria-label=\\\"Record description\\\">\\n    <h2 id=\\\"description-heading\\\">Description</h2>\\n    \\n    <div>\\n      <p>This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: <strong>Python 3.10 is now the minimum required version</strong>.</p>\\n<h3>New Benchmarks &amp; Tasks</h3>\\n<p>A big wave of new evaluation tasks this release:</p>\\n<ul>\\n<li><strong>AIME</strong> and <strong>MATH500</strong> math reasoning benchmarks by @jannalulu in #3248, #3311</li>\\n<li><strong>BabiLong</strong> and <strong>Longbench v2</strong> for long-context evaluation by @jannalulu in #3287, #3338</li>\\n<li><strong>GraphWalks</strong> by @jannalulu in #3377</li>\\n<li><strong>ZhoBLiMP</strong>, <strong>BLiMP-NL</strong>, <strong>TurBLiMP</strong>, <strong>LM-SynEval</strong>, and <strong>BHS</strong> linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265</li>\\n<li><strong>Icelandic WinoGrande</strong> by @jmichaelov in #3277</li>\\n<li><strong>CLIcK</strong> Korean benchmark by @shing100 in #3173</li>\\n<li><strong>MMLU-Redux</strong> (generative) and Spanish translation by @luiscosio in #2705</li>\\n<li><strong>EsBBQ</strong> and <strong>CaBBQ</strong> bias benchmarks by @valleruizf in #3167</li>\\n<li><strong>EQBench</strong> in Spanish and Catalan by @priverabsc in #3168</li>\\n<li><strong>Anthropic discrim-eval</strong> by @Helw150 in #3091</li>\\n<li><strong>XNLI-VA</strong> by @FranValero97 in #3194</li>\\n<li><strong>Bangla MMLU</strong> (Titulm) by @Ismail-Hossain-1 in #3317</li>\\n<li><strong>HumanEval infilling</strong> by @its-alpesh in #3299</li>\\n<li><strong>CNN-DailyMail 3.0.0</strong> by @preordinary in #3426</li>\\n<li><strong>Global PIQA</strong> and new <code>acc_norm_bytes</code> metric by @baberabb in #3368</li>\\n</ul>\\n<h3>Fixes &amp; Improvements</h3>\\n<p><strong>Core Changes:</strong></p>\\n<ul>\\n<li><strong>Python 3.10 minimum</strong> by @jannalulu in #3337</li>\\n<li><strong>Unpinned <code>datasets</code></strong> library by @baberabb in #3316</li>\\n<li><strong>BOS token handling</strong>: Delegate to tokenizer; <code>add_bos_token</code> now defaults to <code>None</code> by @baberabb in #3347</li>\\n<li>Renamed <code>LOGLEVEL</code> env var to <code>LMEVAL_LOG_LEVEL</code> to avoid conflicts by @fxmarty-amd in #3418</li>\\n<li>Resolve duplicate task names with safeguards by @giuliolovisotto in #3394</li>\\n</ul>\\n<p><strong>Task Fixes:</strong></p>\\n<ul>\\n<li>Fixed MMLU-Redux to exclude samples without <code>error_type=\\\"ok\\\"</code> and display summary table by @fxmarty-amd in #3410, #3406</li>\\n<li>Fixed AIME answer extraction by @jannalulu in #3353</li>\\n<li>Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361</li>\\n<li>Fixed <code>crows_pairs</code> dataset by @jannalulu in #3378</li>\\n<li>Fixed Gemma tokenizer <code>add_bos_token</code> not updating by @DarkLight1337 in #3206</li>\\n<li>Fixed <code>lambada_multilingual_stablelm</code> by @jmichaelov, @HallerPatrick in #3294, #3222</li>\\n<li>Fixed CodeXGLUE by @gsaltintas in #3238</li>\\n<li>Pinned correct MMLUSR version by @christinaexyou in #3350</li>\\n<li>Updated <code>minerva_math</code> by @baberabb in #3259</li>\\n</ul>\\n<p><strong>Backend Fixes:</strong></p>\\n<ul>\\n<li>Fixed vLLM import errors when not installed by @fxmarty-amd in #3292</li>\\n<li>Fixed vLLM <code>data_parallel_size&gt;1</code> issue by @Dornavineeth in #3303</li>\\n<li>Resolved deprecated <code>vllm.utils.get_open_port</code> by @DarkLight1337 in #3398</li>\\n<li>Fixed GPT series model bugs by @zinccat in #3348</li>\\n<li>Fixed PIL image hashing to use actual bytes by @tboerstad in #3331</li>\\n<li>Fixed <code>additional_config</code> parsing by @brian-dellabetta in #3393</li>\\n<li>Fixed batch chunking seed handling with groupby by @slimfrkha in #3047</li>\\n<li>Fixed no-output error handling by @Oseltamivir in #3395</li>\\n<li>Replaced deprecated <code>torch_dtype</code> with <code>dtype</code> by @AbdulmalikDS in #3415</li>\\n<li>Fixed custom task config reading by @SkyR0ver in #3425</li>\\n</ul>\\n<h3>Model &amp; Backend Support</h3>\\n<ul>\\n<li><strong>OpenAI GPT-5</strong> support by @babyplutokurt in #3247</li>\\n<li><strong>Azure OpenAI</strong> support by @zinccat in #3349</li>\\n<li><strong>Fine-tuned Gemma3</strong> evaluation support by @LearnerSXH in #3234</li>\\n<li><strong>OpenVINO text2text</strong> models by @nikita-savelyevv in #3101</li>\\n<li><strong>Intel XPU</strong> support for HFLM by @kaixuanliu in #3211</li>\\n<li><strong>Attention head steering</strong> support by @luciaquirke in #3279</li>\\n<li>Leverage vLLM's <code>tokenizer_info</code> endpoint to avoid manual duplication by @m-misiura in #3185</li>\\n</ul>\\n<h2>What's Changed</h2>\\n<ul>\\n<li>Remove <code>trust_remote_code: True</code> from updated datasets by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3213</li>\\n<li>Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234</li>\\n<li>Fix <code>add_bos_token</code> not updated for Gemma tokenizer by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3206</li>\\n<li>remove incomplete compilation instructions, solves #3233 by @ceferisbarov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242</li>\\n<li>Update utils.py by @Anri-Lombard in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246</li>\\n<li>Adding support for OpenAI GPT-5 model by @babyplutokurt in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247</li>\\n<li>Add xnli_va dataset by @FranValero97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194</li>\\n<li>Add ZhoBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3218</li>\\n<li>Add BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3221</li>\\n<li>Add TurBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3219</li>\\n<li>Add LM-SynEval Benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3184</li>\\n<li>Fix unknown group key to tag in yaml config for <code>lambada_multilingual_stablelm</code> by @HallerPatrick in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222</li>\\n<li>update <code>minerva_math</code> by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3259</li>\\n<li>feat: Add CLIcK task by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3173</li>\\n<li>Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091</li>\\n<li>Add support for OpenVINO text2text generation models  by @nikita-savelyevv in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101</li>\\n<li>Update MMLU-ProX task by @weihao1115 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174</li>\\n<li>Support for AIME dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248</li>\\n<li>feat(scrolls): delete chat_template from kwargs by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267</li>\\n<li>pacify pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3268</li>\\n<li>Fix codexglue by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238</li>\\n<li>Add BHS benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3265</li>\\n<li>Add <code>acc_norm</code> metric to BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3272</li>\\n<li>Add <code>acc_norm</code> metric to ZhoBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3271</li>\\n<li>Add EsBBQ and CaBBQ tasks by @valleruizf in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167</li>\\n<li>Add support for steering individual attention heads by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/3279</li>\\n<li>Add the Icelandic WinoGrande benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3277</li>\\n<li>Ignore seed when splitting batch in chunks with groupby by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3047</li>\\n<li>[fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3292</li>\\n<li>Fix LongBench Evaluation by @TimurAysin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273</li>\\n<li>add intel xpu support for HFLM by @kaixuanliu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211</li>\\n<li>feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in https://github.com/EleutherAI/lm-evaluation-harness/pull/2705</li>\\n<li>Add BabiLong by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3287</li>\\n<li>Add AIME to task description by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3296</li>\\n<li>Add humaneval_infilling task by @its-alpesh in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299</li>\\n<li>Add eqbench tasks in Spanish and Catalan by @priverabsc in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168</li>\\n<li>[fix] add math and longbench to test dependencies by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3321</li>\\n<li>Fix: VLLM model when data_parallel_size&gt;1 by @Dornavineeth in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303</li>\\n<li>unpin datasets; update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3316</li>\\n<li>bump to python 3.10 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3337</li>\\n<li>Longbench v2 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3338</li>\\n<li>Leverage vllm's <code>tokenizer_info</code> endpoint to avoid manual duplication  by @m-misiura in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185</li>\\n<li>Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317</li>\\n<li>remove duplicate tags/groups by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3343</li>\\n<li>Align <code>humaneval_64_instruct</code> task label in README to name in yaml file by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3344</li>\\n<li>Fixes bugs when using gpt series model by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348</li>\\n<li>[fix] aime doesn't extract answers by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3353</li>\\n<li>add global_piqa; add acc_norm_bytes metric by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3368</li>\\n<li>[fix] crows_pairs dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3378</li>\\n<li>Fix issue 3355 assertion error by @marksverdhei in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356</li>\\n<li>fix(gsm8k): align README to yaml file by @neoheartbeats in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388</li>\\n<li>added azure openai support by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3349</li>\\n<li>Delegate BOS to the tokenizer; <code>add_bos_token</code> defaults to <code>None</code> by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3347</li>\\n<li>fix trust_remote_code=True for longbench by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3361</li>\\n<li>[feat] add graphwalks by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3377</li>\\n<li>Longbench group fix by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3359</li>\\n<li>Resolve deprecation of <code>vllm.utils.get_open_port</code> by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3398</li>\\n<li>Trim whitespace in remove_whitespace filter by @ziqing-huang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408</li>\\n<li>Fixes #3391 avoid error on no-output by @Oseltamivir in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395</li>\\n<li>Fix PIL image hashing to use actual bytes instead of object repr by @tboerstad in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331</li>\\n<li>[MMLU redux] Do not use samples which do not have <code>error_type=\\\"ok\\\"</code> by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3410</li>\\n<li>fix: resolve duplicate task names and add safeguards. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/3394</li>\\n<li>Add MATH500 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3311</li>\\n<li>[bugfix] additional_config parsing by @brian-dellabetta in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393</li>\\n<li>fix(tasks):pin correct MMLUSR version by @christinaexyou in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350</li>\\n<li>Fix <code>lambada_multilingual_stablelm</code> by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3294</li>\\n<li>Fix descriptions in the Moral Stories and Histoires Morales tasks. by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/3374</li>\\n<li>Replace deprecated torch_dtype parameter with dtype by @AbdulmalikDS in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415</li>\\n<li>[fix] Fix mmlu_redux not displaying summary table + display to the user the tasks / yaml that are actually pulled by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3406</li>\\n<li>Rename the conflicting environment variable <code>LOGLEVEL</code> to <code>LMEVAL_LOG_LEVEL</code> by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3418</li>\\n<li>Update SGLang installation and documentation links by @Bobchenyx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422</li>\\n<li>Fix reading custom task configs by @SkyR0ver in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425</li>\\n<li>New Task: Add CNN-DailyMail (3.0.0) by @preordinary in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426</li>\\n</ul>\\n<h2>New Contributors</h2>\\n<ul>\\n<li>@LearnerSXH made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234</li>\\n<li>@ceferisbarov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242</li>\\n<li>@Anri-Lombard made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246</li>\\n<li>@babyplutokurt made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247</li>\\n<li>@FranValero97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194</li>\\n<li>@HallerPatrick made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222</li>\\n<li>@Helw150 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091</li>\\n<li>@nikita-savelyevv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101</li>\\n<li>@weihao1115 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174</li>\\n<li>@jannalulu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248</li>\\n<li>@slimfrkha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267</li>\\n<li>@gsaltintas made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238</li>\\n<li>@valleruizf made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167</li>\\n<li>@TimurAysin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273</li>\\n<li>@kaixuanliu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211</li>\\n<li>@its-alpesh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299</li>\\n<li>@priverabsc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168</li>\\n<li>@Dornavineeth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303</li>\\n<li>@m-misiura made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185</li>\\n<li>@Ismail-Hossain-1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317</li>\\n<li>@zinccat made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348</li>\\n<li>@marksverdhei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356</li>\\n<li>@neoheartbeats made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388</li>\\n<li>@ziqing-huang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408</li>\\n<li>@Oseltamivir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395</li>\\n<li>@tboerstad made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331</li>\\n<li>@brian-dellabetta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393</li>\\n<li>@christinaexyou made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350</li>\\n<li>@AbdulmalikDS made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415</li>\\n<li>@Bobchenyx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422</li>\\n<li>@SkyR0ver made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425</li>\\n<li>@preordinary made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426</li>\\n</ul>\\n<p><strong>Full Changelog</strong>: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2</p>\\n    </div>\\n  </section>\\n\\n\\n<section id=\\\"record-files\\\" aria-label=\\\"Files\\\"><h2 id=\\\"files-heading\\\">Files</h2>\\n            \\n  <div href=\\\"#files-preview-accordion-panel\\\">\\n    <h3>\\n      <p><span id=\\\"preview-file-title\\\">EleutherAI/lm-evaluation-harness-v0.4.9.2.zip</span>\\n        \\n      </p>\\n    </h3>\\n    \\n  </div>\\n  <div href=\\\"#files-list-accordion-panel\\\">\\n    <h3>\\n      <p>\\n        Files\\n        <small> (10.0 MB)</small>\\n        \\n      </p>\\n    </h3>\\n\\n    <div role=\\\"region\\\" id=\\\"files-list-accordion-panel\\\" aria-labelledby=\\\"files-list-accordion-trigger\\\">\\n        \\n  <table id=\\\"file-list-table\\\">\\n    <thead>\\n      <tr>\\n        <th>Name</th>\\n        <th>Size</th>\\n        <th>\\n            \\n            <a role=\\\"button\\\" href=\\\"https://zenodo.org/api/records/17728786/files-archive\\\">\\n               Download all\\n            </a>\\n        </th>\\n      </tr>\\n    </thead>\\n    <tbody>\\n    \\n      \\n        \\n        \\n        <tr>\\n          <td>\\n            <div>\\n              <p><a href=\\\"https://zenodo.org/records/17728786/files/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?download=1\\\">EleutherAI/lm-evaluation-harness-v0.4.9.2.zip</a>\\n            </p></div>\\n            <small>md5:a1821ba45eee3362f49140f332ff6512\\n            \\n            </small>\\n          </td>\\n          <td>10.0 MB</td>\\n          <td>\\n            <span>\\n              \\n                <a role=\\\"button\\\" href=\\\"https://zenodo.org/records/17728786/preview/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?include_deleted=0\\\" target=\\\"preview-iframe\\\" data-file-key=\\\"EleutherAI/lm-evaluation-harness-v0.4.9.2.zip\\\">\\n                  Preview\\n                </a>\\n              \\n              <a role=\\\"button\\\" href=\\\"https://zenodo.org/records/17728786/files/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?download=1\\\">\\n                Download\\n              </a>\\n            </span>\\n          </td>\\n        </tr>\\n      \\n    \\n    </tbody>\\n  </table>\\n      </div>\\n  </div>\\n\\n    </section>\\n              \\n\\n  <section id=\\\"additional-details\\\" aria-label=\\\"Additional record details\\\">\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n  <h2 id=\\\"record-details-heading\\\">Additional details</h2>\\n\\n  \\n\\n  \\n\\n  \\n\\n  \\n    \\n      \\n    \\n    \\n  \\n\\n  \\n\\n  \\n\\n  \\n      \\n      \\n      \\n      \\n        \\n          \\n        \\n      \\n        \\n      \\n        \\n      \\n\\n      \\n        \\n        \\n      \\n      \\n      \\n      \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n\\n      \\n      \\n      \\n      \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n        \\n      \\n\\n      \\n  \\n\\n  \\n\\n\\n\\n  \\n\\n  </section>\\n    \\n    <div id=\\\"citations-search\\\" data-record-pids=\\\"{&quot;doi&quot;: {&quot;client&quot;: &quot;datacite&quot;, &quot;identifier&quot;: &quot;10.5281/zenodo.17728786&quot;, &quot;provider&quot;: &quot;datacite&quot;}, &quot;oai&quot;: {&quot;identifier&quot;: &quot;oai:zenodo.org:17728786&quot;, &quot;provider&quot;: &quot;oai&quot;}}\\\" data-record-parent-pids=\\\"{&quot;doi&quot;: {&quot;client&quot;: &quot;datacite&quot;, &quot;identifier&quot;: &quot;10.5281/zenodo.5371628&quot;, &quot;provider&quot;: &quot;datacite&quot;}}\\\" data-citations-endpoint=\\\"https://zenodo-broker.web.cern.ch/api/relationships\\\" aria-label=\\\"Record citations\\\"><p>Citations<a href=\\\"https://support.zenodo.org/help/en-gb/25-citations\\\" target=\\\"_blank\\\" rel=\\\"noreferrer\\\" aria-label=\\\"Citations help page\\\"></a></p><div><div><div role=\\\"list\\\"><div role=\\\"listitem\\\"><p><label for=\\\"literature-facet-checkbox\\\">literature (0)</label></p></div><div role=\\\"listitem\\\"><p><label for=\\\"dataset-facet-checkbox\\\">dataset (0)</label></p></div><div role=\\\"listitem\\\"><p><label for=\\\"software-facet-checkbox\\\">software (0)</label></p></div><div role=\\\"listitem\\\"><p><label for=\\\"unknown-facet-checkbox\\\">unknown (0)</label></p></div><div role=\\\"listitem\\\"><p><label for=\\\"citations-to-version\\\">Citations to this version</label></p></div></div><div><p><label for=\\\"citations-search-bar\\\">Search citations</label></p></div></div><div><p>Oops! Something went wrong while fetching results.</p></div></div></div>\\n  \\n        </article></div>\"\n}"
    }
  ],
  "details": {
    "url": "https://zenodo.org/records/17728786",
    "title": "EleutherAI/lm-evaluation-harness: lm-eval v0.4.9.2 Release Notes",
    "summary": "This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum requir…",
    "content": "<div id=\"readability-page-1\" class=\"page\"><article>\n          \n\n              \n\n              \n\n              <div id=\"record-info\" aria-label=\"Publication date and version number\">\n                    <p><span title=\"Publication date\">\n                        Published November 26, 2025\n                      </span>\n                      <span> | Version v0.4.9.2</span>\n                    </p>\n                    <p><span role=\"note\" aria-label=\"Resource type\">\n                          Software\n                        </span>\n                      \n\n                      <span role=\"note\" data-tooltip=\"The record and files are publicly accessible.\" data-inverted=\"\" aria-label=\"Access status\">\n                        \n                          \n                        \n                        <span aria-label=\"The record and files are publicly accessible.\">\n                          Open\n                        </span>\n                      </span>\n                    </p>\n                  </div>\n              <div id=\"record-title-section\" aria-label=\"Record title, authors and contributors\">\n      <div>\n          <h3>Authors/Creators</h3>\n          <ul>\n            \n  \n  <li>\n    <a data-tooltip=\"Language Technologies Institute, CMU\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Lintang+Sutawika%22\">\n\n      <span>Lintang Sutawika</span><sup>1</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Hailey+Schoelkopf%22\">\n\n      <span>Hailey Schoelkopf</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Leo+Gao%22\">\n\n      <span>Leo Gao</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Baber+Abbasi%22\">\n\n      <span>Baber Abbasi</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Booz Allen Hamilton, EleutherAI\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Stella+Biderman%22\">\n\n      <span>Stella Biderman</span><sup>2</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Jonathan+Tow%22\">\n\n      <span>Jonathan Tow</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22ben+fattori%22\">\n\n      <span>ben fattori</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Charles+Lovering%22\">\n\n      <span>Charles Lovering</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22farzanehnakhaee70%22\">\n\n      <span>farzanehnakhaee70</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Jason+Phang%22\">\n\n      <span>Jason Phang</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"playscape.gg\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Anish+Thite%22\">\n\n      <span>Anish Thite</span><sup>3</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Fazz%22\">\n\n      <span>Fazz</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Max Planck Institute for Software Systems: MPI SWS\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Aflah%22\">\n\n      <span>Aflah</span><sup>4</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Niklas%22\">\n\n      <span>Niklas</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"MistralAI\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Thomas+Wang%22\">\n\n      <span>Thomas Wang</span><sup>5</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22sdtblck%22\">\n\n      <span>sdtblck</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22gakada%22\">\n\n      <span>gakada</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22nopperl%22\">\n\n      <span>nopperl</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22researcher2%22\">\n\n      <span>researcher2</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22tttyuntian%22\">\n\n      <span>tttyuntian</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Hitz Zentroa UPV/EHU\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Julen+Etxaniz%22\">\n\n      <span>Julen Etxaniz</span><sup>6</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"@azurro\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Chris%22\">\n\n      <span>Chris</span><sup>7</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"MIT\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22James+A.+Michaelov%22\">\n\n      <span>James A. Michaelov</span><sup>8</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Shinhan Securities Co.\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Hanwool+Albert+Lee%22\">\n\n      <span>Hanwool Albert Lee</span><sup>9</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Janna%22\">\n\n      <span>Janna</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Leonid+Sinev%22\">\n\n      <span>Leonid Sinev</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Charles University\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Zden%C4%9Bk+Kasner%22\">\n\n      <span>Zdeněk Kasner</span><sup>10</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a data-tooltip=\"Open Source Developer @ IBM\" href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Kiersten+Stokes%22\">\n\n      <span>Kiersten Stokes</span><sup>11</sup></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22Khalid%22\">\n\n      <span>Khalid</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n  <li>\n    <a href=\"https://zenodo.org/search?q=metadata.creators.person_or_org.name:%22KonradSzafer%22\">\n\n      <span>KonradSzafer</span></a>\n  \n\n  \n    \n      \n    \n      \n    \n      \n    \n      \n    \n  \n\n  \n  \n    \n      <i></i>\n    \n  \n</li>\n  \n          </ul>\n      </div>\n\n        \n        \n          \n\n\n<section id=\"creators-affiliations\" aria-label=\"Affiliations for creators\">\n  <ul>\n    \n    <li>\n      1.\n\n      \n\n      Language Technologies Institute, CMU\n    </li>\n  \n    <li>\n      2.\n\n      \n\n      Booz Allen Hamilton, EleutherAI\n    </li>\n  \n    <li>\n      3.\n\n      \n\n      playscape.gg\n    </li>\n  \n    <li>\n      4.\n\n      \n\n      Max Planck Institute for Software Systems: MPI SWS\n    </li>\n  \n    <li>\n      5.\n\n      \n\n      MistralAI\n    </li>\n  \n    <li>\n      6.\n\n      \n\n      Hitz Zentroa UPV/EHU\n    </li>\n  \n    <li>\n      7.\n\n      \n\n      @azurro\n    </li>\n  \n    <li>\n      8.\n\n      \n\n      MIT\n    </li>\n  \n    <li>\n      9.\n\n      \n\n      Shinhan Securities Co.\n    </li>\n  \n    <li>\n      10.\n\n      \n\n      Charles University\n    </li>\n  \n    <li>\n      11.\n\n      \n\n      Open Source Developer @ IBM\n    </li>\n  \n\n    </ul>\n</section>\n\n        \n\n    </div>\n\n\n\n  <section id=\"description\" aria-label=\"Record description\">\n    <h2 id=\"description-heading\">Description</h2>\n    \n    <div>\n      <p>This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: <strong>Python 3.10 is now the minimum required version</strong>.</p>\n<h3>New Benchmarks &amp; Tasks</h3>\n<p>A big wave of new evaluation tasks this release:</p>\n<ul>\n<li><strong>AIME</strong> and <strong>MATH500</strong> math reasoning benchmarks by @jannalulu in #3248, #3311</li>\n<li><strong>BabiLong</strong> and <strong>Longbench v2</strong> for long-context evaluation by @jannalulu in #3287, #3338</li>\n<li><strong>GraphWalks</strong> by @jannalulu in #3377</li>\n<li><strong>ZhoBLiMP</strong>, <strong>BLiMP-NL</strong>, <strong>TurBLiMP</strong>, <strong>LM-SynEval</strong>, and <strong>BHS</strong> linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265</li>\n<li><strong>Icelandic WinoGrande</strong> by @jmichaelov in #3277</li>\n<li><strong>CLIcK</strong> Korean benchmark by @shing100 in #3173</li>\n<li><strong>MMLU-Redux</strong> (generative) and Spanish translation by @luiscosio in #2705</li>\n<li><strong>EsBBQ</strong> and <strong>CaBBQ</strong> bias benchmarks by @valleruizf in #3167</li>\n<li><strong>EQBench</strong> in Spanish and Catalan by @priverabsc in #3168</li>\n<li><strong>Anthropic discrim-eval</strong> by @Helw150 in #3091</li>\n<li><strong>XNLI-VA</strong> by @FranValero97 in #3194</li>\n<li><strong>Bangla MMLU</strong> (Titulm) by @Ismail-Hossain-1 in #3317</li>\n<li><strong>HumanEval infilling</strong> by @its-alpesh in #3299</li>\n<li><strong>CNN-DailyMail 3.0.0</strong> by @preordinary in #3426</li>\n<li><strong>Global PIQA</strong> and new <code>acc_norm_bytes</code> metric by @baberabb in #3368</li>\n</ul>\n<h3>Fixes &amp; Improvements</h3>\n<p><strong>Core Changes:</strong></p>\n<ul>\n<li><strong>Python 3.10 minimum</strong> by @jannalulu in #3337</li>\n<li><strong>Unpinned <code>datasets</code></strong> library by @baberabb in #3316</li>\n<li><strong>BOS token handling</strong>: Delegate to tokenizer; <code>add_bos_token</code> now defaults to <code>None</code> by @baberabb in #3347</li>\n<li>Renamed <code>LOGLEVEL</code> env var to <code>LMEVAL_LOG_LEVEL</code> to avoid conflicts by @fxmarty-amd in #3418</li>\n<li>Resolve duplicate task names with safeguards by @giuliolovisotto in #3394</li>\n</ul>\n<p><strong>Task Fixes:</strong></p>\n<ul>\n<li>Fixed MMLU-Redux to exclude samples without <code>error_type=\"ok\"</code> and display summary table by @fxmarty-amd in #3410, #3406</li>\n<li>Fixed AIME answer extraction by @jannalulu in #3353</li>\n<li>Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361</li>\n<li>Fixed <code>crows_pairs</code> dataset by @jannalulu in #3378</li>\n<li>Fixed Gemma tokenizer <code>add_bos_token</code> not updating by @DarkLight1337 in #3206</li>\n<li>Fixed <code>lambada_multilingual_stablelm</code> by @jmichaelov, @HallerPatrick in #3294, #3222</li>\n<li>Fixed CodeXGLUE by @gsaltintas in #3238</li>\n<li>Pinned correct MMLUSR version by @christinaexyou in #3350</li>\n<li>Updated <code>minerva_math</code> by @baberabb in #3259</li>\n</ul>\n<p><strong>Backend Fixes:</strong></p>\n<ul>\n<li>Fixed vLLM import errors when not installed by @fxmarty-amd in #3292</li>\n<li>Fixed vLLM <code>data_parallel_size&gt;1</code> issue by @Dornavineeth in #3303</li>\n<li>Resolved deprecated <code>vllm.utils.get_open_port</code> by @DarkLight1337 in #3398</li>\n<li>Fixed GPT series model bugs by @zinccat in #3348</li>\n<li>Fixed PIL image hashing to use actual bytes by @tboerstad in #3331</li>\n<li>Fixed <code>additional_config</code> parsing by @brian-dellabetta in #3393</li>\n<li>Fixed batch chunking seed handling with groupby by @slimfrkha in #3047</li>\n<li>Fixed no-output error handling by @Oseltamivir in #3395</li>\n<li>Replaced deprecated <code>torch_dtype</code> with <code>dtype</code> by @AbdulmalikDS in #3415</li>\n<li>Fixed custom task config reading by @SkyR0ver in #3425</li>\n</ul>\n<h3>Model &amp; Backend Support</h3>\n<ul>\n<li><strong>OpenAI GPT-5</strong> support by @babyplutokurt in #3247</li>\n<li><strong>Azure OpenAI</strong> support by @zinccat in #3349</li>\n<li><strong>Fine-tuned Gemma3</strong> evaluation support by @LearnerSXH in #3234</li>\n<li><strong>OpenVINO text2text</strong> models by @nikita-savelyevv in #3101</li>\n<li><strong>Intel XPU</strong> support for HFLM by @kaixuanliu in #3211</li>\n<li><strong>Attention head steering</strong> support by @luciaquirke in #3279</li>\n<li>Leverage vLLM's <code>tokenizer_info</code> endpoint to avoid manual duplication by @m-misiura in #3185</li>\n</ul>\n<h2>What's Changed</h2>\n<ul>\n<li>Remove <code>trust_remote_code: True</code> from updated datasets by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3213</li>\n<li>Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234</li>\n<li>Fix <code>add_bos_token</code> not updated for Gemma tokenizer by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3206</li>\n<li>remove incomplete compilation instructions, solves #3233 by @ceferisbarov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242</li>\n<li>Update utils.py by @Anri-Lombard in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246</li>\n<li>Adding support for OpenAI GPT-5 model by @babyplutokurt in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247</li>\n<li>Add xnli_va dataset by @FranValero97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194</li>\n<li>Add ZhoBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3218</li>\n<li>Add BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3221</li>\n<li>Add TurBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3219</li>\n<li>Add LM-SynEval Benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3184</li>\n<li>Fix unknown group key to tag in yaml config for <code>lambada_multilingual_stablelm</code> by @HallerPatrick in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222</li>\n<li>update <code>minerva_math</code> by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3259</li>\n<li>feat: Add CLIcK task by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3173</li>\n<li>Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091</li>\n<li>Add support for OpenVINO text2text generation models  by @nikita-savelyevv in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101</li>\n<li>Update MMLU-ProX task by @weihao1115 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174</li>\n<li>Support for AIME dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248</li>\n<li>feat(scrolls): delete chat_template from kwargs by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267</li>\n<li>pacify pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3268</li>\n<li>Fix codexglue by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238</li>\n<li>Add BHS benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3265</li>\n<li>Add <code>acc_norm</code> metric to BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3272</li>\n<li>Add <code>acc_norm</code> metric to ZhoBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3271</li>\n<li>Add EsBBQ and CaBBQ tasks by @valleruizf in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167</li>\n<li>Add support for steering individual attention heads by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/3279</li>\n<li>Add the Icelandic WinoGrande benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3277</li>\n<li>Ignore seed when splitting batch in chunks with groupby by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3047</li>\n<li>[fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3292</li>\n<li>Fix LongBench Evaluation by @TimurAysin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273</li>\n<li>add intel xpu support for HFLM by @kaixuanliu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211</li>\n<li>feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in https://github.com/EleutherAI/lm-evaluation-harness/pull/2705</li>\n<li>Add BabiLong by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3287</li>\n<li>Add AIME to task description by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3296</li>\n<li>Add humaneval_infilling task by @its-alpesh in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299</li>\n<li>Add eqbench tasks in Spanish and Catalan by @priverabsc in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168</li>\n<li>[fix] add math and longbench to test dependencies by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3321</li>\n<li>Fix: VLLM model when data_parallel_size&gt;1 by @Dornavineeth in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303</li>\n<li>unpin datasets; update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3316</li>\n<li>bump to python 3.10 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3337</li>\n<li>Longbench v2 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3338</li>\n<li>Leverage vllm's <code>tokenizer_info</code> endpoint to avoid manual duplication  by @m-misiura in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185</li>\n<li>Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317</li>\n<li>remove duplicate tags/groups by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3343</li>\n<li>Align <code>humaneval_64_instruct</code> task label in README to name in yaml file by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3344</li>\n<li>Fixes bugs when using gpt series model by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348</li>\n<li>[fix] aime doesn't extract answers by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3353</li>\n<li>add global_piqa; add acc_norm_bytes metric by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3368</li>\n<li>[fix] crows_pairs dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3378</li>\n<li>Fix issue 3355 assertion error by @marksverdhei in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356</li>\n<li>fix(gsm8k): align README to yaml file by @neoheartbeats in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388</li>\n<li>added azure openai support by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3349</li>\n<li>Delegate BOS to the tokenizer; <code>add_bos_token</code> defaults to <code>None</code> by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3347</li>\n<li>fix trust_remote_code=True for longbench by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3361</li>\n<li>[feat] add graphwalks by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3377</li>\n<li>Longbench group fix by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3359</li>\n<li>Resolve deprecation of <code>vllm.utils.get_open_port</code> by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3398</li>\n<li>Trim whitespace in remove_whitespace filter by @ziqing-huang in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408</li>\n<li>Fixes #3391 avoid error on no-output by @Oseltamivir in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395</li>\n<li>Fix PIL image hashing to use actual bytes instead of object repr by @tboerstad in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331</li>\n<li>[MMLU redux] Do not use samples which do not have <code>error_type=\"ok\"</code> by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3410</li>\n<li>fix: resolve duplicate task names and add safeguards. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/3394</li>\n<li>Add MATH500 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3311</li>\n<li>[bugfix] additional_config parsing by @brian-dellabetta in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393</li>\n<li>fix(tasks):pin correct MMLUSR version by @christinaexyou in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350</li>\n<li>Fix <code>lambada_multilingual_stablelm</code> by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3294</li>\n<li>Fix descriptions in the Moral Stories and Histoires Morales tasks. by @upunaprosk in https://github.com/EleutherAI/lm-evaluation-harness/pull/3374</li>\n<li>Replace deprecated torch_dtype parameter with dtype by @AbdulmalikDS in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415</li>\n<li>[fix] Fix mmlu_redux not displaying summary table + display to the user the tasks / yaml that are actually pulled by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3406</li>\n<li>Rename the conflicting environment variable <code>LOGLEVEL</code> to <code>LMEVAL_LOG_LEVEL</code> by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3418</li>\n<li>Update SGLang installation and documentation links by @Bobchenyx in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422</li>\n<li>Fix reading custom task configs by @SkyR0ver in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425</li>\n<li>New Task: Add CNN-DailyMail (3.0.0) by @preordinary in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426</li>\n</ul>\n<h2>New Contributors</h2>\n<ul>\n<li>@LearnerSXH made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234</li>\n<li>@ceferisbarov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242</li>\n<li>@Anri-Lombard made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246</li>\n<li>@babyplutokurt made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247</li>\n<li>@FranValero97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194</li>\n<li>@HallerPatrick made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222</li>\n<li>@Helw150 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091</li>\n<li>@nikita-savelyevv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101</li>\n<li>@weihao1115 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174</li>\n<li>@jannalulu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248</li>\n<li>@slimfrkha made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267</li>\n<li>@gsaltintas made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238</li>\n<li>@valleruizf made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167</li>\n<li>@TimurAysin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273</li>\n<li>@kaixuanliu made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211</li>\n<li>@its-alpesh made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299</li>\n<li>@priverabsc made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168</li>\n<li>@Dornavineeth made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303</li>\n<li>@m-misiura made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185</li>\n<li>@Ismail-Hossain-1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317</li>\n<li>@zinccat made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348</li>\n<li>@marksverdhei made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356</li>\n<li>@neoheartbeats made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388</li>\n<li>@ziqing-huang made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3408</li>\n<li>@Oseltamivir made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3395</li>\n<li>@tboerstad made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3331</li>\n<li>@brian-dellabetta made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3393</li>\n<li>@christinaexyou made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3350</li>\n<li>@AbdulmalikDS made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3415</li>\n<li>@Bobchenyx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3422</li>\n<li>@SkyR0ver made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3425</li>\n<li>@preordinary made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3426</li>\n</ul>\n<p><strong>Full Changelog</strong>: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.1...v0.4.9.2</p>\n    </div>\n  </section>\n\n\n<section id=\"record-files\" aria-label=\"Files\"><h2 id=\"files-heading\">Files</h2>\n            \n  <div href=\"#files-preview-accordion-panel\">\n    <h3>\n      <p><span id=\"preview-file-title\">EleutherAI/lm-evaluation-harness-v0.4.9.2.zip</span>\n        \n      </p>\n    </h3>\n    \n  </div>\n  <div href=\"#files-list-accordion-panel\">\n    <h3>\n      <p>\n        Files\n        <small> (10.0 MB)</small>\n        \n      </p>\n    </h3>\n\n    <div role=\"region\" id=\"files-list-accordion-panel\" aria-labelledby=\"files-list-accordion-trigger\">\n        \n  <table id=\"file-list-table\">\n    <thead>\n      <tr>\n        <th>Name</th>\n        <th>Size</th>\n        <th>\n            \n            <a role=\"button\" href=\"https://zenodo.org/api/records/17728786/files-archive\">\n               Download all\n            </a>\n        </th>\n      </tr>\n    </thead>\n    <tbody>\n    \n      \n        \n        \n        <tr>\n          <td>\n            <div>\n              <p><a href=\"https://zenodo.org/records/17728786/files/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?download=1\">EleutherAI/lm-evaluation-harness-v0.4.9.2.zip</a>\n            </p></div>\n            <small>md5:a1821ba45eee3362f49140f332ff6512\n            \n            </small>\n          </td>\n          <td>10.0 MB</td>\n          <td>\n            <span>\n              \n                <a role=\"button\" href=\"https://zenodo.org/records/17728786/preview/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?include_deleted=0\" target=\"preview-iframe\" data-file-key=\"EleutherAI/lm-evaluation-harness-v0.4.9.2.zip\">\n                  Preview\n                </a>\n              \n              <a role=\"button\" href=\"https://zenodo.org/records/17728786/files/EleutherAI/lm-evaluation-harness-v0.4.9.2.zip?download=1\">\n                Download\n              </a>\n            </span>\n          </td>\n        </tr>\n      \n    \n    </tbody>\n  </table>\n      </div>\n  </div>\n\n    </section>\n              \n\n  <section id=\"additional-details\" aria-label=\"Additional record details\">\n\n\n\n\n\n\n\n\n\n\n\n  <h2 id=\"record-details-heading\">Additional details</h2>\n\n  \n\n  \n\n  \n\n  \n    \n      \n    \n    \n  \n\n  \n\n  \n\n  \n      \n      \n      \n      \n        \n          \n        \n      \n        \n      \n        \n      \n\n      \n        \n        \n      \n      \n      \n      \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n\n      \n      \n      \n      \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n        \n      \n\n      \n  \n\n  \n\n\n\n  \n\n  </section>\n    \n    <div id=\"citations-search\" data-record-pids=\"{&quot;doi&quot;: {&quot;client&quot;: &quot;datacite&quot;, &quot;identifier&quot;: &quot;10.5281/zenodo.17728786&quot;, &quot;provider&quot;: &quot;datacite&quot;}, &quot;oai&quot;: {&quot;identifier&quot;: &quot;oai:zenodo.org:17728786&quot;, &quot;provider&quot;: &quot;oai&quot;}}\" data-record-parent-pids=\"{&quot;doi&quot;: {&quot;client&quot;: &quot;datacite&quot;, &quot;identifier&quot;: &quot;10.5281/zenodo.5371628&quot;, &quot;provider&quot;: &quot;datacite&quot;}}\" data-citations-endpoint=\"https://zenodo-broker.web.cern.ch/api/relationships\" aria-label=\"Record citations\"><p>Citations<a href=\"https://support.zenodo.org/help/en-gb/25-citations\" target=\"_blank\" rel=\"noreferrer\" aria-label=\"Citations help page\"></a></p><div><div><div role=\"list\"><div role=\"listitem\"><p><label for=\"literature-facet-checkbox\">literature (0)</label></p></div><div role=\"listitem\"><p><label for=\"dataset-facet-checkbox\">dataset (0)</label></p></div><div role=\"listitem\"><p><label for=\"software-facet-checkbox\">software (0)</label></p></div><div role=\"listitem\"><p><label for=\"unknown-facet-checkbox\">unknown (0)</label></p></div><div role=\"listitem\"><p><label for=\"citations-to-version\">Citations to this version</label></p></div></div><div><p><label for=\"citations-search-bar\">Search citations</label></p></div></div><div><p>Oops! Something went wrong while fetching results.</p></div></div></div>\n  \n        </article></div>"
  }
}

▼thought

▼✓抓取网页正文网址: "https://arxiv.org/html/2405.14782v1"

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"url\": \"https://arxiv.org/html/2405.14782v1\",\n  \"title\": \"Lessons from the Trenches on Reproducible Evaluation of Language Models\",\n  \"author\": \"Stella Biderman\",\n  \"summary\": \"Hailey Schoelkopf\\n\\n\\n\\n\\nLintang Sutawika\\n\\n\\n\\n\\n\\nLeo Gao\\n\\nEleutherAI\\n\\n\\nJonathan Tow\\n\\nStability AI\\n\\n\\nBaber Abbasi\\n\\nEleutherAI\\n\\n\\nAlham Fikri Aji\\n\\nMBZUAI\\n\\n\\nPawan Sasanka Ammanamanchi\\n\\nIIIT Hyderabad\\n\\n\\nSidney …\",\n  \"content\": \"<div id=\\\"readability-page-1\\\" class=\\\"page\\\"><article>\\n\\n<p>\\n<span>\\n<span>Hailey Schoelkopf\\n</span><span>\\n<span>\\n</span></span></span>\\n<span>\\n<span>Lintang Sutawika\\n</span><span>\\n<span>\\n</span></span></span>\\n<span>\\n<span>\\n<br>Leo Gao\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Jonathan Tow\\n</span><span>\\n<span>Stability AI\\n</span></span></span>\\n<span>\\n<span>Baber Abbasi\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Alham Fikri Aji\\n</span><span>\\n<span>MBZUAI\\n</span></span></span>\\n<span>\\n<span>Pawan Sasanka Ammanamanchi\\n</span><span>\\n<span>IIIT Hyderabad\\n</span></span></span>\\n<span>\\n<span>Sidney Black\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Jordan Clive\\n</span><span>\\n<span>Chattermill AI\\n</span></span></span>\\n<span>\\n<span>Anthony DiPofi\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Julen Etxaniz\\n</span><span>\\n<span>HiTZ Center - Ixa, UPV/EHU\\n</span></span></span>\\n<span>\\n<span>Benjamin Fattori\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Jessica Zosa Forde\\n</span><span>\\n<span>Brown University\\n</span></span></span>\\n<span>\\n<span>Charles Foster\\n</span><span>\\n<span>Finetune\\n</span></span></span>\\n<span>\\n<span>Mimansa Jaiswal\\n</span><span>\\n<span>University of Michigan\\n</span></span></span>\\n<span>\\n<span>Wilson Y. Lee\\n</span><span>\\n<span>HubSpot\\n</span></span></span>\\n<span>\\n<span>Haonan Li\\n</span><span>\\n<span>MBZUAI\\n</span>\\n<span>LibrAI\\n</span></span></span>\\n<span>\\n<span>Charles Lovering\\n</span><span>\\n<span>Kensho\\n</span></span></span>\\n<span>\\n<span>Niklas Muennighoff\\n</span><span>\\n<span>Contextual AI\\n</span></span></span>\\n<span>\\n<span>Ellie Pavlick\\n</span><span>\\n<span>HiTZ Center - Ixa, UPV/EHU\\n</span></span></span>\\n<span>\\n<span>Jason Phang\\n</span><span>\\n<span>EleutherAI\\n</span>\\n<span>New York University\\n</span></span></span>\\n<span>\\n<span>Aviya Skowron\\n</span><span>\\n<span>EleutherAI\\n</span></span></span>\\n<span>\\n<span>Samson Tan\\n</span><span>\\n<span>Amazon\\n</span></span></span>\\n<span>\\n<span>Xiangru Tang\\n</span><span>\\n<span>Yale University\\n</span></span></span>\\n<span>\\n<span>Kevin A. Wang\\n</span><span>\\n<span>HiTZ Center - Ixa, UPV/EHU\\n</span></span></span>\\n<span>\\n<span>Genta Indra Winata\\n</span><span>\\n<span>HKUST\\n</span></span></span>\\n<span>\\n<span>François Yvon\\n</span><span>\\n<span>Sorbonne University\\n</span></span></span>\\n<span>\\n<span>Andy Zou\\n</span><span>\\n<span>CMU\\n</span></span></span>\\n</p>\\n<div id=\\\"abstract\\\">\\n<h6>Abstract</h6><p id=\\\"id1.id1\\\">Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers. First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or lessening the impact of these challenges on research. Third, we present the&nbsp;<span id=\\\"id1.id1.1\\\">Language Model Evaluation Harness</span>&nbsp;(<span id=\\\"id1.id1.2\\\">lm-eval</span>): an open source library for independent, reproducible, and extensible evaluation of language models that seeks to address these issues. We describe the features of the library as well as case studies in which the library has been used to alleviate these methodological concerns.</p></div>\\n<section id=\\\"S1\\\">\\n<h2>\\n<span>1 </span>Introduction</h2>\\n<div id=\\\"S1.p1\\\">\\n<p id=\\\"S1.p1.1\\\">Evaluation on shared benchmark tasks is a crucial tool used to track and communicate progress in the machine learning and language modeling communities <cite>(Ruder, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib98\\\" title=\\\"\\\">2021</a>)</cite>. Benchmarks are used to track progress toward shared community goals and to demonstrate the improvements of newly proposed methods over prior baselines. Evaluation practices thus play a crucial role in the direction of the field: inconsistencies or biases in evaluation practices can lead to skewed performance comparisons, which may influence the direction of future research and the adoption of new methods by the community <cite>(Dehghani et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib28\\\" title=\\\"\\\">2021</a>)</cite> or lead to adverse effects from deploying suboptimal or harmful models <cite>(Bender &amp; Friedman, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib10\\\" title=\\\"\\\">2018</a>)</cite> on tasks for which they are ill-suited <cite>(Raji et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib94\\\" title=\\\"\\\">2022</a>)</cite>.</p></div>\\n<div id=\\\"S1.p2\\\">\\n<p id=\\\"S1.p2.1\\\">In this work, we detail our lessons learned that have been especially beneficial to obtaining useful and rigorous findings. By sharing these lessons, we aim to foster a more robust and reliable evaluation ecosystem. Our contributions are threefold:</p><ol id=\\\"S1.I1\\\">\\n<li id=\\\"S1.I1.i1\\\">\\n<span>1.</span>\\n<div id=\\\"S1.I1.i1.p1\\\">\\n<p id=\\\"S1.I1.i1.p1.1\\\">We highlight several commonly-faced challenges in evaluating language models, including the difficulty of assessing the correctness of natural language responses, challenges in benchmark design, and the dependence upon implementation details that are often obscured or unreported (Section&nbsp;<a href=\\\"https://arxiv.org/html/2405.14782v1#S2\\\" title=\\\"2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>2</span></a>).</p></div>\\n</li>\\n<li id=\\\"S1.I1.i2\\\">\\n<span>2.</span>\\n<div id=\\\"S1.I1.i2.p1\\\">\\n<p id=\\\"S1.I1.i2.p1.1\\\">We then discuss best practices we’ve identified to improve how to communicate results and improve evaluation rigor in the language modeling community, despite–or to mitigate the impact of—these challenges (Section&nbsp;<a href=\\\"https://arxiv.org/html/2405.14782v1#S3\\\" title=\\\"3 Best Practices for Language Model Evaluation ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>3</span></a>).</p></div>\\n</li>\\n<li id=\\\"S1.I1.i3\\\">\\n<span>3.</span>\\n<div id=\\\"S1.I1.i3.p1\\\">\\n<p id=\\\"S1.I1.i3.p1.1\\\">To enable researchers and engineers to easily utilize the best practices we have identified, we present &nbsp;<span id=\\\"S1.I1.i3.p1.1.1\\\">lm-eval</span>, our open-source library for reproducible evaluation of language models (Section&nbsp;<a href=\\\"https://arxiv.org/html/2405.14782v1#S4\\\" title=\\\"4 The Language Model Evaluation Harness ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>4</span></a>). <span id=\\\"S1.I1.i3.p1.1.2\\\">lm-eval</span>’s framework defines a flexible API for both model implementation and evaluation task implementation, and then handles all of the work to <span id=\\\"S1.I1.i3.p1.1.3\\\">orchestrate</span> evaluations internally, allowing users to evaluate their chosen combinations of models + tasks, dependent on use case. We additionally present a set of case studies in which&nbsp;<span id=\\\"S1.I1.i3.p1.1.4\\\">lm-eval</span>&nbsp;was used to improve the level of rigor of a set of evaluations (Section&nbsp;<a href=\\\"https://arxiv.org/html/2405.14782v1#S5\\\" title=\\\"5 Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>5</span></a>).</p></div>\\n</li>\\n</ol>\\n</div>\\n</section>\\n<section id=\\\"S2\\\">\\n<h2>\\n<span>2 </span>Challenges in Evaluating Language Models</h2>\\n<section id=\\\"S2.SS1\\\">\\n<h3>\\n<span>2.1 </span>Evaluating and Scoring Natural Language Abilities</h3>\\n<div id=\\\"S2.SS1.p1\\\">\\n<p id=\\\"S2.SS1.p1.1\\\">The biggest challenge in language model evaluation is a concept we term <span id=\\\"S2.SS1.p1.1.1\\\">the Key Problem</span>: When evaluating language models, there can be many semantically equivalent but syntactically different ways of expressing the same idea. In an ideal world, we would have a way to automatically detect when two sentences express the same content but in different words. Unfortunately, our best tools for determining whether two sentences are semantically equivalent <span id=\\\"S2.SS1.p1.1.2\\\">are the very models we are seeking to evaluate</span>. This problem drives many of the approaches to LM benchmarking, and many problems in LM evaluation stem from there not being any silver bullets for solving the Key Problem.</p></div>\\n<div id=\\\"S2.SS1.p2\\\">\\n<p id=\\\"S2.SS1.p2.1\\\">In principle, this would be solvable by simply having expert human annotators score model responses for correctness. The main reason this is not ubiquitous is cost: performing accurate human studies is not only difficult and time-consuming but also very expensive due to fair compensation, pricing smaller actors or organizations out of performing such evaluations. Additionally, there are other reasons relying on solely human assessments must be done with caution: they can be flawed and biased, especially for complex judgments such as factuality <cite>(Hosking et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib47\\\" title=\\\"\\\">2024</a>; Xu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib127\\\" title=\\\"\\\">2023</a>; Wu &amp; Aji, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib126\\\" title=\\\"\\\">2023</a>)</cite>. Expert, trained human judgment can alleviate these issues but is inherently non-scalable.</p></div>\\n<div id=\\\"S2.SS1.p3\\\">\\n<p id=\\\"S2.SS1.p3.1\\\">To address the high costs of manual human evaluation, automated metrics are often used. These offer notable advantages in that they are (theoretically) fully reproducible, far easier and cheaper to compute, and can avoid some of the issues faced by human studies <cite>(Wei &amp; Jia, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib124\\\" title=\\\"\\\">2021</a>; Freitag et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib33\\\" title=\\\"\\\">2021</a>; Amidei et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib4\\\" title=\\\"\\\">2020</a>)</cite>. Automated metrics such as BLEU <cite>(Papineni et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib81\\\" title=\\\"\\\">2002</a>)</cite> and ROUGE <cite>(Lin, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib65\\\" title=\\\"\\\">2004</a>)</cite> seek to directly solve the Key Problem by measuring the distance from a generated response to a gold-standard one, such as via counting the n-gram overlap between the two texts.\\nHeuristic-based metrics such as BLEU (and its derivatives) have flaws <cite>(Callison-Burch et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib19\\\" title=\\\"\\\">2006</a>)</cite> and present reproducibility challenges <cite>(Marie et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib74\\\" title=\\\"\\\">2021</a>)</cite>, but can be useful. More recently, model-based metrics have recently gained momentum through evaluation methods that leverage large language models as a grader <cite>(Kim et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib56\\\" title=\\\"\\\">2024</a>; Wang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib120\\\" title=\\\"\\\">2024</a>; Liu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib68\\\" title=\\\"\\\">2023b</a>)</cite>, especially as proxies for human preference evaluation <cite>(Zheng et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib133\\\" title=\\\"\\\">2023</a>)</cite>, but these are known to be flawed <cite>(Wang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib118\\\" title=\\\"\\\">2023</a>; Huang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib49\\\" title=\\\"\\\">2024</a>; Shen et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib106\\\" title=\\\"\\\">2023</a>; Zeng et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib130\\\" title=\\\"\\\">2024</a>; Hu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib48\\\" title=\\\"\\\">2024</a>; Liu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib69\\\" title=\\\"\\\">2023c</a>; Chen et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib20\\\" title=\\\"\\\">2024</a>)</cite> and suffer from similar reproducibility issues as BLEU, ROUGE, and their variants.</p></div>\\n<div id=\\\"S2.SS1.p4\\\">\\n<p id=\\\"S2.SS1.p4.1\\\">The Key Problem can alternately be sidestepped by artificially restricting the answer space. The most prevalent way to achieve this is to reframe questions as multiple choice problems, with a single gold target answer and a finite, static set of possible responses <cite>(Hendrycks et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib42\\\" title=\\\"\\\">2020</a>; Srivastava et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib110\\\" title=\\\"\\\">2022</a>; Li’evin et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib64\\\" title=\\\"\\\">2022</a>; Lin et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib66\\\" title=\\\"\\\">2022</a>; Robinson et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib95\\\" title=\\\"\\\">2023</a>; Holtzman et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib46\\\" title=\\\"\\\">2022</a>)</cite>. Alternatively, when a reference answer is known, one can perform string-matching approaches heuristically to determine whether the model’s answer matches the ground truth <cite>(Dua et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib30\\\" title=\\\"\\\">2019</a>; Joshi et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib53\\\" title=\\\"\\\">2017</a>; Hendrycks et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib43\\\" title=\\\"\\\">2021</a>)</cite>.</p></div>\\n<div id=\\\"S2.SS1.p5\\\">\\n<p id=\\\"S2.SS1.p5.1\\\">This challenge does not necessarily impact other applications of language models and related technologies, such as playing games where it easy to check that the game has ended <cite>(Romstad et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib97\\\" title=\\\"\\\">2008</a>; Silver et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib107\\\" title=\\\"\\\">2018</a>; † et&nbsp;al.(2022)(FAIR)†, Bakhtin, Brown, Dinan, Farina,\\nFlaherty, Fried, Goff, Gray, Hu, et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib31\\\" title=\\\"\\\">FAIR</a>)</cite>, more constrained scientific applications <cite>(Jumper et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib54\\\" title=\\\"\\\">2021</a>; Ahdritz et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib1\\\" title=\\\"\\\">2022</a>)</cite>, or domains where we have <span id=\\\"S2.SS1.p5.1.1\\\">practically usable verifiers</span> even when the solutions are not checkable in all contexts <cite>(Biderman, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib12\\\" title=\\\"\\\">2020</a>; Biderman &amp; Raff, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib13\\\" title=\\\"\\\">2022</a>; Lewkowycz et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib59\\\" title=\\\"\\\">2022</a>)</cite>. In the case of LLMs, the most notable cases where this ground-truth verifier is known are coding and mathematics problems, although the verifiers used, such as unit tests, may still break down in edge cases <cite>(Liu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib67\\\" title=\\\"\\\">2023a</a>)</cite></p></div>\\n</section>\\n<section id=\\\"S2.SS2\\\">\\n<h3>\\n<span>2.2 </span>Benchmark Design and Validity</h3>\\n<div id=\\\"S2.SS2.p1\\\">\\n<p id=\\\"S2.SS2.p1.1\\\">Typically, we do not care about the actual numeric score of a model on a benchmark. Instead, we desire the benchmark to be a useful proxy for some real-world phenomenon. The <span id=\\\"S2.SS2.p1.1.1\\\">validity</span> of an evaluation is the extent to which these correlate <cite>(Messick, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib76\\\" title=\\\"\\\">1994</a>)</cite>. For a recent overview of validity concerns in NLP benchmarking, see <cite>Subramonian et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib111\\\" title=\\\"\\\">2023</a>)</cite>. Also see <cite>Raji et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib93\\\" title=\\\"\\\">2021</a>); Saphra et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib101\\\" title=\\\"\\\">2023</a>); Davis (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib25\\\" title=\\\"\\\">2023</a>)</cite> for extended discussion of construct validity in LLM evaluation.</p></div>\\n<div id=\\\"S2.SS2.p2\\\">\\n<p id=\\\"S2.SS2.p2.1\\\">While validity is an ongoing problem in language model evaluation, we focus on mitigating other concerns first: as we will describe,&nbsp;<span id=\\\"S2.SS2.p2.1.1\\\">lm-eval</span>&nbsp;is designed to ensure measurements are <span id=\\\"S2.SS2.p2.1.2\\\">consistent</span> across runs and models, regardless of (construct) validity.</p></div>\\n</section>\\n<section id=\\\"S2.SS3\\\">\\n<h3>\\n<span>2.3 </span>Implementation Difficulties and (Ir)Reproducibility</h3>\\n<div id=\\\"S2.SS3.p1\\\">\\n<p id=\\\"S2.SS3.p1.1\\\">Once a benchmark has been designed, it then needs to be implemented by machine learning researchers around the world to see use in driving progress in the field. This introduces a host of new challenges that need to be addressed in order to ensure that everyone is evaluating models on a benchmark in the same fashion when comparing results. This adaptation process can introduce inconsistencies and make it difficult to draw conclusions across different implementations. Researchers must <span id=\\\"S2.SS3.p1.1.1\\\">adapt</span> it to their own workflows and libraries for the purposes of actually adopting the benchmark in their research.</p></div>\\n<section id=\\\"S2.SS3.SSS1\\\">\\n<h4>\\n<span>2.3.1 </span>“Minor” Implementation Details Matter</h4>\\n<div id=\\\"S2.SS3.SSS1.p1\\\">\\n<p id=\\\"S2.SS3.SSS1.p1.1\\\">The importance of interoperability and full reproducibility stems from the fact that language models are incredibly sensitive to precise details that may not be obvious to practitioners. Even minor variations in prompts, formatting, or other implementation details can significantly impact the performance and validity of evaluations <cite>(Weber et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib121\\\" title=\\\"\\\">2023</a>; Sclar et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib103\\\" title=\\\"\\\">2023</a>; Mizrahi et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib79\\\" title=\\\"\\\">2024</a>; Alzahrani et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib3\\\" title=\\\"\\\">2024</a>; Lu et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib70\\\" title=\\\"\\\">2022</a>; Webson &amp; Pavlick, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib122\\\" title=\\\"\\\">2022</a>; Min et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib78\\\" title=\\\"\\\">2022</a>)</cite>.\\nWithout access to the original evaluation code, when re-implementing evaluation procedures from scratch is required, it is nearly impossible to account for all the subtle details that can affect outcomes. As a result, these implementations are likely to diverge in ways that make it extremely difficult to ensure fair comparisons across works, even when evaluating on the same benchmark. Even having the prompts reported in a paper is no substitute for having access to the actual evaluation code: prompts in papers are often incorrect or difficult to map to the exact code implementation because they’ve been stylized to be human-readable.</p></div>\\n</section>\\n<section id=\\\"S2.SS3.SSS2\\\">\\n<h4>\\n<span>2.3.2 </span>Lack of Agreement About “Apples to Apples”</h4>\\n<div id=\\\"S2.SS3.SSS2.p1\\\">\\n<p id=\\\"S2.SS3.SSS2.p1.1\\\">Even assuming that benchmarks are implemented consistently across works, the question of <span id=\\\"S2.SS3.SSS2.p1.1.1\\\">how to draw fair comparisons</span> across models and methods is still difficult for LMs.</p></div>\\n<div id=\\\"S2.SS3.SSS2.p2\\\">\\n<p id=\\\"S2.SS3.SSS2.p2.1\\\">For instance, different instruction-tuned models may be trained to expect certain formats&nbsp;<cite>(Taori et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib113\\\" title=\\\"\\\">2023</a>; Sanh et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib100\\\" title=\\\"\\\">2022</a>; Wei et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib123\\\" title=\\\"\\\">2022</a>)</cite> – using these models’ intended prompt formats can make the evaluation tasks inherently different or change their difficulty, but not using these can also bias against models trained with formats not matching tasks’ “standard” prompting styles. Likewise, if an original benchmark implementation (including prompting and postprocessing) is tailored for a specific model, other models trained differently will suffer, artificially skewing perceptions of what techniques are effective.</p></div>\\n<div id=\\\"S2.SS3.SSS2.p3\\\">\\n<p id=\\\"S2.SS3.SSS2.p3.1\\\">Likewise, some questions of how to set up controlled experiments are still open–is it ideal to normalize performance and comparisons by the number of parameters? Training FLOPs? Inference cost? Must training data be held equal? How should models which can leverage external resources such as retrieved documents or external tools be compared? These questions are all context-dependent but can impact findings significantly. For example, <cite>Wang et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib119\\\" title=\\\"\\\">2022</a>)</cite> explore comparisons across architectures and training objectives, and choose to normalize for FLOPs, thus comparing encoder-decoder models with double the parameters to decoder-only models. Comparing results of models with equivalent training FLOPs, regardless of the allocation of those FLOPs, is commonplace (<cite>Hoffmann et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib44\\\" title=\\\"\\\">2022</a>); Peng et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib83\\\" title=\\\"\\\">2023</a>); Touvron et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib115\\\" title=\\\"\\\">2023</a>)</cite>, <span id=\\\"S2.SS3.SSS2.p3.1.1\\\">inter alia</span>). However, in a more memory-constrained setting, comparing models equi-parameter may be more logical. While this is not inherently problematic, as different application contexts motivate different evaluation criteria, it is common to gloss headline claims as “model X is better than model y” or “model x doesn’t really work as advertised” without paying significant attention to how comparisons were made.</p></div>\\n</section>\\n<section id=\\\"S2.SS3.SSS3\\\">\\n<h4>\\n<span>2.3.3 </span>Comparisons with Prior Work are Expensive (and Sometimes Impossible)</h4>\\n<div id=\\\"S2.SS3.SSS3.p1\\\">\\n<p id=\\\"S2.SS3.SSS3.p1.1\\\">Setting aside the question of establishing fair comparisons between methods or models, an additional key challenge in language modeling research is that many barriers prevent thorough comparison with related work.</p></div>\\n<div id=\\\"S2.SS3.SSS3.p2\\\">\\n<p id=\\\"S2.SS3.SSS3.p2.1\\\">Many LMs developed by industrial labs, often used as reference points for benchmarks, have never been released externally <cite>(Chowdhery et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib21\\\" title=\\\"\\\">2023</a>; Hoffmann et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib44\\\" title=\\\"\\\">2022</a>)</cite>, preventing comparisons except by pulling unverified evaluation numbers from technical reports. Those models that have been made available via APIs may non-transparently not match the published versions or otherwise be modified for deployment. Additionally, these API models are quickly <span id=\\\"S2.SS3.SSS3.p2.1.1\\\">deprecated</span> and no longer accessible, rendering slews of work no longer reproducible<span id=\\\"footnote1\\\"><sup>1</sup><span><span><sup>1</sup><span>1</span>Notably, OpenAI’s <span id=\\\"footnote1.1\\\">code-davinci-002</span> model was <a href=\\\"https://platform.openai.com/docs/deprecations/base-gpt-models\\\" title=\\\"\\\">deprecated in January 2024</a>, making at minimum hundreds of research studies irreproducible.</span></span></span>. API access, especially for large volumes of evaluation, is quite expensive.</p></div>\\n\\n</section>\\n</section>\\n<section id=\\\"S2.SS4\\\">\\n<h3>\\n<span>2.4 </span>Fast-changing Progress and Conventions</h3>\\n<figure id=\\\"S2.F1\\\"><img alt=\\\"Refer to caption\\\" height=\\\"246\\\" id=\\\"S2.F1.g1\\\" src=\\\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/Timeline_v1.png\\\" width=\\\"598\\\">\\n<figcaption><span>Figure 1: </span>A timeline showing the relative release dates of a selection of notable benchmarks used to evaluate LMs, as compared to the release dates of BERT <cite>(Devlin et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib29\\\" title=\\\"\\\">2018</a>)</cite>, GPT-2 <cite>(Radford et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib92\\\" title=\\\"\\\">2019</a>)</cite>, GPT-3 <cite>(Brown et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib18\\\" title=\\\"\\\">2020</a>)</cite>, and ChatGPT, used as approximate stand-ins for shifts in how the community uses and therefore evaluates LMs.</figcaption>\\n</figure>\\n<div id=\\\"S2.SS4.p1\\\">\\n<p id=\\\"S2.SS4.p1.1\\\">Due to the time-consuming nature of developing good benchmarks and the rapid pace of change in NLP research in the past decade, many widely used language model evaluation benchmarks do not represent the current paradigm of how language models are trained. This has two major impacts:</p><ul id=\\\"S2.I1\\\">\\n<li id=\\\"S2.I1.i1\\\">\\n<span>•</span>\\n<div id=\\\"S2.I1.i1.p1\\\">\\n<p id=\\\"S2.I1.i1.p1.1\\\">Benchmarks are being used for purposes they were not originally designed for or designed for validity under: for example, a large number of benchmarks have been built around fine-tuning on a known training set and closed space of labels <cite>(Wang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib117\\\" title=\\\"\\\">2019b</a>; <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib116\\\" title=\\\"\\\">a</a>)</cite>.</p></div>\\n</li>\\n<li id=\\\"S2.I1.i2\\\">\\n<span>•</span>\\n<div id=\\\"S2.I1.i2.p1\\\">\\n<p id=\\\"S2.I1.i2.p1.1\\\">There is no “ground-truth” implementation from the original benchmark authors for many of these popular benchmarks “retrofitted” to be used with prompted autoregressive LMs. In the absence of a clear standard, the community’s methodology for evaluating on these benchmarks may be fragmented or undocumented <cite>(Clark et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib23\\\" title=\\\"\\\">2018</a>; Paperno et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib80\\\" title=\\\"\\\">2016</a>)</cite>.</p></div>\\n</li>\\n</ul>\\n</div>\\n<div id=\\\"S2.SS4.p2\\\">\\n<p id=\\\"S2.SS4.p2.1\\\">To illustrate the effects of this development timeline, Figure <a href=\\\"https://arxiv.org/html/2405.14782v1#S2.F1\\\" title=\\\"Figure 1 ‣ 2.4 Fast-changing Progress and Conventions ‣ 2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>1</span></a> shows how many prominent LM benchmarks were designed prior to shifts such as in-context learning and chat interaction, and therefore were not designed to take these formats and approaches into account. This can affect validity or difficulty in unforeseen ways.</p></div>\\n</section>\\n</section>\\n<section id=\\\"S3\\\">\\n<h2>\\n<span>3 </span>Best Practices for Language Model Evaluation</h2>\\n<div id=\\\"S3.p1\\\">\\n<p id=\\\"S3.p1.1\\\">While LM evaluation is difficult and suffers from a number of challenges as we have described, there are measures that can be taken to significantly improve current practices. We provide our high-level recommendations regarding such measures, and detail our motivations briefly for each.</p></div>\\n<section id=\\\"S3.SS0.SSS0.Px1\\\">\\n<h5>Always share your exact prompts and code</h5>\\n<div id=\\\"S3.SS0.SSS0.Px1.p1\\\">\\n<ul id=\\\"S3.I1\\\">\\n<li id=\\\"S3.I1.i1\\\">\\n<span>•</span>\\n<div id=\\\"S3.I1.i1.p1\\\">\\n<p id=\\\"S3.I1.i1.p1.1\\\">If possible, full <span id=\\\"S3.I1.i1.p1.1.1\\\">evaluation code</span> including the full prompts used should also be provided for reproducible evaluation runs, as well as further identifiers such as links to specific commits used. Failing this, sharing prompts is often not done, but can drastically improve reproducibility.\\n</p></div>\\n</li>\\n<li id=\\\"S3.I1.i2\\\">\\n<span>•</span>\\n<div id=\\\"S3.I1.i2.p1\\\">\\n<p id=\\\"S3.I1.i2.p1.1\\\">For fair comparison against other models, evaluation should be done with the same set of prompts unless there’s a good reason not to. <span id=\\\"S3.I1.i2.p1.1.1\\\">Prompts should not be optimized for performance on a given model but not others, and the amount of prompt engineering done should be disclosed.</span></p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n<section id=\\\"S3.SS0.SSS0.Px2\\\">\\n<h5>Avoid copying results from other implementations</h5>\\n<div id=\\\"S3.SS0.SSS0.Px2.p1\\\">\\n<ul id=\\\"S3.I2\\\">\\n<li id=\\\"S3.I2.i1\\\">\\n<span>•</span>\\n<div id=\\\"S3.I2.i1.p1\\\">\\n<p id=\\\"S3.I2.i1.p1.1\\\">Comparing results across papers can be misleading due to a wide range of experimental differences, including prompts, sample size, metric calculation, and more <cite>(Marie et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib74\\\" title=\\\"\\\">2021</a>)</cite>.</p></div>\\n</li>\\n<li id=\\\"S3.I2.i2\\\">\\n<span>•</span>\\n<div id=\\\"S3.I2.i2.p1\\\">\\n<p id=\\\"S3.I2.i2.p1.1\\\">Results should <span id=\\\"S3.I2.i2.p1.1.1\\\">not</span> be copied or reported from other papers&nbsp;<cite>(Marie, <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib73\\\" title=\\\"\\\">2022</a>)</cite> whenever possible, unless one can verify that the exact same code has been used to run the experiments in those papers. If such copying is unavoidable, it should be clearly marked as such and treated carefully.</p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n<section id=\\\"S3.SS0.SSS0.Px3\\\">\\n<h5>Always provide model outputs</h5>\\n<div id=\\\"S3.SS0.SSS0.Px3.p1\\\">\\n<ul id=\\\"S3.I3\\\">\\n<li id=\\\"S3.I3.i1\\\">\\n<span>•</span>\\n<div id=\\\"S3.I3.i1.p1\\\">\\n<p id=\\\"S3.I3.i1.p1.1\\\">Providing model outputs alongside evaluation code can allow others to recalculate scores based on these artifacts, which can be useful for performing statistical significance testing and for assessing the impact of different evaluation metrics or scoring approaches.</p></div>\\n</li>\\n<li id=\\\"S3.I3.i2\\\">\\n<span>•</span>\\n<div id=\\\"S3.I3.i2.p1\\\">\\n<p id=\\\"S3.I3.i2.p1.1\\\">Evaluation of large models or APIs can be quite costly–sharing such artifacts allows researchers without access to significant compute to participate in evaluation research.</p></div>\\n</li>\\n<li id=\\\"S3.I3.i3\\\">\\n<span>•</span>\\n<div id=\\\"S3.I3.i3.p1\\\">\\n<p id=\\\"S3.I3.i3.p1.1\\\">Finally, sharing outputs can allow results on API models to be reproduced to some extent, even if the models are subsequently deprecated.</p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n<section id=\\\"S3.SS0.SSS0.Px4\\\">\\n<h5>Perform qualitative analyses</h5>\\n<div id=\\\"S3.SS0.SSS0.Px4.p1\\\">\\n<ul id=\\\"S3.I4\\\">\\n<li id=\\\"S3.I4.i1\\\">\\n<span>•</span>\\n<div id=\\\"S3.I4.i1.p1\\\">\\n<p id=\\\"S3.I4.i1.p1.1\\\">Qualitatively review a small batch of results and outputs before testing at scale: it is very easy to have bugs in your generation code, especially when working with multiple sets of benchmarks, prompts, and models of different architectures. Catching issues early can save a lot of time and compute, and increase confidence in results.</p></div>\\n</li>\\n<li id=\\\"S3.I4.i2\\\">\\n<span>•</span>\\n<div id=\\\"S3.I4.i2.p1\\\">\\n<p id=\\\"S3.I4.i2.p1.1\\\">Quantitative scores only provide so much information. To understand why a model is scoring so well or so poorly, it is important to do some sort of qualitative error analysis. This can sometimes reveal superficial errors that are easier to correct with post-processing&nbsp;<cite>Bawden &amp; Yvon (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib8\\\" title=\\\"\\\">2023</a>)</cite>, or more fundamental errors.</p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n<section id=\\\"S3.SS0.SSS0.Px5\\\">\\n<h5>Perform statistical significance testing</h5>\\n<div id=\\\"S3.SS0.SSS0.Px5.p1\\\">\\n<ul id=\\\"S3.I5\\\">\\n<li id=\\\"S3.I5.i1\\\">\\n<span>•</span>\\n<div id=\\\"S3.I5.i1.p1\\\">\\n<p id=\\\"S3.I5.i1.p1.1\\\">Most works on language modeling do not perform statistical significance testing <cite>(Marie et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib74\\\" title=\\\"\\\">2021</a>)</cite>. This simple addition can dramatically boost the reliability of claimed results.</p></div>\\n</li>\\n<li id=\\\"S3.I5.i2\\\">\\n<span>•</span>\\n<div id=\\\"S3.I5.i2.p1\\\">\\n<p id=\\\"S3.I5.i2.p1.1\\\">Although costly, reporting results run over more than one random seed can dramatically boost the validity and utility of results. For example, averaging across model runs <cite>(Sellam et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib104\\\" title=\\\"\\\">2022</a>)</cite>, or averaging over multiple selections of few-shot examples.</p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n</section>\\n<section id=\\\"S4\\\">\\n<h2>\\n<span>4 </span>The Language Model Evaluation Harness</h2>\\n<div id=\\\"S4.p1\\\">\\n<p id=\\\"S4.p1.1\\\">Informed by these practices we have built <span id=\\\"S4.p1.1.1\\\">lm-eval</span>. Unlike prior work on unified benchmarking libraries <cite>(Liang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib63\\\" title=\\\"\\\">2023</a>; Srivastava et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib110\\\" title=\\\"\\\">2022</a>)</cite>, the Evaluation Library does not seek to solely prescribe what the correct benchmark or evaluation protocols to use are, and allows users to select their desired tasks and use cases.</p></div>\\n<div id=\\\"S4.p2\\\">\\n<p id=\\\"S4.p2.1\\\">The role of the <span id=\\\"S4.p2.1.1\\\">lm-eval</span>&nbsp;is to solve the <span id=\\\"S4.p2.1.2\\\">orchestration problem</span>: previously, performing thorough LM evaluations would require painstaking re-implementation of previous tasks (likely to introduce subtle methodological divergences) or the individual installation and usage of extant code for each benchmark. Our goal is to make it easy to allow researchers or library users to simply install one codebase, and run their method plus selected baselines on their desired tasks in a controlled fashion. We hope to make it more effort-intensive to <span id=\\\"S4.p2.1.3\\\">not</span> follow minimum best practices for evaluation rigor.</p></div>\\n<section id=\\\"S4.SS1\\\">\\n<h3>\\n<span>4.1 </span>Design</h3>\\n<div id=\\\"S4.SS1.p1\\\">\\n<p id=\\\"S4.SS1.p1.1\\\">We provide an overview of <span id=\\\"S4.SS1.p1.1.1\\\">lm-eval</span>’s major components and design philosophy. At its core, <span id=\\\"S4.SS1.p1.1.2\\\">lm-eval</span>&nbsp;allows for the contribution of two types of implementations: evaluation <span id=\\\"S4.SS1.p1.1.3\\\">Tasks</span> and integrations with novel <span id=\\\"S4.SS1.p1.1.4\\\">LM</span> implementations.</p></div>\\n<section id=\\\"S4.SS1.SSS0.Px1\\\">\\n<h5>Tasks</h5>\\n<div id=\\\"S4.SS1.SSS0.Px1.p1\\\">\\n<p id=\\\"S4.SS1.SSS0.Px1.p1.1\\\"><span id=\\\"S4.SS1.SSS0.Px1.p1.1.1\\\">lm-eval</span>&nbsp;is built around modular implementations of evaluation tasks, implemented as a <span id=\\\"S4.SS1.SSS0.Px1.p1.1.2\\\">Task</span> class using a common API. This allows tasks to be collected in a common library, for new tasks to be extended or implemented easily, and for novel tasks to be easily shared reproducibly among practitioners or other library users. Users can implement tasks either via YAML-based configuration files or via subclassing the provided <span id=\\\"S4.SS1.SSS0.Px1.p1.1.3\\\">Task</span> class and providing custom code for specific methods. In Figure <a href=\\\"https://arxiv.org/html/2405.14782v1#S4.F2\\\" title=\\\"Figure 2 ‣ Tasks ‣ 4.1 Design ‣ 4 The Language Model Evaluation Harness ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>2</span></a>, we show an example of the evaluation logic packaged within a <span id=\\\"S4.SS1.SSS0.Px1.p1.1.4\\\">Task</span> class.</p></div>\\n<figure id=\\\"S4.F2\\\"><img alt=\\\"Refer to caption\\\" height=\\\"199\\\" id=\\\"S4.F2.g1\\\" src=\\\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/taskoverview.png\\\" width=\\\"598\\\">\\n<figcaption><span>Figure 2: </span>The operations performed by a <span id=\\\"S4.F2.7.1\\\">Task</span> object in <span id=\\\"S4.F2.8.2\\\">lm-eval</span>. Tasks are configured by YAML files or as a Python subclass, and encompass 1) a data source (using the <span id=\\\"S4.F2.9.3\\\">Datasets</span> library<cite>(Lhoest et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib60\\\" title=\\\"\\\">2021</a>)</cite>), 2) tools for defining prompts and format, 3) mapping these prompts to rendered inputs and expected output type from an <span id=\\\"S4.F2.10.4\\\">LM</span> in the form of <span id=\\\"S4.F2.11.5\\\">Requests</span>, and 4) rules for post-processing the <span id=\\\"S4.F2.12.6\\\">LM</span>’s outputs and calculating the final task metrics.</figcaption>\\n</figure>\\n<div id=\\\"S4.SS1.SSS0.Px1.p2\\\">\\n<p id=\\\"S4.SS1.SSS0.Px1.p2.1\\\">We provide a number of implementations for common tasks, and accept new tasks sourced from the community. We strive to match the paper originally introducing a benchmark dataset in its methodology, including using the same prompts if applicable. For tasks such as those introduced prior to prompted evaluation becoming the standard, we source evaluation methodology from the paper first posing the evaluation dataset as a prompted task. For example, we implement many tasks as adapted for in-context learning by <cite>Brown et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib18\\\" title=\\\"\\\">2020</a>)</cite>.</p></div>\\n</section>\\n<section id=\\\"S4.SS1.SSS0.Px2\\\">\\n<h5>LMs</h5>\\n<div id=\\\"S4.SS1.SSS0.Px2.p1\\\">\\n<p id=\\\"S4.SS1.SSS0.Px2.p1.1\\\">The next core piece of <span id=\\\"S4.SS1.SSS0.Px2.p1.1.1\\\">lm-eval</span>&nbsp;is the <span id=\\\"S4.SS1.SSS0.Px2.p1.1.2\\\">LM</span> API. Because effective <span id=\\\"S4.SS1.SSS0.Px2.p1.1.3\\\">orchestration</span> is our core goal, we allow arbitrary software libraries or (autoregressive) language model architectures to extend a provided interface for <span id=\\\"S4.SS1.SSS0.Px2.p1.1.4\\\">LM</span> objects.</p></div>\\n<div id=\\\"S4.SS1.SSS0.Px2.p2\\\">\\n<p id=\\\"S4.SS1.SSS0.Px2.p2.1\\\">For ease of use, and compartmentalization of the model definition and external library integrations for custom models away from core evaluation logic, we assume that LMs operate upon dispatched <span id=\\\"S4.SS1.SSS0.Px2.p2.1.1\\\">Requests</span> which consist of mapping <span id=\\\"S4.SS1.SSS0.Px2.p2.1.2\\\">string inputs</span> to some <span id=\\\"S4.SS1.SSS0.Px2.p2.1.3\\\">string or probability</span> as output. We thus abstract tokenizers away within the <span id=\\\"S4.SS1.SSS0.Px2.p2.1.4\\\">LM</span> class, and treat a neural language model combined with its tokenizer as a single system being evaluated.</p></div>\\n<div id=\\\"S4.SS1.SSS0.Px2.p3\\\">\\n<p id=\\\"S4.SS1.SSS0.Px2.p3.1\\\">LMs implement a simple interface, consisting of several types of <span id=\\\"S4.SS1.SSS0.Px2.p3.1.1\\\">Requests</span> in order to be used within the library for all supported tasks.</p></div>\\n</section>\\n<section id=\\\"S4.SS1.SSS0.Px3\\\">\\n<h5>Request Types</h5>\\n<div id=\\\"S4.SS1.SSS0.Px3.p1\\\">\\n<p id=\\\"S4.SS1.SSS0.Px3.p1.1\\\">We allow for 3 core types of <span id=\\\"S4.SS1.SSS0.Px3.p1.1.1\\\">Requests</span> that may be sent to a language model, which consist of distinct types of <span id=\\\"S4.SS1.SSS0.Px3.p1.1.2\\\">measurements</span> that can be performed to observe a model’s response or latent capabilities in a prompted format. These are:</p></div>\\n<div id=\\\"S4.SS1.SSS0.Px3.p2\\\">\\n<ul id=\\\"S4.I1\\\">\\n<li id=\\\"S4.I1.i1\\\">\\n<span>•</span>\\n<div id=\\\"S4.I1.i1.p1\\\">\\n<p id=\\\"S4.I1.i1.p1.1\\\">(Conditional) Loglikelihoods (<span id=\\\"S4.I1.i1.p1.1.1\\\">loglikelihood, multiple_choice</span>) - computing the probability of given output string(s), conditioned on some provided input.</p></div>\\n</li>\\n<li id=\\\"S4.I1.i2\\\">\\n<span>•</span>\\n<div id=\\\"S4.I1.i2.p1\\\">\\n<p id=\\\"S4.I1.i2.p1.1\\\">Perplexities (<span id=\\\"S4.I1.i2.p1.1.1\\\">loglikelihood_rolling</span>) - measuring the average loglikelihood or probability of producing the tokens in a given dataset.</p></div>\\n</li>\\n<li id=\\\"S4.I1.i3\\\">\\n<span>•</span>\\n<div id=\\\"S4.I1.i3.p1\\\">\\n<p id=\\\"S4.I1.i3.p1.1\\\">Generation (<span id=\\\"S4.I1.i3.p1.1.1\\\">generate_until</span>) - generating text until a given stopping condition is reached, from a model conditioned on some provided input.</p></div>\\n</li>\\n</ul>\\n</div>\\n<figure id=\\\"S4.F3\\\"><img alt=\\\"Refer to caption\\\" height=\\\"295\\\" id=\\\"S4.F3.g1\\\" src=\\\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/typesoftasks.png\\\" width=\\\"598\\\">\\n<figcaption><span>Figure 3: </span>Overview of the three core <span id=\\\"S4.F3.2.1\\\">Request</span> types supported by our evaluation framework. These include (1) conditional loglikelihoods, (2) perplexities, and (3) generation-based requests.</figcaption>\\n</figure>\\n<div id=\\\"S4.SS1.SSS0.Px3.p3\\\">\\n<p id=\\\"S4.SS1.SSS0.Px3.p3.1\\\">Provided with these three primitive operations, we are able to implement the major ways in the literature that have been used to evaluate LMs (<cite>Gao et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib36\\\" title=\\\"\\\">2020</a>)</cite>, <cite>Brown et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib18\\\" title=\\\"\\\">2020</a>)</cite>, <span id=\\\"S4.SS1.SSS0.Px3.p3.1.1\\\">inter alia</span>). While these high-level approaches are standard, they all contain a number of subtle implementation decisions which are often not disclosed in papers. Therefore, we include a full formal description of common implementation details involved in ours and others’ approaches within Appendix <a href=\\\"https://arxiv.org/html/2405.14782v1#A1\\\" title=\\\"Appendix A Formalizing Measurements ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>A</span></a> for completeness, which we hope will be a useful contribution to the literature.</p></div>\\n</section>\\n</section>\\n<section id=\\\"S4.SS2\\\">\\n<h3>\\n<span>4.2 </span>Addressing Challenges and Incorporating Best Practices</h3>\\n<div id=\\\"S4.SS2.p1\\\">\\n<p id=\\\"S4.SS2.p1.1\\\">Here we detail how we position <span id=\\\"S4.SS2.p1.1.1\\\">lm-eval</span>&nbsp;to address the issues mentioned in Section <a href=\\\"https://arxiv.org/html/2405.14782v1#S2\\\" title=\\\"2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>2</span></a> and incorporate the recommendations in Section <a href=\\\"https://arxiv.org/html/2405.14782v1#S3\\\" title=\\\"3 Best Practices for Language Model Evaluation ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>3</span></a>, in order to encourage a more robust evaluation ecosystem.</p></div>\\n<section id=\\\"S4.SS2.SSS0.Px1\\\">\\n<h5>Reproducibility</h5>\\n<div id=\\\"S4.SS2.SSS0.Px1.p1\\\">\\n<p id=\\\"S4.SS2.SSS0.Px1.p1.1\\\"><span id=\\\"S4.SS2.SSS0.Px1.p1.1.1\\\">lm-eval</span>&nbsp;encourages and enables reproducible evaluation in several ways. First, by providing a standardized implementation of many common tasks, practitioners can report on these tasks and ensure they are evaluating on the same prompt and implementation as other users of the library.</p></div>\\n<div id=\\\"S4.SS2.SSS0.Px1.p2\\\">\\n<p id=\\\"S4.SS2.SSS0.Px1.p2.1\\\">Alongside task results we report a <span id=\\\"S4.SS2.SSS0.Px1.p2.1.1\\\">version</span> field, incremented each time a task must be modified in a way that affects its scoring. Therefore, in the case where task implementations have bugs or must otherwise be updated, one can still reference the version of the task used, to ensure future research can reproduce reported results.</p></div>\\n<div id=\\\"S4.SS2.SSS0.Px1.p3\\\">\\n<p id=\\\"S4.SS2.SSS0.Px1.p3.1\\\">While this is not a panacea for the costs of comparing to prior work, and rerunning baselines oneself is advised, when prior work uses our library one can be confident that the results from prior work match what one would have gotten had one rerun it oneself using that version of the library <cite>(Beeching et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib9\\\" title=\\\"\\\">2023</a>)</cite>.</p></div>\\n</section>\\n<section id=\\\"S4.SS2.SSS0.Px2\\\">\\n<h5>Qualitative Analysis</h5>\\n<div id=\\\"S4.SS2.SSS0.Px2.p1\\\">\\n<p id=\\\"S4.SS2.SSS0.Px2.p1.1\\\"><span id=\\\"S4.SS2.SSS0.Px2.p1.1.1\\\">lm-eval</span>&nbsp;provides support for performing qualitative analysis of evaluation scores. In keeping with our recommended best practices, we implement the following, which allow for qualitative checks to be a core part of the evaluation workflow when using&nbsp;<span id=\\\"S4.SS2.SSS0.Px2.p1.1.2\\\">lm-eval</span>:</p></div>\\n<div id=\\\"S4.SS2.SSS0.Px2.p2\\\">\\n<ul id=\\\"S4.I2\\\">\\n<li id=\\\"S4.I2.i1\\\">\\n<span>•</span>\\n<div id=\\\"S4.I2.i1.p1\\\">\\n<p id=\\\"S4.I2.i1.p1.1\\\">We allow for artificially <span id=\\\"S4.I2.i1.p1.1.1\\\">limit</span>ing the amount of samples used for a given evaluation run, to enable code to be tested and outputs to be reviewed in small batches prior to full evaluation runs.</p></div>\\n</li>\\n<li id=\\\"S4.I2.i2\\\">\\n<span>•</span>\\n<div id=\\\"S4.I2.i2.p1\\\">\\n<p id=\\\"S4.I2.i2.p1.1\\\">Per-sample logging is supported, for post-hoc reproduction of scores or error analysis of model mistakes or evaluation implementation.</p></div>\\n</li>\\n</ul>\\n</div>\\n</section>\\n<section id=\\\"S4.SS2.SSS0.Px3\\\">\\n<h5>Statistical Testing</h5>\\n<div id=\\\"S4.SS2.SSS0.Px3.p1\\\">\\n<p id=\\\"S4.SS2.SSS0.Px3.p1.1\\\"><span id=\\\"S4.SS2.SSS0.Px3.p1.1.1\\\">lm-eval</span>&nbsp;reports the standard error (SE) of most supported metrics, calculated by either bootstrapping or dividing the sample standard deviation by the root of the sample size.</p></div>\\n<div id=\\\"S4.SS2.SSS0.Px3.p2\\\">\\n<p id=\\\"S4.SS2.SSS0.Px3.p2.1\\\">By reporting these SE calculations prominently in every evaluation run, we make it trivial for practitioners to add simple statistical measures such as confidence intervals to their results. While we believe more rigorous and widespread statistical testing in LM evaluation is still needed, we hope that this will spur the community to report and be more aware of statistical significance concerns by lowering the difficulty of reporting such measures.</p></div>\\n</section>\\n</section>\\n</section>\\n<section id=\\\"S5\\\">\\n<h2>\\n<span>5 </span>Case Studies</h2>\\n<div id=\\\"S5.p1\\\">\\n<p id=\\\"S5.p1.1\\\">Finally, we demonstrate <span id=\\\"S5.p1.1.1\\\">lm-eval</span>’s utility for improving evaluation rigor and understanding via case studies of its successful usage. We provide additional case studies within Appendix <a href=\\\"https://arxiv.org/html/2405.14782v1#A2\\\" title=\\\"Appendix B Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>B</span></a>.</p></div>\\n<section id=\\\"S5.SS1\\\">\\n<h3>\\n<span>5.1 </span>Prompts Massively Affect Results</h3>\\n<div id=\\\"S5.SS1.p1\\\">\\n<p id=\\\"S5.SS1.p1.1\\\">As mentioned in Section <a href=\\\"https://arxiv.org/html/2405.14782v1#S2.SS3.SSS1\\\" title=\\\"2.3.1 “Minor” Implementation Details Matter ‣ 2.3 Implementation Difficulties and (Ir)Reproducibility ‣ 2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>2.3.1</span></a>, language models can be very sensitive to the specific prompt or framing of an evaluation task.</p></div>\\n<div id=\\\"S5.SS1.p2\\\">\\n<p id=\\\"S5.SS1.p2.1\\\">Here we detail how <span id=\\\"S5.SS1.p2.1.1\\\">lm-eval</span>&nbsp;has been used to study this sensitivity, and can be used to improve confidence in the comparison of scores across models. We focus our attention on two popular language modeling benchmarks: the ARC question answering benchmark <cite>(Clark et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib23\\\" title=\\\"\\\">2018</a>)</cite> and MMLU <cite>(Hendrycks et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib43\\\" title=\\\"\\\">2021</a>)</cite>. However, not all papers evaluate on these tasks in the same way as the original formats.</p></div>\\n<figure id=\\\"S5.T1\\\">\\n\\n<figcaption><span>Table 1: </span>Comparison of 0-shot model performance for several pretrained LMs <cite>(Black et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib16\\\" title=\\\"\\\">2022</a>; Touvron et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib115\\\" title=\\\"\\\">2023</a>; Penedo et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib82\\\" title=\\\"\\\">2023</a>; Jiang et&nbsp;al., <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib51\\\" title=\\\"\\\">2023</a>; <a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib52\\\" title=\\\"\\\">2024</a>)</cite> on ARC (Easy and Challenge subsets) and MMLU across two commonly used prompt styles, evaluated using <span id=\\\"S5.T1.3.1\\\">lm-eval</span>.</figcaption>\\n</figure>\\n<div id=\\\"S5.SS1.p3\\\">\\n<p id=\\\"S5.SS1.p3.1\\\">ARC is first adapted to the in-context learning setting by <cite>Brown et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib18\\\" title=\\\"\\\">2020</a>)</cite>, who implement the dataset as a “cloze” task: the model is prompted via <span id=\\\"S5.SS1.p3.1.1\\\">‘‘Question: {question}<math alttext=\\\"\\\\backslash\\\" display=\\\"inline\\\" id=\\\"S5.SS1.p3.1.1.m1.1\\\"><semantics id=\\\"S5.SS1.p3.1.1.m1.1a\\\"><mo id=\\\"S5.SS1.p3.1.1.m1.1.1\\\" xref=\\\"S5.SS1.p3.1.1.m1.1.1.cmml\\\">\\\\</mo><annotation-xml encoding=\\\"MathML-Content\\\" id=\\\"S5.SS1.p3.1.1.m1.1b\\\"><ci id=\\\"S5.SS1.p3.1.1.m1.1.1.cmml\\\" xref=\\\"S5.SS1.p3.1.1.m1.1.1\\\">\\\\</ci></annotation-xml><annotation encoding=\\\"application/x-tex\\\" id=\\\"S5.SS1.p3.1.1.m1.1c\\\">\\\\backslash</annotation><annotation encoding=\\\"application/x-llamapun\\\" id=\\\"S5.SS1.p3.1.1.m1.1d\\\">\\\\</annotation></semantics></math>nAnswer:’’</span> and the likelihood of each potential completion string’s <span id=\\\"S5.SS1.p3.1.2\\\">text</span> is compared. Comparatively, MMLU is implemented in <cite>Hendrycks et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib42\\\" title=\\\"\\\">2020</a>)</cite> via providing the model with the question text, each of the 4 possible answers preceded by an answer letter A, B, C, or D, and scoring over the <span id=\\\"S5.SS1.p3.1.3\\\">answer letters’ probabilities</span>.</p></div>\\n<div id=\\\"S5.SS1.p4\\\">\\n<p id=\\\"S5.SS1.p4.1\\\">However, if models do not adopt these approaches, or disclose their exact settings, it is impossible to reliably compare stated model performance. In Table <a href=\\\"https://arxiv.org/html/2405.14782v1#S5.T1\\\" title=\\\"Table 1 ‣ 5.1 Prompts Massively Affect Results ‣ 5 Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\\\"><span>1</span></a>, we compare evaluation on the Easy and Challenge sets of ARC using the prompt from <cite>Brown et&nbsp;al. (<a href=\\\"https://arxiv.org/html/2405.14782v1#bib.bib18\\\" title=\\\"\\\">2020</a>)</cite> (“Cloze”) and using an MMLU-style answer letter with explicit multiple choice options (“MMLU-style”). We additionally compare MMLU scores between the original MMLU prompting style (“MMLU-style”) and an approach we term “Hybrid”, consisting of an MMLU-style prompt but using the <span id=\\\"S5.SS1.p4.1.1\\\">answer strings</span> instead of answer letters as the set of continuations over which we can scor\"\n}"
    }
  ],
  "details": {
    "url": "https://arxiv.org/html/2405.14782v1",
    "title": "Lessons from the Trenches on Reproducible Evaluation of Language Models",
    "author": "Stella Biderman",
    "summary": "Hailey Schoelkopf\n\n\n\n\nLintang Sutawika\n\n\n\n\n\nLeo Gao\n\nEleutherAI\n\n\nJonathan Tow\n\nStability AI\n\n\nBaber Abbasi\n\nEleutherAI\n\n\nAlham Fikri Aji\n\nMBZUAI\n\n\nPawan Sasanka Ammanamanchi\n\nIIIT Hyderabad\n\n\nSidney …",
    "content": "<div id=\"readability-page-1\" class=\"page\"><article>\n\n<p>\n<span>\n<span>Hailey Schoelkopf\n</span><span>\n<span>\n</span></span></span>\n<span>\n<span>Lintang Sutawika\n</span><span>\n<span>\n</span></span></span>\n<span>\n<span>\n<br>Leo Gao\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Jonathan Tow\n</span><span>\n<span>Stability AI\n</span></span></span>\n<span>\n<span>Baber Abbasi\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Alham Fikri Aji\n</span><span>\n<span>MBZUAI\n</span></span></span>\n<span>\n<span>Pawan Sasanka Ammanamanchi\n</span><span>\n<span>IIIT Hyderabad\n</span></span></span>\n<span>\n<span>Sidney Black\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Jordan Clive\n</span><span>\n<span>Chattermill AI\n</span></span></span>\n<span>\n<span>Anthony DiPofi\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Julen Etxaniz\n</span><span>\n<span>HiTZ Center - Ixa, UPV/EHU\n</span></span></span>\n<span>\n<span>Benjamin Fattori\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Jessica Zosa Forde\n</span><span>\n<span>Brown University\n</span></span></span>\n<span>\n<span>Charles Foster\n</span><span>\n<span>Finetune\n</span></span></span>\n<span>\n<span>Mimansa Jaiswal\n</span><span>\n<span>University of Michigan\n</span></span></span>\n<span>\n<span>Wilson Y. Lee\n</span><span>\n<span>HubSpot\n</span></span></span>\n<span>\n<span>Haonan Li\n</span><span>\n<span>MBZUAI\n</span>\n<span>LibrAI\n</span></span></span>\n<span>\n<span>Charles Lovering\n</span><span>\n<span>Kensho\n</span></span></span>\n<span>\n<span>Niklas Muennighoff\n</span><span>\n<span>Contextual AI\n</span></span></span>\n<span>\n<span>Ellie Pavlick\n</span><span>\n<span>HiTZ Center - Ixa, UPV/EHU\n</span></span></span>\n<span>\n<span>Jason Phang\n</span><span>\n<span>EleutherAI\n</span>\n<span>New York University\n</span></span></span>\n<span>\n<span>Aviya Skowron\n</span><span>\n<span>EleutherAI\n</span></span></span>\n<span>\n<span>Samson Tan\n</span><span>\n<span>Amazon\n</span></span></span>\n<span>\n<span>Xiangru Tang\n</span><span>\n<span>Yale University\n</span></span></span>\n<span>\n<span>Kevin A. Wang\n</span><span>\n<span>HiTZ Center - Ixa, UPV/EHU\n</span></span></span>\n<span>\n<span>Genta Indra Winata\n</span><span>\n<span>HKUST\n</span></span></span>\n<span>\n<span>François Yvon\n</span><span>\n<span>Sorbonne University\n</span></span></span>\n<span>\n<span>Andy Zou\n</span><span>\n<span>CMU\n</span></span></span>\n</p>\n<div id=\"abstract\">\n<h6>Abstract</h6><p id=\"id1.id1\">Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers. First, we provide an overview of common challenges faced in language model evaluation. Second, we delineate best practices for addressing or lessening the impact of these challenges on research. Third, we present the&nbsp;<span id=\"id1.id1.1\">Language Model Evaluation Harness</span>&nbsp;(<span id=\"id1.id1.2\">lm-eval</span>): an open source library for independent, reproducible, and extensible evaluation of language models that seeks to address these issues. We describe the features of the library as well as case studies in which the library has been used to alleviate these methodological concerns.</p></div>\n<section id=\"S1\">\n<h2>\n<span>1 </span>Introduction</h2>\n<div id=\"S1.p1\">\n<p id=\"S1.p1.1\">Evaluation on shared benchmark tasks is a crucial tool used to track and communicate progress in the machine learning and language modeling communities <cite>(Ruder, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib98\" title=\"\">2021</a>)</cite>. Benchmarks are used to track progress toward shared community goals and to demonstrate the improvements of newly proposed methods over prior baselines. Evaluation practices thus play a crucial role in the direction of the field: inconsistencies or biases in evaluation practices can lead to skewed performance comparisons, which may influence the direction of future research and the adoption of new methods by the community <cite>(Dehghani et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib28\" title=\"\">2021</a>)</cite> or lead to adverse effects from deploying suboptimal or harmful models <cite>(Bender &amp; Friedman, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib10\" title=\"\">2018</a>)</cite> on tasks for which they are ill-suited <cite>(Raji et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib94\" title=\"\">2022</a>)</cite>.</p></div>\n<div id=\"S1.p2\">\n<p id=\"S1.p2.1\">In this work, we detail our lessons learned that have been especially beneficial to obtaining useful and rigorous findings. By sharing these lessons, we aim to foster a more robust and reliable evaluation ecosystem. Our contributions are threefold:</p><ol id=\"S1.I1\">\n<li id=\"S1.I1.i1\">\n<span>1.</span>\n<div id=\"S1.I1.i1.p1\">\n<p id=\"S1.I1.i1.p1.1\">We highlight several commonly-faced challenges in evaluating language models, including the difficulty of assessing the correctness of natural language responses, challenges in benchmark design, and the dependence upon implementation details that are often obscured or unreported (Section&nbsp;<a href=\"https://arxiv.org/html/2405.14782v1#S2\" title=\"2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>2</span></a>).</p></div>\n</li>\n<li id=\"S1.I1.i2\">\n<span>2.</span>\n<div id=\"S1.I1.i2.p1\">\n<p id=\"S1.I1.i2.p1.1\">We then discuss best practices we’ve identified to improve how to communicate results and improve evaluation rigor in the language modeling community, despite–or to mitigate the impact of—these challenges (Section&nbsp;<a href=\"https://arxiv.org/html/2405.14782v1#S3\" title=\"3 Best Practices for Language Model Evaluation ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>3</span></a>).</p></div>\n</li>\n<li id=\"S1.I1.i3\">\n<span>3.</span>\n<div id=\"S1.I1.i3.p1\">\n<p id=\"S1.I1.i3.p1.1\">To enable researchers and engineers to easily utilize the best practices we have identified, we present &nbsp;<span id=\"S1.I1.i3.p1.1.1\">lm-eval</span>, our open-source library for reproducible evaluation of language models (Section&nbsp;<a href=\"https://arxiv.org/html/2405.14782v1#S4\" title=\"4 The Language Model Evaluation Harness ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>4</span></a>). <span id=\"S1.I1.i3.p1.1.2\">lm-eval</span>’s framework defines a flexible API for both model implementation and evaluation task implementation, and then handles all of the work to <span id=\"S1.I1.i3.p1.1.3\">orchestrate</span> evaluations internally, allowing users to evaluate their chosen combinations of models + tasks, dependent on use case. We additionally present a set of case studies in which&nbsp;<span id=\"S1.I1.i3.p1.1.4\">lm-eval</span>&nbsp;was used to improve the level of rigor of a set of evaluations (Section&nbsp;<a href=\"https://arxiv.org/html/2405.14782v1#S5\" title=\"5 Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>5</span></a>).</p></div>\n</li>\n</ol>\n</div>\n</section>\n<section id=\"S2\">\n<h2>\n<span>2 </span>Challenges in Evaluating Language Models</h2>\n<section id=\"S2.SS1\">\n<h3>\n<span>2.1 </span>Evaluating and Scoring Natural Language Abilities</h3>\n<div id=\"S2.SS1.p1\">\n<p id=\"S2.SS1.p1.1\">The biggest challenge in language model evaluation is a concept we term <span id=\"S2.SS1.p1.1.1\">the Key Problem</span>: When evaluating language models, there can be many semantically equivalent but syntactically different ways of expressing the same idea. In an ideal world, we would have a way to automatically detect when two sentences express the same content but in different words. Unfortunately, our best tools for determining whether two sentences are semantically equivalent <span id=\"S2.SS1.p1.1.2\">are the very models we are seeking to evaluate</span>. This problem drives many of the approaches to LM benchmarking, and many problems in LM evaluation stem from there not being any silver bullets for solving the Key Problem.</p></div>\n<div id=\"S2.SS1.p2\">\n<p id=\"S2.SS1.p2.1\">In principle, this would be solvable by simply having expert human annotators score model responses for correctness. The main reason this is not ubiquitous is cost: performing accurate human studies is not only difficult and time-consuming but also very expensive due to fair compensation, pricing smaller actors or organizations out of performing such evaluations. Additionally, there are other reasons relying on solely human assessments must be done with caution: they can be flawed and biased, especially for complex judgments such as factuality <cite>(Hosking et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib47\" title=\"\">2024</a>; Xu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib127\" title=\"\">2023</a>; Wu &amp; Aji, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib126\" title=\"\">2023</a>)</cite>. Expert, trained human judgment can alleviate these issues but is inherently non-scalable.</p></div>\n<div id=\"S2.SS1.p3\">\n<p id=\"S2.SS1.p3.1\">To address the high costs of manual human evaluation, automated metrics are often used. These offer notable advantages in that they are (theoretically) fully reproducible, far easier and cheaper to compute, and can avoid some of the issues faced by human studies <cite>(Wei &amp; Jia, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib124\" title=\"\">2021</a>; Freitag et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib33\" title=\"\">2021</a>; Amidei et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib4\" title=\"\">2020</a>)</cite>. Automated metrics such as BLEU <cite>(Papineni et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib81\" title=\"\">2002</a>)</cite> and ROUGE <cite>(Lin, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib65\" title=\"\">2004</a>)</cite> seek to directly solve the Key Problem by measuring the distance from a generated response to a gold-standard one, such as via counting the n-gram overlap between the two texts.\nHeuristic-based metrics such as BLEU (and its derivatives) have flaws <cite>(Callison-Burch et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib19\" title=\"\">2006</a>)</cite> and present reproducibility challenges <cite>(Marie et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib74\" title=\"\">2021</a>)</cite>, but can be useful. More recently, model-based metrics have recently gained momentum through evaluation methods that leverage large language models as a grader <cite>(Kim et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib56\" title=\"\">2024</a>; Wang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib120\" title=\"\">2024</a>; Liu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib68\" title=\"\">2023b</a>)</cite>, especially as proxies for human preference evaluation <cite>(Zheng et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib133\" title=\"\">2023</a>)</cite>, but these are known to be flawed <cite>(Wang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib118\" title=\"\">2023</a>; Huang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib49\" title=\"\">2024</a>; Shen et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib106\" title=\"\">2023</a>; Zeng et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib130\" title=\"\">2024</a>; Hu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib48\" title=\"\">2024</a>; Liu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib69\" title=\"\">2023c</a>; Chen et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib20\" title=\"\">2024</a>)</cite> and suffer from similar reproducibility issues as BLEU, ROUGE, and their variants.</p></div>\n<div id=\"S2.SS1.p4\">\n<p id=\"S2.SS1.p4.1\">The Key Problem can alternately be sidestepped by artificially restricting the answer space. The most prevalent way to achieve this is to reframe questions as multiple choice problems, with a single gold target answer and a finite, static set of possible responses <cite>(Hendrycks et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib42\" title=\"\">2020</a>; Srivastava et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib110\" title=\"\">2022</a>; Li’evin et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib64\" title=\"\">2022</a>; Lin et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib66\" title=\"\">2022</a>; Robinson et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib95\" title=\"\">2023</a>; Holtzman et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib46\" title=\"\">2022</a>)</cite>. Alternatively, when a reference answer is known, one can perform string-matching approaches heuristically to determine whether the model’s answer matches the ground truth <cite>(Dua et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib30\" title=\"\">2019</a>; Joshi et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib53\" title=\"\">2017</a>; Hendrycks et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib43\" title=\"\">2021</a>)</cite>.</p></div>\n<div id=\"S2.SS1.p5\">\n<p id=\"S2.SS1.p5.1\">This challenge does not necessarily impact other applications of language models and related technologies, such as playing games where it easy to check that the game has ended <cite>(Romstad et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib97\" title=\"\">2008</a>; Silver et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib107\" title=\"\">2018</a>; † et&nbsp;al.(2022)(FAIR)†, Bakhtin, Brown, Dinan, Farina,\nFlaherty, Fried, Goff, Gray, Hu, et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib31\" title=\"\">FAIR</a>)</cite>, more constrained scientific applications <cite>(Jumper et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib54\" title=\"\">2021</a>; Ahdritz et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib1\" title=\"\">2022</a>)</cite>, or domains where we have <span id=\"S2.SS1.p5.1.1\">practically usable verifiers</span> even when the solutions are not checkable in all contexts <cite>(Biderman, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib12\" title=\"\">2020</a>; Biderman &amp; Raff, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib13\" title=\"\">2022</a>; Lewkowycz et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib59\" title=\"\">2022</a>)</cite>. In the case of LLMs, the most notable cases where this ground-truth verifier is known are coding and mathematics problems, although the verifiers used, such as unit tests, may still break down in edge cases <cite>(Liu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib67\" title=\"\">2023a</a>)</cite></p></div>\n</section>\n<section id=\"S2.SS2\">\n<h3>\n<span>2.2 </span>Benchmark Design and Validity</h3>\n<div id=\"S2.SS2.p1\">\n<p id=\"S2.SS2.p1.1\">Typically, we do not care about the actual numeric score of a model on a benchmark. Instead, we desire the benchmark to be a useful proxy for some real-world phenomenon. The <span id=\"S2.SS2.p1.1.1\">validity</span> of an evaluation is the extent to which these correlate <cite>(Messick, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib76\" title=\"\">1994</a>)</cite>. For a recent overview of validity concerns in NLP benchmarking, see <cite>Subramonian et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib111\" title=\"\">2023</a>)</cite>. Also see <cite>Raji et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib93\" title=\"\">2021</a>); Saphra et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib101\" title=\"\">2023</a>); Davis (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib25\" title=\"\">2023</a>)</cite> for extended discussion of construct validity in LLM evaluation.</p></div>\n<div id=\"S2.SS2.p2\">\n<p id=\"S2.SS2.p2.1\">While validity is an ongoing problem in language model evaluation, we focus on mitigating other concerns first: as we will describe,&nbsp;<span id=\"S2.SS2.p2.1.1\">lm-eval</span>&nbsp;is designed to ensure measurements are <span id=\"S2.SS2.p2.1.2\">consistent</span> across runs and models, regardless of (construct) validity.</p></div>\n</section>\n<section id=\"S2.SS3\">\n<h3>\n<span>2.3 </span>Implementation Difficulties and (Ir)Reproducibility</h3>\n<div id=\"S2.SS3.p1\">\n<p id=\"S2.SS3.p1.1\">Once a benchmark has been designed, it then needs to be implemented by machine learning researchers around the world to see use in driving progress in the field. This introduces a host of new challenges that need to be addressed in order to ensure that everyone is evaluating models on a benchmark in the same fashion when comparing results. This adaptation process can introduce inconsistencies and make it difficult to draw conclusions across different implementations. Researchers must <span id=\"S2.SS3.p1.1.1\">adapt</span> it to their own workflows and libraries for the purposes of actually adopting the benchmark in their research.</p></div>\n<section id=\"S2.SS3.SSS1\">\n<h4>\n<span>2.3.1 </span>“Minor” Implementation Details Matter</h4>\n<div id=\"S2.SS3.SSS1.p1\">\n<p id=\"S2.SS3.SSS1.p1.1\">The importance of interoperability and full reproducibility stems from the fact that language models are incredibly sensitive to precise details that may not be obvious to practitioners. Even minor variations in prompts, formatting, or other implementation details can significantly impact the performance and validity of evaluations <cite>(Weber et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib121\" title=\"\">2023</a>; Sclar et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib103\" title=\"\">2023</a>; Mizrahi et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib79\" title=\"\">2024</a>; Alzahrani et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib3\" title=\"\">2024</a>; Lu et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib70\" title=\"\">2022</a>; Webson &amp; Pavlick, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib122\" title=\"\">2022</a>; Min et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib78\" title=\"\">2022</a>)</cite>.\nWithout access to the original evaluation code, when re-implementing evaluation procedures from scratch is required, it is nearly impossible to account for all the subtle details that can affect outcomes. As a result, these implementations are likely to diverge in ways that make it extremely difficult to ensure fair comparisons across works, even when evaluating on the same benchmark. Even having the prompts reported in a paper is no substitute for having access to the actual evaluation code: prompts in papers are often incorrect or difficult to map to the exact code implementation because they’ve been stylized to be human-readable.</p></div>\n</section>\n<section id=\"S2.SS3.SSS2\">\n<h4>\n<span>2.3.2 </span>Lack of Agreement About “Apples to Apples”</h4>\n<div id=\"S2.SS3.SSS2.p1\">\n<p id=\"S2.SS3.SSS2.p1.1\">Even assuming that benchmarks are implemented consistently across works, the question of <span id=\"S2.SS3.SSS2.p1.1.1\">how to draw fair comparisons</span> across models and methods is still difficult for LMs.</p></div>\n<div id=\"S2.SS3.SSS2.p2\">\n<p id=\"S2.SS3.SSS2.p2.1\">For instance, different instruction-tuned models may be trained to expect certain formats&nbsp;<cite>(Taori et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib113\" title=\"\">2023</a>; Sanh et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib100\" title=\"\">2022</a>; Wei et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib123\" title=\"\">2022</a>)</cite> – using these models’ intended prompt formats can make the evaluation tasks inherently different or change their difficulty, but not using these can also bias against models trained with formats not matching tasks’ “standard” prompting styles. Likewise, if an original benchmark implementation (including prompting and postprocessing) is tailored for a specific model, other models trained differently will suffer, artificially skewing perceptions of what techniques are effective.</p></div>\n<div id=\"S2.SS3.SSS2.p3\">\n<p id=\"S2.SS3.SSS2.p3.1\">Likewise, some questions of how to set up controlled experiments are still open–is it ideal to normalize performance and comparisons by the number of parameters? Training FLOPs? Inference cost? Must training data be held equal? How should models which can leverage external resources such as retrieved documents or external tools be compared? These questions are all context-dependent but can impact findings significantly. For example, <cite>Wang et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib119\" title=\"\">2022</a>)</cite> explore comparisons across architectures and training objectives, and choose to normalize for FLOPs, thus comparing encoder-decoder models with double the parameters to decoder-only models. Comparing results of models with equivalent training FLOPs, regardless of the allocation of those FLOPs, is commonplace (<cite>Hoffmann et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib44\" title=\"\">2022</a>); Peng et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib83\" title=\"\">2023</a>); Touvron et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib115\" title=\"\">2023</a>)</cite>, <span id=\"S2.SS3.SSS2.p3.1.1\">inter alia</span>). However, in a more memory-constrained setting, comparing models equi-parameter may be more logical. While this is not inherently problematic, as different application contexts motivate different evaluation criteria, it is common to gloss headline claims as “model X is better than model y” or “model x doesn’t really work as advertised” without paying significant attention to how comparisons were made.</p></div>\n</section>\n<section id=\"S2.SS3.SSS3\">\n<h4>\n<span>2.3.3 </span>Comparisons with Prior Work are Expensive (and Sometimes Impossible)</h4>\n<div id=\"S2.SS3.SSS3.p1\">\n<p id=\"S2.SS3.SSS3.p1.1\">Setting aside the question of establishing fair comparisons between methods or models, an additional key challenge in language modeling research is that many barriers prevent thorough comparison with related work.</p></div>\n<div id=\"S2.SS3.SSS3.p2\">\n<p id=\"S2.SS3.SSS3.p2.1\">Many LMs developed by industrial labs, often used as reference points for benchmarks, have never been released externally <cite>(Chowdhery et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib21\" title=\"\">2023</a>; Hoffmann et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib44\" title=\"\">2022</a>)</cite>, preventing comparisons except by pulling unverified evaluation numbers from technical reports. Those models that have been made available via APIs may non-transparently not match the published versions or otherwise be modified for deployment. Additionally, these API models are quickly <span id=\"S2.SS3.SSS3.p2.1.1\">deprecated</span> and no longer accessible, rendering slews of work no longer reproducible<span id=\"footnote1\"><sup>1</sup><span><span><sup>1</sup><span>1</span>Notably, OpenAI’s <span id=\"footnote1.1\">code-davinci-002</span> model was <a href=\"https://platform.openai.com/docs/deprecations/base-gpt-models\" title=\"\">deprecated in January 2024</a>, making at minimum hundreds of research studies irreproducible.</span></span></span>. API access, especially for large volumes of evaluation, is quite expensive.</p></div>\n\n</section>\n</section>\n<section id=\"S2.SS4\">\n<h3>\n<span>2.4 </span>Fast-changing Progress and Conventions</h3>\n<figure id=\"S2.F1\"><img alt=\"Refer to caption\" height=\"246\" id=\"S2.F1.g1\" src=\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/Timeline_v1.png\" width=\"598\">\n<figcaption><span>Figure 1: </span>A timeline showing the relative release dates of a selection of notable benchmarks used to evaluate LMs, as compared to the release dates of BERT <cite>(Devlin et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib29\" title=\"\">2018</a>)</cite>, GPT-2 <cite>(Radford et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib92\" title=\"\">2019</a>)</cite>, GPT-3 <cite>(Brown et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib18\" title=\"\">2020</a>)</cite>, and ChatGPT, used as approximate stand-ins for shifts in how the community uses and therefore evaluates LMs.</figcaption>\n</figure>\n<div id=\"S2.SS4.p1\">\n<p id=\"S2.SS4.p1.1\">Due to the time-consuming nature of developing good benchmarks and the rapid pace of change in NLP research in the past decade, many widely used language model evaluation benchmarks do not represent the current paradigm of how language models are trained. This has two major impacts:</p><ul id=\"S2.I1\">\n<li id=\"S2.I1.i1\">\n<span>•</span>\n<div id=\"S2.I1.i1.p1\">\n<p id=\"S2.I1.i1.p1.1\">Benchmarks are being used for purposes they were not originally designed for or designed for validity under: for example, a large number of benchmarks have been built around fine-tuning on a known training set and closed space of labels <cite>(Wang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib117\" title=\"\">2019b</a>; <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib116\" title=\"\">a</a>)</cite>.</p></div>\n</li>\n<li id=\"S2.I1.i2\">\n<span>•</span>\n<div id=\"S2.I1.i2.p1\">\n<p id=\"S2.I1.i2.p1.1\">There is no “ground-truth” implementation from the original benchmark authors for many of these popular benchmarks “retrofitted” to be used with prompted autoregressive LMs. In the absence of a clear standard, the community’s methodology for evaluating on these benchmarks may be fragmented or undocumented <cite>(Clark et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib23\" title=\"\">2018</a>; Paperno et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib80\" title=\"\">2016</a>)</cite>.</p></div>\n</li>\n</ul>\n</div>\n<div id=\"S2.SS4.p2\">\n<p id=\"S2.SS4.p2.1\">To illustrate the effects of this development timeline, Figure <a href=\"https://arxiv.org/html/2405.14782v1#S2.F1\" title=\"Figure 1 ‣ 2.4 Fast-changing Progress and Conventions ‣ 2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>1</span></a> shows how many prominent LM benchmarks were designed prior to shifts such as in-context learning and chat interaction, and therefore were not designed to take these formats and approaches into account. This can affect validity or difficulty in unforeseen ways.</p></div>\n</section>\n</section>\n<section id=\"S3\">\n<h2>\n<span>3 </span>Best Practices for Language Model Evaluation</h2>\n<div id=\"S3.p1\">\n<p id=\"S3.p1.1\">While LM evaluation is difficult and suffers from a number of challenges as we have described, there are measures that can be taken to significantly improve current practices. We provide our high-level recommendations regarding such measures, and detail our motivations briefly for each.</p></div>\n<section id=\"S3.SS0.SSS0.Px1\">\n<h5>Always share your exact prompts and code</h5>\n<div id=\"S3.SS0.SSS0.Px1.p1\">\n<ul id=\"S3.I1\">\n<li id=\"S3.I1.i1\">\n<span>•</span>\n<div id=\"S3.I1.i1.p1\">\n<p id=\"S3.I1.i1.p1.1\">If possible, full <span id=\"S3.I1.i1.p1.1.1\">evaluation code</span> including the full prompts used should also be provided for reproducible evaluation runs, as well as further identifiers such as links to specific commits used. Failing this, sharing prompts is often not done, but can drastically improve reproducibility.\n</p></div>\n</li>\n<li id=\"S3.I1.i2\">\n<span>•</span>\n<div id=\"S3.I1.i2.p1\">\n<p id=\"S3.I1.i2.p1.1\">For fair comparison against other models, evaluation should be done with the same set of prompts unless there’s a good reason not to. <span id=\"S3.I1.i2.p1.1.1\">Prompts should not be optimized for performance on a given model but not others, and the amount of prompt engineering done should be disclosed.</span></p></div>\n</li>\n</ul>\n</div>\n</section>\n<section id=\"S3.SS0.SSS0.Px2\">\n<h5>Avoid copying results from other implementations</h5>\n<div id=\"S3.SS0.SSS0.Px2.p1\">\n<ul id=\"S3.I2\">\n<li id=\"S3.I2.i1\">\n<span>•</span>\n<div id=\"S3.I2.i1.p1\">\n<p id=\"S3.I2.i1.p1.1\">Comparing results across papers can be misleading due to a wide range of experimental differences, including prompts, sample size, metric calculation, and more <cite>(Marie et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib74\" title=\"\">2021</a>)</cite>.</p></div>\n</li>\n<li id=\"S3.I2.i2\">\n<span>•</span>\n<div id=\"S3.I2.i2.p1\">\n<p id=\"S3.I2.i2.p1.1\">Results should <span id=\"S3.I2.i2.p1.1.1\">not</span> be copied or reported from other papers&nbsp;<cite>(Marie, <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib73\" title=\"\">2022</a>)</cite> whenever possible, unless one can verify that the exact same code has been used to run the experiments in those papers. If such copying is unavoidable, it should be clearly marked as such and treated carefully.</p></div>\n</li>\n</ul>\n</div>\n</section>\n<section id=\"S3.SS0.SSS0.Px3\">\n<h5>Always provide model outputs</h5>\n<div id=\"S3.SS0.SSS0.Px3.p1\">\n<ul id=\"S3.I3\">\n<li id=\"S3.I3.i1\">\n<span>•</span>\n<div id=\"S3.I3.i1.p1\">\n<p id=\"S3.I3.i1.p1.1\">Providing model outputs alongside evaluation code can allow others to recalculate scores based on these artifacts, which can be useful for performing statistical significance testing and for assessing the impact of different evaluation metrics or scoring approaches.</p></div>\n</li>\n<li id=\"S3.I3.i2\">\n<span>•</span>\n<div id=\"S3.I3.i2.p1\">\n<p id=\"S3.I3.i2.p1.1\">Evaluation of large models or APIs can be quite costly–sharing such artifacts allows researchers without access to significant compute to participate in evaluation research.</p></div>\n</li>\n<li id=\"S3.I3.i3\">\n<span>•</span>\n<div id=\"S3.I3.i3.p1\">\n<p id=\"S3.I3.i3.p1.1\">Finally, sharing outputs can allow results on API models to be reproduced to some extent, even if the models are subsequently deprecated.</p></div>\n</li>\n</ul>\n</div>\n</section>\n<section id=\"S3.SS0.SSS0.Px4\">\n<h5>Perform qualitative analyses</h5>\n<div id=\"S3.SS0.SSS0.Px4.p1\">\n<ul id=\"S3.I4\">\n<li id=\"S3.I4.i1\">\n<span>•</span>\n<div id=\"S3.I4.i1.p1\">\n<p id=\"S3.I4.i1.p1.1\">Qualitatively review a small batch of results and outputs before testing at scale: it is very easy to have bugs in your generation code, especially when working with multiple sets of benchmarks, prompts, and models of different architectures. Catching issues early can save a lot of time and compute, and increase confidence in results.</p></div>\n</li>\n<li id=\"S3.I4.i2\">\n<span>•</span>\n<div id=\"S3.I4.i2.p1\">\n<p id=\"S3.I4.i2.p1.1\">Quantitative scores only provide so much information. To understand why a model is scoring so well or so poorly, it is important to do some sort of qualitative error analysis. This can sometimes reveal superficial errors that are easier to correct with post-processing&nbsp;<cite>Bawden &amp; Yvon (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib8\" title=\"\">2023</a>)</cite>, or more fundamental errors.</p></div>\n</li>\n</ul>\n</div>\n</section>\n<section id=\"S3.SS0.SSS0.Px5\">\n<h5>Perform statistical significance testing</h5>\n<div id=\"S3.SS0.SSS0.Px5.p1\">\n<ul id=\"S3.I5\">\n<li id=\"S3.I5.i1\">\n<span>•</span>\n<div id=\"S3.I5.i1.p1\">\n<p id=\"S3.I5.i1.p1.1\">Most works on language modeling do not perform statistical significance testing <cite>(Marie et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib74\" title=\"\">2021</a>)</cite>. This simple addition can dramatically boost the reliability of claimed results.</p></div>\n</li>\n<li id=\"S3.I5.i2\">\n<span>•</span>\n<div id=\"S3.I5.i2.p1\">\n<p id=\"S3.I5.i2.p1.1\">Although costly, reporting results run over more than one random seed can dramatically boost the validity and utility of results. For example, averaging across model runs <cite>(Sellam et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib104\" title=\"\">2022</a>)</cite>, or averaging over multiple selections of few-shot examples.</p></div>\n</li>\n</ul>\n</div>\n</section>\n</section>\n<section id=\"S4\">\n<h2>\n<span>4 </span>The Language Model Evaluation Harness</h2>\n<div id=\"S4.p1\">\n<p id=\"S4.p1.1\">Informed by these practices we have built <span id=\"S4.p1.1.1\">lm-eval</span>. Unlike prior work on unified benchmarking libraries <cite>(Liang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib63\" title=\"\">2023</a>; Srivastava et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib110\" title=\"\">2022</a>)</cite>, the Evaluation Library does not seek to solely prescribe what the correct benchmark or evaluation protocols to use are, and allows users to select their desired tasks and use cases.</p></div>\n<div id=\"S4.p2\">\n<p id=\"S4.p2.1\">The role of the <span id=\"S4.p2.1.1\">lm-eval</span>&nbsp;is to solve the <span id=\"S4.p2.1.2\">orchestration problem</span>: previously, performing thorough LM evaluations would require painstaking re-implementation of previous tasks (likely to introduce subtle methodological divergences) or the individual installation and usage of extant code for each benchmark. Our goal is to make it easy to allow researchers or library users to simply install one codebase, and run their method plus selected baselines on their desired tasks in a controlled fashion. We hope to make it more effort-intensive to <span id=\"S4.p2.1.3\">not</span> follow minimum best practices for evaluation rigor.</p></div>\n<section id=\"S4.SS1\">\n<h3>\n<span>4.1 </span>Design</h3>\n<div id=\"S4.SS1.p1\">\n<p id=\"S4.SS1.p1.1\">We provide an overview of <span id=\"S4.SS1.p1.1.1\">lm-eval</span>’s major components and design philosophy. At its core, <span id=\"S4.SS1.p1.1.2\">lm-eval</span>&nbsp;allows for the contribution of two types of implementations: evaluation <span id=\"S4.SS1.p1.1.3\">Tasks</span> and integrations with novel <span id=\"S4.SS1.p1.1.4\">LM</span> implementations.</p></div>\n<section id=\"S4.SS1.SSS0.Px1\">\n<h5>Tasks</h5>\n<div id=\"S4.SS1.SSS0.Px1.p1\">\n<p id=\"S4.SS1.SSS0.Px1.p1.1\"><span id=\"S4.SS1.SSS0.Px1.p1.1.1\">lm-eval</span>&nbsp;is built around modular implementations of evaluation tasks, implemented as a <span id=\"S4.SS1.SSS0.Px1.p1.1.2\">Task</span> class using a common API. This allows tasks to be collected in a common library, for new tasks to be extended or implemented easily, and for novel tasks to be easily shared reproducibly among practitioners or other library users. Users can implement tasks either via YAML-based configuration files or via subclassing the provided <span id=\"S4.SS1.SSS0.Px1.p1.1.3\">Task</span> class and providing custom code for specific methods. In Figure <a href=\"https://arxiv.org/html/2405.14782v1#S4.F2\" title=\"Figure 2 ‣ Tasks ‣ 4.1 Design ‣ 4 The Language Model Evaluation Harness ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>2</span></a>, we show an example of the evaluation logic packaged within a <span id=\"S4.SS1.SSS0.Px1.p1.1.4\">Task</span> class.</p></div>\n<figure id=\"S4.F2\"><img alt=\"Refer to caption\" height=\"199\" id=\"S4.F2.g1\" src=\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/taskoverview.png\" width=\"598\">\n<figcaption><span>Figure 2: </span>The operations performed by a <span id=\"S4.F2.7.1\">Task</span> object in <span id=\"S4.F2.8.2\">lm-eval</span>. Tasks are configured by YAML files or as a Python subclass, and encompass 1) a data source (using the <span id=\"S4.F2.9.3\">Datasets</span> library<cite>(Lhoest et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib60\" title=\"\">2021</a>)</cite>), 2) tools for defining prompts and format, 3) mapping these prompts to rendered inputs and expected output type from an <span id=\"S4.F2.10.4\">LM</span> in the form of <span id=\"S4.F2.11.5\">Requests</span>, and 4) rules for post-processing the <span id=\"S4.F2.12.6\">LM</span>’s outputs and calculating the final task metrics.</figcaption>\n</figure>\n<div id=\"S4.SS1.SSS0.Px1.p2\">\n<p id=\"S4.SS1.SSS0.Px1.p2.1\">We provide a number of implementations for common tasks, and accept new tasks sourced from the community. We strive to match the paper originally introducing a benchmark dataset in its methodology, including using the same prompts if applicable. For tasks such as those introduced prior to prompted evaluation becoming the standard, we source evaluation methodology from the paper first posing the evaluation dataset as a prompted task. For example, we implement many tasks as adapted for in-context learning by <cite>Brown et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib18\" title=\"\">2020</a>)</cite>.</p></div>\n</section>\n<section id=\"S4.SS1.SSS0.Px2\">\n<h5>LMs</h5>\n<div id=\"S4.SS1.SSS0.Px2.p1\">\n<p id=\"S4.SS1.SSS0.Px2.p1.1\">The next core piece of <span id=\"S4.SS1.SSS0.Px2.p1.1.1\">lm-eval</span>&nbsp;is the <span id=\"S4.SS1.SSS0.Px2.p1.1.2\">LM</span> API. Because effective <span id=\"S4.SS1.SSS0.Px2.p1.1.3\">orchestration</span> is our core goal, we allow arbitrary software libraries or (autoregressive) language model architectures to extend a provided interface for <span id=\"S4.SS1.SSS0.Px2.p1.1.4\">LM</span> objects.</p></div>\n<div id=\"S4.SS1.SSS0.Px2.p2\">\n<p id=\"S4.SS1.SSS0.Px2.p2.1\">For ease of use, and compartmentalization of the model definition and external library integrations for custom models away from core evaluation logic, we assume that LMs operate upon dispatched <span id=\"S4.SS1.SSS0.Px2.p2.1.1\">Requests</span> which consist of mapping <span id=\"S4.SS1.SSS0.Px2.p2.1.2\">string inputs</span> to some <span id=\"S4.SS1.SSS0.Px2.p2.1.3\">string or probability</span> as output. We thus abstract tokenizers away within the <span id=\"S4.SS1.SSS0.Px2.p2.1.4\">LM</span> class, and treat a neural language model combined with its tokenizer as a single system being evaluated.</p></div>\n<div id=\"S4.SS1.SSS0.Px2.p3\">\n<p id=\"S4.SS1.SSS0.Px2.p3.1\">LMs implement a simple interface, consisting of several types of <span id=\"S4.SS1.SSS0.Px2.p3.1.1\">Requests</span> in order to be used within the library for all supported tasks.</p></div>\n</section>\n<section id=\"S4.SS1.SSS0.Px3\">\n<h5>Request Types</h5>\n<div id=\"S4.SS1.SSS0.Px3.p1\">\n<p id=\"S4.SS1.SSS0.Px3.p1.1\">We allow for 3 core types of <span id=\"S4.SS1.SSS0.Px3.p1.1.1\">Requests</span> that may be sent to a language model, which consist of distinct types of <span id=\"S4.SS1.SSS0.Px3.p1.1.2\">measurements</span> that can be performed to observe a model’s response or latent capabilities in a prompted format. These are:</p></div>\n<div id=\"S4.SS1.SSS0.Px3.p2\">\n<ul id=\"S4.I1\">\n<li id=\"S4.I1.i1\">\n<span>•</span>\n<div id=\"S4.I1.i1.p1\">\n<p id=\"S4.I1.i1.p1.1\">(Conditional) Loglikelihoods (<span id=\"S4.I1.i1.p1.1.1\">loglikelihood, multiple_choice</span>) - computing the probability of given output string(s), conditioned on some provided input.</p></div>\n</li>\n<li id=\"S4.I1.i2\">\n<span>•</span>\n<div id=\"S4.I1.i2.p1\">\n<p id=\"S4.I1.i2.p1.1\">Perplexities (<span id=\"S4.I1.i2.p1.1.1\">loglikelihood_rolling</span>) - measuring the average loglikelihood or probability of producing the tokens in a given dataset.</p></div>\n</li>\n<li id=\"S4.I1.i3\">\n<span>•</span>\n<div id=\"S4.I1.i3.p1\">\n<p id=\"S4.I1.i3.p1.1\">Generation (<span id=\"S4.I1.i3.p1.1.1\">generate_until</span>) - generating text until a given stopping condition is reached, from a model conditioned on some provided input.</p></div>\n</li>\n</ul>\n</div>\n<figure id=\"S4.F3\"><img alt=\"Refer to caption\" height=\"295\" id=\"S4.F3.g1\" src=\"https://arxiv.org/html/2405.14782v1/extracted/5594376/images/typesoftasks.png\" width=\"598\">\n<figcaption><span>Figure 3: </span>Overview of the three core <span id=\"S4.F3.2.1\">Request</span> types supported by our evaluation framework. These include (1) conditional loglikelihoods, (2) perplexities, and (3) generation-based requests.</figcaption>\n</figure>\n<div id=\"S4.SS1.SSS0.Px3.p3\">\n<p id=\"S4.SS1.SSS0.Px3.p3.1\">Provided with these three primitive operations, we are able to implement the major ways in the literature that have been used to evaluate LMs (<cite>Gao et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib36\" title=\"\">2020</a>)</cite>, <cite>Brown et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib18\" title=\"\">2020</a>)</cite>, <span id=\"S4.SS1.SSS0.Px3.p3.1.1\">inter alia</span>). While these high-level approaches are standard, they all contain a number of subtle implementation decisions which are often not disclosed in papers. Therefore, we include a full formal description of common implementation details involved in ours and others’ approaches within Appendix <a href=\"https://arxiv.org/html/2405.14782v1#A1\" title=\"Appendix A Formalizing Measurements ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>A</span></a> for completeness, which we hope will be a useful contribution to the literature.</p></div>\n</section>\n</section>\n<section id=\"S4.SS2\">\n<h3>\n<span>4.2 </span>Addressing Challenges and Incorporating Best Practices</h3>\n<div id=\"S4.SS2.p1\">\n<p id=\"S4.SS2.p1.1\">Here we detail how we position <span id=\"S4.SS2.p1.1.1\">lm-eval</span>&nbsp;to address the issues mentioned in Section <a href=\"https://arxiv.org/html/2405.14782v1#S2\" title=\"2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>2</span></a> and incorporate the recommendations in Section <a href=\"https://arxiv.org/html/2405.14782v1#S3\" title=\"3 Best Practices for Language Model Evaluation ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>3</span></a>, in order to encourage a more robust evaluation ecosystem.</p></div>\n<section id=\"S4.SS2.SSS0.Px1\">\n<h5>Reproducibility</h5>\n<div id=\"S4.SS2.SSS0.Px1.p1\">\n<p id=\"S4.SS2.SSS0.Px1.p1.1\"><span id=\"S4.SS2.SSS0.Px1.p1.1.1\">lm-eval</span>&nbsp;encourages and enables reproducible evaluation in several ways. First, by providing a standardized implementation of many common tasks, practitioners can report on these tasks and ensure they are evaluating on the same prompt and implementation as other users of the library.</p></div>\n<div id=\"S4.SS2.SSS0.Px1.p2\">\n<p id=\"S4.SS2.SSS0.Px1.p2.1\">Alongside task results we report a <span id=\"S4.SS2.SSS0.Px1.p2.1.1\">version</span> field, incremented each time a task must be modified in a way that affects its scoring. Therefore, in the case where task implementations have bugs or must otherwise be updated, one can still reference the version of the task used, to ensure future research can reproduce reported results.</p></div>\n<div id=\"S4.SS2.SSS0.Px1.p3\">\n<p id=\"S4.SS2.SSS0.Px1.p3.1\">While this is not a panacea for the costs of comparing to prior work, and rerunning baselines oneself is advised, when prior work uses our library one can be confident that the results from prior work match what one would have gotten had one rerun it oneself using that version of the library <cite>(Beeching et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib9\" title=\"\">2023</a>)</cite>.</p></div>\n</section>\n<section id=\"S4.SS2.SSS0.Px2\">\n<h5>Qualitative Analysis</h5>\n<div id=\"S4.SS2.SSS0.Px2.p1\">\n<p id=\"S4.SS2.SSS0.Px2.p1.1\"><span id=\"S4.SS2.SSS0.Px2.p1.1.1\">lm-eval</span>&nbsp;provides support for performing qualitative analysis of evaluation scores. In keeping with our recommended best practices, we implement the following, which allow for qualitative checks to be a core part of the evaluation workflow when using&nbsp;<span id=\"S4.SS2.SSS0.Px2.p1.1.2\">lm-eval</span>:</p></div>\n<div id=\"S4.SS2.SSS0.Px2.p2\">\n<ul id=\"S4.I2\">\n<li id=\"S4.I2.i1\">\n<span>•</span>\n<div id=\"S4.I2.i1.p1\">\n<p id=\"S4.I2.i1.p1.1\">We allow for artificially <span id=\"S4.I2.i1.p1.1.1\">limit</span>ing the amount of samples used for a given evaluation run, to enable code to be tested and outputs to be reviewed in small batches prior to full evaluation runs.</p></div>\n</li>\n<li id=\"S4.I2.i2\">\n<span>•</span>\n<div id=\"S4.I2.i2.p1\">\n<p id=\"S4.I2.i2.p1.1\">Per-sample logging is supported, for post-hoc reproduction of scores or error analysis of model mistakes or evaluation implementation.</p></div>\n</li>\n</ul>\n</div>\n</section>\n<section id=\"S4.SS2.SSS0.Px3\">\n<h5>Statistical Testing</h5>\n<div id=\"S4.SS2.SSS0.Px3.p1\">\n<p id=\"S4.SS2.SSS0.Px3.p1.1\"><span id=\"S4.SS2.SSS0.Px3.p1.1.1\">lm-eval</span>&nbsp;reports the standard error (SE) of most supported metrics, calculated by either bootstrapping or dividing the sample standard deviation by the root of the sample size.</p></div>\n<div id=\"S4.SS2.SSS0.Px3.p2\">\n<p id=\"S4.SS2.SSS0.Px3.p2.1\">By reporting these SE calculations prominently in every evaluation run, we make it trivial for practitioners to add simple statistical measures such as confidence intervals to their results. While we believe more rigorous and widespread statistical testing in LM evaluation is still needed, we hope that this will spur the community to report and be more aware of statistical significance concerns by lowering the difficulty of reporting such measures.</p></div>\n</section>\n</section>\n</section>\n<section id=\"S5\">\n<h2>\n<span>5 </span>Case Studies</h2>\n<div id=\"S5.p1\">\n<p id=\"S5.p1.1\">Finally, we demonstrate <span id=\"S5.p1.1.1\">lm-eval</span>’s utility for improving evaluation rigor and understanding via case studies of its successful usage. We provide additional case studies within Appendix <a href=\"https://arxiv.org/html/2405.14782v1#A2\" title=\"Appendix B Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>B</span></a>.</p></div>\n<section id=\"S5.SS1\">\n<h3>\n<span>5.1 </span>Prompts Massively Affect Results</h3>\n<div id=\"S5.SS1.p1\">\n<p id=\"S5.SS1.p1.1\">As mentioned in Section <a href=\"https://arxiv.org/html/2405.14782v1#S2.SS3.SSS1\" title=\"2.3.1 “Minor” Implementation Details Matter ‣ 2.3 Implementation Difficulties and (Ir)Reproducibility ‣ 2 Challenges in Evaluating Language Models ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>2.3.1</span></a>, language models can be very sensitive to the specific prompt or framing of an evaluation task.</p></div>\n<div id=\"S5.SS1.p2\">\n<p id=\"S5.SS1.p2.1\">Here we detail how <span id=\"S5.SS1.p2.1.1\">lm-eval</span>&nbsp;has been used to study this sensitivity, and can be used to improve confidence in the comparison of scores across models. We focus our attention on two popular language modeling benchmarks: the ARC question answering benchmark <cite>(Clark et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib23\" title=\"\">2018</a>)</cite> and MMLU <cite>(Hendrycks et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib43\" title=\"\">2021</a>)</cite>. However, not all papers evaluate on these tasks in the same way as the original formats.</p></div>\n<figure id=\"S5.T1\">\n\n<figcaption><span>Table 1: </span>Comparison of 0-shot model performance for several pretrained LMs <cite>(Black et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib16\" title=\"\">2022</a>; Touvron et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib115\" title=\"\">2023</a>; Penedo et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib82\" title=\"\">2023</a>; Jiang et&nbsp;al., <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib51\" title=\"\">2023</a>; <a href=\"https://arxiv.org/html/2405.14782v1#bib.bib52\" title=\"\">2024</a>)</cite> on ARC (Easy and Challenge subsets) and MMLU across two commonly used prompt styles, evaluated using <span id=\"S5.T1.3.1\">lm-eval</span>.</figcaption>\n</figure>\n<div id=\"S5.SS1.p3\">\n<p id=\"S5.SS1.p3.1\">ARC is first adapted to the in-context learning setting by <cite>Brown et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib18\" title=\"\">2020</a>)</cite>, who implement the dataset as a “cloze” task: the model is prompted via <span id=\"S5.SS1.p3.1.1\">‘‘Question: {question}<math alttext=\"\\backslash\" display=\"inline\" id=\"S5.SS1.p3.1.1.m1.1\"><semantics id=\"S5.SS1.p3.1.1.m1.1a\"><mo id=\"S5.SS1.p3.1.1.m1.1.1\" xref=\"S5.SS1.p3.1.1.m1.1.1.cmml\">\\</mo><annotation-xml encoding=\"MathML-Content\" id=\"S5.SS1.p3.1.1.m1.1b\"><ci id=\"S5.SS1.p3.1.1.m1.1.1.cmml\" xref=\"S5.SS1.p3.1.1.m1.1.1\">\\</ci></annotation-xml><annotation encoding=\"application/x-tex\" id=\"S5.SS1.p3.1.1.m1.1c\">\\backslash</annotation><annotation encoding=\"application/x-llamapun\" id=\"S5.SS1.p3.1.1.m1.1d\">\\</annotation></semantics></math>nAnswer:’’</span> and the likelihood of each potential completion string’s <span id=\"S5.SS1.p3.1.2\">text</span> is compared. Comparatively, MMLU is implemented in <cite>Hendrycks et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib42\" title=\"\">2020</a>)</cite> via providing the model with the question text, each of the 4 possible answers preceded by an answer letter A, B, C, or D, and scoring over the <span id=\"S5.SS1.p3.1.3\">answer letters’ probabilities</span>.</p></div>\n<div id=\"S5.SS1.p4\">\n<p id=\"S5.SS1.p4.1\">However, if models do not adopt these approaches, or disclose their exact settings, it is impossible to reliably compare stated model performance. In Table <a href=\"https://arxiv.org/html/2405.14782v1#S5.T1\" title=\"Table 1 ‣ 5.1 Prompts Massively Affect Results ‣ 5 Case Studies ‣ Lessons from the Trenches on Reproducible Evaluation of Language Models\"><span>1</span></a>, we compare evaluation on the Easy and Challenge sets of ARC using the prompt from <cite>Brown et&nbsp;al. (<a href=\"https://arxiv.org/html/2405.14782v1#bib.bib18\" title=\"\">2020</a>)</cite> (“Cloze”) and using an MMLU-style answer letter with explicit multiple choice options (“MMLU-style”). We additionally compare MMLU scores between the original MMLU prompting style (“MMLU-style”) and an approach we term “Hybrid”, consisting of an MMLU-style prompt but using the <span id=\"S5.SS1.p4.1.1\">answer strings</span> instead of answer letters as the set of continuations over which we can scor"
  }
}

▼✓网页搜索结果数: 10, 查询: "lm-evaluation-harness simple_evaluate predict_only loglikelihood generate_unt..."

▼thought

LM Evaluation Harness：深度研究报告

一句话定位

一、设计哲学：为什么需要它

EleutherAI 团队在 2024 年发表的论文 Lessons from the Trenches on Reproducible Evaluation of Language Models 中系统阐述了 LLM 评估面临的挑战，以及 lm-eval 的设计理念：

核心问题

"Key Problem"（语义等价但句法不同）：同一个意思可以有无数种表达方式，而判断两句话是否语义等价的"最佳工具"恰恰是我们想要评估的模型本身。这迫使大多数 benchmark 退而求其次——要么改成多选题（限制答案空间），要么用启发式字符串匹配。
"Minor" Implementation Details Matter：prompt 的微小变化、格式化方式、tokenization 细节都会显著影响评估结果。论文中的案例研究表明，同一模型在 ARC 上用 "Cloze" 风格 prompt 和 "MMLU 风格" prompt，得分可以天差地别 Lessons from the Trenches, Section 5.1。
跨论文比较几乎不可能：不同论文使用不同的 prompt、不同的 few-shot 示例选择、不同的 metric 计算方式。即使声称评估的是"同一个 benchmark"，实际跑的可能是完全不同的东西。
API 模型快速弃用：OpenAI 的 code-davinci-002 于 2024 年 1 月弃用，导致数百项研究无法复现。

`lm-eval` 的应对策略

统一代码库：所有 task 实现在同一框架内，确保 prompt、metric、后处理逻辑一致
Task 版本化：每次修改影响评分的 task 时递增版本号，保证可追溯
标准误差报告：每次评估自动输出 bootstrap 标准误差，降低统计显著性检验门槛
Per-sample logging：支持记录每个样本的模型输出，便于事后分析和复现

二、架构：三层核心抽象

2.1 Request Types（三种测量原语）

lm-eval 将模型能力测量抽象为三种基本 Request 类型 Lessons from the Trenches, Section 4.1：

Request 类型	测量内容	典型用途
`loglikelihood` / `multiple_choice`	给定输入，计算输出字符串的条件对数概率	MMLU、HellaSwag 等多选题
`loglikelihood_rolling`	计算整个数据集上的平均对数概率（困惑度）	WikiText、LAMBADA 等语言建模任务
`generate_until`	从给定输入生成文本，直到满足停止条件	GSM8K、BBH、HumanEval 等生成式任务

2.2 Task（任务抽象）

每个 Task 封装了四个环节：

数据源：通过 HuggingFace datasets 库加载
Prompt 构建：支持 YAML 配置（Jinja2 模板）、Python 子类、或从 PromptSource 导入
Request 映射：将 prompt 渲染为具体输入，指定期望的输出类型
后处理与评分：解析模型输出、计算 metric

Task 可以通过 YAML 配置文件定义（推荐方式），也可以通过继承 Python Task 类实现自定义逻辑 new_task_guide.md。

2.3 LM（模型抽象）

模型被抽象为一个统一的 LM 接口，将 tokenizer 和神经网络视为一个整体系统。支持的模型后端极其丰富：

类别	后端	模型类型名
本地推理	HuggingFace transformers	`hf`
本地推理	vLLM	`vllm`
本地推理	SGLang	`sglang`
本地推理	GGUF / llama.cpp	`gguf`, `ggml`
本地推理	Mamba SSM	`mamba_ssm`
本地推理	NVIDIA NeMo	`nemo_lm`
本地推理	NVIDIA Megatron-LM	`megatron_lm`
本地推理	OpenVINO	`openvino`
本地推理	Windows ML (NPU)	`winml`
API	OpenAI Completions / Chat	`openai-completions`, `openai-chat-completions`
API	Anthropic	`anthropic-chat`, `anthropic-chat-completions`
API	LiteLLM (100+ providers)	`litellm`, `litellm-chat`
API	Textsynth	`textsynth`
API	Azure OpenAI	通过 `openai-chat-completions`
本地服务器	OpenAI 兼容 API	`local-completions`, `local-chat-completions`
特殊	Steered HF models	`steered`
特殊	HF Multimodal / vLLM VLM	`hf-multimodal`, `vllm-vlm`

三、版本演进与最新动态

关键版本节点

版本	时间	关键变化
v0.4.0	2024	重大重构：Config-based task 创建、Jinja2 prompt 设计、vLLM 支持、新 task 加入
v0.4.2	2024	修复 MMLU 等 group task 的 stderr 聚合公式错误；引入 `TaskManager`
v0.4.3	2024	Chat Templating 支持（`hf` 和 `vllm`），解决 instruct 模型 prompt 格式问题
v0.4.4	2024	Open LLM Leaderboard v2 任务；`lm-eval --tasks list` 美化输出
v0.4.8	2025	SGLang 后端支持；model steering 向量支持
v0.4.9	2025	vLLM data parallel V1；ACPBench Hard；多模态 API 模型支持
v0.4.9.1	2025	快速补丁：新任务和修复
v0.4.9.2	2025.11	Python 3.10 最低要求；大量新 benchmark；GPT-5 支持；Azure OpenAI 支持

v0.4.9.2 Release Notes (Zenodo)

2025 年 12 月最新变化（README 中标注）

CLI 重构：引入子命令（run、ls、validate）和 YAML 配置文件支持（--config）
轻量化安装：基础包不再包含 transformers/torch，按需安装后端：pip install lm_eval[hf]、lm_eval[vllm] 等

四、使用方式

CLI（命令行）

# 基础用法
lm_eval --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cuda:0 \
    --batch_size 8

# 带 chat template（instruct 模型推荐）
lm_eval --model hf \
    --model_args pretrained=mistralai/Mistral-7B-Instruct-v0.2 \
    --tasks mmlu \
    --num_fewshot 5 \
    --apply_chat_template

# API 模型
lm_eval --model openai-chat-completions \
    --model_args model=gpt-4-turbo \
    --tasks mmlu,bbh \
    --num_fewshot 5

# 列出可用任务
lm-eval ls tasks

Python API

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args={"pretrained": "gpt2"},
    tasks=["hellaswag", "mmlu"],
    num_fewshot=0,
)

高级功能

--predict_only：仅输出解码文本，不做自动评分（适用于无法自动评分的任务）
--log_samples：记录每个样本的模型输出，用于事后分析
--use_cache：缓存评估结果，中断后可续跑
--hf_hub_log_args：将结果和样本推送到 Hugging Face Hub
W&B 集成：--wandb_args project=my-project 自动记录到 Weights & Biases
Zeno 集成：可视化评估结果

五、生态位与竞品对比

与 HELM (Stanford CRFM) 的区别

与 lmms-eval 的关系

与 NVIDIA NeMo Evaluator 的关系

NVIDIA NeMo Evaluator 直接集成了 lm-eval 作为其 Academic Benchmarks 评估的后端 NVIDIA NeMo Microservices docs，说明 lm-eval 已成为行业标准组件。

六、已知局限与批评

6.1 评估不等于真实能力

6.2 Prompt 敏感性

6.3 不支持多节点评估

hf 模型类型原生不支持多节点评估，需要通过外部推理服务器或自定义分布式脚本。

6.4 vLLM 输出与 HF 不完全一致

vLLM 偶尔会产生与 HuggingFace 参考实现不同的输出。官方提供了一个对比脚本来验证一致性。

6.5 Cohere API 长期阻塞

Cohere API 集成因 API bug 长期处于 blocked 状态 GitHub Issue #395。

6.6 学术论文中的"评估滞后"

七、总结

lm-evaluation-harness 已经成为 LLM 评估的事实标准。它的核心价值不在于提供"最好的 benchmark"，而在于：

降低评估门槛：一行命令跑通 60+ benchmark
保证可复现性：统一代码库 + 版本化 task + 标准误差报告
灵活可扩展：YAML 配置新 task、支持几乎所有主流模型后端
社区驱动：持续有大量新 benchmark 和模型后端由社区贡献（v0.4.9.2 一个版本就有 32 位新贡献者）

关键链接

260314 tokens · $0.0413

LM Evaluation Harness：深度研究报告

一句话定位

一、设计哲学：为什么需要它

核心问题

lm-eval 的应对策略

二、架构：三层核心抽象

2.1 Request Types（三种测量原语）

2.2 Task（任务抽象）

2.3 LM（模型抽象）

三、版本演进与最新动态

关键版本节点

2025 年 12 月最新变化（README 中标注）

四、使用方式

CLI（命令行）

Python API

高级功能

五、生态位与竞品对比

与 HELM (Stanford CRFM) 的区别

与 lmms-eval 的关系

与 NVIDIA NeMo Evaluator 的关系

六、已知局限与批评

6.1 评估不等于真实能力

6.2 Prompt 敏感性

6.3 不支持多节点评估

6.4 vLLM 输出与 HF 不完全一致

6.5 Cohere API 长期阻塞

6.6 学术论文中的"评估滞后"

七、总结

关键链接

LM Evaluation Harness：深度研究报告

一句话定位

一、设计哲学：为什么需要它

核心问题

lm-eval 的应对策略

二、架构：三层核心抽象

2.1 Request Types（三种测量原语）

2.2 Task（任务抽象）

2.3 LM（模型抽象）

三、版本演进与最新动态

关键版本节点

2025 年 12 月最新变化（README 中标注）

四、使用方式

CLI（命令行）

Python API

高级功能

五、生态位与竞品对比

与 HELM (Stanford CRFM) 的区别

与 lmms-eval 的关系

与 NVIDIA NeMo Evaluator 的关系

六、已知局限与批评

6.1 评估不等于真实能力

6.2 Prompt 敏感性

6.3 不支持多节点评估

6.4 vLLM 输出与 HF 不完全一致

6.5 Cohere API 长期阻塞

6.6 学术论文中的"评估滞后"

七、总结

关键链接

`lm-eval` 的应对策略

`lm-eval` 的应对策略