目前的主流LLM的训练过程是什么样的？模型能力是...

目前的主流LLM的训练过程是什么样的？模型能力是如何提升的？

主流 LLM 的训练过程：从裸模型到产品级 AI 的完整链路

到 2026 年初，主流 LLM 的训练已经形成了一套相对清晰的三段式框架。以 Richard Kelley 的分类为骨架 Training LLMs in 2026，结合 Sundeep Teki 的详细技术拆解 The Complete Guide to Post-Training LLMs 和 LLM Stats 的前沿跟踪 Post-Training in 2026，整个过程可以概括为：

Pre-training（预训练）→ Capability Shaping（能力塑造）→ Alignment Shaping（对齐塑造）

其中后两者合称 post-training（后训练），而这恰恰是当前模型能力的主战场——post-training 可以在标准评测上带来 20-40% 的性能提升，远超过单纯追加预训练算力能获得的回报。

一、Pre-training（预训练）：打造「世界知识底座」

做什么

在海量、弱过滤的文本语料上做自回归语言建模——给定前文，预测下一个 token。目标函数是交叉熵损失。输出的是一个「base model」（基座模型），具备广泛的句法、语义和事实关联能力，但不会对话、不会遵循指令、也不会拒绝有害请求。

关键参数（以 2024-2025 代表性模型为例）

模型	参数量	训练 tokens	特点
Llama 3.1	8B / 70B / 405B	15.6 万亿	三阶段预训练，最终 annealing
Qwen 2	0.5B–72B	7 万亿（大模型）/ 12 万亿（0.5B）	合成数据增强，多语言 30 种
Gemma 2	2B / 9B / 27B	13 万亿（27B）	知识蒸馏用于小模型
Apple AFM	3B（端侧）+ 服务端	6.3 万亿（服务端）	蒸馏 + 剪枝，3 阶段预训练

这些细节来自 Sebastian Raschka 的系统梳理 New LLM Pre-training and Post-training Paradigms。

预训练内部的阶段化

从 2024 年起，各家的预训练普遍不再是「一口气跑到底」，而是拆成 2-3 个子阶段：

Core pre-training：大规模通用语料，基础上下文窗口（如 4K-8K tokens）
Continued pre-training：上调代码/数学数据比例，压低低质网页抓取数据的权重
Context lengthening / annealing：用长序列数据 + 合成数据将上下文窗口扩展到 32K-128K；或用极小的高质量混合数据做最后的「退火」微调，提升 benchmark 表现

Apple AFM 和 Llama 3.1 都采用了三阶段预训练。Llama 3.1 的上下文扩展更是分六步从 8K 逐步拉到 128K。

数据质量的权重在上升

一个反复出现的主题：数据质量远重于数据量。Meta 的 LIMA 论文（2023）早已证明 1000 条精心挑选的 instruction 样本可以接近 GPT-4 水平。当前所有主流团队都在投入大量工程资源做数据过滤、去重、去污染（与 benchmark 去重叠）和配比优化，而非一味扩大数据规模。

二、Capability Shaping（能力塑造）：从「会说话」到「会做事」

预训练产出的是通用 next-token predictor，而能力塑造通过继续训练（仍然以交叉熵为目标，但数据更精选）来定向增强特定能力 Training LLMs in 2026：

Context Extension：扩展可用的上下文窗口
Code / Math Corpora Training：用高质量编程和数学语料增强形式推理
Synthetic Reasoning Training：用模型生成或程序生成的推理链数据诱导多步推理行为
Multilingual Balancing：跨语言数据重加权
Tool-use Corpora：训练模型理解何时以及如何调用外部工具（搜索、代码执行、API）

这一步与「对齐」之间存在反馈循环：在尝试对齐模型时往往会发现新的能力缺口，再回过头来补充训练。

三、Alignment Shaping（对齐塑造）：从「能力模型」到「产品模型」

这是后训练中最核心也最复杂的部分。目前的共识是三段式流水线 The Complete Guide to Post-Training LLMs：

Stage 1 — SFT（Supervised Fine-Tuning，监督微调）

用高质量 instruction-response 对训练模型学会遵循指令的格式、语气和结构。这是把「自动补全引擎」变成「对话助手」的关键一步。

技术选择：

Full Fine-Tuning（全参数，FP16）：质量最高，4-8 张 H100 跑 70B 模型，成本 $1000-5000/次
LoRA（Low-Rank Adaptation）：冻结基座模型，注入小型可训练矩阵，减少 90-99% 训练参数，单张 A100 可跑 70B
QLoRA：4-bit 量化 + LoRA，单张 24GB 消费级 GPU 即可微调 70B

数据质量铁三角：准确性（Accuracy）、多样性（Diversity）、复杂度（Complexity）。成功的 SFT 数据集中数学和代码通常合计占约 78%（如 SmolLM2：Math 39.4%，Code 38.9%），因为这些领域有客观可验证的正确答案。

Stage 2 — Preference Alignment（偏好对齐）

让模型学会在多个可行回答中选择最 helpful / harmless / honest 的那个。

RLHF（Reinforcement Learning from Human Feedback）：训练一个单独的 reward model（奖励模型）来学习人类偏好，然后用 PPO（Proximal Policy Optimization）算法优化 policy 模型。效果强大但昂贵——人类标注成本 $1-5/比较，一次训练需要数十万次比较。OpenAI 的 GPT-4/5 和 Google DeepMind 的 Gemini 仍以 RLHF 为骨架。

DPO（Direct Preference Optimization）：2023 年 Stanford 的突破——数学上证明了不需要单独的 reward model，可以直接用偏好数据做监督式优化。计算量约为 RLHF 的 50%，且省去了 RL 训练的不稳定性。DPO 已成为开源模型开发的默认对齐方案，前沿实验室也广泛采用 Post-Training in 2026。

DPO 的进化版：

SimPO：连 reference model 都省了，用响应的平均 log probability 作为隐式奖励，AlpacaEval 2 上比 DPO 高 6.4 分
KTO：只需要二进制的 thumbs-up / thumbs-down 信号，不需要配对比较
ORPO：将 SFT 和偏好优化合并到单个训练目标中，一个阶段代替两个

RLAIF + Constitutional AI（Anthropic 路线）：用 AI 反馈代替人类反馈，成本从 $1-5/比较降到 < $0.01/比较。Anthropic 的 constitution 在 2025 年已扩展到约 80 页，模型据此自我批评生成合成偏好数据。质量在大多数维度上与人类 RLHF 相当，一致性更好，但在需要真实人类经验的边缘案例上仍有不足。

Stage 3 — RL with Verifiable Rewards（推理强化学习）：2025 年以来最大的变革

这是 post-training 中最新也最快演进的阶段，由 DeepSeek-R1（2025 年 1 月）引爆。

GRPO（Group Relative Policy Optimization）：DeepSeek 的核心创新——同时消除了 reward model 和 critic network。对每个 prompt 生成一组（8-16 条）响应，用可验证的奖励函数打分（数学答案是否正确、代码是否通过测试），然后以组内均值为基准计算每条响应的 advantage，增大高奖励响应的概率、降低低奖励的。没有需要学习的 reward model，没有 PPO 的复杂裁剪机制 Post-Training in 2026。

结果：DeepSeek-R1 以远低于 OpenAI o1 的训练成本，实现了与之竞争性的推理性能，甚至自发涌现了自我反思和动态策略调整能力。

DAPO（Decoupled Alignment and Policy Optimization）：字节跳动/清华 2025 年提出，针对长链式推理中的训练不稳定性，引入四种技术（Clip-Higher、Dynamic Sampling、Token-level Loss、Overlong Reward Shaping）。在 AIME 2024 上用 Qwen2.5-32B 达到 50 分，比 DeepSeek-R1-Zero 少 50% 训练步骤。

RLVR（Reinforcement Learning with Verifiable Rewards） 是更广泛的范式：任何有客观可验证答案的任务（数学、代码、形式逻辑、结构化输出）都可以用程序化验证信号代替人类反馈进行强化学习训练。这比人类偏好数据更便宜、更快速、更一致。

四、模型能力是如何提升的？——几条核心驱动力

1. Scaling Laws（规模定律）

在预训练阶段，模型性能（以交叉熵损失衡量）随模型参数量、训练数据量和计算量的增加而可预测地改善。这是 Kaplan et al.（2020）和 Chinchilla（2022）确立的基础规律。

2. Emergent Abilities（涌现能力）

某些能力（多步推理、代码生成、in-context learning）在小模型上接近随机水平，但当模型规模超过某个临界阈值后突然跃升。Jason Wei et al.（2022）记录了超过 137 种涌现能力 137 emergent abilities of large language models。不过 2023 年 Stanford 的 Are Emergent Abilities a Mirage? 论文对此提出质疑，认为这可能是评估指标的非线性导致的「视觉错觉」而非模型行为的真正突变 CSET explainer。当前学术界的共识更倾向于：涌现是 scaling、数据、提示策略和指标选择共同作用的多因素现象。

3. Post-training 的能力放大效应

这是当前最大的能力提升来源。post-training 可以在标准评测上带来 20-40% 的性能提升，远超同等算力投入预训练扩展能获得的收益。这意味着：

竞争优势的关键不再是「谁能预训练最大的模型」，而是「谁能把模型 post-train 得最好」。

4. 合成数据与 Self-Play

数据瓶颈正在被合成数据生成打破。代表性方法：

SPIN（Self-Play Fine-Tuning）：模型通过区分自己的输出与人类文本来逐步提升，无需额外人类标注
SPICE：用外部文档接地（grounding）防止 self-play 中的幻觉放大和模型崩溃，数学推理 +8.9%，通用推理 +9.8%
NVIDIA NeMo Gym：用交互式 RL 环境训练 agent 模型，Nemotron 3 Super 经历了 21 种环境配置下的 120 万次 rollout

5. 知识蒸馏（Knowledge Distillation）

用小模型（student）在原始训练数据 + 大模型（teacher）的输出上训练。Apple AFM 的端侧 3B 模型是从 6.4B teacher 蒸馏而来；Google Gemma 2 的 2B 和 9B 模型同样使用了蒸馏。这让小模型以远低于从头训练的代价获得接近大模型的能力。

五、三大前沿实验室的 RL 哲学差异

实验室	核心方法	特点
OpenAI	RLHF + Process Reward Model + 大规模 RL	最激进的 RL 扩展策略，o1/o3 系列代表推理 RL 的 SOTA
Anthropic	Constitutional AI + RLAIF	用约 80 页 constitution 替代原始人类偏好，强调一致性和可控性
Google DeepMind	RLHF + PPO + 学术前沿研究	最传统的 RLHF 实现，辅以 reward model 鲁棒性、多目标优化等研究

来源：The Complete Guide to Post-Training LLMs

六、正在发生的前沿趋势

LLM Stats 总结了下一年最可能定义 post-training 研究的三个方向 Post-Training in 2026：

统一流水线：ORPO 已经合并了 SFT 和偏好优化，下一步是将三阶段（SFT + 偏好对齐 + 推理 RL）合并到单一训练目标中
环境原生训练：从静态数据集转向交互式 RL 环境（浏览器、文件系统、数据库、API），「聊天模型」和「agent 模型」之间的差距将拉大
自动课程生成：模型识别自身弱点 → 生成针对性的训练数据 → 用 RLVR 验证 → 训练 → 重复。SPICE 已经展示了这个闭环的雏形

补充说明 / 尚需核实的点

各前沿实验室（尤其是 OpenAI 和 Anthropic）的具体训练配方（数据配比、奖励函数细节、超参数）属于商业机密，并未完整公开，以上分析基于已发表的技术报告和论文。
「涌现能力是否为 mirage」的学术争论仍在继续，尚未有定论。本回答中呈现了双方的核心论点，供交叉参考。
合成数据 self-play 在哪些任务类型上有天花板、是否会导致模型能力「内卷」，目前缺乏系统性的大规模实证研究。

119167 tokens · $0.0193

每个截断的成本是什么样的，数据的规模和成本的关系是什么样的？

▼thought

The user is asking about the cost breakdown of each training stage (pre-training, SFT, preference alignment, RL with verifiable rewards), and the relationship between data scale and cost. This is a follow-up to the previous detailed answer about LLM training pipelines. Let me search for specific cost data and scaling/cost relationships for each stage. I should look for: 1. Pre-training costs (compute, infrastructure, data costs) 2. SFT costs (full fine-tuning, LoRA, QLoRA) 3. RLHF/DPO/GRPO costs 4. Data costs (annotation, curation, filtering) 5. How costs scale with data size Let me search broadly.

▼✓网页搜索查询: "LLM pre-training cost per token compute GPU 2024 2025 breakdown", 结果数: 10

{
  "results": [
    {
      "title": "Inference Unit Economics: The True Cost Per Million Tokens - Introl",
      "url": "https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide",
      "description": "# Inference Unit Economics: The True Cost Per Million Tokens. Prices declined faster than PC compute during the microprocessor revolution or bandwidth during the dotcom boom—equivalent performance costs 10x less every year.¹ A capability that cost $20 per million tokens in late 2022 now costs $0.40.² Yet organizations still struggle to understand their true inference costs because token-level pricing obscures infrastructure realities, GPU utilization determines actual unit economics, and optimization techniques create order-of-magnitude variations in cost efficiency. **December 2025 Update:** LLM inference costs declined 10x annually—faster than PC compute or dotcom bandwidth. Self-hosted breakeven requires 50%+ GPU utilization for 7B models, 10%+ for 13B models. ## The inference pricing landscape in December 2025. Claude Opus 4.5 costs $5 per million input tokens and $25 per million output tokens—a significant reduction from Opus 4's $15/$75 pricing.⁷ GPT-4 and similar frontier models price at varied tiers, justified by capabilities that smaller models cannot replicate regardless of cost optimization."
    },
    {
      "title": "How Much Does LLM Training Cost? | Galileo",
      "url": "https://galileo.ai/blog/llm-model-training-cost",
      "description": "# How Much Does LLM Training Cost? Understanding the costs of training LLMs is essential for those looking to create and train models or simply to understand the dynamics of AI as an industry. In this article, we'll explore the cost of training LLM models, the factors that contribute to LLM training costs, and strategies to optimize them effectively. * Training frontier LLMs costs $78M-$192M+ for compute alone. * Key cost drivers have shifted—human data annotation now exceeds compute costs by up to 28x. * Fine-tuning pre-trained models costs 60-90% less than training from scratch. ## **How Much Does It Cost to Train an LLM?**. To appreciate the scale of investment involved in training LLMs, consider the verified costs of notable models. According to the Stanford AI Index Report 2025, frontier model training costs have escalated dramatically—with GPT-4's training estimated at $78-100+ million, and Gemini Ultra 1.0 reaching $192 million, representing a 287,000x increase from the cost of a Transformer model in 2017 ($670)."
    },
    {
      "title": "$ Cost of LLM continued pre-training | by Gili Nachum - Medium",
      "url": "https://medium.com/@gilinachum/cost-of-llm-continued-pre-training-0c1998cb44ec",
      "description": "Key Factors Impacting Cost. Dataset size — More data (measured in tokens) means longer training times and more compute resources needed."
    },
    {
      "title": "What is the cost of training large language models?",
      "url": "https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models",
      "description": "### CUDO arrow Resources arrow What is the cost of training large language models? # What is the cost of training large language models? Large language models (LLMs) like OpenAI’s GPT series and Google’s BERT have become foundational technologies powering many applications, from automated customer service to advanced research tools. For example, while the original 2017 Transformer that introduced the core LLM architecture cost only about $900 to train, the compute cost for training GPT-3 (175 billion parameters) was estimated in 2020 to range from about $500,000 up to $4.6 million, depending on the hardware and optimization techniques used. Training OpenAI’s GPT-4 reportedly cost more than $100 million, with some estimates ranging up to $78 million in compute cost, and Google’s Gemini Ultra model is estimated to have cost $191 million in training compute. The emergence of reasoning in LLMs is partly a byproduct of scale, as larger models trained on more data tend to develop more advanced capabilities, and partly due to specialized training or fine-tuning on data that encourages logical reasoning."
    },
    {
      "title": "AI & Machine Learning for Everyone | Do you know how much it costs to train large language models (LLM) | Facebook",
      "url": "https://www.facebook.com/groups/AIandMachineLearningforEveryone/posts/9843392082338627",
      "description": "# AI & Machine Learning for Everyone | Do you know how much it costs to train large language models (LLM) | Facebook. Training large language models costs millions. ## AI & Machine Learning for Everyone ·. Do you know how much it costs to train large language models (LLM)? Al training costs are exploding lately. The Stanford Al Index Report has just released some training numbers and they are CRAZY. Training LLMs from scratch costs millions and these numbers are expected to climb even higher with the development of new models. But why is this the case? A team of machine learning, data science, and linguistic experts is essential. Image 12: ➡️ Training and developing LLMs is incredibly expensive due to the vast computational resources required, with nodels like GPT-4 needing thousands of GPUs running for months (!). This extensive use of GPUs, combined with the need for continuous fine-tuning and experimentation, significantly drives up both the hardware and operational costs."
    },
    {
      "title": "Sebastian Raschka, PhD's Post - LinkedIn",
      "url": "https://www.linkedin.com/posts/sebastianraschka_sure-its-widely-known-that-pretraining-activity-7278071569384460288-y_NE",
      "description": "Sure, it's widely known that pretraining large language models (LLMs) is incredibly expensive, but how expensive, exactly?"
    },
    {
      "title": "How Much Does Your LLM Inference Cost? | NVIDIA Technical Blog",
      "url": "https://developer.nvidia.com/blog/llm-inference-benchmarking-how-much-does-your-llm-inference-cost",
      "description": "This guide covers performance metrics (TTFT, latency-throughput trade-offs), infrastructure provisioning, and cost calculations per token to"
    },
    {
      "title": "The Real Price of AI: Pre-Training Vs. Inference Costs",
      "url": "https://www.ankursnewsletter.com/p/the-real-price-of-ai-pre-training",
      "description": "# The Real Price of AI: Pre-Training Vs. Inference Costs. For most companies using AI, **the ongoing cost of running models daily (inference) vastly outweighs the initial training cost**, potentially accounting for 80-90% of the total lifetime expense. Overall inference costs are rising significantly **due to wider business adoption of AI, the demand for real-time performance, the increasing complexity and size of models, and the growing volume of data processe**d. Proactively managing and **optimizing inference costs is becoming a critical roadblock** to achieving real value and return on investment from generative AI deployments. A lot of the discussion centers on the huge effort and significant amount of hardware needed for training the big large language models, or LLMs. But here’s something critical that often gets missed: for almost any company actually putting AI to work, the day-to-day running – the inference stage – tells a very different cost story."
    },
    {
      "title": "Your Guide To Inference Cost (And Make It A Margin Advantage)",
      "url": "https://www.cloudzero.com/blog/inference-cost",
      "description": "# Inference Cost Explained: How to Reduce LLM & AI Inference Spend. Learn how LLM and AI inference cost work, what drives them up at scale, and 8 proven strategies to reduce your AI inference spend without slowing down your team. Learn how LLM and AI inference cost work, what drives them up at scale, and 8 proven strategies to reduce your AI inference spend without slowing down your team. That ongoing, per-request expense is inference cost, and for organizations running AI in production, it is the number that matters most. Inference cost is the compute cost your organization pays every time an AI model generates a response — each prompt sent, each reply received. In production systems, tighter context management reduces inference cost significantly — prompt compression techniques such as summarization, keyphrase extraction, and semantic chunking can achieve 70–94% cost savings in production AI systems, according to research published in Medium citing Stanford and ACL findings."
    },
    {
      "title": "Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective",
      "url": "https://arxiv.org/html/2410.04466v3",
      "description": "# Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform."
    }
  ]
}

▼✓网页搜索查询: "LLM SFT fine-tuning cost LoRA full parameter price estimate 2025", 结果数: 10

{
  "results": [
    {
      "title": "A Comprehensive Guide to LLM Fine-Tuning: Methods ... - Wilson Wu",
      "url": "https://wilsonwu.me/en/blog/2025/llm-fine-tuning",
      "description": "Over the years, fine-tuning techniques have evolved from traditional full-parameter training to efficient, low-cost approaches such as LoRA,"
    },
    {
      "title": "How to fine-tune open LLMs in 2025 with Hugging Face",
      "url": "https://www.philschmid.de/fine-tune-llms-in-2025",
      "description": "# How to fine-tune open LLMs in 2025 with Hugging Face. However, for specialized applications requiring high accuracy or domain expertise, fine-tuning remains a powerful approach to achieve higher quality results than prompting alone, reduce costs by training smaller, more efficient models, and ensure reliability and consistency for specific use cases. Contrary to last years guide How to Fine-Tune LLMs in 2024 with Hugging Face this guide focuses more on optimization, distributed training and being more customizable. This means support for different PEFT methods from Full-Finetuning to QLoRA and Spectrum, optimizations for faster and more efficient training, with Flash Attention or Liger Kernels and how to scale training to multiple GPUs using DeepSpeed. If you are compeltely new to fine-tuning LLMs, I recommend you to start with the How to Fine-Tune LLMs in 2024 with Hugging Face guide and then come back to this guide. This can be a good use case for fine-tuning, as it requires a lot of domain-specific knowledge about math and how to solve math problems."
    },
    {
      "title": "How to Fine Tune your own LLM using LoRA (on a CUSTOM dataset!)",
      "url": "https://www.youtube.com/watch?v=D3pXSkGceY0",
      "description": "How to Fine Tune your own LLM using LoRA (on a CUSTOM dataset!)\nNicholas Renotte\n326000 subscribers\n1299 likes\n41222 views\n9 Jun 2025\nThat gameboy blender animation...took 6 hours to render 😅. \n\nAnyway, had a ton of fun coding this up and finally getting back to some proper ML. I've been thinking about just swapping out the pdf with some specific libraries to get an ultimate Blender bot or Manim bot (also learned it for the video...we're boosting animation quality this year!!). \n\nWould love to see what you end up fine tuning and deploying. If you need a hand just ping me I'm flying out tonight but will answer as much as i can. Love ya, thanks for watching! \n\nCode: https://github.com/nicknochnack/EndToEndLoRA\n\n👨‍💻 Sign up for the Full Stack course and use YOUTUBE50 to get 50% off:\nhttps://www.coursesfromnick.com/bundles/fullstackml\n\nOh, and don't forget to connect with me!\nLinkedIn: https://bit.ly/324Epgo\nFacebook: https://bit.ly/3mB1sZD\nGitHub: https://bit.ly/3mDJllD\nPatreon: https://bit.ly/2OCn3UW\nJoin the Discussion on Discord: https://bit.ly/3dQiZsV\n\nHappy coding!\nNick\n73 comments"
    },
    {
      "title": "A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios",
      "url": "https://arxiv.org/html/2511.00130v1",
      "description": "# A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios. The remarkable capabilities of Large Language Models (LLMs) often need to be tailored for specific applications, requiring the integration of new knowledge or the acquisition of new skills. While full fine-tuning is a powerful adaptation method, it is computationally expensive and can lead to a degradation of general reasoning abilities, a phenomenon known as catastrophic forgetting McCloskey and Cohen (1989). In-Context Learning (ICL) is fast but limited by context length, while Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a middle ground by minimizing parameter changes. This paper presents a comparative analysis of Supervised Finetuning (SFT), LoRA, and ICL in data-scarce scenarios. We find that LoRA provides the most effective balance, successfully instilling new skills with minimal impact on the base model’s general knowledge. An analysis of the trade-off between task accuracy and forgetting as a function of key hyperparameters such as learning rate and LoRA rank."
    },
    {
      "title": "NeurIPS Poster LoRA vs Full Fine-tuning: An Illusion of Equivalence",
      "url": "https://neurips.cc/virtual/2025/poster/115207",
      "description": "Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. But, \\emph{are their learned solutions really equivalent?} We study how LoRA and full-finetuning change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that LoRA and full fine-tuning yield weight matrices whose singular value decompositions exhibit very different structure: weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \\emph{intruder dimensions}, while those trained with full fine-tuning do not. Further, we extend the finding that LoRA forgets less than full fine-tuning and find its forgetting is vastly localized to the intruder dimension -- by causally intervening on the intruder dimensions by changing their associated singular values post-fine-tuning, we show that they cause forgetting. This will be amplified during continual learning because of sequentially fine-tuning, and we show that LoRA models do accumulate intruder dimensions here tend to perform worse in this setting, emphasizing the practicality of our findings."
    },
    {
      "title": "Efficient LLM Fine-Tuning with LoRA | by Raquel Vaz, PhD | Medium",
      "url": "https://medium.com/@raquelhvaz/efficient-llm-fine-tuning-with-lora-e5edb88b64a1",
      "description": "# Efficient LLM Fine-Tuning with LoRA | by Raquel Vaz, PhD | Medium. # Efficient LLM Fine-Tuning with LoRA. It might make sense to use a large model like Gemini to **coordinate and orchestrate**, but the individual agents themselves can — and should — be **smaller, specialized models** fine-tuned for their specific tasks. This article shows how to do that using **LoRA (Low-Rank Adaptation)** — a lightweight technique for Parameter-Efficient Fine-Tuning (PEFT). LoRA allows you to adapt pre-trained language models using a minimal number of additional parameters. *   🧪 **Fine-tune large models** on low-resource infrastructure by training only a small fraction of parameters. *   🧩 **Deploy with zero inference overhead** — at runtime, LoRA adapters are merged into the original weight matrices, so model size and latency remain unchanged. Let’s illustrate it with a concrete example using LoRA on Qwen2.5–0.5B-Instruct, a 0.5 billion‑parameter, instruction‑tuned model in the Qwen2.5 family."
    },
    {
      "title": "Supervised Fine-Tuning (SFT) with LoRA on Fireworks AI - Fireworks AI Docs",
      "url": "https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning",
      "description": "# Supervised Fine-Tuning (SFT) with LoRA on Fireworks AI - Fireworks AI Docs. [Skip to main content](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#content-area). *   [Which model should I use?](https://docs.fireworks.ai/guides/recommended-models). *   [Fine Tuning Overview](https://docs.fireworks.ai/fine-tuning/finetuning-intro). *   [Deploying Fine Tuned Models](https://docs.fireworks.ai/fine-tuning/deploying-loras). *   [Fire Pass Setup](https://docs.fireworks.ai/firepass). *   [Understanding LoRA](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#understanding-lora). *   [List of Supported Models](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#list-of-supported-models). *   [Step-by-Step Guide to Fine-Tuning with Fireworks AI](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#step-by-step-guide-to-fine-tuning-with-fireworks-ai). Preparing the Dataset](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#1-preparing-the-dataset). Uploading the Dataset to Fireworks AI](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#2-uploading-the-dataset-to-fireworks-ai). Creating a Fine-Tuning Job](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#3-creating-a-fine-tuning-job). Monitoring and Managing Fine-Tuning Jobs](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#4-monitoring-and-managing-fine-tuning-jobs). Deploying the Fine-Tuned Model](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#5-deploying-the-fine-tuned-model). Best Practices and Considerations](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#6-best-practices-and-considerations). *   [Appendix](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#appendix). ## [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#understanding-lora). ## [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#list-of-supported-models). ## [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#step-by-step-guide-to-fine-tuning-with-fireworks-ai). ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#1-preparing-the-dataset). ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#2-uploading-the-dataset-to-fireworks-ai). Full reference here: [Create dataset](https://docs.fireworks.ai/api-reference/create-dataset). ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#3-creating-a-fine-tuning-job). ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#4-monitoring-and-managing-fine-tuning-jobs). ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#5-deploying-the-fine-tuned-model). For guide on how to create deployment (dedicated or live-merge deployment), please follow the guide here: [creating deployment](https://docs.fireworks.ai/guides/ondemand-deployments).For guide on how to deploy a LoRA addon to an existing deployment, you can simply click the `Deploy this LoRA` button in the supervised fine tuning job details page or LoRA model details page, and follow the wizard. ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#6-best-practices-and-considerations). ## [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#appendix)."
    },
    {
      "title": "A comprehensive overview of everything I know about fine-tuning. : r/LocalLLaMA",
      "url": "https://www.reddit.com/r/LocalLLaMA/comments/1ilkamr/a_comprehensive_overview_of_everything_i_know",
      "description": "[Skip to main content](https://www.reddit.com/r/LocalLLaMA/comments/1ilkamr/a_comprehensive_overview_of_everything_i_know/#main-content)A comprehensive overview of everything I know about fine-tuning. [The-Silvervein](https://www.reddit.com/user/The-Silvervein/). I’ve compiled a small collection of everything I know about fine-tuning LLMs or transformer models for specific use cases. *   You don't need a 100B model for every task you have. *   You must check the licensing to see if you use the model for commercial use cases. 2.   For domain-specific needs like medical, financial, legal, etc., you need to push the model to update its knowledge => Use RAG when applicable or fine-tune the entire model. [Overview of fine-tuning LLMs](https://www.reddit.com/answers/d410a820-a6b1-4545-8e34-5a0c58fd8fb5/?q=Overview+of+fine-tuning+LLMs&source=PDP). [Examples of fine-tuning large language models](https://www.reddit.com/answers/952c2d57-6b89-4d85-aeca-bfbc0788cbef/?q=Examples+of+fine-tuning+large+language+models&source=PDP). [How to fine-tune AI models effectively](https://www.reddit.com/answers/29fb1cd7-3bf8-42ab-9b67-b3047285ad07/?q=How+to+fine-tune+AI+models+effectively&source=PDP). [Understanding fine-tuning in generative AI](https://www.reddit.com/answers/c20a5dac-ca8c-433b-acce-ef48ada1145e/?q=Understanding+fine-tuning+in+generative+AI&source=PDP). *   [Reddit reReddit: Top posts of February 9, 2025 * * *](https://www.reddit.com/posts/2025/february-9-1/global/). *   [Reddit reReddit: Top posts of February 2025 * * *](https://www.reddit.com/posts/2025/february/global/). *   [Reddit reReddit: Top posts of 2025 * * *](https://www.reddit.com/posts/2025/global/). [Reddit Rules](https://www.redditinc.com/policies/content-policy)[Privacy Policy](https://www.reddit.com/policies/privacy-policy)[User Agreement](https://www.redditinc.com/policies/user-agreement)[Your Privacy Choices](https://support.reddithelp.com/hc/articles/43980704794004)[Accessibility](https://support.reddithelp.com/hc/sections/38303584022676-Accessibility)[Reddit, Inc. All rights reserved.](https://redditinc.com/). *   [About Reddit](https://www.redditinc.com/). *   [Best of Reddit](https://www.reddit.com/posts/2026/global/). *   [Reddit Rules](https://www.redditinc.com/policies/content-policy)."
    },
    {
      "title": "In-depth guide to fine-tuning LLMs with LoRA and QLoRA - Mercity.ai",
      "url": "https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora",
      "description": "In this blog we provide detailed explanation of how QLoRA works and how you can use it in hugging face to finetune your models."
    },
    {
      "title": "WTF is Fine-Tuning? (intro4devs) | [2025]",
      "url": "https://huggingface.co/blog/tegridydev/fine-tuning-dev-intro-2025",
      "description": "# WTF is Fine-Tuning? Full Fine-Tuning: Max Capacity + What It Is + Code Example + Use When + Pros + Cons * 2. Parameter-Efficient Fine-Tuning (PEFT): Efficiency First: Efficiency First\") + Context (2025)\") + a. Instruction Fine-Tuning: Teaching Models to Follow Commands + What It Is + Code Example + Use When + Pros + Cons * 4. Retrieval-Augmented Fine-Tuning (RAFT): External Knowledge Injection: External Knowledge Injection\") + What It Is (2025)\") + Use When + Pros + Cons * 5. Reinforcement Learning from Human Feedback (RLHF): Aligning AI with Human Preferences: Aligning AI with Human Preferences\") + What It Is (2025)\") + Use When + Pros + Cons * Wrapping Up: Matching the Right Gear to The Boss TL;DR. ### What It Is. Full fine-tuning updates all parameters of a model using your dataset, the gold standard for maximizing model performance, ensuring every layer of the model adapts to your specific requirements. Parameter-Efficient Fine-Tuning (PEFT): Efficiency First."
    }
  ]
}

▼✓网页搜索查询: "RLHF DPO GRPO training cost comparison compute human annotation 2024 2025", 结果数: 10

{
  "results": [
    {
      "title": "Preference Tuning LLMs: PPO, DPO, GRPO — A Simple Guide",
      "url": "https://anukriti-ranjan.medium.com/preference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090",
      "description": "[Sign in](https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=post_page---top_nav_layout_nav-----------------------global_nav------------------). [Sign in](https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=post_page---top_nav_layout_nav-----------------------global_nav------------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Fp%2F135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&user=Anukriti+Ranjan&userId=6f45a42dbb90&source=---header_actions--135765c87090---------------------clap_footer------------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---header_actions--135765c87090---------------------bookmark_footer------------------). [Listen](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2Fplans%3Fdimension%3Dpost_audio_button%26postId%3D135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---header_actions--135765c87090---------------------post_audio_button------------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Fp%2F135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&user=Anukriti+Ranjan&userId=6f45a42dbb90&source=---footer_actions--135765c87090---------------------clap_footer------------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Fp%2F135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&user=Anukriti+Ranjan&userId=6f45a42dbb90&source=---footer_actions--135765c87090---------------------clap_footer------------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---footer_actions--135765c87090---------------------bookmark_footer------------------). [What are your thoughts?](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---post_responses--135765c87090---------------------respond_sidebar------------------). [Mar 19, 2025](https://medium.com/@champagnejin0511/here-the-auther-said-that-grpo-also-has-a-reward-model-however-in-the-table-comparing-the-7eea1ed9c67d?source=post_page---post_responses--135765c87090----0-----------------------------------). [![Image 11: Anukriti Ranjan](https://miro.medium.com/v2/resize:fill:20:20/1*QXaGEeFeyWU9ZFRJWD6o0g.jpeg)](https://anukriti-ranjan.medium.com/?source=post_page---author_recirc--135765c87090----0---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). ## [Building a Sandboxed Environment for AI generated Code Execution ### When building AI agents, especially those that dynamically generate and execute analysis code, one major concern is security and stability…](https://anukriti-ranjan.medium.com/building-a-sandboxed-environment-for-ai-generated-code-execution-e1351301268a?source=post_page---author_recirc--135765c87090----0---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [18 1](https://anukriti-ranjan.medium.com/building-a-sandboxed-environment-for-ai-generated-code-execution-e1351301268a?source=post_page---author_recirc--135765c87090----0---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---author_recirc--135765c87090----0-----------------explicit_signal----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2Fe1351301268a&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fbuilding-a-sandboxed-environment-for-ai-generated-code-execution-e1351301268a&source=---author_recirc--135765c87090----0-----------------bookmark_preview----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [![Image 13: Anukriti Ranjan](https://miro.medium.com/v2/resize:fill:20:20/1*QXaGEeFeyWU9ZFRJWD6o0g.jpeg)](https://anukriti-ranjan.medium.com/?source=post_page---author_recirc--135765c87090----1---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). Claude thinks, calls tools, reads tool…](https://anukriti-ranjan.medium.com/observability-with-the-claude-agent-sdk-1dc6bfa9c50e?source=post_page---author_recirc--135765c87090----1---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---author_recirc--135765c87090----1-----------------explicit_signal----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [![Image 15: Anukriti Ranjan](https://miro.medium.com/v2/resize:fill:20:20/1*QXaGEeFeyWU9ZFRJWD6o0g.jpeg)](https://anukriti-ranjan.medium.com/?source=post_page---author_recirc--135765c87090----2---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). ## [Decoding Agentic Workflows: Towards Practical LLM Integration ### 1.The Rise of the AI Engineering](https://anukriti-ranjan.medium.com/decoding-agentic-workflows-towards-practical-llm-integration-58ee1cb6237a?source=post_page---author_recirc--135765c87090----2---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [12](https://anukriti-ranjan.medium.com/decoding-agentic-workflows-towards-practical-llm-integration-58ee1cb6237a?source=post_page---author_recirc--135765c87090----2---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---author_recirc--135765c87090----2-----------------explicit_signal----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F58ee1cb6237a&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fdecoding-agentic-workflows-towards-practical-llm-integration-58ee1cb6237a&source=---author_recirc--135765c87090----2-----------------bookmark_preview----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [![Image 17: Anukriti Ranjan](https://miro.medium.com/v2/resize:fill:20:20/1*QXaGEeFeyWU9ZFRJWD6o0g.jpeg)](https://anukriti-ranjan.medium.com/?source=post_page---author_recirc--135765c87090----3---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). While they…](https://anukriti-ranjan.medium.com/langchain-csv-agent-a-chain-of-function-calls-part-1-f0290e6c4bf3?source=post_page---author_recirc--135765c87090----3---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [131](https://anukriti-ranjan.medium.com/langchain-csv-agent-a-chain-of-function-calls-part-1-f0290e6c4bf3?source=post_page---author_recirc--135765c87090----3---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---author_recirc--135765c87090----3-----------------explicit_signal----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2Ff0290e6c4bf3&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Flangchain-csv-agent-a-chain-of-function-calls-part-1-f0290e6c4bf3&source=---author_recirc--135765c87090----3-----------------bookmark_preview----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----0-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----1-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). ## [KV Cache Explained: The Complete Guide to KV Cache in LLM Inference ### From PagedAttention to Distributed Caching — How KV Cache Evolved Across 5 Eras to Power Modern LLM Inference](https://luv-bansal.medium.com/the-evolution-of-kv-cache-from-simple-buffers-to-distributed-memory-systems-df51cb8ce26f?source=post_page---read_next_recirc--135765c87090----0---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----0-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [![Image 25: Artificial Intelligence in Plain English](https://miro.medium.com/v2/resize:fill:20:20/1*9zAmnK08gUCmZX7q0McVKw@2x.png)](https://ai.plainenglish.io/?source=post_page---read_next_recirc--135765c87090----1---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [3.1K 9](https://antonrgordon.medium.com/optimizing-llm-inference-prefill-vs-decode-on-multi-gpu-nvidia-systems-91aed335d593?source=post_page---read_next_recirc--135765c87090----1---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----1-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----2-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [![Image 29: Towards AI](https://miro.medium.com/v2/resize:fill:20:20/1*JyIThO-cLjlChQLb6kSlVQ.png)](https://pub.towardsai.net/?source=post_page---read_next_recirc--135765c87090----3---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). Learn what the KV Cache is and modern optimization techniques.](https://burakdegirmencioglu.medium.com/the-secret-behind-fast-llm-inference-unlocking-the-kv-cache-9c13140b632d?source=post_page---read_next_recirc--135765c87090----3---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [15](https://burakdegirmencioglu.medium.com/the-secret-behind-fast-llm-inference-unlocking-the-kv-cache-9c13140b632d?source=post_page---read_next_recirc--135765c87090----3---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----3-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F9c13140b632d&operation=register&redirect=https%3A%2F%2Fpub.towardsai.net%2Fthe-secret-behind-fast-llm-inference-unlocking-the-kv-cache-9c13140b632d&source=---read_next_recirc--135765c87090----3-----------------bookmark_preview----905f8b92_4586_4965_9746_d57a4ae39cd5--------------)."
    },
    {
      "title": "A Comparison of Reinforcement Learning (RL) and RLHF",
      "url": "https://intuitionlabs.ai/articles/reinforcement-learning-vs-rlhf",
      "description": "RLTHF (Targeted Human Feedback): One of the most significant cost-saving advancements in 2025 is RLTHF, which addresses the high cost of human"
    },
    {
      "title": "Group Relative Policy Optimization (GRPO) - Deep (Learning) Focus",
      "url": "https://cameronrwolfe.substack.com/p/grpo",
      "description": "*Reinforcement Learning from Human Feedback (RLHF)* trains the LLM using RL with rewards derived from a reward model trained on human preferences. We can directly train an LLM on this preference data using a direct alignment algorithm like Direct Preference Optimization (DPO), but we usually incorporate this data into RL by first using it to train a reward model. **PPO & RLHF.** Once the reward model has been trained over the preference data using this loss, the model learns how to assign a preference score to each model completion; see above. During training with RL, we generate new outputs—*or on-policy samples*—from our LLM and score them with the reward model. In this way, the RL training process guides the LLM to produce outputs that align with human preferences, as estimated by the reward model. This model was created by the same authors of DeepSeekMath [1], so R1-Zero also uses GRPO for RL training."
    },
    {
      "title": "Post-Training Matters More Than Pretraining Now: SFT, RLHF, DPO, and GRPO. | by Han HELOIR YAN, Ph.D. ☕️ | Mar, 2026 | AI Advances",
      "url": "https://ai.gopubby.com/post-training-matters-more-than-pretraining-now-sft-rlhf-dpo-and-grpo-1f9beffd391a",
      "description": "## AI Advances. Han HELOIR YAN, Ph.D. Han HELOIR YAN, Ph.D. Free link => *50 claps is the highest score! ## Written by Han HELOIR YAN, Ph.D. An AI Enthusiast & Tech Architect 🌟. Your number of claps reflects your incredible support for this article. If you work with LLMs, you’ve seen all of them. If you’re honest, you can explain maybe three with confidence. If you’re choosing between these techniques for your next project, the wrong choice doesn’t just waste compute. This article walks the full chain, so you pick the right one. If this helps you ship better AI systems:. 👏 **Clap 50 times** (yes, you can!) — Medium’s algorithm favors this, increasing visibility to others who then discover the article. ## Create an account to read the full story. If you’re new to Medium, create a new account to read this story on us. ## Published in AI Advances. Han HELOIR YAN, Ph.D. Han HELOIR YAN, Ph.D."
    },
    {
      "title": "The State of Reinforcement Learning for LLM Reasoning - Ahead of AI",
      "url": "https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training",
      "description": "[![Image 11: The State of LLM Reasoning Model Inference](https://substackcdn.com/image/fetch/$s_!IOSP!,w_140,h_140,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf9e2677-652a-4af1-9f57-dc0c253d2198_1448x1260.png) #### The State of LLM Reasoning Model Inference [Sebastian Raschka, PhD](https://substack.com/profile/27393275-sebastian-raschka-phd) · March 8, 2025 [Read full story](https://magazine.sebastianraschka.com/p/state-of-llm-reasoning-and-inference-scaling)](https://magazine.sebastianraschka.com/p/state-of-llm-reasoning-and-inference-scaling). The reinforcement learning (RL) training methods used to build and improve reasoning models are more or less related to the reinforcement learning with human feedback (RLHF) methodology that is used to develop and align conventional LLMs. So, I want to start with a small recap of how RLHF works before discussing reasoning-specific modification based on RL-based training. The connection between RLHF and reasoning comes from how the DeepSeek team applied a similar RL-based approach (with GRPO) to train the reasoning capabilities of their R1 and R1-Zero models. Interestingly, they find that this simple setup (vanilla PPO and a basic binary reward function based on answer correctness) is sufficient to train models that scale up in both reasoning performance and response length."
    },
    {
      "title": "Complete guide to RLHF for LLMs: How human feedback shapes modern AI",
      "url": "https://toloka.ai/blog/what-is-rlhf",
      "description": "Collect RLHF preference data Why RLHF matters What is Reinforcement Learning from Human Feedback (RLHF)?RLHF meaning Why human feedback is necessary How does RLHF work?Deep dive: The RLHF training pipeline Stage 1: Collecting preference data How to collect RLHF preference data Stage 2: Training the reward model Stage 3: Proximal Policy Optimization (PPO)Alternatives to traditional RLHF: DPO and related methods Direct Preference Optimization (DPO)Reinforcement learning for reasoning: RLVR, GRPO - rise of thinking models Reinforcement Learning from AI Feedback (RLAIF)Other emerging methods Human feedback in RLHF Collecting human feedback at scale How teams collect comparison data Modeling human preferences What RLHF can’t fix When RLHF works well in practice Predictability over raw capability Scope and iteration matter more than scale Preference clarity enables consistency Building a sustainable RLHF pipeline Treat RLHF as infrastructure, not a phase Quality control must be continuous Design for exit, not endless optimization Practical implementation considerations When a language model needs RLHF Operational costs and organizational trade-offs Failure modes in production Tools and frameworks Conclusion: RLHF as operational alignment Subscribe to Toloka news."
    },
    {
      "title": "LLM Alignment: RLHF to DPO & GRPO | MI - 超智諮詢",
      "url": "https://www.meta-intelligence.tech/en/insight-rlhf-alignment",
      "description": "1 The Complete Guide to Prompt Engineering: Master Systematic Prompt Design from Scratch and Unlock the Full Potential of Large Language Models  2 The Complete Guide to Context Engineering: From RAG to Memory Systems, Building Enterprise-Grade AI Knowledge Architectures  3 Reasoning Model Practical Guide: Enterprise Comparison and Deployment Strategies for DeepSeek R1, OpenAI o3, and Gemini 3  4 MCP (Model Context Protocol) Complete Guide: From Protocol Architecture to Hands-On Practice, Building a Universal Bridge Between AI and External Tools  5 AI Agent Interoperability Protocol Practical Guide: A2A and MCP Integration Architecture, Enterprise Deployment, and Standardization Trends  6 The Complete Guide to Agentic Workflow: From ReAct to Multi-Agent Collaboration — Building Autonomous AI Systems  7 The Complete Guide to AI Agent Development: LangGraph vs CrewAI vs AutoGen — Full Framework Comparison from Single-Agent to Multi-Agent Collaboration  8 The Complete Guide to Private LLM Deployment: From Llama to vLLM — Enterprise Architecture for Self-Hosted Large Language Models  9 The Complete Guide to LLM Evaluation: From Benchmark Leaderboards to Human Preference Alignment — Systematic Assessment Methods  10 The Complete Guide to LLM Alignment: From RLHF to DPO and GRPO — A Practical Deep Dive into Aligning Large Language Models with Human Values Current  11 The Complete Guide to MLOps: Building End-to-End Machine Learning Engineering Pipelines from Experiment Management to Model Deployment."
    },
    {
      "title": "It Takes Two: Your GRPO Is Secretly DPO - arXiv",
      "url": "https://arxiv.org/html/2510.00977v2",
      "description": "Group Relative Policy Optimization (GRPO) has emerged as a prominent reinforcement learning algorithm for post-training Large Language Models."
    },
    {
      "title": "The Evolution and Optimization of AI LLM Models Alignment.",
      "url": "https://www.youtube.com/watch?v=fEmewtw9v7k",
      "description": "... Human Feedback (RLHF). We'll explore how Direct Preference Optimization (DPO) turned a complex reinforcement learning problem into a sleek"
    },
    {
      "title": "The Complete Guide to Post-Training LLMs: SFT, RLHF, DPO & GRPO",
      "url": "https://www.sundeepteki.org/advice/the-complete-guide-to-post-training-llms-how-sft-rlhf-dpo-and-grpo-shape-llms",
      "description": "| The Complete Guide to Post-Training LLMs: How SFT, RLHF, DPO, and GRPO Shape LLMs 8/4/2026  0 Comments    Table of Contents    1. Fine-Tuning: A Critical Distinction  2.2 The Three-Stage Pipeline: SFT, Preference Alignment, and Reinforcement Learning  2.3 Why Post-Training Now Accounts for the Majority of Usable Model Capability    3. Supervised Fine-Tuning (SFT): Teaching Models to Follow Instructions  3.1 Full Fine-Tuning, LoRA, and QLoRA - Choosing Your Approach  3.2 Dataset Quality: The Accuracy-Diversity-Complexity Triad  3.3 The Dataset Composition Blueprint    4. Preference Alignment: Making Models Helpful, Harmless, and Honest  4.1 RLHF - The Original Breakthrough  4.2 DPO - Eliminating the Reward Model  4.3 RLAIF and Constitutional AI - Anthropic's Scalable Alternative    5. Reinforcement Learning: The Frontier of Reasoning Models  5.1 GRPO - DeepSeek's Paradigm Shift  5.2 DAPO and RLVR - Verifiable Rewards for Reasoning  5.3 How OpenAI, Anthropic, and Google DeepMind Approach RL Differently    6."
    }
  ]
}

▼thought

Let me get more specific cost data from some of these sources, and also search for the relationship between data scale and training cost.

▼✓抓取网页正文网址: "https://galileo.ai/blog/llm-model-training-cost"

{
  "url": "about:blank",
  "title": "How Much Does LLM Training Cost? | Galileo"
}

▼✓网页搜索查询: "LLM training cost per token pre-training GPU hours data scale cost relationsh...", 结果数: 10

{
  "results": [
    {
      "title": "How Much Does LLM Training Cost? - Galileo AI",
      "url": "https://galileo.ai/blog/llm-model-training-cost",
      "description": "# How Much Does LLM Training Cost? Understanding the costs of training LLMs is essential for those looking to create and train models or simply to understand the dynamics of AI as an industry. In this article, we'll explore the cost of training LLM models, the factors that contribute to LLM training costs, and strategies to optimize them effectively. * Training frontier LLMs costs $78M-$192M+ for compute alone. * Key cost drivers have shifted—human data annotation now exceeds compute costs by up to 28x. * Fine-tuning pre-trained models costs 60-90% less than training from scratch. ## **How Much Does It Cost to Train an LLM?**. To appreciate the scale of investment involved in training LLMs, consider the verified costs of notable models. According to the Stanford AI Index Report 2025, frontier model training costs have escalated dramatically—with GPT-4's training estimated at $78-100+ million, and Gemini Ultra 1.0 reaching $192 million, representing a 287,000x increase from the cost of a Transformer model in 2017 ($670)."
    },
    {
      "title": "Pre-Training LLMs on a budget: A comparison of three optimizers",
      "url": "https://arxiv.org/html/2507.08472v1",
      "description": "# Pre-Training LLMs on a budget: A comparison of three optimizers. For better generalization, we train with two different base architectures and use a single- and a multiple-epoch approach while keeping the number of tokens constant. We found that while the results from all three optimizers were in approximately the same range, Sophia exhibited the lowest training and validation loss, Lion was fastest in terms of training GPU hours but AdamW led to the best downstream evaluation results. This study compares two of them, Lion (Chen et al., 2023) and Sophia (Liu et al., 2024), with AdamW, in the context of small models and considering a limited compute budget. Since a grid search even with only few hyperparameters and a model with approximately 3 billion parameters is not feasible given the usual limits in the compute budget, we employed the Maximal Update Parametrization (μ𝜇\\muitalic\\_μP, Yang et al., 2021) to find optimal values using small proxy models (50 million parameters) and then transfer these values unchanged to the larger target model (2.7 billion parameters)."
    },
    {
      "title": "Understanding cost, options and Technical steps to build LLM from ...",
      "url": "https://medium.com/@anilAmbharii/understanding-cost-options-and-technical-steps-to-build-llm-from-scratch-d68cb9ba7bd9",
      "description": "By simple multiplication, the estimated training cost for a 10 billion parameter model is approximately $50,000, and for a 100 billion parameter"
    },
    {
      "title": "How Long Should You Train Your Language Model? | Databricks Blog",
      "url": "https://www.databricks.com/blog/how-long-should-you-train-your-language-model",
      "description": "# How Long Should You Train Your Language Model? Accounting for Inference in Language Model Scaling Laws. Our recent paper, presented at ICML 2024, **proposes a modified scaling law to account for the cost of** ***both training and inference*****.** This blog post explains the reasoning behind our new scaling law, and then experimentally demonstrates how “overtrained” LLMs can be optimal. The “*Chinchilla”* Scaling Law is the most widely cited scaling law for LLMs. The Chinchilla paper asked the question: If you have a fixed training compute budget, how should you balance model size and training duration to produce the highest quality model? In our research, we were motivated by the idea of training smaller models on more data than the Chinchilla law suggested. By contrast, we assume a *fixed pretraining loss* (i.e. model quality) and find the model size and training duration that *minimize the total compute* over the model’s lifetime, including both training and inference (Fig. 2, Eq. 2)."
    },
    {
      "title": "What is the cost of training large language models? - CUDO Compute",
      "url": "https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models",
      "description": "### CUDO arrow Resources arrow What is the cost of training large language models? # What is the cost of training large language models? Large language models (LLMs) like OpenAI’s GPT series and Google’s BERT have become foundational technologies powering many applications, from automated customer service to advanced research tools. For example, while the original 2017 Transformer that introduced the core LLM architecture cost only about $900 to train, the compute cost for training GPT-3 (175 billion parameters) was estimated in 2020 to range from about $500,000 up to $4.6 million, depending on the hardware and optimization techniques used. Training OpenAI’s GPT-4 reportedly cost more than $100 million, with some estimates ranging up to $78 million in compute cost, and Google’s Gemini Ultra model is estimated to have cost $191 million in training compute. The emergence of reasoning in LLMs is partly a byproduct of scale, as larger models trained on more data tend to develop more advanced capabilities, and partly due to specialized training or fine-tuning on data that encourages logical reasoning."
    },
    {
      "title": "Training large language models costs millions - Facebook",
      "url": "https://www.facebook.com/groups/AIandMachineLearningforEveryone/posts/9843392082338627",
      "description": "# AI & Machine Learning for Everyone | Do you know how much it costs to train large language models (LLM) | Facebook. Training large language models costs millions. ## AI & Machine Learning for Everyone ·. Do you know how much it costs to train large language models (LLM)? Al training costs are exploding lately. The Stanford Al Index Report has just released some training numbers and they are CRAZY. Training LLMs from scratch costs millions and these numbers are expected to climb even higher with the development of new models. But why is this the case? A team of machine learning, data science, and linguistic experts is essential. Image 12: ➡️ Training and developing LLMs is incredibly expensive due to the vast computational resources required, with nodels like GPT-4 needing thousands of GPUs running for months (!). This extensive use of GPUs, combined with the need for continuous fine-tuning and experimentation, significantly drives up both the hardware and operational costs."
    },
    {
      "title": "You Need to Know About Data Costs for Training an LLM - Abaka AI",
      "url": "https://www.abaka.ai/blog/llm-data-cost",
      "description": "We will provide a detailed interpretation of the composition of LLM data costs from multiple dimensions such as data volume, data quality, and data diversity."
    },
    {
      "title": "Sebastian Raschka, PhD's Post - LinkedIn",
      "url": "https://www.linkedin.com/posts/sebastianraschka_sure-its-widely-known-that-pretraining-activity-7278071569384460288-y_NE",
      "description": "Sure, it's widely known that pretraining large language models (LLMs) is incredibly expensive, but how expensive, exactly?"
    },
    {
      "title": "Exploring the cost of training an AI model on cloud infrastructure",
      "url": "https://nebius.com/blog/posts/cost-of-training-an-ai-model-on-cloud",
      "description": "# Exploring the cost of training an AI model on cloud infrastructure. Training machine learning models can cost anywhere from tens of thousands to millions of dollars depending on model size, dataset volume and infrastructure. In this article we look at the main cost components of training in the cloud, what drives the final bill and how to optimize spending without compromising results. Even within one architecture, budgets can differ by several times depending on parameter count, dataset size, infrastructure setup and efficiency of use. cost = (training time ÷ utilization rate) × resource price × number of resources + overhead (storage, networking, orchestration). Even a smaller model such as BERT-Large can run into tens of thousands of dollars if trained from scratch in the cloud without optimizations. In the cloud this is even more visible: compute, storage, networking and service layers act as a single system and only a balanced setup keeps spending under control."
    },
    {
      "title": "Trends in Artificial Intelligence | Epoch AI",
      "url": "https://epoch.ai/trends",
      "description": "Model Performance 90% CI 10× to 900× 1 to 4 months 1 to 3 OOM](https://epoch.ai/trends#model-performance)[Compute stock growth 3.4 ×/year 6.8 months 0.53 OOM/year The total computing power of the stock of AI chips is growing at a rate of 3.4×/year. AI Companies 90% CI 3.2× to 3.7× 6.4 to 7.2 months 0.51 to 0.57 OOM](https://epoch.ai/trends#ai-companies)[Training compute 5 ×/year 5.2 months 0.7 OOM/year Training compute for frontier language models has been growing at 5× per year since 2020. Training Runs 90% CI 4× to 6× 4.6 to 6.0 months 0.6 to 0.8 OOM](https://epoch.ai/trends#training-runs)[Software progress ÷ 3.0 ×/year 7.6 months 0.5 OOM/year Pre-training compute efficiency is improving at roughly 3.0× per year. Training Runs 90% CI 2.8× to 4.4× 5.6 to 8.1 months 0.4 to 0.6 OOM](https://epoch.ai/trends#training-runs)[Largest AI data center 700,000 H100e The largest known AI data center has computing power equivalent to 700,000 NVIDIA H100 chips. Data Centers 90% CI 500k to 1M H100e](https://epoch.ai/trends#data-centers)[FLOP/s per dollar 1.37 ×/year 2.2 years 0.14 OOM/year AI chip performance per dollar has improved by 37% per year."
    }
  ]
}

▼✓网页搜索查询: "human annotation cost RLHF data labeling per comparison 2024 2025 ScaleAI Surge", 结果数: 8

{
  "results": [
    {
      "title": "RLHF Market Compared: Scale AI, Mercor, Turing | Lemon.io Blog",
      "url": "https://lemon.io/blog/rlhf-platforms-for-data-annotation",
      "description": "Home blog For Employers The Hiring Process RLHF Platforms: Inside the AI Data Annotation and Labeling Market. # RLHF Platforms: Inside the AI Data Annotation and Labeling Market. On the other side are the **platforms** (e.g., Surge AI, Micro1, Mercor, and Labelbox) that provide the infrastructure and services to make this possible, often using approaches like reinforcement learning from human feedback (RLHF). Drawing on Lemon.io’s experience in sourcing and onboarding data annotators for both Tier 1 and Tier 2 labs, this article shows how these platforms create value for the companies building modern AI systems. The data annotation and labeling market is projected to grow from **$1.2 billion in 2024** to **$10.2 billion by 2034,** highlighting the increasing importance of high-quality, context-driven datasets in AI systems. That’s why, similar to the data annotation and labeling market, the RLHF platform market is forecasted to grow from **$2.8 billion in 2025** to **$18.6 billion in 2034.** The table below proves how lucrative it has already become."
    },
    {
      "title": "Data labeling costs surge as compute costs rise for AI models | Daniel Kang posted on the topic | LinkedIn",
      "url": "https://www.linkedin.com/posts/daniel-kang-1223b343_human-data-is-probably-more-expensive-than-activity-7360724422187442177-Madz",
      "description": "However, as post-training techniques (SFT, RL-HF, etc.) that rely on human annotations continue to drive progress in frontier models, we believe that the majority of (marginal) cost is shifting to data. The revenue of major data labeling companies and the marginal compute cost of training of training frontier models for major AI providers in 2024. This finding highlights clear evidence: the cost of acquiring high-quality human-annotated data is rapidly outpacing the compute costs required for training state-of-the-art AI models.\" https://lnkd.in/gCJZ4t9n #AI #AITraining #GenerativeAI #LLMs #DataLabeling #ComputeCosts AI models are only as good as the data they are trained on."
    },
    {
      "title": "Data Labeling Industry Guide: How AI Startups Make Billions (2025)",
      "url": "https://o-mega.ai/articles/how-the-data-labeling-industry-works-full-insider-guide-2025",
      "description": "How AI data labeling startups like Scale, Surge & Mercor are building billion-dollar businesses by connecting expert humans with AI labs. In fact, leading AI companies like OpenAI, Google, Meta, Anthropic and others are each spending on the order of **$1 billion per year** on human-provided training data (foundationcapital.com) (foundationcapital.com). As one investor put it, *“the only way models are now learning is through net new human data”* (techcrunch.com) – meaning that continual human feedback, annotation, and instruction have become crucial for advancing AI capabilities. This demand gave rise to a **data labeling industry** that supplies AI labs with armies of human contractors (labelers or “AI tutors”) who **label, annotate, or generate data** to feed into AI model training (techcrunch.com). From labeling millions of images for computer vision, to crafting high-quality responses and comparisons for training large language models via reinforcement learning from human feedback (RLHF), these human workers provide the “ground truth” that teaches AI models how to behave."
    },
    {
      "title": "Scale AI, Surge AI, and the billion-dollar gig-work industry shaping ...",
      "url": "https://www.instagram.com/reel/DTlcSGDj2gC",
      "description": "Rlhf is better known as data labeling and just like Uber drivers and DoorDashers, data labelers are gig workers paid hourly wages without"
    },
    {
      "title": "Top 10 Human Data Labeling Providers in 2026 - Pin",
      "url": "https://www.pin.com/blog/human-data-labeling-providers",
      "description": "Top 10 human data labeling providers compared for 2026 - Scale AI, Surge AI & 8 more. The top human data labeling providers in 2026 are Scale AI, Surge AI, Labelbox, Snorkel AI, Appen, Sama, iMerit, Telus Digital, Toloka, and SuperAnnotate. For recruiters staffing in-house annotation teams, Pin’s AI sourcing scans 850M+ profiles to find domain experts faster than any traditional job board. The data collection and labeling market reached $4.89 billion in 2025 and is projected to hit $17.10 billion by 2030, growing at a 28.4% CAGR, according to Grand View Research (Nov 2024). > * **Scale AI and Surge AI lead at the top.** Both dominate frontier-model contracts; Surge is the go-to for RLHF on LLMs and Scale for multi-modal enterprise programs. > * **Sourcing annotation talent is the new bottleneck.** For teams building in-house annotation capacity, Pin is the best way to find domain experts in medicine, law, and software - scanning 850M+ profiles that gig platforms simply don’t reach."
    },
    {
      "title": "Scale AI Alternatives for Enterprise AI Teams - Gun.io",
      "url": "https://gun.io/news/2025/12/scale-ai-alternatives-for-enterprise-ai-teams",
      "description": "Meta’s $14.3 billion acquisition of a 49% stake in Scale AI has forced enterprise AI teams to reassess their data annotation partnerships. They haven’t solved access to the engineers, developers, and domain experts required for the work that actually moves models forward: RLHF ranking, code evaluation, safety red-teaming. This guide evaluates the leading Scale AI alternatives across platform capabilities, annotator quality, pricing transparency, and vendor independence. Scale AI’s transformation from neutral market leader to Meta subsidiary represents the most significant vendor risk event in data annotation history. What are the best Scale AI alternatives for enterprise annotation? Scale AI’s ownership crisis accelerated trends already in motion: quality over scale, expert annotators over crowd workers, vendor independence over platform lock-in. The question for AI/ML teams has shifted from “which vendor has scale?” to “which vendor can access the developers, engineers, and domain specialists our training data requires?”. Gun.io connects companies with vetted senior developers and engineers for AI training data annotation, code evaluation, and RLHF workflows."
    },
    {
      "title": "Top RLHF Tools and Platforms to Use in 2026",
      "url": "https://www.taskmonk.ai/blogs/rlhf-data-annotation-platforms",
      "description": "# Top RLHF Platforms in 2026 for LLM Fine-Tuning & Human Feedback. Discover the top RLHF platforms for LLM fine-tuning in 2026. Compare Taskmonk, Scale AI, Surge AI, and open-source tools for scalable human feedback and model alignment. * RLHF platforms collect human feedback at scale to fine-tune LLMs for safety, instruction following, and alignment with human values. * The best platforms combine expert annotator workforces, preference ranking tools, and an API infrastructure for SFT and reward modeling. * Commercial platforms like Taskmonk, Scale AI, Surge AI offer managed services with vetted annotators across technical domains. Reinforcement learning from human feedback trains models to generate responses that humans prefer by collecting preference data at scale, training reward models to predict human judgments, and using those reward signals to fine-tune the base model. You need annotators who can evaluate model outputs across technical domains, platforms that collect pairwise rankings without introducing bias, APIs that integrate feedback loops into training pipelines, and quality control systems that catch low-effort labels before they poison your reward model."
    },
    {
      "title": "RLHF Platforms in Biotech: Scale vs. Labelbox vs. In-House | IntuitionLabs",
      "url": "https://intuitionlabs.ai/articles/rlhf-platforms-biotech-comparison",
      "description": "# RLHF Platforms in Biotech: Scale vs. Reinforcement Learning from Human Feedback (RLHF) is an advanced AI training paradigm that leverages expert human evaluation to refine model outputs. In biotechnology and healthcare, where domain expertise and safety-critical decision-making abound, RLHF (and related human-in-the-loop labeling) is emerging as a crucial approach for training models in diagnostics, drug discovery, genomics, and clinical research. Specialized data-labeling platforms—such as Scale AI (Healthcare division), Labelbox (Healthcare), and Appen (Medical)—offer managed RLHF and annotation services, promising rapid scale, domain expertise, and regulatory compliance. This report provides an in-depth comparison of these external RLHF platforms versus in-house solutions for biotech applications. Appen offers HIPAA-aligned workflows with U.S.-based teams () and integrates data securely via APIs. In RLHF contexts, Appen emphasizes scale and diversity for language models () and has partnered with medical research (e.g. Johns Hopkins) to accelerate labeling: one case enabled **1,500+ person-hours of neuroscience annotation in a few weeks** via Appen’s crowd ()."
    }
  ]
}

▼thought

Let me get more specific details from the Galileo page and other sources. Also, let me look for specific numbers on how cost scales with data size - there's a known formula and I should look at the CUDO Compute and Sebastian Raschka posts.

▼✓抓取网页正文网址: "https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-..."

{
  "url": "https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models",
  "title": "What is the cost of training large language models?",
  "summary": "Explore the true cost of training large language models. Learn about the financial, computational, and environmental costs of AI's most advanced models.",
  "content": "<div id=\"readability-page-1\" class=\"page\"><div data-bglz=\"1\" data-id=\"3f5cd38\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"theme-post-content.default\">\n\t\t\t\t\t\n<p>Large language models (LLMs) like OpenAI’s GPT series and Google’s BERT have become foundational technologies powering many applications, from automated customer service to advanced research tools. </p>\n<p>However, training LLMs requires a substantial financial investment due to their vast number of parameters and the enormous computational power needed. It typically involves high-end GPUs or specialized AI accelerators, which are costly resources to acquire and run.</p>\n<p>For example, while the original 2017 Transformer that introduced the core LLM architecture cost only about $900 to train, the compute cost for training GPT-3 (175 billion parameters) was estimated in 2020 to <a href=\"https://carboncredits.com/carbon-countdown-ais-10-billion-rise-in-power-use-explodes-data-center-emission/#:~:text=The%20final%20training%20run%20of%20GPT%2D3%20is%20estimated%20to%20have%20ranged%20from%20%24500%2C000%20to%20%244.6%20million.\" target=\"_blank\" rel=\"noopener\">range from about $500,000 up to $4.6 million</a>, depending on the hardware and optimization techniques used. By comparison, newer models have pushed costs much higher.</p>\n<p>Training OpenAI’s GPT-4 reportedly <a href=\"https://en.wikipedia.org/wiki/GPT-4#:~:text=Sam%20Altman%20stated%20that%20the,51\" target=\"_blank\" rel=\"noopener\">cost more than $100 million</a>, with some estimates ranging up to <a href=\"https://hai.stanford.edu/news/ai-index-state-ai-13-charts#:~:text=One%20of%20the%20reasons%20academia,modern%20LLM%2C%20cost%20around%20%24900\" target=\"_blank\" rel=\"noopener\">$78 million in compute cost</a>, and Google’s Gemini Ultra model is estimated to have cost $191 million in training compute. These staggering sums partly reflect why model size and complexity have exploded.</p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02a.png\" alt=\"cost_of_llms_chart_1\"></p>\n<blockquote>\n<p>Training compute petaFLOP log scale. Source: <a href=\"https://hai.stanford.edu/news/ai-index-state-ai-13-charts\" target=\"_blank\" rel=\"noopener\">Paper</a></p>\n</blockquote>\n<p>In this article, we’ll explore the expenses of bringing modern generative AI and reasoning models to life, focusing on infrastructure needs, data management, and the increasingly important role of cloud computing.</p>\n<h2>What are large language models?</h2>\n<p>LLMs are AI systems designed to understand and generate language in a way that mimics human communication and reasoning. They are trained on vast datasets containing text from books, websites, code repositories, and other digital content. </p>\n<p>By learning statistical patterns in language, LLMs can generate coherent and contextually relevant text based on a given input. For example, models like GPT are trained on a huge swath of internet text and can produce writing that mirrors human style across many topics.</p>\n<p>These models typically rely on the <a href=\"https://www.cudocompute.com/topics/neural-networks/transformer-models-what-are-they-and-how-do-they-work\">transformer architecture</a>, which uses mechanisms like self-attention to consider the context of words in a sequence, allowing the model to weigh the importance of different parts of the input text relative to each other, enabling a better understanding of context and meaning. </p>\n<p>For instance, Google’s BERT model reads text bidirectionally – both left-to-right and right-to-left – to capture context, a significant advancement over older one-directional models, making BERT especially effective for tasks requiring deep language understanding, like question answering or sentiment analysis.</p>\n<p>As LLMs have grown, they’ve gotten better at fluent text generation and shown emergent reasoning abilities. Newer generative models such as <a href=\"https://www.cudocompute.com/blog/why-open-source-models-are-disrupting-the-ai-race\">GPT-4o-mini and DeepSeek R1</a> can perform complex tasks like solving multi-step math problems or writing code by reasoning through steps, especially when guided by techniques like chain-of-thought prompting. </p>\n<p>Given the right prompts, these models can simulate a step-by-step reasoning process, allowing them to handle more complex queries. The emergence of reasoning in LLMs is partly a byproduct of scale, as larger models trained on more data tend to develop more advanced capabilities, and partly due to specialized training or fine-tuning on <a href=\"https://arxiv.org/pdf/2110.14168\" target=\"_blank\" rel=\"noopener\">data that encourages logical reasoning</a>. </p>\n<p>Models like Anthropic’s Claude have also been trained with specific methods, like Constitutional AI, to better follow instructions and reason safely. All of these advances, however, come with increased training requirements. Next, we’ll discuss the cost of training a large language model.</p>\n<h2>Cost of training LLMs with cloud infrastructure</h2>\n<p>As AI development increasingly shifts to cloud platforms, <a href=\"https://www.cudocompute.com/blog/gpu-supply-shortage-due-to-ai-needs\">driven partly by limited availability of GPUs</a>, cloud services have become one of the most practical and scalable ways to train LLMs. The cloud offers on-demand access to large numbers of GPUs/TPUs and the ability to distribute training across multiple machines. </p>\n<p>Such scalability is excellent for the fluctuating demands of AI training cycles. Major providers have built massive supercomputers in the cloud to facilitate LLM training. Microsoft, for example, <a href=\"https://www.zdnet.com/article/microsoft-builds-a-supercomputer-for-openai-for-training-massive-ai-models/#:~:text=Microsoft%20said%20the%20supercomputer%20built,has%20access%20to%20Azure%20services\" target=\"_blank\" rel=\"noopener\">constructed an Azure supercomputer</a> with over 10,000 GPUs and ultra-fast networking specifically for OpenAI’s model training.</p>\n<p>However, renting such infrastructure comes at a significant cost. A recent example from NVIDIA CEO Jensen Huang illustrates the scale of resources required, stating that training the GPT-MoE-1.8T model using 25,000 Ampere-based GPUs (most likely the A100) took 3 to 5 months. Doing the same with H100 would take about 8,000 GPUs in 90 days.</p>\n<p>The above statement shows how each generation of hardware can improve training efficiency. The H100s offer substantially higher throughput than A100s, reducing the time and number of machines needed for the same task. </p>\n<p>Still, whether using 25,000 older GPUs or 8,000 newer ones, the compute cluster required is enormous, and most organizations simply cannot afford to assemble or rent such resources for training from scratch.</p>\n<p>While it is possible to train a model from scratch, most users won’t train giant LLMs from scratch due to these high costs and infrastructure hurdles. Instead, they’ll use pre-trained models provided by AI labs or open-source communities and then adapt them to their needs. This avoids spending millions of dollars in compute for initial training.</p>\n<p>Generally, if a company or researcher needs an LLM, there are two approaches:</p>\n<ul>\n<li><strong>Hosting your own model:</strong> Obtain a pre-trained model checkpoint, either open-source or via license, and run further training or fine-tuning on it using cloud servers that you rent or own.</li>\n<li><strong>Pay-per-token access:</strong> Use a hosted model provided through an API, like OpenAI’s or Google’s, and pay for usage, rather than handling any training yourself.</li>\n</ul>\n<p>Let’s examine each of these approaches and their cost implications.</p>\n<p>Read more: <a href=\"https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch\">How to build an AI</a></p>\n<h3>Hosting models in the cloud</h3>\n<p>If you choose to host and train/fine-tune models in the cloud, you might need to rent compute resources. Companies like CUDO Compute offer suites of services that support the entire machine learning lifecycle from data storage to GPU compute to deployment. </p>\n<p>The advantage of cloud-based training is convenience and scalability, as you can spin up hundreds of GPUs for a few days of training and then shut them down, only paying for what you used. You also don’t have to maintain physical hardware.</p>\n<p>When training large models with tens or hundreds of billions of parameters, such as GPT-3 or Meta’s Falcon 180B, the expense goes beyond just the headline price of GPUs. In a cloud environment, you also need to account for supporting resources and overheads:</p>\n<ul>\n<li><strong>Virtual CPUs (vCPUs)</strong> to coordinate and feed data to the GPUs during training.</li>\n<li><strong>Memory (RAM)</strong> for holding training data batches, model activations, and other intermediate computations.</li>\n<li><strong>Storage</strong> for datasets and for saving model checkpoints (which themselves can be hundreds of gigabytes in size for big models), as well as costs for data transfer.</li>\n</ul>\n<p>Each of these components adds to the bill. Cloud providers typically charge based on the compute time for both GPU and CPU, the amount of memory provisioned, and the volume of data stored or transferred. This means training a large AI model can incur costs on multiple fronts.</p>\n<p>Efficiently using and managing all these resources helps to keep the budget under control. For example, ensuring GPUs are kept busy so you’re not paying for idle time, and not provisioning more CPU or memory than needed.</p>\n<p>Let’s break down how this might work when training an LLM on a large model on <a href=\"https://www.cudocompute.com/\">CUDO Compute</a>:</p>\n<p>At the time of writing, the cost of the <a href=\"https://www.cudocompute.com/gpu-rental/nvidia-a100\">A100</a> on <a href=\"https://www.cudocompute.com/pricing\">CUDO Compute</a> starts from $1.50 per hour. There is also a monthly commitment option of $1,125.95. When factoring in the other costs, such as vCPUs and memory needed, each is charged based on location.</p>\n<p>Using the median location from the roster, we will base our analysis on the pricing from the Los Angeles 1 location. Here is how much it costs for each resource needed:</p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02a.png\" alt=\"cost_of_llms_table_1\"></p>\n<p>Multiple GPUs are advised for optimal results. This would be the recommended amount needed to train a Falcon 180B on CUDO Compute based on the default instance for training the same model on AWS:</p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02b.png\" alt=\"cost_of_llms_table_2\"></p>\n<p>The above configuration is very similar to the default configuration used on AWS for training LLMs on the same model. To use this configuration on CUDO Compute, assuming the model is not scaled up or down, and no discounts are applied, it will total just over USD 13,000 monthly. Here is the breakdown:</p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02c.png\" alt=\"cost_of_llms_table_3\"></p>\n<p>Since training an LLM from scratch will likely take months, this cost will pile up over time, particularly when training involves multiple iterations over extensive datasets. </p>\n<p>Also, as CUDO Compute is one of the cheapest cloud platforms available, bear in mind that the compute costs may be higher on other platforms. For example, using an instance with a similar configuration on AWS (ml.p4de.24xlarge) will cost over USD 23,000 per month.</p>\n<p>It’s worth noting that newer GPU hardware can change this calculus. The above costs were with NVIDIA’s A100 GPUs. The H100 GPUs offer about 2–3 times the performance of A100 for training workloads, and the <a href=\"https://www.cudocompute.com/blog/nvidia-gb200-everything-you-need-to-know\">B100 is even more improved than the H100</a>. You could use fewer B100s to train in less time, potentially lowering the total cost, but B100s are also more expensive per hour than A100s and H100s.</p>\n<p>Hosting and training large models in the cloud gives flexibility but requires careful resource planning. You will often fine-tune an existing model for a specific task (which is far cheaper) rather than train a new LLM from scratch. Next, we’ll look at an alternative to any training, using a pay-per-token model to use models others have already trained.</p>\n<h3>Pay-per-token access to LLMs</h3>\n<p>The high cost of training and maintaining LLMs has led to the rise of the pay-per-token (PPT) model for accessing these powerful language models. Here's how it works:</p>\n<p>Companies like OpenAI and Google AI pre-train massive LLMs on vast datasets and allow developers and businesses to use these models, such as GPT-3 or similar, without the prohibitive costs and technical challenges of training such models themselves.</p>\n<p>Users don't incur the upfront costs of training and infrastructure. Instead, they pay a fee based on the number of tokens (roughly equivalent to words or sub-words) processed by the LLM when completing tasks like text generation, translation, or code writing.</p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/cost-of-llms-image-5.jpg\" alt=\"cost-of-llms-image-5\"></p>\n<p>The PPT model offers a significantly more cost-effective approach than in-house training for tasks that don't require extensive LLM usage. Users only pay for the resources they actually use.</p>\n<h3>Benefits of pay per token:</h3>\n<ul>\n<li><strong>Reduced costs:</strong> This model eliminates the upfront investment in hardware, software, and training data.</li>\n<li><strong>Scalability:</strong> Users can easily scale their LLM usage up or down based on their needs, paying only for the tokens they consume.</li>\n<li><strong>Accessibility:</strong> PPT allows a wider range of users and smaller companies to access LLMs without the prohibitive costs of in-house training.</li>\n</ul>\n<h2>Why is it so expensive to train LLMs?</h2>\n<p>There are several reasons why training large language models is extraordinarily expensive:</p>\n<ul>\n<li><strong>Massive model sizes and data:</strong> Modern LLMs are huge. While GPT-3 had 175 billion parameters, newer models like GPT-4 are rumored to have on the order of <a href=\"https://en.wikipedia.org/wiki/GPT-4#:~:text=Sam%20Altman%20stated%20that%20the,51\" target=\"_blank\" rel=\"noopener\">1 trillion parameters</a>. The amount of computation (floating point operations) needed to train a model scales roughly linearly with the number of parameters multiplied by the number of training examples (tokens).</li>\n</ul>\n<p>To achieve high performance, these models are trained on extremely large datasets with hundreds of billions or even trillions of tokens of text. For example, an LLM might be trained on 300-500 billion tokens of data, iterating through that corpus multiple times. </p>\n<p>The combination of a gigantic model and a gigantic dataset leads to an eye-popping number of total compute operations.</p>\n<p>It is estimated that GPT-4’s training consumed 2.1 × 10<sup>25</sup> FLOPs (21 billion petaFLOPs), and models like Gemini Ultra <a href=\"https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million\" target=\"_blank\" rel=\"noopener\">might be around 5.0 × 10</a><a href=\"https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million\" target=\"_blank\" rel=\"noopener\">25</a><a href=\"https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million\" target=\"_blank\" rel=\"noopener\">FLOPs</a>, driving those compute bills. Simply put, scale is costly. Even with highly efficient software and hardware, you’re pushing an unfathomable amount of data through the model.</p>\n<ul>\n<li><strong>Expensive hardware requirements:</strong> To handle that scale of computation within a reasonable timeframe, you need fleets of high-end accelerators (GPUs or TPUs). Training runs often use thousands of GPUs working in parallel for weeks or months.</li>\n</ul>\n<p>Owning such hardware is capital-intensive, as one NVIDIA H100 GPU can cost $25k–$40k; a pod of 1000 of them would be $25–40 million in hardware alone, and renting them is expensive too, as we discussed earlier. </p>\n<p>Additionally, the electricity and cooling for running these at full tilt 24/7 are very costly. Large training runs can consume megawatt-hours of energy, and only a few companies have the infrastructure to dedicate 10,000 GPUs to a single task. </p>\n<p>Supporting hardware like high-speed networking, like InfiniBand or custom interconnects, is also needed to ensure those GPUs can communicate rapidly as they split up the training load. All of this specialized infrastructure drives up the cost significantly.</p>\n<ul>\n<li><strong>Extended training durations:</strong> Despite massive parallelism, training a reasoning model can still take weeks or months. During that entire period, the hardware must be powered and utilized.</li>\n</ul>\n<p>For instance, if you run 1,000 GPUs for one month, that’s 1,000 GPU-months of usage, which, at say $2,000 per GPU-month, would be $2 million. Now, consider some runs use 5,000 or 10,000 GPUs for several months. It’s easy to see how the cost racks up to tens of millions of dollars. Every additional epoch of training to slightly improve the model’s performance comes with that price tag. </p>\n<p>Moreover, training often isn’t a one-shot process; researchers may do multiple runs while tuning hyperparameters or testing different configurations. The final model you see is the product of many trial runs and experiments behind the scenes. Those experimental runs also consume a lot of compute. </p>\n<p>According to recent analysis, the total compute used in developing a new model could be <a href=\"https://arxiv.org/html/2405.21015v2\" target=\"_blank\" rel=\"noopener\">2–3 times the compute of the final training run</a>, once you account for all the failed or exploratory experiments along the way.</p>\n<ul>\n<li><strong>Advanced techniques and multiple phases:</strong> Training a competitive LLM now involves more than just one pass through a dataset. For example, after the initial pre-training where you predict the next word on huge text data, many models undergo fine-tuning phases, such as <a href=\"https://arxiv.org/pdf/2504.12501\" target=\"_blank\" rel=\"noopener\">Reinforcement Learning from Human Feedback</a> (RLHF), to align the model with human preferences and instructions.</li>\n</ul>\n<p>RLHF involves training additional models (reward models) and running <a href=\"https://www.cudocompute.com/blog/machine-learning-technique-introduction-to-reinforcement-learning\">reinforcement learning algorithms</a> that themselves require lots of computations on GPUs and <a href=\"https://arxiv.org/pdf/2409.18417\" target=\"_blank\" rel=\"noopener\">custom data generated by human annotators</a>, which is expensive. </p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02b.png\" alt=\"cost_of_llms_chart_2\"></p>\n<blockquote>\n<p>Illustration of Vickrey Feedback for RLHF. Source: <a href=\"https://arxiv.org/pdf/2409.18417\" target=\"_blank\" rel=\"noopener\">Paper</a>.</p>\n</blockquote>\n<p>OpenAI’s ChatGPT/GPT-4 alignment process, for instance, meant hiring human contractors to <a href=\"https://openai.com/index/chatgpt/\" target=\"_blank\" rel=\"noopener\">produce conversations and feedback</a>, and then performing iterative fine-tuning, improving the model’s usefulness and safety, but it adds to the total cost of development, both in terms of money and time, beyond the baseline pre-training. </p>\n<p>Similarly, to imbue reasoning abilities, researchers might fine-tune LLMs on complex problem-solving datasets or use techniques like <a href=\"https://arxiv.org/pdf/2201.11903\" target=\"_blank\" rel=\"noopener\">chain-of-thought prompting</a> and <a href=\"https://arxiv.org/pdf/2203.11171\" target=\"_blank\" rel=\"noopener\">self-correction</a>, which can require generating and filtering a lot of model outputs, again consuming compute.</p>\n<p>Each additional training objective or stage, be it tutoring the model on logic puzzles or making it follow a constitution of AI principles, introduces extra expense.</p>\n<ul>\n<li><strong>Data acquisition and preparation:</strong> Often overlooked in pure compute discussions is the cost of the <em>data</em> itself. High-quality datasets may need to be purchased or scraped and cleaned at scale. For example, an LLM might ingest the entire Wikipedia, huge swaths of web text, news, books, academic papers, code repositories, etc. Web crawling and data storage incur infrastructure costs.</li>\n</ul>\n<p>Additionally, if a model is to excel at, say, coding or medical questions, one might need to curate domain-specific data, which could be behind paywalls or require licensing. Companies might pay for access to archives of journals or specific private datasets. </p>\n<p>Then comes the preprocessing: filtering out problematic content, deduplicating text, formatting it for training, and so forth. This can involve substantial engineering effort and cloud processing power. </p>\n<p>Preparing a trillion-token corpus that is diverse, high-quality, and legally usable is a non-trivial expense, sometimes involving human annotators for labeling or verification in the loop.</p>\n<p>Pushing the limits of dataset size, model size, and compute scale is expensive because it sits at the edge of what’s technologically possible. To do so requires cutting-edge hardware, huge energy consumption, and often a team of researchers and engineers whose salaries are another factor. </p>\n<p>The result is that only a handful of companies and well-funded academic labs can afford to train the largest models. Others must be clever in using those models or finding ways to achieve results with smaller-scale systems. </p>\n<p>In the following section, we’ll discuss strategies for controlling and reducing the cost of training.</p>\n<h2>Steps to controlling the cost of training LLMs</h2>\n<p>While training large models will likely remain resource-intensive, there are several strategies that can optimize resource utilization and reduce expenses. In fact, for most organizations, the question isn’t “<em>Can we train our own GPT-4?</em>” but “<em>How can we achieve our AI goals within a reasonable budget?</em>”. </p>\n<p>Below are key approaches to managing and minimizing the costs of training and deploying LLMs:</p>\n<p><strong>1. Implement model and training optimization techniques:</strong> Optimize what you build and how you build it so you need less compute for the same result.</p>\n<ul>\n<li><strong>Model architecture selection and sizing:</strong> Carefully choose a model architecture and size that balances performance with cost. Bigger is not always better for a given task. If you can achieve your target accuracy with a 6-billion-parameter model, there’s no need to train a 60-billion-parameter one.</li>\n</ul>\n<p>Techniques like model pruning (removing unnecessary weights) or quantization can reduce the effective size and computation of a model without a significant loss in accuracy. Using newer architecture improvements can also get more bang for the buck, as some architectures are more parameter-efficient than others. </p>\n<p>The goal is to avoid overspending on compute capacity you don’t actually need. In some cases, <a href=\"https://www.cudocompute.com/blog/what-is-ensemble-learning\">multiple smaller specialized models ensemble</a> might be better than one large model.</p>\n<ul>\n<li><strong>Training data optimization:</strong> Ensure your training data is high-quality and relevant. A smaller, curated dataset can sometimes outperform a larger, noisy one. Removing redundant or low-value data can shorten training time.</li>\n</ul>\n<p>For instance, if you’re building a biomedical language model, 100GB of well-chosen medical text will train faster and yield a more accurate model in that domain than 1TB of random internet text. </p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02c.png\" alt=\"cost_of_llms_chart_3\"></p>\n<blockquote>\n<p>Amortized hardware cost plus energy cost for the final training run of frontier models. Source: <a href=\"https://arxiv.org/html/2405.21015v2\" target=\"_blank\" rel=\"noopener\">Paper</a></p>\n</blockquote>\n<p>Use data filtering to eliminate spam, duplicated content, or irrelevant information, and augmentation to add useful diversity are important steps. Every epoch on unnecessary data is wasted GPU time, so focus on quality over sheer quantity when possible.</p>\n<ul>\n<li><strong>Knowledge distillation:</strong> This technique involves training a smaller “student” model to replicate the behavior of a larger “teacher” model. You first train a large model (or take an existing one), then use its outputs to guide the training of a compact model.</li>\n</ul>\n<p>The student model can often achieve near the teacher’s performance but with far fewer parameters, effectively compressing the knowledge of a huge model into a cheaper, faster model that is much easier to deploy and even continue training. </p>\n<p>Distillation was used, for example, to create smaller BERT variants and even some DeepSeek distillations, cutting down model size by 10 times or more while retaining most capabilities. For organizations, one strategy is to fine-tune a large model on your task (possibly using an API or a short-term rental on a big GPU machine), then distill it into a smaller model you can afford to train further or run in production.</p>\n<ul>\n<li><strong>Mixed-precision training:</strong> Modern hardware and software allow using lower numerical precision for calculations to speed up training. Instead of 32-bit floating point for everything, frameworks use 16-bit or even 8-bit in certain parts of the computation.</li>\n</ul>\n<p>This cuts memory usage and can double the training speed on supported hardware <em>without</em> noticeably affecting model accuracy due to techniques like loss scaling to maintain numerical stability. </p>\n<p>The latest NVIDIA GPUs and Google TPUs all support mixed precision. You can significantly reduce the compute time and cost required for each training step using mixed-precision training. It’s a free efficiency gain and has become standard practice in training large models.</p>\n<ul>\n<li><strong>Parameter-efficient fine-tuning:</strong> A recent development, particularly useful if you’re working with pre-trained models, is to fine-tune only a subset of the model’s parameters or use add-on adapters, instead of complete end-to-end training.</li>\n</ul>\n<p>Approaches such as <a href=\"https://arxiv.org/pdf/2106.09685\" target=\"_blank\" rel=\"noopener\">low-rank adaptation</a> (LoRA) and other adapter-based methods allow you to train just a small additional set of weights or a low-rank update to the original weights that can be merged with the model for new tasks, drastically reducing the compute and data needed to adapt a model to a domain. </p>\n<p>For example, LoRA was used to fine-tune large models like GPT-3 on specific tasks using 10x–100x less compute than tuning the entire model. By training fewer parameters, you use less GPU memory and can get away with shorter training times, all while achieving nearly the same performance as full fine-tuning. </p>\n<p>Parameter-efficient fine-tuning uses the investment already made in a pre-trained model and avoids having to reinvent the wheel. In general, always prefer fine-tuning or adapting an existing model over training from scratch when possible; it’s very much like <a href=\"https://www.cudocompute.com/blog/few-shot-learning-everything-you-need-to-know#:~:text=varied%20training%20data.-,Transfer%20learning,-%3A%20Transfer%20learning%20leverages\">transfer learning</a> that can cut requirements by orders of magnitude.</p>\n<p><strong>2. Optimize training configurations:</strong>\nTrain smarter, not harder – improve your training process to converge faster or use fewer resources.</p>\n<ul>\n<li><strong>Hyperparameter tuning and efficient experimentation:</strong> The choice of <a href=\"https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch/4-steps-to-build-an-ai-model#hyperparameters-tuning\">hyperparameters can dramatically affect how fast a model learns</a>. A well-tuned training run might reach a target accuracy in 50% fewer steps than a poorly tuned one, which directly translates to 50% less cost.</li>\n</ul>\n<p>Rather than guessing, use systematic approaches to hyperparameter search (grid search, Bayesian optimization, etc.) on smaller-scale models or subsets of data to find good settings, then scale up. </p>\n<p>Also, monitor training in real-time. If loss has plateaued early or something looks off, intervene or stop rather than let a bad configuration run to completion. Implement an early stopping strategy so that if your model’s performance on a validation set stops improving, it halts the training to avoid wasting epochs that don’t yield gains.</p>\n<p>Read more: <a href=\"https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch\">How to build an AI</a></p>\n<ul>\n<li><strong>Checkpointing and fault tolerance:</strong> Training runs can be long, and failures like machine crashes and spot instance termination can happen. By checkpointing regularly (saving the model state every so often), you can resume from the last checkpoint instead of starting over, thereby saving all the progress made until the interruption. This is more about avoiding unnecessary re-training costs rather than reducing the cost of the successful run, but it’s important.</li>\n<li><strong>Smaller-scale and iterative training:</strong> Often, one can pre-train on a smaller scale first to get a sense of things. For example, train a smaller model (say 10% of the full size) on the full data, or the full model on 10% of the data, to identify issues or get learning curve estimates. This “pilot run” approach can reveal if your learning rate is way off, or if the model is too small to ever reach your target, etc., without spending the full budget.</li>\n</ul>\n<p>You can also try using learning rate warmups*,* cosine decays, and other training schedule tricks, which can also help stabilize the training of large models, avoiding divergence that would force you to restart with different settings, which is another hidden cost.</p>\n<p><strong>4. Use advanced model strategies:</strong></p>\n<p>Use model designs that reduce per-model compute and external resources that make the job easier.</p>\n<ul>\n<li><strong>Mixture-of-experts (MoE) models:</strong> MoE is an architecture that effectively has multiple sub-models called experts and a gating mechanism that activates only some of them for each input, meaning that at any given time, only a portion of the model’s parameters are used, rather than all of them.</li>\n</ul>\n<p>MoEs can significantly cut down the compute needed for a given parameter count – for instance, Google’s Switch Transformer (an MoE model) achieved comparable results to a dense model but with far less computation by sparsely activating experts. </p>\n<p>In a research, they got a model with 1.6 trillion parameters to train with the <a href=\"https://arxiv.org/pdf/2101.03961\" target=\"_blank\" rel=\"noopener\">computational cost of only a 100+ billion parameter dense model</a>. The benefit is that you get the effect of a huge model (in terms of capacity and specialization) without having to pay the full cost every time. </p>\n<p>MoEs spread out the workload and can be more efficient if implemented well. However, they add complexity; not all frameworks easily support them, and they can be tricky to get right. But as research advances, MoEs are becoming more common in large-scale settings. </p>\n<p>For cost-savvy training, exploring MoE architectures could allow training frontier-sized models on a smaller budget by <strong>reducing the computational load per token</strong>.</p>\n<ul>\n<li><strong>Retrieval and external knowledge integration:</strong> Another way to reduce the burden on the model itself is to give it tools or access to information during training/inference. For example, Retrieval-Augmented Generation (RAG) techniques provide the model with relevant documents fetched from an external database, so the model doesn’t need to memorize every fact; instead, it can look things up.</li>\n</ul>\n<p>DeepMind’s RETRO model demonstrated that a model with retrieval support could <a href=\"https://www.thedecoding.net/p/deepmind-retro-smaller-llm#:~:text=But%20there%20is%20good%20news%21\" target=\"_blank\" rel=\"noopener\">match the performance of a model 25 times large</a>r that had everything baked into its parameters. By integrating a massive external knowledge base, they maintained performance with a much smaller network, which implies huge training cost <a href=\"https://www.thedecoding.net/p/deepmind-retro-smaller-llm#:~:text=You%20can%20integrate%20a%20massive,up%20to%2025x%20smaller%20networks\" target=\"_blank\" rel=\"noopener\">savings of 25 times fewer parameters to train is roughly 25 times less compute</a>. </p>\n<p>This approach can be thought of as augmenting the model with a form of memory or tools so it doesn’t have to do all the work internally. Likewise, allowing a model to use a calculator for arithmetic or call an API for current knowledge means you don’t have to train it to death trying to internalize those capabilities. </p>\n<p>In a practical sense, building a system that combines an LLM with retrieval (e.g., queries to Wikipedia or a company knowledge base) can let you use a smaller base model to achieve the same or better performance than an extremely large model without retrieval, cutting down training requirements. </p>\n<p>Many current applications, like search engine bots, use this strategy. So, from a cost perspective, invest in a slightly more complex system (model + retrieval tool) and you might avoid having to invest in a model that’s 10 times bigger and 10 times more expensive to train.</p>\n<ul>\n<li><strong>Multi-step reasoning and curriculum learning:</strong> If you want an LLM with strong reasoning, an efficient approach is to train it to encourage reasoning via multiple steps, rather than expecting the model to solve everything in one go with brute-force parameters.</li>\n</ul>\n<p>For example, techniques like chain-of-thought training involve showing the model intermediate reasoning steps for problems (like math solutions or code logic) during training. This can teach a smaller model to approach complex tasks systematically, narrowing the gap with a larger model that might implicitly learn those skills. </p>\n<p>It’s not a direct cost-saving in compute, but it can mean you reach the desired capability without an exponential increase in model size. Similarly, curriculum learning (starting with easier tasks and then increasing difficulty) can help a model learn faster and converge better, meaning fewer epochs (less cost) to achieve high performance. </p>\n<p>The overarching idea is training smarter. Guiding the model’s learning process can trim the needed compute compared to naive training on a jumble of tasks. This is especially useful for reasoning-centric models where the objective is not just fluent text, but correct and logical solutions.</p>\n<p><strong>5. Collaborate and use open-source ecosystems:</strong></p>\n<p>Don’t go at it alone; take advantage of community efforts, existing models, and shared resources.</p>\n<ul>\n<li><strong>Use pre-trained models and tools:</strong> Using open-source pre-trained models significantly reduces costs by eliminating the need for full-scale training from scratch. Models such as Meta’s LLaMA 2, BLOOM, and DeepSeek R1 are available with pre-trained weights, enabling companies and researchers to fine-tune these models for their specific needs at a fraction of the original training cost.</li>\n</ul>\n<p>As we showed earlier, fine-tuning a large model like LLaMA 2 (70B parameters) typically costs tens of thousands of dollars, substantially less than full-scale initial training.</p>\n<p>Additionally, the open-source ecosystem provides robust tools and libraries optimized for efficient distributed training. Frameworks such as DeepSpeed and Fully Sharded Data Parallel (FSDP) effectively manage large models across limited hardware by sharding model components, allowing greater efficiency and reduced hardware requirements. </p>\n<p>Hugging Face’s Accelerate and tools like Horovod further streamline distributed training processes. You can benefit from shared knowledge, compute resources, and reduced overall costs by tapping into community-developed resources and collaborating within research communities, like EleutherAI, LAION, and BigScience.</p>\n<p>By applying these strategies, you’ll be able to cut down the cost required to train or fine-tune LLMs. Some emerging efforts have shown it’s possible to reach GPT-3/4 level performance on much smaller budgets.</p>\n<p>For example, the founder of 01.ai claimed they trained a GPT-4-comparable model <a href=\"https://www.tomshardware.com/tech-industry/artificial-intelligence/chinese-company-trained-gpt-4-rival-with-just-2-000-gpus-01-ai-spent-usd3m-compared-to-openais-usd80m-to-usd100m#:~:text=advanced%20AI%20models%20using%202%2C000%20GPUs%20with%20just%20%243%20million\" target=\"_blank\" rel=\"noopener\">using only 2,000 GPUs and about $3 million of compute</a> by optimizing every aspect of the process, which is a stark contrast to the amount reportedly spent on GPT-4 by OpenAI. While such claims have yet to be fully verified, they underscore that efficiency matters. The gap between big-budget labs and smaller players can be narrowed with clever engineering.</p>\n<p>Controlling the cost of training LLMs is about making prudent choices at each step, from choosing the right model size, hardware, data, and tools. Those who do so can build impressive language and reasoning models without breaking the bank. </p>\n<p>The landscape of LLM development is evolving quickly. While the cutting-edge models will always push expensive extremes, the practical know-how for cost-effective training is more available than ever. If you need to learn how, read our guide on <a href=\"https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch\">how to build an AI from scratch to learn more</a>.</p>\n<p>With careful planning, training your LLM or tailoring an existing one to your needs is feasible within a reasonable budget, and CUDO Compute can help you achieve that easily. We offer cost-effective GPUs on demand and scalable GPU clusters equipped with the latest NVIDIA GPUs. You can get started with a few clicks. <a href=\"https://compute.cudo.org/\" target=\"_blank\" rel=\"noopener\">Get started</a>.</p>\n\t\t\t\t</div></div>",
  "markdown": "Large language models (LLMs) like OpenAI’s GPT series and Google’s BERT have become foundational technologies powering many applications, from automated customer service to advanced research tools.\n\nHowever, training LLMs requires a substantial financial investment due to their vast number of parameters and the enormous computational power needed. It typically involves high-end GPUs or specialized AI accelerators, which are costly resources to acquire and run.\n\nFor example, while the original 2017 Transformer that introduced the core LLM architecture cost only about $900 to train, the compute cost for training GPT-3 (175 billion parameters) was estimated in 2020 to [range from about $500,000 up to $4.6 million](https://carboncredits.com/carbon-countdown-ais-10-billion-rise-in-power-use-explodes-data-center-emission/#:~:text=The%20final%20training%20run%20of%20GPT%2D3%20is%20estimated%20to%20have%20ranged%20from%20%24500%2C000%20to%20%244.6%20million.), depending on the hardware and optimization techniques used. By comparison, newer models have pushed costs much higher.\n\nTraining OpenAI’s GPT-4 reportedly [cost more than $100 million](https://en.wikipedia.org/wiki/GPT-4#:~:text=Sam%20Altman%20stated%20that%20the,51), with some estimates ranging up to [$78 million in compute cost](https://hai.stanford.edu/news/ai-index-state-ai-13-charts#:~:text=One%20of%20the%20reasons%20academia,modern%20LLM%2C%20cost%20around%20%24900), and Google’s Gemini Ultra model is estimated to have cost $191 million in training compute. These staggering sums partly reflect why model size and complexity have exploded.\n\n![cost\\_of\\_llms\\_chart\\_1](https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02a.png)\n\n> Training compute petaFLOP log scale. Source: [Paper](https://hai.stanford.edu/news/ai-index-state-ai-13-charts)\n\nIn this article, we’ll explore the expenses of bringing modern generative AI and reasoning models to life, focusing on infrastructure needs, data management, and the increasingly important role of cloud computing.\n\n## What are large language models?\n\nLLMs are AI systems designed to understand and generate language in a way that mimics human communication and reasoning. They are trained on vast datasets containing text from books, websites, code repositories, and other digital content.\n\nBy learning statistical patterns in language, LLMs can generate coherent and contextually relevant text based on a given input. For example, models like GPT are trained on a huge swath of internet text and can produce writing that mirrors human style across many topics.\n\nThese models typically rely on the [transformer architecture](https://www.cudocompute.com/topics/neural-networks/transformer-models-what-are-they-and-how-do-they-work), which uses mechanisms like self-attention to consider the context of words in a sequence, allowing the model to weigh the importance of different parts of the input text relative to each other, enabling a better understanding of context and meaning.\n\nFor instance, Google’s BERT model reads text bidirectionally – both left-to-right and right-to-left – to capture context, a significant advancement over older one-directional models, making BERT especially effective for tasks requiring deep language understanding, like question answering or sentiment analysis.\n\nAs LLMs have grown, they’ve gotten better at fluent text generation and shown emergent reasoning abilities. Newer generative models such as [GPT-4o-mini and DeepSeek R1](https://www.cudocompute.com/blog/why-open-source-models-are-disrupting-the-ai-race) can perform complex tasks like solving multi-step math problems or writing code by reasoning through steps, especially when guided by techniques like chain-of-thought prompting.\n\nGiven the right prompts, these models can simulate a step-by-step reasoning process, allowing them to handle more complex queries. The emergence of reasoning in LLMs is partly a byproduct of scale, as larger models trained on more data tend to develop more advanced capabilities, and partly due to specialized training or fine-tuning on [data that encourages logical reasoning](https://arxiv.org/pdf/2110.14168).\n\nModels like Anthropic’s Claude have also been trained with specific methods, like Constitutional AI, to better follow instructions and reason safely. All of these advances, however, come with increased training requirements. Next, we’ll discuss the cost of training a large language model.\n\n## Cost of training LLMs with cloud infrastructure\n\nAs AI development increasingly shifts to cloud platforms, [driven partly by limited availability of GPUs](https://www.cudocompute.com/blog/gpu-supply-shortage-due-to-ai-needs), cloud services have become one of the most practical and scalable ways to train LLMs. The cloud offers on-demand access to large numbers of GPUs/TPUs and the ability to distribute training across multiple machines.\n\nSuch scalability is excellent for the fluctuating demands of AI training cycles. Major providers have built massive supercomputers in the cloud to facilitate LLM training. Microsoft, for example, [constructed an Azure supercomputer](https://www.zdnet.com/article/microsoft-builds-a-supercomputer-for-openai-for-training-massive-ai-models/#:~:text=Microsoft%20said%20the%20supercomputer%20built,has%20access%20to%20Azure%20services) with over 10,000 GPUs and ultra-fast networking specifically for OpenAI’s model training.\n\nHowever, renting such infrastructure comes at a significant cost. A recent example from NVIDIA CEO Jensen Huang illustrates the scale of resources required, stating that training the GPT-MoE-1.8T model using 25,000 Ampere-based GPUs (most likely the A100) took 3 to 5 months. Doing the same with H100 would take about 8,000 GPUs in 90 days.\n\nThe above statement shows how each generation of hardware can improve training efficiency. The H100s offer substantially higher throughput than A100s, reducing the time and number of machines needed for the same task.\n\nStill, whether using 25,000 older GPUs or 8,000 newer ones, the compute cluster required is enormous, and most organizations simply cannot afford to assemble or rent such resources for training from scratch.\n\nWhile it is possible to train a model from scratch, most users won’t train giant LLMs from scratch due to these high costs and infrastructure hurdles. Instead, they’ll use pre-trained models provided by AI labs or open-source communities and then adapt them to their needs. This avoids spending millions of dollars in compute for initial training.\n\nGenerally, if a company or researcher needs an LLM, there are two approaches:\n\n-   **Hosting your own model:** Obtain a pre-trained model checkpoint, either open-source or via license, and run further training or fine-tuning on it using cloud servers that you rent or own.\n-   **Pay-per-token access:** Use a hosted model provided through an API, like OpenAI’s or Google’s, and pay for usage, rather than handling any training yourself.\n\nLet’s examine each of these approaches and their cost implications.\n\nRead more: [How to build an AI](https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch)\n\n### Hosting models in the cloud\n\nIf you choose to host and train/fine-tune models in the cloud, you might need to rent compute resources. Companies like CUDO Compute offer suites of services that support the entire machine learning lifecycle from data storage to GPU compute to deployment.\n\nThe advantage of cloud-based training is convenience and scalability, as you can spin up hundreds of GPUs for a few days of training and then shut them down, only paying for what you used. You also don’t have to maintain physical hardware.\n\nWhen training large models with tens or hundreds of billions of parameters, such as GPT-3 or Meta’s Falcon 180B, the expense goes beyond just the headline price of GPUs. In a cloud environment, you also need to account for supporting resources and overheads:\n\n-   **Virtual CPUs (vCPUs)** to coordinate and feed data to the GPUs during training.\n-   **Memory (RAM)** for holding training data batches, model activations, and other intermediate computations.\n-   **Storage** for datasets and for saving model checkpoints (which themselves can be hundreds of gigabytes in size for big models), as well as costs for data transfer.\n\nEach of these components adds to the bill. Cloud providers typically charge based on the compute time for both GPU and CPU, the amount of memory provisioned, and the volume of data stored or transferred. This means training a large AI model can incur costs on multiple fronts.\n\nEfficiently using and managing all these resources helps to keep the budget under control. For example, ensuring GPUs are kept busy so you’re not paying for idle time, and not provisioning more CPU or memory than needed.\n\nLet’s break down how this might work when training an LLM on a large model on [CUDO Compute](https://www.cudocompute.com/):\n\nAt the time of writing, the cost of the [A100](https://www.cudocompute.com/gpu-rental/nvidia-a100) on [CUDO Compute](https://www.cudocompute.com/pricing) starts from $1.50 per hour. There is also a monthly commitment option of $1,125.95. When factoring in the other costs, such as vCPUs and memory needed, each is charged based on location.\n\nUsing the median location from the roster, we will base our analysis on the pricing from the Los Angeles 1 location. Here is how much it costs for each resource needed:\n\n![cost\\_of\\_llms\\_table\\_1](https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02a.png)\n\nMultiple GPUs are advised for optimal results. This would be the recommended amount needed to train a Falcon 180B on CUDO Compute based on the default instance for training the same model on AWS:\n\n![cost\\_of\\_llms\\_table\\_2](https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02b.png)\n\nThe above configuration is very similar to the default configuration used on AWS for training LLMs on the same model. To use this configuration on CUDO Compute, assuming the model is not scaled up or down, and no discounts are applied, it will total just over USD 13,000 monthly. Here is the breakdown:\n\n![cost\\_of\\_llms\\_table\\_3](https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02c.png)\n\nSince training an LLM from scratch will likely take months, this cost will pile up over time, particularly when training involves multiple iterations over extensive datasets.\n\nAlso, as CUDO Compute is one of the cheapest cloud platforms available, bear in mind that the compute costs may be higher on other platforms. For example, using an instance with a similar configuration on AWS (ml.p4de.24xlarge) will cost over USD 23,000 per month.\n\nIt’s worth noting that newer GPU hardware can change this calculus. The above costs were with NVIDIA’s A100 GPUs. The H100 GPUs offer about 2–3 times the performance of A100 for training workloads, and the [B100 is even more improved than the H100](https://www.cudocompute.com/blog/nvidia-gb200-everything-you-need-to-know). You could use fewer B100s to train in less time, potentially lowering the total cost, but B100s are also more expensive per hour than A100s and H100s.\n\nHosting and training large models in the cloud gives flexibility but requires careful resource planning. You will often fine-tune an existing model for a specific task (which is far cheaper) rather than train a new LLM from scratch. Next, we’ll look at an alternative to any training, using a pay-per-token model to use models others have already trained.\n\n### Pay-per-token access to LLMs\n\nThe high cost of training and maintaining LLMs has led to the rise of the pay-per-token (PPT) model for accessing these powerful language models. Here's how it works:\n\nCompanies like OpenAI and Google AI pre-train massive LLMs on vast datasets and allow developers and businesses to use these models, such as GPT-3 or similar, without the prohibitive costs and technical challenges of training such models themselves.\n\nUsers don't incur the upfront costs of training and infrastructure. Instead, they pay a fee based on the number of tokens (roughly equivalent to words or sub-words) processed by the LLM when completing tasks like text generation, translation, or code writing.\n\n![cost-of-llms-image-5](https://www.cudocompute.com/wp-content/uploads/2026/04/cost-of-llms-image-5.jpg)\n\nThe PPT model offers a significantly more cost-effective approach than in-house training for tasks that don't require extensive LLM usage. Users only pay for the resources they actually use.\n\n### Benefits of pay per token:\n\n-   **Reduced costs:** This model eliminates the upfront investment in hardware, software, and training data.\n-   **Scalability:** Users can easily scale their LLM usage up or down based on their needs, paying only for the tokens they consume.\n-   **Accessibility:** PPT allows a wider range of users and smaller companies to access LLMs without the prohibitive costs of in-house training.\n\n## Why is it so expensive to train LLMs?\n\nThere are several reasons why training large language models is extraordinarily expensive:\n\n-   **Massive model sizes and data:** Modern LLMs are huge. While GPT-3 had 175 billion parameters, newer models like GPT-4 are rumored to have on the order of [1 trillion parameters](https://en.wikipedia.org/wiki/GPT-4#:~:text=Sam%20Altman%20stated%20that%20the,51). The amount of computation (floating point operations) needed to train a model scales roughly linearly with the number of parameters multiplied by the number of training examples (tokens).\n\nTo achieve high performance, these models are trained on extremely large datasets with hundreds of billions or even trillions of tokens of text. For example, an LLM might be trained on 300-500 billion tokens of data, iterating through that corpus multiple times.\n\nThe combination of a gigantic model and a gigantic dataset leads to an eye-popping number of total compute operations.\n\nIt is estimated that GPT-4’s training consumed 2.1 × 1025 FLOPs (21 billion petaFLOPs), and models like Gemini Ultra [might be around 5.0 × 10](https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million)[25](https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million)[FLOPs](https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million), driving those compute bills. Simply put, scale is costly. Even with highly efficient software and hardware, you’re pushing an unfathomable amount of data through the model.\n\n-   **Expensive hardware requirements:** To handle that scale of computation within a reasonable timeframe, you need fleets of high-end accelerators (GPUs or TPUs). Training runs often use thousands of GPUs working in parallel for weeks or months.\n\nOwning such hardware is capital-intensive, as one NVIDIA H100 GPU can cost $25k–$40k; a pod of 1000 of them would be $25–40 million in hardware alone, and renting them is expensive too, as we discussed earlier.\n\nAdditionally, the electricity and cooling for running these at full tilt 24/7 are very costly. Large training runs can consume megawatt-hours of energy, and only a few companies have the infrastructure to dedicate 10,000 GPUs to a single task.\n\nSupporting hardware like high-speed networking, like InfiniBand or custom interconnects, is also needed to ensure those GPUs can communicate rapidly as they split up the training load. All of this specialized infrastructure drives up the cost significantly.\n\n-   **Extended training durations:** Despite massive parallelism, training a reasoning model can still take weeks or months. During that entire period, the hardware must be powered and utilized.\n\nFor instance, if you run 1,000 GPUs for one month, that’s 1,000 GPU-months of usage, which, at say $2,000 per GPU-month, would be $2 million. Now, consider some runs use 5,000 or 10,000 GPUs for several months. It’s easy to see how the cost racks up to tens of millions of dollars. Every additional epoch of training to slightly improve the model’s performance comes with that price tag.\n\nMoreover, training often isn’t a one-shot process; researchers may do multiple runs while tuning hyperparameters or testing different configurations. The final model you see is the product of many trial runs and experiments behind the scenes. Those experimental runs also consume a lot of compute.\n\nAccording to recent analysis, the total compute used in developing a new model could be [2–3 times the compute of the final training run](https://arxiv.org/html/2405.21015v2), once you account for all the failed or exploratory experiments along the way.\n\n-   **Advanced techniques and multiple phases:** Training a competitive LLM now involves more than just one pass through a dataset. For example, after the initial pre-training where you predict the next word on huge text data, many models undergo fine-tuning phases, such as [Reinforcement Learning from Human Feedback](https://arxiv.org/pdf/2504.12501) (RLHF), to align the model with human preferences and instructions.\n\nRLHF involves training additional models (reward models) and running [reinforcement learning algorithms](https://www.cudocompute.com/blog/machine-learning-technique-introduction-to-reinforcement-learning) that themselves require lots of computations on GPUs and [custom data generated by human annotators](https://arxiv.org/pdf/2409.18417), which is expensive.\n\n![cost\\_of\\_llms\\_chart\\_2](https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02b.png)\n\n> Illustration of Vickrey Feedback for RLHF. Source: [Paper](https://arxiv.org/pdf/2409.18417).\n\nOpenAI’s ChatGPT/GPT-4 alignment process, for instance, meant hiring human contractors to [produce conversations and feedback](https://openai.com/index/chatgpt/), and then performing iterative fine-tuning, improving the model’s usefulness and safety, but it adds to the total cost of development, both in terms of money and time, beyond the baseline pre-training.\n\nSimilarly, to imbue reasoning abilities, researchers might fine-tune LLMs on complex problem-solving datasets or use techniques like [chain-of-thought prompting](https://arxiv.org/pdf/2201.11903) and [self-correction](https://arxiv.org/pdf/2203.11171), which can require generating and filtering a lot of model outputs, again consuming compute.\n\nEach additional training objective or stage, be it tutoring the model on logic puzzles or making it follow a constitution of AI principles, introduces extra expense.\n\n-   **Data acquisition and preparation:** Often overlooked in pure compute discussions is the cost of the _data_ itself. High-quality datasets may need to be purchased or scraped and cleaned at scale. For example, an LLM might ingest the entire Wikipedia, huge swaths of web text, news, books, academic papers, code repositories, etc. Web crawling and data storage incur infrastructure costs.\n\nAdditionally, if a model is to excel at, say, coding or medical questions, one might need to curate domain-specific data, which could be behind paywalls or require licensing. Companies might pay for access to archives of journals or specific private datasets.\n\nThen comes the preprocessing: filtering out problematic content, deduplicating text, formatting it for training, and so forth. This can involve substantial engineering effort and cloud processing power.\n\nPreparing a trillion-token corpus that is diverse, high-quality, and legally usable is a non-trivial expense, sometimes involving human annotators for labeling or verification in the loop.\n\nPushing the limits of dataset size, model size, and compute scale is expensive because it sits at the edge of what’s technologically possible. To do so requires cutting-edge hardware, huge energy consumption, and often a team of researchers and engineers whose salaries are another factor.\n\nThe result is that only a handful of companies and well-funded academic labs can afford to train the largest models. Others must be clever in using those models or finding ways to achieve results with smaller-scale systems.\n\nIn the following section, we’ll discuss strategies for controlling and reducing the cost of training.\n\n## Steps to controlling the cost of training LLMs\n\nWhile training large models will likely remain resource-intensive, there are several strategies that can optimize resource utilization and reduce expenses. In fact, for most organizations, the question isn’t “_Can we train our own GPT-4?_” but “_How can we achieve our AI goals within a reasonable budget?_”.\n\nBelow are key approaches to managing and minimizing the costs of training and deploying LLMs:\n\n**1\\. Implement model and training optimization techniques:** Optimize what you build and how you build it so you need less compute for the same result.\n\n-   **Model architecture selection and sizing:** Carefully choose a model architecture and size that balances performance with cost. Bigger is not always better for a given task. If you can achieve your target accuracy with a 6-billion-parameter model, there’s no need to train a 60-billion-parameter one.\n\nTechniques like model pruning (removing unnecessary weights) or quantization can reduce the effective size and computation of a model without a significant loss in accuracy. Using newer architecture improvements can also get more bang for the buck, as some architectures are more parameter-efficient than others.\n\nThe goal is to avoid overspending on compute capacity you don’t actually need. In some cases, [multiple smaller specialized models ensemble](https://www.cudocompute.com/blog/what-is-ensemble-learning) might be better than one large model.\n\n-   **Training data optimization:** Ensure your training data is high-quality and relevant. A smaller, curated dataset can sometimes outperform a larger, noisy one. Removing redundant or low-value data can shorten training time.\n\nFor instance, if you’re building a biomedical language model, 100GB of well-chosen medical text will train faster and yield a more accurate model in that domain than 1TB of random internet text.\n\n![cost\\_of\\_llms\\_chart\\_3](https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02c.png)\n\n> Amortized hardware cost plus energy cost for the final training run of frontier models. Source: [Paper](https://arxiv.org/html/2405.21015v2)\n\nUse data filtering to eliminate spam, duplicated content, or irrelevant information, and augmentation to add useful diversity are important steps. Every epoch on unnecessary data is wasted GPU time, so focus on quality over sheer quantity when possible.\n\n-   **Knowledge distillation:** This technique involves training a smaller “student” model to replicate the behavior of a larger “teacher” model. You first train a large model (or take an existing one), then use its outputs to guide the training of a compact model.\n\nThe student model can often achieve near the teacher’s performance but with far fewer parameters, effectively compressing the knowledge of a huge model into a cheaper, faster model that is much easier to deploy and even continue training.\n\nDistillation was used, for example, to create smaller BERT variants and even some DeepSeek distillations, cutting down model size by 10 times or more while retaining most capabilities. For organizations, one strategy is to fine-tune a large model on your task (possibly using an API or a short-term rental on a big GPU machine), then distill it into a smaller model you can afford to train further or run in production.\n\n-   **Mixed-precision training:** Modern hardware and software allow using lower numerical precision for calculations to speed up training. Instead of 32-bit floating point for everything, frameworks use 16-bit or even 8-bit in certain parts of the computation.\n\nThis cuts memory usage and can double the training speed on supported hardware _without_ noticeably affecting model accuracy due to techniques like loss scaling to maintain numerical stability.\n\nThe latest NVIDIA GPUs and Google TPUs all support mixed precision. You can significantly reduce the compute time and cost required for each training step using mixed-precision training. It’s a free efficiency gain and has become standard practice in training large models.\n\n-   **Parameter-efficient fine-tuning:** A recent development, particularly useful if you’re working with pre-trained models, is to fine-tune only a subset of the model’s parameters or use add-on adapters, instead of complete end-to-end training.\n\nApproaches such as [low-rank adaptation](https://arxiv.org/pdf/2106.09685) (LoRA) and other adapter-based methods allow you to train just a small additional set of weights or a low-rank update to the original weights that can be merged with the model for new tasks, drastically reducing the compute and data needed to adapt a model to a domain.\n\nFor example, LoRA was used to fine-tune large models like GPT-3 on specific tasks using 10x–100x less compute than tuning the entire model. By training fewer parameters, you use less GPU memory and can get away with shorter training times, all while achieving nearly the same performance as full fine-tuning.\n\nParameter-efficient fine-tuning uses the investment already made in a pre-trained model and avoids having to reinvent the wheel. In general, always prefer fine-tuning or adapting an existing model over training from scratch when possible; it’s very much like [transfer learning](https://www.cudocompute.com/blog/few-shot-learning-everything-you-need-to-know#:~:text=varied%20training%20data.-,Transfer%20learning,-%3A%20Transfer%20learning%20leverages) that can cut requirements by orders of magnitude.\n\n**2\\. Optimize training configurations:** Train smarter, not harder – improve your training process to converge faster or use fewer resources.\n\n-   **Hyperparameter tuning and efficient experimentation:** The choice of [hyperparameters can dramatically affect how fast a model learns](https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch/4-steps-to-build-an-ai-model#hyperparameters-tuning). A well-tuned training run might reach a target accuracy in 50% fewer steps than a poorly tuned one, which directly translates to 50% less cost.\n\nRather than guessing, use systematic approaches to hyperparameter search (grid search, Bayesian optimization, etc.) on smaller-scale models or subsets of data to find good settings, then scale up.\n\nAlso, monitor training in real-time. If loss has plateaued early or something looks off, intervene or stop rather than let a bad configuration run to completion. Implement an early stopping strategy so that if your model’s performance on a validation set stops improving, it halts the training to avoid wasting epochs that don’t yield gains.\n\nRead more: [How to build an AI](https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch)\n\n-   **Checkpointing and fault tolerance:** Training runs can be long, and failures like machine crashes and spot instance termination can happen. By checkpointing regularly (saving the model state every so often), you can resume from the last checkpoint instead of starting over, thereby saving all the progress made until the interruption. This is more about avoiding unnecessary re-training costs rather than reducing the cost of the successful run, but it’s important.\n-   **Smaller-scale and iterative training:** Often, one can pre-train on a smaller scale first to get a sense of things. For example, train a smaller model (say 10% of the full size) on the full data, or the full model on 10% of the data, to identify issues or get learning curve estimates. This “pilot run” approach can reveal if your learning rate is way off, or if the model is too small to ever reach your target, etc., without spending the full budget.\n\nYou can also try using learning rate warmups\\*,\\* cosine decays, and other training schedule tricks, which can also help stabilize the training of large models, avoiding divergence that would force you to restart with different settings, which is another hidden cost.\n\n**4\\. Use advanced model strategies:**\n\nUse model designs that reduce per-model compute and external resources that make the job easier.\n\n-   **Mixture-of-experts (MoE) models:** MoE is an architecture that effectively has multiple sub-models called experts and a gating mechanism that activates only some of them for each input, meaning that at any given time, only a portion of the model’s parameters are used, rather than all of them.\n\nMoEs can significantly cut down the compute needed for a given parameter count – for instance, Google’s Switch Transformer (an MoE model) achieved comparable results to a dense model but with far less computation by sparsely activating experts.\n\nIn a research, they got a model with 1.6 trillion parameters to train with the [computational cost of only a 100+ billion parameter dense model](https://arxiv.org/pdf/2101.03961). The benefit is that you get the effect of a huge model (in terms of capacity and specialization) without having to pay the full cost every time.\n\nMoEs spread out the workload and can be more efficient if implemented well. However, they add complexity; not all frameworks easily support them, and they can be tricky to get right. But as research advances, MoEs are becoming more common in large-scale settings.\n\nFor cost-savvy training, exploring MoE architectures could allow training frontier-sized models on a smaller budget by **reducing the computational load per token**.\n\n-   **Retrieval and external knowledge integration:** Another way to reduce the burden on the model itself is to give it tools or access to information during training/inference. For example, Retrieval-Augmented Generation (RAG) techniques provide the model with relevant documents fetched from an external database, so the model doesn’t need to memorize every fact; instead, it can look things up.\n\nDeepMind’s RETRO model demonstrated that a model with retrieval support could [match the performance of a model 25 times large](https://www.thedecoding.net/p/deepmind-retro-smaller-llm#:~:text=But%20there%20is%20good%20news%21)r that had everything baked into its parameters. By integrating a massive external knowledge base, they maintained performance with a much smaller network, which implies huge training cost [savings of 25 times fewer parameters to train is roughly 25 times less compute](https://www.thedecoding.net/p/deepmind-retro-smaller-llm#:~:text=You%20can%20integrate%20a%20massive,up%20to%2025x%20smaller%20networks).\n\nThis approach can be thought of as augmenting the model with a form of memory or tools so it doesn’t have to do all the work internally. Likewise, allowing a model to use a calculator for arithmetic or call an API for current knowledge means you don’t have to train it to death trying to internalize those capabilities.\n\nIn a practical sense, building a system that combines an LLM with retrieval (e.g., queries to Wikipedia or a company knowledge base) can let you use a smaller base model to achieve the same or better performance than an extremely large model without retrieval, cutting down training requirements.\n\nMany current applications, like search engine bots, use this strategy. So, from a cost perspective, invest in a slightly more complex system (model + retrieval tool) and you might avoid having to invest in a model that’s 10 times bigger and 10 times more expensive to train.\n\n-   **Multi-step reasoning and curriculum learning:** If you want an LLM with strong reasoning, an efficient approach is to train it to encourage reasoning via multiple steps, rather than expecting the model to solve everything in one go with brute-force parameters.\n\nFor example, techniques like chain-of-thought training involve showing the model intermediate reasoning steps for problems (like math solutions or code logic) during training. This can teach a smaller model to approach complex tasks systematically, narrowing the gap with a larger model that might implicitly learn those skills.\n\nIt’s not a direct cost-saving in compute, but it can mean you reach the desired capability without an exponential increase in model size. Similarly, curriculum learning (starting with easier tasks and then increasing difficulty) can help a model learn faster and converge better, meaning fewer epochs (less cost) to achieve high performance.\n\nThe overarching idea is training smarter. Guiding the model’s learning process can trim the needed compute compared to naive training on a jumble of tasks. This is especially useful for reasoning-centric models where the objective is not just fluent text, but correct and logical solutions.\n\n**5\\. Collaborate and use open-source ecosystems:**\n\nDon’t go at it alone; take advantage of community efforts, existing models, and shared resources.\n\n-   **Use pre-trained models and tools:** Using open-source pre-trained models significantly reduces costs by eliminating the need for full-scale training from scratch. Models such as Meta’s LLaMA 2, BLOOM, and DeepSeek R1 are available with pre-trained weights, enabling companies and researchers to fine-tune these models for their specific needs at a fraction of the original training cost.\n\nAs we showed earlier, fine-tuning a large model like LLaMA 2 (70B parameters) typically costs tens of thousands of dollars, substantially less than full-scale initial training.\n\nAdditionally, the open-source ecosystem provides robust tools and libraries optimized for efficient distributed training. Frameworks such as DeepSpeed and Fully Sharded Data Parallel (FSDP) effectively manage large models across limited hardware by sharding model components, allowing greater efficiency and reduced hardware requirements.\n\nHugging Face’s Accelerate and tools like Horovod further streamline distributed training processes. You can benefit from shared knowledge, compute resources, and reduced overall costs by tapping into community-developed resources and collaborating within research communities, like EleutherAI, LAION, and BigScience.\n\nBy applying these strategies, you’ll be able to cut down the cost required to train or fine-tune LLMs. Some emerging efforts have shown it’s possible to reach GPT-3/4 level performance on much smaller budgets.\n\nFor example, the founder of 01.ai claimed they trained a GPT-4-comparable model [using only 2,000 GPUs and about $3 million of compute](https://www.tomshardware.com/tech-industry/artificial-intelligence/chinese-company-trained-gpt-4-rival-with-just-2-000-gpus-01-ai-spent-usd3m-compared-to-openais-usd80m-to-usd100m#:~:text=advanced%20AI%20models%20using%202%2C000%20GPUs%20with%20just%20%243%20million) by optimizing every aspect of the process, which is a stark contrast to the amount reportedly spent on GPT-4 by OpenAI. While such claims have yet to be fully verified, they underscore that efficiency matters. The gap between big-budget labs and smaller players can be narrowed with clever engineering.\n\nControlling the cost of training LLMs is about making prudent choices at each step, from choosing the right model size, hardware, data, and tools. Those who do so can build impressive language and reasoning models without breaking the bank.\n\nThe landscape of LLM development is evolving quickly. While the cutting-edge models will always push expensive extremes, the practical know-how for cost-effective training is more available than ever. If you need to learn how, read our guide on [how to build an AI from scratch to learn more](https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch).\n\nWith careful planning, training your LLM or tailoring an existing one to your needs is feasible within a reasonable budget, and CUDO Compute can help you achieve that easily. We offer cost-effective GPUs on demand and scalable GPU clusters equipped with the latest NVIDIA GPUs. You can get started with a few clicks. [Get started](https://compute.cudo.org/)."
}

▼✓网页搜索查询: "Sebastian Raschka LLM pretraining cost breakdown per stage GPU hours compute ...", 结果数: 8

{
  "results": [
    {
      "title": "Sebastian Raschka, PhD's Post - LinkedIn",
      "url": "https://www.linkedin.com/posts/sebastianraschka_sure-its-widely-known-that-pretraining-activity-7278071569384460288-y_NE",
      "description": "Math: - The total number of GPU hours needed is 184,320 hours. - The cost of running one A100 instance per hour is approximately $33. - Each"
    },
    {
      "title": "GPU requirements and cost for training 7B LLaMA 2 model - Facebook",
      "url": "https://www.facebook.com/groups/artificialintelligence.lt/posts/8519734168087012",
      "description": "# AI Lithuania | ## Example: GPU Requirements & Cost for training 7B Llama 2 | Facebook. GPU requirements and cost for training 7B LLaMA 2 model. Summarized by AI from the post below. ## AI Lithuania ·. #### Example: GPU Requirements & Cost for training 7B Llama 2. As per the post – 7B Llama 2 model costs about $760,000 to pretrain – by Dr. Sebastian Raschka, it took a total number of 184,320 GPU hours to train this model. The following is the math:. *   The total number of GPU hours needed is 184,320 hours. *   The cost of running one A100 instance per hour is approximately $33. *   Each instance has 8 A100 GPUs. That’s 184320 / 8 * 33 ~ $760,000. https://vitalflux.com/llm-gpu-memory-requirements-examples/. Image 2: LLM Training & GPU Memory Requirements: Examples - Analytics Yogi. vitalflux.com LLM Training & GPU Memory Requirements: Examples - Analytics Yogi. A100 valandai tikrai ne $33 kainuoja.."
    },
    {
      "title": "An updated back-of-the-envelope calculation of LLM pretraining ...",
      "url": "https://x.com/rasbt/status/1872299621596475524",
      "description": "- The cost of running one A100 instance per hour is approximately $33. - Each instance has 8 A100 GPUs. That's 184320 / 8 * 33 = $760,000."
    },
    {
      "title": "flops-analysis.ipynb - rasbt/LLMs-from-scratch - GitHub",
      "url": "https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/02_performance-analysis/flops-analysis.ipynb",
      "description": "FLOPs (Floating Point Operations Per Second) measure the computational complexity of neural network models by counting the number of floating-point operations"
    },
    {
      "title": "New LLM Pre-training and Post-training Paradigms - Ahead of AI",
      "url": "https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training",
      "description": "Build a Large Language Model (from Scratch) is a highly focused book dedicated to coding LLMs from the ground up in PyTorch, covering everything from pre-training to post-training—arguably the best way to truly understand LLMs. Machine Learning Q and AI is a great book for those who are already familiar with the basics; it dives into intermediate and advanced concepts covering deep neural networks, vision transformers, multi-GPU training paradigms, LLMs, and many more. Interestingly, they also used Qwen models (although they didn't specify details, I assume they mean previous generation Qwen models) to synthesize additional pre-training data. Apple's approach to pre-training and post-training is relatively comprehensive, likely because the stakes are very high (the model is deployed on millions, if not billions, of devices). For the third pre-training stage, the researchers trained the model on a small but high-quality mix, which they found helps improve the performance on benchmark datasets. Interestingly, like Apple's AFM model, Llama 3 also implemented a 3-stage pre-training process."
    },
    {
      "title": "Building LLMs from the Ground Up: A 3-hour Coding Workshop",
      "url": "https://magazine.sebastianraschka.com/p/building-llms-from-the-ground-up",
      "description": "# Building LLMs from the Ground Up: A 3-hour Coding Workshop. If you’d like to spend a few hours this weekend to dive into Large Language Models (LLMs) and understand how they work, I've prepared a 3-hour coding workshop presentation on implementing, training, and using LLMs. Below, you'll find a table of contents to get an idea of what this video covers (the video itself has clickable chapter marks, allowing you to jump directly to topics of interest):. 2:17 – Part 1: Intro to LLMs. 10:48 – Part 2: Understanding LLM input data. 1:45:12 – Part 5.2: Pretrained weights via LitGPT. It's a slight departure from my usual text-based content, but the last time I did this a few months ago, it was so well-received that I thought it might be nice to do another one! Build an LLM from Scratch book. Build an LLM from Scratch GitHub repository. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book."
    },
    {
      "title": "Sebastian Raschka, PhD (@rasbt)",
      "url": "https://substack.com/@rasbt/note/c-190575668",
      "description": "# Sebastian Raschka, PhD (@rasbt): \"The LLM eras: 202x Pre-training (foundation) 2022 RLHF + PPO 2023 LoRA SFT 2024 Mid-Training 2025 RLVR + GRPO 2026 Inference-time scaling? ### Make money doing the work you believe in. Image 1: Hamish McKenzie's avatar. Image 2: Rebecca Rae's avatar Rebecca Rae 4d If you ever think you’re bad at your job, just remember this update was approved by at least 40 people. Image 4: Ted Gioia's avatar. Image 6: The Conscious Lee's avatar. By doing color-blindness in front of cameras while letting ICE racially profile in the streets, the court is making whiteness visible to anybody who knows how to read. With full editorial control and no gatekeepers, you can do the work you most believe in. I'll buy a 100 books of yours as a gift to people if I'm wrong (it's already my go-to gift for people trying to learn AI haha). In itself it’s a training technique, but you train the model to output more tokens, which is a form of inference scaling."
    },
    {
      "title": "State of LLMs 2026: RLVR, GRPO, Inference Scaling - YouTube",
      "url": "https://www.youtube.com/watch?v=K5WPr5dtne0",
      "description": "[37:25 Yann LeCun's $1B Bet Against LLMs Welch Labs 360K views • 6 days ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=kYkIdXwW2AE)[1:04:39 AI is Already Building AI — Google DeepMind’s Mostafa Dehghani The MAD Podcast with Matt Turck 23K views • 1 month ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=Bo19sXssYXI)[29:49 Andrej Karpathy: From Vibe Coding to Agentic Engineering Sequoia Capital 753K views • 9 days ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=96jN2OCOfLs)[37:53 Recursion Is The Next Scaling Law In AI Y Combinator 13K views • 7 days ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=DGtUUMNYLcc)[1:03:38 Faithfulness and Hallucinations in Reasoning Models, with Dr. Letitia Parcalabescu Women in AI Research WiAIR and AI Coffee Break with Letitia 610 views • 2 months ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=gzQiDCG_j7A)[1:18:03 LLM Architecture in 2026: What You Need to Know with Sebastian Raschka Vanishing Gradients 4.7K views • 3 weeks ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=Y6APnyZT6XU)[27:26 LLMs Don't Need More Parameters. Compute & The Agent Era — Tim Dettmers (Ai2) & Dan Fu (Together AI)The MAD Podcast with Matt Turck 4K views • 3 months ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=XCCkgRzth6Q)."
    }
  ]
}

▼✓网页搜索查询: "LLM training compute FLOPs scaling law Chinchilla optimal tokens parameters c...", 结果数: 8

{
  "results": [
    {
      "title": "irhum.github.io - Thoughts on Chinchilla",
      "url": "https://irhum.github.io/blog/chinchilla",
      "description": "The answer is in the scaling law literature (Kaplan et al., 2020, Sec 6.1): the “optimal” \\((N\\_{\\text{opt}}, D\\_{\\text{opt}})\\) are the ones that produce a model that achieves the lowest loss on a validation set of the pretraining data, subject to the fixed cost constraint (the green star above). One intuitive (and important) conclusion from the 1:1 scaling of model size and data means, if you want a compute optimal model that’s 2x large, you need to train it on 2x many tokens. This is enormously expensive, and is the core reason model sizes peaked around early-2022 (pre-Chinchilla): we’re only just now doing training runs with \\(C\\) large enough that models of that size (500B+) are *compute optimal*, and future model size scaling will remain slower (compared to pre-Chinchilla) because of this quadratic factor. Suppose the compute optimal model for an initial \\(2.66\\times 10^{21}\\) FLOPs budget is \\(N=2.8\\text{B}\\) params trained on \\(D=156\\text{B}\\) tokens. This work also fitted scaling laws between compute \\(C\\), and model size \\(N\\) and number of tokens \\(D\\)."
    },
    {
      "title": "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws",
      "url": "https://gonzoml.substack.com/p/beyond-chinchilla-optimal-accounting",
      "description": "# Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws. Continuing the discussion on optimal training regimes for Large Language Models (LLMs) with a focus on inference and moving beyond Chinchilla's guidelines. *Inference* is a crucial phase in the life cycle of most models, akin to the deployment and maintenance phases in software engineering where the software spends most of its life. The first major breakthrough in finding optimal regimes (in terms of computational budget, model size, and number of training tokens) was achieved in the study on **Chinchilla** (). Returning to the article, the authors update Chinchilla's scaling laws to include mass inference and advocate for **longer training of smaller models**. The task in the Chinchilla study was *“to find optimal N (parameters) and Dtr (tokens) within a given computational budget that minimize pre-training loss”*. For example, for a model with Chinchilla-7B quality level and an inference need of 1011 tokens, it's more optimal to train a 6B model on 1.18x the original data."
    },
    {
      "title": "Chinchilla Scaling Laws for Large Language Models (LLMs) - Medium",
      "url": "https://medium.com/@raniahossam/chinchilla-scaling-laws-for-large-language-models-llms-40c434e4e1c1",
      "description": "1,400B (1.4T) tokens should be used to train a data-optimal LLM of size 70B parameters. So, we need around 20 text tokens per parameter. Get"
    },
    {
      "title": "Chinchilla Scaling Laws: Compute-Optimal LLM Training - Interactive",
      "url": "https://mbrenndoerfer.com/writing/chinchilla-scaling-laws-compute-optimal-llm-training",
      "description": "Learn how DeepMind's Chinchilla scaling laws revolutionized LLM training by proving models should use 20 tokens per parameter for"
    },
    {
      "title": "How Long Should You Train Your Language Model? | Databricks Blog",
      "url": "https://www.databricks.com/blog/how-long-should-you-train-your-language-model",
      "description": "# How Long Should You Train Your Language Model? Accounting for Inference in Language Model Scaling Laws. Our recent paper, presented at ICML 2024, **proposes a modified scaling law to account for the cost of** ***both training and inference*****.** This blog post explains the reasoning behind our new scaling law, and then experimentally demonstrates how “overtrained” LLMs can be optimal. The “*Chinchilla”* Scaling Law is the most widely cited scaling law for LLMs. The Chinchilla paper asked the question: If you have a fixed training compute budget, how should you balance model size and training duration to produce the highest quality model? In our research, we were motivated by the idea of training smaller models on more data than the Chinchilla law suggested. By contrast, we assume a *fixed pretraining loss* (i.e. model quality) and find the model size and training duration that *minimize the total compute* over the model’s lifetime, including both training and inference (Fig. 2, Eq. 2)."
    },
    {
      "title": "[PDF] Training Compute-Optimal Large Language Models",
      "url": "https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf",
      "description": "We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data. Model Size (# Parameters) Training Tokens LaMDA [52] 137 Billion 768 Billion GPT-3 [6] 175 Billion 300 Billion Jurassic [30] 178 Billion 300 Billion Gopher [38] 280 Billion 300 Billion MT-NLG 530B [48] 530 Billion 270 Billion Chinchilla 70 Billion 1.4 Trillion is typically only feasible to train these large models once, accurately estimating the best model hyperparameters for a given compute budget is critical [51]. From these curves, we extracted the envelope of minimal loss per FLOP, and we used these points to estimate the optimal model size (center) for a given compute budget and the optimal number of training tokens (right). 3 Estimating the optimal parameter/training tokens allocation We present three different approaches to answer the question driving our research: Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?"
    },
    {
      "title": "Accounting for Inference in Language Model Scaling Laws - arXiv",
      "url": "https://arxiv.org/html/2401.00448v2",
      "description": "In this paper, we modify the Chinchilla scaling laws to account for inference costs by calculating the optimal parameter and training token"
    },
    {
      "title": "\"Training Compute-Optimal Large Language Models\", Hoffmann et ...",
      "url": "https://www.reddit.com/r/mlscaling/comments/trwkck/training_computeoptimal_large_language_models",
      "description": "We verify this by training a more compute-optimal 70B model, called Chinchilla, on 1.4 trillion tokens. Not only does Chinchilla outperform its"
    }
  ]
}

▼thought

Good, I now have comprehensive data. Let me also get the specific Galileo article for the "28x" human data annotation exceeding compute claim, and the specific calculation formulas.

▼✓抓取网页正文网址: "https://www.abaka.ai/blog/llm-data-cost"

{
  "url": "https://www.abaka.ai/blog/llm-data-cost",
  "title": "Abaka AI | Abaka AI - AI Data Annotation & Solution - Your Data Partner In The AI Industry",
  "summary": "Abaka AI offers data collection, data cleaning, data annotation, and high-quality datasets for world-class Automobile AI, Generative AI, and Embodied AI industry leaders.",
  "content": "<div id=\"readability-page-1\" class=\"page\"><section><article data-v-8def9226=\"\"><!--[--><h2 lexical-key=\"1\" id=\"llm-data-cost-breakdown-all-you-need-to-know-about-data-costs-for-training-an-llm\" data-v-6ee1d66a=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->LLM Data Cost Breakdown: All You Need to Know About Data Costs for Training an LLM<!--]--><!--]--><!--]--></span><!--]--><!--]--></h2><h2 lexical-key=\"3\" id=\"_1-introduction\" data-v-4b2aea56=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->1. Introduction<!--]--><!--]--><!--]--></span><!--]--><!--]--></h2><p lexical-key=\"5\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->With the rapid development of Large Language Models (LLMs), more and more enterprises are considering applying LLMs to their actual businesses. However, in the process of implementing LLMs, data cost is often an important factor that cannot be ignored. As decision-makers, understanding the data requirements and related costs of LLM training at various stages is crucial for the successful implementation of the project.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"7\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->ABAKA AI will take you on an in-depth exploration of the three key stages of LLM training: Pre-training, Supervised Fine-tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF), analyzing the data requirement characteristics of each stage and their impact on costs. We will provide a detailed interpretation of the composition of LLM data costs from multiple dimensions such as data volume, data quality, and data diversity, as well as how to optimize data investment while ensuring model performance.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"9\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Whether you are a corporate executive just starting to explore LLM applications, or a technical leader who has already made achievements in the AI field, we will provide you with a comprehensive and practical LLM data cost assessment framework to help you navigate AI implementation decisions with ease, using our past experience to help you organize a framework for calculating costs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h2 lexical-key=\"11\" id=\"_2-pre-training-stage\" data-v-4b2aea56=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2. Pre-training Stage<!--]--><!--]--><!--]--></span><!--]--><!--]--></h2><h3 lexical-key=\"13\" id=\"_21-dataset-scale-estimation\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2.1. Dataset Scale Estimation<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"15\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Estimating the required pre-training dataset size given a computational budget C is the first step in implementing an LLM project. This process involves different Scaling Laws, the most famous of which are OpenAI's Scaling Law and DeepMind's Chinchilla Law.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"17\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->OpenAI's research published in 2020 proposed the initial Scaling Laws, indicating a power-law relationship between model performance and model parameter count, dataset size, and computational resources. However, the Chinchilla Law proposed by DeepMind in 2022 revised this, arguing that the optimal data volume should be comparable to the model parameter count.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"19\" data-v-091cadd3=\"\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->OpenAI Scaling Law<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"22\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$L ( N, D )=\\left[ \\left( \\frac{N_{c}} {N} \\right)^{\\frac{\\alpha_{N}} {\\alpha_{D}}}+\\frac{D_{c}} {D} \\right]^{\\alpha_{D}}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"24\" data-v-091cadd3=\"\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->DeepMind Scaling Law<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"27\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$\\hat{L} ( N, D ) \\triangleq E+\\frac{A} {N^{\\alpha}}+\\frac{B} {D^{\\beta}}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"29\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->The formulas represent the relationship between model performance ($L$ or $\\hat{L}$) and model parameter count ($N$) and dataset size ($D$).<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"31\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->These two formulas represent different understandings and modeling methods of LLM scaling behavior. In practical applications, we often need to balance between model size and data volume. For example, to reduce inference costs, we can consider using smaller models with more data. Research by Hoffmann et al. [1] shows that under a fixed computational budget, a well-trained small model may perform better than an undertrained large model. Specifically, if we originally planned to train an 8B parameter model but want to reduce inference costs, we can consider replacing it with a model with fewer parameters (such as 7B) while increasing the amount of training data. This approach may not only maintain or even improve model performance but also significantly reduce deployment and operational costs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"33\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->The first step of data budgeting is crucial, determining the size of the model and the size of the pre-train dataset needed. ABAKA AI can build high-quality datasets for you, while having more stock data that can precisely match more suitable data according to your needs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/1.png\" alt=\"\n    Data scraping capabilities of ABAKA AI\n  \" title=\"\n    Data scraping capabilities of ABAKA AI\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    Data scraping capabilities of ABAKA AI\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><h3 lexical-key=\"37\" id=\"_22-multi-domain-data-ratio\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2.2. Multi-domain Data Ratio<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"39\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->The pre-training corpus can include various types of text data, such as web pages, academic materials, books, and relevant texts from different fields, such as legal documents, annual financial reports, medical textbooks, and other domain-specific data. In the pre-training stage, LLMs learn broad knowledge from massive unlabeled text data and store it in model parameters, thus acquiring a certain level of language understanding and generation capabilities.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"41\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->A general pre-training corpus is a large-scale dataset composed of a large amount of text from different domains and sources. Research by Liu, Yang et al. [2] divides general data into eight major categories: web pages, language text, books, academic materials, code, parallel corpora, social media, and encyclopedias. In the pre-training process of the model, the diversity and quality of data are crucial, so careful design of the ratio of these different categories of data is needed when constructing the pre-training dataset.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Web data<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Web data is one of the most widely used sources of pre-training data. The data usually exists in Hypertext Markup Language (HTML) format, showing certain structural features, and is rich in topics, covering content from different fields and disciplines. However, web data may also contain noise and low-quality content, so careful screening and cleaning are required.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Language text<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Language text data mainly consists of two parts. The first part is electronic text data built based on a wide range of sources of written and oral language, usually presented in the form of large corpora of specific languages; the second part is electronic text data built based on relevant written materials in various fields or topics. For example, FinGLM covers annual reports of some listed companies from 2019 to 2021. This type of data belongs to language text materials in the financial field.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"3\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Books<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Book data is also one of the common data types in pre-training corpora. Compared with web pages, books have longer text content and higher data quality, both of which help improve the performance of large language models. Book data provides knowledge with both depth and breadth, allowing models to improve understanding ability and knowledge reserve while learning deeper contextual information.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"4\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Academic materials<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Academic material data refers to text data related to academic fields, including but not limited to academic papers, journal articles, conference papers, research reports, patents, etc. These data are written and published by experts and scholars in academia, with high professionalism and academic rigor. Including them in pre-training corpora can provide more accurate and professional information, helping models understand terminology and knowledge within academic fields. Academic literature, papers, and textbooks provide examples of professional and technical language use, as well as the latest scientific discoveries. This type of data is particularly important for improving model performance in professional fields.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"5\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Code<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: The code data category refers to text information containing programming languages, such as Python, Java, C++, and other code snippets. Its purpose is to help models better understand programming languages and code structures. Code datasets can not only enhance programming capabilities but may also improve logical reasoning abilities. This type of data enables LLMs to understand and generate code in various programming languages, providing support for software development and code analysis tasks.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"6\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Parallel corpora<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Parallel corpus data refers to a collection of text or sentence pairs in different languages. These text pairs are translations of each other, where one text is in the source language (e.g., English) and the corresponding text is in the target language (e.g., Chinese). The introduction of parallel corpus data is crucial for improving the machine translation capabilities and cross-lingual task performance of large language models.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"7\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Social media<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Social media data refers to text content collected from various media platforms, mainly including user-generated posts, comments, and conversations between users, reflecting informal, colloquial language use. It contains a large amount of slang, new words, and diverse expressions. Although social media data may contain harmful information such as bias, discrimination, and violence, it is still crucial for the pre-training of large language models. This is because social media data is beneficial for models to learn expressive abilities in conversational communication and capture social trends, user behavior patterns, etc.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"8\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Encyclopedia<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Encyclopedia data refers to text information extracted from encyclopedias, online encyclopedia websites, or other knowledge databases. Data from online encyclopedia websites is written and edited by experts, volunteers, or community contributors, with a certain degree of authority and reliability. Due to its easy accessibility, it is included in pre-training corpora at a higher frequency, becoming a cornerstone for enhancing the knowledge base of large language models.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"68\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Reasonably configuring this pre-training data can significantly improve the performance and applicability of LLMs. The quality and diversity of data are often more important than the sheer volume of data. Based on the need for high-quality, multi-domain data ratios, ABAKA AI carefully considers the characteristics and value of each type of data when designing pre-training datasets, adjusting the ratio according to your specific needs to help you achieve high-quality and precise pre-training dataset ratios, reducing model training costs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/image.png\" alt=\"\n    The distribution of data types in the corpora used for pre-training by different models\n  \" title=\"\n    The distribution of data types in the corpora used for pre-training by different models\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    The distribution of data types in the corpora used for pre-training by different models\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><p lexical-key=\"72\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->This image shows the distribution of data types in the corpora used by different models during pre-training. Each pie chart represents a model and indicates the proportions of various data types. Different data types are distinguished by different colors, including web pages, code, encyclopedias, books, academic materials, social media, language text, and diverse data.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h3 lexical-key=\"74\" id=\"_23-training-data-acquisition\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2.3. Training Data Acquisition<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"76\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Although open-source datasets provide a foundation for model training, many truly valuable and unique data often do not appear in public channels. Therefore, targeted crawling of data from specific domains or sources has become a key strategy for improving model performance and competitiveness. The acquisition of this part of data is very necessary. In terms of high-quality training data acquisition, ABAKA AI can provide you with deeper insights, higher timeliness, and more unique data in targeted acquisitions, helping you improve model performance and accuracy in vertical domains and enhance the model's understanding of the latest information and trends.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"78\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Channels for targeted data acquisition usually include data crawling, commercial database subscriptions, data cooperation and exchange, etc. Except for web crawlers, other channels are too customized, so this section only discusses the relevant content of data crawling. Data crawling does not have high requirements for infrastructure, so in the following calculations, we only consider development costs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"80\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Before development, more importantly, is to choose suitable data sources. Crawling from suitable data sources can significantly improve the model's performance in specific domains. After determining the data source, the development and crawling costs mainly come from the following aspects:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Development cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"86\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{dev} = (S_{dev} × D_{initial}) + (S_{dev} × D_{update})$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"88\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $D_{initial}$ and $D_{update}$ are the time for initial development and updating the crawling code after website updates, respectively. The complexity of the website, verification mechanisms, request complexity, etc. will all affect the development time.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"2\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Maintenance cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"94\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{ ops} = S_{ops} × D_{crawl} × α$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"96\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Maintenance costs may not be full-time, so a coefficient $α (0 &lt; α ≤ 1)$ can be introduced to represent the actual proportion of maintenance time needed. If the data needs continuous updating or the crawling period is very long, then maintenance personnel intervention is needed to keep the crawler running normally and respond to website changes. If the crawler system uses a distributed strategy, more maintenance support may be needed.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"3\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"3\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->IP proxy pool<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"102\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{ip} = (\\frac{N_{req}}{N_{req_per_ip}}) × C_{ip}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"104\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $N_{req}$ is the total number of requests, $N_{req_per_ip}$ is the number of requests each IP can handle, and $C_{ip}$ is the unit price of each IP. Factors such as the website's IP restriction policy, total data volume, IP quality, IP geographic location requirements, proxy type, etc. will affect the price.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"4\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"4\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Crawling material cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"110\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{mat} = C_{mem} × N_{mem} × (D_{crawl} / D_{mem_validity})$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"112\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $C_{mem}$ and $N_{mem}$ are the required number of memberships and the number of members, $D_{mem_validity}$ is the validity period of the membership (in days). Factors such as membership level, concurrent strategy, etc. will affect the final budget. If the target website requires registration or membership to download, then this cost needs to be considered.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"114\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->So overall:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"116\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{total} = S_{dev} × (D_{initial} + D_{update}) + S_{ops} × D_{crawl} × α + (N_{req} / N_{req_per_ip}) × C_{ip} + (C_{mem} × N_{mem} × D_{crawl}) / D_{MemValidity}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"118\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Generally speaking, a vertical domain website costs between 15,00 to 15,000 USD dollars depending on the difficulty, with large social networking sites costing more. ABAKA AI can provide you with deeper insights, higher timeliness, more unique, and higher quality data, and reduce the total acquisition cost by 70%, helping you train excellent large language models in various dimensions.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"122\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->A large amount of high-quality LLM pre-training data exists in the form of PDFs or scanned images. Due to the diversity of layouts and formats and the varying quality of scanned images, utilizing this data to build datasets is a challenging task, requiring the conversion of this content into data formats like markdown for use. The core problems mainly focus on two aspects: extracting content information and layout information (including body text, titles, figure captions, images, tables, formulas) and handling the relationships between layout elements.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"124\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->When processing multiple open-source datasets, ABAKA AI observed several excellent open-source solutions, such as PP-StructureV2, Marker, Vary, and Nougat, but they each have shortcomings. PP-StructureV2 cannot identify LaTeX format content and lacks necessary post-processing steps; Marker covers fewer languages and doesn't handle figures well; Nougat has limited support for multi-column data and can identify limited languages, while Vary / Vary-toy consumes more computational resources.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"126\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Based on these situations, ABAKA AI, as a member of the Multimodal Art Projection (M-A-P) team, fully participated in building the completely open-source large language model MAP-Neo, which also open-sourced the Document Convert Pipeline. This pipeline can better balance performance and computational overhead, while the decoupling between modules brings better interpretability and makes it easier to upgrade, add, and replace different modules, providing a more flexible, efficient, and CPU-friendly solution.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/2.png\" alt=\"\n  \" title=\"\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><p lexical-key=\"130\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->In addition to using models for conversion, many vendors provide similar services, such as mathpix, Doc2x, Paodin PDFlux, pix2text, X Information, X Xun Cloud Large Model Knowledge Engine Document Parsing, etc. Therefore, we provide two ways to calculate costs below:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Self-built conversion service cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"136\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{convert} = (\\frac{N_{pages}}{R_{process}}) × C_{node} × (1 + F_{complexity}) + C_{integration}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"138\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $N_{pages}$ is the total number of documents, $R_{process}$ is the number of documents processed per node per day, $C_{node}$ is the price per node per day, $F_{complexity}$ is the document complexity factor ($0 ≤ F_{complexity} ≤ 1$). Generally speaking, the layout and fonts of magazines and newspapers will be more complex, while literature and patents will have richer images and tables. These factors need to be considered when specifying budgets. $C_{integration}$ is the cost of deployment, updating strategies/models, and maintenance. This part of the cost will vary greatly depending on the task.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"2\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Third-party service cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"144\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{convert} = \\sum_{i=1}^{n} C_{tier,i} \\times N_{pages,i} + C_{integration}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"146\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $n$ is the number of price tiers, $C_{tier,i}$ is the price per page for the i-th tier, $N_{pages,i}$ is the number of pages in the i-th tier, $C_{integration}$ is the cost of API integration and maintenance.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"148\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->The choice between these methods depends on multiple factors, including the number and type of documents, required conversion quality, availability of internal resources, and budget constraints. In fact, in most cases, easy data is converted using one's own servers, while difficult data uses commercial-grade services.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h3 lexical-key=\"150\" id=\"_25-training-data-cleaning\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2.5. Training Data Cleaning<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"152\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Although the raw data obtained through web crawling, document conversion, and open-source datasets provides a foundation for model training, this data usually contains noise, errors, biases, and false information, which will reduce the training effectiveness of the model. Therefore, data cleaning becomes a key step in improving model performance and reliability. To obtain high-quality data, ABAKA AI can provide you with cleaner and more refined data cleaning, significantly improving data quality, thereby enhancing the model's performance on specific tasks, strengthening the model's ability to understand complex patterns, and reducing misleading learning due to data issues.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/3.png\" alt=\"\n    Fineweb data cleaning pipeline\n  \" title=\"\n    Fineweb data cleaning pipeline\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    Fineweb data cleaning pipeline\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><p lexical-key=\"156\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Before starting cleaning, more importantly, is to formulate appropriate cleaning strategies. This requires a thorough understanding of data characteristics, model requirements, and potential data quality issues. The formulation of cleaning strategies should consider factors such as data scale, complexity, domain characteristics, etc. In terms of cost estimation, taking the Matrix dataset of the MAP-Neo large model jointly participated by ABAKA AI and Ge Zhang et al. [3] as an example, the Matrix dataset released 4.7T tokens of data, which can be said to be one of the highest quality and largest scale bilingual datasets. The general approach to data cleaning for the Matrix dataset follows the principle of \"from coarse to fine\" and \"from simple to complex\". We can divide the cleaning steps into the following main stages:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Heuristic filtering<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：Heuristic rule filtering is the first line of defense, aimed at quickly identifying and deleting low-quality data. This step has low computational cost but can significantly reduce the amount of data for subsequent processing. Filtering criteria include: URL; blacklist word table; gibberish text filter; document length; proportion of special characters; proportion of short, continuous, or incomplete lines; repeated words; n-grams or paragraphs. The filtering thresholds are based on statistical analysis of large document samples. Heuristic rules can effectively identify and remove low-quality data, preventing low-quality pre-training corpora from affecting model performance. As the team used composite data from multiple sources, based on data diversity, the team specially designed cleaning methods and tailored rules for each method to maintain consistency in data quality.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Data deduplication<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：Many studies have shown that repetitive text may lead to a decline in model performance, making deduplication a key step in corpus processing (although this point is somewhat controversial, more repetitive data may precisely indicate that this part of the data is of high quality, which is an important feature. For example, Fineweb's view is that more deduplication does not necessarily mean better performance; if deduplication is performed across dumps, performance may actually be worse).<!--]--><!--]--><!--]--></span><br><span><!--[--><!--[--><!--[-->a. <!--]--><!--]--><!--]--></span><b><!--[--><strong><!--[--><!--[--><!--[-->Exact duplication<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：Exact document deduplication is a method used to evaluate whether an entire text is completely identical to another text. If found to be completely identical, the duplicate is deleted. Due to the large amount of data, clusters must be used for processing, and memory insufficiency problems may also occur. In practice, we store text data in batches in different storage buckets. Then process the data in each storage bucket in turn to remove duplicates.<!--]--><!--]--><!--]--></span><br><span><!--[--><!--[--><!--[-->b. <!--]--><!--]--><!--]--></span><b><!--[--><strong><!--[--><!--[--><!--[-->Near-duplicate<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：For near-duplicates, we use the MinHash LSH deduplication method to remove them as much as possible, which is particularly suitable for web data and is widely used in similarity search and duplicate detection in large datasets. It can handle very common scenarios where the text content is basically the same, but the scattered template blocks of web pages are different. The principle of MinHash is to represent a set with smaller hash values, and then these hash values can be used to estimate the Jaccard similarity between two sets. The computational cost of this step is still quite high.<!--]--><!--]--><!--]--></span><br><span><!--[--><!--[--><!--[-->c. <!--]--><!--]--><!--]--></span><b><!--[--><strong><!--[--><!--[--><!--[-->Similar Line<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：To solve the problem of the same content appearing multiple times in the text, a direct method is to divide the text into multiple lines using specific delimiters, and then compare the similarity between each line. If they are similar, subsequent lines are deleted.<!--]--><!--]--><!--]--></span><br><span><!--[--><!--[--><!--[-->d. In addition, paragraph deduplication and substring deduplication were also performed to achieve better results.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"3\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Quality screening<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：After data cleaning, Fineweb-edu used the LLama3-70B-Instruct model to score the data and trained a Bert-like classification model. The classification model was then used to filter the data, greatly improving data quality. In addition to using models for data quality screening, many developers use fasttext models for language identification when cleaning CC datasets.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/4.png\" alt=\"\n    On the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n  \" title=\"\n    On the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    On the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/5.png\" alt=\"\n    Deduplication did not show the expected performance improvement in this experiment\n  \" title=\"\n    Deduplication did not show the expected performance improvement in this experiment\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    Deduplication did not show the expected performance improvement in this experiment\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><p lexical-key=\"185\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Based on the above steps, we can calculate the cost of data cleaning:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Engineer debugging and rule determination cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"191\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{engineer} = S_{eng} \\times (T_{rules} + T_{debug})$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"193\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $S_{eng}$ is the developer's daily salary, $T_{rules}$ and $T_{debug}$ are the time required for formulating and optimizing rules (USD/day).<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"2\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Storage costs<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"199\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{storage} = C_{storage} \\times V_{data} \\times T_{retention}$ F Where $C_{storage}$ is the storage cost per TB per month, $V_{data}$ is the total data volume (TB), $T_{retention}$ is the data retention time (months).<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Computation costs<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"205\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{compute} = \\sum_{i=1}^{n} [C_i \\times \\frac{V_{data,i}}{R_i} \\times (1 + \\beta_i \\times (F_{comm} + F_{ops}))]$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"207\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $i$ represents the processing stage (1 to n), $C_i$ is the unit cost of computing resources for the i-th stage (USD/day), $V_{data,i}$ is the data volume for the i-th stage (TB), $R_i$ is the processing rate for the i-th stage (TB/day), $\\beta_i$ is a binary indicator showing whether the i-th stage uses cluster processing (0 for single-node processing, 1 for cluster processing), $F_{comm}$ and $F_{ops}$ are the communication and operational overheads of using clusters. Using clusters is troublesome and costly, so we use heuristic filtering as the first step.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"4\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"4\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Quality screening<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"213\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{quality} = C_{train} \\times T_{training} + C_{data_annotation} + C_{inference} \\times \\frac{V_{data}}{R_{inference}}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"215\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $C_{train}$ and $C_{inference}$ are the computational costs for training and inference, which usually differ significantly in price, $T_{training}$ is the training time (days), $C_{data_annotation}$ is the annotation cost, $\\frac{V_{data}}{R_{inference}}$ indicates the time needed to complete inference for all data.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h3 lexical-key=\"217\" id=\"_26-data-cost-calculation\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2.6. Data Cost Calculation<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"219\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->High-quality data processing comes at a cost. From data acquisition to the final cleaning process, each step involves complex computations and human resource investments, all of which translate into actual costs. This chapter will combine ABAKA AI's previous content and rich experience to provide you with some feasible ideas, hoping to help you calculate data costs when implementing LLMs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"221\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Based on the data processing flow described earlier, we can roughly divide data costs into the following main categories: storage costs, data acquisition costs, data conversion costs, and data cleaning costs. We hope to help you establish an intuitive budget system through ABAKA AI's past rich experience:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Storage costs<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：In this field, data scale is far larger than general projects, with pre-training datasets reaching PB levels. Single machines cannot meet such large-scale data storage needs, and projects also have high bandwidth requirements. Therefore, distributed storage is generally used. Distributed storage facilitates horizontal expansion, can meet growing storage needs, and has data backup and fault tolerance mechanisms, ensuring high data reliability. Multi-node parallel read and write can also improve I/O performance. Generally, the capacity price of distributed storage is about 85 USD/T (NVME + HDD), meaning 1PB of available storage space costs about 85,000 USD. Adding security redundancy and network equipment, security equipment, the cost will approach 99,000 USD/PB.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Data acquisition<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：All historical data from a well-known large website can be estimated at around 42,500-70,500 USD, with incremental updates costing about 14,000 USD annually. For vertical domain websites, it could be anywhere from 42,00 to 14,000 USD. Video websites are three to five times more expensive than ordinary websites (bandwidth, storage), and overseas websites are two to three times more expensive (overseas proxies, overseas servers, compliance). Assuming you need to crawl 8 mainstream social media and news websites + 15 vertical domain websites (such as code, mathematics, finance), a budget of 706,000 USD would be appropriate.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"3\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Document information extraction<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：Based on ABAKA AI's experience, using ABAKA AI's developed Pipeline for document conversion is more cost-effective and flexible. If using consumer-grade GPUs for conversion, the cost per page is about 0.000035 USD, far lower than mathpix's 0.025 / 0.01 USD per page. Of course, we now see many good domestic manufacturers trying in this area, and we look forward to better models and cheaper prices from domestic service providers. All in all, including the time for Gap and debugging, estimate about 14,000 USD for every 10,000,000 pages of documents (80% using your own model + 20% using third-party services).<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"4\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Data cleaning<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：The cost of this step mainly depends on how many data sources there are and their domains. When processing very dirty data, ABAKA AI used over 1,000 cores for about a month, adding many special rules to obtain higher quality data, with a data retention rate of less than 1%. Therefore, this part of the data can be calculated as follows:<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"236\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$S_{\\text{eng}} + \\frac{V_{\\text{data}}}{100\\text{T}} \\times C_{\\text{base}}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"238\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->That is, the data cleaning cost for each domain consists of two weeks' salary for an algorithm engineer + 2,800 USD for cleaning every 100T, assuming the data cleaning cost increases linearly with data volume when the cluster is set up. For example, like Fineweb-edu using Llama3-70B and Bert-like models, the price is also quite affordable, just slightly increase the cost per 100T.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"240\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->In summary, preparing pre-training data for LLMs is a complex and costly process. It involves multiple stages, including data acquisition, storage, document information extraction, and data cleaning, each requiring careful planning and substantial investment. The quality and diversity of data are crucial to the model's final performance, so each stage should be optimized as much as possible within budget constraints. At the same time, we find that the value of experienced algorithm engineers cannot be overlooked. Their experience and expertise can help teams avoid many potential pitfalls and detours. In LLM projects, the cost of taking detours due to human resource issues is often surprisingly high, potentially leading to a waste of considerable time and resources.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h2 lexical-key=\"242\" id=\"_3-sft-rlhf-stages\" data-v-4b2aea56=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->3. SFT &amp; RLHF Stages<!--]--><!--]--><!--]--></span><!--]--><!--]--></h2><p lexical-key=\"244\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->In the training process of large language models (LLMs), Supervised Fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are two closely connected key stages. Although these two stages differ in technical implementation and specific objectives, they share significant similarities in terms of data requirements and cost composition. In this chapter, we combine these two stages for discussion, primarily because their core costs are concentrated on data annotation and requirement definition, a characteristic that results in many commonalities in data preparation and cost estimation.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h3 lexical-key=\"246\" id=\"_31-characteristics-of-sft-datasets\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->3.1. Characteristics of SFT Datasets<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"248\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->SFT datasets consist of a series of text pairs, including \"instruction input\" and \"answer output\". \"Instruction input\" represents requests made by humans to the model, covering various types such as classification, summarization, rewriting, etc. \"Answer output\" is the response generated by the model based on the instruction, meeting human expectations. There are four methods to construct instruction fine-tuning datasets: manual creation; model generation, such as using the Self-Instruct method; collecting and improving existing open-source datasets; and combining the above three methods.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/5.png\" alt=\"\n    Different ways to build SFT datasets\n  \" title=\"\n    Different ways to build SFT datasets\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    Different ways to build SFT datasets\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><p lexical-key=\"252\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->There are generally two approaches to constructing artificially generated datasets. The first approach involves directly creating instruction text sets according to given requirements and rules by company employees, volunteers, annotation platform staff, and others. Whether designing instruction sets, writing annotation guidelines, or conducting actual data annotation and quality control, it requires a significant investment of human time and effort. For example, the creation of the Databricks-dolly-15k dataset involved thousands of Databricks employees who generated over 15,000 records across multiple instruction categories. The second approach involves scraping human-generated real question-and-answer data from web pages and standardizing it into an instruction format. Examples include datasets like InstructionWild, v2LCCC, and Zhihu-KOL, which construct datasets by aggregating and organizing content from social chats, code-related Q&amp;As, and other sources.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"254\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->In ABAKA AI's past practices, the first approach has been more commonly used to construct datasets. Meanwhile, Liu, Yang, et al. [2] believe that datasets constructed in this manner are of higher quality and cleaner due to processing and review by professional annotators. After human processing, these datasets become more interpretable and more consistent with human understanding, thus increasing their explainability. Researchers have flexible control over the training samples and can adjust them according to different tasks, making them more versatile.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/7.png\" alt=\"\n    ABAKA AI possesses high-quality finished datasets across multiple domains\n  \" title=\"\n    ABAKA AI possesses high-quality finished datasets across multiple domains\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    ABAKA AI possesses high-quality finished datasets across multiple domains\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><h3 lexical-key=\"258\" id=\"_32-characteristics-of-rlhf-datasets\"><!--[--><!--[--><",
  "markdown": "## LLM Data Cost Breakdown: All You Need to Know About Data Costs for Training an LLM\n\n## 1\\. Introduction\n\nWith the rapid development of Large Language Models (LLMs), more and more enterprises are considering applying LLMs to their actual businesses. However, in the process of implementing LLMs, data cost is often an important factor that cannot be ignored. As decision-makers, understanding the data requirements and related costs of LLM training at various stages is crucial for the successful implementation of the project.\n\nABAKA AI will take you on an in-depth exploration of the three key stages of LLM training: Pre-training, Supervised Fine-tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF), analyzing the data requirement characteristics of each stage and their impact on costs. We will provide a detailed interpretation of the composition of LLM data costs from multiple dimensions such as data volume, data quality, and data diversity, as well as how to optimize data investment while ensuring model performance.\n\nWhether you are a corporate executive just starting to explore LLM applications, or a technical leader who has already made achievements in the AI field, we will provide you with a comprehensive and practical LLM data cost assessment framework to help you navigate AI implementation decisions with ease, using our past experience to help you organize a framework for calculating costs.\n\n## 2\\. Pre-training Stage\n\n### 2.1. Dataset Scale Estimation\n\nEstimating the required pre-training dataset size given a computational budget C is the first step in implementing an LLM project. This process involves different Scaling Laws, the most famous of which are OpenAI's Scaling Law and DeepMind's Chinchilla Law.\n\nOpenAI's research published in 2020 proposed the initial Scaling Laws, indicating a power-law relationship between model performance and model parameter count, dataset size, and computational resources. However, the Chinchilla Law proposed by DeepMind in 2022 revised this, arguing that the optimal data volume should be comparable to the model parameter count.\n\n****OpenAI Scaling Law****:\n\n$L ( N, D )=\\\\left\\[ \\\\left( \\\\frac{N\\_{c}} {N} \\\\right)^{\\\\frac{\\\\alpha\\_{N}} {\\\\alpha\\_{D}}}+\\\\frac{D\\_{c}} {D} \\\\right\\]^{\\\\alpha\\_{D}}$\n\n****DeepMind Scaling Law****:\n\n$\\\\hat{L} ( N, D ) \\\\triangleq E+\\\\frac{A} {N^{\\\\alpha}}+\\\\frac{B} {D^{\\\\beta}}$\n\nThe formulas represent the relationship between model performance ($L$ or $\\\\hat{L}$) and model parameter count ($N$) and dataset size ($D$).\n\nThese two formulas represent different understandings and modeling methods of LLM scaling behavior. In practical applications, we often need to balance between model size and data volume. For example, to reduce inference costs, we can consider using smaller models with more data. Research by Hoffmann et al. \\[1\\] shows that under a fixed computational budget, a well-trained small model may perform better than an undertrained large model. Specifically, if we originally planned to train an 8B parameter model but want to reduce inference costs, we can consider replacing it with a model with fewer parameters (such as 7B) while increasing the amount of training data. This approach may not only maintain or even improve model performance but also significantly reduce deployment and operational costs.\n\nThe first step of data budgeting is crucial, determining the size of the model and the size of the pre-train dataset needed. ABAKA AI can build high-quality datasets for you, while having more stock data that can precisely match more suitable data according to your needs.\n\n![\nData scraping capabilities of ABAKA AI\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/1.png \"\nData scraping capabilities of ABAKA AI\n\")\n\nData scraping capabilities of ABAKA AI\n\n### 2.2. Multi-domain Data Ratio\n\nThe pre-training corpus can include various types of text data, such as web pages, academic materials, books, and relevant texts from different fields, such as legal documents, annual financial reports, medical textbooks, and other domain-specific data. In the pre-training stage, LLMs learn broad knowledge from massive unlabeled text data and store it in model parameters, thus acquiring a certain level of language understanding and generation capabilities.\n\nA general pre-training corpus is a large-scale dataset composed of a large amount of text from different domains and sources. Research by Liu, Yang et al. \\[2\\] divides general data into eight major categories: web pages, language text, books, academic materials, code, parallel corpora, social media, and encyclopedias. In the pre-training process of the model, the diversity and quality of data are crucial, so careful design of the ratio of these different categories of data is needed when constructing the pre-training dataset.\n\n1.  ****Web data****: Web data is one of the most widely used sources of pre-training data. The data usually exists in Hypertext Markup Language (HTML) format, showing certain structural features, and is rich in topics, covering content from different fields and disciplines. However, web data may also contain noise and low-quality content, so careful screening and cleaning are required.\n2.  ****Language text****: Language text data mainly consists of two parts. The first part is electronic text data built based on a wide range of sources of written and oral language, usually presented in the form of large corpora of specific languages; the second part is electronic text data built based on relevant written materials in various fields or topics. For example, FinGLM covers annual reports of some listed companies from 2019 to 2021. This type of data belongs to language text materials in the financial field.\n3.  ****Books****: Book data is also one of the common data types in pre-training corpora. Compared with web pages, books have longer text content and higher data quality, both of which help improve the performance of large language models. Book data provides knowledge with both depth and breadth, allowing models to improve understanding ability and knowledge reserve while learning deeper contextual information.\n4.  ****Academic materials****: Academic material data refers to text data related to academic fields, including but not limited to academic papers, journal articles, conference papers, research reports, patents, etc. These data are written and published by experts and scholars in academia, with high professionalism and academic rigor. Including them in pre-training corpora can provide more accurate and professional information, helping models understand terminology and knowledge within academic fields. Academic literature, papers, and textbooks provide examples of professional and technical language use, as well as the latest scientific discoveries. This type of data is particularly important for improving model performance in professional fields.\n5.  ****Code****: The code data category refers to text information containing programming languages, such as Python, Java, C++, and other code snippets. Its purpose is to help models better understand programming languages and code structures. Code datasets can not only enhance programming capabilities but may also improve logical reasoning abilities. This type of data enables LLMs to understand and generate code in various programming languages, providing support for software development and code analysis tasks.\n6.  ****Parallel corpora****: Parallel corpus data refers to a collection of text or sentence pairs in different languages. These text pairs are translations of each other, where one text is in the source language (e.g., English) and the corresponding text is in the target language (e.g., Chinese). The introduction of parallel corpus data is crucial for improving the machine translation capabilities and cross-lingual task performance of large language models.\n7.  ****Social media****: Social media data refers to text content collected from various media platforms, mainly including user-generated posts, comments, and conversations between users, reflecting informal, colloquial language use. It contains a large amount of slang, new words, and diverse expressions. Although social media data may contain harmful information such as bias, discrimination, and violence, it is still crucial for the pre-training of large language models. This is because social media data is beneficial for models to learn expressive abilities in conversational communication and capture social trends, user behavior patterns, etc.\n8.  ****Encyclopedia****: Encyclopedia data refers to text information extracted from encyclopedias, online encyclopedia websites, or other knowledge databases. Data from online encyclopedia websites is written and edited by experts, volunteers, or community contributors, with a certain degree of authority and reliability. Due to its easy accessibility, it is included in pre-training corpora at a higher frequency, becoming a cornerstone for enhancing the knowledge base of large language models.\n\nReasonably configuring this pre-training data can significantly improve the performance and applicability of LLMs. The quality and diversity of data are often more important than the sheer volume of data. Based on the need for high-quality, multi-domain data ratios, ABAKA AI carefully considers the characteristics and value of each type of data when designing pre-training datasets, adjusting the ratio according to your specific needs to help you achieve high-quality and precise pre-training dataset ratios, reducing model training costs.\n\n![\nThe distribution of data types in the corpora used for pre-training by different models\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/image.png \"\nThe distribution of data types in the corpora used for pre-training by different models\n\")\n\nThe distribution of data types in the corpora used for pre-training by different models\n\nThis image shows the distribution of data types in the corpora used by different models during pre-training. Each pie chart represents a model and indicates the proportions of various data types. Different data types are distinguished by different colors, including web pages, code, encyclopedias, books, academic materials, social media, language text, and diverse data.\n\n### 2.3. Training Data Acquisition\n\nAlthough open-source datasets provide a foundation for model training, many truly valuable and unique data often do not appear in public channels. Therefore, targeted crawling of data from specific domains or sources has become a key strategy for improving model performance and competitiveness. The acquisition of this part of data is very necessary. In terms of high-quality training data acquisition, ABAKA AI can provide you with deeper insights, higher timeliness, and more unique data in targeted acquisitions, helping you improve model performance and accuracy in vertical domains and enhance the model's understanding of the latest information and trends.\n\nChannels for targeted data acquisition usually include data crawling, commercial database subscriptions, data cooperation and exchange, etc. Except for web crawlers, other channels are too customized, so this section only discusses the relevant content of data crawling. Data crawling does not have high requirements for infrastructure, so in the following calculations, we only consider development costs.\n\nBefore development, more importantly, is to choose suitable data sources. Crawling from suitable data sources can significantly improve the model's performance in specific domains. After determining the data source, the development and crawling costs mainly come from the following aspects:\n\n1.  ****Development cost****：\n\n$Budget\\_{dev} = (S\\_{dev} × D\\_{initial}) + (S\\_{dev} × D\\_{update})$\n\nWhere $D\\_{initial}$ and $D\\_{update}$ are the time for initial development and updating the crawling code after website updates, respectively. The complexity of the website, verification mechanisms, request complexity, etc. will all affect the development time.\n\n2.  ****Maintenance cost****：\n\n$Budget\\_{ ops} = S\\_{ops} × D\\_{crawl} × α$\n\nMaintenance costs may not be full-time, so a coefficient $α (0 < α ≤ 1)$ can be introduced to represent the actual proportion of maintenance time needed. If the data needs continuous updating or the crawling period is very long, then maintenance personnel intervention is needed to keep the crawler running normally and respond to website changes. If the crawler system uses a distributed strategy, more maintenance support may be needed.\n\n3.  ****IP proxy pool****：\n\n$Budget\\_{ip} = (\\\\frac{N\\_{req}}{N\\_{req\\_per\\_ip}}) × C\\_{ip}$\n\nWhere $N\\_{req}$ is the total number of requests, $N\\_{req\\_per\\_ip}$ is the number of requests each IP can handle, and $C\\_{ip}$ is the unit price of each IP. Factors such as the website's IP restriction policy, total data volume, IP quality, IP geographic location requirements, proxy type, etc. will affect the price.\n\n4.  ****Crawling material cost****：\n\n$Budget\\_{mat} = C\\_{mem} × N\\_{mem} × (D\\_{crawl} / D\\_{mem\\_validity})$\n\nWhere $C\\_{mem}$ and $N\\_{mem}$ are the required number of memberships and the number of members, $D\\_{mem\\_validity}$ is the validity period of the membership (in days). Factors such as membership level, concurrent strategy, etc. will affect the final budget. If the target website requires registration or membership to download, then this cost needs to be considered.\n\nSo overall:\n\n$Budget\\_{total} = S\\_{dev} × (D\\_{initial} + D\\_{update}) + S\\_{ops} × D\\_{crawl} × α + (N\\_{req} / N\\_{req\\_per\\_ip}) × C\\_{ip} + (C\\_{mem} × N\\_{mem} × D\\_{crawl}) / D\\_{MemValidity}$\n\nGenerally speaking, a vertical domain website costs between 15,00 to 15,000 USD dollars depending on the difficulty, with large social networking sites costing more. ABAKA AI can provide you with deeper insights, higher timeliness, more unique, and higher quality data, and reduce the total acquisition cost by 70%, helping you train excellent large language models in various dimensions.\n\nA large amount of high-quality LLM pre-training data exists in the form of PDFs or scanned images. Due to the diversity of layouts and formats and the varying quality of scanned images, utilizing this data to build datasets is a challenging task, requiring the conversion of this content into data formats like markdown for use. The core problems mainly focus on two aspects: extracting content information and layout information (including body text, titles, figure captions, images, tables, formulas) and handling the relationships between layout elements.\n\nWhen processing multiple open-source datasets, ABAKA AI observed several excellent open-source solutions, such as PP-StructureV2, Marker, Vary, and Nougat, but they each have shortcomings. PP-StructureV2 cannot identify LaTeX format content and lacks necessary post-processing steps; Marker covers fewer languages and doesn't handle figures well; Nougat has limited support for multi-column data and can identify limited languages, while Vary / Vary-toy consumes more computational resources.\n\nBased on these situations, ABAKA AI, as a member of the Multimodal Art Projection (M-A-P) team, fully participated in building the completely open-source large language model MAP-Neo, which also open-sourced the Document Convert Pipeline. This pipeline can better balance performance and computational overhead, while the decoupling between modules brings better interpretability and makes it easier to upgrade, add, and replace different modules, providing a more flexible, efficient, and CPU-friendly solution.\n\n![\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/2.png \"\n\")\n\nIn addition to using models for conversion, many vendors provide similar services, such as mathpix, Doc2x, Paodin PDFlux, pix2text, X Information, X Xun Cloud Large Model Knowledge Engine Document Parsing, etc. Therefore, we provide two ways to calculate costs below:\n\n1.  ****Self-built conversion service cost****：\n\n$Budget\\_{convert} = (\\\\frac{N\\_{pages}}{R\\_{process}}) × C\\_{node} × (1 + F\\_{complexity}) + C\\_{integration}$\n\nWhere $N\\_{pages}$ is the total number of documents, $R\\_{process}$ is the number of documents processed per node per day, $C\\_{node}$ is the price per node per day, $F\\_{complexity}$ is the document complexity factor ($0 ≤ F\\_{complexity} ≤ 1$). Generally speaking, the layout and fonts of magazines and newspapers will be more complex, while literature and patents will have richer images and tables. These factors need to be considered when specifying budgets. $C\\_{integration}$ is the cost of deployment, updating strategies/models, and maintenance. This part of the cost will vary greatly depending on the task.\n\n2.  ****Third-party service cost****：\n\n$Budget\\_{convert} = \\\\sum\\_{i=1}^{n} C\\_{tier,i} \\\\times N\\_{pages,i} + C\\_{integration}$\n\nWhere $n$ is the number of price tiers, $C\\_{tier,i}$ is the price per page for the i-th tier, $N\\_{pages,i}$ is the number of pages in the i-th tier, $C\\_{integration}$ is the cost of API integration and maintenance.\n\nThe choice between these methods depends on multiple factors, including the number and type of documents, required conversion quality, availability of internal resources, and budget constraints. In fact, in most cases, easy data is converted using one's own servers, while difficult data uses commercial-grade services.\n\n### 2.5. Training Data Cleaning\n\nAlthough the raw data obtained through web crawling, document conversion, and open-source datasets provides a foundation for model training, this data usually contains noise, errors, biases, and false information, which will reduce the training effectiveness of the model. Therefore, data cleaning becomes a key step in improving model performance and reliability. To obtain high-quality data, ABAKA AI can provide you with cleaner and more refined data cleaning, significantly improving data quality, thereby enhancing the model's performance on specific tasks, strengthening the model's ability to understand complex patterns, and reducing misleading learning due to data issues.\n\n![\nFineweb data cleaning pipeline\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/3.png \"\nFineweb data cleaning pipeline\n\")\n\nFineweb data cleaning pipeline\n\nBefore starting cleaning, more importantly, is to formulate appropriate cleaning strategies. This requires a thorough understanding of data characteristics, model requirements, and potential data quality issues. The formulation of cleaning strategies should consider factors such as data scale, complexity, domain characteristics, etc. In terms of cost estimation, taking the Matrix dataset of the MAP-Neo large model jointly participated by ABAKA AI and Ge Zhang et al. \\[3\\] as an example, the Matrix dataset released 4.7T tokens of data, which can be said to be one of the highest quality and largest scale bilingual datasets. The general approach to data cleaning for the Matrix dataset follows the principle of \"from coarse to fine\" and \"from simple to complex\". We can divide the cleaning steps into the following main stages:\n\n1.  ****Heuristic filtering****：Heuristic rule filtering is the first line of defense, aimed at quickly identifying and deleting low-quality data. This step has low computational cost but can significantly reduce the amount of data for subsequent processing. Filtering criteria include: URL; blacklist word table; gibberish text filter; document length; proportion of special characters; proportion of short, continuous, or incomplete lines; repeated words; n-grams or paragraphs. The filtering thresholds are based on statistical analysis of large document samples. Heuristic rules can effectively identify and remove low-quality data, preventing low-quality pre-training corpora from affecting model performance. As the team used composite data from multiple sources, based on data diversity, the team specially designed cleaning methods and tailored rules for each method to maintain consistency in data quality.\n2.  ****Data deduplication****：Many studies have shown that repetitive text may lead to a decline in model performance, making deduplication a key step in corpus processing (although this point is somewhat controversial, more repetitive data may precisely indicate that this part of the data is of high quality, which is an important feature. For example, Fineweb's view is that more deduplication does not necessarily mean better performance; if deduplication is performed across dumps, performance may actually be worse).  \n    a. ****Exact duplication****：Exact document deduplication is a method used to evaluate whether an entire text is completely identical to another text. If found to be completely identical, the duplicate is deleted. Due to the large amount of data, clusters must be used for processing, and memory insufficiency problems may also occur. In practice, we store text data in batches in different storage buckets. Then process the data in each storage bucket in turn to remove duplicates.  \n    b. ****Near-duplicate****：For near-duplicates, we use the MinHash LSH deduplication method to remove them as much as possible, which is particularly suitable for web data and is widely used in similarity search and duplicate detection in large datasets. It can handle very common scenarios where the text content is basically the same, but the scattered template blocks of web pages are different. The principle of MinHash is to represent a set with smaller hash values, and then these hash values can be used to estimate the Jaccard similarity between two sets. The computational cost of this step is still quite high.  \n    c. ****Similar Line****：To solve the problem of the same content appearing multiple times in the text, a direct method is to divide the text into multiple lines using specific delimiters, and then compare the similarity between each line. If they are similar, subsequent lines are deleted.  \n    d. In addition, paragraph deduplication and substring deduplication were also performed to achieve better results.\n3.  ****Quality screening****：After data cleaning, Fineweb-edu used the LLama3-70B-Instruct model to score the data and trained a Bert-like classification model. The classification model was then used to filter the data, greatly improving data quality. In addition to using models for data quality screening, many developers use fasttext models for language identification when cleaning CC datasets.\n\n![\nOn the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/4.png \"\nOn the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n\")\n\nOn the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n\n![\nDeduplication did not show the expected performance improvement in this experiment\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/5.png \"\nDeduplication did not show the expected performance improvement in this experiment\n\")\n\nDeduplication did not show the expected performance improvement in this experiment\n\nBased on the above steps, we can calculate the cost of data cleaning:\n\n1.  ****Engineer debugging and rule determination cost****：\n\n$Budget\\_{engineer} = S\\_{eng} \\\\times (T\\_{rules} + T\\_{debug})$\n\nWhere $S\\_{eng}$ is the developer's daily salary, $T\\_{rules}$ and $T\\_{debug}$ are the time required for formulating and optimizing rules (USD/day).\n\n2.  ****Storage costs****：\n\n$Budget\\_{storage} = C\\_{storage} \\\\times V\\_{data} \\\\times T\\_{retention}$ F Where $C\\_{storage}$ is the storage cost per TB per month, $V\\_{data}$ is the total data volume (TB), $T\\_{retention}$ is the data retention time (months).\n\n1.  ****Computation costs****：\n\n$Budget\\_{compute} = \\\\sum\\_{i=1}^{n} \\[C\\_i \\\\times \\\\frac{V\\_{data,i}}{R\\_i} \\\\times (1 + \\\\beta\\_i \\\\times (F\\_{comm} + F\\_{ops}))\\]$\n\nWhere $i$ represents the processing stage (1 to n), $C\\_i$ is the unit cost of computing resources for the i-th stage (USD/day), $V\\_{data,i}$ is the data volume for the i-th stage (TB), $R\\_i$ is the processing rate for the i-th stage (TB/day), $\\\\beta\\_i$ is a binary indicator showing whether the i-th stage uses cluster processing (0 for single-node processing, 1 for cluster processing), $F\\_{comm}$ and $F\\_{ops}$ are the communication and operational overheads of using clusters. Using clusters is troublesome and costly, so we use heuristic filtering as the first step.\n\n4.  ****Quality screening****：\n\n$Budget\\_{quality} = C\\_{train} \\\\times T\\_{training} + C\\_{data\\_annotation} + C\\_{inference} \\\\times \\\\frac{V\\_{data}}{R\\_{inference}}$\n\nWhere $C\\_{train}$ and $C\\_{inference}$ are the computational costs for training and inference, which usually differ significantly in price, $T\\_{training}$ is the training time (days), $C\\_{data\\_annotation}$ is the annotation cost, $\\\\frac{V\\_{data}}{R\\_{inference}}$ indicates the time needed to complete inference for all data.\n\n### 2.6. Data Cost Calculation\n\nHigh-quality data processing comes at a cost. From data acquisition to the final cleaning process, each step involves complex computations and human resource investments, all of which translate into actual costs. This chapter will combine ABAKA AI's previous content and rich experience to provide you with some feasible ideas, hoping to help you calculate data costs when implementing LLMs.\n\nBased on the data processing flow described earlier, we can roughly divide data costs into the following main categories: storage costs, data acquisition costs, data conversion costs, and data cleaning costs. We hope to help you establish an intuitive budget system through ABAKA AI's past rich experience:\n\n1.  ****Storage costs****：In this field, data scale is far larger than general projects, with pre-training datasets reaching PB levels. Single machines cannot meet such large-scale data storage needs, and projects also have high bandwidth requirements. Therefore, distributed storage is generally used. Distributed storage facilitates horizontal expansion, can meet growing storage needs, and has data backup and fault tolerance mechanisms, ensuring high data reliability. Multi-node parallel read and write can also improve I/O performance. Generally, the capacity price of distributed storage is about 85 USD/T (NVME + HDD), meaning 1PB of available storage space costs about 85,000 USD. Adding security redundancy and network equipment, security equipment, the cost will approach 99,000 USD/PB.\n2.  ****Data acquisition****：All historical data from a well-known large website can be estimated at around 42,500-70,500 USD, with incremental updates costing about 14,000 USD annually. For vertical domain websites, it could be anywhere from 42,00 to 14,000 USD. Video websites are three to five times more expensive than ordinary websites (bandwidth, storage), and overseas websites are two to three times more expensive (overseas proxies, overseas servers, compliance). Assuming you need to crawl 8 mainstream social media and news websites + 15 vertical domain websites (such as code, mathematics, finance), a budget of 706,000 USD would be appropriate.\n3.  ****Document information extraction****：Based on ABAKA AI's experience, using ABAKA AI's developed Pipeline for document conversion is more cost-effective and flexible. If using consumer-grade GPUs for conversion, the cost per page is about 0.000035 USD, far lower than mathpix's 0.025 / 0.01 USD per page. Of course, we now see many good domestic manufacturers trying in this area, and we look forward to better models and cheaper prices from domestic service providers. All in all, including the time for Gap and debugging, estimate about 14,000 USD for every 10,000,000 pages of documents (80% using your own model + 20% using third-party services).\n4.  ****Data cleaning****：The cost of this step mainly depends on how many data sources there are and their domains. When processing very dirty data, ABAKA AI used over 1,000 cores for about a month, adding many special rules to obtain higher quality data, with a data retention rate of less than 1%. Therefore, this part of the data can be calculated as follows:\n\n$S\\_{\\\\text{eng}} + \\\\frac{V\\_{\\\\text{data}}}{100\\\\text{T}} \\\\times C\\_{\\\\text{base}}$\n\nThat is, the data cleaning cost for each domain consists of two weeks' salary for an algorithm engineer + 2,800 USD for cleaning every 100T, assuming the data cleaning cost increases linearly with data volume when the cluster is set up. For example, like Fineweb-edu using Llama3-70B and Bert-like models, the price is also quite affordable, just slightly increase the cost per 100T.\n\nIn summary, preparing pre-training data for LLMs is a complex and costly process. It involves multiple stages, including data acquisition, storage, document information extraction, and data cleaning, each requiring careful planning and substantial investment. The quality and diversity of data are crucial to the model's final performance, so each stage should be optimized as much as possible within budget constraints. At the same time, we find that the value of experienced algorithm engineers cannot be overlooked. Their experience and expertise can help teams avoid many potential pitfalls and detours. In LLM projects, the cost of taking detours due to human resource issues is often surprisingly high, potentially leading to a waste of considerable time and resources.\n\n## 3\\. SFT & RLHF Stages\n\nIn the training process of large language models (LLMs), Supervised Fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are two closely connected key stages. Although these two stages differ in technical implementation and specific objectives, they share significant similarities in terms of data requirements and cost composition. In this chapter, we combine these two stages for discussion, primarily because their core costs are concentrated on data annotation and requirement definition, a characteristic that results in many commonalities in data preparation and cost estimation.\n\n### 3.1. Characteristics of SFT Datasets\n\nSFT datasets consist of a series of text pairs, including \"instruction input\" and \"answer output\". \"Instruction input\" represents requests made by humans to the model, covering various types such as classification, summarization, rewriting, etc. \"Answer output\" is the response generated by the model based on the instruction, meeting human expectations. There are four methods to construct instruction fine-tuning datasets: manual creation; model generation, such as using the Self-Instruct method; collecting and improving existing open-source datasets; and combining the above three methods.\n\n![\nDifferent ways to build SFT datasets\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/5.png \"\nDifferent ways to build SFT datasets\n\")\n\nDifferent ways to build SFT datasets\n\nThere are generally two approaches to constructing artificially generated datasets. The first approach involves directly creating instruction text sets according to given requirements and rules by company employees, volunteers, annotation platform staff, and others. Whether designing instruction sets, writing annotation guidelines, or conducting actual data annotation and quality control, it requires a significant investment of human time and effort. For example, the creation of the Databricks-dolly-15k dataset involved thousands of Databricks employees who generated over 15,000 records across multiple instruction categories. The second approach involves scraping human-generated real question-and-answer data from web pages and standardizing it into an instruction format. Examples include datasets like InstructionWild, v2LCCC, and Zhihu-KOL, which construct datasets by aggregating and organizing content from social chats, code-related Q&As, and other sources.\n\nIn ABAKA AI's past practices, the first approach has been more commonly used to construct datasets. Meanwhile, Liu, Yang, et al. \\[2\\] believe that datasets constructed in this manner are of higher quality and cleaner due to processing and review by professional annotators. After human processing, these datasets become more interpretable and more consistent with human understanding, thus increasing their explainability. Researchers have flexible control over the training samples and can adjust them according to different tasks, making them more versatile.\n\n![\nABAKA AI possesses high-quality finished datasets across multiple domains\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/7.png \"\nABAKA AI possesses high-quality finished datasets across multiple domains\n\")\n\nABAKA AI possesses high-quality finished datasets across multiple domains\n\n### 3.2. Characteristics of RLHF Datasets\n\nRLHF datasets are collections of instructions that provide preference evaluations for multiple responses to the same input prompt. Typically, they consist of instruction pairs with different responses, including feedback from humans or other models. This setup reflects the relative preferences of humans or models for different responses in a given task or context. The feedback information in preference datasets is usually expressed through voting, ranking, scoring, or other forms of comparison.\n\nPreference datasets are primarily used in the alignment phase of large models, aiming to help align model outputs more closely with human preferences and expectations. Alignment with human preferences is mainly reflected in three aspects: practicality (the ability to follow instructions), honesty (avoiding fabrication of information), and safety (avoiding the generation of illegal or harmful information).\n\n![\nDifferent ways to build the RLHF dataset\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/8.png \"\nDifferent ways to build the RLHF dataset\n\")\n\nDifferent ways to build the RLHF dataset\n\nRLHF (Reinforcement Learning from Human Feedback) and RLAIF (Reinforcement Learning from AI Feedback) both utilize reinforcement learning methods to optimize models using feedback signals. In addition to fine-tuning with instruction datasets, preference datasets can be used to train reward models. Subsequently, the Proximal Policy Optimization (PPO) algorithm can be applied for further fine-tuning based on feedback from the reward model.\n\n### 3.3. Data Cost Calculation\n\nIn the SFT and RLHF stages, data costs primarily come from the following aspects:\n\n1.  ****Rule Design Cost****：\n\n$Budget\\_{analysis} = T\\_{total} \\\\times (R\\_{expert} \\\\times S\\_{expert} + R\\_{engineer} \\\\times S\\_{engineer} + R\\_{user} \\\\times S\\_{user})$\n\nIn this, $R\\_{x} (0 < R\\_{x} ≤ 1)$ represents the participation ratio. Algorithm engineers $S\\_{expert}$ understand the model's capability boundaries, domain experts $S\\_{expert}$ provide professional knowledge and insights, and $S\\_{user}$ offers frontline usage scenarios and requirement feedback. This step is both necessary and important. Carefully designed rules can significantly improve data quality, directly affecting model performance, and good rule design can increase annotation efficiency and reduce rework rates. Although the detailed rule design process may increase initial costs, its value far exceeds these expenses. It not only improves data and model quality but also brings long-term benefits to the entire project and organization.\n\n2.  ****Instruction Dataset Construction Cost****：\n\n$Budget\\_{instruction} = \\\\frac{N\\_{instructions}}{R\\_{creation\\_speed}} \\\\times S\\_{annotator} + \\\\frac{N\\_{instructions} \\\\times R\\_{review}}{R\\_{review\\_speed}} \\\\times S\\_{reviewer}$\n\nWhere: $N\\_{instructions}$ is the total number of instructions, $R\\_{creation\\_speed}$ is the number of instructions an annotator can produce per hour, $S\\_{annotator}$ is the average hourly wage of annotators, $R\\_{review}$ is the review sampling rate, $S\\_{reviewer}$ and $R\\_{review\\_speed}$ are the average hourly wage of reviewers and the number of instructions a reviewer can review per hour, respectively.\n\n3.  ****Cost of building RLHF dataset****：\n\n$Budget\\_{RLHF} = T\\_{generation} \\\\times C\\_{GPU\\_cluster} + \\\\frac{N\\_{instructions} \\\\times \\\\alpha}{R\\_{ranking\\_speed}} \\\\times S\\_{annotator} + Budget\\_{review}$\n\nThe first part is the inference cost for generating responses, and the second part is the cost of manual annotation. The choice of annotation method and strategy greatly affects $\\\\alpha$. For example, if there are $N$ responses that need to be compared pairwise, then the cost of manual annotation:\n\n$\\\\frac{N\\_{instructions} \\\\times C(N\\_{responses}, 2)}{R\\_{rank\\_speed}} \\\\times S\\_{annotator}$\n\nIf it's a rating system, then $R\\_{rank\\_speed}$ would significantly increase, so choosing an appropriate evaluation method is a key factor in constructing RLHF datasets. It not only affects data quality but also directly determines the cost structure. The choice and orientation of review strategies will likewise significantly impact costs. Considering the complexity of these factors and their interactions, it is indeed challenging to provide a universal cost formula, which is why we have not presented a specific formula.\n\nIn practice, it's often necessary to validate and optimize evaluation and audit strategies through small-scale pilot tests before expanding to the full dataset. This iterative approach not only helps optimize costs but also continuously improves data quality and annotation guidelines throughout the process.\n\nBased on ABAKA AI's past experience, assuming we collect 1,000 IMO-level math problems, considering the demand is already well-established, the main costs will be concentrated on annotation and auditing. The cost for annotators is 20 USD per hour, with an estimated rate of one problem per hour. Including other expenses, the budget is estimated at 28,000 USD. However, if we adopt ABAKA AI's RLHF data construction method, utilizing modern proof tools like LEAN, the processing efficiency would be much higher than response construction, handling approximately 4-6 pairs per hour.\n\nAt this point, we have established a comprehensive evaluation system that allows us to assess data prices according to requirements.\n\nFor example, if the boss wants the model to possess knowledge in a specific domain, or even become state-of-the-art in that field, we can choose CPT to add knowledge. Based on D-CPT Law\\[4\\] and REGMIX\\[5\\], we can calculate that approximately 100B of domain data might be needed. We can crawl 12 target websites to cover 70B of data, and the remaining 30B can be filtered from public datasets. After CPT, we can add a few thousand SFT data points. The data portion might cost around 42,000 USD, including: approximately 28,000 USD for data crawling from 12 websites + about 2,800 USD for downloading and filtering several dozen TB of data using the deepseek math method + constructing SFT data at about 6 USD per entry, totaling 12,000 USD for 2,000 entries.\n\nCost Item\n\nCost / USD\n\nPre-training from scratch\n\n140,000 - 7 million\n\nCPT\n\n70,000 - 1 million\n\nSFT\n\n5,600 - 140,000 per domain\n\nRLHF data\n\n1,400 - 56,000 per domain\n\nThe above estimates are based on current market data and ABAKA's years of industry experience, providing the most common budget range framework to help you more intuitively estimate overall data cost expenses.\n\nABAKA AI can reduce costs by 40%-60% at various stages based on the above framework. In the process of building high-quality training datasets, ABAKA provides professional solutions based on rich data processing experience. The intelligent data engineering platform Abaka Forge Platform and highly specialized, standardized data processing services offered by ABAKA AI empower the construction of training data, helping you train LLMs using high-quality datasets and enabling you to better understand the resources and investment required for your project.\n\n## 4\\. Reference\n\n1.  Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. \"Training Compute-Optimal Large Language Models.\" arXiv, March 29, 2022. [http://arxiv.org/abs/2203.15556](http://arxiv.org/abs/2203.15556 \"Training Compute-Optimal Large Language Models\").\n2.  Liu, Yang, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. \"Datasets for Large Language Models: A Comprehensive Survey.\" arXiv, February 27, 2024. [http://arxiv.org/abs/2402.18041](http://arxiv.org/abs/2402.18041 \"Datasets for Large Language Models: A Comprehensive Survey.\").\n3.  Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, et al. \"MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series.\" arXiv, June 2, 2024. [http://arxiv.org/abs/2405.19327](http://arxiv.org/abs/2405.19327 \"MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series.\").\n4.  Que, Haoran, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, et al. \"D-CPT Law: Domain-Specific Continual Pre-Training Scaling Law for Large Language Models.\" arXiv, June 3, 2024. [http://arxiv.org/abs/2406.01375](http://arxiv.org/abs/2406.01375 \"D-CPT Law: Domain-Specific Continual Pre-Training Scaling Law for Large Language Models.\").\n5.  Liu, Qian, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. \"RegMix: Data Mixture as Regression for Language Model Pre-Training.\" arXiv, July 1, 2024. [http://arxiv.org/abs/2407.01492](http://arxiv.org/abs/2407.01492 \"RegMix: Data Mixture as Regression for Language Model Pre-Training.\")."
}

▼✓网页搜索查询: "LLM pre-training FLOPs calculation formula \"C ≈ 6ND\" cost per GPU hour traini...", 结果数: 8

{
  "results": [
    {
      "title": "The FLOPs Calculus of Language Model Training - Medium",
      "url": "https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4",
      "description": "Without further ado, meet the Transformer FLOPs Equation: C ≈ 6ND. A slightly more sophisticated version of the equation expresses the"
    },
    {
      "title": "How Long Does It Take to Train the LLM From Scratch? | Towards Data Science",
      "url": "https://towardsdatascience.com/how-long-does-it-take-to-train-the-llm-from-scratch-a1adb194c624",
      "description": "The goal is to quantify the computational requirements for processing data and updating model parameters during training in terms of **FLOPs** (floating point operations). To sum up, training FLOPs for the transformer model of size N and dataset of P tokens can be estimated as:. However, the theoretical maximum FLOPS for a GPU is often less relevant in practice when training Large Language Models. For example, the authors of the LLaMA 3 paper reported an MFU of 38%, or 380 teraflops of throughput per GPU, when training with 16,000 GPUs. Reported TFLOPs throughput per GPU training Llama3 models as reported in the paper for different configurations. The total FLOPs required to train a model of this size can be calculated as follows:. Once you know the FLOPS per GPU in the training setup, you can calculate the total GPU hours required to train a model of a given size and dataset."
    },
    {
      "title": "How Much Does LLM Training Cost? - Galileo AI",
      "url": "https://galileo.ai/blog/llm-model-training-cost",
      "description": "# How Much Does LLM Training Cost? Understanding the costs of training LLMs is essential for those looking to create and train models or simply to understand the dynamics of AI as an industry. In this article, we'll explore the cost of training LLM models, the factors that contribute to LLM training costs, and strategies to optimize them effectively. * Training frontier LLMs costs $78M-$192M+ for compute alone. * Key cost drivers have shifted—human data annotation now exceeds compute costs by up to 28x. * Fine-tuning pre-trained models costs 60-90% less than training from scratch. ## **How Much Does It Cost to Train an LLM?**. To appreciate the scale of investment involved in training LLMs, consider the verified costs of notable models. According to the Stanford AI Index Report 2025, frontier model training costs have escalated dramatically—with GPT-4's training estimated at $78-100+ million, and Gemini Ultra 1.0 reaching $192 million, representing a 287,000x increase from the cost of a Transformer model in 2017 ($670)."
    },
    {
      "title": "Cost of Training Large Language Models: A Simple Calculation | Ashish Kulkarni posted on the topic | LinkedIn",
      "url": "https://www.linkedin.com/posts/ashishakulkarni_llm-pretraining-foundationmodels-activity-7373222564543807488-haqO",
      "description": "Software optimizations are the key: NVIDIA Dynamo -> provides disaggregated serving splitting prefill and decode phases of inference onto separate GPUs at scale NVFP4 -> quantizes the model weights to 4-bit floating point executed on the GPU Tensor Cores CUDA Graphs -> reduce CPU overhead That's how we supercharge NVIDIA Data Center into a true AI Factory! Disaggregated serving splits these phases: - Prefill runs on high-compute GPUs - Decode runs on memory-optimized hardware - Each phase gets what it needs The results are striking: - 6.4x throughput improvements - 20x reduction in latency variance - 15-40% infrastructure cost reduction Microsoft's implementation showed 1.4x higher throughput at 20% lower cost."
    },
    {
      "title": "Optimize for inference too, not just training FLOPs | MatX",
      "url": "https://matx.com/research/lifetime_llm_cost",
      "description": "# Optimize for inference too, not just training FLOPs. Large Language Models (LLMs) have been shaped by the scaling laws, as established by Kaplan et al., 2020 and Hoffmann et al., 2022. They guide us to design models that optimize for training cost but often overlook inference costs. During training, KV computation is usually cheaper than the rest of the model, but during inference, loading the KV cache becomes the dominant expense. Since training is compute-bound, the training FLOPs budget effectively determines the optimal model size and the number of training tokens needed to maximize model performance. While the loading of model parameters can be amortized by using larger batch sizes, the KV cache grows both with the **batch size** and the **sequence length**, making memory bandwidth a potential bottleneck (Pope et al., 2022). * Cost to load the full KV cache: 69GB×HOI≈41 TFLOPs. In this case, the cost of loading the KV cache in FLOP-equivalents is 20 times greater than the compute cost of performing the forward pass."
    },
    {
      "title": "What is the cost of training large language models? - CUDO Compute",
      "url": "https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models",
      "description": "### CUDO arrow Resources arrow What is the cost of training large language models? # What is the cost of training large language models? Large language models (LLMs) like OpenAI’s GPT series and Google’s BERT have become foundational technologies powering many applications, from automated customer service to advanced research tools. For example, while the original 2017 Transformer that introduced the core LLM architecture cost only about $900 to train, the compute cost for training GPT-3 (175 billion parameters) was estimated in 2020 to range from about $500,000 up to $4.6 million, depending on the hardware and optimization techniques used. Training OpenAI’s GPT-4 reportedly cost more than $100 million, with some estimates ranging up to $78 million in compute cost, and Google’s Gemini Ultra model is estimated to have cost $191 million in training compute. The emergence of reasoning in LLMs is partly a byproduct of scale, as larger models trained on more data tend to develop more advanced capabilities, and partly due to specialized training or fine-tuning on data that encourages logical reasoning."
    },
    {
      "title": "LLM training",
      "url": "https://www.glennklockwood.com/garden/LLM-training",
      "description": "There are three ways in which **training** a model can be divided across GPU nodes:. * partition the training batch and give each GPU node its own subset of the training dataset (a minibatch). * scales very well since multiple copies of the model are training in parallel, but may increase the time to train a model (convergence time) since training data may be less randomized as a result of partitioning. * requires moderate rewriting the training code to include communication within each epoch. For example, training a large language model across multiple DGX nodes likely involves tensor parallelism within the DGX node (since it has NVLink which makes the communication fast), pipeline parallelism across 16 DGX nodes, and data parallelism to accelerate training by scaling to a thousand DGX nodes. The ZeRO-DP paper (2020) states that a trillion-parameter model using a stateful optimizer (like Adam) requires 16 TiB of GPU memory at 16-bit precision."
    },
    {
      "title": "Understanding the Performance and Estimating the Cost of LLM ...",
      "url": "https://arxiv.org/html/2408.04693v1",
      "description": "In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU."
    }
  ]
}

▼thought

Now I have comprehensive data to answer the user's question about costs per stage and the relationship between data scale and cost. Let me organize this. Key data points: **Pre-training costs:** - GPT-3 (175B): $500K-$4.6M - Llama 2 7B: ~$760K (Sebastian Raschka calculation: 184,320 A100 GPU hours × $33/hr ÷ 8 GPUs) - GPT-4: $78M-$100M+ - Gemini Ultra: $191M - The FLOPs formula: C ≈ 6ND (compute ≈ 6 × parameters × tokens) - Chinchilla optimal: tokens ≈ 20 × parameters **Data costs:** - Storage: ~$85,000/PB - Data crawling: $42,500-$70,500 per large website - Document extraction: ~$14,000 per 10M pages - Data cleaning: engineer salary + ~$2,800 per 100TB **SFT costs:** - Full fine-tuning 70B: $1,000-$5,000 - LoRA: much cheaper - Data annotation: ~$6/entry for SFT data **RLHF costs:** - Human annotation: $1-5/comparison - Scale AI, Surge AI - billion dollar industry - RLAIF: <$0.01/comparison - Human data costs can exceed compute by up to 28x **Post-training (SFT+RLHF) costs compared to pre-training:** - Fine-tuning is 60-90% cheaper than training from scratch The relationship between data and cost: - Training compute ≈ 6ND (linear in both params and tokens) - So doubling data roughly doubles compute cost - But with Chinchilla, optimal tokens grow linearly with parameters, so cost is quadratic in model scale - Data preparation costs: storage is linear, crawling is more complex, cleaning costs scale sub-linearly after setup Let me compose a comprehensive answer.

LLM 训练各阶段的成本拆解：算力、数据、与人

把上一个回答中的每个训练阶段拆开来看，成本的来源和量级差异极大。核心变化趋势是：预训练的算力成本仍然最大，但后训练中的人类数据标注成本增速超过了算力——Stanford AI Index Report 2025 数据显示，人类标注成本在部分阶段可超过算力成本的 28 倍 Galileo: LLM Training Cost。

一、预训练（Pre-training）：最大的单笔算力支出

1.1 核心计算公式

预训练的计算量由一条简洁的公式决定 The FLOPs Calculus of Language Model Training：

C ≈ 6 × N × D

其中 N = 模型参数量，D = 训练 token 数。这是 transformer 架构下 forward + backward pass 的总浮点运算次数。

1.2 代表性模型的实际成本

模型	参数量	训练 tokens	估计算力成本	备注
Transformer (2017)	—	—	~$900	基准线
GPT-3	175B	300B	$500K–$4.6M	视硬件和优化差异
Llama 2 7B	7B	2T	~$760K	184,320 A100 GPU-hours Sebastian Raschka
GPT-4	~1.76T (MoE)	未公开	$78M–$100M+	2.1×10²⁵ FLOPs
Gemini Ultra	未公开	—	~$191M	5.0×10²⁵ FLOPs CUDO Compute

注意：这些只是最终训练跑的算力成本。加上探索性实验、失败的超参尝试、消融实验，实际总研发算力通常是最终训练的 2-3 倍 CUDO Compute。

1.3 数据规模 vs 算力成本：Chinchilla 定律的含义

DeepMind 的 Chinchilla 论文（2022）发现：给定算力预算 C，最优的模型大小 N 和训练数据量 D 应该满足 D ≈ 20 × N Chinchilla Scaling Laws。

这意味着什么？

如果你想训练一个2 倍大的模型，你还需要2 倍多的训练数据才能「喂饱」它
而 C ≈ 6ND，所以总计算量变成了 4 倍（2 × 2）
换句话说：在 compute-optimal 假设下，模型规模翻倍 ≈ 算力成本翻两番

这也是为什么 2022 年以后模型规模的扩张速度比之前慢了——Chinchilla 揭示了「光扩大模型不扩大数据是在浪费钱」。如今的趋势反而是「过度训练」（overtraining）：用比 Chinchilla 建议的更多数据训练相对较小的模型，因为推理成本在模型整个生命周期中占比更大 Databricks: Inference in Scaling Laws。

1.4 数据准备的隐性成本

算力之外，预训练数据的获取和清洗本身就是一笔不小的开支。以下数据来自 ABAKA AI 的实操经验 ABAKA: LLM Data Cost：

成本项	量级估计
分布式存储	~$85,000/PB（含冗余和网络设备约 $99,000/PB）
爬取一个大网站的全部历史数据	$42,500–$70,500
文档信息提取（PDF → Markdown）	~$14,000 / 每 1000 万页
数据清洗（每 100TB）	工程师两周薪资 + ~$2,800 算力成本
质量筛选（用 LLM 打分 + 分类器过滤）	需额外的推理算力 + 标注成本

以构建一个「8 个主流网站 + 15 个垂直领域网站」的爬取体系为例，数据获取预算约 $706,000。如果从零准备一个数万亿 token 的高质量双语预训练数据集（如 MAP-Neo 的 4.7T tokens Matrix 数据集），数据侧的总投入（不含模型训练）可能达到数百万美元。

二、SFT（监督微调）：相对廉价，但数据质量决定上限

2.1 算力成本

SFT 的算力开销远小于预训练——通常只需几个小时到几天，而不是数周数月。上一轮回答中提到的数字：

方法	70B 模型典型成本	硬件要求
Full Fine-Tuning	$1,000–$5,000/次	4-8 张 H100
LoRA	$50–$500/次	单张 A100 / H100
QLoRA	$20–$100/次	单张 24GB 消费级 GPU

SFT 算力总量在预训练的 1%-5% 量级。微调一个已有基座模型的成本通常比从头训练便宜 60-90% Galileo: LLM Training Cost。

2.2 数据标注成本

SFT 的瓶颈不在算力，而在于高质量 instruction-response 对的获取。高质量 SFT 数据的标注成本约为每条 $6（领域专家水平），但可按简单/中等/困难分层 ABAKA: LLM Data Cost。

以构建一个 2000 条高质量 domain-specific SFT 数据集为例：

2000 条 × $6/条 = $12,000（仅标注，不含规则设计和审核）
加上领域专家规则设计 + 多轮审核，一个领域的 SFT 数据总成本在 $5,600–$140,000 之间

三、偏好对齐（RLHF / DPO）：人类标注是真正的成本怪兽

3.1 RLHF 的完整成本链

传统 RLHF 有三层成本叠加：

第一层：人类偏好数据收集

每条比较（pairwise comparison）：$1–$5
一次典型 RLHF 训练需要数十万次比较
顶级 AI 公司（OpenAI、Google、Meta、Anthropic）每家每年花费约 $10 亿 在人类标注数据上 O-Mega: Data Labeling Industry Guide
数据标注市场从 2024 年 $12 亿预计增长到 2034 年 $102 亿 Lemon.io: RLHF Market

第二层：奖励模型训练

需要额外 GPU 算力训练一个独立的 reward model
通常在 $10K–$100K 量级（取决于规模）

第三层：PPO 强化学习训练

需要同时运行 policy 模型 + reward model + reference model
训练不稳定，常需多次尝试
算力成本可达 SFT 的 5-10 倍

3.2 不同对齐方法的成本对比

方法	标注成本/比较	算力成本	总成本量级
RLHF (PPO)	$1–$5	最高（需训练 RM + RL 多轮）	千万到亿美元级
DPO	$1–$5	~RLHF 的 50%	标注成本相同，算力减半
RLAIF / Constitutional AI	<$0.01	中等	标注成本降低 100-500 倍
RLVR / GRPO	$0（程序化验证）	中等	完全消除人类偏好标注

四、推理 RL（GRPO / RLVR）：成本结构革命

GRPO 和 RLVR 的核心优势不仅是性能，更是成本结构的根本性改变。

传统 RLHF 的成本中，人类标注占主导（可达算力成本的 28 倍）。而 GRPO 用可验证奖励函数（数学题是否正确、代码是否通过测试）完全替代了人类偏好数据。这意味着：

零人类标注成本
奖励信号即时、一致、无限可扩展
训练算力需求与 RLHF 的 RL 阶段大致相当

这使得 DeepSeek 能以远低于 OpenAI o1 的训练预算实现竞争性推理性能。RLVR 范式正在将「后训练竞赛」从「谁能雇更多标注员」拉回到「谁有更好的可验证任务设计和 RL 工程能力」。

五、数据规模 vs 成本的完整关系

将上述所有阶段汇总，数据规模与成本之间存在多层次、非线性的关系：

5.1 预训练阶段：线性递增 + 规模定律的放大效应

算力成本 ≈ 6 × N × D × (GPU 单价 / GPU FLOPs)

N 和 D 各自线性驱动成本，但 Chinchilla 最优 D=20N 意味着总成本与 N² 成正比
10 倍模型规模 ≈ 100 倍算力成本（在 compute-optimal 假设下）
但现实中大多数实验室选择「overtraining」（D >> 20N），使数据侧的系数更大

5.2 数据准备阶段：固定开销 + 线性增量

数据准备总成本 = 基础设施固定成本 + 爬取/采集成本(D) + 清洗/去重成本(D)

存储成本：基本线性（$/TB × TB 数）
爬取成本：每个新数据源有固定开发成本（$1,500-$15,000/网站），但之后增量成本很低
清洗成本：有显著的规模经济——集群搭好后，每 100TB 额外清洗成本仅 ~$2,800
这部分的「边际成本递减」特性意味着数据量越大，平均每 token 成本越低

5.3 后训练阶段：质量 > 数量，成本非线性

SFT/RLHF 成本 = 固定规则设计 + 标注成本(条数 × 难度系数) + 审核成本 + 算力成本

SFT 数据量通常在数千到数十万条，远小于预训练（数十亿到数万亿 tokens）
但单条数据的价值密度极高：1000 条精选 SFT 数据的效果可能超过 10 万条低质数据
RLHF 的 pairwise comparison 成本随响应数 组合爆炸：N 条响应需要 C(N,2) 次比较
这意味着 RLHF 的标注成本对「每条 prompt 生成多少条响应」这个参数极度敏感

六、总结：成本结构的三大迁移趋势

从前面的分析可以提炼出一个清晰的变化方向：

阶段	过去（2020-2022）	现在（2024-2026）	成本主驱动因素
预训练	算力绝对主导	算力仍最大，但数据准备投入激增	GPU 集群 + 数据工程团队
SFT	小规模标注	领域专家标注成本上升	人类专家稀缺性
偏好对齐	RLHF + 昂贵人类标注	DPO 降低算力，RLAIF 降低标注，GRPO 消除标注	从「人」向「程序化验证」迁移
推理增强	不存在	RLVR / GRPO 成为标配	可验证任务的设计能力

最关键的趋势：后训练中，从「花钱雇人标注」向「花钱设计可自动验证的训练环境」的范式迁移正在进行中。这既是成本优化，也是能力上限的突破——因为程序化验证可以覆盖数学、代码、形式推理等有客观标准的领域，而这些恰恰是当前模型能力竞争的主战场。

尚需核实的点

各顶级实验室的人类标注实际支出属于商业机密。$10 亿/年这个数字来自投资人和行业分析师的估算（引用自 Foundation Capital 的分析），并非官方披露。
01.ai 声称 $300 万训练出 GPT-4 级模型的说法仍缺乏独立验证，业内对此有争议，本回答未将其作为确定事实引用。
GRPO 等 RLVR 方法在非可验证领域（如创意写作、开放式对话）的效果边界仍在研究中，尚无大规模公开对比。

185317 tokens · $0.0284

目前的主流LLM的训练过程是什么样的？模型能力是...

目前的主流LLM的训练过程是什么样的？模型能力是如何提升的？

主流 LLM 的训练过程：从裸模型到产品级 AI 的完整链路

Pre-training（预训练）→ Capability Shaping（能力塑造）→ Alignment Shaping（对齐塑造）

一、Pre-training（预训练）：打造「世界知识底座」

做什么

关键参数（以 2024-2025 代表性模型为例）

模型	参数量	训练 tokens	特点
Llama 3.1	8B / 70B / 405B	15.6 万亿	三阶段预训练，最终 annealing
Qwen 2	0.5B–72B	7 万亿（大模型）/ 12 万亿（0.5B）	合成数据增强，多语言 30 种
Gemma 2	2B / 9B / 27B	13 万亿（27B）	知识蒸馏用于小模型
Apple AFM	3B（端侧）+ 服务端	6.3 万亿（服务端）	蒸馏 + 剪枝，3 阶段预训练

这些细节来自 Sebastian Raschka 的系统梳理 New LLM Pre-training and Post-training Paradigms。

预训练内部的阶段化

从 2024 年起，各家的预训练普遍不再是「一口气跑到底」，而是拆成 2-3 个子阶段：

Core pre-training：大规模通用语料，基础上下文窗口（如 4K-8K tokens）
Continued pre-training：上调代码/数学数据比例，压低低质网页抓取数据的权重
Context lengthening / annealing：用长序列数据 + 合成数据将上下文窗口扩展到 32K-128K；或用极小的高质量混合数据做最后的「退火」微调，提升 benchmark 表现

Apple AFM 和 Llama 3.1 都采用了三阶段预训练。Llama 3.1 的上下文扩展更是分六步从 8K 逐步拉到 128K。

数据质量的权重在上升

二、Capability Shaping（能力塑造）：从「会说话」到「会做事」

预训练产出的是通用 next-token predictor，而能力塑造通过继续训练（仍然以交叉熵为目标，但数据更精选）来定向增强特定能力 Training LLMs in 2026：

Context Extension：扩展可用的上下文窗口
Code / Math Corpora Training：用高质量编程和数学语料增强形式推理
Synthetic Reasoning Training：用模型生成或程序生成的推理链数据诱导多步推理行为
Multilingual Balancing：跨语言数据重加权
Tool-use Corpora：训练模型理解何时以及如何调用外部工具（搜索、代码执行、API）

这一步与「对齐」之间存在反馈循环：在尝试对齐模型时往往会发现新的能力缺口，再回过头来补充训练。

三、Alignment Shaping（对齐塑造）：从「能力模型」到「产品模型」

这是后训练中最核心也最复杂的部分。目前的共识是三段式流水线 The Complete Guide to Post-Training LLMs：

Stage 1 — SFT（Supervised Fine-Tuning，监督微调）

用高质量 instruction-response 对训练模型学会遵循指令的格式、语气和结构。这是把「自动补全引擎」变成「对话助手」的关键一步。

技术选择：

Full Fine-Tuning（全参数，FP16）：质量最高，4-8 张 H100 跑 70B 模型，成本 $1000-5000/次
LoRA（Low-Rank Adaptation）：冻结基座模型，注入小型可训练矩阵，减少 90-99% 训练参数，单张 A100 可跑 70B
QLoRA：4-bit 量化 + LoRA，单张 24GB 消费级 GPU 即可微调 70B

Stage 2 — Preference Alignment（偏好对齐）

让模型学会在多个可行回答中选择最 helpful / harmless / honest 的那个。

DPO 的进化版：

SimPO：连 reference model 都省了，用响应的平均 log probability 作为隐式奖励，AlpacaEval 2 上比 DPO 高 6.4 分
KTO：只需要二进制的 thumbs-up / thumbs-down 信号，不需要配对比较
ORPO：将 SFT 和偏好优化合并到单个训练目标中，一个阶段代替两个

Stage 3 — RL with Verifiable Rewards（推理强化学习）：2025 年以来最大的变革

这是 post-training 中最新也最快演进的阶段，由 DeepSeek-R1（2025 年 1 月）引爆。

结果：DeepSeek-R1 以远低于 OpenAI o1 的训练成本，实现了与之竞争性的推理性能，甚至自发涌现了自我反思和动态策略调整能力。

四、模型能力是如何提升的？——几条核心驱动力

1. Scaling Laws（规模定律）

2. Emergent Abilities（涌现能力）

3. Post-training 的能力放大效应

这是当前最大的能力提升来源。post-training 可以在标准评测上带来 20-40% 的性能提升，远超同等算力投入预训练扩展能获得的收益。这意味着：

竞争优势的关键不再是「谁能预训练最大的模型」，而是「谁能把模型 post-train 得最好」。

4. 合成数据与 Self-Play

数据瓶颈正在被合成数据生成打破。代表性方法：

SPIN（Self-Play Fine-Tuning）：模型通过区分自己的输出与人类文本来逐步提升，无需额外人类标注
SPICE：用外部文档接地（grounding）防止 self-play 中的幻觉放大和模型崩溃，数学推理 +8.9%，通用推理 +9.8%
NVIDIA NeMo Gym：用交互式 RL 环境训练 agent 模型，Nemotron 3 Super 经历了 21 种环境配置下的 120 万次 rollout

5. 知识蒸馏（Knowledge Distillation）

五、三大前沿实验室的 RL 哲学差异

实验室	核心方法	特点
OpenAI	RLHF + Process Reward Model + 大规模 RL	最激进的 RL 扩展策略，o1/o3 系列代表推理 RL 的 SOTA
Anthropic	Constitutional AI + RLAIF	用约 80 页 constitution 替代原始人类偏好，强调一致性和可控性
Google DeepMind	RLHF + PPO + 学术前沿研究	最传统的 RLHF 实现，辅以 reward model 鲁棒性、多目标优化等研究

来源：The Complete Guide to Post-Training LLMs

六、正在发生的前沿趋势

LLM Stats 总结了下一年最可能定义 post-training 研究的三个方向 Post-Training in 2026：

统一流水线：ORPO 已经合并了 SFT 和偏好优化，下一步是将三阶段（SFT + 偏好对齐 + 推理 RL）合并到单一训练目标中
环境原生训练：从静态数据集转向交互式 RL 环境（浏览器、文件系统、数据库、API），「聊天模型」和「agent 模型」之间的差距将拉大
自动课程生成：模型识别自身弱点 → 生成针对性的训练数据 → 用 RLVR 验证 → 训练 → 重复。SPICE 已经展示了这个闭环的雏形

补充说明 / 尚需核实的点

各前沿实验室（尤其是 OpenAI 和 Anthropic）的具体训练配方（数据配比、奖励函数细节、超参数）属于商业机密，并未完整公开，以上分析基于已发表的技术报告和论文。
「涌现能力是否为 mirage」的学术争论仍在继续，尚未有定论。本回答中呈现了双方的核心论点，供交叉参考。
合成数据 self-play 在哪些任务类型上有天花板、是否会导致模型能力「内卷」，目前缺乏系统性的大规模实证研究。

119167 tokens · $0.0193

每个截断的成本是什么样的，数据的规模和成本的关系是什么样的？

▼thought

▼✓网页搜索查询: "LLM pre-training cost per token compute GPU 2024 2025 breakdown", 结果数: 10

{
  "results": [
    {
      "title": "Inference Unit Economics: The True Cost Per Million Tokens - Introl",
      "url": "https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide",
      "description": "# Inference Unit Economics: The True Cost Per Million Tokens. Prices declined faster than PC compute during the microprocessor revolution or bandwidth during the dotcom boom—equivalent performance costs 10x less every year.¹ A capability that cost $20 per million tokens in late 2022 now costs $0.40.² Yet organizations still struggle to understand their true inference costs because token-level pricing obscures infrastructure realities, GPU utilization determines actual unit economics, and optimization techniques create order-of-magnitude variations in cost efficiency. **December 2025 Update:** LLM inference costs declined 10x annually—faster than PC compute or dotcom bandwidth. Self-hosted breakeven requires 50%+ GPU utilization for 7B models, 10%+ for 13B models. ## The inference pricing landscape in December 2025. Claude Opus 4.5 costs $5 per million input tokens and $25 per million output tokens—a significant reduction from Opus 4's $15/$75 pricing.⁷ GPT-4 and similar frontier models price at varied tiers, justified by capabilities that smaller models cannot replicate regardless of cost optimization."
    },
    {
      "title": "How Much Does LLM Training Cost? | Galileo",
      "url": "https://galileo.ai/blog/llm-model-training-cost",
      "description": "# How Much Does LLM Training Cost? Understanding the costs of training LLMs is essential for those looking to create and train models or simply to understand the dynamics of AI as an industry. In this article, we'll explore the cost of training LLM models, the factors that contribute to LLM training costs, and strategies to optimize them effectively. * Training frontier LLMs costs $78M-$192M+ for compute alone. * Key cost drivers have shifted—human data annotation now exceeds compute costs by up to 28x. * Fine-tuning pre-trained models costs 60-90% less than training from scratch. ## **How Much Does It Cost to Train an LLM?**. To appreciate the scale of investment involved in training LLMs, consider the verified costs of notable models. According to the Stanford AI Index Report 2025, frontier model training costs have escalated dramatically—with GPT-4's training estimated at $78-100+ million, and Gemini Ultra 1.0 reaching $192 million, representing a 287,000x increase from the cost of a Transformer model in 2017 ($670)."
    },
    {
      "title": "$ Cost of LLM continued pre-training | by Gili Nachum - Medium",
      "url": "https://medium.com/@gilinachum/cost-of-llm-continued-pre-training-0c1998cb44ec",
      "description": "Key Factors Impacting Cost. Dataset size — More data (measured in tokens) means longer training times and more compute resources needed."
    },
    {
      "title": "What is the cost of training large language models?",
      "url": "https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models",
      "description": "### CUDO arrow Resources arrow What is the cost of training large language models? # What is the cost of training large language models? Large language models (LLMs) like OpenAI’s GPT series and Google’s BERT have become foundational technologies powering many applications, from automated customer service to advanced research tools. For example, while the original 2017 Transformer that introduced the core LLM architecture cost only about $900 to train, the compute cost for training GPT-3 (175 billion parameters) was estimated in 2020 to range from about $500,000 up to $4.6 million, depending on the hardware and optimization techniques used. Training OpenAI’s GPT-4 reportedly cost more than $100 million, with some estimates ranging up to $78 million in compute cost, and Google’s Gemini Ultra model is estimated to have cost $191 million in training compute. The emergence of reasoning in LLMs is partly a byproduct of scale, as larger models trained on more data tend to develop more advanced capabilities, and partly due to specialized training or fine-tuning on data that encourages logical reasoning."
    },
    {
      "title": "AI & Machine Learning for Everyone | Do you know how much it costs to train large language models (LLM) | Facebook",
      "url": "https://www.facebook.com/groups/AIandMachineLearningforEveryone/posts/9843392082338627",
      "description": "# AI & Machine Learning for Everyone | Do you know how much it costs to train large language models (LLM) | Facebook. Training large language models costs millions. ## AI & Machine Learning for Everyone ·. Do you know how much it costs to train large language models (LLM)? Al training costs are exploding lately. The Stanford Al Index Report has just released some training numbers and they are CRAZY. Training LLMs from scratch costs millions and these numbers are expected to climb even higher with the development of new models. But why is this the case? A team of machine learning, data science, and linguistic experts is essential. Image 12: ➡️ Training and developing LLMs is incredibly expensive due to the vast computational resources required, with nodels like GPT-4 needing thousands of GPUs running for months (!). This extensive use of GPUs, combined with the need for continuous fine-tuning and experimentation, significantly drives up both the hardware and operational costs."
    },
    {
      "title": "Sebastian Raschka, PhD's Post - LinkedIn",
      "url": "https://www.linkedin.com/posts/sebastianraschka_sure-its-widely-known-that-pretraining-activity-7278071569384460288-y_NE",
      "description": "Sure, it's widely known that pretraining large language models (LLMs) is incredibly expensive, but how expensive, exactly?"
    },
    {
      "title": "How Much Does Your LLM Inference Cost? | NVIDIA Technical Blog",
      "url": "https://developer.nvidia.com/blog/llm-inference-benchmarking-how-much-does-your-llm-inference-cost",
      "description": "This guide covers performance metrics (TTFT, latency-throughput trade-offs), infrastructure provisioning, and cost calculations per token to"
    },
    {
      "title": "The Real Price of AI: Pre-Training Vs. Inference Costs",
      "url": "https://www.ankursnewsletter.com/p/the-real-price-of-ai-pre-training",
      "description": "# The Real Price of AI: Pre-Training Vs. Inference Costs. For most companies using AI, **the ongoing cost of running models daily (inference) vastly outweighs the initial training cost**, potentially accounting for 80-90% of the total lifetime expense. Overall inference costs are rising significantly **due to wider business adoption of AI, the demand for real-time performance, the increasing complexity and size of models, and the growing volume of data processe**d. Proactively managing and **optimizing inference costs is becoming a critical roadblock** to achieving real value and return on investment from generative AI deployments. A lot of the discussion centers on the huge effort and significant amount of hardware needed for training the big large language models, or LLMs. But here’s something critical that often gets missed: for almost any company actually putting AI to work, the day-to-day running – the inference stage – tells a very different cost story."
    },
    {
      "title": "Your Guide To Inference Cost (And Make It A Margin Advantage)",
      "url": "https://www.cloudzero.com/blog/inference-cost",
      "description": "# Inference Cost Explained: How to Reduce LLM & AI Inference Spend. Learn how LLM and AI inference cost work, what drives them up at scale, and 8 proven strategies to reduce your AI inference spend without slowing down your team. Learn how LLM and AI inference cost work, what drives them up at scale, and 8 proven strategies to reduce your AI inference spend without slowing down your team. That ongoing, per-request expense is inference cost, and for organizations running AI in production, it is the number that matters most. Inference cost is the compute cost your organization pays every time an AI model generates a response — each prompt sent, each reply received. In production systems, tighter context management reduces inference cost significantly — prompt compression techniques such as summarization, keyphrase extraction, and semantic chunking can achieve 70–94% cost savings in production AI systems, according to research published in Medium citing Stanford and ACL findings."
    },
    {
      "title": "Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective",
      "url": "https://arxiv.org/html/2410.04466v3",
      "description": "# Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform."
    }
  ]
}

▼✓网页搜索查询: "LLM SFT fine-tuning cost LoRA full parameter price estimate 2025", 结果数: 10

{
  "results": [
    {
      "title": "A Comprehensive Guide to LLM Fine-Tuning: Methods ... - Wilson Wu",
      "url": "https://wilsonwu.me/en/blog/2025/llm-fine-tuning",
      "description": "Over the years, fine-tuning techniques have evolved from traditional full-parameter training to efficient, low-cost approaches such as LoRA,"
    },
    {
      "title": "How to fine-tune open LLMs in 2025 with Hugging Face",
      "url": "https://www.philschmid.de/fine-tune-llms-in-2025",
      "description": "# How to fine-tune open LLMs in 2025 with Hugging Face. However, for specialized applications requiring high accuracy or domain expertise, fine-tuning remains a powerful approach to achieve higher quality results than prompting alone, reduce costs by training smaller, more efficient models, and ensure reliability and consistency for specific use cases. Contrary to last years guide How to Fine-Tune LLMs in 2024 with Hugging Face this guide focuses more on optimization, distributed training and being more customizable. This means support for different PEFT methods from Full-Finetuning to QLoRA and Spectrum, optimizations for faster and more efficient training, with Flash Attention or Liger Kernels and how to scale training to multiple GPUs using DeepSpeed. If you are compeltely new to fine-tuning LLMs, I recommend you to start with the How to Fine-Tune LLMs in 2024 with Hugging Face guide and then come back to this guide. This can be a good use case for fine-tuning, as it requires a lot of domain-specific knowledge about math and how to solve math problems."
    },
    {
      "title": "How to Fine Tune your own LLM using LoRA (on a CUSTOM dataset!)",
      "url": "https://www.youtube.com/watch?v=D3pXSkGceY0",
      "description": "How to Fine Tune your own LLM using LoRA (on a CUSTOM dataset!)\nNicholas Renotte\n326000 subscribers\n1299 likes\n41222 views\n9 Jun 2025\nThat gameboy blender animation...took 6 hours to render 😅. \n\nAnyway, had a ton of fun coding this up and finally getting back to some proper ML. I've been thinking about just swapping out the pdf with some specific libraries to get an ultimate Blender bot or Manim bot (also learned it for the video...we're boosting animation quality this year!!). \n\nWould love to see what you end up fine tuning and deploying. If you need a hand just ping me I'm flying out tonight but will answer as much as i can. Love ya, thanks for watching! \n\nCode: https://github.com/nicknochnack/EndToEndLoRA\n\n👨‍💻 Sign up for the Full Stack course and use YOUTUBE50 to get 50% off:\nhttps://www.coursesfromnick.com/bundles/fullstackml\n\nOh, and don't forget to connect with me!\nLinkedIn: https://bit.ly/324Epgo\nFacebook: https://bit.ly/3mB1sZD\nGitHub: https://bit.ly/3mDJllD\nPatreon: https://bit.ly/2OCn3UW\nJoin the Discussion on Discord: https://bit.ly/3dQiZsV\n\nHappy coding!\nNick\n73 comments"
    },
    {
      "title": "A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios",
      "url": "https://arxiv.org/html/2511.00130v1",
      "description": "# A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios. The remarkable capabilities of Large Language Models (LLMs) often need to be tailored for specific applications, requiring the integration of new knowledge or the acquisition of new skills. While full fine-tuning is a powerful adaptation method, it is computationally expensive and can lead to a degradation of general reasoning abilities, a phenomenon known as catastrophic forgetting McCloskey and Cohen (1989). In-Context Learning (ICL) is fast but limited by context length, while Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a middle ground by minimizing parameter changes. This paper presents a comparative analysis of Supervised Finetuning (SFT), LoRA, and ICL in data-scarce scenarios. We find that LoRA provides the most effective balance, successfully instilling new skills with minimal impact on the base model’s general knowledge. An analysis of the trade-off between task accuracy and forgetting as a function of key hyperparameters such as learning rate and LoRA rank."
    },
    {
      "title": "NeurIPS Poster LoRA vs Full Fine-tuning: An Illusion of Equivalence",
      "url": "https://neurips.cc/virtual/2025/poster/115207",
      "description": "Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. But, \\emph{are their learned solutions really equivalent?} We study how LoRA and full-finetuning change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that LoRA and full fine-tuning yield weight matrices whose singular value decompositions exhibit very different structure: weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \\emph{intruder dimensions}, while those trained with full fine-tuning do not. Further, we extend the finding that LoRA forgets less than full fine-tuning and find its forgetting is vastly localized to the intruder dimension -- by causally intervening on the intruder dimensions by changing their associated singular values post-fine-tuning, we show that they cause forgetting. This will be amplified during continual learning because of sequentially fine-tuning, and we show that LoRA models do accumulate intruder dimensions here tend to perform worse in this setting, emphasizing the practicality of our findings."
    },
    {
      "title": "Efficient LLM Fine-Tuning with LoRA | by Raquel Vaz, PhD | Medium",
      "url": "https://medium.com/@raquelhvaz/efficient-llm-fine-tuning-with-lora-e5edb88b64a1",
      "description": "# Efficient LLM Fine-Tuning with LoRA | by Raquel Vaz, PhD | Medium. # Efficient LLM Fine-Tuning with LoRA. It might make sense to use a large model like Gemini to **coordinate and orchestrate**, but the individual agents themselves can — and should — be **smaller, specialized models** fine-tuned for their specific tasks. This article shows how to do that using **LoRA (Low-Rank Adaptation)** — a lightweight technique for Parameter-Efficient Fine-Tuning (PEFT). LoRA allows you to adapt pre-trained language models using a minimal number of additional parameters. *   🧪 **Fine-tune large models** on low-resource infrastructure by training only a small fraction of parameters. *   🧩 **Deploy with zero inference overhead** — at runtime, LoRA adapters are merged into the original weight matrices, so model size and latency remain unchanged. Let’s illustrate it with a concrete example using LoRA on Qwen2.5–0.5B-Instruct, a 0.5 billion‑parameter, instruction‑tuned model in the Qwen2.5 family."
    },
    {
      "title": "Supervised Fine-Tuning (SFT) with LoRA on Fireworks AI - Fireworks AI Docs",
      "url": "https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning",
      "description": "# Supervised Fine-Tuning (SFT) with LoRA on Fireworks AI - Fireworks AI Docs. [Skip to main content](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#content-area). *   [Which model should I use?](https://docs.fireworks.ai/guides/recommended-models). *   [Fine Tuning Overview](https://docs.fireworks.ai/fine-tuning/finetuning-intro). *   [Deploying Fine Tuned Models](https://docs.fireworks.ai/fine-tuning/deploying-loras). *   [Fire Pass Setup](https://docs.fireworks.ai/firepass). *   [Understanding LoRA](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#understanding-lora). *   [List of Supported Models](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#list-of-supported-models). *   [Step-by-Step Guide to Fine-Tuning with Fireworks AI](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#step-by-step-guide-to-fine-tuning-with-fireworks-ai). Preparing the Dataset](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#1-preparing-the-dataset). Uploading the Dataset to Fireworks AI](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#2-uploading-the-dataset-to-fireworks-ai). Creating a Fine-Tuning Job](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#3-creating-a-fine-tuning-job). Monitoring and Managing Fine-Tuning Jobs](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#4-monitoring-and-managing-fine-tuning-jobs). Deploying the Fine-Tuned Model](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#5-deploying-the-fine-tuned-model). Best Practices and Considerations](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#6-best-practices-and-considerations). *   [Appendix](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#appendix). ## [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#understanding-lora). ## [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#list-of-supported-models). ## [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#step-by-step-guide-to-fine-tuning-with-fireworks-ai). ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#1-preparing-the-dataset). ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#2-uploading-the-dataset-to-fireworks-ai). Full reference here: [Create dataset](https://docs.fireworks.ai/api-reference/create-dataset). ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#3-creating-a-fine-tuning-job). ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#4-monitoring-and-managing-fine-tuning-jobs). ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#5-deploying-the-fine-tuned-model). For guide on how to create deployment (dedicated or live-merge deployment), please follow the guide here: [creating deployment](https://docs.fireworks.ai/guides/ondemand-deployments).For guide on how to deploy a LoRA addon to an existing deployment, you can simply click the `Deploy this LoRA` button in the supervised fine tuning job details page or LoRA model details page, and follow the wizard. ### [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#6-best-practices-and-considerations). ## [](https://docs.fireworks.ai/fine-tuning/supervised-fine-tuning#appendix)."
    },
    {
      "title": "A comprehensive overview of everything I know about fine-tuning. : r/LocalLLaMA",
      "url": "https://www.reddit.com/r/LocalLLaMA/comments/1ilkamr/a_comprehensive_overview_of_everything_i_know",
      "description": "[Skip to main content](https://www.reddit.com/r/LocalLLaMA/comments/1ilkamr/a_comprehensive_overview_of_everything_i_know/#main-content)A comprehensive overview of everything I know about fine-tuning. [The-Silvervein](https://www.reddit.com/user/The-Silvervein/). I’ve compiled a small collection of everything I know about fine-tuning LLMs or transformer models for specific use cases. *   You don't need a 100B model for every task you have. *   You must check the licensing to see if you use the model for commercial use cases. 2.   For domain-specific needs like medical, financial, legal, etc., you need to push the model to update its knowledge => Use RAG when applicable or fine-tune the entire model. [Overview of fine-tuning LLMs](https://www.reddit.com/answers/d410a820-a6b1-4545-8e34-5a0c58fd8fb5/?q=Overview+of+fine-tuning+LLMs&source=PDP). [Examples of fine-tuning large language models](https://www.reddit.com/answers/952c2d57-6b89-4d85-aeca-bfbc0788cbef/?q=Examples+of+fine-tuning+large+language+models&source=PDP). [How to fine-tune AI models effectively](https://www.reddit.com/answers/29fb1cd7-3bf8-42ab-9b67-b3047285ad07/?q=How+to+fine-tune+AI+models+effectively&source=PDP). [Understanding fine-tuning in generative AI](https://www.reddit.com/answers/c20a5dac-ca8c-433b-acce-ef48ada1145e/?q=Understanding+fine-tuning+in+generative+AI&source=PDP). *   [Reddit reReddit: Top posts of February 9, 2025 * * *](https://www.reddit.com/posts/2025/february-9-1/global/). *   [Reddit reReddit: Top posts of February 2025 * * *](https://www.reddit.com/posts/2025/february/global/). *   [Reddit reReddit: Top posts of 2025 * * *](https://www.reddit.com/posts/2025/global/). [Reddit Rules](https://www.redditinc.com/policies/content-policy)[Privacy Policy](https://www.reddit.com/policies/privacy-policy)[User Agreement](https://www.redditinc.com/policies/user-agreement)[Your Privacy Choices](https://support.reddithelp.com/hc/articles/43980704794004)[Accessibility](https://support.reddithelp.com/hc/sections/38303584022676-Accessibility)[Reddit, Inc. All rights reserved.](https://redditinc.com/). *   [About Reddit](https://www.redditinc.com/). *   [Best of Reddit](https://www.reddit.com/posts/2026/global/). *   [Reddit Rules](https://www.redditinc.com/policies/content-policy)."
    },
    {
      "title": "In-depth guide to fine-tuning LLMs with LoRA and QLoRA - Mercity.ai",
      "url": "https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora",
      "description": "In this blog we provide detailed explanation of how QLoRA works and how you can use it in hugging face to finetune your models."
    },
    {
      "title": "WTF is Fine-Tuning? (intro4devs) | [2025]",
      "url": "https://huggingface.co/blog/tegridydev/fine-tuning-dev-intro-2025",
      "description": "# WTF is Fine-Tuning? Full Fine-Tuning: Max Capacity + What It Is + Code Example + Use When + Pros + Cons * 2. Parameter-Efficient Fine-Tuning (PEFT): Efficiency First: Efficiency First\") + Context (2025)\") + a. Instruction Fine-Tuning: Teaching Models to Follow Commands + What It Is + Code Example + Use When + Pros + Cons * 4. Retrieval-Augmented Fine-Tuning (RAFT): External Knowledge Injection: External Knowledge Injection\") + What It Is (2025)\") + Use When + Pros + Cons * 5. Reinforcement Learning from Human Feedback (RLHF): Aligning AI with Human Preferences: Aligning AI with Human Preferences\") + What It Is (2025)\") + Use When + Pros + Cons * Wrapping Up: Matching the Right Gear to The Boss TL;DR. ### What It Is. Full fine-tuning updates all parameters of a model using your dataset, the gold standard for maximizing model performance, ensuring every layer of the model adapts to your specific requirements. Parameter-Efficient Fine-Tuning (PEFT): Efficiency First."
    }
  ]
}

▼✓网页搜索查询: "RLHF DPO GRPO training cost comparison compute human annotation 2024 2025", 结果数: 10

{
  "results": [
    {
      "title": "Preference Tuning LLMs: PPO, DPO, GRPO — A Simple Guide",
      "url": "https://anukriti-ranjan.medium.com/preference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090",
      "description": "[Sign in](https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=post_page---top_nav_layout_nav-----------------------global_nav------------------). [Sign in](https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=post_page---top_nav_layout_nav-----------------------global_nav------------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Fp%2F135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&user=Anukriti+Ranjan&userId=6f45a42dbb90&source=---header_actions--135765c87090---------------------clap_footer------------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---header_actions--135765c87090---------------------bookmark_footer------------------). [Listen](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2Fplans%3Fdimension%3Dpost_audio_button%26postId%3D135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---header_actions--135765c87090---------------------post_audio_button------------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Fp%2F135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&user=Anukriti+Ranjan&userId=6f45a42dbb90&source=---footer_actions--135765c87090---------------------clap_footer------------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Fp%2F135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&user=Anukriti+Ranjan&userId=6f45a42dbb90&source=---footer_actions--135765c87090---------------------clap_footer------------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F135765c87090&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---footer_actions--135765c87090---------------------bookmark_footer------------------). [What are your thoughts?](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---post_responses--135765c87090---------------------respond_sidebar------------------). [Mar 19, 2025](https://medium.com/@champagnejin0511/here-the-auther-said-that-grpo-also-has-a-reward-model-however-in-the-table-comparing-the-7eea1ed9c67d?source=post_page---post_responses--135765c87090----0-----------------------------------). [![Image 11: Anukriti Ranjan](https://miro.medium.com/v2/resize:fill:20:20/1*QXaGEeFeyWU9ZFRJWD6o0g.jpeg)](https://anukriti-ranjan.medium.com/?source=post_page---author_recirc--135765c87090----0---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). ## [Building a Sandboxed Environment for AI generated Code Execution ### When building AI agents, especially those that dynamically generate and execute analysis code, one major concern is security and stability…](https://anukriti-ranjan.medium.com/building-a-sandboxed-environment-for-ai-generated-code-execution-e1351301268a?source=post_page---author_recirc--135765c87090----0---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [18 1](https://anukriti-ranjan.medium.com/building-a-sandboxed-environment-for-ai-generated-code-execution-e1351301268a?source=post_page---author_recirc--135765c87090----0---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---author_recirc--135765c87090----0-----------------explicit_signal----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2Fe1351301268a&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fbuilding-a-sandboxed-environment-for-ai-generated-code-execution-e1351301268a&source=---author_recirc--135765c87090----0-----------------bookmark_preview----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [![Image 13: Anukriti Ranjan](https://miro.medium.com/v2/resize:fill:20:20/1*QXaGEeFeyWU9ZFRJWD6o0g.jpeg)](https://anukriti-ranjan.medium.com/?source=post_page---author_recirc--135765c87090----1---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). Claude thinks, calls tools, reads tool…](https://anukriti-ranjan.medium.com/observability-with-the-claude-agent-sdk-1dc6bfa9c50e?source=post_page---author_recirc--135765c87090----1---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---author_recirc--135765c87090----1-----------------explicit_signal----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [![Image 15: Anukriti Ranjan](https://miro.medium.com/v2/resize:fill:20:20/1*QXaGEeFeyWU9ZFRJWD6o0g.jpeg)](https://anukriti-ranjan.medium.com/?source=post_page---author_recirc--135765c87090----2---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). ## [Decoding Agentic Workflows: Towards Practical LLM Integration ### 1.The Rise of the AI Engineering](https://anukriti-ranjan.medium.com/decoding-agentic-workflows-towards-practical-llm-integration-58ee1cb6237a?source=post_page---author_recirc--135765c87090----2---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [12](https://anukriti-ranjan.medium.com/decoding-agentic-workflows-towards-practical-llm-integration-58ee1cb6237a?source=post_page---author_recirc--135765c87090----2---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---author_recirc--135765c87090----2-----------------explicit_signal----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F58ee1cb6237a&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fdecoding-agentic-workflows-towards-practical-llm-integration-58ee1cb6237a&source=---author_recirc--135765c87090----2-----------------bookmark_preview----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [![Image 17: Anukriti Ranjan](https://miro.medium.com/v2/resize:fill:20:20/1*QXaGEeFeyWU9ZFRJWD6o0g.jpeg)](https://anukriti-ranjan.medium.com/?source=post_page---author_recirc--135765c87090----3---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). While they…](https://anukriti-ranjan.medium.com/langchain-csv-agent-a-chain-of-function-calls-part-1-f0290e6c4bf3?source=post_page---author_recirc--135765c87090----3---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [131](https://anukriti-ranjan.medium.com/langchain-csv-agent-a-chain-of-function-calls-part-1-f0290e6c4bf3?source=post_page---author_recirc--135765c87090----3---------------------d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---author_recirc--135765c87090----3-----------------explicit_signal----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2Ff0290e6c4bf3&operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Flangchain-csv-agent-a-chain-of-function-calls-part-1-f0290e6c4bf3&source=---author_recirc--135765c87090----3-----------------bookmark_preview----d0669147_b229_4eb5_a1ba_52c7d1f6ee81--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----0-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----1-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). ## [KV Cache Explained: The Complete Guide to KV Cache in LLM Inference ### From PagedAttention to Distributed Caching — How KV Cache Evolved Across 5 Eras to Power Modern LLM Inference](https://luv-bansal.medium.com/the-evolution-of-kv-cache-from-simple-buffers-to-distributed-memory-systems-df51cb8ce26f?source=post_page---read_next_recirc--135765c87090----0---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----0-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [![Image 25: Artificial Intelligence in Plain English](https://miro.medium.com/v2/resize:fill:20:20/1*9zAmnK08gUCmZX7q0McVKw@2x.png)](https://ai.plainenglish.io/?source=post_page---read_next_recirc--135765c87090----1---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [3.1K 9](https://antonrgordon.medium.com/optimizing-llm-inference-prefill-vs-decode-on-multi-gpu-nvidia-systems-91aed335d593?source=post_page---read_next_recirc--135765c87090----1---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----1-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----2-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [![Image 29: Towards AI](https://miro.medium.com/v2/resize:fill:20:20/1*JyIThO-cLjlChQLb6kSlVQ.png)](https://pub.towardsai.net/?source=post_page---read_next_recirc--135765c87090----3---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). Learn what the KV Cache is and modern optimization techniques.](https://burakdegirmencioglu.medium.com/the-secret-behind-fast-llm-inference-unlocking-the-kv-cache-9c13140b632d?source=post_page---read_next_recirc--135765c87090----3---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [15](https://burakdegirmencioglu.medium.com/the-secret-behind-fast-llm-inference-unlocking-the-kv-cache-9c13140b632d?source=post_page---read_next_recirc--135765c87090----3---------------------905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fanukriti-ranjan.medium.com%2Fpreference-tuning-llms-ppo-dpo-grpo-a-simple-guide-135765c87090&source=---read_next_recirc--135765c87090----3-----------------explicit_signal----905f8b92_4586_4965_9746_d57a4ae39cd5--------------). [](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F9c13140b632d&operation=register&redirect=https%3A%2F%2Fpub.towardsai.net%2Fthe-secret-behind-fast-llm-inference-unlocking-the-kv-cache-9c13140b632d&source=---read_next_recirc--135765c87090----3-----------------bookmark_preview----905f8b92_4586_4965_9746_d57a4ae39cd5--------------)."
    },
    {
      "title": "A Comparison of Reinforcement Learning (RL) and RLHF",
      "url": "https://intuitionlabs.ai/articles/reinforcement-learning-vs-rlhf",
      "description": "RLTHF (Targeted Human Feedback): One of the most significant cost-saving advancements in 2025 is RLTHF, which addresses the high cost of human"
    },
    {
      "title": "Group Relative Policy Optimization (GRPO) - Deep (Learning) Focus",
      "url": "https://cameronrwolfe.substack.com/p/grpo",
      "description": "*Reinforcement Learning from Human Feedback (RLHF)* trains the LLM using RL with rewards derived from a reward model trained on human preferences. We can directly train an LLM on this preference data using a direct alignment algorithm like Direct Preference Optimization (DPO), but we usually incorporate this data into RL by first using it to train a reward model. **PPO & RLHF.** Once the reward model has been trained over the preference data using this loss, the model learns how to assign a preference score to each model completion; see above. During training with RL, we generate new outputs—*or on-policy samples*—from our LLM and score them with the reward model. In this way, the RL training process guides the LLM to produce outputs that align with human preferences, as estimated by the reward model. This model was created by the same authors of DeepSeekMath [1], so R1-Zero also uses GRPO for RL training."
    },
    {
      "title": "Post-Training Matters More Than Pretraining Now: SFT, RLHF, DPO, and GRPO. | by Han HELOIR YAN, Ph.D. ☕️ | Mar, 2026 | AI Advances",
      "url": "https://ai.gopubby.com/post-training-matters-more-than-pretraining-now-sft-rlhf-dpo-and-grpo-1f9beffd391a",
      "description": "## AI Advances. Han HELOIR YAN, Ph.D. Han HELOIR YAN, Ph.D. Free link => *50 claps is the highest score! ## Written by Han HELOIR YAN, Ph.D. An AI Enthusiast & Tech Architect 🌟. Your number of claps reflects your incredible support for this article. If you work with LLMs, you’ve seen all of them. If you’re honest, you can explain maybe three with confidence. If you’re choosing between these techniques for your next project, the wrong choice doesn’t just waste compute. This article walks the full chain, so you pick the right one. If this helps you ship better AI systems:. 👏 **Clap 50 times** (yes, you can!) — Medium’s algorithm favors this, increasing visibility to others who then discover the article. ## Create an account to read the full story. If you’re new to Medium, create a new account to read this story on us. ## Published in AI Advances. Han HELOIR YAN, Ph.D. Han HELOIR YAN, Ph.D."
    },
    {
      "title": "The State of Reinforcement Learning for LLM Reasoning - Ahead of AI",
      "url": "https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training",
      "description": "[![Image 11: The State of LLM Reasoning Model Inference](https://substackcdn.com/image/fetch/$s_!IOSP!,w_140,h_140,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf9e2677-652a-4af1-9f57-dc0c253d2198_1448x1260.png) #### The State of LLM Reasoning Model Inference [Sebastian Raschka, PhD](https://substack.com/profile/27393275-sebastian-raschka-phd) · March 8, 2025 [Read full story](https://magazine.sebastianraschka.com/p/state-of-llm-reasoning-and-inference-scaling)](https://magazine.sebastianraschka.com/p/state-of-llm-reasoning-and-inference-scaling). The reinforcement learning (RL) training methods used to build and improve reasoning models are more or less related to the reinforcement learning with human feedback (RLHF) methodology that is used to develop and align conventional LLMs. So, I want to start with a small recap of how RLHF works before discussing reasoning-specific modification based on RL-based training. The connection between RLHF and reasoning comes from how the DeepSeek team applied a similar RL-based approach (with GRPO) to train the reasoning capabilities of their R1 and R1-Zero models. Interestingly, they find that this simple setup (vanilla PPO and a basic binary reward function based on answer correctness) is sufficient to train models that scale up in both reasoning performance and response length."
    },
    {
      "title": "Complete guide to RLHF for LLMs: How human feedback shapes modern AI",
      "url": "https://toloka.ai/blog/what-is-rlhf",
      "description": "Collect RLHF preference data Why RLHF matters What is Reinforcement Learning from Human Feedback (RLHF)?RLHF meaning Why human feedback is necessary How does RLHF work?Deep dive: The RLHF training pipeline Stage 1: Collecting preference data How to collect RLHF preference data Stage 2: Training the reward model Stage 3: Proximal Policy Optimization (PPO)Alternatives to traditional RLHF: DPO and related methods Direct Preference Optimization (DPO)Reinforcement learning for reasoning: RLVR, GRPO - rise of thinking models Reinforcement Learning from AI Feedback (RLAIF)Other emerging methods Human feedback in RLHF Collecting human feedback at scale How teams collect comparison data Modeling human preferences What RLHF can’t fix When RLHF works well in practice Predictability over raw capability Scope and iteration matter more than scale Preference clarity enables consistency Building a sustainable RLHF pipeline Treat RLHF as infrastructure, not a phase Quality control must be continuous Design for exit, not endless optimization Practical implementation considerations When a language model needs RLHF Operational costs and organizational trade-offs Failure modes in production Tools and frameworks Conclusion: RLHF as operational alignment Subscribe to Toloka news."
    },
    {
      "title": "LLM Alignment: RLHF to DPO & GRPO | MI - 超智諮詢",
      "url": "https://www.meta-intelligence.tech/en/insight-rlhf-alignment",
      "description": "1 The Complete Guide to Prompt Engineering: Master Systematic Prompt Design from Scratch and Unlock the Full Potential of Large Language Models  2 The Complete Guide to Context Engineering: From RAG to Memory Systems, Building Enterprise-Grade AI Knowledge Architectures  3 Reasoning Model Practical Guide: Enterprise Comparison and Deployment Strategies for DeepSeek R1, OpenAI o3, and Gemini 3  4 MCP (Model Context Protocol) Complete Guide: From Protocol Architecture to Hands-On Practice, Building a Universal Bridge Between AI and External Tools  5 AI Agent Interoperability Protocol Practical Guide: A2A and MCP Integration Architecture, Enterprise Deployment, and Standardization Trends  6 The Complete Guide to Agentic Workflow: From ReAct to Multi-Agent Collaboration — Building Autonomous AI Systems  7 The Complete Guide to AI Agent Development: LangGraph vs CrewAI vs AutoGen — Full Framework Comparison from Single-Agent to Multi-Agent Collaboration  8 The Complete Guide to Private LLM Deployment: From Llama to vLLM — Enterprise Architecture for Self-Hosted Large Language Models  9 The Complete Guide to LLM Evaluation: From Benchmark Leaderboards to Human Preference Alignment — Systematic Assessment Methods  10 The Complete Guide to LLM Alignment: From RLHF to DPO and GRPO — A Practical Deep Dive into Aligning Large Language Models with Human Values Current  11 The Complete Guide to MLOps: Building End-to-End Machine Learning Engineering Pipelines from Experiment Management to Model Deployment."
    },
    {
      "title": "It Takes Two: Your GRPO Is Secretly DPO - arXiv",
      "url": "https://arxiv.org/html/2510.00977v2",
      "description": "Group Relative Policy Optimization (GRPO) has emerged as a prominent reinforcement learning algorithm for post-training Large Language Models."
    },
    {
      "title": "The Evolution and Optimization of AI LLM Models Alignment.",
      "url": "https://www.youtube.com/watch?v=fEmewtw9v7k",
      "description": "... Human Feedback (RLHF). We'll explore how Direct Preference Optimization (DPO) turned a complex reinforcement learning problem into a sleek"
    },
    {
      "title": "The Complete Guide to Post-Training LLMs: SFT, RLHF, DPO & GRPO",
      "url": "https://www.sundeepteki.org/advice/the-complete-guide-to-post-training-llms-how-sft-rlhf-dpo-and-grpo-shape-llms",
      "description": "| The Complete Guide to Post-Training LLMs: How SFT, RLHF, DPO, and GRPO Shape LLMs 8/4/2026  0 Comments    Table of Contents    1. Fine-Tuning: A Critical Distinction  2.2 The Three-Stage Pipeline: SFT, Preference Alignment, and Reinforcement Learning  2.3 Why Post-Training Now Accounts for the Majority of Usable Model Capability    3. Supervised Fine-Tuning (SFT): Teaching Models to Follow Instructions  3.1 Full Fine-Tuning, LoRA, and QLoRA - Choosing Your Approach  3.2 Dataset Quality: The Accuracy-Diversity-Complexity Triad  3.3 The Dataset Composition Blueprint    4. Preference Alignment: Making Models Helpful, Harmless, and Honest  4.1 RLHF - The Original Breakthrough  4.2 DPO - Eliminating the Reward Model  4.3 RLAIF and Constitutional AI - Anthropic's Scalable Alternative    5. Reinforcement Learning: The Frontier of Reasoning Models  5.1 GRPO - DeepSeek's Paradigm Shift  5.2 DAPO and RLVR - Verifiable Rewards for Reasoning  5.3 How OpenAI, Anthropic, and Google DeepMind Approach RL Differently    6."
    }
  ]
}

▼thought

Let me get more specific cost data from some of these sources, and also search for the relationship between data scale and training cost.

▼✓抓取网页正文网址: "https://galileo.ai/blog/llm-model-training-cost"

{
  "url": "about:blank",
  "title": "How Much Does LLM Training Cost? | Galileo"
}

▼✓网页搜索查询: "LLM training cost per token pre-training GPU hours data scale cost relationsh...", 结果数: 10

{
  "results": [
    {
      "title": "How Much Does LLM Training Cost? - Galileo AI",
      "url": "https://galileo.ai/blog/llm-model-training-cost",
      "description": "# How Much Does LLM Training Cost? Understanding the costs of training LLMs is essential for those looking to create and train models or simply to understand the dynamics of AI as an industry. In this article, we'll explore the cost of training LLM models, the factors that contribute to LLM training costs, and strategies to optimize them effectively. * Training frontier LLMs costs $78M-$192M+ for compute alone. * Key cost drivers have shifted—human data annotation now exceeds compute costs by up to 28x. * Fine-tuning pre-trained models costs 60-90% less than training from scratch. ## **How Much Does It Cost to Train an LLM?**. To appreciate the scale of investment involved in training LLMs, consider the verified costs of notable models. According to the Stanford AI Index Report 2025, frontier model training costs have escalated dramatically—with GPT-4's training estimated at $78-100+ million, and Gemini Ultra 1.0 reaching $192 million, representing a 287,000x increase from the cost of a Transformer model in 2017 ($670)."
    },
    {
      "title": "Pre-Training LLMs on a budget: A comparison of three optimizers",
      "url": "https://arxiv.org/html/2507.08472v1",
      "description": "# Pre-Training LLMs on a budget: A comparison of three optimizers. For better generalization, we train with two different base architectures and use a single- and a multiple-epoch approach while keeping the number of tokens constant. We found that while the results from all three optimizers were in approximately the same range, Sophia exhibited the lowest training and validation loss, Lion was fastest in terms of training GPU hours but AdamW led to the best downstream evaluation results. This study compares two of them, Lion (Chen et al., 2023) and Sophia (Liu et al., 2024), with AdamW, in the context of small models and considering a limited compute budget. Since a grid search even with only few hyperparameters and a model with approximately 3 billion parameters is not feasible given the usual limits in the compute budget, we employed the Maximal Update Parametrization (μ𝜇\\muitalic\\_μP, Yang et al., 2021) to find optimal values using small proxy models (50 million parameters) and then transfer these values unchanged to the larger target model (2.7 billion parameters)."
    },
    {
      "title": "Understanding cost, options and Technical steps to build LLM from ...",
      "url": "https://medium.com/@anilAmbharii/understanding-cost-options-and-technical-steps-to-build-llm-from-scratch-d68cb9ba7bd9",
      "description": "By simple multiplication, the estimated training cost for a 10 billion parameter model is approximately $50,000, and for a 100 billion parameter"
    },
    {
      "title": "How Long Should You Train Your Language Model? | Databricks Blog",
      "url": "https://www.databricks.com/blog/how-long-should-you-train-your-language-model",
      "description": "# How Long Should You Train Your Language Model? Accounting for Inference in Language Model Scaling Laws. Our recent paper, presented at ICML 2024, **proposes a modified scaling law to account for the cost of** ***both training and inference*****.** This blog post explains the reasoning behind our new scaling law, and then experimentally demonstrates how “overtrained” LLMs can be optimal. The “*Chinchilla”* Scaling Law is the most widely cited scaling law for LLMs. The Chinchilla paper asked the question: If you have a fixed training compute budget, how should you balance model size and training duration to produce the highest quality model? In our research, we were motivated by the idea of training smaller models on more data than the Chinchilla law suggested. By contrast, we assume a *fixed pretraining loss* (i.e. model quality) and find the model size and training duration that *minimize the total compute* over the model’s lifetime, including both training and inference (Fig. 2, Eq. 2)."
    },
    {
      "title": "What is the cost of training large language models? - CUDO Compute",
      "url": "https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models",
      "description": "### CUDO arrow Resources arrow What is the cost of training large language models? # What is the cost of training large language models? Large language models (LLMs) like OpenAI’s GPT series and Google’s BERT have become foundational technologies powering many applications, from automated customer service to advanced research tools. For example, while the original 2017 Transformer that introduced the core LLM architecture cost only about $900 to train, the compute cost for training GPT-3 (175 billion parameters) was estimated in 2020 to range from about $500,000 up to $4.6 million, depending on the hardware and optimization techniques used. Training OpenAI’s GPT-4 reportedly cost more than $100 million, with some estimates ranging up to $78 million in compute cost, and Google’s Gemini Ultra model is estimated to have cost $191 million in training compute. The emergence of reasoning in LLMs is partly a byproduct of scale, as larger models trained on more data tend to develop more advanced capabilities, and partly due to specialized training or fine-tuning on data that encourages logical reasoning."
    },
    {
      "title": "Training large language models costs millions - Facebook",
      "url": "https://www.facebook.com/groups/AIandMachineLearningforEveryone/posts/9843392082338627",
      "description": "# AI & Machine Learning for Everyone | Do you know how much it costs to train large language models (LLM) | Facebook. Training large language models costs millions. ## AI & Machine Learning for Everyone ·. Do you know how much it costs to train large language models (LLM)? Al training costs are exploding lately. The Stanford Al Index Report has just released some training numbers and they are CRAZY. Training LLMs from scratch costs millions and these numbers are expected to climb even higher with the development of new models. But why is this the case? A team of machine learning, data science, and linguistic experts is essential. Image 12: ➡️ Training and developing LLMs is incredibly expensive due to the vast computational resources required, with nodels like GPT-4 needing thousands of GPUs running for months (!). This extensive use of GPUs, combined with the need for continuous fine-tuning and experimentation, significantly drives up both the hardware and operational costs."
    },
    {
      "title": "You Need to Know About Data Costs for Training an LLM - Abaka AI",
      "url": "https://www.abaka.ai/blog/llm-data-cost",
      "description": "We will provide a detailed interpretation of the composition of LLM data costs from multiple dimensions such as data volume, data quality, and data diversity."
    },
    {
      "title": "Sebastian Raschka, PhD's Post - LinkedIn",
      "url": "https://www.linkedin.com/posts/sebastianraschka_sure-its-widely-known-that-pretraining-activity-7278071569384460288-y_NE",
      "description": "Sure, it's widely known that pretraining large language models (LLMs) is incredibly expensive, but how expensive, exactly?"
    },
    {
      "title": "Exploring the cost of training an AI model on cloud infrastructure",
      "url": "https://nebius.com/blog/posts/cost-of-training-an-ai-model-on-cloud",
      "description": "# Exploring the cost of training an AI model on cloud infrastructure. Training machine learning models can cost anywhere from tens of thousands to millions of dollars depending on model size, dataset volume and infrastructure. In this article we look at the main cost components of training in the cloud, what drives the final bill and how to optimize spending without compromising results. Even within one architecture, budgets can differ by several times depending on parameter count, dataset size, infrastructure setup and efficiency of use. cost = (training time ÷ utilization rate) × resource price × number of resources + overhead (storage, networking, orchestration). Even a smaller model such as BERT-Large can run into tens of thousands of dollars if trained from scratch in the cloud without optimizations. In the cloud this is even more visible: compute, storage, networking and service layers act as a single system and only a balanced setup keeps spending under control."
    },
    {
      "title": "Trends in Artificial Intelligence | Epoch AI",
      "url": "https://epoch.ai/trends",
      "description": "Model Performance 90% CI 10× to 900× 1 to 4 months 1 to 3 OOM](https://epoch.ai/trends#model-performance)[Compute stock growth 3.4 ×/year 6.8 months 0.53 OOM/year The total computing power of the stock of AI chips is growing at a rate of 3.4×/year. AI Companies 90% CI 3.2× to 3.7× 6.4 to 7.2 months 0.51 to 0.57 OOM](https://epoch.ai/trends#ai-companies)[Training compute 5 ×/year 5.2 months 0.7 OOM/year Training compute for frontier language models has been growing at 5× per year since 2020. Training Runs 90% CI 4× to 6× 4.6 to 6.0 months 0.6 to 0.8 OOM](https://epoch.ai/trends#training-runs)[Software progress ÷ 3.0 ×/year 7.6 months 0.5 OOM/year Pre-training compute efficiency is improving at roughly 3.0× per year. Training Runs 90% CI 2.8× to 4.4× 5.6 to 8.1 months 0.4 to 0.6 OOM](https://epoch.ai/trends#training-runs)[Largest AI data center 700,000 H100e The largest known AI data center has computing power equivalent to 700,000 NVIDIA H100 chips. Data Centers 90% CI 500k to 1M H100e](https://epoch.ai/trends#data-centers)[FLOP/s per dollar 1.37 ×/year 2.2 years 0.14 OOM/year AI chip performance per dollar has improved by 37% per year."
    }
  ]
}

▼✓网页搜索查询: "human annotation cost RLHF data labeling per comparison 2024 2025 ScaleAI Surge", 结果数: 8

{
  "results": [
    {
      "title": "RLHF Market Compared: Scale AI, Mercor, Turing | Lemon.io Blog",
      "url": "https://lemon.io/blog/rlhf-platforms-for-data-annotation",
      "description": "Home blog For Employers The Hiring Process RLHF Platforms: Inside the AI Data Annotation and Labeling Market. # RLHF Platforms: Inside the AI Data Annotation and Labeling Market. On the other side are the **platforms** (e.g., Surge AI, Micro1, Mercor, and Labelbox) that provide the infrastructure and services to make this possible, often using approaches like reinforcement learning from human feedback (RLHF). Drawing on Lemon.io’s experience in sourcing and onboarding data annotators for both Tier 1 and Tier 2 labs, this article shows how these platforms create value for the companies building modern AI systems. The data annotation and labeling market is projected to grow from **$1.2 billion in 2024** to **$10.2 billion by 2034,** highlighting the increasing importance of high-quality, context-driven datasets in AI systems. That’s why, similar to the data annotation and labeling market, the RLHF platform market is forecasted to grow from **$2.8 billion in 2025** to **$18.6 billion in 2034.** The table below proves how lucrative it has already become."
    },
    {
      "title": "Data labeling costs surge as compute costs rise for AI models | Daniel Kang posted on the topic | LinkedIn",
      "url": "https://www.linkedin.com/posts/daniel-kang-1223b343_human-data-is-probably-more-expensive-than-activity-7360724422187442177-Madz",
      "description": "However, as post-training techniques (SFT, RL-HF, etc.) that rely on human annotations continue to drive progress in frontier models, we believe that the majority of (marginal) cost is shifting to data. The revenue of major data labeling companies and the marginal compute cost of training of training frontier models for major AI providers in 2024. This finding highlights clear evidence: the cost of acquiring high-quality human-annotated data is rapidly outpacing the compute costs required for training state-of-the-art AI models.\" https://lnkd.in/gCJZ4t9n #AI #AITraining #GenerativeAI #LLMs #DataLabeling #ComputeCosts AI models are only as good as the data they are trained on."
    },
    {
      "title": "Data Labeling Industry Guide: How AI Startups Make Billions (2025)",
      "url": "https://o-mega.ai/articles/how-the-data-labeling-industry-works-full-insider-guide-2025",
      "description": "How AI data labeling startups like Scale, Surge & Mercor are building billion-dollar businesses by connecting expert humans with AI labs. In fact, leading AI companies like OpenAI, Google, Meta, Anthropic and others are each spending on the order of **$1 billion per year** on human-provided training data (foundationcapital.com) (foundationcapital.com). As one investor put it, *“the only way models are now learning is through net new human data”* (techcrunch.com) – meaning that continual human feedback, annotation, and instruction have become crucial for advancing AI capabilities. This demand gave rise to a **data labeling industry** that supplies AI labs with armies of human contractors (labelers or “AI tutors”) who **label, annotate, or generate data** to feed into AI model training (techcrunch.com). From labeling millions of images for computer vision, to crafting high-quality responses and comparisons for training large language models via reinforcement learning from human feedback (RLHF), these human workers provide the “ground truth” that teaches AI models how to behave."
    },
    {
      "title": "Scale AI, Surge AI, and the billion-dollar gig-work industry shaping ...",
      "url": "https://www.instagram.com/reel/DTlcSGDj2gC",
      "description": "Rlhf is better known as data labeling and just like Uber drivers and DoorDashers, data labelers are gig workers paid hourly wages without"
    },
    {
      "title": "Top 10 Human Data Labeling Providers in 2026 - Pin",
      "url": "https://www.pin.com/blog/human-data-labeling-providers",
      "description": "Top 10 human data labeling providers compared for 2026 - Scale AI, Surge AI & 8 more. The top human data labeling providers in 2026 are Scale AI, Surge AI, Labelbox, Snorkel AI, Appen, Sama, iMerit, Telus Digital, Toloka, and SuperAnnotate. For recruiters staffing in-house annotation teams, Pin’s AI sourcing scans 850M+ profiles to find domain experts faster than any traditional job board. The data collection and labeling market reached $4.89 billion in 2025 and is projected to hit $17.10 billion by 2030, growing at a 28.4% CAGR, according to Grand View Research (Nov 2024). > * **Scale AI and Surge AI lead at the top.** Both dominate frontier-model contracts; Surge is the go-to for RLHF on LLMs and Scale for multi-modal enterprise programs. > * **Sourcing annotation talent is the new bottleneck.** For teams building in-house annotation capacity, Pin is the best way to find domain experts in medicine, law, and software - scanning 850M+ profiles that gig platforms simply don’t reach."
    },
    {
      "title": "Scale AI Alternatives for Enterprise AI Teams - Gun.io",
      "url": "https://gun.io/news/2025/12/scale-ai-alternatives-for-enterprise-ai-teams",
      "description": "Meta’s $14.3 billion acquisition of a 49% stake in Scale AI has forced enterprise AI teams to reassess their data annotation partnerships. They haven’t solved access to the engineers, developers, and domain experts required for the work that actually moves models forward: RLHF ranking, code evaluation, safety red-teaming. This guide evaluates the leading Scale AI alternatives across platform capabilities, annotator quality, pricing transparency, and vendor independence. Scale AI’s transformation from neutral market leader to Meta subsidiary represents the most significant vendor risk event in data annotation history. What are the best Scale AI alternatives for enterprise annotation? Scale AI’s ownership crisis accelerated trends already in motion: quality over scale, expert annotators over crowd workers, vendor independence over platform lock-in. The question for AI/ML teams has shifted from “which vendor has scale?” to “which vendor can access the developers, engineers, and domain specialists our training data requires?”. Gun.io connects companies with vetted senior developers and engineers for AI training data annotation, code evaluation, and RLHF workflows."
    },
    {
      "title": "Top RLHF Tools and Platforms to Use in 2026",
      "url": "https://www.taskmonk.ai/blogs/rlhf-data-annotation-platforms",
      "description": "# Top RLHF Platforms in 2026 for LLM Fine-Tuning & Human Feedback. Discover the top RLHF platforms for LLM fine-tuning in 2026. Compare Taskmonk, Scale AI, Surge AI, and open-source tools for scalable human feedback and model alignment. * RLHF platforms collect human feedback at scale to fine-tune LLMs for safety, instruction following, and alignment with human values. * The best platforms combine expert annotator workforces, preference ranking tools, and an API infrastructure for SFT and reward modeling. * Commercial platforms like Taskmonk, Scale AI, Surge AI offer managed services with vetted annotators across technical domains. Reinforcement learning from human feedback trains models to generate responses that humans prefer by collecting preference data at scale, training reward models to predict human judgments, and using those reward signals to fine-tune the base model. You need annotators who can evaluate model outputs across technical domains, platforms that collect pairwise rankings without introducing bias, APIs that integrate feedback loops into training pipelines, and quality control systems that catch low-effort labels before they poison your reward model."
    },
    {
      "title": "RLHF Platforms in Biotech: Scale vs. Labelbox vs. In-House | IntuitionLabs",
      "url": "https://intuitionlabs.ai/articles/rlhf-platforms-biotech-comparison",
      "description": "# RLHF Platforms in Biotech: Scale vs. Reinforcement Learning from Human Feedback (RLHF) is an advanced AI training paradigm that leverages expert human evaluation to refine model outputs. In biotechnology and healthcare, where domain expertise and safety-critical decision-making abound, RLHF (and related human-in-the-loop labeling) is emerging as a crucial approach for training models in diagnostics, drug discovery, genomics, and clinical research. Specialized data-labeling platforms—such as Scale AI (Healthcare division), Labelbox (Healthcare), and Appen (Medical)—offer managed RLHF and annotation services, promising rapid scale, domain expertise, and regulatory compliance. This report provides an in-depth comparison of these external RLHF platforms versus in-house solutions for biotech applications. Appen offers HIPAA-aligned workflows with U.S.-based teams () and integrates data securely via APIs. In RLHF contexts, Appen emphasizes scale and diversity for language models () and has partnered with medical research (e.g. Johns Hopkins) to accelerate labeling: one case enabled **1,500+ person-hours of neuroscience annotation in a few weeks** via Appen’s crowd ()."
    }
  ]
}

▼thought

▼✓抓取网页正文网址: "https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-..."

{
  "url": "https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models",
  "title": "What is the cost of training large language models?",
  "summary": "Explore the true cost of training large language models. Learn about the financial, computational, and environmental costs of AI's most advanced models.",
  "content": "<div id=\"readability-page-1\" class=\"page\"><div data-bglz=\"1\" data-id=\"3f5cd38\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"theme-post-content.default\">\n\t\t\t\t\t\n<p>Large language models (LLMs) like OpenAI’s GPT series and Google’s BERT have become foundational technologies powering many applications, from automated customer service to advanced research tools. </p>\n<p>However, training LLMs requires a substantial financial investment due to their vast number of parameters and the enormous computational power needed. It typically involves high-end GPUs or specialized AI accelerators, which are costly resources to acquire and run.</p>\n<p>For example, while the original 2017 Transformer that introduced the core LLM architecture cost only about $900 to train, the compute cost for training GPT-3 (175 billion parameters) was estimated in 2020 to <a href=\"https://carboncredits.com/carbon-countdown-ais-10-billion-rise-in-power-use-explodes-data-center-emission/#:~:text=The%20final%20training%20run%20of%20GPT%2D3%20is%20estimated%20to%20have%20ranged%20from%20%24500%2C000%20to%20%244.6%20million.\" target=\"_blank\" rel=\"noopener\">range from about $500,000 up to $4.6 million</a>, depending on the hardware and optimization techniques used. By comparison, newer models have pushed costs much higher.</p>\n<p>Training OpenAI’s GPT-4 reportedly <a href=\"https://en.wikipedia.org/wiki/GPT-4#:~:text=Sam%20Altman%20stated%20that%20the,51\" target=\"_blank\" rel=\"noopener\">cost more than $100 million</a>, with some estimates ranging up to <a href=\"https://hai.stanford.edu/news/ai-index-state-ai-13-charts#:~:text=One%20of%20the%20reasons%20academia,modern%20LLM%2C%20cost%20around%20%24900\" target=\"_blank\" rel=\"noopener\">$78 million in compute cost</a>, and Google’s Gemini Ultra model is estimated to have cost $191 million in training compute. These staggering sums partly reflect why model size and complexity have exploded.</p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02a.png\" alt=\"cost_of_llms_chart_1\"></p>\n<blockquote>\n<p>Training compute petaFLOP log scale. Source: <a href=\"https://hai.stanford.edu/news/ai-index-state-ai-13-charts\" target=\"_blank\" rel=\"noopener\">Paper</a></p>\n</blockquote>\n<p>In this article, we’ll explore the expenses of bringing modern generative AI and reasoning models to life, focusing on infrastructure needs, data management, and the increasingly important role of cloud computing.</p>\n<h2>What are large language models?</h2>\n<p>LLMs are AI systems designed to understand and generate language in a way that mimics human communication and reasoning. They are trained on vast datasets containing text from books, websites, code repositories, and other digital content. </p>\n<p>By learning statistical patterns in language, LLMs can generate coherent and contextually relevant text based on a given input. For example, models like GPT are trained on a huge swath of internet text and can produce writing that mirrors human style across many topics.</p>\n<p>These models typically rely on the <a href=\"https://www.cudocompute.com/topics/neural-networks/transformer-models-what-are-they-and-how-do-they-work\">transformer architecture</a>, which uses mechanisms like self-attention to consider the context of words in a sequence, allowing the model to weigh the importance of different parts of the input text relative to each other, enabling a better understanding of context and meaning. </p>\n<p>For instance, Google’s BERT model reads text bidirectionally – both left-to-right and right-to-left – to capture context, a significant advancement over older one-directional models, making BERT especially effective for tasks requiring deep language understanding, like question answering or sentiment analysis.</p>\n<p>As LLMs have grown, they’ve gotten better at fluent text generation and shown emergent reasoning abilities. Newer generative models such as <a href=\"https://www.cudocompute.com/blog/why-open-source-models-are-disrupting-the-ai-race\">GPT-4o-mini and DeepSeek R1</a> can perform complex tasks like solving multi-step math problems or writing code by reasoning through steps, especially when guided by techniques like chain-of-thought prompting. </p>\n<p>Given the right prompts, these models can simulate a step-by-step reasoning process, allowing them to handle more complex queries. The emergence of reasoning in LLMs is partly a byproduct of scale, as larger models trained on more data tend to develop more advanced capabilities, and partly due to specialized training or fine-tuning on <a href=\"https://arxiv.org/pdf/2110.14168\" target=\"_blank\" rel=\"noopener\">data that encourages logical reasoning</a>. </p>\n<p>Models like Anthropic’s Claude have also been trained with specific methods, like Constitutional AI, to better follow instructions and reason safely. All of these advances, however, come with increased training requirements. Next, we’ll discuss the cost of training a large language model.</p>\n<h2>Cost of training LLMs with cloud infrastructure</h2>\n<p>As AI development increasingly shifts to cloud platforms, <a href=\"https://www.cudocompute.com/blog/gpu-supply-shortage-due-to-ai-needs\">driven partly by limited availability of GPUs</a>, cloud services have become one of the most practical and scalable ways to train LLMs. The cloud offers on-demand access to large numbers of GPUs/TPUs and the ability to distribute training across multiple machines. </p>\n<p>Such scalability is excellent for the fluctuating demands of AI training cycles. Major providers have built massive supercomputers in the cloud to facilitate LLM training. Microsoft, for example, <a href=\"https://www.zdnet.com/article/microsoft-builds-a-supercomputer-for-openai-for-training-massive-ai-models/#:~:text=Microsoft%20said%20the%20supercomputer%20built,has%20access%20to%20Azure%20services\" target=\"_blank\" rel=\"noopener\">constructed an Azure supercomputer</a> with over 10,000 GPUs and ultra-fast networking specifically for OpenAI’s model training.</p>\n<p>However, renting such infrastructure comes at a significant cost. A recent example from NVIDIA CEO Jensen Huang illustrates the scale of resources required, stating that training the GPT-MoE-1.8T model using 25,000 Ampere-based GPUs (most likely the A100) took 3 to 5 months. Doing the same with H100 would take about 8,000 GPUs in 90 days.</p>\n<p>The above statement shows how each generation of hardware can improve training efficiency. The H100s offer substantially higher throughput than A100s, reducing the time and number of machines needed for the same task. </p>\n<p>Still, whether using 25,000 older GPUs or 8,000 newer ones, the compute cluster required is enormous, and most organizations simply cannot afford to assemble or rent such resources for training from scratch.</p>\n<p>While it is possible to train a model from scratch, most users won’t train giant LLMs from scratch due to these high costs and infrastructure hurdles. Instead, they’ll use pre-trained models provided by AI labs or open-source communities and then adapt them to their needs. This avoids spending millions of dollars in compute for initial training.</p>\n<p>Generally, if a company or researcher needs an LLM, there are two approaches:</p>\n<ul>\n<li><strong>Hosting your own model:</strong> Obtain a pre-trained model checkpoint, either open-source or via license, and run further training or fine-tuning on it using cloud servers that you rent or own.</li>\n<li><strong>Pay-per-token access:</strong> Use a hosted model provided through an API, like OpenAI’s or Google’s, and pay for usage, rather than handling any training yourself.</li>\n</ul>\n<p>Let’s examine each of these approaches and their cost implications.</p>\n<p>Read more: <a href=\"https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch\">How to build an AI</a></p>\n<h3>Hosting models in the cloud</h3>\n<p>If you choose to host and train/fine-tune models in the cloud, you might need to rent compute resources. Companies like CUDO Compute offer suites of services that support the entire machine learning lifecycle from data storage to GPU compute to deployment. </p>\n<p>The advantage of cloud-based training is convenience and scalability, as you can spin up hundreds of GPUs for a few days of training and then shut them down, only paying for what you used. You also don’t have to maintain physical hardware.</p>\n<p>When training large models with tens or hundreds of billions of parameters, such as GPT-3 or Meta’s Falcon 180B, the expense goes beyond just the headline price of GPUs. In a cloud environment, you also need to account for supporting resources and overheads:</p>\n<ul>\n<li><strong>Virtual CPUs (vCPUs)</strong> to coordinate and feed data to the GPUs during training.</li>\n<li><strong>Memory (RAM)</strong> for holding training data batches, model activations, and other intermediate computations.</li>\n<li><strong>Storage</strong> for datasets and for saving model checkpoints (which themselves can be hundreds of gigabytes in size for big models), as well as costs for data transfer.</li>\n</ul>\n<p>Each of these components adds to the bill. Cloud providers typically charge based on the compute time for both GPU and CPU, the amount of memory provisioned, and the volume of data stored or transferred. This means training a large AI model can incur costs on multiple fronts.</p>\n<p>Efficiently using and managing all these resources helps to keep the budget under control. For example, ensuring GPUs are kept busy so you’re not paying for idle time, and not provisioning more CPU or memory than needed.</p>\n<p>Let’s break down how this might work when training an LLM on a large model on <a href=\"https://www.cudocompute.com/\">CUDO Compute</a>:</p>\n<p>At the time of writing, the cost of the <a href=\"https://www.cudocompute.com/gpu-rental/nvidia-a100\">A100</a> on <a href=\"https://www.cudocompute.com/pricing\">CUDO Compute</a> starts from $1.50 per hour. There is also a monthly commitment option of $1,125.95. When factoring in the other costs, such as vCPUs and memory needed, each is charged based on location.</p>\n<p>Using the median location from the roster, we will base our analysis on the pricing from the Los Angeles 1 location. Here is how much it costs for each resource needed:</p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02a.png\" alt=\"cost_of_llms_table_1\"></p>\n<p>Multiple GPUs are advised for optimal results. This would be the recommended amount needed to train a Falcon 180B on CUDO Compute based on the default instance for training the same model on AWS:</p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02b.png\" alt=\"cost_of_llms_table_2\"></p>\n<p>The above configuration is very similar to the default configuration used on AWS for training LLMs on the same model. To use this configuration on CUDO Compute, assuming the model is not scaled up or down, and no discounts are applied, it will total just over USD 13,000 monthly. Here is the breakdown:</p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02c.png\" alt=\"cost_of_llms_table_3\"></p>\n<p>Since training an LLM from scratch will likely take months, this cost will pile up over time, particularly when training involves multiple iterations over extensive datasets. </p>\n<p>Also, as CUDO Compute is one of the cheapest cloud platforms available, bear in mind that the compute costs may be higher on other platforms. For example, using an instance with a similar configuration on AWS (ml.p4de.24xlarge) will cost over USD 23,000 per month.</p>\n<p>It’s worth noting that newer GPU hardware can change this calculus. The above costs were with NVIDIA’s A100 GPUs. The H100 GPUs offer about 2–3 times the performance of A100 for training workloads, and the <a href=\"https://www.cudocompute.com/blog/nvidia-gb200-everything-you-need-to-know\">B100 is even more improved than the H100</a>. You could use fewer B100s to train in less time, potentially lowering the total cost, but B100s are also more expensive per hour than A100s and H100s.</p>\n<p>Hosting and training large models in the cloud gives flexibility but requires careful resource planning. You will often fine-tune an existing model for a specific task (which is far cheaper) rather than train a new LLM from scratch. Next, we’ll look at an alternative to any training, using a pay-per-token model to use models others have already trained.</p>\n<h3>Pay-per-token access to LLMs</h3>\n<p>The high cost of training and maintaining LLMs has led to the rise of the pay-per-token (PPT) model for accessing these powerful language models. Here's how it works:</p>\n<p>Companies like OpenAI and Google AI pre-train massive LLMs on vast datasets and allow developers and businesses to use these models, such as GPT-3 or similar, without the prohibitive costs and technical challenges of training such models themselves.</p>\n<p>Users don't incur the upfront costs of training and infrastructure. Instead, they pay a fee based on the number of tokens (roughly equivalent to words or sub-words) processed by the LLM when completing tasks like text generation, translation, or code writing.</p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/cost-of-llms-image-5.jpg\" alt=\"cost-of-llms-image-5\"></p>\n<p>The PPT model offers a significantly more cost-effective approach than in-house training for tasks that don't require extensive LLM usage. Users only pay for the resources they actually use.</p>\n<h3>Benefits of pay per token:</h3>\n<ul>\n<li><strong>Reduced costs:</strong> This model eliminates the upfront investment in hardware, software, and training data.</li>\n<li><strong>Scalability:</strong> Users can easily scale their LLM usage up or down based on their needs, paying only for the tokens they consume.</li>\n<li><strong>Accessibility:</strong> PPT allows a wider range of users and smaller companies to access LLMs without the prohibitive costs of in-house training.</li>\n</ul>\n<h2>Why is it so expensive to train LLMs?</h2>\n<p>There are several reasons why training large language models is extraordinarily expensive:</p>\n<ul>\n<li><strong>Massive model sizes and data:</strong> Modern LLMs are huge. While GPT-3 had 175 billion parameters, newer models like GPT-4 are rumored to have on the order of <a href=\"https://en.wikipedia.org/wiki/GPT-4#:~:text=Sam%20Altman%20stated%20that%20the,51\" target=\"_blank\" rel=\"noopener\">1 trillion parameters</a>. The amount of computation (floating point operations) needed to train a model scales roughly linearly with the number of parameters multiplied by the number of training examples (tokens).</li>\n</ul>\n<p>To achieve high performance, these models are trained on extremely large datasets with hundreds of billions or even trillions of tokens of text. For example, an LLM might be trained on 300-500 billion tokens of data, iterating through that corpus multiple times. </p>\n<p>The combination of a gigantic model and a gigantic dataset leads to an eye-popping number of total compute operations.</p>\n<p>It is estimated that GPT-4’s training consumed 2.1 × 10<sup>25</sup> FLOPs (21 billion petaFLOPs), and models like Gemini Ultra <a href=\"https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million\" target=\"_blank\" rel=\"noopener\">might be around 5.0 × 10</a><a href=\"https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million\" target=\"_blank\" rel=\"noopener\">25</a><a href=\"https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million\" target=\"_blank\" rel=\"noopener\">FLOPs</a>, driving those compute bills. Simply put, scale is costly. Even with highly efficient software and hardware, you’re pushing an unfathomable amount of data through the model.</p>\n<ul>\n<li><strong>Expensive hardware requirements:</strong> To handle that scale of computation within a reasonable timeframe, you need fleets of high-end accelerators (GPUs or TPUs). Training runs often use thousands of GPUs working in parallel for weeks or months.</li>\n</ul>\n<p>Owning such hardware is capital-intensive, as one NVIDIA H100 GPU can cost $25k–$40k; a pod of 1000 of them would be $25–40 million in hardware alone, and renting them is expensive too, as we discussed earlier. </p>\n<p>Additionally, the electricity and cooling for running these at full tilt 24/7 are very costly. Large training runs can consume megawatt-hours of energy, and only a few companies have the infrastructure to dedicate 10,000 GPUs to a single task. </p>\n<p>Supporting hardware like high-speed networking, like InfiniBand or custom interconnects, is also needed to ensure those GPUs can communicate rapidly as they split up the training load. All of this specialized infrastructure drives up the cost significantly.</p>\n<ul>\n<li><strong>Extended training durations:</strong> Despite massive parallelism, training a reasoning model can still take weeks or months. During that entire period, the hardware must be powered and utilized.</li>\n</ul>\n<p>For instance, if you run 1,000 GPUs for one month, that’s 1,000 GPU-months of usage, which, at say $2,000 per GPU-month, would be $2 million. Now, consider some runs use 5,000 or 10,000 GPUs for several months. It’s easy to see how the cost racks up to tens of millions of dollars. Every additional epoch of training to slightly improve the model’s performance comes with that price tag. </p>\n<p>Moreover, training often isn’t a one-shot process; researchers may do multiple runs while tuning hyperparameters or testing different configurations. The final model you see is the product of many trial runs and experiments behind the scenes. Those experimental runs also consume a lot of compute. </p>\n<p>According to recent analysis, the total compute used in developing a new model could be <a href=\"https://arxiv.org/html/2405.21015v2\" target=\"_blank\" rel=\"noopener\">2–3 times the compute of the final training run</a>, once you account for all the failed or exploratory experiments along the way.</p>\n<ul>\n<li><strong>Advanced techniques and multiple phases:</strong> Training a competitive LLM now involves more than just one pass through a dataset. For example, after the initial pre-training where you predict the next word on huge text data, many models undergo fine-tuning phases, such as <a href=\"https://arxiv.org/pdf/2504.12501\" target=\"_blank\" rel=\"noopener\">Reinforcement Learning from Human Feedback</a> (RLHF), to align the model with human preferences and instructions.</li>\n</ul>\n<p>RLHF involves training additional models (reward models) and running <a href=\"https://www.cudocompute.com/blog/machine-learning-technique-introduction-to-reinforcement-learning\">reinforcement learning algorithms</a> that themselves require lots of computations on GPUs and <a href=\"https://arxiv.org/pdf/2409.18417\" target=\"_blank\" rel=\"noopener\">custom data generated by human annotators</a>, which is expensive. </p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02b.png\" alt=\"cost_of_llms_chart_2\"></p>\n<blockquote>\n<p>Illustration of Vickrey Feedback for RLHF. Source: <a href=\"https://arxiv.org/pdf/2409.18417\" target=\"_blank\" rel=\"noopener\">Paper</a>.</p>\n</blockquote>\n<p>OpenAI’s ChatGPT/GPT-4 alignment process, for instance, meant hiring human contractors to <a href=\"https://openai.com/index/chatgpt/\" target=\"_blank\" rel=\"noopener\">produce conversations and feedback</a>, and then performing iterative fine-tuning, improving the model’s usefulness and safety, but it adds to the total cost of development, both in terms of money and time, beyond the baseline pre-training. </p>\n<p>Similarly, to imbue reasoning abilities, researchers might fine-tune LLMs on complex problem-solving datasets or use techniques like <a href=\"https://arxiv.org/pdf/2201.11903\" target=\"_blank\" rel=\"noopener\">chain-of-thought prompting</a> and <a href=\"https://arxiv.org/pdf/2203.11171\" target=\"_blank\" rel=\"noopener\">self-correction</a>, which can require generating and filtering a lot of model outputs, again consuming compute.</p>\n<p>Each additional training objective or stage, be it tutoring the model on logic puzzles or making it follow a constitution of AI principles, introduces extra expense.</p>\n<ul>\n<li><strong>Data acquisition and preparation:</strong> Often overlooked in pure compute discussions is the cost of the <em>data</em> itself. High-quality datasets may need to be purchased or scraped and cleaned at scale. For example, an LLM might ingest the entire Wikipedia, huge swaths of web text, news, books, academic papers, code repositories, etc. Web crawling and data storage incur infrastructure costs.</li>\n</ul>\n<p>Additionally, if a model is to excel at, say, coding or medical questions, one might need to curate domain-specific data, which could be behind paywalls or require licensing. Companies might pay for access to archives of journals or specific private datasets. </p>\n<p>Then comes the preprocessing: filtering out problematic content, deduplicating text, formatting it for training, and so forth. This can involve substantial engineering effort and cloud processing power. </p>\n<p>Preparing a trillion-token corpus that is diverse, high-quality, and legally usable is a non-trivial expense, sometimes involving human annotators for labeling or verification in the loop.</p>\n<p>Pushing the limits of dataset size, model size, and compute scale is expensive because it sits at the edge of what’s technologically possible. To do so requires cutting-edge hardware, huge energy consumption, and often a team of researchers and engineers whose salaries are another factor. </p>\n<p>The result is that only a handful of companies and well-funded academic labs can afford to train the largest models. Others must be clever in using those models or finding ways to achieve results with smaller-scale systems. </p>\n<p>In the following section, we’ll discuss strategies for controlling and reducing the cost of training.</p>\n<h2>Steps to controlling the cost of training LLMs</h2>\n<p>While training large models will likely remain resource-intensive, there are several strategies that can optimize resource utilization and reduce expenses. In fact, for most organizations, the question isn’t “<em>Can we train our own GPT-4?</em>” but “<em>How can we achieve our AI goals within a reasonable budget?</em>”. </p>\n<p>Below are key approaches to managing and minimizing the costs of training and deploying LLMs:</p>\n<p><strong>1. Implement model and training optimization techniques:</strong> Optimize what you build and how you build it so you need less compute for the same result.</p>\n<ul>\n<li><strong>Model architecture selection and sizing:</strong> Carefully choose a model architecture and size that balances performance with cost. Bigger is not always better for a given task. If you can achieve your target accuracy with a 6-billion-parameter model, there’s no need to train a 60-billion-parameter one.</li>\n</ul>\n<p>Techniques like model pruning (removing unnecessary weights) or quantization can reduce the effective size and computation of a model without a significant loss in accuracy. Using newer architecture improvements can also get more bang for the buck, as some architectures are more parameter-efficient than others. </p>\n<p>The goal is to avoid overspending on compute capacity you don’t actually need. In some cases, <a href=\"https://www.cudocompute.com/blog/what-is-ensemble-learning\">multiple smaller specialized models ensemble</a> might be better than one large model.</p>\n<ul>\n<li><strong>Training data optimization:</strong> Ensure your training data is high-quality and relevant. A smaller, curated dataset can sometimes outperform a larger, noisy one. Removing redundant or low-value data can shorten training time.</li>\n</ul>\n<p>For instance, if you’re building a biomedical language model, 100GB of well-chosen medical text will train faster and yield a more accurate model in that domain than 1TB of random internet text. </p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02c.png\" alt=\"cost_of_llms_chart_3\"></p>\n<blockquote>\n<p>Amortized hardware cost plus energy cost for the final training run of frontier models. Source: <a href=\"https://arxiv.org/html/2405.21015v2\" target=\"_blank\" rel=\"noopener\">Paper</a></p>\n</blockquote>\n<p>Use data filtering to eliminate spam, duplicated content, or irrelevant information, and augmentation to add useful diversity are important steps. Every epoch on unnecessary data is wasted GPU time, so focus on quality over sheer quantity when possible.</p>\n<ul>\n<li><strong>Knowledge distillation:</strong> This technique involves training a smaller “student” model to replicate the behavior of a larger “teacher” model. You first train a large model (or take an existing one), then use its outputs to guide the training of a compact model.</li>\n</ul>\n<p>The student model can often achieve near the teacher’s performance but with far fewer parameters, effectively compressing the knowledge of a huge model into a cheaper, faster model that is much easier to deploy and even continue training. </p>\n<p>Distillation was used, for example, to create smaller BERT variants and even some DeepSeek distillations, cutting down model size by 10 times or more while retaining most capabilities. For organizations, one strategy is to fine-tune a large model on your task (possibly using an API or a short-term rental on a big GPU machine), then distill it into a smaller model you can afford to train further or run in production.</p>\n<ul>\n<li><strong>Mixed-precision training:</strong> Modern hardware and software allow using lower numerical precision for calculations to speed up training. Instead of 32-bit floating point for everything, frameworks use 16-bit or even 8-bit in certain parts of the computation.</li>\n</ul>\n<p>This cuts memory usage and can double the training speed on supported hardware <em>without</em> noticeably affecting model accuracy due to techniques like loss scaling to maintain numerical stability. </p>\n<p>The latest NVIDIA GPUs and Google TPUs all support mixed precision. You can significantly reduce the compute time and cost required for each training step using mixed-precision training. It’s a free efficiency gain and has become standard practice in training large models.</p>\n<ul>\n<li><strong>Parameter-efficient fine-tuning:</strong> A recent development, particularly useful if you’re working with pre-trained models, is to fine-tune only a subset of the model’s parameters or use add-on adapters, instead of complete end-to-end training.</li>\n</ul>\n<p>Approaches such as <a href=\"https://arxiv.org/pdf/2106.09685\" target=\"_blank\" rel=\"noopener\">low-rank adaptation</a> (LoRA) and other adapter-based methods allow you to train just a small additional set of weights or a low-rank update to the original weights that can be merged with the model for new tasks, drastically reducing the compute and data needed to adapt a model to a domain. </p>\n<p>For example, LoRA was used to fine-tune large models like GPT-3 on specific tasks using 10x–100x less compute than tuning the entire model. By training fewer parameters, you use less GPU memory and can get away with shorter training times, all while achieving nearly the same performance as full fine-tuning. </p>\n<p>Parameter-efficient fine-tuning uses the investment already made in a pre-trained model and avoids having to reinvent the wheel. In general, always prefer fine-tuning or adapting an existing model over training from scratch when possible; it’s very much like <a href=\"https://www.cudocompute.com/blog/few-shot-learning-everything-you-need-to-know#:~:text=varied%20training%20data.-,Transfer%20learning,-%3A%20Transfer%20learning%20leverages\">transfer learning</a> that can cut requirements by orders of magnitude.</p>\n<p><strong>2. Optimize training configurations:</strong>\nTrain smarter, not harder – improve your training process to converge faster or use fewer resources.</p>\n<ul>\n<li><strong>Hyperparameter tuning and efficient experimentation:</strong> The choice of <a href=\"https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch/4-steps-to-build-an-ai-model#hyperparameters-tuning\">hyperparameters can dramatically affect how fast a model learns</a>. A well-tuned training run might reach a target accuracy in 50% fewer steps than a poorly tuned one, which directly translates to 50% less cost.</li>\n</ul>\n<p>Rather than guessing, use systematic approaches to hyperparameter search (grid search, Bayesian optimization, etc.) on smaller-scale models or subsets of data to find good settings, then scale up. </p>\n<p>Also, monitor training in real-time. If loss has plateaued early or something looks off, intervene or stop rather than let a bad configuration run to completion. Implement an early stopping strategy so that if your model’s performance on a validation set stops improving, it halts the training to avoid wasting epochs that don’t yield gains.</p>\n<p>Read more: <a href=\"https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch\">How to build an AI</a></p>\n<ul>\n<li><strong>Checkpointing and fault tolerance:</strong> Training runs can be long, and failures like machine crashes and spot instance termination can happen. By checkpointing regularly (saving the model state every so often), you can resume from the last checkpoint instead of starting over, thereby saving all the progress made until the interruption. This is more about avoiding unnecessary re-training costs rather than reducing the cost of the successful run, but it’s important.</li>\n<li><strong>Smaller-scale and iterative training:</strong> Often, one can pre-train on a smaller scale first to get a sense of things. For example, train a smaller model (say 10% of the full size) on the full data, or the full model on 10% of the data, to identify issues or get learning curve estimates. This “pilot run” approach can reveal if your learning rate is way off, or if the model is too small to ever reach your target, etc., without spending the full budget.</li>\n</ul>\n<p>You can also try using learning rate warmups*,* cosine decays, and other training schedule tricks, which can also help stabilize the training of large models, avoiding divergence that would force you to restart with different settings, which is another hidden cost.</p>\n<p><strong>4. Use advanced model strategies:</strong></p>\n<p>Use model designs that reduce per-model compute and external resources that make the job easier.</p>\n<ul>\n<li><strong>Mixture-of-experts (MoE) models:</strong> MoE is an architecture that effectively has multiple sub-models called experts and a gating mechanism that activates only some of them for each input, meaning that at any given time, only a portion of the model’s parameters are used, rather than all of them.</li>\n</ul>\n<p>MoEs can significantly cut down the compute needed for a given parameter count – for instance, Google’s Switch Transformer (an MoE model) achieved comparable results to a dense model but with far less computation by sparsely activating experts. </p>\n<p>In a research, they got a model with 1.6 trillion parameters to train with the <a href=\"https://arxiv.org/pdf/2101.03961\" target=\"_blank\" rel=\"noopener\">computational cost of only a 100+ billion parameter dense model</a>. The benefit is that you get the effect of a huge model (in terms of capacity and specialization) without having to pay the full cost every time. </p>\n<p>MoEs spread out the workload and can be more efficient if implemented well. However, they add complexity; not all frameworks easily support them, and they can be tricky to get right. But as research advances, MoEs are becoming more common in large-scale settings. </p>\n<p>For cost-savvy training, exploring MoE architectures could allow training frontier-sized models on a smaller budget by <strong>reducing the computational load per token</strong>.</p>\n<ul>\n<li><strong>Retrieval and external knowledge integration:</strong> Another way to reduce the burden on the model itself is to give it tools or access to information during training/inference. For example, Retrieval-Augmented Generation (RAG) techniques provide the model with relevant documents fetched from an external database, so the model doesn’t need to memorize every fact; instead, it can look things up.</li>\n</ul>\n<p>DeepMind’s RETRO model demonstrated that a model with retrieval support could <a href=\"https://www.thedecoding.net/p/deepmind-retro-smaller-llm#:~:text=But%20there%20is%20good%20news%21\" target=\"_blank\" rel=\"noopener\">match the performance of a model 25 times large</a>r that had everything baked into its parameters. By integrating a massive external knowledge base, they maintained performance with a much smaller network, which implies huge training cost <a href=\"https://www.thedecoding.net/p/deepmind-retro-smaller-llm#:~:text=You%20can%20integrate%20a%20massive,up%20to%2025x%20smaller%20networks\" target=\"_blank\" rel=\"noopener\">savings of 25 times fewer parameters to train is roughly 25 times less compute</a>. </p>\n<p>This approach can be thought of as augmenting the model with a form of memory or tools so it doesn’t have to do all the work internally. Likewise, allowing a model to use a calculator for arithmetic or call an API for current knowledge means you don’t have to train it to death trying to internalize those capabilities. </p>\n<p>In a practical sense, building a system that combines an LLM with retrieval (e.g., queries to Wikipedia or a company knowledge base) can let you use a smaller base model to achieve the same or better performance than an extremely large model without retrieval, cutting down training requirements. </p>\n<p>Many current applications, like search engine bots, use this strategy. So, from a cost perspective, invest in a slightly more complex system (model + retrieval tool) and you might avoid having to invest in a model that’s 10 times bigger and 10 times more expensive to train.</p>\n<ul>\n<li><strong>Multi-step reasoning and curriculum learning:</strong> If you want an LLM with strong reasoning, an efficient approach is to train it to encourage reasoning via multiple steps, rather than expecting the model to solve everything in one go with brute-force parameters.</li>\n</ul>\n<p>For example, techniques like chain-of-thought training involve showing the model intermediate reasoning steps for problems (like math solutions or code logic) during training. This can teach a smaller model to approach complex tasks systematically, narrowing the gap with a larger model that might implicitly learn those skills. </p>\n<p>It’s not a direct cost-saving in compute, but it can mean you reach the desired capability without an exponential increase in model size. Similarly, curriculum learning (starting with easier tasks and then increasing difficulty) can help a model learn faster and converge better, meaning fewer epochs (less cost) to achieve high performance. </p>\n<p>The overarching idea is training smarter. Guiding the model’s learning process can trim the needed compute compared to naive training on a jumble of tasks. This is especially useful for reasoning-centric models where the objective is not just fluent text, but correct and logical solutions.</p>\n<p><strong>5. Collaborate and use open-source ecosystems:</strong></p>\n<p>Don’t go at it alone; take advantage of community efforts, existing models, and shared resources.</p>\n<ul>\n<li><strong>Use pre-trained models and tools:</strong> Using open-source pre-trained models significantly reduces costs by eliminating the need for full-scale training from scratch. Models such as Meta’s LLaMA 2, BLOOM, and DeepSeek R1 are available with pre-trained weights, enabling companies and researchers to fine-tune these models for their specific needs at a fraction of the original training cost.</li>\n</ul>\n<p>As we showed earlier, fine-tuning a large model like LLaMA 2 (70B parameters) typically costs tens of thousands of dollars, substantially less than full-scale initial training.</p>\n<p>Additionally, the open-source ecosystem provides robust tools and libraries optimized for efficient distributed training. Frameworks such as DeepSpeed and Fully Sharded Data Parallel (FSDP) effectively manage large models across limited hardware by sharding model components, allowing greater efficiency and reduced hardware requirements. </p>\n<p>Hugging Face’s Accelerate and tools like Horovod further streamline distributed training processes. You can benefit from shared knowledge, compute resources, and reduced overall costs by tapping into community-developed resources and collaborating within research communities, like EleutherAI, LAION, and BigScience.</p>\n<p>By applying these strategies, you’ll be able to cut down the cost required to train or fine-tune LLMs. Some emerging efforts have shown it’s possible to reach GPT-3/4 level performance on much smaller budgets.</p>\n<p>For example, the founder of 01.ai claimed they trained a GPT-4-comparable model <a href=\"https://www.tomshardware.com/tech-industry/artificial-intelligence/chinese-company-trained-gpt-4-rival-with-just-2-000-gpus-01-ai-spent-usd3m-compared-to-openais-usd80m-to-usd100m#:~:text=advanced%20AI%20models%20using%202%2C000%20GPUs%20with%20just%20%243%20million\" target=\"_blank\" rel=\"noopener\">using only 2,000 GPUs and about $3 million of compute</a> by optimizing every aspect of the process, which is a stark contrast to the amount reportedly spent on GPT-4 by OpenAI. While such claims have yet to be fully verified, they underscore that efficiency matters. The gap between big-budget labs and smaller players can be narrowed with clever engineering.</p>\n<p>Controlling the cost of training LLMs is about making prudent choices at each step, from choosing the right model size, hardware, data, and tools. Those who do so can build impressive language and reasoning models without breaking the bank. </p>\n<p>The landscape of LLM development is evolving quickly. While the cutting-edge models will always push expensive extremes, the practical know-how for cost-effective training is more available than ever. If you need to learn how, read our guide on <a href=\"https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch\">how to build an AI from scratch to learn more</a>.</p>\n<p>With careful planning, training your LLM or tailoring an existing one to your needs is feasible within a reasonable budget, and CUDO Compute can help you achieve that easily. We offer cost-effective GPUs on demand and scalable GPU clusters equipped with the latest NVIDIA GPUs. You can get started with a few clicks. <a href=\"https://compute.cudo.org/\" target=\"_blank\" rel=\"noopener\">Get started</a>.</p>\n\t\t\t\t</div></div>",
  "markdown": "Large language models (LLMs) like OpenAI’s GPT series and Google’s BERT have become foundational technologies powering many applications, from automated customer service to advanced research tools.\n\nHowever, training LLMs requires a substantial financial investment due to their vast number of parameters and the enormous computational power needed. It typically involves high-end GPUs or specialized AI accelerators, which are costly resources to acquire and run.\n\nFor example, while the original 2017 Transformer that introduced the core LLM architecture cost only about $900 to train, the compute cost for training GPT-3 (175 billion parameters) was estimated in 2020 to [range from about $500,000 up to $4.6 million](https://carboncredits.com/carbon-countdown-ais-10-billion-rise-in-power-use-explodes-data-center-emission/#:~:text=The%20final%20training%20run%20of%20GPT%2D3%20is%20estimated%20to%20have%20ranged%20from%20%24500%2C000%20to%20%244.6%20million.), depending on the hardware and optimization techniques used. By comparison, newer models have pushed costs much higher.\n\nTraining OpenAI’s GPT-4 reportedly [cost more than $100 million](https://en.wikipedia.org/wiki/GPT-4#:~:text=Sam%20Altman%20stated%20that%20the,51), with some estimates ranging up to [$78 million in compute cost](https://hai.stanford.edu/news/ai-index-state-ai-13-charts#:~:text=One%20of%20the%20reasons%20academia,modern%20LLM%2C%20cost%20around%20%24900), and Google’s Gemini Ultra model is estimated to have cost $191 million in training compute. These staggering sums partly reflect why model size and complexity have exploded.\n\n![cost\\_of\\_llms\\_chart\\_1](https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02a.png)\n\n> Training compute petaFLOP log scale. Source: [Paper](https://hai.stanford.edu/news/ai-index-state-ai-13-charts)\n\nIn this article, we’ll explore the expenses of bringing modern generative AI and reasoning models to life, focusing on infrastructure needs, data management, and the increasingly important role of cloud computing.\n\n## What are large language models?\n\nLLMs are AI systems designed to understand and generate language in a way that mimics human communication and reasoning. They are trained on vast datasets containing text from books, websites, code repositories, and other digital content.\n\nBy learning statistical patterns in language, LLMs can generate coherent and contextually relevant text based on a given input. For example, models like GPT are trained on a huge swath of internet text and can produce writing that mirrors human style across many topics.\n\nThese models typically rely on the [transformer architecture](https://www.cudocompute.com/topics/neural-networks/transformer-models-what-are-they-and-how-do-they-work), which uses mechanisms like self-attention to consider the context of words in a sequence, allowing the model to weigh the importance of different parts of the input text relative to each other, enabling a better understanding of context and meaning.\n\nFor instance, Google’s BERT model reads text bidirectionally – both left-to-right and right-to-left – to capture context, a significant advancement over older one-directional models, making BERT especially effective for tasks requiring deep language understanding, like question answering or sentiment analysis.\n\nAs LLMs have grown, they’ve gotten better at fluent text generation and shown emergent reasoning abilities. Newer generative models such as [GPT-4o-mini and DeepSeek R1](https://www.cudocompute.com/blog/why-open-source-models-are-disrupting-the-ai-race) can perform complex tasks like solving multi-step math problems or writing code by reasoning through steps, especially when guided by techniques like chain-of-thought prompting.\n\nGiven the right prompts, these models can simulate a step-by-step reasoning process, allowing them to handle more complex queries. The emergence of reasoning in LLMs is partly a byproduct of scale, as larger models trained on more data tend to develop more advanced capabilities, and partly due to specialized training or fine-tuning on [data that encourages logical reasoning](https://arxiv.org/pdf/2110.14168).\n\nModels like Anthropic’s Claude have also been trained with specific methods, like Constitutional AI, to better follow instructions and reason safely. All of these advances, however, come with increased training requirements. Next, we’ll discuss the cost of training a large language model.\n\n## Cost of training LLMs with cloud infrastructure\n\nAs AI development increasingly shifts to cloud platforms, [driven partly by limited availability of GPUs](https://www.cudocompute.com/blog/gpu-supply-shortage-due-to-ai-needs), cloud services have become one of the most practical and scalable ways to train LLMs. The cloud offers on-demand access to large numbers of GPUs/TPUs and the ability to distribute training across multiple machines.\n\nSuch scalability is excellent for the fluctuating demands of AI training cycles. Major providers have built massive supercomputers in the cloud to facilitate LLM training. Microsoft, for example, [constructed an Azure supercomputer](https://www.zdnet.com/article/microsoft-builds-a-supercomputer-for-openai-for-training-massive-ai-models/#:~:text=Microsoft%20said%20the%20supercomputer%20built,has%20access%20to%20Azure%20services) with over 10,000 GPUs and ultra-fast networking specifically for OpenAI’s model training.\n\nHowever, renting such infrastructure comes at a significant cost. A recent example from NVIDIA CEO Jensen Huang illustrates the scale of resources required, stating that training the GPT-MoE-1.8T model using 25,000 Ampere-based GPUs (most likely the A100) took 3 to 5 months. Doing the same with H100 would take about 8,000 GPUs in 90 days.\n\nThe above statement shows how each generation of hardware can improve training efficiency. The H100s offer substantially higher throughput than A100s, reducing the time and number of machines needed for the same task.\n\nStill, whether using 25,000 older GPUs or 8,000 newer ones, the compute cluster required is enormous, and most organizations simply cannot afford to assemble or rent such resources for training from scratch.\n\nWhile it is possible to train a model from scratch, most users won’t train giant LLMs from scratch due to these high costs and infrastructure hurdles. Instead, they’ll use pre-trained models provided by AI labs or open-source communities and then adapt them to their needs. This avoids spending millions of dollars in compute for initial training.\n\nGenerally, if a company or researcher needs an LLM, there are two approaches:\n\n-   **Hosting your own model:** Obtain a pre-trained model checkpoint, either open-source or via license, and run further training or fine-tuning on it using cloud servers that you rent or own.\n-   **Pay-per-token access:** Use a hosted model provided through an API, like OpenAI’s or Google’s, and pay for usage, rather than handling any training yourself.\n\nLet’s examine each of these approaches and their cost implications.\n\nRead more: [How to build an AI](https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch)\n\n### Hosting models in the cloud\n\nIf you choose to host and train/fine-tune models in the cloud, you might need to rent compute resources. Companies like CUDO Compute offer suites of services that support the entire machine learning lifecycle from data storage to GPU compute to deployment.\n\nThe advantage of cloud-based training is convenience and scalability, as you can spin up hundreds of GPUs for a few days of training and then shut them down, only paying for what you used. You also don’t have to maintain physical hardware.\n\nWhen training large models with tens or hundreds of billions of parameters, such as GPT-3 or Meta’s Falcon 180B, the expense goes beyond just the headline price of GPUs. In a cloud environment, you also need to account for supporting resources and overheads:\n\n-   **Virtual CPUs (vCPUs)** to coordinate and feed data to the GPUs during training.\n-   **Memory (RAM)** for holding training data batches, model activations, and other intermediate computations.\n-   **Storage** for datasets and for saving model checkpoints (which themselves can be hundreds of gigabytes in size for big models), as well as costs for data transfer.\n\nEach of these components adds to the bill. Cloud providers typically charge based on the compute time for both GPU and CPU, the amount of memory provisioned, and the volume of data stored or transferred. This means training a large AI model can incur costs on multiple fronts.\n\nEfficiently using and managing all these resources helps to keep the budget under control. For example, ensuring GPUs are kept busy so you’re not paying for idle time, and not provisioning more CPU or memory than needed.\n\nLet’s break down how this might work when training an LLM on a large model on [CUDO Compute](https://www.cudocompute.com/):\n\nAt the time of writing, the cost of the [A100](https://www.cudocompute.com/gpu-rental/nvidia-a100) on [CUDO Compute](https://www.cudocompute.com/pricing) starts from $1.50 per hour. There is also a monthly commitment option of $1,125.95. When factoring in the other costs, such as vCPUs and memory needed, each is charged based on location.\n\nUsing the median location from the roster, we will base our analysis on the pricing from the Los Angeles 1 location. Here is how much it costs for each resource needed:\n\n![cost\\_of\\_llms\\_table\\_1](https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02a.png)\n\nMultiple GPUs are advised for optimal results. This would be the recommended amount needed to train a Falcon 180B on CUDO Compute based on the default instance for training the same model on AWS:\n\n![cost\\_of\\_llms\\_table\\_2](https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02b.png)\n\nThe above configuration is very similar to the default configuration used on AWS for training LLMs on the same model. To use this configuration on CUDO Compute, assuming the model is not scaled up or down, and no discounts are applied, it will total just over USD 13,000 monthly. Here is the breakdown:\n\n![cost\\_of\\_llms\\_table\\_3](https://www.cudocompute.com/wp-content/uploads/2026/04/Table_02c.png)\n\nSince training an LLM from scratch will likely take months, this cost will pile up over time, particularly when training involves multiple iterations over extensive datasets.\n\nAlso, as CUDO Compute is one of the cheapest cloud platforms available, bear in mind that the compute costs may be higher on other platforms. For example, using an instance with a similar configuration on AWS (ml.p4de.24xlarge) will cost over USD 23,000 per month.\n\nIt’s worth noting that newer GPU hardware can change this calculus. The above costs were with NVIDIA’s A100 GPUs. The H100 GPUs offer about 2–3 times the performance of A100 for training workloads, and the [B100 is even more improved than the H100](https://www.cudocompute.com/blog/nvidia-gb200-everything-you-need-to-know). You could use fewer B100s to train in less time, potentially lowering the total cost, but B100s are also more expensive per hour than A100s and H100s.\n\nHosting and training large models in the cloud gives flexibility but requires careful resource planning. You will often fine-tune an existing model for a specific task (which is far cheaper) rather than train a new LLM from scratch. Next, we’ll look at an alternative to any training, using a pay-per-token model to use models others have already trained.\n\n### Pay-per-token access to LLMs\n\nThe high cost of training and maintaining LLMs has led to the rise of the pay-per-token (PPT) model for accessing these powerful language models. Here's how it works:\n\nCompanies like OpenAI and Google AI pre-train massive LLMs on vast datasets and allow developers and businesses to use these models, such as GPT-3 or similar, without the prohibitive costs and technical challenges of training such models themselves.\n\nUsers don't incur the upfront costs of training and infrastructure. Instead, they pay a fee based on the number of tokens (roughly equivalent to words or sub-words) processed by the LLM when completing tasks like text generation, translation, or code writing.\n\n![cost-of-llms-image-5](https://www.cudocompute.com/wp-content/uploads/2026/04/cost-of-llms-image-5.jpg)\n\nThe PPT model offers a significantly more cost-effective approach than in-house training for tasks that don't require extensive LLM usage. Users only pay for the resources they actually use.\n\n### Benefits of pay per token:\n\n-   **Reduced costs:** This model eliminates the upfront investment in hardware, software, and training data.\n-   **Scalability:** Users can easily scale their LLM usage up or down based on their needs, paying only for the tokens they consume.\n-   **Accessibility:** PPT allows a wider range of users and smaller companies to access LLMs without the prohibitive costs of in-house training.\n\n## Why is it so expensive to train LLMs?\n\nThere are several reasons why training large language models is extraordinarily expensive:\n\n-   **Massive model sizes and data:** Modern LLMs are huge. While GPT-3 had 175 billion parameters, newer models like GPT-4 are rumored to have on the order of [1 trillion parameters](https://en.wikipedia.org/wiki/GPT-4#:~:text=Sam%20Altman%20stated%20that%20the,51). The amount of computation (floating point operations) needed to train a model scales roughly linearly with the number of parameters multiplied by the number of training examples (tokens).\n\nTo achieve high performance, these models are trained on extremely large datasets with hundreds of billions or even trillions of tokens of text. For example, an LLM might be trained on 300-500 billion tokens of data, iterating through that corpus multiple times.\n\nThe combination of a gigantic model and a gigantic dataset leads to an eye-popping number of total compute operations.\n\nIt is estimated that GPT-4’s training consumed 2.1 × 1025 FLOPs (21 billion petaFLOPs), and models like Gemini Ultra [might be around 5.0 × 10](https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million)[25](https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million)[FLOPs](https://www.voronoiapp.com/technology/Googles-Gemini-Ultra-Cost-191M-to-Develop--1088#:~:text=For%20example%2C%20Gemini%20Ultra%20required,billion%20petaFLOPS%2C%20cost%20%2478%20million), driving those compute bills. Simply put, scale is costly. Even with highly efficient software and hardware, you’re pushing an unfathomable amount of data through the model.\n\n-   **Expensive hardware requirements:** To handle that scale of computation within a reasonable timeframe, you need fleets of high-end accelerators (GPUs or TPUs). Training runs often use thousands of GPUs working in parallel for weeks or months.\n\nOwning such hardware is capital-intensive, as one NVIDIA H100 GPU can cost $25k–$40k; a pod of 1000 of them would be $25–40 million in hardware alone, and renting them is expensive too, as we discussed earlier.\n\nAdditionally, the electricity and cooling for running these at full tilt 24/7 are very costly. Large training runs can consume megawatt-hours of energy, and only a few companies have the infrastructure to dedicate 10,000 GPUs to a single task.\n\nSupporting hardware like high-speed networking, like InfiniBand or custom interconnects, is also needed to ensure those GPUs can communicate rapidly as they split up the training load. All of this specialized infrastructure drives up the cost significantly.\n\n-   **Extended training durations:** Despite massive parallelism, training a reasoning model can still take weeks or months. During that entire period, the hardware must be powered and utilized.\n\nFor instance, if you run 1,000 GPUs for one month, that’s 1,000 GPU-months of usage, which, at say $2,000 per GPU-month, would be $2 million. Now, consider some runs use 5,000 or 10,000 GPUs for several months. It’s easy to see how the cost racks up to tens of millions of dollars. Every additional epoch of training to slightly improve the model’s performance comes with that price tag.\n\nMoreover, training often isn’t a one-shot process; researchers may do multiple runs while tuning hyperparameters or testing different configurations. The final model you see is the product of many trial runs and experiments behind the scenes. Those experimental runs also consume a lot of compute.\n\nAccording to recent analysis, the total compute used in developing a new model could be [2–3 times the compute of the final training run](https://arxiv.org/html/2405.21015v2), once you account for all the failed or exploratory experiments along the way.\n\n-   **Advanced techniques and multiple phases:** Training a competitive LLM now involves more than just one pass through a dataset. For example, after the initial pre-training where you predict the next word on huge text data, many models undergo fine-tuning phases, such as [Reinforcement Learning from Human Feedback](https://arxiv.org/pdf/2504.12501) (RLHF), to align the model with human preferences and instructions.\n\nRLHF involves training additional models (reward models) and running [reinforcement learning algorithms](https://www.cudocompute.com/blog/machine-learning-technique-introduction-to-reinforcement-learning) that themselves require lots of computations on GPUs and [custom data generated by human annotators](https://arxiv.org/pdf/2409.18417), which is expensive.\n\n![cost\\_of\\_llms\\_chart\\_2](https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02b.png)\n\n> Illustration of Vickrey Feedback for RLHF. Source: [Paper](https://arxiv.org/pdf/2409.18417).\n\nOpenAI’s ChatGPT/GPT-4 alignment process, for instance, meant hiring human contractors to [produce conversations and feedback](https://openai.com/index/chatgpt/), and then performing iterative fine-tuning, improving the model’s usefulness and safety, but it adds to the total cost of development, both in terms of money and time, beyond the baseline pre-training.\n\nSimilarly, to imbue reasoning abilities, researchers might fine-tune LLMs on complex problem-solving datasets or use techniques like [chain-of-thought prompting](https://arxiv.org/pdf/2201.11903) and [self-correction](https://arxiv.org/pdf/2203.11171), which can require generating and filtering a lot of model outputs, again consuming compute.\n\nEach additional training objective or stage, be it tutoring the model on logic puzzles or making it follow a constitution of AI principles, introduces extra expense.\n\n-   **Data acquisition and preparation:** Often overlooked in pure compute discussions is the cost of the _data_ itself. High-quality datasets may need to be purchased or scraped and cleaned at scale. For example, an LLM might ingest the entire Wikipedia, huge swaths of web text, news, books, academic papers, code repositories, etc. Web crawling and data storage incur infrastructure costs.\n\nAdditionally, if a model is to excel at, say, coding or medical questions, one might need to curate domain-specific data, which could be behind paywalls or require licensing. Companies might pay for access to archives of journals or specific private datasets.\n\nThen comes the preprocessing: filtering out problematic content, deduplicating text, formatting it for training, and so forth. This can involve substantial engineering effort and cloud processing power.\n\nPreparing a trillion-token corpus that is diverse, high-quality, and legally usable is a non-trivial expense, sometimes involving human annotators for labeling or verification in the loop.\n\nPushing the limits of dataset size, model size, and compute scale is expensive because it sits at the edge of what’s technologically possible. To do so requires cutting-edge hardware, huge energy consumption, and often a team of researchers and engineers whose salaries are another factor.\n\nThe result is that only a handful of companies and well-funded academic labs can afford to train the largest models. Others must be clever in using those models or finding ways to achieve results with smaller-scale systems.\n\nIn the following section, we’ll discuss strategies for controlling and reducing the cost of training.\n\n## Steps to controlling the cost of training LLMs\n\nWhile training large models will likely remain resource-intensive, there are several strategies that can optimize resource utilization and reduce expenses. In fact, for most organizations, the question isn’t “_Can we train our own GPT-4?_” but “_How can we achieve our AI goals within a reasonable budget?_”.\n\nBelow are key approaches to managing and minimizing the costs of training and deploying LLMs:\n\n**1\\. Implement model and training optimization techniques:** Optimize what you build and how you build it so you need less compute for the same result.\n\n-   **Model architecture selection and sizing:** Carefully choose a model architecture and size that balances performance with cost. Bigger is not always better for a given task. If you can achieve your target accuracy with a 6-billion-parameter model, there’s no need to train a 60-billion-parameter one.\n\nTechniques like model pruning (removing unnecessary weights) or quantization can reduce the effective size and computation of a model without a significant loss in accuracy. Using newer architecture improvements can also get more bang for the buck, as some architectures are more parameter-efficient than others.\n\nThe goal is to avoid overspending on compute capacity you don’t actually need. In some cases, [multiple smaller specialized models ensemble](https://www.cudocompute.com/blog/what-is-ensemble-learning) might be better than one large model.\n\n-   **Training data optimization:** Ensure your training data is high-quality and relevant. A smaller, curated dataset can sometimes outperform a larger, noisy one. Removing redundant or low-value data can shorten training time.\n\nFor instance, if you’re building a biomedical language model, 100GB of well-chosen medical text will train faster and yield a more accurate model in that domain than 1TB of random internet text.\n\n![cost\\_of\\_llms\\_chart\\_3](https://www.cudocompute.com/wp-content/uploads/2026/04/Chart_02c.png)\n\n> Amortized hardware cost plus energy cost for the final training run of frontier models. Source: [Paper](https://arxiv.org/html/2405.21015v2)\n\nUse data filtering to eliminate spam, duplicated content, or irrelevant information, and augmentation to add useful diversity are important steps. Every epoch on unnecessary data is wasted GPU time, so focus on quality over sheer quantity when possible.\n\n-   **Knowledge distillation:** This technique involves training a smaller “student” model to replicate the behavior of a larger “teacher” model. You first train a large model (or take an existing one), then use its outputs to guide the training of a compact model.\n\nThe student model can often achieve near the teacher’s performance but with far fewer parameters, effectively compressing the knowledge of a huge model into a cheaper, faster model that is much easier to deploy and even continue training.\n\nDistillation was used, for example, to create smaller BERT variants and even some DeepSeek distillations, cutting down model size by 10 times or more while retaining most capabilities. For organizations, one strategy is to fine-tune a large model on your task (possibly using an API or a short-term rental on a big GPU machine), then distill it into a smaller model you can afford to train further or run in production.\n\n-   **Mixed-precision training:** Modern hardware and software allow using lower numerical precision for calculations to speed up training. Instead of 32-bit floating point for everything, frameworks use 16-bit or even 8-bit in certain parts of the computation.\n\nThis cuts memory usage and can double the training speed on supported hardware _without_ noticeably affecting model accuracy due to techniques like loss scaling to maintain numerical stability.\n\nThe latest NVIDIA GPUs and Google TPUs all support mixed precision. You can significantly reduce the compute time and cost required for each training step using mixed-precision training. It’s a free efficiency gain and has become standard practice in training large models.\n\n-   **Parameter-efficient fine-tuning:** A recent development, particularly useful if you’re working with pre-trained models, is to fine-tune only a subset of the model’s parameters or use add-on adapters, instead of complete end-to-end training.\n\nApproaches such as [low-rank adaptation](https://arxiv.org/pdf/2106.09685) (LoRA) and other adapter-based methods allow you to train just a small additional set of weights or a low-rank update to the original weights that can be merged with the model for new tasks, drastically reducing the compute and data needed to adapt a model to a domain.\n\nFor example, LoRA was used to fine-tune large models like GPT-3 on specific tasks using 10x–100x less compute than tuning the entire model. By training fewer parameters, you use less GPU memory and can get away with shorter training times, all while achieving nearly the same performance as full fine-tuning.\n\nParameter-efficient fine-tuning uses the investment already made in a pre-trained model and avoids having to reinvent the wheel. In general, always prefer fine-tuning or adapting an existing model over training from scratch when possible; it’s very much like [transfer learning](https://www.cudocompute.com/blog/few-shot-learning-everything-you-need-to-know#:~:text=varied%20training%20data.-,Transfer%20learning,-%3A%20Transfer%20learning%20leverages) that can cut requirements by orders of magnitude.\n\n**2\\. Optimize training configurations:** Train smarter, not harder – improve your training process to converge faster or use fewer resources.\n\n-   **Hyperparameter tuning and efficient experimentation:** The choice of [hyperparameters can dramatically affect how fast a model learns](https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch/4-steps-to-build-an-ai-model#hyperparameters-tuning). A well-tuned training run might reach a target accuracy in 50% fewer steps than a poorly tuned one, which directly translates to 50% less cost.\n\nRather than guessing, use systematic approaches to hyperparameter search (grid search, Bayesian optimization, etc.) on smaller-scale models or subsets of data to find good settings, then scale up.\n\nAlso, monitor training in real-time. If loss has plateaued early or something looks off, intervene or stop rather than let a bad configuration run to completion. Implement an early stopping strategy so that if your model’s performance on a validation set stops improving, it halts the training to avoid wasting epochs that don’t yield gains.\n\nRead more: [How to build an AI](https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch)\n\n-   **Checkpointing and fault tolerance:** Training runs can be long, and failures like machine crashes and spot instance termination can happen. By checkpointing regularly (saving the model state every so often), you can resume from the last checkpoint instead of starting over, thereby saving all the progress made until the interruption. This is more about avoiding unnecessary re-training costs rather than reducing the cost of the successful run, but it’s important.\n-   **Smaller-scale and iterative training:** Often, one can pre-train on a smaller scale first to get a sense of things. For example, train a smaller model (say 10% of the full size) on the full data, or the full model on 10% of the data, to identify issues or get learning curve estimates. This “pilot run” approach can reveal if your learning rate is way off, or if the model is too small to ever reach your target, etc., without spending the full budget.\n\nYou can also try using learning rate warmups\\*,\\* cosine decays, and other training schedule tricks, which can also help stabilize the training of large models, avoiding divergence that would force you to restart with different settings, which is another hidden cost.\n\n**4\\. Use advanced model strategies:**\n\nUse model designs that reduce per-model compute and external resources that make the job easier.\n\n-   **Mixture-of-experts (MoE) models:** MoE is an architecture that effectively has multiple sub-models called experts and a gating mechanism that activates only some of them for each input, meaning that at any given time, only a portion of the model’s parameters are used, rather than all of them.\n\nMoEs can significantly cut down the compute needed for a given parameter count – for instance, Google’s Switch Transformer (an MoE model) achieved comparable results to a dense model but with far less computation by sparsely activating experts.\n\nIn a research, they got a model with 1.6 trillion parameters to train with the [computational cost of only a 100+ billion parameter dense model](https://arxiv.org/pdf/2101.03961). The benefit is that you get the effect of a huge model (in terms of capacity and specialization) without having to pay the full cost every time.\n\nMoEs spread out the workload and can be more efficient if implemented well. However, they add complexity; not all frameworks easily support them, and they can be tricky to get right. But as research advances, MoEs are becoming more common in large-scale settings.\n\nFor cost-savvy training, exploring MoE architectures could allow training frontier-sized models on a smaller budget by **reducing the computational load per token**.\n\n-   **Retrieval and external knowledge integration:** Another way to reduce the burden on the model itself is to give it tools or access to information during training/inference. For example, Retrieval-Augmented Generation (RAG) techniques provide the model with relevant documents fetched from an external database, so the model doesn’t need to memorize every fact; instead, it can look things up.\n\nDeepMind’s RETRO model demonstrated that a model with retrieval support could [match the performance of a model 25 times large](https://www.thedecoding.net/p/deepmind-retro-smaller-llm#:~:text=But%20there%20is%20good%20news%21)r that had everything baked into its parameters. By integrating a massive external knowledge base, they maintained performance with a much smaller network, which implies huge training cost [savings of 25 times fewer parameters to train is roughly 25 times less compute](https://www.thedecoding.net/p/deepmind-retro-smaller-llm#:~:text=You%20can%20integrate%20a%20massive,up%20to%2025x%20smaller%20networks).\n\nThis approach can be thought of as augmenting the model with a form of memory or tools so it doesn’t have to do all the work internally. Likewise, allowing a model to use a calculator for arithmetic or call an API for current knowledge means you don’t have to train it to death trying to internalize those capabilities.\n\nIn a practical sense, building a system that combines an LLM with retrieval (e.g., queries to Wikipedia or a company knowledge base) can let you use a smaller base model to achieve the same or better performance than an extremely large model without retrieval, cutting down training requirements.\n\nMany current applications, like search engine bots, use this strategy. So, from a cost perspective, invest in a slightly more complex system (model + retrieval tool) and you might avoid having to invest in a model that’s 10 times bigger and 10 times more expensive to train.\n\n-   **Multi-step reasoning and curriculum learning:** If you want an LLM with strong reasoning, an efficient approach is to train it to encourage reasoning via multiple steps, rather than expecting the model to solve everything in one go with brute-force parameters.\n\nFor example, techniques like chain-of-thought training involve showing the model intermediate reasoning steps for problems (like math solutions or code logic) during training. This can teach a smaller model to approach complex tasks systematically, narrowing the gap with a larger model that might implicitly learn those skills.\n\nIt’s not a direct cost-saving in compute, but it can mean you reach the desired capability without an exponential increase in model size. Similarly, curriculum learning (starting with easier tasks and then increasing difficulty) can help a model learn faster and converge better, meaning fewer epochs (less cost) to achieve high performance.\n\nThe overarching idea is training smarter. Guiding the model’s learning process can trim the needed compute compared to naive training on a jumble of tasks. This is especially useful for reasoning-centric models where the objective is not just fluent text, but correct and logical solutions.\n\n**5\\. Collaborate and use open-source ecosystems:**\n\nDon’t go at it alone; take advantage of community efforts, existing models, and shared resources.\n\n-   **Use pre-trained models and tools:** Using open-source pre-trained models significantly reduces costs by eliminating the need for full-scale training from scratch. Models such as Meta’s LLaMA 2, BLOOM, and DeepSeek R1 are available with pre-trained weights, enabling companies and researchers to fine-tune these models for their specific needs at a fraction of the original training cost.\n\nAs we showed earlier, fine-tuning a large model like LLaMA 2 (70B parameters) typically costs tens of thousands of dollars, substantially less than full-scale initial training.\n\nAdditionally, the open-source ecosystem provides robust tools and libraries optimized for efficient distributed training. Frameworks such as DeepSpeed and Fully Sharded Data Parallel (FSDP) effectively manage large models across limited hardware by sharding model components, allowing greater efficiency and reduced hardware requirements.\n\nHugging Face’s Accelerate and tools like Horovod further streamline distributed training processes. You can benefit from shared knowledge, compute resources, and reduced overall costs by tapping into community-developed resources and collaborating within research communities, like EleutherAI, LAION, and BigScience.\n\nBy applying these strategies, you’ll be able to cut down the cost required to train or fine-tune LLMs. Some emerging efforts have shown it’s possible to reach GPT-3/4 level performance on much smaller budgets.\n\nFor example, the founder of 01.ai claimed they trained a GPT-4-comparable model [using only 2,000 GPUs and about $3 million of compute](https://www.tomshardware.com/tech-industry/artificial-intelligence/chinese-company-trained-gpt-4-rival-with-just-2-000-gpus-01-ai-spent-usd3m-compared-to-openais-usd80m-to-usd100m#:~:text=advanced%20AI%20models%20using%202%2C000%20GPUs%20with%20just%20%243%20million) by optimizing every aspect of the process, which is a stark contrast to the amount reportedly spent on GPT-4 by OpenAI. While such claims have yet to be fully verified, they underscore that efficiency matters. The gap between big-budget labs and smaller players can be narrowed with clever engineering.\n\nControlling the cost of training LLMs is about making prudent choices at each step, from choosing the right model size, hardware, data, and tools. Those who do so can build impressive language and reasoning models without breaking the bank.\n\nThe landscape of LLM development is evolving quickly. While the cutting-edge models will always push expensive extremes, the practical know-how for cost-effective training is more available than ever. If you need to learn how, read our guide on [how to build an AI from scratch to learn more](https://www.cudocompute.com/topics/how-to-build-an-ai-project-from-scratch).\n\nWith careful planning, training your LLM or tailoring an existing one to your needs is feasible within a reasonable budget, and CUDO Compute can help you achieve that easily. We offer cost-effective GPUs on demand and scalable GPU clusters equipped with the latest NVIDIA GPUs. You can get started with a few clicks. [Get started](https://compute.cudo.org/)."
}

▼✓网页搜索查询: "Sebastian Raschka LLM pretraining cost breakdown per stage GPU hours compute ...", 结果数: 8

{
  "results": [
    {
      "title": "Sebastian Raschka, PhD's Post - LinkedIn",
      "url": "https://www.linkedin.com/posts/sebastianraschka_sure-its-widely-known-that-pretraining-activity-7278071569384460288-y_NE",
      "description": "Math: - The total number of GPU hours needed is 184,320 hours. - The cost of running one A100 instance per hour is approximately $33. - Each"
    },
    {
      "title": "GPU requirements and cost for training 7B LLaMA 2 model - Facebook",
      "url": "https://www.facebook.com/groups/artificialintelligence.lt/posts/8519734168087012",
      "description": "# AI Lithuania | ## Example: GPU Requirements & Cost for training 7B Llama 2 | Facebook. GPU requirements and cost for training 7B LLaMA 2 model. Summarized by AI from the post below. ## AI Lithuania ·. #### Example: GPU Requirements & Cost for training 7B Llama 2. As per the post – 7B Llama 2 model costs about $760,000 to pretrain – by Dr. Sebastian Raschka, it took a total number of 184,320 GPU hours to train this model. The following is the math:. *   The total number of GPU hours needed is 184,320 hours. *   The cost of running one A100 instance per hour is approximately $33. *   Each instance has 8 A100 GPUs. That’s 184320 / 8 * 33 ~ $760,000. https://vitalflux.com/llm-gpu-memory-requirements-examples/. Image 2: LLM Training & GPU Memory Requirements: Examples - Analytics Yogi. vitalflux.com LLM Training & GPU Memory Requirements: Examples - Analytics Yogi. A100 valandai tikrai ne $33 kainuoja.."
    },
    {
      "title": "An updated back-of-the-envelope calculation of LLM pretraining ...",
      "url": "https://x.com/rasbt/status/1872299621596475524",
      "description": "- The cost of running one A100 instance per hour is approximately $33. - Each instance has 8 A100 GPUs. That's 184320 / 8 * 33 = $760,000."
    },
    {
      "title": "flops-analysis.ipynb - rasbt/LLMs-from-scratch - GitHub",
      "url": "https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/02_performance-analysis/flops-analysis.ipynb",
      "description": "FLOPs (Floating Point Operations Per Second) measure the computational complexity of neural network models by counting the number of floating-point operations"
    },
    {
      "title": "New LLM Pre-training and Post-training Paradigms - Ahead of AI",
      "url": "https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training",
      "description": "Build a Large Language Model (from Scratch) is a highly focused book dedicated to coding LLMs from the ground up in PyTorch, covering everything from pre-training to post-training—arguably the best way to truly understand LLMs. Machine Learning Q and AI is a great book for those who are already familiar with the basics; it dives into intermediate and advanced concepts covering deep neural networks, vision transformers, multi-GPU training paradigms, LLMs, and many more. Interestingly, they also used Qwen models (although they didn't specify details, I assume they mean previous generation Qwen models) to synthesize additional pre-training data. Apple's approach to pre-training and post-training is relatively comprehensive, likely because the stakes are very high (the model is deployed on millions, if not billions, of devices). For the third pre-training stage, the researchers trained the model on a small but high-quality mix, which they found helps improve the performance on benchmark datasets. Interestingly, like Apple's AFM model, Llama 3 also implemented a 3-stage pre-training process."
    },
    {
      "title": "Building LLMs from the Ground Up: A 3-hour Coding Workshop",
      "url": "https://magazine.sebastianraschka.com/p/building-llms-from-the-ground-up",
      "description": "# Building LLMs from the Ground Up: A 3-hour Coding Workshop. If you’d like to spend a few hours this weekend to dive into Large Language Models (LLMs) and understand how they work, I've prepared a 3-hour coding workshop presentation on implementing, training, and using LLMs. Below, you'll find a table of contents to get an idea of what this video covers (the video itself has clickable chapter marks, allowing you to jump directly to topics of interest):. 2:17 – Part 1: Intro to LLMs. 10:48 – Part 2: Understanding LLM input data. 1:45:12 – Part 5.2: Pretrained weights via LitGPT. It's a slight departure from my usual text-based content, but the last time I did this a few months ago, it was so well-received that I thought it might be nice to do another one! Build an LLM from Scratch book. Build an LLM from Scratch GitHub repository. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book."
    },
    {
      "title": "Sebastian Raschka, PhD (@rasbt)",
      "url": "https://substack.com/@rasbt/note/c-190575668",
      "description": "# Sebastian Raschka, PhD (@rasbt): \"The LLM eras: 202x Pre-training (foundation) 2022 RLHF + PPO 2023 LoRA SFT 2024 Mid-Training 2025 RLVR + GRPO 2026 Inference-time scaling? ### Make money doing the work you believe in. Image 1: Hamish McKenzie's avatar. Image 2: Rebecca Rae's avatar Rebecca Rae 4d If you ever think you’re bad at your job, just remember this update was approved by at least 40 people. Image 4: Ted Gioia's avatar. Image 6: The Conscious Lee's avatar. By doing color-blindness in front of cameras while letting ICE racially profile in the streets, the court is making whiteness visible to anybody who knows how to read. With full editorial control and no gatekeepers, you can do the work you most believe in. I'll buy a 100 books of yours as a gift to people if I'm wrong (it's already my go-to gift for people trying to learn AI haha). In itself it’s a training technique, but you train the model to output more tokens, which is a form of inference scaling."
    },
    {
      "title": "State of LLMs 2026: RLVR, GRPO, Inference Scaling - YouTube",
      "url": "https://www.youtube.com/watch?v=K5WPr5dtne0",
      "description": "[37:25 Yann LeCun's $1B Bet Against LLMs Welch Labs 360K views • 6 days ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=kYkIdXwW2AE)[1:04:39 AI is Already Building AI — Google DeepMind’s Mostafa Dehghani The MAD Podcast with Matt Turck 23K views • 1 month ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=Bo19sXssYXI)[29:49 Andrej Karpathy: From Vibe Coding to Agentic Engineering Sequoia Capital 753K views • 9 days ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=96jN2OCOfLs)[37:53 Recursion Is The Next Scaling Law In AI Y Combinator 13K views • 7 days ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=DGtUUMNYLcc)[1:03:38 Faithfulness and Hallucinations in Reasoning Models, with Dr. Letitia Parcalabescu Women in AI Research WiAIR and AI Coffee Break with Letitia 610 views • 2 months ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=gzQiDCG_j7A)[1:18:03 LLM Architecture in 2026: What You Need to Know with Sebastian Raschka Vanishing Gradients 4.7K views • 3 weeks ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=Y6APnyZT6XU)[27:26 LLMs Don't Need More Parameters. Compute & The Agent Era — Tim Dettmers (Ai2) & Dan Fu (Together AI)The MAD Podcast with Matt Turck 4K views • 3 months ago Live Playlist ()Mix (50+)](https://www.youtube.com/watch?v=XCCkgRzth6Q)."
    }
  ]
}

▼✓网页搜索查询: "LLM training compute FLOPs scaling law Chinchilla optimal tokens parameters c...", 结果数: 8

{
  "results": [
    {
      "title": "irhum.github.io - Thoughts on Chinchilla",
      "url": "https://irhum.github.io/blog/chinchilla",
      "description": "The answer is in the scaling law literature (Kaplan et al., 2020, Sec 6.1): the “optimal” \\((N\\_{\\text{opt}}, D\\_{\\text{opt}})\\) are the ones that produce a model that achieves the lowest loss on a validation set of the pretraining data, subject to the fixed cost constraint (the green star above). One intuitive (and important) conclusion from the 1:1 scaling of model size and data means, if you want a compute optimal model that’s 2x large, you need to train it on 2x many tokens. This is enormously expensive, and is the core reason model sizes peaked around early-2022 (pre-Chinchilla): we’re only just now doing training runs with \\(C\\) large enough that models of that size (500B+) are *compute optimal*, and future model size scaling will remain slower (compared to pre-Chinchilla) because of this quadratic factor. Suppose the compute optimal model for an initial \\(2.66\\times 10^{21}\\) FLOPs budget is \\(N=2.8\\text{B}\\) params trained on \\(D=156\\text{B}\\) tokens. This work also fitted scaling laws between compute \\(C\\), and model size \\(N\\) and number of tokens \\(D\\)."
    },
    {
      "title": "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws",
      "url": "https://gonzoml.substack.com/p/beyond-chinchilla-optimal-accounting",
      "description": "# Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws. Continuing the discussion on optimal training regimes for Large Language Models (LLMs) with a focus on inference and moving beyond Chinchilla's guidelines. *Inference* is a crucial phase in the life cycle of most models, akin to the deployment and maintenance phases in software engineering where the software spends most of its life. The first major breakthrough in finding optimal regimes (in terms of computational budget, model size, and number of training tokens) was achieved in the study on **Chinchilla** (). Returning to the article, the authors update Chinchilla's scaling laws to include mass inference and advocate for **longer training of smaller models**. The task in the Chinchilla study was *“to find optimal N (parameters) and Dtr (tokens) within a given computational budget that minimize pre-training loss”*. For example, for a model with Chinchilla-7B quality level and an inference need of 1011 tokens, it's more optimal to train a 6B model on 1.18x the original data."
    },
    {
      "title": "Chinchilla Scaling Laws for Large Language Models (LLMs) - Medium",
      "url": "https://medium.com/@raniahossam/chinchilla-scaling-laws-for-large-language-models-llms-40c434e4e1c1",
      "description": "1,400B (1.4T) tokens should be used to train a data-optimal LLM of size 70B parameters. So, we need around 20 text tokens per parameter. Get"
    },
    {
      "title": "Chinchilla Scaling Laws: Compute-Optimal LLM Training - Interactive",
      "url": "https://mbrenndoerfer.com/writing/chinchilla-scaling-laws-compute-optimal-llm-training",
      "description": "Learn how DeepMind's Chinchilla scaling laws revolutionized LLM training by proving models should use 20 tokens per parameter for"
    },
    {
      "title": "How Long Should You Train Your Language Model? | Databricks Blog",
      "url": "https://www.databricks.com/blog/how-long-should-you-train-your-language-model",
      "description": "# How Long Should You Train Your Language Model? Accounting for Inference in Language Model Scaling Laws. Our recent paper, presented at ICML 2024, **proposes a modified scaling law to account for the cost of** ***both training and inference*****.** This blog post explains the reasoning behind our new scaling law, and then experimentally demonstrates how “overtrained” LLMs can be optimal. The “*Chinchilla”* Scaling Law is the most widely cited scaling law for LLMs. The Chinchilla paper asked the question: If you have a fixed training compute budget, how should you balance model size and training duration to produce the highest quality model? In our research, we were motivated by the idea of training smaller models on more data than the Chinchilla law suggested. By contrast, we assume a *fixed pretraining loss* (i.e. model quality) and find the model size and training duration that *minimize the total compute* over the model’s lifetime, including both training and inference (Fig. 2, Eq. 2)."
    },
    {
      "title": "[PDF] Training Compute-Optimal Large Language Models",
      "url": "https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf",
      "description": "We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data. Model Size (# Parameters) Training Tokens LaMDA [52] 137 Billion 768 Billion GPT-3 [6] 175 Billion 300 Billion Jurassic [30] 178 Billion 300 Billion Gopher [38] 280 Billion 300 Billion MT-NLG 530B [48] 530 Billion 270 Billion Chinchilla 70 Billion 1.4 Trillion is typically only feasible to train these large models once, accurately estimating the best model hyperparameters for a given compute budget is critical [51]. From these curves, we extracted the envelope of minimal loss per FLOP, and we used these points to estimate the optimal model size (center) for a given compute budget and the optimal number of training tokens (right). 3 Estimating the optimal parameter/training tokens allocation We present three different approaches to answer the question driving our research: Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?"
    },
    {
      "title": "Accounting for Inference in Language Model Scaling Laws - arXiv",
      "url": "https://arxiv.org/html/2401.00448v2",
      "description": "In this paper, we modify the Chinchilla scaling laws to account for inference costs by calculating the optimal parameter and training token"
    },
    {
      "title": "\"Training Compute-Optimal Large Language Models\", Hoffmann et ...",
      "url": "https://www.reddit.com/r/mlscaling/comments/trwkck/training_computeoptimal_large_language_models",
      "description": "We verify this by training a more compute-optimal 70B model, called Chinchilla, on 1.4 trillion tokens. Not only does Chinchilla outperform its"
    }
  ]
}

▼thought

Good, I now have comprehensive data. Let me also get the specific Galileo article for the "28x" human data annotation exceeding compute claim, and the specific calculation formulas.

▼✓抓取网页正文网址: "https://www.abaka.ai/blog/llm-data-cost"

{
  "url": "https://www.abaka.ai/blog/llm-data-cost",
  "title": "Abaka AI | Abaka AI - AI Data Annotation & Solution - Your Data Partner In The AI Industry",
  "summary": "Abaka AI offers data collection, data cleaning, data annotation, and high-quality datasets for world-class Automobile AI, Generative AI, and Embodied AI industry leaders.",
  "content": "<div id=\"readability-page-1\" class=\"page\"><section><article data-v-8def9226=\"\"><!--[--><h2 lexical-key=\"1\" id=\"llm-data-cost-breakdown-all-you-need-to-know-about-data-costs-for-training-an-llm\" data-v-6ee1d66a=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->LLM Data Cost Breakdown: All You Need to Know About Data Costs for Training an LLM<!--]--><!--]--><!--]--></span><!--]--><!--]--></h2><h2 lexical-key=\"3\" id=\"_1-introduction\" data-v-4b2aea56=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->1. Introduction<!--]--><!--]--><!--]--></span><!--]--><!--]--></h2><p lexical-key=\"5\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->With the rapid development of Large Language Models (LLMs), more and more enterprises are considering applying LLMs to their actual businesses. However, in the process of implementing LLMs, data cost is often an important factor that cannot be ignored. As decision-makers, understanding the data requirements and related costs of LLM training at various stages is crucial for the successful implementation of the project.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"7\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->ABAKA AI will take you on an in-depth exploration of the three key stages of LLM training: Pre-training, Supervised Fine-tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF), analyzing the data requirement characteristics of each stage and their impact on costs. We will provide a detailed interpretation of the composition of LLM data costs from multiple dimensions such as data volume, data quality, and data diversity, as well as how to optimize data investment while ensuring model performance.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"9\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Whether you are a corporate executive just starting to explore LLM applications, or a technical leader who has already made achievements in the AI field, we will provide you with a comprehensive and practical LLM data cost assessment framework to help you navigate AI implementation decisions with ease, using our past experience to help you organize a framework for calculating costs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h2 lexical-key=\"11\" id=\"_2-pre-training-stage\" data-v-4b2aea56=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2. Pre-training Stage<!--]--><!--]--><!--]--></span><!--]--><!--]--></h2><h3 lexical-key=\"13\" id=\"_21-dataset-scale-estimation\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2.1. Dataset Scale Estimation<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"15\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Estimating the required pre-training dataset size given a computational budget C is the first step in implementing an LLM project. This process involves different Scaling Laws, the most famous of which are OpenAI's Scaling Law and DeepMind's Chinchilla Law.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"17\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->OpenAI's research published in 2020 proposed the initial Scaling Laws, indicating a power-law relationship between model performance and model parameter count, dataset size, and computational resources. However, the Chinchilla Law proposed by DeepMind in 2022 revised this, arguing that the optimal data volume should be comparable to the model parameter count.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"19\" data-v-091cadd3=\"\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->OpenAI Scaling Law<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"22\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$L ( N, D )=\\left[ \\left( \\frac{N_{c}} {N} \\right)^{\\frac{\\alpha_{N}} {\\alpha_{D}}}+\\frac{D_{c}} {D} \\right]^{\\alpha_{D}}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"24\" data-v-091cadd3=\"\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->DeepMind Scaling Law<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"27\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$\\hat{L} ( N, D ) \\triangleq E+\\frac{A} {N^{\\alpha}}+\\frac{B} {D^{\\beta}}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"29\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->The formulas represent the relationship between model performance ($L$ or $\\hat{L}$) and model parameter count ($N$) and dataset size ($D$).<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"31\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->These two formulas represent different understandings and modeling methods of LLM scaling behavior. In practical applications, we often need to balance between model size and data volume. For example, to reduce inference costs, we can consider using smaller models with more data. Research by Hoffmann et al. [1] shows that under a fixed computational budget, a well-trained small model may perform better than an undertrained large model. Specifically, if we originally planned to train an 8B parameter model but want to reduce inference costs, we can consider replacing it with a model with fewer parameters (such as 7B) while increasing the amount of training data. This approach may not only maintain or even improve model performance but also significantly reduce deployment and operational costs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"33\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->The first step of data budgeting is crucial, determining the size of the model and the size of the pre-train dataset needed. ABAKA AI can build high-quality datasets for you, while having more stock data that can precisely match more suitable data according to your needs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/1.png\" alt=\"\n    Data scraping capabilities of ABAKA AI\n  \" title=\"\n    Data scraping capabilities of ABAKA AI\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    Data scraping capabilities of ABAKA AI\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><h3 lexical-key=\"37\" id=\"_22-multi-domain-data-ratio\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2.2. Multi-domain Data Ratio<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"39\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->The pre-training corpus can include various types of text data, such as web pages, academic materials, books, and relevant texts from different fields, such as legal documents, annual financial reports, medical textbooks, and other domain-specific data. In the pre-training stage, LLMs learn broad knowledge from massive unlabeled text data and store it in model parameters, thus acquiring a certain level of language understanding and generation capabilities.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"41\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->A general pre-training corpus is a large-scale dataset composed of a large amount of text from different domains and sources. Research by Liu, Yang et al. [2] divides general data into eight major categories: web pages, language text, books, academic materials, code, parallel corpora, social media, and encyclopedias. In the pre-training process of the model, the diversity and quality of data are crucial, so careful design of the ratio of these different categories of data is needed when constructing the pre-training dataset.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Web data<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Web data is one of the most widely used sources of pre-training data. The data usually exists in Hypertext Markup Language (HTML) format, showing certain structural features, and is rich in topics, covering content from different fields and disciplines. However, web data may also contain noise and low-quality content, so careful screening and cleaning are required.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Language text<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Language text data mainly consists of two parts. The first part is electronic text data built based on a wide range of sources of written and oral language, usually presented in the form of large corpora of specific languages; the second part is electronic text data built based on relevant written materials in various fields or topics. For example, FinGLM covers annual reports of some listed companies from 2019 to 2021. This type of data belongs to language text materials in the financial field.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"3\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Books<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Book data is also one of the common data types in pre-training corpora. Compared with web pages, books have longer text content and higher data quality, both of which help improve the performance of large language models. Book data provides knowledge with both depth and breadth, allowing models to improve understanding ability and knowledge reserve while learning deeper contextual information.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"4\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Academic materials<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Academic material data refers to text data related to academic fields, including but not limited to academic papers, journal articles, conference papers, research reports, patents, etc. These data are written and published by experts and scholars in academia, with high professionalism and academic rigor. Including them in pre-training corpora can provide more accurate and professional information, helping models understand terminology and knowledge within academic fields. Academic literature, papers, and textbooks provide examples of professional and technical language use, as well as the latest scientific discoveries. This type of data is particularly important for improving model performance in professional fields.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"5\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Code<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: The code data category refers to text information containing programming languages, such as Python, Java, C++, and other code snippets. Its purpose is to help models better understand programming languages and code structures. Code datasets can not only enhance programming capabilities but may also improve logical reasoning abilities. This type of data enables LLMs to understand and generate code in various programming languages, providing support for software development and code analysis tasks.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"6\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Parallel corpora<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Parallel corpus data refers to a collection of text or sentence pairs in different languages. These text pairs are translations of each other, where one text is in the source language (e.g., English) and the corresponding text is in the target language (e.g., Chinese). The introduction of parallel corpus data is crucial for improving the machine translation capabilities and cross-lingual task performance of large language models.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"7\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Social media<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Social media data refers to text content collected from various media platforms, mainly including user-generated posts, comments, and conversations between users, reflecting informal, colloquial language use. It contains a large amount of slang, new words, and diverse expressions. Although social media data may contain harmful information such as bias, discrimination, and violence, it is still crucial for the pre-training of large language models. This is because social media data is beneficial for models to learn expressive abilities in conversational communication and capture social trends, user behavior patterns, etc.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"8\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Encyclopedia<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->: Encyclopedia data refers to text information extracted from encyclopedias, online encyclopedia websites, or other knowledge databases. Data from online encyclopedia websites is written and edited by experts, volunteers, or community contributors, with a certain degree of authority and reliability. Due to its easy accessibility, it is included in pre-training corpora at a higher frequency, becoming a cornerstone for enhancing the knowledge base of large language models.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"68\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Reasonably configuring this pre-training data can significantly improve the performance and applicability of LLMs. The quality and diversity of data are often more important than the sheer volume of data. Based on the need for high-quality, multi-domain data ratios, ABAKA AI carefully considers the characteristics and value of each type of data when designing pre-training datasets, adjusting the ratio according to your specific needs to help you achieve high-quality and precise pre-training dataset ratios, reducing model training costs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/image.png\" alt=\"\n    The distribution of data types in the corpora used for pre-training by different models\n  \" title=\"\n    The distribution of data types in the corpora used for pre-training by different models\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    The distribution of data types in the corpora used for pre-training by different models\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><p lexical-key=\"72\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->This image shows the distribution of data types in the corpora used by different models during pre-training. Each pie chart represents a model and indicates the proportions of various data types. Different data types are distinguished by different colors, including web pages, code, encyclopedias, books, academic materials, social media, language text, and diverse data.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h3 lexical-key=\"74\" id=\"_23-training-data-acquisition\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2.3. Training Data Acquisition<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"76\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Although open-source datasets provide a foundation for model training, many truly valuable and unique data often do not appear in public channels. Therefore, targeted crawling of data from specific domains or sources has become a key strategy for improving model performance and competitiveness. The acquisition of this part of data is very necessary. In terms of high-quality training data acquisition, ABAKA AI can provide you with deeper insights, higher timeliness, and more unique data in targeted acquisitions, helping you improve model performance and accuracy in vertical domains and enhance the model's understanding of the latest information and trends.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"78\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Channels for targeted data acquisition usually include data crawling, commercial database subscriptions, data cooperation and exchange, etc. Except for web crawlers, other channels are too customized, so this section only discusses the relevant content of data crawling. Data crawling does not have high requirements for infrastructure, so in the following calculations, we only consider development costs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"80\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Before development, more importantly, is to choose suitable data sources. Crawling from suitable data sources can significantly improve the model's performance in specific domains. After determining the data source, the development and crawling costs mainly come from the following aspects:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Development cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"86\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{dev} = (S_{dev} × D_{initial}) + (S_{dev} × D_{update})$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"88\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $D_{initial}$ and $D_{update}$ are the time for initial development and updating the crawling code after website updates, respectively. The complexity of the website, verification mechanisms, request complexity, etc. will all affect the development time.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"2\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Maintenance cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"94\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{ ops} = S_{ops} × D_{crawl} × α$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"96\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Maintenance costs may not be full-time, so a coefficient $α (0 &lt; α ≤ 1)$ can be introduced to represent the actual proportion of maintenance time needed. If the data needs continuous updating or the crawling period is very long, then maintenance personnel intervention is needed to keep the crawler running normally and respond to website changes. If the crawler system uses a distributed strategy, more maintenance support may be needed.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"3\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"3\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->IP proxy pool<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"102\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{ip} = (\\frac{N_{req}}{N_{req_per_ip}}) × C_{ip}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"104\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $N_{req}$ is the total number of requests, $N_{req_per_ip}$ is the number of requests each IP can handle, and $C_{ip}$ is the unit price of each IP. Factors such as the website's IP restriction policy, total data volume, IP quality, IP geographic location requirements, proxy type, etc. will affect the price.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"4\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"4\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Crawling material cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"110\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{mat} = C_{mem} × N_{mem} × (D_{crawl} / D_{mem_validity})$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"112\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $C_{mem}$ and $N_{mem}$ are the required number of memberships and the number of members, $D_{mem_validity}$ is the validity period of the membership (in days). Factors such as membership level, concurrent strategy, etc. will affect the final budget. If the target website requires registration or membership to download, then this cost needs to be considered.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"114\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->So overall:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"116\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{total} = S_{dev} × (D_{initial} + D_{update}) + S_{ops} × D_{crawl} × α + (N_{req} / N_{req_per_ip}) × C_{ip} + (C_{mem} × N_{mem} × D_{crawl}) / D_{MemValidity}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"118\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Generally speaking, a vertical domain website costs between 15,00 to 15,000 USD dollars depending on the difficulty, with large social networking sites costing more. ABAKA AI can provide you with deeper insights, higher timeliness, more unique, and higher quality data, and reduce the total acquisition cost by 70%, helping you train excellent large language models in various dimensions.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"122\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->A large amount of high-quality LLM pre-training data exists in the form of PDFs or scanned images. Due to the diversity of layouts and formats and the varying quality of scanned images, utilizing this data to build datasets is a challenging task, requiring the conversion of this content into data formats like markdown for use. The core problems mainly focus on two aspects: extracting content information and layout information (including body text, titles, figure captions, images, tables, formulas) and handling the relationships between layout elements.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"124\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->When processing multiple open-source datasets, ABAKA AI observed several excellent open-source solutions, such as PP-StructureV2, Marker, Vary, and Nougat, but they each have shortcomings. PP-StructureV2 cannot identify LaTeX format content and lacks necessary post-processing steps; Marker covers fewer languages and doesn't handle figures well; Nougat has limited support for multi-column data and can identify limited languages, while Vary / Vary-toy consumes more computational resources.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"126\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Based on these situations, ABAKA AI, as a member of the Multimodal Art Projection (M-A-P) team, fully participated in building the completely open-source large language model MAP-Neo, which also open-sourced the Document Convert Pipeline. This pipeline can better balance performance and computational overhead, while the decoupling between modules brings better interpretability and makes it easier to upgrade, add, and replace different modules, providing a more flexible, efficient, and CPU-friendly solution.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/2.png\" alt=\"\n  \" title=\"\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><p lexical-key=\"130\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->In addition to using models for conversion, many vendors provide similar services, such as mathpix, Doc2x, Paodin PDFlux, pix2text, X Information, X Xun Cloud Large Model Knowledge Engine Document Parsing, etc. Therefore, we provide two ways to calculate costs below:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Self-built conversion service cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"136\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{convert} = (\\frac{N_{pages}}{R_{process}}) × C_{node} × (1 + F_{complexity}) + C_{integration}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"138\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $N_{pages}$ is the total number of documents, $R_{process}$ is the number of documents processed per node per day, $C_{node}$ is the price per node per day, $F_{complexity}$ is the document complexity factor ($0 ≤ F_{complexity} ≤ 1$). Generally speaking, the layout and fonts of magazines and newspapers will be more complex, while literature and patents will have richer images and tables. These factors need to be considered when specifying budgets. $C_{integration}$ is the cost of deployment, updating strategies/models, and maintenance. This part of the cost will vary greatly depending on the task.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"2\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Third-party service cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"144\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{convert} = \\sum_{i=1}^{n} C_{tier,i} \\times N_{pages,i} + C_{integration}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"146\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $n$ is the number of price tiers, $C_{tier,i}$ is the price per page for the i-th tier, $N_{pages,i}$ is the number of pages in the i-th tier, $C_{integration}$ is the cost of API integration and maintenance.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"148\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->The choice between these methods depends on multiple factors, including the number and type of documents, required conversion quality, availability of internal resources, and budget constraints. In fact, in most cases, easy data is converted using one's own servers, while difficult data uses commercial-grade services.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h3 lexical-key=\"150\" id=\"_25-training-data-cleaning\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2.5. Training Data Cleaning<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"152\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Although the raw data obtained through web crawling, document conversion, and open-source datasets provides a foundation for model training, this data usually contains noise, errors, biases, and false information, which will reduce the training effectiveness of the model. Therefore, data cleaning becomes a key step in improving model performance and reliability. To obtain high-quality data, ABAKA AI can provide you with cleaner and more refined data cleaning, significantly improving data quality, thereby enhancing the model's performance on specific tasks, strengthening the model's ability to understand complex patterns, and reducing misleading learning due to data issues.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/3.png\" alt=\"\n    Fineweb data cleaning pipeline\n  \" title=\"\n    Fineweb data cleaning pipeline\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    Fineweb data cleaning pipeline\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><p lexical-key=\"156\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Before starting cleaning, more importantly, is to formulate appropriate cleaning strategies. This requires a thorough understanding of data characteristics, model requirements, and potential data quality issues. The formulation of cleaning strategies should consider factors such as data scale, complexity, domain characteristics, etc. In terms of cost estimation, taking the Matrix dataset of the MAP-Neo large model jointly participated by ABAKA AI and Ge Zhang et al. [3] as an example, the Matrix dataset released 4.7T tokens of data, which can be said to be one of the highest quality and largest scale bilingual datasets. The general approach to data cleaning for the Matrix dataset follows the principle of \"from coarse to fine\" and \"from simple to complex\". We can divide the cleaning steps into the following main stages:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Heuristic filtering<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：Heuristic rule filtering is the first line of defense, aimed at quickly identifying and deleting low-quality data. This step has low computational cost but can significantly reduce the amount of data for subsequent processing. Filtering criteria include: URL; blacklist word table; gibberish text filter; document length; proportion of special characters; proportion of short, continuous, or incomplete lines; repeated words; n-grams or paragraphs. The filtering thresholds are based on statistical analysis of large document samples. Heuristic rules can effectively identify and remove low-quality data, preventing low-quality pre-training corpora from affecting model performance. As the team used composite data from multiple sources, based on data diversity, the team specially designed cleaning methods and tailored rules for each method to maintain consistency in data quality.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Data deduplication<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：Many studies have shown that repetitive text may lead to a decline in model performance, making deduplication a key step in corpus processing (although this point is somewhat controversial, more repetitive data may precisely indicate that this part of the data is of high quality, which is an important feature. For example, Fineweb's view is that more deduplication does not necessarily mean better performance; if deduplication is performed across dumps, performance may actually be worse).<!--]--><!--]--><!--]--></span><br><span><!--[--><!--[--><!--[-->a. <!--]--><!--]--><!--]--></span><b><!--[--><strong><!--[--><!--[--><!--[-->Exact duplication<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：Exact document deduplication is a method used to evaluate whether an entire text is completely identical to another text. If found to be completely identical, the duplicate is deleted. Due to the large amount of data, clusters must be used for processing, and memory insufficiency problems may also occur. In practice, we store text data in batches in different storage buckets. Then process the data in each storage bucket in turn to remove duplicates.<!--]--><!--]--><!--]--></span><br><span><!--[--><!--[--><!--[-->b. <!--]--><!--]--><!--]--></span><b><!--[--><strong><!--[--><!--[--><!--[-->Near-duplicate<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：For near-duplicates, we use the MinHash LSH deduplication method to remove them as much as possible, which is particularly suitable for web data and is widely used in similarity search and duplicate detection in large datasets. It can handle very common scenarios where the text content is basically the same, but the scattered template blocks of web pages are different. The principle of MinHash is to represent a set with smaller hash values, and then these hash values can be used to estimate the Jaccard similarity between two sets. The computational cost of this step is still quite high.<!--]--><!--]--><!--]--></span><br><span><!--[--><!--[--><!--[-->c. <!--]--><!--]--><!--]--></span><b><!--[--><strong><!--[--><!--[--><!--[-->Similar Line<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：To solve the problem of the same content appearing multiple times in the text, a direct method is to divide the text into multiple lines using specific delimiters, and then compare the similarity between each line. If they are similar, subsequent lines are deleted.<!--]--><!--]--><!--]--></span><br><span><!--[--><!--[--><!--[-->d. In addition, paragraph deduplication and substring deduplication were also performed to achieve better results.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"3\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Quality screening<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：After data cleaning, Fineweb-edu used the LLama3-70B-Instruct model to score the data and trained a Bert-like classification model. The classification model was then used to filter the data, greatly improving data quality. In addition to using models for data quality screening, many developers use fasttext models for language identification when cleaning CC datasets.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/4.png\" alt=\"\n    On the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n  \" title=\"\n    On the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    On the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/5.png\" alt=\"\n    Deduplication did not show the expected performance improvement in this experiment\n  \" title=\"\n    Deduplication did not show the expected performance improvement in this experiment\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    Deduplication did not show the expected performance improvement in this experiment\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><p lexical-key=\"185\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Based on the above steps, we can calculate the cost of data cleaning:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Engineer debugging and rule determination cost<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"191\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{engineer} = S_{eng} \\times (T_{rules} + T_{debug})$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"193\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $S_{eng}$ is the developer's daily salary, $T_{rules}$ and $T_{debug}$ are the time required for formulating and optimizing rules (USD/day).<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"2\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Storage costs<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"199\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{storage} = C_{storage} \\times V_{data} \\times T_{retention}$ F Where $C_{storage}$ is the storage cost per TB per month, $V_{data}$ is the total data volume (TB), $T_{retention}$ is the data retention time (months).<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Computation costs<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"205\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{compute} = \\sum_{i=1}^{n} [C_i \\times \\frac{V_{data,i}}{R_i} \\times (1 + \\beta_i \\times (F_{comm} + F_{ops}))]$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"207\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $i$ represents the processing stage (1 to n), $C_i$ is the unit cost of computing resources for the i-th stage (USD/day), $V_{data,i}$ is the data volume for the i-th stage (TB), $R_i$ is the processing rate for the i-th stage (TB/day), $\\beta_i$ is a binary indicator showing whether the i-th stage uses cluster processing (0 for single-node processing, 1 for cluster processing), $F_{comm}$ and $F_{ops}$ are the communication and operational overheads of using clusters. Using clusters is troublesome and costly, so we use heuristic filtering as the first step.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol start=\"4\" data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"4\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Quality screening<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"213\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$Budget_{quality} = C_{train} \\times T_{training} + C_{data_annotation} + C_{inference} \\times \\frac{V_{data}}{R_{inference}}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"215\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Where $C_{train}$ and $C_{inference}$ are the computational costs for training and inference, which usually differ significantly in price, $T_{training}$ is the training time (days), $C_{data_annotation}$ is the annotation cost, $\\frac{V_{data}}{R_{inference}}$ indicates the time needed to complete inference for all data.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h3 lexical-key=\"217\" id=\"_26-data-cost-calculation\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->2.6. Data Cost Calculation<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"219\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->High-quality data processing comes at a cost. From data acquisition to the final cleaning process, each step involves complex computations and human resource investments, all of which translate into actual costs. This chapter will combine ABAKA AI's previous content and rich experience to provide you with some feasible ideas, hoping to help you calculate data costs when implementing LLMs.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"221\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->Based on the data processing flow described earlier, we can roughly divide data costs into the following main categories: storage costs, data acquisition costs, data conversion costs, and data cleaning costs. We hope to help you establish an intuitive budget system through ABAKA AI's past rich experience:<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><ol data-v-9ffa60ee=\"\"><!--[--><!--[--><li value=\"1\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Storage costs<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：In this field, data scale is far larger than general projects, with pre-training datasets reaching PB levels. Single machines cannot meet such large-scale data storage needs, and projects also have high bandwidth requirements. Therefore, distributed storage is generally used. Distributed storage facilitates horizontal expansion, can meet growing storage needs, and has data backup and fault tolerance mechanisms, ensuring high data reliability. Multi-node parallel read and write can also improve I/O performance. Generally, the capacity price of distributed storage is about 85 USD/T (NVME + HDD), meaning 1PB of available storage space costs about 85,000 USD. Adding security redundancy and network equipment, security equipment, the cost will approach 99,000 USD/PB.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"2\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Data acquisition<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：All historical data from a well-known large website can be estimated at around 42,500-70,500 USD, with incremental updates costing about 14,000 USD annually. For vertical domain websites, it could be anywhere from 42,00 to 14,000 USD. Video websites are three to five times more expensive than ordinary websites (bandwidth, storage), and overseas websites are two to three times more expensive (overseas proxies, overseas servers, compliance). Assuming you need to crawl 8 mainstream social media and news websites + 15 vertical domain websites (such as code, mathematics, finance), a budget of 706,000 USD would be appropriate.<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"3\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Document information extraction<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：Based on ABAKA AI's experience, using ABAKA AI's developed Pipeline for document conversion is more cost-effective and flexible. If using consumer-grade GPUs for conversion, the cost per page is about 0.000035 USD, far lower than mathpix's 0.025 / 0.01 USD per page. Of course, we now see many good domestic manufacturers trying in this area, and we look forward to better models and cheaper prices from domestic service providers. All in all, including the time for Gap and debugging, estimate about 14,000 USD for every 10,000,000 pages of documents (80% using your own model + 20% using third-party services).<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><li value=\"4\"><!--[--><!--[--><b><!--[--><strong><!--[--><!--[--><!--[-->Data cleaning<!--]--><!--]--><!--]--></strong><!--]--></b><span><!--[--><!--[--><!--[-->：The cost of this step mainly depends on how many data sources there are and their domains. When processing very dirty data, ABAKA AI used over 1,000 cores for about a month, adding many special rules to obtain higher quality data, with a data retention rate of less than 1%. Therefore, this part of the data can be calculated as follows:<!--]--><!--]--><!--]--></span><!--]--><!--]--></li><!--]--><!--]--></ol><p lexical-key=\"236\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->$S_{\\text{eng}} + \\frac{V_{\\text{data}}}{100\\text{T}} \\times C_{\\text{base}}$<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"238\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->That is, the data cleaning cost for each domain consists of two weeks' salary for an algorithm engineer + 2,800 USD for cleaning every 100T, assuming the data cleaning cost increases linearly with data volume when the cluster is set up. For example, like Fineweb-edu using Llama3-70B and Bert-like models, the price is also quite affordable, just slightly increase the cost per 100T.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"240\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->In summary, preparing pre-training data for LLMs is a complex and costly process. It involves multiple stages, including data acquisition, storage, document information extraction, and data cleaning, each requiring careful planning and substantial investment. The quality and diversity of data are crucial to the model's final performance, so each stage should be optimized as much as possible within budget constraints. At the same time, we find that the value of experienced algorithm engineers cannot be overlooked. Their experience and expertise can help teams avoid many potential pitfalls and detours. In LLM projects, the cost of taking detours due to human resource issues is often surprisingly high, potentially leading to a waste of considerable time and resources.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h2 lexical-key=\"242\" id=\"_3-sft-rlhf-stages\" data-v-4b2aea56=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->3. SFT &amp; RLHF Stages<!--]--><!--]--><!--]--></span><!--]--><!--]--></h2><p lexical-key=\"244\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->In the training process of large language models (LLMs), Supervised Fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are two closely connected key stages. Although these two stages differ in technical implementation and specific objectives, they share significant similarities in terms of data requirements and cost composition. In this chapter, we combine these two stages for discussion, primarily because their core costs are concentrated on data annotation and requirement definition, a characteristic that results in many commonalities in data preparation and cost estimation.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><h3 lexical-key=\"246\" id=\"_31-characteristics-of-sft-datasets\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->3.1. Characteristics of SFT Datasets<!--]--><!--]--><!--]--></span><!--]--><!--]--></h3><p lexical-key=\"248\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->SFT datasets consist of a series of text pairs, including \"instruction input\" and \"answer output\". \"Instruction input\" represents requests made by humans to the model, covering various types such as classification, summarization, rewriting, etc. \"Answer output\" is the response generated by the model based on the instruction, meeting human expectations. There are four methods to construct instruction fine-tuning datasets: manual creation; model generation, such as using the Self-Instruct method; collecting and improving existing open-source datasets; and combining the above three methods.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/5.png\" alt=\"\n    Different ways to build SFT datasets\n  \" title=\"\n    Different ways to build SFT datasets\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    Different ways to build SFT datasets\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><p lexical-key=\"252\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->There are generally two approaches to constructing artificially generated datasets. The first approach involves directly creating instruction text sets according to given requirements and rules by company employees, volunteers, annotation platform staff, and others. Whether designing instruction sets, writing annotation guidelines, or conducting actual data annotation and quality control, it requires a significant investment of human time and effort. For example, the creation of the Databricks-dolly-15k dataset involved thousands of Databricks employees who generated over 15,000 records across multiple instruction categories. The second approach involves scraping human-generated real question-and-answer data from web pages and standardizing it into an instruction format. Examples include datasets like InstructionWild, v2LCCC, and Zhihu-KOL, which construct datasets by aggregating and organizing content from social chats, code-related Q&amp;As, and other sources.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><p lexical-key=\"254\" data-v-091cadd3=\"\"><!--[--><!--[--><span><!--[--><!--[--><!--[-->In ABAKA AI's past practices, the first approach has been more commonly used to construct datasets. Meanwhile, Liu, Yang, et al. [2] believe that datasets constructed in this manner are of higher quality and cleaner due to processing and review by professional annotators. After human processing, these datasets become more interpretable and more consistent with human understanding, thus increasing their explainability. Researchers have flexible control over the training samples and can adjust them according to different tasks, making them more versatile.<!--]--><!--]--><!--]--></span><!--]--><!--]--></p><figure><!--[--><img src=\"https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/7.png\" alt=\"\n    ABAKA AI possesses high-quality finished datasets across multiple domains\n  \" title=\"\n    ABAKA AI possesses high-quality finished datasets across multiple domains\n  \" data-v-18814304=\"\"><figcaption><!--[--><!--[--><!--[-->\n    ABAKA AI possesses high-quality finished datasets across multiple domains\n  <!--]--><!--]--><!--]--></figcaption><!--]--></figure><h3 lexical-key=\"258\" id=\"_32-characteristics-of-rlhf-datasets\"><!--[--><!--[--><",
  "markdown": "## LLM Data Cost Breakdown: All You Need to Know About Data Costs for Training an LLM\n\n## 1\\. Introduction\n\nWith the rapid development of Large Language Models (LLMs), more and more enterprises are considering applying LLMs to their actual businesses. However, in the process of implementing LLMs, data cost is often an important factor that cannot be ignored. As decision-makers, understanding the data requirements and related costs of LLM training at various stages is crucial for the successful implementation of the project.\n\nABAKA AI will take you on an in-depth exploration of the three key stages of LLM training: Pre-training, Supervised Fine-tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF), analyzing the data requirement characteristics of each stage and their impact on costs. We will provide a detailed interpretation of the composition of LLM data costs from multiple dimensions such as data volume, data quality, and data diversity, as well as how to optimize data investment while ensuring model performance.\n\nWhether you are a corporate executive just starting to explore LLM applications, or a technical leader who has already made achievements in the AI field, we will provide you with a comprehensive and practical LLM data cost assessment framework to help you navigate AI implementation decisions with ease, using our past experience to help you organize a framework for calculating costs.\n\n## 2\\. Pre-training Stage\n\n### 2.1. Dataset Scale Estimation\n\nEstimating the required pre-training dataset size given a computational budget C is the first step in implementing an LLM project. This process involves different Scaling Laws, the most famous of which are OpenAI's Scaling Law and DeepMind's Chinchilla Law.\n\nOpenAI's research published in 2020 proposed the initial Scaling Laws, indicating a power-law relationship between model performance and model parameter count, dataset size, and computational resources. However, the Chinchilla Law proposed by DeepMind in 2022 revised this, arguing that the optimal data volume should be comparable to the model parameter count.\n\n****OpenAI Scaling Law****:\n\n$L ( N, D )=\\\\left\\[ \\\\left( \\\\frac{N\\_{c}} {N} \\\\right)^{\\\\frac{\\\\alpha\\_{N}} {\\\\alpha\\_{D}}}+\\\\frac{D\\_{c}} {D} \\\\right\\]^{\\\\alpha\\_{D}}$\n\n****DeepMind Scaling Law****:\n\n$\\\\hat{L} ( N, D ) \\\\triangleq E+\\\\frac{A} {N^{\\\\alpha}}+\\\\frac{B} {D^{\\\\beta}}$\n\nThe formulas represent the relationship between model performance ($L$ or $\\\\hat{L}$) and model parameter count ($N$) and dataset size ($D$).\n\nThese two formulas represent different understandings and modeling methods of LLM scaling behavior. In practical applications, we often need to balance between model size and data volume. For example, to reduce inference costs, we can consider using smaller models with more data. Research by Hoffmann et al. \\[1\\] shows that under a fixed computational budget, a well-trained small model may perform better than an undertrained large model. Specifically, if we originally planned to train an 8B parameter model but want to reduce inference costs, we can consider replacing it with a model with fewer parameters (such as 7B) while increasing the amount of training data. This approach may not only maintain or even improve model performance but also significantly reduce deployment and operational costs.\n\nThe first step of data budgeting is crucial, determining the size of the model and the size of the pre-train dataset needed. ABAKA AI can build high-quality datasets for you, while having more stock data that can precisely match more suitable data according to your needs.\n\n![\nData scraping capabilities of ABAKA AI\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/1.png \"\nData scraping capabilities of ABAKA AI\n\")\n\nData scraping capabilities of ABAKA AI\n\n### 2.2. Multi-domain Data Ratio\n\nThe pre-training corpus can include various types of text data, such as web pages, academic materials, books, and relevant texts from different fields, such as legal documents, annual financial reports, medical textbooks, and other domain-specific data. In the pre-training stage, LLMs learn broad knowledge from massive unlabeled text data and store it in model parameters, thus acquiring a certain level of language understanding and generation capabilities.\n\nA general pre-training corpus is a large-scale dataset composed of a large amount of text from different domains and sources. Research by Liu, Yang et al. \\[2\\] divides general data into eight major categories: web pages, language text, books, academic materials, code, parallel corpora, social media, and encyclopedias. In the pre-training process of the model, the diversity and quality of data are crucial, so careful design of the ratio of these different categories of data is needed when constructing the pre-training dataset.\n\n1.  ****Web data****: Web data is one of the most widely used sources of pre-training data. The data usually exists in Hypertext Markup Language (HTML) format, showing certain structural features, and is rich in topics, covering content from different fields and disciplines. However, web data may also contain noise and low-quality content, so careful screening and cleaning are required.\n2.  ****Language text****: Language text data mainly consists of two parts. The first part is electronic text data built based on a wide range of sources of written and oral language, usually presented in the form of large corpora of specific languages; the second part is electronic text data built based on relevant written materials in various fields or topics. For example, FinGLM covers annual reports of some listed companies from 2019 to 2021. This type of data belongs to language text materials in the financial field.\n3.  ****Books****: Book data is also one of the common data types in pre-training corpora. Compared with web pages, books have longer text content and higher data quality, both of which help improve the performance of large language models. Book data provides knowledge with both depth and breadth, allowing models to improve understanding ability and knowledge reserve while learning deeper contextual information.\n4.  ****Academic materials****: Academic material data refers to text data related to academic fields, including but not limited to academic papers, journal articles, conference papers, research reports, patents, etc. These data are written and published by experts and scholars in academia, with high professionalism and academic rigor. Including them in pre-training corpora can provide more accurate and professional information, helping models understand terminology and knowledge within academic fields. Academic literature, papers, and textbooks provide examples of professional and technical language use, as well as the latest scientific discoveries. This type of data is particularly important for improving model performance in professional fields.\n5.  ****Code****: The code data category refers to text information containing programming languages, such as Python, Java, C++, and other code snippets. Its purpose is to help models better understand programming languages and code structures. Code datasets can not only enhance programming capabilities but may also improve logical reasoning abilities. This type of data enables LLMs to understand and generate code in various programming languages, providing support for software development and code analysis tasks.\n6.  ****Parallel corpora****: Parallel corpus data refers to a collection of text or sentence pairs in different languages. These text pairs are translations of each other, where one text is in the source language (e.g., English) and the corresponding text is in the target language (e.g., Chinese). The introduction of parallel corpus data is crucial for improving the machine translation capabilities and cross-lingual task performance of large language models.\n7.  ****Social media****: Social media data refers to text content collected from various media platforms, mainly including user-generated posts, comments, and conversations between users, reflecting informal, colloquial language use. It contains a large amount of slang, new words, and diverse expressions. Although social media data may contain harmful information such as bias, discrimination, and violence, it is still crucial for the pre-training of large language models. This is because social media data is beneficial for models to learn expressive abilities in conversational communication and capture social trends, user behavior patterns, etc.\n8.  ****Encyclopedia****: Encyclopedia data refers to text information extracted from encyclopedias, online encyclopedia websites, or other knowledge databases. Data from online encyclopedia websites is written and edited by experts, volunteers, or community contributors, with a certain degree of authority and reliability. Due to its easy accessibility, it is included in pre-training corpora at a higher frequency, becoming a cornerstone for enhancing the knowledge base of large language models.\n\nReasonably configuring this pre-training data can significantly improve the performance and applicability of LLMs. The quality and diversity of data are often more important than the sheer volume of data. Based on the need for high-quality, multi-domain data ratios, ABAKA AI carefully considers the characteristics and value of each type of data when designing pre-training datasets, adjusting the ratio according to your specific needs to help you achieve high-quality and precise pre-training dataset ratios, reducing model training costs.\n\n![\nThe distribution of data types in the corpora used for pre-training by different models\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/image.png \"\nThe distribution of data types in the corpora used for pre-training by different models\n\")\n\nThe distribution of data types in the corpora used for pre-training by different models\n\nThis image shows the distribution of data types in the corpora used by different models during pre-training. Each pie chart represents a model and indicates the proportions of various data types. Different data types are distinguished by different colors, including web pages, code, encyclopedias, books, academic materials, social media, language text, and diverse data.\n\n### 2.3. Training Data Acquisition\n\nAlthough open-source datasets provide a foundation for model training, many truly valuable and unique data often do not appear in public channels. Therefore, targeted crawling of data from specific domains or sources has become a key strategy for improving model performance and competitiveness. The acquisition of this part of data is very necessary. In terms of high-quality training data acquisition, ABAKA AI can provide you with deeper insights, higher timeliness, and more unique data in targeted acquisitions, helping you improve model performance and accuracy in vertical domains and enhance the model's understanding of the latest information and trends.\n\nChannels for targeted data acquisition usually include data crawling, commercial database subscriptions, data cooperation and exchange, etc. Except for web crawlers, other channels are too customized, so this section only discusses the relevant content of data crawling. Data crawling does not have high requirements for infrastructure, so in the following calculations, we only consider development costs.\n\nBefore development, more importantly, is to choose suitable data sources. Crawling from suitable data sources can significantly improve the model's performance in specific domains. After determining the data source, the development and crawling costs mainly come from the following aspects:\n\n1.  ****Development cost****：\n\n$Budget\\_{dev} = (S\\_{dev} × D\\_{initial}) + (S\\_{dev} × D\\_{update})$\n\nWhere $D\\_{initial}$ and $D\\_{update}$ are the time for initial development and updating the crawling code after website updates, respectively. The complexity of the website, verification mechanisms, request complexity, etc. will all affect the development time.\n\n2.  ****Maintenance cost****：\n\n$Budget\\_{ ops} = S\\_{ops} × D\\_{crawl} × α$\n\nMaintenance costs may not be full-time, so a coefficient $α (0 < α ≤ 1)$ can be introduced to represent the actual proportion of maintenance time needed. If the data needs continuous updating or the crawling period is very long, then maintenance personnel intervention is needed to keep the crawler running normally and respond to website changes. If the crawler system uses a distributed strategy, more maintenance support may be needed.\n\n3.  ****IP proxy pool****：\n\n$Budget\\_{ip} = (\\\\frac{N\\_{req}}{N\\_{req\\_per\\_ip}}) × C\\_{ip}$\n\nWhere $N\\_{req}$ is the total number of requests, $N\\_{req\\_per\\_ip}$ is the number of requests each IP can handle, and $C\\_{ip}$ is the unit price of each IP. Factors such as the website's IP restriction policy, total data volume, IP quality, IP geographic location requirements, proxy type, etc. will affect the price.\n\n4.  ****Crawling material cost****：\n\n$Budget\\_{mat} = C\\_{mem} × N\\_{mem} × (D\\_{crawl} / D\\_{mem\\_validity})$\n\nWhere $C\\_{mem}$ and $N\\_{mem}$ are the required number of memberships and the number of members, $D\\_{mem\\_validity}$ is the validity period of the membership (in days). Factors such as membership level, concurrent strategy, etc. will affect the final budget. If the target website requires registration or membership to download, then this cost needs to be considered.\n\nSo overall:\n\n$Budget\\_{total} = S\\_{dev} × (D\\_{initial} + D\\_{update}) + S\\_{ops} × D\\_{crawl} × α + (N\\_{req} / N\\_{req\\_per\\_ip}) × C\\_{ip} + (C\\_{mem} × N\\_{mem} × D\\_{crawl}) / D\\_{MemValidity}$\n\nGenerally speaking, a vertical domain website costs between 15,00 to 15,000 USD dollars depending on the difficulty, with large social networking sites costing more. ABAKA AI can provide you with deeper insights, higher timeliness, more unique, and higher quality data, and reduce the total acquisition cost by 70%, helping you train excellent large language models in various dimensions.\n\nA large amount of high-quality LLM pre-training data exists in the form of PDFs or scanned images. Due to the diversity of layouts and formats and the varying quality of scanned images, utilizing this data to build datasets is a challenging task, requiring the conversion of this content into data formats like markdown for use. The core problems mainly focus on two aspects: extracting content information and layout information (including body text, titles, figure captions, images, tables, formulas) and handling the relationships between layout elements.\n\nWhen processing multiple open-source datasets, ABAKA AI observed several excellent open-source solutions, such as PP-StructureV2, Marker, Vary, and Nougat, but they each have shortcomings. PP-StructureV2 cannot identify LaTeX format content and lacks necessary post-processing steps; Marker covers fewer languages and doesn't handle figures well; Nougat has limited support for multi-column data and can identify limited languages, while Vary / Vary-toy consumes more computational resources.\n\nBased on these situations, ABAKA AI, as a member of the Multimodal Art Projection (M-A-P) team, fully participated in building the completely open-source large language model MAP-Neo, which also open-sourced the Document Convert Pipeline. This pipeline can better balance performance and computational overhead, while the decoupling between modules brings better interpretability and makes it easier to upgrade, add, and replace different modules, providing a more flexible, efficient, and CPU-friendly solution.\n\n![\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/2.png \"\n\")\n\nIn addition to using models for conversion, many vendors provide similar services, such as mathpix, Doc2x, Paodin PDFlux, pix2text, X Information, X Xun Cloud Large Model Knowledge Engine Document Parsing, etc. Therefore, we provide two ways to calculate costs below:\n\n1.  ****Self-built conversion service cost****：\n\n$Budget\\_{convert} = (\\\\frac{N\\_{pages}}{R\\_{process}}) × C\\_{node} × (1 + F\\_{complexity}) + C\\_{integration}$\n\nWhere $N\\_{pages}$ is the total number of documents, $R\\_{process}$ is the number of documents processed per node per day, $C\\_{node}$ is the price per node per day, $F\\_{complexity}$ is the document complexity factor ($0 ≤ F\\_{complexity} ≤ 1$). Generally speaking, the layout and fonts of magazines and newspapers will be more complex, while literature and patents will have richer images and tables. These factors need to be considered when specifying budgets. $C\\_{integration}$ is the cost of deployment, updating strategies/models, and maintenance. This part of the cost will vary greatly depending on the task.\n\n2.  ****Third-party service cost****：\n\n$Budget\\_{convert} = \\\\sum\\_{i=1}^{n} C\\_{tier,i} \\\\times N\\_{pages,i} + C\\_{integration}$\n\nWhere $n$ is the number of price tiers, $C\\_{tier,i}$ is the price per page for the i-th tier, $N\\_{pages,i}$ is the number of pages in the i-th tier, $C\\_{integration}$ is the cost of API integration and maintenance.\n\nThe choice between these methods depends on multiple factors, including the number and type of documents, required conversion quality, availability of internal resources, and budget constraints. In fact, in most cases, easy data is converted using one's own servers, while difficult data uses commercial-grade services.\n\n### 2.5. Training Data Cleaning\n\nAlthough the raw data obtained through web crawling, document conversion, and open-source datasets provides a foundation for model training, this data usually contains noise, errors, biases, and false information, which will reduce the training effectiveness of the model. Therefore, data cleaning becomes a key step in improving model performance and reliability. To obtain high-quality data, ABAKA AI can provide you with cleaner and more refined data cleaning, significantly improving data quality, thereby enhancing the model's performance on specific tasks, strengthening the model's ability to understand complex patterns, and reducing misleading learning due to data issues.\n\n![\nFineweb data cleaning pipeline\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/3.png \"\nFineweb data cleaning pipeline\n\")\n\nFineweb data cleaning pipeline\n\nBefore starting cleaning, more importantly, is to formulate appropriate cleaning strategies. This requires a thorough understanding of data characteristics, model requirements, and potential data quality issues. The formulation of cleaning strategies should consider factors such as data scale, complexity, domain characteristics, etc. In terms of cost estimation, taking the Matrix dataset of the MAP-Neo large model jointly participated by ABAKA AI and Ge Zhang et al. \\[3\\] as an example, the Matrix dataset released 4.7T tokens of data, which can be said to be one of the highest quality and largest scale bilingual datasets. The general approach to data cleaning for the Matrix dataset follows the principle of \"from coarse to fine\" and \"from simple to complex\". We can divide the cleaning steps into the following main stages:\n\n1.  ****Heuristic filtering****：Heuristic rule filtering is the first line of defense, aimed at quickly identifying and deleting low-quality data. This step has low computational cost but can significantly reduce the amount of data for subsequent processing. Filtering criteria include: URL; blacklist word table; gibberish text filter; document length; proportion of special characters; proportion of short, continuous, or incomplete lines; repeated words; n-grams or paragraphs. The filtering thresholds are based on statistical analysis of large document samples. Heuristic rules can effectively identify and remove low-quality data, preventing low-quality pre-training corpora from affecting model performance. As the team used composite data from multiple sources, based on data diversity, the team specially designed cleaning methods and tailored rules for each method to maintain consistency in data quality.\n2.  ****Data deduplication****：Many studies have shown that repetitive text may lead to a decline in model performance, making deduplication a key step in corpus processing (although this point is somewhat controversial, more repetitive data may precisely indicate that this part of the data is of high quality, which is an important feature. For example, Fineweb's view is that more deduplication does not necessarily mean better performance; if deduplication is performed across dumps, performance may actually be worse).  \n    a. ****Exact duplication****：Exact document deduplication is a method used to evaluate whether an entire text is completely identical to another text. If found to be completely identical, the duplicate is deleted. Due to the large amount of data, clusters must be used for processing, and memory insufficiency problems may also occur. In practice, we store text data in batches in different storage buckets. Then process the data in each storage bucket in turn to remove duplicates.  \n    b. ****Near-duplicate****：For near-duplicates, we use the MinHash LSH deduplication method to remove them as much as possible, which is particularly suitable for web data and is widely used in similarity search and duplicate detection in large datasets. It can handle very common scenarios where the text content is basically the same, but the scattered template blocks of web pages are different. The principle of MinHash is to represent a set with smaller hash values, and then these hash values can be used to estimate the Jaccard similarity between two sets. The computational cost of this step is still quite high.  \n    c. ****Similar Line****：To solve the problem of the same content appearing multiple times in the text, a direct method is to divide the text into multiple lines using specific delimiters, and then compare the similarity between each line. If they are similar, subsequent lines are deleted.  \n    d. In addition, paragraph deduplication and substring deduplication were also performed to achieve better results.\n3.  ****Quality screening****：After data cleaning, Fineweb-edu used the LLama3-70B-Instruct model to score the data and trained a Bert-like classification model. The classification model was then used to filter the data, greatly improving data quality. In addition to using models for data quality screening, many developers use fasttext models for language identification when cleaning CC datasets.\n\n![\nOn the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/4.png \"\nOn the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n\")\n\nOn the left is the retention rate for processing English data and on the right is the retention rate for Chinese\n\n![\nDeduplication did not show the expected performance improvement in this experiment\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/5.png \"\nDeduplication did not show the expected performance improvement in this experiment\n\")\n\nDeduplication did not show the expected performance improvement in this experiment\n\nBased on the above steps, we can calculate the cost of data cleaning:\n\n1.  ****Engineer debugging and rule determination cost****：\n\n$Budget\\_{engineer} = S\\_{eng} \\\\times (T\\_{rules} + T\\_{debug})$\n\nWhere $S\\_{eng}$ is the developer's daily salary, $T\\_{rules}$ and $T\\_{debug}$ are the time required for formulating and optimizing rules (USD/day).\n\n2.  ****Storage costs****：\n\n$Budget\\_{storage} = C\\_{storage} \\\\times V\\_{data} \\\\times T\\_{retention}$ F Where $C\\_{storage}$ is the storage cost per TB per month, $V\\_{data}$ is the total data volume (TB), $T\\_{retention}$ is the data retention time (months).\n\n1.  ****Computation costs****：\n\n$Budget\\_{compute} = \\\\sum\\_{i=1}^{n} \\[C\\_i \\\\times \\\\frac{V\\_{data,i}}{R\\_i} \\\\times (1 + \\\\beta\\_i \\\\times (F\\_{comm} + F\\_{ops}))\\]$\n\nWhere $i$ represents the processing stage (1 to n), $C\\_i$ is the unit cost of computing resources for the i-th stage (USD/day), $V\\_{data,i}$ is the data volume for the i-th stage (TB), $R\\_i$ is the processing rate for the i-th stage (TB/day), $\\\\beta\\_i$ is a binary indicator showing whether the i-th stage uses cluster processing (0 for single-node processing, 1 for cluster processing), $F\\_{comm}$ and $F\\_{ops}$ are the communication and operational overheads of using clusters. Using clusters is troublesome and costly, so we use heuristic filtering as the first step.\n\n4.  ****Quality screening****：\n\n$Budget\\_{quality} = C\\_{train} \\\\times T\\_{training} + C\\_{data\\_annotation} + C\\_{inference} \\\\times \\\\frac{V\\_{data}}{R\\_{inference}}$\n\nWhere $C\\_{train}$ and $C\\_{inference}$ are the computational costs for training and inference, which usually differ significantly in price, $T\\_{training}$ is the training time (days), $C\\_{data\\_annotation}$ is the annotation cost, $\\\\frac{V\\_{data}}{R\\_{inference}}$ indicates the time needed to complete inference for all data.\n\n### 2.6. Data Cost Calculation\n\nHigh-quality data processing comes at a cost. From data acquisition to the final cleaning process, each step involves complex computations and human resource investments, all of which translate into actual costs. This chapter will combine ABAKA AI's previous content and rich experience to provide you with some feasible ideas, hoping to help you calculate data costs when implementing LLMs.\n\nBased on the data processing flow described earlier, we can roughly divide data costs into the following main categories: storage costs, data acquisition costs, data conversion costs, and data cleaning costs. We hope to help you establish an intuitive budget system through ABAKA AI's past rich experience:\n\n1.  ****Storage costs****：In this field, data scale is far larger than general projects, with pre-training datasets reaching PB levels. Single machines cannot meet such large-scale data storage needs, and projects also have high bandwidth requirements. Therefore, distributed storage is generally used. Distributed storage facilitates horizontal expansion, can meet growing storage needs, and has data backup and fault tolerance mechanisms, ensuring high data reliability. Multi-node parallel read and write can also improve I/O performance. Generally, the capacity price of distributed storage is about 85 USD/T (NVME + HDD), meaning 1PB of available storage space costs about 85,000 USD. Adding security redundancy and network equipment, security equipment, the cost will approach 99,000 USD/PB.\n2.  ****Data acquisition****：All historical data from a well-known large website can be estimated at around 42,500-70,500 USD, with incremental updates costing about 14,000 USD annually. For vertical domain websites, it could be anywhere from 42,00 to 14,000 USD. Video websites are three to five times more expensive than ordinary websites (bandwidth, storage), and overseas websites are two to three times more expensive (overseas proxies, overseas servers, compliance). Assuming you need to crawl 8 mainstream social media and news websites + 15 vertical domain websites (such as code, mathematics, finance), a budget of 706,000 USD would be appropriate.\n3.  ****Document information extraction****：Based on ABAKA AI's experience, using ABAKA AI's developed Pipeline for document conversion is more cost-effective and flexible. If using consumer-grade GPUs for conversion, the cost per page is about 0.000035 USD, far lower than mathpix's 0.025 / 0.01 USD per page. Of course, we now see many good domestic manufacturers trying in this area, and we look forward to better models and cheaper prices from domestic service providers. All in all, including the time for Gap and debugging, estimate about 14,000 USD for every 10,000,000 pages of documents (80% using your own model + 20% using third-party services).\n4.  ****Data cleaning****：The cost of this step mainly depends on how many data sources there are and their domains. When processing very dirty data, ABAKA AI used over 1,000 cores for about a month, adding many special rules to obtain higher quality data, with a data retention rate of less than 1%. Therefore, this part of the data can be calculated as follows:\n\n$S\\_{\\\\text{eng}} + \\\\frac{V\\_{\\\\text{data}}}{100\\\\text{T}} \\\\times C\\_{\\\\text{base}}$\n\nThat is, the data cleaning cost for each domain consists of two weeks' salary for an algorithm engineer + 2,800 USD for cleaning every 100T, assuming the data cleaning cost increases linearly with data volume when the cluster is set up. For example, like Fineweb-edu using Llama3-70B and Bert-like models, the price is also quite affordable, just slightly increase the cost per 100T.\n\nIn summary, preparing pre-training data for LLMs is a complex and costly process. It involves multiple stages, including data acquisition, storage, document information extraction, and data cleaning, each requiring careful planning and substantial investment. The quality and diversity of data are crucial to the model's final performance, so each stage should be optimized as much as possible within budget constraints. At the same time, we find that the value of experienced algorithm engineers cannot be overlooked. Their experience and expertise can help teams avoid many potential pitfalls and detours. In LLM projects, the cost of taking detours due to human resource issues is often surprisingly high, potentially leading to a waste of considerable time and resources.\n\n## 3\\. SFT & RLHF Stages\n\nIn the training process of large language models (LLMs), Supervised Fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are two closely connected key stages. Although these two stages differ in technical implementation and specific objectives, they share significant similarities in terms of data requirements and cost composition. In this chapter, we combine these two stages for discussion, primarily because their core costs are concentrated on data annotation and requirement definition, a characteristic that results in many commonalities in data preparation and cost estimation.\n\n### 3.1. Characteristics of SFT Datasets\n\nSFT datasets consist of a series of text pairs, including \"instruction input\" and \"answer output\". \"Instruction input\" represents requests made by humans to the model, covering various types such as classification, summarization, rewriting, etc. \"Answer output\" is the response generated by the model based on the instruction, meeting human expectations. There are four methods to construct instruction fine-tuning datasets: manual creation; model generation, such as using the Self-Instruct method; collecting and improving existing open-source datasets; and combining the above three methods.\n\n![\nDifferent ways to build SFT datasets\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/5.png \"\nDifferent ways to build SFT datasets\n\")\n\nDifferent ways to build SFT datasets\n\nThere are generally two approaches to constructing artificially generated datasets. The first approach involves directly creating instruction text sets according to given requirements and rules by company employees, volunteers, annotation platform staff, and others. Whether designing instruction sets, writing annotation guidelines, or conducting actual data annotation and quality control, it requires a significant investment of human time and effort. For example, the creation of the Databricks-dolly-15k dataset involved thousands of Databricks employees who generated over 15,000 records across multiple instruction categories. The second approach involves scraping human-generated real question-and-answer data from web pages and standardizing it into an instruction format. Examples include datasets like InstructionWild, v2LCCC, and Zhihu-KOL, which construct datasets by aggregating and organizing content from social chats, code-related Q&As, and other sources.\n\nIn ABAKA AI's past practices, the first approach has been more commonly used to construct datasets. Meanwhile, Liu, Yang, et al. \\[2\\] believe that datasets constructed in this manner are of higher quality and cleaner due to processing and review by professional annotators. After human processing, these datasets become more interpretable and more consistent with human understanding, thus increasing their explainability. Researchers have flexible control over the training samples and can adjust them according to different tasks, making them more versatile.\n\n![\nABAKA AI possesses high-quality finished datasets across multiple domains\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/7.png \"\nABAKA AI possesses high-quality finished datasets across multiple domains\n\")\n\nABAKA AI possesses high-quality finished datasets across multiple domains\n\n### 3.2. Characteristics of RLHF Datasets\n\nRLHF datasets are collections of instructions that provide preference evaluations for multiple responses to the same input prompt. Typically, they consist of instruction pairs with different responses, including feedback from humans or other models. This setup reflects the relative preferences of humans or models for different responses in a given task or context. The feedback information in preference datasets is usually expressed through voting, ranking, scoring, or other forms of comparison.\n\nPreference datasets are primarily used in the alignment phase of large models, aiming to help align model outputs more closely with human preferences and expectations. Alignment with human preferences is mainly reflected in three aspects: practicality (the ability to follow instructions), honesty (avoiding fabrication of information), and safety (avoiding the generation of illegal or harmful information).\n\n![\nDifferent ways to build the RLHF dataset\n](https://doxhub.s3.us-east-1.amazonaws.com/abaka/20240909/LLMDataCost/8.png \"\nDifferent ways to build the RLHF dataset\n\")\n\nDifferent ways to build the RLHF dataset\n\nRLHF (Reinforcement Learning from Human Feedback) and RLAIF (Reinforcement Learning from AI Feedback) both utilize reinforcement learning methods to optimize models using feedback signals. In addition to fine-tuning with instruction datasets, preference datasets can be used to train reward models. Subsequently, the Proximal Policy Optimization (PPO) algorithm can be applied for further fine-tuning based on feedback from the reward model.\n\n### 3.3. Data Cost Calculation\n\nIn the SFT and RLHF stages, data costs primarily come from the following aspects:\n\n1.  ****Rule Design Cost****：\n\n$Budget\\_{analysis} = T\\_{total} \\\\times (R\\_{expert} \\\\times S\\_{expert} + R\\_{engineer} \\\\times S\\_{engineer} + R\\_{user} \\\\times S\\_{user})$\n\nIn this, $R\\_{x} (0 < R\\_{x} ≤ 1)$ represents the participation ratio. Algorithm engineers $S\\_{expert}$ understand the model's capability boundaries, domain experts $S\\_{expert}$ provide professional knowledge and insights, and $S\\_{user}$ offers frontline usage scenarios and requirement feedback. This step is both necessary and important. Carefully designed rules can significantly improve data quality, directly affecting model performance, and good rule design can increase annotation efficiency and reduce rework rates. Although the detailed rule design process may increase initial costs, its value far exceeds these expenses. It not only improves data and model quality but also brings long-term benefits to the entire project and organization.\n\n2.  ****Instruction Dataset Construction Cost****：\n\n$Budget\\_{instruction} = \\\\frac{N\\_{instructions}}{R\\_{creation\\_speed}} \\\\times S\\_{annotator} + \\\\frac{N\\_{instructions} \\\\times R\\_{review}}{R\\_{review\\_speed}} \\\\times S\\_{reviewer}$\n\nWhere: $N\\_{instructions}$ is the total number of instructions, $R\\_{creation\\_speed}$ is the number of instructions an annotator can produce per hour, $S\\_{annotator}$ is the average hourly wage of annotators, $R\\_{review}$ is the review sampling rate, $S\\_{reviewer}$ and $R\\_{review\\_speed}$ are the average hourly wage of reviewers and the number of instructions a reviewer can review per hour, respectively.\n\n3.  ****Cost of building RLHF dataset****：\n\n$Budget\\_{RLHF} = T\\_{generation} \\\\times C\\_{GPU\\_cluster} + \\\\frac{N\\_{instructions} \\\\times \\\\alpha}{R\\_{ranking\\_speed}} \\\\times S\\_{annotator} + Budget\\_{review}$\n\nThe first part is the inference cost for generating responses, and the second part is the cost of manual annotation. The choice of annotation method and strategy greatly affects $\\\\alpha$. For example, if there are $N$ responses that need to be compared pairwise, then the cost of manual annotation:\n\n$\\\\frac{N\\_{instructions} \\\\times C(N\\_{responses}, 2)}{R\\_{rank\\_speed}} \\\\times S\\_{annotator}$\n\nIf it's a rating system, then $R\\_{rank\\_speed}$ would significantly increase, so choosing an appropriate evaluation method is a key factor in constructing RLHF datasets. It not only affects data quality but also directly determines the cost structure. The choice and orientation of review strategies will likewise significantly impact costs. Considering the complexity of these factors and their interactions, it is indeed challenging to provide a universal cost formula, which is why we have not presented a specific formula.\n\nIn practice, it's often necessary to validate and optimize evaluation and audit strategies through small-scale pilot tests before expanding to the full dataset. This iterative approach not only helps optimize costs but also continuously improves data quality and annotation guidelines throughout the process.\n\nBased on ABAKA AI's past experience, assuming we collect 1,000 IMO-level math problems, considering the demand is already well-established, the main costs will be concentrated on annotation and auditing. The cost for annotators is 20 USD per hour, with an estimated rate of one problem per hour. Including other expenses, the budget is estimated at 28,000 USD. However, if we adopt ABAKA AI's RLHF data construction method, utilizing modern proof tools like LEAN, the processing efficiency would be much higher than response construction, handling approximately 4-6 pairs per hour.\n\nAt this point, we have established a comprehensive evaluation system that allows us to assess data prices according to requirements.\n\nFor example, if the boss wants the model to possess knowledge in a specific domain, or even become state-of-the-art in that field, we can choose CPT to add knowledge. Based on D-CPT Law\\[4\\] and REGMIX\\[5\\], we can calculate that approximately 100B of domain data might be needed. We can crawl 12 target websites to cover 70B of data, and the remaining 30B can be filtered from public datasets. After CPT, we can add a few thousand SFT data points. The data portion might cost around 42,000 USD, including: approximately 28,000 USD for data crawling from 12 websites + about 2,800 USD for downloading and filtering several dozen TB of data using the deepseek math method + constructing SFT data at about 6 USD per entry, totaling 12,000 USD for 2,000 entries.\n\nCost Item\n\nCost / USD\n\nPre-training from scratch\n\n140,000 - 7 million\n\nCPT\n\n70,000 - 1 million\n\nSFT\n\n5,600 - 140,000 per domain\n\nRLHF data\n\n1,400 - 56,000 per domain\n\nThe above estimates are based on current market data and ABAKA's years of industry experience, providing the most common budget range framework to help you more intuitively estimate overall data cost expenses.\n\nABAKA AI can reduce costs by 40%-60% at various stages based on the above framework. In the process of building high-quality training datasets, ABAKA provides professional solutions based on rich data processing experience. The intelligent data engineering platform Abaka Forge Platform and highly specialized, standardized data processing services offered by ABAKA AI empower the construction of training data, helping you train LLMs using high-quality datasets and enabling you to better understand the resources and investment required for your project.\n\n## 4\\. Reference\n\n1.  Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. \"Training Compute-Optimal Large Language Models.\" arXiv, March 29, 2022. [http://arxiv.org/abs/2203.15556](http://arxiv.org/abs/2203.15556 \"Training Compute-Optimal Large Language Models\").\n2.  Liu, Yang, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. \"Datasets for Large Language Models: A Comprehensive Survey.\" arXiv, February 27, 2024. [http://arxiv.org/abs/2402.18041](http://arxiv.org/abs/2402.18041 \"Datasets for Large Language Models: A Comprehensive Survey.\").\n3.  Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, et al. \"MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series.\" arXiv, June 2, 2024. [http://arxiv.org/abs/2405.19327](http://arxiv.org/abs/2405.19327 \"MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series.\").\n4.  Que, Haoran, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, et al. \"D-CPT Law: Domain-Specific Continual Pre-Training Scaling Law for Large Language Models.\" arXiv, June 3, 2024. [http://arxiv.org/abs/2406.01375](http://arxiv.org/abs/2406.01375 \"D-CPT Law: Domain-Specific Continual Pre-Training Scaling Law for Large Language Models.\").\n5.  Liu, Qian, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. \"RegMix: Data Mixture as Regression for Language Model Pre-Training.\" arXiv, July 1, 2024. [http://arxiv.org/abs/2407.01492](http://arxiv.org/abs/2407.01492 \"RegMix: Data Mixture as Regression for Language Model Pre-Training.\")."
}

▼✓网页搜索查询: "LLM pre-training FLOPs calculation formula \"C ≈ 6ND\" cost per GPU hour traini...", 结果数: 8

{
  "results": [
    {
      "title": "The FLOPs Calculus of Language Model Training - Medium",
      "url": "https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-language-model-training-3b19c1f025e4",
      "description": "Without further ado, meet the Transformer FLOPs Equation: C ≈ 6ND. A slightly more sophisticated version of the equation expresses the"
    },
    {
      "title": "How Long Does It Take to Train the LLM From Scratch? | Towards Data Science",
      "url": "https://towardsdatascience.com/how-long-does-it-take-to-train-the-llm-from-scratch-a1adb194c624",
      "description": "The goal is to quantify the computational requirements for processing data and updating model parameters during training in terms of **FLOPs** (floating point operations). To sum up, training FLOPs for the transformer model of size N and dataset of P tokens can be estimated as:. However, the theoretical maximum FLOPS for a GPU is often less relevant in practice when training Large Language Models. For example, the authors of the LLaMA 3 paper reported an MFU of 38%, or 380 teraflops of throughput per GPU, when training with 16,000 GPUs. Reported TFLOPs throughput per GPU training Llama3 models as reported in the paper for different configurations. The total FLOPs required to train a model of this size can be calculated as follows:. Once you know the FLOPS per GPU in the training setup, you can calculate the total GPU hours required to train a model of a given size and dataset."
    },
    {
      "title": "How Much Does LLM Training Cost? - Galileo AI",
      "url": "https://galileo.ai/blog/llm-model-training-cost",
      "description": "# How Much Does LLM Training Cost? Understanding the costs of training LLMs is essential for those looking to create and train models or simply to understand the dynamics of AI as an industry. In this article, we'll explore the cost of training LLM models, the factors that contribute to LLM training costs, and strategies to optimize them effectively. * Training frontier LLMs costs $78M-$192M+ for compute alone. * Key cost drivers have shifted—human data annotation now exceeds compute costs by up to 28x. * Fine-tuning pre-trained models costs 60-90% less than training from scratch. ## **How Much Does It Cost to Train an LLM?**. To appreciate the scale of investment involved in training LLMs, consider the verified costs of notable models. According to the Stanford AI Index Report 2025, frontier model training costs have escalated dramatically—with GPT-4's training estimated at $78-100+ million, and Gemini Ultra 1.0 reaching $192 million, representing a 287,000x increase from the cost of a Transformer model in 2017 ($670)."
    },
    {
      "title": "Cost of Training Large Language Models: A Simple Calculation | Ashish Kulkarni posted on the topic | LinkedIn",
      "url": "https://www.linkedin.com/posts/ashishakulkarni_llm-pretraining-foundationmodels-activity-7373222564543807488-haqO",
      "description": "Software optimizations are the key: NVIDIA Dynamo -> provides disaggregated serving splitting prefill and decode phases of inference onto separate GPUs at scale NVFP4 -> quantizes the model weights to 4-bit floating point executed on the GPU Tensor Cores CUDA Graphs -> reduce CPU overhead That's how we supercharge NVIDIA Data Center into a true AI Factory! Disaggregated serving splits these phases: - Prefill runs on high-compute GPUs - Decode runs on memory-optimized hardware - Each phase gets what it needs The results are striking: - 6.4x throughput improvements - 20x reduction in latency variance - 15-40% infrastructure cost reduction Microsoft's implementation showed 1.4x higher throughput at 20% lower cost."
    },
    {
      "title": "Optimize for inference too, not just training FLOPs | MatX",
      "url": "https://matx.com/research/lifetime_llm_cost",
      "description": "# Optimize for inference too, not just training FLOPs. Large Language Models (LLMs) have been shaped by the scaling laws, as established by Kaplan et al., 2020 and Hoffmann et al., 2022. They guide us to design models that optimize for training cost but often overlook inference costs. During training, KV computation is usually cheaper than the rest of the model, but during inference, loading the KV cache becomes the dominant expense. Since training is compute-bound, the training FLOPs budget effectively determines the optimal model size and the number of training tokens needed to maximize model performance. While the loading of model parameters can be amortized by using larger batch sizes, the KV cache grows both with the **batch size** and the **sequence length**, making memory bandwidth a potential bottleneck (Pope et al., 2022). * Cost to load the full KV cache: 69GB×HOI≈41 TFLOPs. In this case, the cost of loading the KV cache in FLOP-equivalents is 20 times greater than the compute cost of performing the forward pass."
    },
    {
      "title": "What is the cost of training large language models? - CUDO Compute",
      "url": "https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models",
      "description": "### CUDO arrow Resources arrow What is the cost of training large language models? # What is the cost of training large language models? Large language models (LLMs) like OpenAI’s GPT series and Google’s BERT have become foundational technologies powering many applications, from automated customer service to advanced research tools. For example, while the original 2017 Transformer that introduced the core LLM architecture cost only about $900 to train, the compute cost for training GPT-3 (175 billion parameters) was estimated in 2020 to range from about $500,000 up to $4.6 million, depending on the hardware and optimization techniques used. Training OpenAI’s GPT-4 reportedly cost more than $100 million, with some estimates ranging up to $78 million in compute cost, and Google’s Gemini Ultra model is estimated to have cost $191 million in training compute. The emergence of reasoning in LLMs is partly a byproduct of scale, as larger models trained on more data tend to develop more advanced capabilities, and partly due to specialized training or fine-tuning on data that encourages logical reasoning."
    },
    {
      "title": "LLM training",
      "url": "https://www.glennklockwood.com/garden/LLM-training",
      "description": "There are three ways in which **training** a model can be divided across GPU nodes:. * partition the training batch and give each GPU node its own subset of the training dataset (a minibatch). * scales very well since multiple copies of the model are training in parallel, but may increase the time to train a model (convergence time) since training data may be less randomized as a result of partitioning. * requires moderate rewriting the training code to include communication within each epoch. For example, training a large language model across multiple DGX nodes likely involves tensor parallelism within the DGX node (since it has NVLink which makes the communication fast), pipeline parallelism across 16 DGX nodes, and data parallelism to accelerate training by scaling to a thousand DGX nodes. The ZeRO-DP paper (2020) states that a trillion-parameter model using a stateful optimizer (like Adam) requires 16 TiB of GPU memory at 16-bit precision."
    },
    {
      "title": "Understanding the Performance and Estimating the Cost of LLM ...",
      "url": "https://arxiv.org/html/2408.04693v1",
      "description": "In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU."
    }
  ]
}

▼thought

LLM 训练各阶段的成本拆解：算力、数据、与人

一、预训练（Pre-training）：最大的单笔算力支出

1.1 核心计算公式

预训练的计算量由一条简洁的公式决定 The FLOPs Calculus of Language Model Training：

C ≈ 6 × N × D

其中 N = 模型参数量，D = 训练 token 数。这是 transformer 架构下 forward + backward pass 的总浮点运算次数。

1.2 代表性模型的实际成本

模型	参数量	训练 tokens	估计算力成本	备注
Transformer (2017)	—	—	~$900	基准线
GPT-3	175B	300B	$500K–$4.6M	视硬件和优化差异
Llama 2 7B	7B	2T	~$760K	184,320 A100 GPU-hours Sebastian Raschka
GPT-4	~1.76T (MoE)	未公开	$78M–$100M+	2.1×10²⁵ FLOPs
Gemini Ultra	未公开	—	~$191M	5.0×10²⁵ FLOPs CUDO Compute

注意：这些只是最终训练跑的算力成本。加上探索性实验、失败的超参尝试、消融实验，实际总研发算力通常是最终训练的 2-3 倍 CUDO Compute。

1.3 数据规模 vs 算力成本：Chinchilla 定律的含义

DeepMind 的 Chinchilla 论文（2022）发现：给定算力预算 C，最优的模型大小 N 和训练数据量 D 应该满足 D ≈ 20 × N Chinchilla Scaling Laws。

这意味着什么？

如果你想训练一个2 倍大的模型，你还需要2 倍多的训练数据才能「喂饱」它
而 C ≈ 6ND，所以总计算量变成了 4 倍（2 × 2）
换句话说：在 compute-optimal 假设下，模型规模翻倍 ≈ 算力成本翻两番

1.4 数据准备的隐性成本

算力之外，预训练数据的获取和清洗本身就是一笔不小的开支。以下数据来自 ABAKA AI 的实操经验 ABAKA: LLM Data Cost：

成本项	量级估计
分布式存储	~$85,000/PB（含冗余和网络设备约 $99,000/PB）
爬取一个大网站的全部历史数据	$42,500–$70,500
文档信息提取（PDF → Markdown）	~$14,000 / 每 1000 万页
数据清洗（每 100TB）	工程师两周薪资 + ~$2,800 算力成本
质量筛选（用 LLM 打分 + 分类器过滤）	需额外的推理算力 + 标注成本

二、SFT（监督微调）：相对廉价，但数据质量决定上限

2.1 算力成本

SFT 的算力开销远小于预训练——通常只需几个小时到几天，而不是数周数月。上一轮回答中提到的数字：

方法	70B 模型典型成本	硬件要求
Full Fine-Tuning	$1,000–$5,000/次	4-8 张 H100
LoRA	$50–$500/次	单张 A100 / H100
QLoRA	$20–$100/次	单张 24GB 消费级 GPU

SFT 算力总量在预训练的 1%-5% 量级。微调一个已有基座模型的成本通常比从头训练便宜 60-90% Galileo: LLM Training Cost。

2.2 数据标注成本

以构建一个 2000 条高质量 domain-specific SFT 数据集为例：

2000 条 × $6/条 = $12,000（仅标注，不含规则设计和审核）
加上领域专家规则设计 + 多轮审核，一个领域的 SFT 数据总成本在 $5,600–$140,000 之间

三、偏好对齐（RLHF / DPO）：人类标注是真正的成本怪兽

3.1 RLHF 的完整成本链

传统 RLHF 有三层成本叠加：

第一层：人类偏好数据收集

每条比较（pairwise comparison）：$1–$5
一次典型 RLHF 训练需要数十万次比较
顶级 AI 公司（OpenAI、Google、Meta、Anthropic）每家每年花费约 $10 亿 在人类标注数据上 O-Mega: Data Labeling Industry Guide
数据标注市场从 2024 年 $12 亿预计增长到 2034 年 $102 亿 Lemon.io: RLHF Market

第二层：奖励模型训练

需要额外 GPU 算力训练一个独立的 reward model
通常在 $10K–$100K 量级（取决于规模）

第三层：PPO 强化学习训练

需要同时运行 policy 模型 + reward model + reference model
训练不稳定，常需多次尝试
算力成本可达 SFT 的 5-10 倍

3.2 不同对齐方法的成本对比

方法	标注成本/比较	算力成本	总成本量级
RLHF (PPO)	$1–$5	最高（需训练 RM + RL 多轮）	千万到亿美元级
DPO	$1–$5	~RLHF 的 50%	标注成本相同，算力减半
RLAIF / Constitutional AI	<$0.01	中等	标注成本降低 100-500 倍
RLVR / GRPO	$0（程序化验证）	中等	完全消除人类偏好标注

四、推理 RL（GRPO / RLVR）：成本结构革命

GRPO 和 RLVR 的核心优势不仅是性能，更是成本结构的根本性改变。

零人类标注成本
奖励信号即时、一致、无限可扩展
训练算力需求与 RLHF 的 RL 阶段大致相当

五、数据规模 vs 成本的完整关系

将上述所有阶段汇总，数据规模与成本之间存在多层次、非线性的关系：

5.1 预训练阶段：线性递增 + 规模定律的放大效应

算力成本 ≈ 6 × N × D × (GPU 单价 / GPU FLOPs)

N 和 D 各自线性驱动成本，但 Chinchilla 最优 D=20N 意味着总成本与 N² 成正比
10 倍模型规模 ≈ 100 倍算力成本（在 compute-optimal 假设下）
但现实中大多数实验室选择「overtraining」（D >> 20N），使数据侧的系数更大

5.2 数据准备阶段：固定开销 + 线性增量

数据准备总成本 = 基础设施固定成本 + 爬取/采集成本(D) + 清洗/去重成本(D)

存储成本：基本线性（$/TB × TB 数）
爬取成本：每个新数据源有固定开发成本（$1,500-$15,000/网站），但之后增量成本很低
清洗成本：有显著的规模经济——集群搭好后，每 100TB 额外清洗成本仅 ~$2,800
这部分的「边际成本递减」特性意味着数据量越大，平均每 token 成本越低

5.3 后训练阶段：质量 > 数量，成本非线性

SFT/RLHF 成本 = 固定规则设计 + 标注成本(条数 × 难度系数) + 审核成本 + 算力成本

SFT 数据量通常在数千到数十万条，远小于预训练（数十亿到数万亿 tokens）
但单条数据的价值密度极高：1000 条精选 SFT 数据的效果可能超过 10 万条低质数据
RLHF 的 pairwise comparison 成本随响应数 组合爆炸：N 条响应需要 C(N,2) 次比较
这意味着 RLHF 的标注成本对「每条 prompt 生成多少条响应」这个参数极度敏感

六、总结：成本结构的三大迁移趋势

从前面的分析可以提炼出一个清晰的变化方向：

阶段	过去（2020-2022）	现在（2024-2026）	成本主驱动因素
预训练	算力绝对主导	算力仍最大，但数据准备投入激增	GPU 集群 + 数据工程团队
SFT	小规模标注	领域专家标注成本上升	人类专家稀缺性
偏好对齐	RLHF + 昂贵人类标注	DPO 降低算力，RLAIF 降低标注，GRPO 消除标注	从「人」向「程序化验证」迁移
推理增强	不存在	RLVR / GRPO 成为标配	可验证任务的设计能力

尚需核实的点

各顶级实验室的人类标注实际支出属于商业机密。$10 亿/年这个数字来自投资人和行业分析师的估算（引用自 Foundation Capital 的分析），并非官方披露。
01.ai 声称 $300 万训练出 GPT-4 级模型的说法仍缺乏独立验证，业内对此有争议，本回答未将其作为确定事实引用。
GRPO 等 RLVR 方法在非可验证领域（如创意写作、开放式对话）的效果边界仍在研究中，尚无大规模公开对比。

185317 tokens · $0.0284