Navigating the EU AI Act and the 10^25 FLOPs Tipping Point

What is the EU AI Act's general approach to regulating AI?

The European Union's Artificial Intelligence Act is the world's first comprehensive legal framework for AI. It uses a tiered, risk-based approach, which means the rules and obligations an AI system must follow are directly related to its potential for causing harm. The higher the risk, the stricter the rules.

What's the difference between an AI "model" and an AI "system"?

This is a critical distinction in the AI Act, designed to assign responsibility across the AI value chain.

A General-Purpose AI Model (GPM) is the foundational engine. It's a model trained on a large amount of data that can perform a wide range of different tasks. Think of Large Language Models (LLMs) like GPT-4 or Llama 3 as GPMs.
A General-Purpose AI System (GPAIS) is the broader application that a user interacts with. It's an AI system built on top of a GPM, often with additional components like a user interface.

This separation is important because the "provider" of the GPM has different legal obligations than the "provider" or "deployer" of the final AI system that uses that model.

What are the four risk tiers in the EU AI Act? tiered risk model

The Act's architecture is built on a four-tiered classification of risk:

Unacceptable Risk: These are practices considered a clear threat to people's safety and rights, and they are banned completely. This includes things like social scoring by governments or manipulative AI techniques.
High Risk: This category covers AI systems used in critical areas where failure could have severe consequences, such as in recruitment, credit scoring, medical devices, or critical infrastructure. These systems face very strict requirements.
Limited Risk: Systems like chatbots or deepfakes fall into this category. They mainly have transparency obligations, meaning users must be told they are interacting with an AI or viewing generated content.
Minimal Risk: This includes the vast majority of AI systems, like AI in video games or spam filters. These are largely unregulated.

What is "systemic risk" for AI models?

Systemic risk is a special concept designed to capture the most powerful and influential GPMs whose capabilities could have broad, societal-level impacts. It's defined as a risk that is specific to the "high-impact capabilities" of these models, with the potential for significant negative effects on public health, safety, security, or fundamental rights.

What is the 10^25 FLOPs threshold and what does it do?

The 10^25 FLOPs threshold is a clear, quantitative trigger established by the Act to identify models that are presumed to pose systemic risk.

A GPM is presumed to have "high-impact capabilities" when the total amount of computation used for its training, measured in floating point operations (FLOPs), is greater than 10^25.

This threshold doesn't mean the model is automatically deemed dangerous. Instead, it acts as a procedural trigger that shifts the burden of proof.

For models below the threshold, the regulator must prove the model is risky.
For models above the threshold, the provider is automatically presumed to be in the systemic risk category and must either comply with the strict rules or present a strong, evidence-based case to regulators arguing why their model does not actually pose systemic risks.

Essentially, crossing the 10^25 FLOPs line starts a mandatory regulatory conversation.

What is a FLOP in the context of AI training?

A Floating-Point Operation (FLOP) is the basic unit of computational work for training AI models. It represents a single simple calculation (like addition or multiplication) on numbers with decimal points. Modern AI models require trillions of these operations to be trained.It's important not to confuse FLOPs (a measure of total work) with FLOPS (FLOPs per second), which is a measure of a computer's processing speed. The AI Act's 10^25 threshold refers to the total number of FLOPs used over the entire training process.

How is the total training compute (FLOPs) calculated for an AI model?

Accurately calculating the total FLOPs is complex and often relies on estimations. The most common methods are:

The 6ND Heuristic: For Transformer models, a widely used formula is C ≈ 6 × N × D, where:
- C is the total training compute in FLOPs.
- N is the number of parameters in the model.
- D is the number of tokens in the training dataset.
Hardware-Based Estimation: An alternative is to use the formula:
- C = Training Time × # of GPUs × Peak FLOPS × Utilization Rate
- This method is also uncertain because the "utilization rate" (the percentage of a GPU's peak performance that is actually achieved) is very difficult to estimate accurately.

This difficulty in calculation creates a "verification gap" for regulators, placing a huge burden on data science teams to keep meticulous internal records of their computational expenses.

How does the 10^25 FLOPs threshold compare to current AI models? 🤖

The threshold was set to capture the "frontier" of AI development around the time the Act was drafted (e.g., GPT-4 class models). As of mid-2025, over 30 publicly announced models have already crossed this line.

Models Below the Systemic Risk Threshold

GPT-3 (175B): 3.14 × 10^23 FLOPs
PaLM (540B): 2.56 × 10^24 FLOPs
Llama 3 (70B): 6.3 × 10^24 FLOPs

The EU AI Act Systemic Risk Threshold

1.0 × 10^25 FLOPs

Models At or Above the Systemic Risk Threshold

Inflection-2: 1.0 × 10^25 FLOPs
Mistral Large: 1.1 × 10^25 FLOPs
Claude 3 Opus: 1.6 × 10^25 FLOPs
GPT-4: 2.1 × 10^25 FLOPs
Llama 3.1 (405B): 3.8 × 10^25 FLOPs
Gemini 1.0 Ultra: 5.0 × 10^25 FLOPs
Grok-3: 4.6 × 10^26 FLOPs

What specific obligations do providers of systemic risk models have?

Providers of GPMs with systemic risk must first meet all the baseline requirements for GPMs (like providing technical documentation and a summary of training data). On top of that, they face a much more demanding set of obligations under Article 55 of the Act:

State-of-the-Art Model Evaluation: They must perform thorough model evaluations, including internal and external adversarial testing (red-teaming) to proactively find dangerous capabilities and vulnerabilities.
Systemic Risk Assessment and Mitigation: They must establish a continuous process to identify, analyze, and manage foreseeable systemic risks throughout the model's entire lifecycle.
Serious Incident Tracking and Reporting: They must track, document, and report any "serious incidents" that could lead to significant harm to the AI Office and other authorities without delay.
Cybersecurity Protection: They must ensure an adequate level of cybersecurity to protect the model and its physical infrastructure from being stolen, compromised, or maliciously manipulated.

What are the "Codes of Practice" and why are they important?

The Codes of Practice are a form of co-regulation where the EU's AI Office facilitates a process for industry stakeholders to develop detailed, state-of-the-art guidance on how to comply with the Act's obligations.Adhering to an official Code of Practice is voluntary, but it creates a powerful legal safe harbor called a "presumption of conformity". This means regulators will assume a provider who follows the code is compliant with the Act, which reduces their legal risk and administrative burden. Because of this strong incentive, these codes are expected to become a de facto mandatory standard for the industry.

What is the "compliance cliff" for developers who modify existing AI models?

The "compliance cliff" refers to a situation where a developer who is simply using or modifying a pre-existing GPM can suddenly become legally reclassified as the "provider" of a new systemic risk model, inheriting a massive set of expensive obligations.This happens when the computational resources used for the modification cross certain thresholds. For example:

If a data science team takes an open-source model that is below the 10^25 FLOPs systemic risk threshold and performs a large-scale fine-tuning that pushes the model's cumulative compute (original training + fine-tuning) over the 10^25 FLOPs line, that team's organization instantly becomes the provider of a new systemic risk GPM. They are then responsible for the full compliance burden for the entire model, even the parts they didn't build.

This turns the decision to fine-tune a powerful model from a simple technical choice into a major strategic and financial risk.

What are the different roles and responsibilities in the AI value chain?

The Act creates a cascade of responsibilities down the value chain.

Original GPM Provider (Below 10^25 FLOPs)

Description: Develops a foundational GPM not presumed to have systemic risk.
Key Responsibilities: Provide technical documentation, copyright policy, and information for downstream providers.

Original Systemic Risk GPM Provider (Above 10^25 FLOPs)

Description: Develops a GPM presumed to have systemic risk.
Key Responsibilities: All baseline obligations PLUS model evaluations, adversarial testing, risk assessment, serious incident reporting, and cybersecurity.

Downstream Deployer (Integration Only)

Description: Integrates a third-party GPM into an AI system without significant modification.
Key Responsibilities: Responsible for the final AI system (e.g., high-risk rules if applicable), but not GPM provider obligations.

Downstream Modifier (Creates New Systemic Risk GPM)

Description: Fine-tunes or modifies a GPM, causing it to cross a systemic risk threshold.
Key Responsibilities: Full provider obligations for the entire newly created model. They cannot limit their liability to just the modification they performed.

Why did regulators choose FLOPs as the main metric for systemic risk?

Regulators chose training compute as a metric for several practical reasons:

Quantifiability: FLOPs are a quantifiable, single-dimensional metric, unlike abstract concepts like "capability."
Early Measurability: Compute can be estimated before a model is even trained, allowing for early planning.
Correlation with Cost: Compute costs tens or hundreds of millions of dollars at this scale, allowing regulators to narrowly target the few organizations that can afford it.
Empirical Basis in Scaling Laws: The choice is grounded in the "scaling laws" phenomenon, which shows a predictable relationship between compute, performance, and the emergence of new abilities.

What are the main criticisms of using FLOPs as a proxy for risk?

While practical, the FLOPs-as-risk approach is criticized for being a crude and increasingly unreliable measure.

Algorithmic & Architectural Efficiency: The link between compute and capability is weakening as more efficient model architectures (like Mixture-of-Experts) are developed that achieve high performance with less compute.
The Primacy of Data Quality: The quality of the training data can have a greater impact on a model's capabilities than the sheer volume of computation. The FLOPs metric is blind to this.
Post-Training Capability Enhancements: A model's real-world risk can be significantly changed by downstream modifications like fine-tuning or Retrieval-Augmented Generation (RAG) that don't add to its training FLOP count.
Verification Challenges & Loopholes: The difficulty of accurately calculating FLOPs creates opportunities for "creative compliance," such as using model distillation to create smaller, unregulated-but-still-powerful models for the EU market.
Risk is Contextual, Not Computational: The most significant risks often come from how a model is used, not how big it is. A lower-compute model used for a high-stakes purpose could be far riskier than a high-compute model used for entertainment.

What is a practical, three-phase compliance playbook for data science teams? 📋

Phase 1: Audit and Classification

Inventory AI Systems: Conduct a thorough audit of every AI model and system in use or development.
Determine Your Role: For each, define if you are a provider, modifier, or deployer under the Act.
Establish Compute Estimation: Create a rigorous, documented process for calculating the training compute for any GPM you provide or significantly modify.
Perform Risk Categorization: Assess each system against the Act's risk tiers, paying special attention to the 10^25 FLOPs threshold.

Phase 2: Implementing "Compliance by Design" in MLOps

Institute Robust Data Governance: Implement systems for immutable, end-to-end data provenance and lineage. Integrate automated data quality and bias checks into training workflows.
Mandate "Living" Technical Documentation: Use tools to automate the generation of documentation that is continuously updated from codebases, model cards, and experiment logs.
Integrate Lifecycle Risk Management: Embed risk assessment into every stage of the model lifecycle. Make adversarial testing (red-teaming) a mandatory pre-deployment step for any model nearing the systemic risk threshold.

Phase 3: Strategic and Organizational Adaptation

Re-evaluate "Build vs. Buy vs. Fine-tune": The choice of how to source AI is now a major regulatory decision. Rigorously vet vendors for their compliance documentation and conduct a formal risk assessment before any large-scale fine-tuning to avoid crossing the "compliance cliff."
Establish Cross-Functional Governance: Form a dedicated AI governance committee with representatives from Legal, Compliance, Security, and business units.
Champion AI Literacy: Promote training programs across the organization to ensure everyone understands the new regulatory landscape, not just the technical teams.

What is the overall impact of the EU AI Act on the field of AI?

The EU AI Act, and particularly its 10^25 FLOPs threshold, signals the end of the "move fast and break things" era for artificial intelligence. It ushers in a new period of structured, process-heavy development that prioritizes safety, transparency, and accountability.For data scientists and ML engineers, this means their role is fundamentally changing. It's no longer just a technical pursuit but a socio-technical risk management discipline. The new imperative is clear: move deliberately, manage risk proactively, and document everything. While this introduces new overhead, it also marks the maturation of the field, aiming to ensure that the powerful tools being built are worthy of public trust.