Stripe's Foundation Model (1/3): A New Architecture for Financial AI

What is Stripe's Payments Foundation Model (PFM)?

Stripe's Payments Foundation Model (PFM) is a large-scale, general-purpose AI model designed to understand the fundamental patterns, or "grammar," of financial transactions. Announced in May 2025, it represents a major architectural shift in how AI is applied to payments.

Instead of using many separate, specialized machine learning models for individual tasks (like fraud, disputes, or authorizations), the PFM is a single, unified transformer-based model. It learns from a massive dataset of tens of billions of data points through self-supervised learning. Its primary output is a universal "behavioral embedding"—a rich, numerical representation of a transaction's context—that can then be used to power a whole suite of different financial applications.

What was the "old" way of doing machine learning at Stripe?

Before the PFM, Stripe's machine learning stack, like much of the financial industry's, was based on a collection of siloed, task-specific models.

Model Types: The workhorses were models like XGBoost (a form of gradient-boosted trees) and Logistic Regression.
Feature Engineering: For each specific problem, data science teams would manually identify relevant signals and hand-craft features to capture them. For example, they might create rules based on transaction velocity or the distance between a shipping address and an IP address.
Siloed Tasks: A dedicated model was trained for each product—one for fraud detection in Radar, another for chargeback prediction in Smart Disputes, and another for optimizing payment authorizations.

While effective, this approach was labor-intensive, reactive, and difficult to scale. The intelligence of the system was limited by the human expertise required to constantly engineer new features to keep up with changing fraud patterns.

What is the core architectural shift the PFM represents? 🏗️

The PFM marks a strategic pivot from an incremental, feature-driven paradigm to a generative, representation-learning paradigm. This is a fundamental change in philosophy.

Instead of teaching many small models to recognize specific, pre-defined patterns of "good" or "bad" behavior, Stripe has built a single large model to learn the deep, contextual structure of all payments.

Here is a breakdown of the two approaches:

Legacy ML Stack (Pre-PFM)

Model Type: Gradient Boosting (XGBoost), Logistic Regression
Feature Engineering: Manual, hand-crafted features for each task
Training Paradigm: Supervised, task-specific training
Core Logic: Learns correlations between pre-defined features and labels
Data Requirements: Requires labeled data for each specific task
Scalability: Scales per individual model; bottlenecks with sequential data
Generalizability: Low; models are siloed and task-specific
Maintenance Overhead: High; requires continuous feature engineering

Payments Foundation Model (PFM)

Model Type: Transformer-based Neural Network
Feature Engineering: Learned representations from raw data
Training Paradigm: Self-supervised pre-training on a general corpus
Core Logic: Learns contextual relationships within data sequences
Data Requirements: Utilizes a vast corpus of unlabeled transaction data
Scalability: Scales horizontally; designed for parallel processing
Generalizability: High; universal embeddings are reusable across tasks
Maintenance Overhead: Lower; proactive adaptation through continuous retraining

How would a transformer "tokenize" a financial transaction?

Tokenization is the first step in feeding data to a transformer. It involves converting the raw data of a transaction into a sequence of numerical tokens. Since a financial transaction contains a mix of data types, a hybrid tokenization strategy is the most likely approach:

Categorical Features: High-cardinality features like a Card BIN or Merchant Category Code (MCC) are likely mapped to a learned embedding vector. This allows the model to learn a dense, meaningful representation for each category (e.g., that different BINs from the same bank are related).
Numerical Features: Continuous values like the transaction amount or discrete values like a timestamp are more challenging. They are likely converted into a sequence of tokens through techniques like binning the values into discrete buckets or using a positional base encoding.
Textual Features: Some fields, like a merchant's name or descriptor, can be treated as natural language and processed using a standard subword tokenizer like Byte-Pair Encoding (BPE).

This process turns each transaction into a "sentence" of tokens, and a user's entire history becomes a long sequence of these sentences.

How does the PFM use self-supervised learning (SSL)?

A key innovation of the PFM is its use of self-supervised learning (SSL). This is critical because it avoids the primary bottleneck in traditional financial ML: the need for huge amounts of manually labeled data (e.g., "fraud" vs. "not fraud"). With SSL, the model learns from the inherent structure of the data itself by solving a "pretext task."

Plausible pretext tasks for the PFM include:

Masked Feature Modeling: The model is given a transaction sequence where some features are randomly hidden or "masked." Its goal is to predict the original, uncorrupted values of these masked features based on the surrounding context. To do this well, it must learn the deep relationships between all the features.
Contrastive Learning: The model learns to tell the difference between similar and dissimilar transaction sequences. This forces it to learn representations that are sensitive to meaningful behavioral differences while ignoring superficial noise.
Next Transaction Prediction: Given a user's history, the model's objective is to predict their next transaction. This naturally encourages the model to learn temporal patterns and behavioral tendencies.

This SSL approach allows the PFM to be in a "constantly retraining loop," letting it adapt to new fraud patterns and consumer behaviors without slow and expensive manual relabeling campaigns.

What are "behavioral embeddings" and why are they so powerful?

The primary output of the PFM is not a simple score or prediction but a behavioral embedding vector. This is a dense, high-dimensional numerical representation that encapsulates the full semantic and behavioral context of a transaction. It distills hundreds of subtle temporal, geographical, and behavioral signals into a single, rich representation.

These embeddings are powerful because they are universal and reusable. They form a foundational intelligence layer that can be leveraged across Stripe's entire product ecosystem. This has profound strategic implications:

Centralized Intelligence: It centralizes the most computationally expensive part of the ML lifecycle—representation learning—into a single foundation model.
Accelerated Innovation: Downstream product teams can then build much simpler, more specialized models on top of these rich embeddings to achieve state-of-the-art performance for their specific use cases (like fraud, disputes, or authorizations).

This "embedding-first" approach dramatically speeds up the development of new AI-powered features, as teams no longer need to build complex models from scratch each time.

How do these embeddings lead to dramatic performance improvements?

The rich context captured in the embeddings allows downstream models to see patterns that were previously invisible. The most striking example is in card-testing fraud detection.

Card testing is where fraudsters test stolen card numbers by making many small, rapid-fire transactions. A traditional model that looks at each transaction in isolation would struggle to see this, as each individual purchase might look harmless.

A transformer model using these embeddings, however, can analyze the entire sequence. Its self-attention mechanism can identify the subtle, long-range dependencies that characterize a card-testing attack—like a high velocity of transactions from a new device, correlations between seemingly unrelated IP addresses, and sequential probing of card numbers. It learns the "shape" of a fraudulent sequence. This is why Stripe reported its detection rate for this type of fraud jumped from 59% to 97% almost overnight when the PFM was deployed.