Label · Roster · Match

EVE-3

Every match simulator assumes something about what a team is. The cheap answer is a label: a token, a lookup in an embedding table. That is the weak version, and it works until the league adds new teams. EVE-3 takes the other answer. A team is the eleven players on the pitch, and the match is what they make.

About this site

This site documents EVE-3, the transformer that serves match simulations at sim.nwslnotebook.com. The model is trained on roughly 2.5 million NWSL events between 2016 and 2026: passes, tackles, shots, clearances. Given the first ten events of a real match and the two starting lineups, it generates the remaining ninety minutes one event at a time.

What makes EVE-3 worth documenting is what it does not use. There is no team embedding. There is no learned vector that says "Portland" or "Gotham." The only thing that tells the model which team is acting is the roster of eleven players on each side of the pitch, and every prediction is conditioned on that roster. Swap a player. The match changes.

The Essay is the argument. Why no single representation of a team is enough, and why EVE-3 makes the choice it does. The Log is the dated record of how the model was built. Thirty-two entries across seven eras, from standing up the event pipeline in February to the identity change in April that produced the current checkpoint. The Attention Depth, Apparatus, and Capacity tabs are interactive explanations of how a transformer ingests soccer events. The Simulate tab is live: pick a matchup, generate, and watch the model produce a match event by event.

Attention Depth

Watch how multi-head attention encodes relational structure beyond simple bigram statistics. Each colored stream is an attention head learning different patterns from the same event sequence.

Sequence Length 40
Num Heads 4
Context Depth 0.60
Compare Models
Playback
Position 0
Speed 1.0

What you're seeing: A synthetic event sequence rendered as tokens. Each attention head (colored arc) attends to different prior events. Markov sees only the last event. Feedforward sees all but has no temporal order. Multi-head attention discovers compositional relational structure — the thing Tom's paper undersells.

The Encoding Apparatus

Flusser: the apparatus programs the possible. Compare what the model sees (pre-discretized tokens) vs. what the game actually contains (continuous flow). The gap is what the encoding decided before training began.

Spatial Resolution 10
Event Vocabulary 41
Time Granularity 9
View
Left: Continuous match flow — positions, velocities, off-ball movement. Right: What the model actually receives — discretized grid cells, labeled event tokens. The gap between them is the information the apparatus discards. Adjust resolution to see how much "knowledge" is encoding vs. learning.
Memorization vs. Generalization

Is the model learning abstract structure or memorizing frequent subsequences? Watch how a small vocabulary + regular domain might make a large model unnecessary.

Vocab Size 114
Parameters (M) 33.0
Domain Regularity 0.56
Training Events (M) 2.5
The question: With 55.8% of events being passes and only 114 tokens, the effective combinatorial space is far smaller than 114128. The red region shows what the model can memorize. The green region shows what requires genuine generalization. When red covers most of the space — the model might just be a lookup table.
Question · Hypothesis · Verdict

Log

A trained model is the terminus of a chain of decisions. Which features go in. Which objective you minimize. Which optimizer you trust. Which checkpoint you pick. Change any of them, and you get a different model with different knowledge. This log records the chain that produced EVE-3. Thirty-two entries, seven eras, February through April.

team = roster(11 players)

The load-bearing decision, the one that named the model EVE-3, was removing the team identity embedding. A team is no longer a row in a lookup table. It is the set of eleven players attending to the event stream. Everything below is the work it took to make that equation functional in the weights, not just drawn in the architecture diagram.

Entity Shortcut

team_id → cheap gradient path

Cross-attention to twenty-two player embeddings is work. A single lookup in a team-identity table is not. The model will take the shortcut unless the shortcut is taken away.

Expansion Asymmetry

∄ history(Denver, Boston)

A model that learns teams as entities has no row for a team that has never played. A model that learns teams as rosters inherits whatever its signed players already know.

Pipeline Bug

possession_run_len_bin → <UNK>

A feature the tokenizer expected was silently replaced with <UNK> in every training run since EVE-1. The model had been learning without the signal it was designed to learn from. The bug predates the model.

Local and Global

val_loss ↓ ≠ rollout ↑

Competitive per-step prediction compounds into bad full-match rollouts. A training horizon of four events never sees the compounding. The gradient cannot teach what the loss cannot see.

Silent Collapse

cross_scale → 0

Left alone, the model drives its roster cross-attention toward zero. The cheap path wins by default. A capability offered in the architecture is not the same as a capability defended in the loss.

The Test

swap Δ = 0.029 · expansion ✓

Remove team_id. Retrain. Swap a player and measure whether the simulation changes. EVE-3.4 is the first EVE where the measurement comes back positive.

The raw entries live in paper/scientific-log.md. This page shows a curated public view; the markdown is the canonical record.

The Formal Argument

Why No Single Representation of a Team Suffices

An essay on names, labels, embeddings, rosters, and what it takes to represent a team well enough to simulate it.

Label Is Not Identity

Every representational system can compress a team into some token. A three-letter code. A row in a database. A one-hot vector over a fixed league. A learned embedding in a neural network. In that weak sense, identifying a team is banal. A league table places twelve teams side by side, each a string, each distinct from the others. A fixtures list pairs them off. A ranking orders them. No one disputes that a team can be labeled.

The strong question is different. Can one label preserve the way a team plays? Can it carry the tactical rhythm, the personnel of the week, the form of the season, the adjustment a manager makes at halftime? Once the question is posed that way, the label starts to fracture.

The basic pressure is easy to state. Some representations are optimized for identification. They are evaluated by whether they distinguish one team from another. Others are optimized for simulation. They are evaluated by what they produce when you run them. Tokens distinguish. Rosters act. A league table is a grid of tokens. A simulated match is a record of what rosters did. That difference is not a defect in the framework. It is part of the point. The same string, "Portland Thorns," names something different in 2013, in 2018, and in 2025. Each is the same club legally and commercially. None of them would play the same match.

Names Outrun Labels

Name and Season

Start with the humble team name. "Washington Spirit" is a fixed string. The proposition it names changes with the year, with the coach, with the roster at that specific moment. Any formal reconstruction has to smuggle those parameters back in: a context record that casual use quietly relies on and structured systems must make explicit. The string is not the whole message. It never was.

Continuity and Rebrand

Something similar happens when a club reorganizes. "Chicago Red Stars" becomes "Chicago Stars" in the middle of a decade. "Kansas City Current" inherits a city but not a franchise history. The literal parse is not false. It is simply too thin to capture what the continuity did to the identity. A fan can model this, but only by adding hidden parameters for history, venue, ownership, and supporter culture. The bare name is not enough.

Expansion

There is also the case where a name is assigned before anything has been played under it. "Denver Summit" in its first season names a team with no matches in its history. The name exists. The pattern the name should predict does not exist yet. Classical identification wants a fixed reference. A new team admits borderline cases: roster without record, future without past.

The lesson is not that names are irrational. It is that their power lies partly in being underdetermined by the string and overdetermined by use.

Statistics Alternates Between Elegant Compilation and Hard Refusal

The Success Case

Some representations of a team compile beautifully. Possession percentage. Expected goals per match. Pass completion rate. A season-long average compresses thousands of actions into a scalar, and many useful questions are answered by that scalar. Once you have seen enough examples like that, it is tempting to think a team is just a bundle of rates. The temptation fails.

The Identity Boundary

The scoreline of a specific match is perfectly precise as data and yet impossible to predict from rates alone. That is not an inconvenience or a missing feature. The match happens at a specific moment, with specific players, against a specific opponent, in a specific state. Average rates describe what tends to happen. They cannot say who was fouled in the 67th minute, or whose substitution changed the shape, or which player ran into the channel that produced the assist. Statistics can describe tendencies that no finite scalar can settle in full.

Existence Without Recipe

A softer but equally important gap appears in what scalars imply. You can show that one team outperforms another on average without producing a single prediction about what they will do when they meet. The comparison is secured before the recipe. Match-level simulation demands more: not which team tends to win, but what this team, tonight, will do on this pass in this minute against this defensive shape.

The team-as-rate gives you a description without an instruction. Simulation needs both.

Some Representations Think in Sets

A lineup can be encoded as an ordered list or as an unordered set. The list is more specific in one way: slot one is the goalkeeper, slot four is a center back, slot nine is a striker. The set is more general in the other direction. It says who is playing without pinning them to positions that shift during a match.

A set is not a list with ordering removed. It is a different kind of object, evaluated by different operations. Two lineups with the same players in different slots are two lists and one set. A team that rotates its midfielders across a match has one set and many effective lists. The cognitive machine you choose to represent the team commits you, in advance, to which comparisons are easy and which require extra work.

The same information can sit in two different representations and make radically different inferences easy or hard. Serializing a set into a canonical list preserves the players and loses the flexibility. The representation is the machine.

Teams Encode Through Players

A team looks at first like a self-contained unit. You see a logo, a city, a coach. The system seems legible from the outside. But the closer you look, the less self-sufficient the unit becomes. A team is its players, plus the formation they are in, plus the match state, plus the opponent they are facing, plus what happened in the last three minutes. Remove any one and the team, as a predictive thing, stops existing.

The point becomes unmistakable over a season. The same team, on paper, plays differently because its captain is injured. The same name, with a new striker, presses higher. The same kit, on the road in August, plays a different shape than at home in October. A team is not a self-reading string. It is a relation between its constituents, its state, and its context.

A team is not downstream of its players. They co-arise.

Learned Models Translate Teams Into Geometry

The twenty-first-century twist on this argument is that we now build large representational machines empirically, and we can watch what they do with a team.

A neural network trained on match events is a stack of learned operations. Nothing in its specification says "team" or "style" or "personnel." The features we give it are a list of decisions: what to tokenize, what to embed, what to throw away. Some of these decisions name a team by a learned vector. Others name a team by the players attending to the event stream. The same training data routed through these two choices is two different models with two different internal representations of a team.

When a model is trained to reproduce match events with a team-identity embedding in the input, the network will use whichever signal is cheapest. A single lookup is cheaper than attention across a roster of players. The shortcut wins by default. The roster is architecturally present and operationally absent. The model carries a representation of the team that was not the representation we thought we were teaching it.

This is not a criticism of neural networks. It is a general fact about any learned or compiled representational system. Encode, store, and reason are three different operations, and they can quietly disagree. What goes in is not always what gets used. The architecture imposes its own geometry, and its only form of reasoning is reasoning inside that geometry. The only reliable way to make a representation load-bearing is to remove every cheaper alternative to it.

No Final Team, Only Families of Representation

The examples do not show that representing a team is hopeless. On the contrary. We do it constantly. League tables, box scores, xG maps, tactical diagrams, scouting reports, match simulations. We move fluently between a name, a rate, a lineup, a formation, a simulation. What fails is the stronger fantasy of a single representation that preserves every relevant feature of a team across every use.

Names lose tactical structure when abbreviated. Statistics lose individual identity when averaged. Rosters lose formation when listed as sets. Formations lose personnel when drawn as shapes. Learned models lose the thing they were trained on the moment training passes through them, because the substrate has its own geometry and insists on being heard.

This is an older intuition made specific. To identify a team is to be in a symbolic relation to it, not in a truth relation with it. A representation is a choice about which relation to emphasize. The present essay is the same claim made particular, one pressure point at a time, one representation at a time.

That is why the strongest sentence here is not a slogan about labels. It is the working equation:

team = roster + formation + match state + native inference rules

Once roster, formation, match state, and native inference are admitted into the story, a universal team label remains possible. A universal simulation-preserving representation does not. The issue is not a temporary engineering limitation. It is structural. The world contains too many ways for a team to identify, describe, predict, act, and become.

A team is never just a name. It is players, plus context, plus what they do next.
Eleven Preregistered Benchmarks

Results

Each benchmark answers a question registered before the run. Every result is published, pass or fail. Confidence intervals come from a ten-thousand-resample bootstrap over the unit of independence.

The preregistration is frozen at experiments/evals/release-benchmarks/preregistration.md. Per-benchmark write-ups, raw outputs, and methods details are in experiments/evals/release-benchmarks/.

Run Simulations

Same teams, same season — different match every time. Run multiple simulations to see the space of possibilities.

Home Team
Away Team
Parallel Simulations 1
Playback
Speed 1.0

How it works: EVE-3 reads the first 10 events of a real match, then generates the rest autoregressively — one event at a time, ~1,200 total. Each team is identified by the 11 players on the pitch, not a team ID — different rosters produce different simulations for the same matchup.