Every match simulator assumes something about what a team is. The cheap answer is a label: a token, a lookup in an embedding table. That is the weak version, and it works until the league adds new teams. EVE-3 takes the other answer. A team is the eleven players on the pitch, and the match is what they make.
This site documents EVE-3, the transformer that serves match simulations at sim.nwslnotebook.com. The model is trained on roughly 2.5 million NWSL events between 2016 and 2026: passes, tackles, shots, clearances. Given the first ten events of a real match and the two starting lineups, it generates the remaining ninety minutes one event at a time.
What makes EVE-3 worth documenting is what it does not use. There is no team embedding. There is no learned vector that says "Portland" or "Gotham." The only thing that tells the model which team is acting is the roster of eleven players on each side of the pitch, and every prediction is conditioned on that roster. Swap a player. The match changes.
The Essay is the argument. Why no single representation of a team is enough, and why EVE-3 makes the choice it does. The Log is the dated record of how the model was built. Thirty-two entries across seven eras, from standing up the event pipeline in February to the identity change in April that produced the current checkpoint. The Attention Depth, Apparatus, and Capacity tabs are interactive explanations of how a transformer ingests soccer events. The Simulate tab is live: pick a matchup, generate, and watch the model produce a match event by event.
A trained model is the terminus of a chain of decisions. Which features go in. Which objective you minimize. Which optimizer you trust. Which checkpoint you pick. Change any of them, and you get a different model with different knowledge. This log records the chain that produced EVE-3. Thirty-two entries, seven eras, February through April.
The load-bearing decision, the one that named the model EVE-3, was removing the team identity embedding. A team is no longer a row in a lookup table. It is the set of eleven players attending to the event stream. Everything below is the work it took to make that equation functional in the weights, not just drawn in the architecture diagram.
Cross-attention to twenty-two player embeddings is work. A single lookup in a team-identity table is not. The model will take the shortcut unless the shortcut is taken away.
A model that learns teams as entities has no row for a team that has never played. A model that learns teams as rosters inherits whatever its signed players already know.
A feature the tokenizer expected was silently replaced with <UNK> in every training run since EVE-1. The model had been learning without the signal it was designed to learn from. The bug predates the model.
Competitive per-step prediction compounds into bad full-match rollouts. A training horizon of four events never sees the compounding. The gradient cannot teach what the loss cannot see.
Left alone, the model drives its roster cross-attention toward zero. The cheap path wins by default. A capability offered in the architecture is not the same as a capability defended in the loss.
Remove team_id. Retrain. Swap a player and measure whether the simulation changes. EVE-3.4 is the first EVE where the measurement comes back positive.
The raw entries live in paper/scientific-log.md. This page shows a curated
public view; the markdown is the canonical record.
An essay on names, labels, embeddings, rosters, and what it takes to represent a team well enough to simulate it.
Every representational system can compress a team into some token. A three-letter code. A row in a database. A one-hot vector over a fixed league. A learned embedding in a neural network. In that weak sense, identifying a team is banal. A league table places twelve teams side by side, each a string, each distinct from the others. A fixtures list pairs them off. A ranking orders them. No one disputes that a team can be labeled.
The strong question is different. Can one label preserve the way a team plays? Can it carry the tactical rhythm, the personnel of the week, the form of the season, the adjustment a manager makes at halftime? Once the question is posed that way, the label starts to fracture.
The basic pressure is easy to state. Some representations are optimized for identification. They are evaluated by whether they distinguish one team from another. Others are optimized for simulation. They are evaluated by what they produce when you run them. Tokens distinguish. Rosters act. A league table is a grid of tokens. A simulated match is a record of what rosters did. That difference is not a defect in the framework. It is part of the point. The same string, "Portland Thorns," names something different in 2013, in 2018, and in 2025. Each is the same club legally and commercially. None of them would play the same match.
Start with the humble team name. "Washington Spirit" is a fixed string. The proposition it names changes with the year, with the coach, with the roster at that specific moment. Any formal reconstruction has to smuggle those parameters back in: a context record that casual use quietly relies on and structured systems must make explicit. The string is not the whole message. It never was.
Something similar happens when a club reorganizes. "Chicago Red Stars" becomes "Chicago Stars" in the middle of a decade. "Kansas City Current" inherits a city but not a franchise history. The literal parse is not false. It is simply too thin to capture what the continuity did to the identity. A fan can model this, but only by adding hidden parameters for history, venue, ownership, and supporter culture. The bare name is not enough.
There is also the case where a name is assigned before anything has been played under it. "Denver Summit" in its first season names a team with no matches in its history. The name exists. The pattern the name should predict does not exist yet. Classical identification wants a fixed reference. A new team admits borderline cases: roster without record, future without past.
The lesson is not that names are irrational. It is that their power lies partly in being underdetermined by the string and overdetermined by use.
Some representations of a team compile beautifully. Possession percentage. Expected goals per match. Pass completion rate. A season-long average compresses thousands of actions into a scalar, and many useful questions are answered by that scalar. Once you have seen enough examples like that, it is tempting to think a team is just a bundle of rates. The temptation fails.
The scoreline of a specific match is perfectly precise as data and yet impossible to predict from rates alone. That is not an inconvenience or a missing feature. The match happens at a specific moment, with specific players, against a specific opponent, in a specific state. Average rates describe what tends to happen. They cannot say who was fouled in the 67th minute, or whose substitution changed the shape, or which player ran into the channel that produced the assist. Statistics can describe tendencies that no finite scalar can settle in full.
A softer but equally important gap appears in what scalars imply. You can show that one team outperforms another on average without producing a single prediction about what they will do when they meet. The comparison is secured before the recipe. Match-level simulation demands more: not which team tends to win, but what this team, tonight, will do on this pass in this minute against this defensive shape.
A lineup can be encoded as an ordered list or as an unordered set. The list is more specific in one way: slot one is the goalkeeper, slot four is a center back, slot nine is a striker. The set is more general in the other direction. It says who is playing without pinning them to positions that shift during a match.
A set is not a list with ordering removed. It is a different kind of object, evaluated by different operations. Two lineups with the same players in different slots are two lists and one set. A team that rotates its midfielders across a match has one set and many effective lists. The cognitive machine you choose to represent the team commits you, in advance, to which comparisons are easy and which require extra work.
The same information can sit in two different representations and make radically different inferences easy or hard. Serializing a set into a canonical list preserves the players and loses the flexibility. The representation is the machine.
A team looks at first like a self-contained unit. You see a logo, a city, a coach. The system seems legible from the outside. But the closer you look, the less self-sufficient the unit becomes. A team is its players, plus the formation they are in, plus the match state, plus the opponent they are facing, plus what happened in the last three minutes. Remove any one and the team, as a predictive thing, stops existing.
The point becomes unmistakable over a season. The same team, on paper, plays differently because its captain is injured. The same name, with a new striker, presses higher. The same kit, on the road in August, plays a different shape than at home in October. A team is not a self-reading string. It is a relation between its constituents, its state, and its context.
The twenty-first-century twist on this argument is that we now build large representational machines empirically, and we can watch what they do with a team.
A neural network trained on match events is a stack of learned operations. Nothing in its specification says "team" or "style" or "personnel." The features we give it are a list of decisions: what to tokenize, what to embed, what to throw away. Some of these decisions name a team by a learned vector. Others name a team by the players attending to the event stream. The same training data routed through these two choices is two different models with two different internal representations of a team.
When a model is trained to reproduce match events with a team-identity embedding in the input, the network will use whichever signal is cheapest. A single lookup is cheaper than attention across a roster of players. The shortcut wins by default. The roster is architecturally present and operationally absent. The model carries a representation of the team that was not the representation we thought we were teaching it.
This is not a criticism of neural networks. It is a general fact about any learned or compiled representational system. Encode, store, and reason are three different operations, and they can quietly disagree. What goes in is not always what gets used. The architecture imposes its own geometry, and its only form of reasoning is reasoning inside that geometry. The only reliable way to make a representation load-bearing is to remove every cheaper alternative to it.
The examples do not show that representing a team is hopeless. On the contrary. We do it constantly. League tables, box scores, xG maps, tactical diagrams, scouting reports, match simulations. We move fluently between a name, a rate, a lineup, a formation, a simulation. What fails is the stronger fantasy of a single representation that preserves every relevant feature of a team across every use.
Names lose tactical structure when abbreviated. Statistics lose individual identity when averaged. Rosters lose formation when listed as sets. Formations lose personnel when drawn as shapes. Learned models lose the thing they were trained on the moment training passes through them, because the substrate has its own geometry and insists on being heard.
This is an older intuition made specific. To identify a team is to be in a symbolic relation to it, not in a truth relation with it. A representation is a choice about which relation to emphasize. The present essay is the same claim made particular, one pressure point at a time, one representation at a time.
That is why the strongest sentence here is not a slogan about labels. It is the working equation:
Once roster, formation, match state, and native inference are admitted into the story, a universal team label remains possible. A universal simulation-preserving representation does not. The issue is not a temporary engineering limitation. It is structural. The world contains too many ways for a team to identify, describe, predict, act, and become.
Each benchmark answers a question registered before the run. Every result is published, pass or fail. Confidence intervals come from a ten-thousand-resample bootstrap over the unit of independence.
The preregistration is frozen at
experiments/evals/release-benchmarks/preregistration.md.
Per-benchmark write-ups, raw outputs, and methods details are in
experiments/evals/release-benchmarks/.