Interfere is an exploration of how a causal Transformer internally represents and reasons about the transition rules of a Game-of-Life-like system.
We train a deliberately overparameterized 25M-parameter Transformer on multiple variants of Conway’s Game of Life (GOL) and use mechanistic interpretability tools to probe what the model treats as first-class features.
The goal is not to solve GOL efficiently, but to study how a Transformer behaves when a simple, deterministic rule does exist—without giving it the right inductive biases.
- Train a Transformer to predict t+1 from (rules + t).
- Encode both the update rule and the 2D grid state as tokens.
- Probe attention patterns, linear probes, and PCA directions to understand:
- What structure emerges?
- What features are linearly accessible?
- Which properties the model prioritizes internally?
Each sequence is laid out as:
, 18 rule bits, , H×W (t), , H×W (t+1)
markdown Copy code
- Rules: 18 “physics bits” encoding different GOL-style update rules
- State: Binary grid, flattened row-major
- Positioning: Standard positional encoding + 2D RoPE on Q/K (added post-hoc)
- Training: MLM-style masking after
<SEP> - Scale: ~0.6B tokens, trained to near-perfect overfitting on GOL
- Hardware: bf16 on 8× H200 GPUs
Model
- 8 layers,
d_model=512 - 8 attention heads / layer
- MLP dim = 2048
- Pre-LayerNorm, GELU (SwiGLU planned)
- Attention restricted to prefix when predicting
- Dead vs alive tokens should be linearly separable.
- Local (3×3 neighborhood) attention should emerge.
- Some layers should specialize in rule-prefix processing.
- Linear probes + PCA should reveal which system properties are most salient.
- Early layers (L0–L1) develop strong local attention.
- ~5/8 heads attend locally; others attend far away.
- Striation patterns likely arise from the 2D RoPE setup (to be ablated).
- By late layers (L7), locality largely disappears.
- Probes work, but weaker than expected.
- The model does not cleanly encode exact neighbor counts.
- This aligns with GOL’s logic: thresholds matter more than exact values.
- Later layers are not consistently more linearly accessible.
- The top PCA direction in early layers cleanly separates alive vs dead cells.
- Unsurprising, but confirms this is a first-class feature.
- Other layers may encode more abstract structure (future work).
- GOL is trivially solvable by a CNN—but that’s not the point.
- Transformers are routinely used as general function approximators.
- We want to understand:
- What structure a Transformer discovers when inductive biases are misaligned.
- How macroscopic behavior emerges from token-level computation.
This setup stays close to the Chinchilla frontier (within this toy domain) and asks:
What does a Transformer learn when a simple rule exists, but we don’t tell it how to look?