Skip to content

plugyawn/interfere

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Interfere

Interfere is an exploration of how a causal Transformer internally represents and reasons about the transition rules of a Game-of-Life-like system.
We train a deliberately overparameterized 25M-parameter Transformer on multiple variants of Conway’s Game of Life (GOL) and use mechanistic interpretability tools to probe what the model treats as first-class features.

The goal is not to solve GOL efficiently, but to study how a Transformer behaves when a simple, deterministic rule does exist—without giving it the right inductive biases.


(rough) idea

  • Train a Transformer to predict t+1 from (rules + t).
  • Encode both the update rule and the 2D grid state as tokens.
  • Probe attention patterns, linear probes, and PCA directions to understand:
    • What structure emerges?
    • What features are linearly accessible?
    • Which properties the model prioritizes internally?

Data & training setup

Each sequence is laid out as:

, 18 rule bits, , H×W (t), , H×W (t+1)

markdown Copy code

  • Rules: 18 “physics bits” encoding different GOL-style update rules
  • State: Binary grid, flattened row-major
  • Positioning: Standard positional encoding + 2D RoPE on Q/K (added post-hoc)
  • Training: MLM-style masking after <SEP>
  • Scale: ~0.6B tokens, trained to near-perfect overfitting on GOL
  • Hardware: bf16 on 8× H200 GPUs

Model

  • 8 layers, d_model=512
  • 8 attention heads / layer
  • MLP dim = 2048
  • Pre-LayerNorm, GELU (SwiGLU planned)
  • Attention restricted to prefix when predicting

What we expected

  • Dead vs alive tokens should be linearly separable.
  • Local (3×3 neighborhood) attention should emerge.
  • Some layers should specialize in rule-prefix processing.
  • Linear probes + PCA should reveal which system properties are most salient.

What we observed

Local structure

  • Early layers (L0–L1) develop strong local attention.
  • ~5/8 heads attend locally; others attend far away.
  • Striation patterns likely arise from the 2D RoPE setup (to be ablated).
  • By late layers (L7), locality largely disappears.

Linear probes

  • Probes work, but weaker than expected.
  • The model does not cleanly encode exact neighbor counts.
  • This aligns with GOL’s logic: thresholds matter more than exact values.
  • Later layers are not consistently more linearly accessible.

PCA

  • The top PCA direction in early layers cleanly separates alive vs dead cells.
  • Unsurprising, but confirms this is a first-class feature.
  • Other layers may encode more abstract structure (future work).

Why use a Transformer?

  • GOL is trivially solvable by a CNN—but that’s not the point.
  • Transformers are routinely used as general function approximators.
  • We want to understand:
    • What structure a Transformer discovers when inductive biases are misaligned.
    • How macroscopic behavior emerges from token-level computation.

This setup stays close to the Chinchilla frontier (within this toy domain) and asks:

What does a Transformer learn when a simple rule exists, but we don’t tell it how to look?

About

Can you interfere with a model's understanding of the world?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published