FCD: Frozen Core Decomposition

An Architectural Approach to Continual Learning Without Catastrophic Forgetting

Abstract

Catastrophic forgetting occurs when neural networks trained on sequential tasks lose previously acquired knowledge. FCD (Frozen Core Decomposition) solves this through an architectural approach: task-specific weights are generated via a Tucker-style factorization with mode-3 contraction (often called Tucker-2 parameterization), and the shared core is frozen after the first task.

Note on terminology: We do not decompose each weight matrix $W_t$ separately. Instead, we parameterize a family of task-indexed weights via a shared 3-way core $S$ and factors $U, V$, generating $W_t$ by contracting $S$ with task vector $v_t$ along mode 3.

Key formula: $$W_t = U \cdot (S \times_3 v_t) \cdot V$$

Where:

$U \in \mathbb{R}^{d_{in} \times r}$, $V \in \mathbb{R}^{r \times d_{out}}$ — factor matrices
$S \in \mathbb{R}^{r \times r \times k}$ — core tensor (frozen after task 1)
$v_t \in \mathbb{R}^k$ — task-specific coefficient vector

Key Features

🧊 Core Freezing — primary mechanism preventing forgetting
📐 Tucker-style mode-3 contraction — memory-efficient weight generation
🎯 Separation Loss — orthogonalization of task vectors
📊 Near-zero forgetting — <1% on all benchmarks
💾 Memory efficient — O(Tk) per-task overhead vs O(T×N) for separate networks

Results

Full Comparison with State-of-the-Art (3 runs, mean ± std)

Split MNIST (5 binary tasks)

Method	Accuracy	Forgetting
FCD (Ours)	96.1 ± 0.4%	0.2 ± 0.2%
HAT	82.9 ± 4.1%	19.3 ± 5.1%
PackNet	56.6 ± 2.6%	40.6 ± 2.3%
DER++	56.6 ± 2.2%	52.5 ± 2.9%
EWC	57.0 ± 3.4%	52.2 ± 4.3%
Fine-tuning	55.8 ± 1.9%	54.0 ± 2.4%

Permuted MNIST (10 tasks, 10 classes each)

Method	Accuracy	Forgetting
FCD (Ours)	82.2 ± 0.4%	0.2 ± 0.1%
EWC	65.7 ± 2.7%	2.6 ± 1.4%
HAT	49.2 ± 7.1%	35.5 ± 7.1%
Fine-tuning	30.3 ± 2.9%	68.6 ± 3.2%
PackNet	25.4 ± 1.6%	42.0 ± 2.3%
DER++	18.0 ± 6.6%	32.8 ± 26.9%

Split CIFAR-100 (10 tasks, 10 classes each)

Method	Accuracy	Forgetting
FCD (Ours)	50.5 ± 0.3%	0.1 ± 0.1%
HAT	27.1 ± 1.7%	25.5 ± 1.7%
EWC	15.9 ± 0.2%	19.5 ± 0.8%
PackNet	14.4 ± 0.2%	40.1 ± 0.2%
Fine-tuning	14.2 ± 0.1%	43.9 ± 0.3%
DER++	13.7 ± 0.1%	39.9 ± 0.5%

Note: All methods use the same MLP architecture for fair comparison. HAT and PackNet were designed for CNNs.

Vision Backbone Results (ResNet-18 + FCD Adapter)

FCD can be used as a lightweight adapter on top of frozen pretrained backbones. Results on Split CIFAR-100 (5 tasks, 20 classes each, 10 epochs):

Method	Accuracy	Forgetting
ResNet-18 + FCD Adapter	59.8%	0.3%
ResNet-18 + Fine-tuning	16.2%	61.3%

Key insight: FCD achieves 200× reduction in forgetting (0.3% vs 61.3%) while maintaining 3.7× higher accuracy. The frozen backbone provides rich features, and FCD adapter learns task-specific projections without catastrophic interference.

CNN Results (FCDCNN — trained from scratch)

To address the MLP limitation on CIFAR-100, we implement FCDCNN—a custom 4-block CNN (32→64→128→256 channels) trained from scratch with FCD adapter. Results on Split CIFAR-100 (10 tasks, 10 classes each, 20 epochs):

Method	Accuracy	Forgetting
FCD + CNN	57.6%	1.3%
Fine-tuning CNN	17.1%	72.7%

Key insight: FCD reduces forgetting by 56× (1.3% vs 72.7%) while achieving 3.4× higher accuracy. Unlike the pretrained ResNet-18, here the CNN is trained from scratch, showing FCD's core freezing works effectively without transfer learning.

LLM Results (GPT-2 + FCD-LoRA)

FCD can be applied to Large Language Models as an alternative to LoRA for continual fine-tuning without forgetting.

GPT-2 (124M params), 3 sequential tasks, 5 epochs each:

Configuration	Forgetting	Trainable Parameters
Soft FCD (separation loss)	5.4%	848K (0.7% of model)
Hard FCD (freeze core)	0%	576 (<0.01% of model)
Standard LoRA	~30%	768K (0.6% of model)

Note: % of model = fraction of trainable parameters out of total (124M). FCD and LoRA use comparable parameter counts (~0.6-0.7%), but FCD preserves previous tasks.

Key advantages over LoRA:

Multi-task in one adapter: Switch tasks via model.set_task(id)
No forgetting: Previous tasks preserved when learning new ones
Same memory: Similar parameter count to LoRA

Scales to: DeepSeek, LLaMA 2/3, Mistral, Qwen.

Ablation Study

Configuration	Accuracy	Forgetting
Full FCD	96.1 ± 0.4%	0.2 ± 0.2%
Without separation loss	95.8 ± 0.5%	0.2 ± 0.2%
Without core freezing	93.2 ± 4.8%	6.7 ± 6.1%
Minimal (no sep, no freeze)	92.2 ± 5.2%	8.1 ± 6.6%

Key finding: Core freezing is essential (+6.5% forgetting without it). Separation loss provides marginal improvement.

Scalability: T > k

What happens when tasks T exceed vector dimension k?

Tasks T	Accuracy	Forgetting
5	98.4%	0.0%
10	97.6%	0.5%
16	97.6%	0.9%
20	96.6%	1.3%

Method gracefully degrades — even at T=20 with k=16, accuracy >96% with only 1.3% forgetting.

Memory Efficiency

Tasks T	Baseline	FCD	Savings
5	1,006,850	34,136	96.6%
10	2,013,700	34,216	98.3%
20	4,027,400	34,376	99.1%

Note: in this repository’s implementation, per-task overhead includes both the task vector ($k$ params) and a small task-specific classifier head (multi-head setting). The coefficient vector is only $k=16$ parameters.

Installation

git clone https://github.com/infosave2007/fcd.git
cd fcd

# Install the library (recommended)
python -m pip install .

# Optional: benchmark/text extras
python -m pip install ".[benchmarks]"

# For development/editable installs (requires a recent pip)
python -m pip install --upgrade pip
python -m pip install -e .

Requirements: Python 3.9+, PyTorch 2.0+

Quick Start

from fcd import FCDModel, train_continual
from benchmarks import get_split_mnist

# Load benchmark
tasks = get_split_mnist()

# Create model
model = FCDModel(
    input_dim=784,
    hidden_dim=256,
    num_classes=2,
    task_rank=16,    # k: task vector dimension
    core_rank=32     # r: factorization rank
)

# Train on all tasks sequentially
results = train_continual(model, tasks, epochs=100)

print(f"Average Accuracy: {results['avg_accuracy']*100:.1f}%")
print(f"Average Forgetting: {results['avg_forgetting']*100:.1f}%")

Vision Backbone Example (ResNet-18 / ViT)

from fcd import FCDResNet18, train_fcd_backbone
from fcd.backbone import get_cifar100_for_cnn  # or use custom loader

# Load CIFAR-100 split into 10 tasks
tasks = get_cifar100_for_cnn(n_tasks=10)

# Create ResNet-18 with FCD adapter (backbone frozen, only adapter trains)
model = FCDResNet18(
    num_classes=10,      # classes per task
    task_rank=32,
    core_rank=64,
    pretrained=True,
    freeze_backbone=True
)

# Train on all tasks
results = train_fcd_backbone(model, tasks, epochs=20, device='mps')
# Typical result: ~60% accuracy, <1% forgetting

LLM Example (GPT-2 / DeepSeek / LLaMA)

from transformers import AutoModelForCausalLM, AutoTokenizer
from fcd.llm_adapter import FCDLoRAConfig, apply_fcd_to_model

# Load any HuggingFace model
base_model = AutoModelForCausalLM.from_pretrained("gpt2")  # or deepseek, llama
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Apply FCD adapters (similar to LoRA, but with multi-task support)
config = FCDLoRAConfig(
    r=8,                  # Adapter rank
    task_rank=16,         # Task vector dimension
    target_modules=["c_attn", "c_proj"],  # Layers to adapt
)
model = apply_fcd_to_model(base_model, config)

# Train task 0
model.set_task("sentiment")
train(model, sentiment_data)
model.freeze_task("sentiment", freeze_core=False)  # soft FCD

# Train task 1 (no forgetting of task 0!)
model.set_task("qa")
train(model, qa_data)
model.freeze_task("qa")

# Inference: switch between tasks
model.set_task("sentiment")
generate(model, "This movie was")  # Uses sentiment knowledge

model.set_task("qa")
generate(model, "Question: What is")  # Uses QA knowledge

Method Overview

Training Protocol

1. Initialize U, V, S randomly
2. For each task t = 0, 1, ..., T-1:
   a. Initialize v_t orthogonally (Gram-Schmidt)
   b. Train with L = L_CE + λ_sep * L_sep
   c. If t == 0: Freeze U, V, S (core freezing)
   d. Freeze v_t

Why FCD Works

vs Method	FCD Advantage
HAT	Simpler architecture (no attention masks), higher accuracy
PackNet	No pruning degradation, maintains capacity
DER++	No sample storage required, O(Tk) vs O(buffer)
EWC	Hard isolation vs soft constraints

Method Comparison (Practical Setup)

The main result tables above are the source of truth for accuracy/forgetting. This table summarizes practical properties of the implementations in this repo.

Method	Task ID at inference	Replay buffer	Parameter growth with tasks
Fine-tuning	No	No	$O(N)$
EWC	No	No	$O(N)$ params + $O(N)$ Fisher/opt params
DER++	No	Yes	$O(N)$ params + $O(B)$ buffer
HAT	Yes	No	$O(N)$ params + $O(T\cdot H)$ task embeddings
PackNet	No*	No	$O(N)$ params (this simplified impl does not store per-task masks)
FCD	Yes	No	$O(N_{core}) + O(T\cdot(k + H\cdot C))$ (task vector + task head)

*Original PackNet typically requires a task-specific mask at inference; the current implementation keeps only a single global mask.

Project Structure

fcd/
├── python/
│   ├── fcd/                 # Core library
│   │   ├── model.py         # FCDModel, FactorizedCoreTensor
│   │   ├── backbone.py      # FCDResNet18, FCDViT adapters
│   │   ├── llm_adapter.py   # FCD-LoRA for LLMs (GPT-2, LLaMA, etc.)
│   │   ├── losses.py        # separation_loss, total_loss
│   │   ├── training.py      # train_task, train_continual
│   │   └── baselines.py     # EWC, HAT, PackNet, DER++
│   ├── benchmarks/          # Data loaders
│   │   └── loaders.py       # Split MNIST, Permuted MNIST, CIFAR-100
│   ├── examples/            # Example scripts
│   ├── full_comparison.py   # Run all methods
│   ├── vision_backbone_benchmark.py  # ResNet/ViT + FCD
│   ├── llm_fcd_example.py   # GPT-2/LLaMA + FCD-LoRA
│   └── ablation_study.py    # Ablation experiments
├── FCD.tex                  # Paper source
├── LICENSE
└── README.md

Running Experiments

# Full comparison (all methods, all benchmarks)
cd python
python full_comparison.py

# Fast sanity check (quick run, skips CIFAR-100)
python full_comparison.py --smoke

# Force a device
python full_comparison.py --device cpu
python full_comparison.py --device cuda   # NVIDIA GPU (requires CUDA-enabled PyTorch)
python full_comparison.py --device mps    # macOS GPU via Metal (MPS)

# Control what to run
python full_comparison.py --skip-cifar
python full_comparison.py --runs 5

# Ablation study
python ablation_study.py

# Vision backbones (ResNet-18, ViT) on Split CIFAR-100
python vision_backbone_benchmark.py --device mps
python vision_backbone_benchmark.py --skip-vit  # ResNet only (faster)

# Individual benchmarks
cd examples
python split_mnist_benchmark.py
python permuted_mnist_benchmark.py

Notes:

If you don't pass --device, the scripts auto-select mps → cuda → cpu.
--smoke is intended for quick “does it run?” checks; for paper-style numbers, run with default epochs and multiple --runs.

Hyperparameters

Parameter	Default	Description
r (core_rank)	32	Factorization rank
k (task_rank)	16	Task vector dimension
λ_sep	0.5	Separation loss weight
lr	0.01	Learning rate
epochs	100	Epochs per task

Limitations

Task identity must be known at inference (task-aware setting)
Task dimension k limits perfectly orthogonal tasks (but graceful degradation)

Resolved: CNN architecture for CIFAR-100 is now implemented (FCDCNN), achieving 57.6% accuracy with 1.3% forgetting.

Citation

@article{kirichenko2025fcd,
  author = {Kirichenko, Oleg},
  title = {Frozen Core Decomposition: An Architectural Approach to Continual Learning Without Catastrophic Forgetting},
  year = {2025},
  url = {https://github.com/infosave2007/fcd}
}

Support

If you find this project useful:

Support via Tribute 🙏

License

MIT License

Author

Oleg Kirichenko — urevich55@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
python		python
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
FCD.tex		FCD.tex
LICENSE		LICENSE
README.md		README.md
README_ru.md		README_ru.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FCD: Frozen Core Decomposition

Abstract

Key Features

Results

Full Comparison with State-of-the-Art (3 runs, mean ± std)

Split MNIST (5 binary tasks)

Permuted MNIST (10 tasks, 10 classes each)

Split CIFAR-100 (10 tasks, 10 classes each)

Vision Backbone Results (ResNet-18 + FCD Adapter)

CNN Results (FCDCNN — trained from scratch)

LLM Results (GPT-2 + FCD-LoRA)

Ablation Study

Scalability: T > k

Memory Efficiency

Installation

Quick Start

Vision Backbone Example (ResNet-18 / ViT)

LLM Example (GPT-2 / DeepSeek / LLaMA)

Method Overview

Training Protocol

Why FCD Works

Method Comparison (Practical Setup)

Project Structure

Running Experiments

Hyperparameters

Limitations

Citation

Support

License

Author

About

Uh oh!

Releases

Packages

Languages

License

infosave2007/fcd

Folders and files

Latest commit

History

Repository files navigation

FCD: Frozen Core Decomposition

Abstract

Key Features

Results

Full Comparison with State-of-the-Art (3 runs, mean ± std)

Split MNIST (5 binary tasks)

Permuted MNIST (10 tasks, 10 classes each)

Split CIFAR-100 (10 tasks, 10 classes each)

Vision Backbone Results (ResNet-18 + FCD Adapter)

CNN Results (FCDCNN — trained from scratch)

LLM Results (GPT-2 + FCD-LoRA)

Ablation Study

Scalability: T > k

Memory Efficiency

Installation

Quick Start

Vision Backbone Example (ResNet-18 / ViT)

LLM Example (GPT-2 / DeepSeek / LLaMA)

Method Overview

Training Protocol

Why FCD Works

Method Comparison (Practical Setup)

Project Structure

Running Experiments

Hyperparameters

Limitations

Citation

Support

License

Author

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages