Line of Sight: On Linear Representations in VLLMs

Achyuta Rajaram*, Sarah Schwettmann, Jacob Andreas, Arthur Conmy

*Indicates Primary Author, Direct Correspondence to achyuta@mit.edu

Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.

Dependencies

requirements.txt should be comprehensive, but the main necessary requirements are:

BauKit (https://github.com/davidbau/baukit), PyTorch, and tqdm

SAE Training

A single-file script for training SAEs for LlaVa-Next on an 8-gpu host node can be found in /sae/sae-trainer.py, alongside several evaluation scripts. Remember to download the ShareGPT4V dataset: https://sharegpt4v.github.io/.

Steering Interventions

A hackable single-file script for performing interventions on huggingface models with KV caching is under \steering\steering_rollouts.py . Be sure to enable KV caching - this project generated nearly a billion tokens!

Models & Data

The SAE weights for several layers are stored under This HuggingFace Repo. Example code for loading and performing inference is under /sae/evaluation/eval_sae_damage.py training data is the ShareGPT4V dataset, as previously mentioned. The raw results of the Mehcanical Turk Experiments are under steering/.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.ipynb_checkpoints		.ipynb_checkpoints
benchmarking		benchmarking
probing_approximation		probing_approximation
sae		sae
steering		steering
README.md		README.md
class_pairs.txt		class_pairs.txt
requirements.txt		requirements.txt
teaser_figure.png		teaser_figure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Line of Sight: On Linear Representations in VLLMs

Dependencies

SAE Training

Steering Interventions

Models & Data

About

Uh oh!

Releases

Packages

Languages

multimodal-interpretability/multimodal-saes

Folders and files

Latest commit

History

Repository files navigation

Line of Sight: On Linear Representations in VLLMs

Dependencies

SAE Training

Steering Interventions

Models & Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages