Ujjwal Upadhyay * Mukul Ranjan * Zhiqiang Shen Mohamed Elhoseiny
*Equal Contribution
- 📖 Overview
- 🌟 Key Highlights
- 📊 SpookyBench Dataset
- 🎯 Benchmark Results
- 📸 Task Examples
- 🔬 Temporal Encoding Framework
- ⚙️ Installation & Usage
- 📜 Citation
Current 🤖 Video-Vision Language Models (Video-VLMs) excel at spatial understanding but suffer from ⏰ "time blindness" - a critical inability to process purely temporal patterns. While humans effortlessly recognize information encoded in temporal sequences with 98% accuracy, state-of-the-art models including GPT-4o, Gemini 2.0, and Qwen-VL achieve 0% accuracy on the same tasks.
We introduce 👻 SpookyBench, the first benchmark designed to isolate and evaluate pure temporal understanding by encoding information exclusively through temporal sequences of noise-like frames. This exposes a fundamental limitation in current video understanding architectures that over-rely on frame-level spatial features.
✅ First benchmark to isolate purely temporal reasoning without spatial shortcuts
✅ 451 carefully crafted videos across 4 distinct categories (Words, Shapes, Objects, Dynamic Scenes)
✅ Striking performance gap: Humans 98% vs. All AI models 0%
✅ Comprehensive evaluation: 15+ state-of-the-art models tested including GPT-4o, Gemini, Qwen-VL
✅ Novel temporal encoding framework using opposing motion patterns
✅ Cross-architecture failure: Limitation persists across model scales and designs
🚀 SpookyBench reveals that current Video-VLMs are fundamentally "time-blind" despite impressive performance on standard benchmarks! ⏰👁️
Our benchmark contains 451 videos distributed across four temporal pattern categories:
| Category | Total Videos | Description |
|---|---|---|
| Text | 210 (46.6%) | English words encoded through temporal noise patterns |
| Object Images | 156 (34.6%) | Single objects encoded using temporal animation |
| Dynamic Scenes | 57 (12.6%) | Video depth maps with temporal motion patterns |
| Shapes | 28 (6.2%) | Geometric patterns encoded through temporal sequences |
| Total | 451 | Comprehensive temporal understanding evaluation |
📌 Each video appears as random noise in individual frames, but reveals meaningful content when viewed as a temporal sequence.
Our evaluation reveals a shocking performance disparity:
| Model Type | Human Performance | AI Model Performance | Gap |
|---|---|---|---|
| 👥 Humans | 98.0% ± 0.6% | N/A | N/A |
| 🤖 All Video-VLMs | N/A | 0.0% | 98 percentage points |
| Model Family | Models Tested | Performance |
|---|---|---|
| Closed-Source | GPT-4o, GPT-4V, Gemini 2.0 Flash, Gemini 1.5 Pro | 0% across all |
| Open-Source Large | Qwen2.5-VL-72B, InternVL2.5-78B, InternVL2-40B | 0% across all |
| Open-Source Mid | Video-LLaVA, LLaVA-NeXT-Video, TimeChat | 0% across all |
| Specialized | TimeChat, VideoGPT+, VILA | 0% across all |
📌 Key Finding: The limitation is architectural, not a matter of scale, training, or prompting strategy.
Examples of temporal patterns in SpookyBench: Individual frames appear as noise, but temporal sequences reveal words, shapes, and objects that humans can easily recognize. For more examples visit our project webpage
Our unique encoding method creates temporal patterns through opposing motion:
- Foreground pixels: Move in one direction (e.g., up/left)
- Background pixels: Move in opposite direction (e.g., down/right)
- Human perception: Groups pixels by motion direction, revealing content
- AI models: Fail to leverage temporal motion cues
# Simplified temporal encoding algorithm
for each pixel (x, y):
if content_mask(x, y):
frame[x, y] = foreground_noise(x, y + velocity*time)
else:
frame[x, y] = background_noise(x, y - velocity*time)| Metric | Purpose |
|---|---|
| Basic SNR | Measures signal-to-noise ratio in temporal patterns |
| Perceptual SNR | Incorporates human visual sensitivity weighting |
| Temporal Coherence | Quantifies motion consistency over time |
| Motion Contrast | Measures foreground-background motion differentiation |
You can download the dataset from hugging face using wget and then unzip the file.
wget https://huggingface.co/datasets/timeblindness/spooky-bench/resolve/main/spooky_bench.zip
unzip spooky_bench.zipgit clone https://github.com/TimeBlindness/time-blindness.git
cd time-blindness# For closed-source models (GPT-4o, Gemini)
cd eval/closed_models
pip install -r requirements.txt
# Set up API keys in .env
OPENAI_API_KEY=your_openai_api_key_here
GOOGLE_API_KEY=your_gemini_api_key_herecd eval/closed_models
python eval_gpt4o.py \
--dataset /path/to/spooky_bench/SpookyBenchDatasets \
--csv /path/to/metadata.csv \
--output ./results \
--categories words shapes \
--use_cot \
--sample_size 10cd eval/closed_models
python eval_gemini.py \
--dataset /path/to/spooky_bench/SpookyBenchDatasets \
--csv /path/to/metadata.csv \
--output ./results \
--categories words \
--use_cot \
--sample_size 10See instructions in eval/qwen/README.md
See instructions in eval/internvl/README.md
See instructions in eval/MovieChatVideo/README.md
See instructions in the author's original repo TimeChat. More detailed instructions will be updated later.
See instructions in the author's original repo VideoLLaMA3. More detailed instructions will be updated later.
See instructions in the author's original repo MiniGPT4-video. More detailed instructions will be updated later.
See instructions in the author's original repo Video-ChatGPT. More detailed instructions will be updated later.
See instructions in the author's original repo VideoGPT-plus. More detailed instructions will be updated later.
See instructions in the author's original repo VILA. More detailed instructions will be updated later.
See instructions in the author's original repo ShareGPT4Video. More detailed instructions will be updated later.
See instructions in the author's original repo VideoLLaMA2. More detailed instructions will be updated later.
See instructions in the author's original repo Video-LLaVA. More detailed instructions will be updated later.
See instructions in the author's original repo LLaVA-NeXT-Video. More detailed instructions will be updated later.
--dataset: Path to the SpookyBench dataset directory--csv: Path to the metadata CSV file--output: Directory to save evaluation results--categories: Categories to evaluate (words, shapes, images, videos)--use_cot: Use chain-of-thought prompting for more detailed reasoning--sample_size: Number of videos to sample per category--model: Model name/version to use (specific to each evaluator)
Expected Data Folder Structure
data_path/
├── images/
│ ├── video1.mp4
│ ├── video2.mp4
│ └── ...
├── shapes/
│ ├── video1.mp4
│ ├── video2.mp4
│ └── ...
├── videos/
│ ├── video1.mp4
│ ├── video2.mp4
│ └── ...
└── words/
├── video1.mp4
├── video2.mp4
└── ...Example Usage:
# Run human evaluation interface
python human_eval_interface.py --data_path /path/to/spooky_bench_data
python human_eval_interface.py --data_path ./data --output_dir ./annotations --port 7861--dataset: Path to the SpookyBench dataset directory--csv: Path to the metadata CSV file--output: Directory to save evaluation results--categories: Categories to evaluate (words, shapes, objects, videos)--use_cot: Use chain-of-thought prompting for more detailed reasoning--sample_size: Number of videos to sample per category--model: Model name/version to use (specific to each evaluator)
- Closed-Source: GPT-4o, GPT-4V, Gemini 2.0 Flash, Gemini 1.5 Pro
- Open-Source: Qwen2-VL, Qwen2.5-VL, InternVL2, InternVL2.5, Video-LLaVA, TimeChat, LLaVA-NeXT-Video
- Specialized: InternVideo2.5, LongVLM, Momentor, Grounded-VideoLLM
| Category | Accuracy | Perceptibility Rating |
|---|---|---|
| Text | 98.9% ± 0.7% | 4.8 ± 0.0 |
| Shapes | 98.2% ± 2.5% | 4.8 ± 0.1 |
| Object Images | 98.2% ± 1.1% | 4.6 ± 0.1 |
| Dynamic Scenes | 94.3% ± 3.1% | 4.3 ± 0.1 |
See instructions in the repo Qwen2-VL-Finetune. More detailed instructions will be updated later. We have provided the json file in finetune directory.
SpookyBench/
├── eval/
│ ├── closed_models/
│ │ ├── eval_gpt4o.py
│ │ ├── eval_gemini.py
│ │ └── requirements.txt
│ ├── qwen/
│ │ ├── run_qwen.py
│ │ └── requirements.txt
│ ├── internvl/
│ │ └── run_internvl.py
│ └── video_llava/
│ └── run_video_llava.py
├── dataset/
│ ├── SpookyBenchDatasets/
│ │ ├── words/
│ │ ├── shapes/
│ │ ├── objects/
│ │ └── videos/
│ └── metadata.csv
├── human_eval/
│ └── human_eval_interface.py
├── static/
│ └── images/
│ ├── timeblind_logo.svg
│ └── spooky_examples.png
└── README.md
- Over-reliance on spatial features: Models process individual frames first, then attempt temporal integration
- Lack of motion-based segregation: Cannot perform figure-ground separation based on motion patterns
- Insufficient temporal integration: Current architectures treat temporal information as secondary
- Missing biological inspiration: Human visual system uses distributed temporal processing mechanisms
- Need for temporal-first processing: Future models should treat temporal understanding as primary
- Motion contrast analysis required: Models need sophisticated motion segregation capabilities
- Longer temporal integration windows: Extended temporal attention mechanisms necessary
- Distributed temporal representations: Following biological principles of temporal processing
- Add support in VLMEvalKit
- Add support in lmms-eval
- Add python code for generating animations in batch
For questions or collaborations, please contact:
- Ujjwal Upadhyay: ujjwalupadhyay8@gmail.com
- Mukul Ranjan: mukul.ranjan@mbzuai.ac.ae
This project is licensed under the MIT License - see the LICENSE file for details.
Exposing the temporal reasoning gap between humans and machines 🧠⚡🤖


