Comic Analysis: Project Overview

This document outlines the ongoing work in analyzing comics using data science techniques, focusing on multimodal modeling for content understanding and querying.

Models Available

Cosmo Page Stream Segmentation - https://huggingface.co/RichardScottOZ/cosmo-v4

trained on a 25K page subset of more modern comics from multi companies and multiple genres and styles (not manga) Closure Lite Simple - https://huggingface.co/RichardScottOZ/comics-analysis-closure-lite-simple
composition and structural analysis

1. Project Goals & Evolution

Version 1 (Closure Lite Framework):
- Aim: To combine page image and VLM text for querying, with panel and reading order recognition.
- Key Choice: Utilized 384-dimensional "lite" embeddings to allow for testing on retail GPUs.
- Model: Based on the RichardScottOZ/comics-analysis-closure-lite-simple model, trained as a context-denoise variant.
Version 2 (Current Focus): Initiating work towards a new, more robust modeling framework, particularly for Page Stream Segmentation.

2. Core Framework: Closure Lite

The Closure Lite Framework is a multimodal fusion model designed to analyze images, text, and panel reading order.

Objective: Generate multimodal fused embeddings that are queryable for similarity.
- Embeddings are efficiently stored in Zarr format (e.g., 80K "perfect match subset" takes ~500MB).
- An interface (Flask-based) exists for querying.
Design Principle: Usable on a reasonable retail GPU (hence 384-dim embeddings).
Model Variants: base, denoise, context, context-denoise.

3. Data Sources & Preparation

Primary Data: Comic pages from various sources (Amazon/Comixology, Dark Horse, Calibre-managed collections like Humble Bundle and DriveThruComics).
Future Data: Exploring new Humble Bundles and potential scraping (e.g., Neon Ichigan).
Initial Processing:
- Conversion: Comic files (PDF, CBZ, etc.) are converted into individual page images.
- VLM Text Extraction: Vision-Language Models (VLMs) are used to extract text from each page, preceding general OCR due to its comic-specific challenges.
- Panel Detection: Fast-RCNN is used to detect panel bounding boxes and coordinates.
- DataSpec Integration: All extracted data is joined into a DataSpec format for modeling.
Dataset Size: Currently working with a "Calibre" subset of ~330K pages (compared to ~805K Amazon pages).

4. Basic Workflow

Find Comics: Gather digital comic files.
Convert to Pages: Extract images from containers (PDF, CBZ, etc.).
VLM Analysis: Use VLMs to get panel text for each page.
R-CNN Panel Detection: Get panel bounding boxes and coordinates.
Data Integration: Join data into DataSpec (e.g., using coco_to_dataspec).
Perfect Match Subset: Create high-quality subsets for training.
Model Training: Train the Closure Lite model (using closure_lite_dataset, closure_lite_simple_framework, train_closure_lite_simple_with_list).
Embedding Generation: Run the trained model to generate queryable embeddings.
Querying: Access embeddings via code or a dedicated interface.

5. Perfect Match Dataset Notes

Goal: Ensure text alignment with the correct panels (same number of panels from R-CNN and VLM).
Challenge: Initial alignment between Fast-RCNN and various VLM runs was only ~25%, indicating a need to improve this step or consider alternative approaches (e.g., OCR for Fast-RCNN).

6. Key Results from Closure Lite (80K Calibre Subset, Context-Denoise)

Panel Similarity

An early model showed panel similarities, with the model cutting things off at 12 panels.

Page Similarity

The model demonstrated understanding of page composition and structure.

Clustering (UMAP Embeddings)

UMAP embeddings revealed distinct clusters.
Landscape Cluster: A "ribbon cluster" was identified as a "landscape cluster" (pages with landscape orientation), which is less common in comics.
Data Noise: An adjoining part of this cluster was found to be an error, containing extracted pages from non-comic content (e.g., computing books), highlighting the need for better data filtering.

Ablation Study (Vision-Only Embeddings)

Clustering of vision-only embeddings showed distinct patterns.
Cluster 8: Identified as "all single panel very dark images" and "heavy text one page images," with a few multi-panel pages. A disconnected group of "all white images" formed a monochromatic sub-cluster.
Aspect Ratios: Vision-only clustering also picked out significant differences in panel aspect ratios.

7. Relevant Research & Future Tasks

Relevant Research

Survey: "One missing piece in Vision and Language: A Survey on Comics Understanding" (https://arxiv.org/abs/2409.09502v1)
Neural Networks & Transformers: "Investigating Neural Networks and Transformer Models for Enhanced Comic Decoding" (https://dl.acm.org/doi/10.1007/978-3-031-70645-5_10)
Image Indexing: "Digital Comics Image Indexing Based on Deep Learning" (https://www.researchgate.net/publication/326137469_Digital_Comics_Image_Indexing_Based_on_Deep_Learning)
Synthesis of Graphic Novels: "A Deep Learning Pipeline for the Synthesis of Graphic Novels" (https://computationalcreativity.net/iccc21/wp-content/uploads/2021/09/ICCC_2021_paper_52.pdf)
ComicsPAP: https://github.com/llabres/ComicsPAP =======
Papers:
- https://arxiv.org/abs/2503.08561
- https://www.researchgate.net/publication/389748978_ComicsPAP_understanding_comic_strips_by_picking_the_correct_panel? _tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InB1YmxpY2F0aW9uIiwicGFnZSI6InB1YmxpY2F0aW9uIn19
  - https://huggingface.co/VLR- ComicsPap models
  - https://github.com/emanuelevivoli/CoMix - CoMix repo
- https://www.researchgate.net/publication/326137469_Digital_Comics_Image_Indexing_Based_on_Deep_Learning
- https://computationalcreativity.net/iccc21/wp-content/uploads/2021/09/ICCC_2021_paper_52.pdf

Tasks of Interest

Dialogue transcription, which necessitates a robust panel detection pipeline.

8. Pipeline Details

Refer to the main CoMix repository for the Faster RCNN panel detection and processing to make coco style fiels
The code in detections_2000ad contains adaptations for inference, distinct from evaluation.
Detailed functions are in https://github.com/RichardScottOZ/CoMix/tree/main/benchmarks/detections.
Technical Stack: PyTorch (CUDA 11.8 recommended), transformers, xarray and zarr.

9. VLM Model Experiments & Possibilities

Extensive testing has been conducted with various Vision-Language Models for text extraction and analysis:

Gemini 4B: Handles basics, but fails on some perpetually (reason yet to be understood; Fourier analysis suggested).
Gemma 12B: Can handle some failures of Gemini 4B.
Mistral 3.1: Generally identifies "character A and character B," can handle Gemini failures.
Qwen 2.5 VL Instruct: Failed on the same cases as Gemini 4B.
Phi4: Not very good.
Llama 11B: Failed to process.
Gemini Flash 1.5: Capable of handling missing/null captions and characters.
Gemini Flash 2.5 Lite: Good performance, but 8x more expensive than Gemma 4B (which can run locally). Experienced quite a few connection errors with Google. On the hardest 660 samples, had 1/3 errors and 1/5 JSON errors.
GPT Nano 4.1: Reported "unsupported image type," indicating lower capability than Google's models.
Meta Llama 4 Scout: Much better, with 300/500 success on remaining images, and better cost efficiency (0.08/0.3 vs 0.10/0.40 for Gemini Flash Lite 2.5). However, encountered significant problems with the GMI Cloud provider.

10. Embeddings Generation Strategy

To Generate: Panel embeddings (P), Page embeddings (E_page), Reading order embeddings.
Dataset Coverage: Amazon perfect matches (212K pages), CalibreComics perfect matches (80K), and a combined dataset of all high-quality samples.
Technical Approach: Batch Processing Script.

11. Domain capability

Humble Bundle: If you download everything you will get your book and rpg collections too - which can happily go through pdf and epub extractions and end up with images.

Framework handled this, the 'deep learning / linux / programming' book images clusters as their own small part connected to a ribbon, so not ideal to include but a useful result. So what if you put rpg art in? Lots of single page 'panels' obtained most likely - and would cluster in the 'no/low test text' UMAP sections

12. Future Research

Refer to documentation/future_work/.

Version 2

1. Paratext

Need to work out comic classes inside a comic that will work for 'modern comics' as opposed to old public domain comics
Cover, credits and indicia, splash, cover interior for trades, back matter text, back matter art, previews etc.

Problems

single issues
trade paperbacks
omnibuses, collections that are larger
anthologies - art/narrative changes 'inside' a comic - e.g. 2000AD
- anthologies inside omnibuses

a. Zero shot classification

Trying bart for the above as have many VLM json files with some signal of what pages might include in 'overall summary' for example.

b. Page Stream Segmentation (PSS) & CoSMo

Importance: PSS is crucial for the next version of the project, feeding page type markers into multimodal fusion.
CoSMo Model:
- GitHub Repository (includes arXiv paper link).
- Specifically designed for PSS.
- Uses Qwen 2.5 VL 32B for OCR.
- Consideration: Is Gemma as good/cost-effective for OCR?
- Current Status: No pre-trained model available for direct testing; requires training a new one.

notebook in Data Analysis - comic training
- try first pass classes
- cover, credits, advertisements, art (previews, internal covers etc. hopefully eventually)), text, back_cover - which could be ads or previews often so might not be a useful classification

2. OCR

Needed for everything - page / panel / characters
Discussion on Deepseek OCR references Qwen here too: https://news.ycombinator.com/item?id=43549072
- note bounding box commentary
- also https://pyimagesearch.com/2025/06/09/object-detection-and-visual-grounding-with-qwen-2-5/

Name		Name	Last commit message	Last commit date
Latest commit History 523 Commits
documentation		documentation
examples		examples
images		images
scripts		scripts
src		src
.gitignore		.gitignore
Comic-Databases.md		Comic-Databases.md
Comic-Sources.md		Comic-Sources.md
DOCKER_README.md		DOCKER_README.md
Dockerfile.lithops.cpu		Dockerfile.lithops.cpu
Dockerfile.lithops.gpu		Dockerfile.lithops.gpu
Dockerfile.lithops.ocr.cpu		Dockerfile.lithops.ocr.cpu
Dockerfile.lithops.ocr.gpu		Dockerfile.lithops.ocr.gpu
Example Faster RCNN predictions.md		Example Faster RCNN predictions.md
Example VLM extraction.md		Example VLM extraction.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LITHOPS_OCR_SUMMARY.md		LITHOPS_OCR_SUMMARY.md
NEXT_STEPS_OCR_PSS.md		NEXT_STEPS_OCR_PSS.md
README.md		README.md
build_lithops_ocr_runtime.sh		build_lithops_ocr_runtime.sh
build_lithops_paddleocr_runtime.sh		build_lithops_paddleocr_runtime.sh
image.png		image.png
lithops.yaml		lithops.yaml
requirements.txt		requirements.txt
run_vlm_benchmark.bat		run_vlm_benchmark.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!