This document outlines the ongoing work in analyzing comics using data science techniques, focusing on multimodal modeling for content understanding and querying.
Cosmo Page Stream Segmentation - https://huggingface.co/RichardScottOZ/cosmo-v4
- trained on a 25K page subset of more modern comics from multi companies and multiple genres and styles (not manga) Closure Lite Simple - https://huggingface.co/RichardScottOZ/comics-analysis-closure-lite-simple
- composition and structural analysis
- Version 1 (Closure Lite Framework):
- Aim: To combine page image and VLM text for querying, with panel and reading order recognition.
- Key Choice: Utilized 384-dimensional "lite" embeddings to allow for testing on retail GPUs.
- Model: Based on the RichardScottOZ/comics-analysis-closure-lite-simple model, trained as a context-denoise variant.
- Version 2 (Current Focus): Initiating work towards a new, more robust modeling framework, particularly for Page Stream Segmentation.
The Closure Lite Framework is a multimodal fusion model designed to analyze images, text, and panel reading order.
- Objective: Generate multimodal fused embeddings that are queryable for similarity.
- Embeddings are efficiently stored in Zarr format (e.g., 80K "perfect match subset" takes ~500MB).
- An interface (Flask-based) exists for querying.
- Design Principle: Usable on a reasonable retail GPU (hence 384-dim embeddings).
- Model Variants:
base,denoise,context,context-denoise.
- Primary Data: Comic pages from various sources (Amazon/Comixology, Dark Horse, Calibre-managed collections like Humble Bundle and DriveThruComics).
- Future Data: Exploring new Humble Bundles and potential scraping (e.g., Neon Ichigan).
- Initial Processing:
- Conversion: Comic files (PDF, CBZ, etc.) are converted into individual page images.
- VLM Text Extraction: Vision-Language Models (VLMs) are used to extract text from each page, preceding general OCR due to its comic-specific challenges.
- Panel Detection: Fast-RCNN is used to detect panel bounding boxes and coordinates.
- DataSpec Integration: All extracted data is joined into a
DataSpecformat for modeling.
- Dataset Size: Currently working with a "Calibre" subset of ~330K pages (compared to ~805K Amazon pages).
- Find Comics: Gather digital comic files.
- Convert to Pages: Extract images from containers (PDF, CBZ, etc.).
- VLM Analysis: Use VLMs to get panel text for each page.
- R-CNN Panel Detection: Get panel bounding boxes and coordinates.
- Data Integration: Join data into
DataSpec(e.g., usingcoco_to_dataspec). - Perfect Match Subset: Create high-quality subsets for training.
- Model Training: Train the Closure Lite model (using
closure_lite_dataset,closure_lite_simple_framework,train_closure_lite_simple_with_list). - Embedding Generation: Run the trained model to generate queryable embeddings.
- Querying: Access embeddings via code or a dedicated interface.
- Goal: Ensure text alignment with the correct panels (same number of panels from R-CNN and VLM).
- Challenge: Initial alignment between Fast-RCNN and various VLM runs was only ~25%, indicating a need to improve this step or consider alternative approaches (e.g., OCR for Fast-RCNN).
- UMAP embeddings revealed distinct clusters.

- Landscape Cluster: A "ribbon cluster" was identified as a "landscape cluster" (pages with landscape orientation), which is less common in comics.

- Data Noise: An adjoining part of this cluster was found to be an error, containing extracted pages from non-comic content (e.g., computing books), highlighting the need for better data filtering.
- Clustering of vision-only embeddings showed distinct patterns.

- Cluster 8: Identified as "all single panel very dark images" and "heavy text one page images," with a few multi-panel pages. A disconnected group of "all white images" formed a monochromatic sub-cluster.
- Aspect Ratios: Vision-only clustering also picked out significant differences in panel aspect ratios.

-
Survey: "One missing piece in Vision and Language: A Survey on Comics Understanding" (https://arxiv.org/abs/2409.09502v1)
-
Neural Networks & Transformers: "Investigating Neural Networks and Transformer Models for Enhanced Comic Decoding" (https://dl.acm.org/doi/10.1007/978-3-031-70645-5_10)
-
Image Indexing: "Digital Comics Image Indexing Based on Deep Learning" (https://www.researchgate.net/publication/326137469_Digital_Comics_Image_Indexing_Based_on_Deep_Learning)
-
Synthesis of Graphic Novels: "A Deep Learning Pipeline for the Synthesis of Graphic Novels" (https://computationalcreativity.net/iccc21/wp-content/uploads/2021/09/ICCC_2021_paper_52.pdf)
-
ComicsPAP: https://github.com/llabres/ComicsPAP =======
-
Papers:
-
https://www.researchgate.net/publication/389748978_ComicsPAP_understanding_comic_strips_by_picking_the_correct_panel? _tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InB1YmxpY2F0aW9uIiwicGFnZSI6InB1YmxpY2F0aW9uIn19
- https://huggingface.co/VLR- ComicsPap models
- https://github.com/emanuelevivoli/CoMix - CoMix repo
-
https://computationalcreativity.net/iccc21/wp-content/uploads/2021/09/ICCC_2021_paper_52.pdf
- Dialogue transcription, which necessitates a robust panel detection pipeline.
- Refer to the main CoMix repository for the Faster RCNN panel detection and processing to make coco style fiels
- The code in
detections_2000adcontains adaptations for inference, distinct from evaluation. - Detailed functions are in
https://github.com/RichardScottOZ/CoMix/tree/main/benchmarks/detections. - Technical Stack: PyTorch (CUDA 11.8 recommended), transformers, xarray and zarr.
Extensive testing has been conducted with various Vision-Language Models for text extraction and analysis:
- Gemini 4B: Handles basics, but fails on some perpetually (reason yet to be understood; Fourier analysis suggested).
- Gemma 12B: Can handle some failures of Gemini 4B.
- Mistral 3.1: Generally identifies "character A and character B," can handle Gemini failures.
- Qwen 2.5 VL Instruct: Failed on the same cases as Gemini 4B.
- Phi4: Not very good.
- Llama 11B: Failed to process.
- Gemini Flash 1.5: Capable of handling missing/null captions and characters.
- Gemini Flash 2.5 Lite: Good performance, but 8x more expensive than Gemma 4B (which can run locally). Experienced quite a few connection errors with Google. On the hardest 660 samples, had 1/3 errors and 1/5 JSON errors.
- GPT Nano 4.1: Reported "unsupported image type," indicating lower capability than Google's models.
- Meta Llama 4 Scout: Much better, with 300/500 success on remaining images, and better cost efficiency (0.08/0.3 vs 0.10/0.40 for Gemini Flash Lite 2.5). However, encountered significant problems with the GMI Cloud provider.
- To Generate: Panel embeddings (P), Page embeddings (E_page), Reading order embeddings.
- Dataset Coverage: Amazon perfect matches (212K pages), CalibreComics perfect matches (80K), and a combined dataset of all high-quality samples.
- Technical Approach: Batch Processing Script.
- Humble Bundle: If you download everything you will get your book and rpg collections too - which can happily go through pdf and epub extractions and end up with images.
- Framework handled this, the 'deep learning / linux / programming' book images clusters as their own small part connected to a ribbon, so not ideal to include but a useful result. So what if you put rpg art in? Lots of single page 'panels' obtained most likely - and would cluster in the 'no/low test text' UMAP sections
- Refer to documentation/future_work/.
- Need to work out comic classes inside a comic that will work for 'modern comics' as opposed to old public domain comics
- Cover, credits and indicia, splash, cover interior for trades, back matter text, back matter art, previews etc.
- single issues
- trade paperbacks
- omnibuses, collections that are larger
- anthologies - art/narrative changes 'inside' a comic - e.g. 2000AD
- anthologies inside omnibuses
- Trying bart for the above as have many VLM json files with some signal of what pages might include in 'overall summary' for example.
- Importance: PSS is crucial for the next version of the project, feeding page type markers into multimodal fusion.
- CoSMo Model:
- GitHub Repository (includes arXiv paper link).
- Specifically designed for PSS.
- Uses Qwen 2.5 VL 32B for OCR.
- Consideration: Is Gemma as good/cost-effective for OCR?
- Current Status: No pre-trained model available for direct testing; requires training a new one.
- notebook in Data Analysis - comic training
- try first pass classes
- cover, credits, advertisements, art (previews, internal covers etc. hopefully eventually)), text, back_cover - which could be ads or previews often so might not be a useful classification
- Needed for everything - page / panel / characters
- Discussion on Deepseek OCR references Qwen here too: https://news.ycombinator.com/item?id=43549072
- note bounding box commentary
- also https://pyimagesearch.com/2025/06/09/object-detection-and-visual-grounding-with-qwen-2-5/

