Skip to content

RichardScottOZ/Comic-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comic Analysis: Project Overview

This document outlines the ongoing work in analyzing comics using data science techniques, focusing on multimodal modeling for content understanding and querying.

Models Available

Cosmo Page Stream Segmentation - https://huggingface.co/RichardScottOZ/cosmo-v4

1. Project Goals & Evolution

  • Version 1 (Closure Lite Framework):
    • Aim: To combine page image and VLM text for querying, with panel and reading order recognition.
    • Key Choice: Utilized 384-dimensional "lite" embeddings to allow for testing on retail GPUs.
    • Model: Based on the RichardScottOZ/comics-analysis-closure-lite-simple model, trained as a context-denoise variant.
  • Version 2 (Current Focus): Initiating work towards a new, more robust modeling framework, particularly for Page Stream Segmentation.

2. Core Framework: Closure Lite

The Closure Lite Framework is a multimodal fusion model designed to analyze images, text, and panel reading order.

  • Objective: Generate multimodal fused embeddings that are queryable for similarity.
    • Embeddings are efficiently stored in Zarr format (e.g., 80K "perfect match subset" takes ~500MB).
    • An interface (Flask-based) exists for querying.
  • Design Principle: Usable on a reasonable retail GPU (hence 384-dim embeddings).
  • Model Variants: base, denoise, context, context-denoise.

3. Data Sources & Preparation

  • Primary Data: Comic pages from various sources (Amazon/Comixology, Dark Horse, Calibre-managed collections like Humble Bundle and DriveThruComics).
  • Future Data: Exploring new Humble Bundles and potential scraping (e.g., Neon Ichigan).
  • Initial Processing:
    • Conversion: Comic files (PDF, CBZ, etc.) are converted into individual page images.
    • VLM Text Extraction: Vision-Language Models (VLMs) are used to extract text from each page, preceding general OCR due to its comic-specific challenges.
    • Panel Detection: Fast-RCNN is used to detect panel bounding boxes and coordinates.
    • DataSpec Integration: All extracted data is joined into a DataSpec format for modeling.
  • Dataset Size: Currently working with a "Calibre" subset of ~330K pages (compared to ~805K Amazon pages).

4. Basic Workflow

  1. Find Comics: Gather digital comic files.
  2. Convert to Pages: Extract images from containers (PDF, CBZ, etc.).
  3. VLM Analysis: Use VLMs to get panel text for each page.
  4. R-CNN Panel Detection: Get panel bounding boxes and coordinates.
  5. Data Integration: Join data into DataSpec (e.g., using coco_to_dataspec).
  6. Perfect Match Subset: Create high-quality subsets for training.
  7. Model Training: Train the Closure Lite model (using closure_lite_dataset, closure_lite_simple_framework, train_closure_lite_simple_with_list).
  8. Embedding Generation: Run the trained model to generate queryable embeddings.
  9. Querying: Access embeddings via code or a dedicated interface.

5. Perfect Match Dataset Notes

  • Goal: Ensure text alignment with the correct panels (same number of panels from R-CNN and VLM).
  • Challenge: Initial alignment between Fast-RCNN and various VLM runs was only ~25%, indicating a need to improve this step or consider alternative approaches (e.g., OCR for Fast-RCNN).

6. Key Results from Closure Lite (80K Calibre Subset, Context-Denoise)

Panel Similarity

  • An early model showed panel similarities, with the model cutting things off at 12 panels. alt text

Page Similarity

  • The model demonstrated understanding of page composition and structure. alt text

Clustering (UMAP Embeddings)

  • UMAP embeddings revealed distinct clusters. alt text
  • Landscape Cluster: A "ribbon cluster" was identified as a "landscape cluster" (pages with landscape orientation), which is less common in comics. alt text
  • Data Noise: An adjoining part of this cluster was found to be an error, containing extracted pages from non-comic content (e.g., computing books), highlighting the need for better data filtering.

Ablation Study (Vision-Only Embeddings)

  • Clustering of vision-only embeddings showed distinct patterns. alt text
  • Cluster 8: Identified as "all single panel very dark images" and "heavy text one page images," with a few multi-panel pages. A disconnected group of "all white images" formed a monochromatic sub-cluster.
  • Aspect Ratios: Vision-only clustering also picked out significant differences in panel aspect ratios. alt text

7. Relevant Research & Future Tasks

Relevant Research

Tasks of Interest

  • Dialogue transcription, which necessitates a robust panel detection pipeline.

8. Pipeline Details

  • Refer to the main CoMix repository for the Faster RCNN panel detection and processing to make coco style fiels
  • The code in detections_2000ad contains adaptations for inference, distinct from evaluation.
  • Detailed functions are in https://github.com/RichardScottOZ/CoMix/tree/main/benchmarks/detections.
  • Technical Stack: PyTorch (CUDA 11.8 recommended), transformers, xarray and zarr.

9. VLM Model Experiments & Possibilities

Extensive testing has been conducted with various Vision-Language Models for text extraction and analysis:

  • Gemini 4B: Handles basics, but fails on some perpetually (reason yet to be understood; Fourier analysis suggested).
  • Gemma 12B: Can handle some failures of Gemini 4B.
  • Mistral 3.1: Generally identifies "character A and character B," can handle Gemini failures.
  • Qwen 2.5 VL Instruct: Failed on the same cases as Gemini 4B.
  • Phi4: Not very good.
  • Llama 11B: Failed to process.
  • Gemini Flash 1.5: Capable of handling missing/null captions and characters.
  • Gemini Flash 2.5 Lite: Good performance, but 8x more expensive than Gemma 4B (which can run locally). Experienced quite a few connection errors with Google. On the hardest 660 samples, had 1/3 errors and 1/5 JSON errors.
  • GPT Nano 4.1: Reported "unsupported image type," indicating lower capability than Google's models.
  • Meta Llama 4 Scout: Much better, with 300/500 success on remaining images, and better cost efficiency (0.08/0.3 vs 0.10/0.40 for Gemini Flash Lite 2.5). However, encountered significant problems with the GMI Cloud provider.

10. Embeddings Generation Strategy

  • To Generate: Panel embeddings (P), Page embeddings (E_page), Reading order embeddings.
  • Dataset Coverage: Amazon perfect matches (212K pages), CalibreComics perfect matches (80K), and a combined dataset of all high-quality samples.
  • Technical Approach: Batch Processing Script.

11. Domain capability

  • Humble Bundle: If you download everything you will get your book and rpg collections too - which can happily go through pdf and epub extractions and end up with images.
  • Framework handled this, the 'deep learning / linux / programming' book images clusters as their own small part connected to a ribbon, so not ideal to include but a useful result. So what if you put rpg art in? Lots of single page 'panels' obtained most likely - and would cluster in the 'no/low test text' UMAP sections

12. Future Research

Version 2

1. Paratext

  • Need to work out comic classes inside a comic that will work for 'modern comics' as opposed to old public domain comics
  • Cover, credits and indicia, splash, cover interior for trades, back matter text, back matter art, previews etc.

Problems

  • single issues
  • trade paperbacks
  • omnibuses, collections that are larger
  • anthologies - art/narrative changes 'inside' a comic - e.g. 2000AD
    • anthologies inside omnibuses

a. Zero shot classification

  • Trying bart for the above as have many VLM json files with some signal of what pages might include in 'overall summary' for example.

b. Page Stream Segmentation (PSS) & CoSMo

  • Importance: PSS is crucial for the next version of the project, feeding page type markers into multimodal fusion.
  • CoSMo Model:
    • GitHub Repository (includes arXiv paper link).
    • Specifically designed for PSS.
    • Uses Qwen 2.5 VL 32B for OCR.
    • Consideration: Is Gemma as good/cost-effective for OCR?
    • Current Status: No pre-trained model available for direct testing; requires training a new one.
  • notebook in Data Analysis - comic training
    • try first pass classes
    • cover, credits, advertisements, art (previews, internal covers etc. hopefully eventually)), text, back_cover - which could be ads or previews often so might not be a useful classification

2. OCR