Skip to content

Transform fiction novels into user guided visual narratives

visibl-github-video.mp4

Download for iPhone

TestFlight Beta Apache 2.0 Website


What is visibl?

visibl transforms any fiction novel into a personalized cinematic experience. It's a new kind of audiobook player that generates visual scenes in real-time as you listen, letting you guide the artistic direction of your own journey through the story.

Free for iPhone. No production studios. No waiting. Just your books, visualized instantly.

Read more to understand how it works


The Problem

Reading is dying. The average American reads 12 minutes per day, while spending 2.5 hours on TikTok and Instagram. Long-form narrative content can't compete with the dopamine hit of short-form video.

But the stories themselves aren't the problem - it's the medium. People still crave narrative (Netflix has +200M subscribers), they just won't read text for 10 hours when they could watch instead.

Our hypothesis: Can we make reading as engaging and immersive as tiktok and instagram?

The Solution

Visibl is a pipeline that converts any fiction novel into a synchronized audio-visual experience in real-time. No human intervention, no production costs - just automated scene generation from text.

Think of it as a compiler that takes a novel as input and outputs a Graphic Novel.


Technical Architecture

The pipeline consists of several stages that transform text into synchronized visual content:

TLDR; use RAG and graph data models to create detailed image prompts for a diffusion image model.

1. Entity Extraction via NER

Text to Structured Data

Using a lightweight language model, we extract characters and locations from text chunks. This creates the foundational scene graph that drives all visualization.

Implementation Details
  • Model: deepseek-v3
  • Processing: ~512-token chunks with no overlap

Example Input:

In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since.

"Whenever you feel like criticizing any one," he told me, "just remember that all the people in this world haven't had the advantages that you've had."

Example Output (JSON):

["zelda", "gatsby", "my father", "narrator"]

Note: From The Great Gatsby - identifies the narrator and his father as key entities in this passage.


2. Alias Resolution & Entity Linking

Maintaining Consistency

Multi-pass reasoning model groups entity references ("Harry" = "Potter" = "The Boy Who Lived"). Critical for maintaining visual consistency across scenes.

Implementation Details
  • Two-stage process: intra-chapter then inter-chapter resolution
  • Model: gpt-5-mini
  • Processing: Full chapter text with NER list in system message

Example Input:

["zelda", "gatsby", "my father", "narrator"]

Example Output:

{
  "characters": [
    {
      "name": "zelda fitzgerald",
      "aliases": ["zelda"]
    },
    {
      "name": "jay gatsby",
      "aliases": ["gatsby", "mr. gatsby"]
    },
    {
      "name": "narrator's father",
      "aliases": ["my father", "father"]
    },
    {
      "name": "nick carraway",
      "aliases": ["nick", "narrator", "mr. carraway"]
    }
  ]
}

Note: Reasoning model identifies that "narrator" refers to Nick Carraway and groups all related aliases.

Cross-Chapter Continuity Example:

{
  "gatsby's house": {
    "appearsIn": [
      {
        "0": {
          "name": "the house",
          "confidence": "high",
          "reason": "Aliases match (my neighbor's house / my neighbor's mansion / the house) — same primary residence in both chapters."
        }
      },
      {
        "1": {
          "name": "gatsby's",
          "confidence": "high",
          "reason": "Clear identity: 'gatsby's house' in Chapter 2 corresponds to 'gatsby's' in Chapter 1 (same residence/name/alias)."
        }
      }
    ],
    "firstAppearance": 0,
    "allAliases": [
      "my neighbor's house",
      "the house",
      "gatsby's house",
      "gatsby's"
    ]
  }
}

Note: The model tracks entity appearances across chapters, maintaining continuity even when references change (e.g., "my neighbor's house" → "gatsby's house").


3. Property Tuple Generation

Dynamic Entity State

Entities aren't static - we track state changes through property tuples (character: appearance, injuries). This enables visual progression throughout the narrative.

Implementation Details
  • First pass: Entity + full chapter text → generate comprehensive tuples for single entity
  • Second pass: Compare with previous chapter tuples using reasoning model
  • Determines deprecated tuples (e.g., costume changes, injuries healing)
  • Tuple structure: (entity, property, value)
  • Tuple Model: deepseek-v3
  • Reasoning Model: gpt-5-mini

Example Input:

[raw chapter text] +
{
  "name": "tom buchanan",
  "aliases": ["tom buchanan", "tom"]
}

Example Output (Initial Extraction):

[
  {
    "character": "tom buchanan",
    "relationship": "hair_color",
    "property": "straw-haired"
  },
  {
    "character": "tom buchanan",
    "relationship": "facial_features",
    "property": "hard mouth"
  },
  {
    "character": "tom buchanan",
    "relationship": "facial_features",
    "property": "shining, arrogant eyes"
  },
  {
    "character": "tom buchanan",
    "relationship": "wearing",
    "property": "riding clothes"
  },
  {
    "character": "tom buchanan",
    "relationship": "age",
    "property": "thirty"
  }
]

Cross-Chapter Reconciliation:

{
  "tom buchanan": {
    "sourceChapter": 1,
    "included": [
      {"relationship": "gender", "property": "male"},
      {"relationship": "build", "property": "sturdy"},
      {"relationship": "hair_color", "property": "straw-haired"},
      {"relationship": "facial_features", "property": "hard mouth"},
      {"relationship": "facial_features", "property": "shining, arrogant eyes"},
      {"relationship": "wearing", "property": "riding clothes"},
      {"relationship": "age", "property": "thirty"}
    ],
    "dropped": [],
    "reasoning": "No properties from previous chapter contradicted. Retained gender, permanent features (hair, face), build, age, and clothing."
  }
}

Note: System tracks which properties persist vs. change between chapters, essential for maintaining character continuity.


4. Entity Prompt Generation

Tuples to Visual Description

Converts entity property tuples into rich visual descriptions ready for diffusion models.

Implementation Details
  • Entity tuples → visual description generation
  • Only requires tuples from prior steps
  • Model: fine tuned gpt-4.1

Example Input:

[
  {
    "character": "tom buchanan",
    "relationship": "hair_color",
    "property": "straw-haired"
  },
  {
    "character": "tom buchanan",
    "relationship": "facial_features",
    "property": "hard mouth"
  },
  {
    "character": "tom buchanan",
    "relationship": "facial_features",
    "property": "shining, arrogant eyes"
  },
  {
    "character": "tom buchanan",
    "relationship": "wearing",
    "property": "riding clothes"
  },
  {
    "character": "tom buchanan",
    "relationship": "age",
    "property": "thirty"
  }
]

Example Output:

{
  "character": "tom buchanan",
  "description": "Tom Buchanan is a thirty-year-old man with a sturdy, athletic build that commands physical presence. His straw-blonde hair is slightly tousled, appearing as if recently windblown from outdoor activity. His face is dominated by a hard, set mouth that suggests stubbornness and privilege, complemented by shining eyes that radiate unapologetic arrogance. These piercing eyes seem to constantly appraise others with undisguised superiority. Currently dressed in fitted riding clothes—likely consisting of tailored breeches, knee-high leather boots, and a crisp riding jacket—his attire speaks to both his wealth and athletic pursuits. His forehead might show faint weathering from outdoor sports, and his posture carries the natural authority of someone accustomed to dominance. The sunlight catches golden highlights in his hair as he stands with the squared shoulders of a former collegiate athlete."
}

Generated Image:

Tom Buchanan

Note: The model expands sparse tuples into cinematically rich descriptions optimized for visual generation.


5. Scene Decomposition

Narrative to Timeline

Chunks text into ~15-second scenes synchronized with audio narration. Handles pacing, transitions, and narrative structure.

Implementation Details
  • Processing: ~2048-token chunks with no overlap
  • Creative LLM chunks text into storyboard-style scenes
  • Timing embeddings for audio synchronization
  • Model: deepseek-v3

Example Input:

[raw chapter text with timestamps]

Example Output (Scene Data):

{
  "scene_number": 2,
  "description": "A young Nick Carraway stands in a well-appointed study with his father, a distinguished older gentleman. Sunlight streams through french windows as Nick listens intently to his father's advice. The father places a hand on Nick's shoulder in a moment of paternal wisdom.",
  "startTime": 35.1,
  "endTime": 70.6,
  "characters": {
    "nick carraway": "- Early 30s\n- Male\n- Lean, wiry build\n- Clean-shaven face\n- Sharp, angular features\n- Defined jawline\n- High cheekbones\n- Bright, intelligent eyes\n- Straight eyebrows\n- Short, neatly combed hair with a precise side part\n- Medium brown hair, slightly sun-bleached at the temples\n- Straight-backed posture\n- Wears well-tailored suits in muted tones\n- Crisp collars\n- Carefully knotted ties\n- Smooth but strong-looking hands\n- Long fingers",
    "nick carraway's father": "- Male\n- Late 50s\n- Strong, square jawline\n- Neatly trimmed gray hair, slightly receding at the temples\n- Gentle weathering of middle age\n- Faint smile lines around eyes and mouth\n- Wears round, wire-rimmed glasses\n- Upright posture\n- Wears a well-tailored charcoal gray wool three-piece suit\n- Crisp white dress shirt\n- Muted patterned tie secured with a simple tie pin\n- Polished black oxford shoes\n- Carries a mahogany-tipped walking stick"
  },
  "locations": {
    "the room": "- Walls are crimson in color\n- Soft, radiant light fills the room\n- A luxurious long couch at the center, upholstered in deep red\n- Rich wood paneling\n- French windows letting in bright afternoon light"
  },
  "viewpoint": {
    "setting": "bright afternoon light, rich wood paneling",
    "placement": "two-shot with Nick in foreground",
    "shot_type": "medium close-up",
    "mood": "reflective, nostalgic",
    "technical": "85mm f/4, warm color temperature, 9:16 aspect ratio"
  }
}

Final Prompt (after fine-tuning):

"Envision a young male character in his early 30s, with a lean, wiry build and a crisp, clean-shaven face. He possesses sharp, angular features, highlighted by a defined jawline and high cheekbones. His bright, intelligent eyes shine under straight eyebrows and his short, neatly combed hair, with a precise side part, is medium brown, slightly sun-bleached at the temples. His posture is straight-backed and he is dressed in a well-tailored suit in muted tones. Beside him stands an older, distinguished gentleman, his late 50s bearing a gentle weathering of middle age. He has a strong, square jawline, and neatly trimmed gray hair, slightly receding at the temples. Round, wire-rimmed glasses sit on his face and he wears a well-tailored charcoal gray wool three-piece suit, with a crisp white dress shirt and a muted patterned tie secured with a simple tie pin. His shoes are polished black oxfords and he carries a mahogany-tipped walking stick. They are inside a well-appointed study bathed in bright afternoon light, rich wood paneling serving as the background. The room is painted crimson, and a sense of warmth radiates from all surfaces. Both men have been captured in a nostalgic, reflective mood."

Generated Scene:

Nick and his father

Note: System creates cinematic scenes with precise timing, character descriptions, and camera direction - essentially automated storyboarding.


6. Real-time Image Synthesis

Visual Generation

Finally, use our prompt with a diffusion model

Implementation Details
  • Model: imagen4

Example Input (Prompt):

"Envision a young male character in his early 30s, with a lean, wiry build and a crisp, clean-shaven face. He possesses sharp, angular features, highlighted by a defined jawline and high cheekbones. His bright, intelligent eyes shine under straight eyebrows and his short, neatly combed hair, with a precise side part, is medium brown, slightly sun-bleached at the temples. His posture is straight-backed and he is dressed in a well-tailored suit in muted tones. Beside him stands an older, distinguished gentleman, his late 50s bearing a gentle weathering of middle age. He has a strong, square jawline, and neatly trimmed gray hair, slightly receding at the temples. Round, wire-rimmed glasses sit on his face and he wears a well-tailored charcoal gray wool three-piece suit, with a crisp white dress shirt and a muted patterned tie secured with a simple tie pin. His shoes are polished black oxfords and he carries a mahogany-tipped walking stick. They are inside a well-appointed study bathed in bright afternoon light, rich wood paneling serving as the background. The room is painted crimson, and a sense of warmth radiates from all surfaces. Both men have been captured in a nostalgic, reflective mood."

Example Output:

Generated scene

Note: Diffusion model generates high-quality images from detailed prompts in real-time during audiobook playback.


7. Style Transfer via ControlNet

User-Directed Aesthetics

Allows artistic control while maintaining structural accuracy. Users can define visual style without breaking narrative coherence.

Implementation Details
  • LLM used to take user input and convert to directions a controlnet model can accept
  • Model: seededit-3

Example User Input:

"Wes Anderson film"

LLM Conditioning Output:

"Transform this image into a scene that belongs in the world of Wes Anderson films, with symmetrical compositions, pastel color palettes, and whimsical atmosphere fully adapted to that universe."

Input Image:

Original scene

ControlNet Output:

Wes Anderson styled scene

Note: ControlNet preserves the scene composition and character positions while completely transforming the visual style to match the user's creative direction.


Key Features

Public Domain Library

Pre-processed classical literature ready for immediate visualization. No copyright issues, instant access.

5-Minute Processing

Import any novel -> Entity extraction -> Scene graph generation -> Ready for playback. Most books process in under 5 minutes.

Ambient Display Mode

iOS homescreen album art updates with story-relevant imagery during playback. Maintains engagement without active watching - designed for the "second screen" generation.

Style Transfer Control

ControlNet implementation allows users to define visual aesthetics while maintaining narrative accuracy. Not just filters - actual artistic direction.


Open Problems

These aren't just bugs - they're fundamental challenges in automated storytelling:

Character Consistency

  • Challenge: Diffusion models lack persistent identity mechanisms
  • Exploring: Face embedding injection, 3D model generation

Entity Coverage

  • Current: Characters and locations only
  • Challenge: Objects and abstract concepts need different handling
  • Impact: Missing crucial story elements (the One Ring, the Elder Wand)

Temporal Reasoning

  • Current: Static property snapshots at chapter boundaries
  • Challenge: Need continuous state tracking for smooth transitions
  • Proposed: Move to proper graph database with temporal queries

Scene Generation

  • Current: Film-style linear scenes
  • Challenge: Literature uses flashbacks, parallel narratives, internal monologues
  • Needed: Multi-track timeline with narrative device recognition

Audio Pipeline

  • Current: Requires existing audiobook files (M4B)
  • Challenge: TTS quality vs. professional narration
  • Future: Multi-voice synthesis with emotion modeling

Repository Structure

visibl-audiobooks/
├── README.md              # This file
├── visibl-swift/          # iOS client
│   └── README.md          # iOS specific instructions
└── visibl-server/         # Pipeline server
    └── README.md          # Server specific instructions

Contributing

We need help solving hard problems at the intersection of NLP, computer vision, and narrative understanding.

Priority Areas

  • Model Optimization: Quantization, pruning, mobile deployment
  • Character Consistency: Novel approaches to identity preservation
  • Graph Systems: Temporal knowledge graphs for narrative
  • Scene Understanding: Better narrative structure detection

About the Author

Moe Adham - Engineer and entrepreneur who co-founded two companies now listed on NASDAQ. Specializes in graph data, AI systems, and distributed computing. Open source contributor to Bitcoin, Linux, and Tor.

Learn more: moeadham.com


License

Apache 2.0 - See LICENSE


TestFlightWebsiteDiscussions

Pinned Loading

  1. visibl-audiobooks visibl-audiobooks Public

    One Shot Book to Movie

    JavaScript 72 3

Repositories

Showing 7 of 7 repositories

Top languages

Loading…

Most used topics

Loading…