Skip to content

AI-powered YouTube transcript processor with LangGraph orchestration. Extracts transcripts, segments content, applies "verbatim core" summarization, and saves to Notion with toggle blocks.

Notifications You must be signed in to change notification settings

apathi/youtube-transcript-summarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YouTube Transcript Processor

Python 3.11+ License: MIT GitHub stars GitHub forks GitHub issues LangGraph Claude AI Notion API

Automated system that extracts YouTube video transcripts, processes them into "verbatim core" summaries using Claude AI, and saves them to Notion with proper formatting.

Features

  • ✅ Extracts transcripts from YouTube videos (no API key needed)
  • ✅ Fetches real video titles using yt-dlp
  • ✅ AI-powered section segmentation using Claude
  • ✅ "Verbatim core" summarization style (preserves key quotes, removes filler)
  • ✅ Quality validation before saving
  • ✅ Notion integration with toggle blocks for easy navigation
  • ✅ LangGraph orchestration with checkpointing for recovery
  • ✅ Idempotent node design for efficient resume
  • ✅ State management to avoid duplicate processing
  • ✅ Handles videos of any length (tested up to 96+ minutes)

Project Structure

youtube_transcript_summarizer/
├── src/
│   ├── main.py                 # CLI entry point
│   ├── config.py               # Configuration management
│   ├── orchestrator.py         # LangGraph workflow
│   ├── nodes/                  # Workflow nodes
│   │   ├── extract.py          # Node 1: Transcript extraction
│   │   ├── segment.py          # Node 2: Section segmentation
│   │   ├── summarize.py        # Node 3: Verbatim core summarization
│   │   ├── qa.py               # Node 4: Quality validation
│   │   ├── format_notion.py    # Node 5: Notion formatting
│   │   └── save_notion.py      # Node 6: Notion API integration
│   └── utils/
│       └── state_manager.py    # State persistence
├── skills/
│   └── verbatim_core_skill.md  # Summarization instructions for Claude
├── .env                        # Your API keys (create from .env.example)
├── .env.example                # API key template
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Setup Instructions

1. Install Dependencies

# Ensure you're in the project directory
cd youtube_transcript_summarizer

# Activate virtual environment (if using .venv)
source .venv/bin/activate  # On macOS/Linux
# OR
.venv\\Scripts\\activate  # On Windows

# Install dependencies
pip install -r requirements.txt

2. Configure API Keys

Create a .env file in the project root:

cp .env.example .env

Edit .env and add your API keys:

# Get from: https://console.anthropic.com
ANTHROPIC_API_KEY=sk-ant-your-key-here

# Get from: https://www.notion.so/my-integrations
NOTION_API_KEY=secret_your-key-here
NOTION_PARENT_PAGE_ID=your-page-id-here

# Model (optional - defaults to claude-sonnet-4-5)
ANTHROPIC_MODEL=claude-sonnet-4-5-20250929

3. Set Up Notion

  1. Go to https://www.notion.so/my-integrations
  2. Create a new integration
  3. Copy the "Internal Integration Token" → This is your NOTION_API_KEY
  4. Create or choose a Notion page where you want summaries saved
  5. Share that page with your integration (click "Share" → "Invite" → select your integration)
  6. Copy the page ID from the URL → This is your NOTION_PARENT_PAGE_ID
    • Example URL: https://www.notion.so/workspace/Page-Title-abc123def456
    • Page ID: abc123def456

4. Verify Configuration

python -m src.config

Should output: ✅ Configuration valid!

Usage

Process a Single Video

python -m src.main process "https://www.youtube.com/watch?v=VIDEO_ID"

Example:

python -m src.main process "https://www.youtube.com/watch?v=DpD8QB-6Pc8"

Resume a Failed Workflow

If processing fails midway, you can resume from the last successful checkpoint. The workflow uses idempotent nodes that skip already-completed steps:

python -m src.main resume "https://www.youtube.com/watch?v=VIDEO_ID"

Example output:

⏭️  Transcript already extracted, skipping...
⏭️  Sections already segmented, skipping...
⏭️  Summaries already generated, skipping...
💾 Saving to Notion...  ← Only the failed node runs!

This saves both time and API costs by avoiding redundant work.

Check Status

View processing history:

python -m src.main status

Get Help

python -m src.main help

Workflow Steps

The system processes videos through 6 stages:

  1. Extract Transcript - Downloads transcript from YouTube + fetches video title with yt-dlp
  2. Segment Sections - AI identifies natural section boundaries (titles + timestamps only)
  3. Summarize Sections - Applies "verbatim core" style to each section
  4. QA Validation - Checks formatting and quality
  5. Format for Notion - Converts markdown to Notion blocks with toggles
  6. Save to Notion - Creates page under specified parent

Key Design Pattern: Each node checks if its output already exists in state and skips processing if found. This enables efficient resume behavior and reduces API costs.

Verbatim Core Style

The summarization follows these principles:

  • ✅ Preserve exact quotes for key concepts and technical terms
  • ✅ Remove filler words (um, uh, you know, like)
  • ✅ Organize into toggle-ready sections
  • ✅ Use nested bullets for clarity
  • ✅ Capture technical jargon, numbers, and memorable phrases

See skills/verbatim_core_skill.md for complete formatting rules.

Cost Estimates

  • Transcript extraction: Free (uses youtube-transcript-api)
  • Claude AI processing: ~$1.50 per video (depends on length)
  • Notion API: Free

Troubleshooting

"Module not found" errors

# Ensure virtual environment is activated
source .venv/bin/activate

# Reinstall dependencies
pip install -r requirements.txt

"Configuration Error"

# Check your .env file exists and has all required keys
cat .env

# Verify configuration
python -m src.config

"Notion API unauthorized"

  1. Verify NOTION_API_KEY is correct
  2. Ensure the parent page is shared with your integration
  3. Verify NOTION_PARENT_PAGE_ID is correct

"No transcript available"

The video may not have captions/subtitles. Try a different video.

Video already processed

Remove the video from processed_videos.json to reprocess it.

Learning Objectives

This project teaches:

  • ✅ LangGraph state machines and workflow orchestration
  • ✅ Error recovery patterns with checkpointing
  • ✅ Multi-step validation and quality gates
  • ✅ External API integration (YouTube, Claude, Notion)
  • ✅ State persistence and idempotency
  • ✅ Prompt engineering with skill files
  • ✅ Structured outputs from LLMs

Recent Improvements

  • Idempotent nodes - Efficient resume that skips completed work
  • Real video titles - Uses yt-dlp to fetch actual YouTube titles
  • Long video support - Fixed segmentation to handle 90+ minute videos
  • Improved error handling - Better debugging with detailed error messages

Future Enhancements

  • Parallel section processing for speed
  • Channel monitoring (Phase 2)
  • GitHub Actions automation
  • Web UI with Streamlit
  • Multi-language support
  • Migration to CrewAI (planned)

License

MIT

Questions?

Check the implementation plan in plans/youtube_transcript_processor_implementation_plan.md for detailed architecture and learning notes.

About

AI-powered YouTube transcript processor with LangGraph orchestration. Extracts transcripts, segments content, applies "verbatim core" summarization, and saves to Notion with toggle blocks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages