YouTube Transcript Processor

Automated system that extracts YouTube video transcripts, processes them into "verbatim core" summaries using Claude AI, and saves them to Notion with proper formatting.

Features

✅ Extracts transcripts from YouTube videos (no API key needed)
✅ Fetches real video titles using yt-dlp
✅ AI-powered section segmentation using Claude
✅ "Verbatim core" summarization style (preserves key quotes, removes filler)
✅ Quality validation before saving
✅ Notion integration with toggle blocks for easy navigation
✅ LangGraph orchestration with checkpointing for recovery
✅ Idempotent node design for efficient resume
✅ State management to avoid duplicate processing
✅ Handles videos of any length (tested up to 96+ minutes)

Project Structure

youtube_transcript_summarizer/
├── src/
│   ├── main.py                 # CLI entry point
│   ├── config.py               # Configuration management
│   ├── orchestrator.py         # LangGraph workflow
│   ├── nodes/                  # Workflow nodes
│   │   ├── extract.py          # Node 1: Transcript extraction
│   │   ├── segment.py          # Node 2: Section segmentation
│   │   ├── summarize.py        # Node 3: Verbatim core summarization
│   │   ├── qa.py               # Node 4: Quality validation
│   │   ├── format_notion.py    # Node 5: Notion formatting
│   │   └── save_notion.py      # Node 6: Notion API integration
│   └── utils/
│       └── state_manager.py    # State persistence
├── skills/
│   └── verbatim_core_skill.md  # Summarization instructions for Claude
├── .env                        # Your API keys (create from .env.example)
├── .env.example                # API key template
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Setup Instructions

1. Install Dependencies

# Ensure you're in the project directory
cd youtube_transcript_summarizer

# Activate virtual environment (if using .venv)
source .venv/bin/activate  # On macOS/Linux
# OR
.venv\\Scripts\\activate  # On Windows

# Install dependencies
pip install -r requirements.txt

2. Configure API Keys

Create a .env file in the project root:

cp .env.example .env

Edit .env and add your API keys:

# Get from: https://console.anthropic.com
ANTHROPIC_API_KEY=sk-ant-your-key-here

# Get from: https://www.notion.so/my-integrations
NOTION_API_KEY=secret_your-key-here
NOTION_PARENT_PAGE_ID=your-page-id-here

# Model (optional - defaults to claude-sonnet-4-5)
ANTHROPIC_MODEL=claude-sonnet-4-5-20250929

3. Set Up Notion

Go to https://www.notion.so/my-integrations
Create a new integration
Copy the "Internal Integration Token" → This is your NOTION_API_KEY
Create or choose a Notion page where you want summaries saved
Share that page with your integration (click "Share" → "Invite" → select your integration)
Copy the page ID from the URL → This is your NOTION_PARENT_PAGE_ID
- Example URL: https://www.notion.so/workspace/Page-Title-abc123def456
- Page ID: abc123def456

4. Verify Configuration

python -m src.config

Should output: ✅ Configuration valid!

Usage

Process a Single Video

python -m src.main process "https://www.youtube.com/watch?v=VIDEO_ID"

Example:

python -m src.main process "https://www.youtube.com/watch?v=DpD8QB-6Pc8"

Resume a Failed Workflow

If processing fails midway, you can resume from the last successful checkpoint. The workflow uses idempotent nodes that skip already-completed steps:

python -m src.main resume "https://www.youtube.com/watch?v=VIDEO_ID"

Example output:

⏭️  Transcript already extracted, skipping...
⏭️  Sections already segmented, skipping...
⏭️  Summaries already generated, skipping...
💾 Saving to Notion...  ← Only the failed node runs!

This saves both time and API costs by avoiding redundant work.

Check Status

View processing history:

python -m src.main status

Get Help

python -m src.main help

Workflow Steps

The system processes videos through 6 stages:

Extract Transcript - Downloads transcript from YouTube + fetches video title with yt-dlp
Segment Sections - AI identifies natural section boundaries (titles + timestamps only)
Summarize Sections - Applies "verbatim core" style to each section
QA Validation - Checks formatting and quality
Format for Notion - Converts markdown to Notion blocks with toggles
Save to Notion - Creates page under specified parent

Key Design Pattern: Each node checks if its output already exists in state and skips processing if found. This enables efficient resume behavior and reduces API costs.

Verbatim Core Style

The summarization follows these principles:

✅ Preserve exact quotes for key concepts and technical terms
✅ Remove filler words (um, uh, you know, like)
✅ Organize into toggle-ready sections
✅ Use nested bullets for clarity
✅ Capture technical jargon, numbers, and memorable phrases

See skills/verbatim_core_skill.md for complete formatting rules.

Cost Estimates

Transcript extraction: Free (uses youtube-transcript-api)
Claude AI processing: ~$1.50 per video (depends on length)
Notion API: Free

Troubleshooting

"Module not found" errors

# Ensure virtual environment is activated
source .venv/bin/activate

# Reinstall dependencies
pip install -r requirements.txt

"Configuration Error"

# Check your .env file exists and has all required keys
cat .env

# Verify configuration
python -m src.config

"Notion API unauthorized"

Verify NOTION_API_KEY is correct
Ensure the parent page is shared with your integration
Verify NOTION_PARENT_PAGE_ID is correct

"No transcript available"

The video may not have captions/subtitles. Try a different video.

Video already processed

Remove the video from processed_videos.json to reprocess it.

Learning Objectives

This project teaches:

✅ LangGraph state machines and workflow orchestration
✅ Error recovery patterns with checkpointing
✅ Multi-step validation and quality gates
✅ External API integration (YouTube, Claude, Notion)
✅ State persistence and idempotency
✅ Prompt engineering with skill files
✅ Structured outputs from LLMs

Recent Improvements

✅ Idempotent nodes - Efficient resume that skips completed work
✅ Real video titles - Uses yt-dlp to fetch actual YouTube titles
✅ Long video support - Fixed segmentation to handle 90+ minute videos
✅ Improved error handling - Better debugging with detailed error messages

Future Enhancements

License

MIT

Questions?

Check the implementation plan in plans/youtube_transcript_processor_implementation_plan.md for detailed architecture and learning notes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Transcript Processor

Features

Project Structure

Setup Instructions

1. Install Dependencies

2. Configure API Keys

3. Set Up Notion

4. Verify Configuration

Usage

Process a Single Video

Resume a Failed Workflow

Check Status

Get Help

Workflow Steps

Verbatim Core Style

Cost Estimates

Troubleshooting

"Module not found" errors

"Configuration Error"

"Notion API unauthorized"

"No transcript available"

Video already processed

Learning Objectives

Recent Improvements

Future Enhancements

License

Questions?

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
plans		plans
skills		skills
src		src
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
requirements.txt		requirements.txt

apathi/youtube-transcript-summarizer

Folders and files

Latest commit

History

Repository files navigation

YouTube Transcript Processor

Features

Project Structure

Setup Instructions

1. Install Dependencies

2. Configure API Keys

3. Set Up Notion

4. Verify Configuration

Usage

Process a Single Video

Resume a Failed Workflow

Check Status

Get Help

Workflow Steps

Verbatim Core Style

Cost Estimates

Troubleshooting

"Module not found" errors

"Configuration Error"

"Notion API unauthorized"

"No transcript available"

Video already processed

Learning Objectives

Recent Improvements

Future Enhancements

License

Questions?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages