A Python script collection that extracts data from YouTube using Playwright. The scripts support processing both videos and channels, saving data in CSV and JSON formats.
- Extracts video metadata:
- Title
- Description
- Channel name
- View count
- Like count
- Upload date
- Duration
- Keywords
- And more...
- Captures top 5 comments for each video
- Supports processing multiple videos
- Tracks processed videos to avoid duplicates
- CSV output compatible with Excel
- Extracts channel information:
- Channel description
- Social media links (with resolved redirect URLs)
- Location
- Join date
- Subscriber count
- Video count
- View count
- Saves data in JSON format
- Python 3.7+
- Playwright
- A stable internet connection
- Clone the repository:
git clone https://github.com/xkchok/yt-data-scraper.git
cd yt-data-scraper- Create and activate a virtual environment:
pip install uv
uv sync- Install dependencies:
uv add playwright
uvx playwright install chromium- Add your video IDs to the
video_idslist invideo.py:
video_ids = [
'c2tuxS3Pcto', # Example video ID
'kxs9Su_mbpU', # Another video ID
'mvkbCZfwWzA' # Another video ID
]- Run the script:
uv run video.pyThe video scraper will:
- Create a timestamped CSV file for the data
- Process each video and save its data immediately
- Track processed videos in
processed_videos.txt - Skip any previously processed videos
- Edit the channel URLs in
channel.py:
channels = [
"https://www.youtube.com/@The_FirstTake",
# Add more channels here
]- Run the script:
uv run channel.pyThe channel scraper will:
- Visit each channel's "About" page
- Extract available information
- Save the data to a JSON file
video_data_[timestamp].csv: Contains the extracted video data in CSV formatprocessed_videos.txt: Tracks which videos have been processed and when
channel_data_[timestamp].json: Contains the extracted channel data with this structure:
{
"scrape_time": "2024-03-21T12:34:56.789012",
"channels": [
{
"channel_name": "ChannelName",
"url": "https://youtube.com/@ChannelName",
"description": "Channel description...",
"social_links": [
{
"title": "Link title",
"text": "Link text",
"url": "Actual URL (not YouTube redirect)"
}
],
"subscribers": "1.2M subscribers",
"views": "100M views",
"country": "Country name",
"joined": "Joined date"
}
]
}- Videos are processed one at a time to avoid rate limiting
- Failed videos won't be marked as processed
- All text fields are cleaned and formatted for CSV compatibility
- Some channels may have different amounts of information available
Both scripts include error handling for:
- Network issues
- Missing elements
- Loading failures
- File writing errors
Failed items will be reported but won't stop the scripts from processing other items.
Feel free to submit issues and enhancement requests!