A comprehensive, unified crawler system for collecting and analyzing news articles from Heise.de and Chip.de.
- 🎯 Quick Links
- ✨ Key Features
- 🌍 Purpose & Functionality
- 🚀 Installation & Setup
- 🛠 Usage
- 🐳 Docker Deployment
- 🏗 Database Schema
- 📊 Streamlit Features
- 📂 Project Structure
- 🔧 Management with Docker Tools
- ❗ Troubleshooting
- 🗂️ Examples & Screenshots
- 📜 License
- 🙋 About Us
- 📖 Quick Start Guide - Get started in 5 minutes
- ⚙️ Setup Guide - New centralized configuration and auto-refresh features
- 🏗️ Architecture - System architecture and data flow
- 🐳 Docker Setup - Deployment with Docker
The News Mining System is designed to automatically extract and store news articles from multiple sources. The main objectives are:
- 📡 Data Collection - Capture historical news articles from Heise.de and Chip.de
- 🏛 Structured Storage - Articles from both sources in separate PostgreSQL tables
- 🔍 Metadata Extraction - Capture title, author, category, keywords, word count and more
- 🔄 Incremental Crawling - Duplicate detection and storage of only new articles
- 🔔 Notifications - Email notifications for errors during the crawling process
- 🎨 Enhanced Terminal Output - Use of PyFiglet for better readability
- 📤 Data Export - Export as CSV, JSON, XLSX with source filtering
- 🖥 API - Provision of statistics and complete datasets
- 📈 Analytics - Detailed analysis of authors, categories and time trends
- 🔍 Article Search - Search all articles with advanced filter options
- 🎯 Unified Dashboard - One Streamlit application for both sources
- 🤖 Discord Bot - Real-time statistics for both sources in Discord
- 📊 Extensive Visualizations - Over 20 different charts, graphs and representations
- 🕸️ Author Networks - Visualization of connections between authors
- 📈 Trend Analysis - Time-based analysis and predictions
An API endpoint is also provided that can display the crawled data and statistics.
🔹 Python 3.8+ (recommended: Python 3.11)
🔹 PostgreSQL 13+ (local or remote)
🔹 Git (for cloning the repository)
🔹 pip3 (Python Package Manager)
Optional:
- 🐳 Docker & Docker Compose (for containerized deployment)
- 🎮 Discord Bot Token (for Discord integration)
- 🤖 Google API Key (for AI analysis)
git clone https://github.com/SchBenedikt/datamining.git
cd dataminingInstall all required Python libraries:
pip3 install -r requirements.txtFor the Streamlit application (advanced visualizations):
cd visualization
pip3 install -r requirements_streamlit.txt
cd ..Create a .env file in the root directory with the following variables:
# Database Configuration
DB_NAME=your_database_name
DB_USER=your_database_user
DB_PASSWORD=your_database_password
DB_HOST=localhost
DB_PORT=5432
# Email Notifications (optional)
EMAIL_USER=your_email@example.com
EMAIL_PASSWORD=your_app_password
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
ALERT_EMAIL=recipient@example.com
# Discord Bot (optional)
DISCORD_TOKEN=your_discord_bot_token
CHANNEL_ID=your_discord_channel_id
# Google AI (optional, for advanced analysis)
GOOGLE_API_KEY=your_google_api_keyNotes:
- For Gmail, use an App Password
- Get Discord Token from the Discord Developer Portal
- Create Google API Key in the Google Cloud Console
Create the PostgreSQL database:
# Open PostgreSQL console
psql -U postgres
# Create database
CREATE DATABASE your_database_name;
# Exit
\qThe required tables will be created automatically when the crawlers start for the first time.
Manual table creation (optional):
-- Heise-Tabelle
CREATE TABLE IF NOT EXISTS heise (
id SERIAL PRIMARY KEY,
title TEXT,
url TEXT UNIQUE,
date TEXT,
author TEXT,
category TEXT,
keywords TEXT,
word_count INTEGER,
editor_abbr TEXT,
site_name TEXT
);
-- Chip-Tabelle
CREATE TABLE IF NOT EXISTS chip (
id SERIAL PRIMARY KEY,
url TEXT UNIQUE,
title TEXT,
author TEXT,
date TEXT,
keywords TEXT,
description TEXT,
type TEXT,
page_level1 TEXT,
page_level2 TEXT,
page_level3 TEXT,
page_template TEXT
);cd heise
python3 main.pyExample Terminal Output:
[INFO] Crawling URL: https://www.heise.de/newsticker/archiv/2025/10
[INFO] Found articles (total): 55
2025-10-02 10:30:15 [INFO] Processing 16 articles for day 2025-10-02
2025-10-02 10:30:15 [INFO] 2025-10-02T20:00:00 - article-name
If fewer than 10 articles per day are found, an email will be sent.
cd heise
python3 current_crawler.pyExample Terminal Output:
[INFO] Crawling URL: https://www.heise.de/newsticker/archiv/2025/10
[INFO] Found articles (total): 55
2025-10-02 10:35:00 [INFO] Current crawl cycle completed.
2025-10-02 10:35:00 [INFO] Waiting 300 seconds until next crawl.
cd chip
python3 main.pycd chip
python3 current_crawler.pyStart the interactive Streamlit dashboard with support for both sources:
cd visualization
streamlit run streamlit_app.pyThe dashboard will open at http://localhost:8501.
Start the Discord bot for real-time statistics updates:
cd heise
python3 bot.pyThe bot provides:
- Total article count for both sources
- Today's article count for both sources
- Author statistics
- Updates every 10 minutes
The API server starts automatically when running heise/main.py. Statistics can be retrieved here:
http://127.0.0.1:6600/stats
Manual API start:
cd heise
python3 api.pyYou can export data for each source as CSV, JSON, or XLSX files.
Export Heise articles:
cd heise
python3 export_articles.pyExport Chip articles:
cd chip
python3 export_articles.pyExported articles are saved in the data/ directory.
docker-compose up -d# Start Heise Archive Crawler
docker-compose up -d heise-archive-crawler
# Start Chip Live Crawler
docker-compose up -d chip-live-crawler
# Start Streamlit Dashboard
docker-compose up -d streamlit-dashboard
# Start Discord Bot
docker-compose up -d discord-bot# All services
docker-compose logs -f
# Specific service
docker-compose logs -f heise-live-crawler# Stop all services
docker-compose down
# Specific service
docker-compose stop heise-archive-crawlerAfter starting, the Streamlit dashboard is available at:
http://localhost:8501
The database uses two separate tables for better organization:
| Column | Type | Description |
|---|---|---|
| id | SERIAL | Unique ID |
| title | TEXT | Article title |
| url | TEXT | Article URL (unique) |
| date | TEXT | Publication date |
| author | TEXT | Author(s) |
| category | TEXT | Category |
| keywords | TEXT | Keywords |
| word_count | INT | Word count |
| editor_abbr | TEXT | Editor abbreviation |
| site_name | TEXT | Website name |
| Column | Type | Description |
|---|---|---|
| id | SERIAL | Unique ID |
| url | TEXT | Article URL (unique) |
| title | TEXT | Article title |
| author | TEXT | Author(s) |
| date | TEXT | Publication date |
| keywords | TEXT | Keywords |
| description | TEXT | Article description |
| type | TEXT | Article type |
| page_level1 | TEXT | Page level 1 |
| page_level2 | TEXT | Page level 2 |
| page_level3 | TEXT | Page level 3 |
| page_template | TEXT | Page template |
Note: The Streamlit dashboard merges data from both tables for a unified view.
The dashboard offers over 20 different features and visualizations:
- Author Networks (🕸️) - Interactive network graphs showing connections between authors
- Keyword Analysis (🔑) - Frequency distribution of key keywords
- Word Clouds - Visual representation of most common terms
- Time Analysis (📅) - Article publications over time
- Trend Analysis - Predictions and pattern recognition
- AI Analysis (🤖) - Topic Modeling, Sentiment Analysis
- Sentiment Analysis - Article sentiment analysis
- Topic Clustering - Automatic topic grouping
- Content Recommendations - Find similar articles
- Performance Metrics (⚡) - System statistics
- Source Filter - Show Heise, Chip, or both
- Search Function (🔍) - Full-text search in articles
- Date Range Filter - Time-based filtering
- Category Filter - Filter by category
- Author Filter - Filter by author
- Export Function - CSV, Excel, JSON
- SQL Queries (🔧) - Execute custom queries
- Cache Management - Clear data cache
- CSV export with source info
- Excel export (.xlsx)
- JSON export
- SQL export
- Filtered exports possible
📂 datamining/
├── 📂 heise/ # Heise crawlers and related scripts
│ ├── 📄 main.py # Archive crawler (backwards)
│ ├── 📄 current_crawler.py # Live crawler (every 5 minutes)
│ ├── 📄 bot.py # Discord bot
│ ├── 📄 api.py # API functionalities
│ ├── 📄 notification.py # Email notifications
│ ├── 📄 export_articles.py # Export functionality
│ ├── 📄 test_notification.py # Notification test
│ └── 📂 templates/ # HTML templates
│ ├── 📄 news_feed.html
│ └── 📄 query.html
├── 📂 chip/ # Chip crawlers and related scripts
│ ├── 📄 main.py # Archive crawler (forwards)
│ ├── 📄 current_crawler.py # Live crawler (every 10 minutes)
│ ├── 📄 notification.py # Email notifications
│ └── 📄 export_articles.py # Export functionality
├── 📂 visualization/ # Unified Streamlit dashboard
│ ├── 📄 streamlit_app.py # Main Streamlit application
│ └── 📄 requirements_streamlit.txt # Streamlit dependencies
├── 📂 data/ # Export directory
├── 📂 docker/ # Docker configurations (if present)
├── 📄 docker-compose.yml # Docker Compose configuration
├── 📄 Dockerfile # Docker image definition
├── 📄 requirements.txt # Python dependencies
├── 📄 .env # Environment variables (create manually)
├── 📄 .gitignore # Git ignore file
├── 📄 README.md # This file
├── 📄 QUICKSTART.md # Quick start guide
├── 📄 ARCHITECTURE.md # System architecture
├── 📄 DOCKER_SETUP.md # Docker setup guide
├── 📄 SECURITY.md # Security guidelines
└── 📄 LICENSE # License (GNU GPL)
For centralized management of your Docker containers, we recommend the following 3rd-party solutions:
Installation:
docker volume create portainer_data
docker run -d \
-p 9000:9000 \
--name portainer \
--restart always \
-v /var/run/docker.sock:/var/run/docker.sock \
-v portainer_data:/data \
portainer/portainer-ce:latestAccess: http://localhost:9000
Features:
- Web-based GUI for container management
- View logs in real-time
- Start/stop/pause containers
- Resource monitoring
- Stack management (Docker Compose)
- User-friendly
Installation:
docker run -d \
-p 5001:5001 \
--name dockge \
--restart unless-stopped \
-v /var/run/docker.sock:/var/run/docker.sock \
-v dockge_data:/app/data \
louislam/dockge:1Access: http://localhost:5001
Features:
- Modern alternative to Portainer
- Docker Compose focused
- Simple user interface
- Live logs
Installation:
docker volume create yacht
docker run -d \
-p 8000:8000 \
--name yacht \
--restart unless-stopped \
-v /var/run/docker.sock:/var/run/docker.sock \
-v yacht:/config \
selfhostedpro/yachtAccess: http://localhost:8000
Features:
- Self-hosted Docker management
- Template-based
- Clean UI
Solution:
- Check
.envfile for correct database credentials - Make sure PostgreSQL is running:
# macOS brew services list # Linux sudo systemctl status postgresql
- Test the connection:
psql -U $DB_USER -d $DB_NAME -h $DB_HOST
Solution:
- Check if tables contain data:
SELECT COUNT(*) FROM heise; SELECT COUNT(*) FROM chip;
- Clear Streamlit cache with the "🔄 Clear Cache" button
- Restart the Streamlit app
Solution:
- For Gmail: Use an App Password
- Test the notification function:
cd heise python3 test_notification.py - Check SMTP settings in
.env
Solution:
- Check
DISCORD_TOKENandCHANNEL_IDin.env - Make sure the bot has the right permissions
- Check bot logs for errors
Solution:
- Check Docker logs:
docker-compose logs
- Make sure all ports are available
- Check the
.envfile
Solution: Run a crawler to create the table:
cd heise
python3 main.py(with Tableau and DeepNote, as of March 2025)
We have also generated some graphs with Deepnote (❗ only with random 10,000 rows ❗)
Also check out the data/Datamining_Heise web crawler-3.twb file with an excerpt of analyses.
This program is licensed under GNU GENERAL PUBLIC LICENSE
See LICENSE for more details.
This project was programmed by both of us within a few days and is constantly being further developed:
Don't hesitate to contact us if you have questions, feedback, or just want to say hello!
📧 Email: server@schächner.de
🌐 Website:
The idea for our Heise News Crawler comes from David Kriesel and his presentation "Spiegel Mining" at 33c3.
Happy Crawling! 🎉









