🗞️ Unified News Mining System

A comprehensive, unified crawler system for collecting and analyzing news articles from Heise.de and Chip.de.

📑 Table of Contents

🎯 Quick Links
✨ Key Features
🌍 Purpose & Functionality
🚀 Installation & Setup
🛠 Usage
🐳 Docker Deployment
🏗 Database Schema
📊 Streamlit Features
📂 Project Structure
🔧 Management with Docker Tools
❗ Troubleshooting
🗂️ Examples & Screenshots
📜 License
🙋 About Us

🎯 Quick Links

📖 Quick Start Guide - Get started in 5 minutes
⚙️ Setup Guide - New centralized configuration and auto-refresh features
🏗️ Architecture - System architecture and data flow
🐳 Docker Setup - Deployment with Docker

🌍 Purpose & Functionality

The News Mining System is designed to automatically extract and store news articles from multiple sources. The main objectives are:

📡 Data Collection - Capture historical news articles from Heise.de and Chip.de
🏛 Structured Storage - Articles from both sources in separate PostgreSQL tables
🔍 Metadata Extraction - Capture title, author, category, keywords, word count and more
🔄 Incremental Crawling - Duplicate detection and storage of only new articles
🔔 Notifications - Email notifications for errors during the crawling process
🎨 Enhanced Terminal Output - Use of PyFiglet for better readability
📤 Data Export - Export as CSV, JSON, XLSX with source filtering
🖥 API - Provision of statistics and complete datasets
📈 Analytics - Detailed analysis of authors, categories and time trends
🔍 Article Search - Search all articles with advanced filter options
🎯 Unified Dashboard - One Streamlit application for both sources
🤖 Discord Bot - Real-time statistics for both sources in Discord
📊 Extensive Visualizations - Over 20 different charts, graphs and representations
🕸️ Author Networks - Visualization of connections between authors
📈 Trend Analysis - Time-based analysis and predictions

An API endpoint is also provided that can display the crawled data and statistics.

🚀 Installation & Setup

1️⃣ Prerequisites

🔹 Python 3.8+ (recommended: Python 3.11)

🔹 PostgreSQL 13+ (local or remote)

🔹 Git (for cloning the repository)

🔹 pip3 (Python Package Manager)

Optional:

🐳 Docker & Docker Compose (for containerized deployment)
🎮 Discord Bot Token (for Discord integration)
🤖 Google API Key (for AI analysis)

2️⃣ Clone Repository

git clone https://github.com/SchBenedikt/datamining.git
cd datamining

3️⃣ Install Dependencies

Install all required Python libraries:

pip3 install -r requirements.txt

For the Streamlit application (advanced visualizations):

cd visualization
pip3 install -r requirements_streamlit.txt
cd ..

4️⃣ Configure Environment Variables

Create a .env file in the root directory with the following variables:

# Database Configuration
DB_NAME=your_database_name
DB_USER=your_database_user
DB_PASSWORD=your_database_password
DB_HOST=localhost
DB_PORT=5432

# Email Notifications (optional)
EMAIL_USER=your_email@example.com
EMAIL_PASSWORD=your_app_password
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
ALERT_EMAIL=recipient@example.com

# Discord Bot (optional)
DISCORD_TOKEN=your_discord_bot_token
CHANNEL_ID=your_discord_channel_id

# Google AI (optional, for advanced analysis)
GOOGLE_API_KEY=your_google_api_key

Notes:

For Gmail, use an App Password
Get Discord Token from the Discord Developer Portal
Create Google API Key in the Google Cloud Console

5️⃣ Database Setup

Create the PostgreSQL database:

# Open PostgreSQL console
psql -U postgres

# Create database
CREATE DATABASE your_database_name;

# Exit
\q

The required tables will be created automatically when the crawlers start for the first time.

Manual table creation (optional):

-- Heise-Tabelle
CREATE TABLE IF NOT EXISTS heise (
    id SERIAL PRIMARY KEY,
    title TEXT,
    url TEXT UNIQUE,
    date TEXT,
    author TEXT,
    category TEXT,
    keywords TEXT,
    word_count INTEGER,
    editor_abbr TEXT,
    site_name TEXT
);

-- Chip-Tabelle
CREATE TABLE IF NOT EXISTS chip (
    id SERIAL PRIMARY KEY,
    url TEXT UNIQUE,
    title TEXT,
    author TEXT,
    date TEXT,
    keywords TEXT,
    description TEXT,
    type TEXT,
    page_level1 TEXT,
    page_level2 TEXT,
    page_level3 TEXT,
    page_template TEXT
);

🛠 Usage

Start Crawlers

Heise Archive Crawler (crawls backwards from newest to oldest)

cd heise
python3 main.py

Example Terminal Output:

[INFO] Crawling URL: https://www.heise.de/newsticker/archiv/2025/10
[INFO] Found articles (total): 55
2025-10-02 10:30:15 [INFO] Processing 16 articles for day 2025-10-02
2025-10-02 10:30:15 [INFO] 2025-10-02T20:00:00 - article-name

If fewer than 10 articles per day are found, an email will be sent.

Heise Live Crawler (checks every 5 minutes for new articles)

cd heise
python3 current_crawler.py

Example Terminal Output:

[INFO] Crawling URL: https://www.heise.de/newsticker/archiv/2025/10
[INFO] Found articles (total): 55
2025-10-02 10:35:00 [INFO] Current crawl cycle completed.
2025-10-02 10:35:00 [INFO] Waiting 300 seconds until next crawl.

Chip Archive Crawler (crawls from page 1 upwards)

cd chip
python3 main.py

Chip Live Crawler (checks every 10 minutes for new articles)

cd chip
python3 current_crawler.py

Streamlit Dashboard

Start the interactive Streamlit dashboard with support for both sources:

cd visualization
streamlit run streamlit_app.py

The dashboard will open at http://localhost:8501.

Discord Bot

Start the Discord bot for real-time statistics updates:

cd heise
python3 bot.py

The bot provides:

Total article count for both sources
Today's article count for both sources
Author statistics
Updates every 10 minutes

API Endpoints

The API server starts automatically when running heise/main.py. Statistics can be retrieved here:

http://127.0.0.1:6600/stats

Manual API start:

cd heise
python3 api.py

Export Data

You can export data for each source as CSV, JSON, or XLSX files.

Export Heise articles:

cd heise
python3 export_articles.py

Export Chip articles:

cd chip
python3 export_articles.py

Exported articles are saved in the data/ directory.

🐳 Docker Deployment

Start all services with one command

docker-compose up -d

Manage individual services

# Start Heise Archive Crawler
docker-compose up -d heise-archive-crawler

# Start Chip Live Crawler
docker-compose up -d chip-live-crawler

# Start Streamlit Dashboard
docker-compose up -d streamlit-dashboard

# Start Discord Bot
docker-compose up -d discord-bot

View logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f heise-live-crawler

Stop services

# Stop all services
docker-compose down

# Specific service
docker-compose stop heise-archive-crawler

Access Dashboard

After starting, the Streamlit dashboard is available at:

http://localhost:8501

🏗 Database Schema

The database uses two separate tables for better organization:

Heise Table

Column	Type	Description
id	SERIAL	Unique ID
title	TEXT	Article title
url	TEXT	Article URL (unique)
date	TEXT	Publication date
author	TEXT	Author(s)
category	TEXT	Category
keywords	TEXT	Keywords
word_count	INT	Word count
editor_abbr	TEXT	Editor abbreviation
site_name	TEXT	Website name

Chip Table

Column	Type	Description
id	SERIAL	Unique ID
url	TEXT	Article URL (unique)
title	TEXT	Article title
author	TEXT	Author(s)
date	TEXT	Publication date
keywords	TEXT	Keywords
description	TEXT	Article description
type	TEXT	Article type
page_level1	TEXT	Page level 1
page_level2	TEXT	Page level 2
page_level3	TEXT	Page level 3
page_template	TEXT	Page template

Note: The Streamlit dashboard merges data from both tables for a unified view.

📊 Streamlit Features

The dashboard offers over 20 different features and visualizations:

📈 Visualizations

Author Networks (🕸️) - Interactive network graphs showing connections between authors
Keyword Analysis (🔑) - Frequency distribution of key keywords
Word Clouds - Visual representation of most common terms
Time Analysis (📅) - Article publications over time
Trend Analysis - Predictions and pattern recognition
AI Analysis (🤖) - Topic Modeling, Sentiment Analysis
Sentiment Analysis - Article sentiment analysis
Topic Clustering - Automatic topic grouping
Content Recommendations - Find similar articles
Performance Metrics (⚡) - System statistics

🔧 Interactive Features

Source Filter - Show Heise, Chip, or both
Search Function (🔍) - Full-text search in articles
Date Range Filter - Time-based filtering
Category Filter - Filter by category
Author Filter - Filter by author
Export Function - CSV, Excel, JSON
SQL Queries (🔧) - Execute custom queries
Cache Management - Clear data cache

📥 Export Options

CSV export with source info
Excel export (.xlsx)
JSON export
SQL export
Filtered exports possible

📂 Project Structure

📂 datamining/
├── 📂 heise/                          # Heise crawlers and related scripts
│   ├── 📄 main.py                     # Archive crawler (backwards)
│   ├── 📄 current_crawler.py          # Live crawler (every 5 minutes)
│   ├── 📄 bot.py                      # Discord bot
│   ├── 📄 api.py                      # API functionalities
│   ├── 📄 notification.py             # Email notifications
│   ├── 📄 export_articles.py          # Export functionality
│   ├── 📄 test_notification.py        # Notification test
│   └── 📂 templates/                  # HTML templates
│       ├── 📄 news_feed.html
│       └── 📄 query.html
├── 📂 chip/                           # Chip crawlers and related scripts
│   ├── 📄 main.py                     # Archive crawler (forwards)
│   ├── 📄 current_crawler.py          # Live crawler (every 10 minutes)
│   ├── 📄 notification.py             # Email notifications
│   └── 📄 export_articles.py          # Export functionality
├── 📂 visualization/                  # Unified Streamlit dashboard
│   ├── 📄 streamlit_app.py            # Main Streamlit application
│   └── 📄 requirements_streamlit.txt  # Streamlit dependencies
├── 📂 data/                           # Export directory
├── 📂 docker/                         # Docker configurations (if present)
├── 📄 docker-compose.yml              # Docker Compose configuration
├── 📄 Dockerfile                      # Docker image definition
├── 📄 requirements.txt                # Python dependencies
├── 📄 .env                            # Environment variables (create manually)
├── 📄 .gitignore                      # Git ignore file
├── 📄 README.md                       # This file
├── 📄 QUICKSTART.md                   # Quick start guide
├── 📄 ARCHITECTURE.md                 # System architecture
├── 📄 DOCKER_SETUP.md                 # Docker setup guide
├── 📄 SECURITY.md                     # Security guidelines
└── 📄 LICENSE                         # License (GNU GPL)

🔧 Management with Docker Tools

For centralized management of your Docker containers, we recommend the following 3rd-party solutions:

🏆 Portainer (Recommended)

Installation:

docker volume create portainer_data

docker run -d \
  -p 9000:9000 \
  --name portainer \
  --restart always \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v portainer_data:/data \
  portainer/portainer-ce:latest

Access: http://localhost:9000

Features:

Web-based GUI for container management
View logs in real-time
Start/stop/pause containers
Resource monitoring
Stack management (Docker Compose)
User-friendly

🎨 Dockge (Alternative)

Installation:

docker run -d \
  -p 5001:5001 \
  --name dockge \
  --restart unless-stopped \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v dockge_data:/app/data \
  louislam/dockge:1

Access: http://localhost:5001

Features:

Modern alternative to Portainer
Docker Compose focused
Simple user interface
Live logs

🚢 Yacht

Installation:

docker volume create yacht

docker run -d \
  -p 8000:8000 \
  --name yacht \
  --restart unless-stopped \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v yacht:/config \
  selfhostedpro/yacht

Access: http://localhost:8000

Features:

Self-hosted Docker management
Template-based
Clean UI

❗ Troubleshooting

Problem: Database connection error

Solution:

Check .env file for correct database credentials

Make sure PostgreSQL is running:

# macOS
brew services list

# Linux
sudo systemctl status postgresql

Test the connection:

psql -U $DB_USER -d $DB_NAME -h $DB_HOST

Problem: No data in Streamlit dashboard

Solution:

Check if tables contain data:

SELECT COUNT(*) FROM heise;
SELECT COUNT(*) FROM chip;

Clear Streamlit cache with the "🔄 Clear Cache" button
Restart the Streamlit app

Problem: Email notifications not working

Solution:

For Gmail: Use an App Password
Test the notification function:
```
cd heise
python3 test_notification.py
```
Check SMTP settings in .env

Problem: Discord bot not responding

Solution:

Check DISCORD_TOKEN and CHANNEL_ID in .env
Make sure the bot has the right permissions
Check bot logs for errors

Problem: Docker containers not starting

Solution:

Check Docker logs:
```
docker-compose logs
```
Make sure all ports are available
Check the .env file

Problem: "Table does not exist"

Solution: Run a crawler to create the table:

cd heise
python3 main.py

🗂️ Examples & Screenshots

(with Tableau and DeepNote, as of March 2025)

Deepnote:

We have also generated some graphs with Deepnote (❗ only with random 10,000 rows ❗)

Also check out the data/Datamining_Heise web crawler-3.twb file with an excerpt of analyses.

📜 License

This program is licensed under GNU GENERAL PUBLIC LICENSE

See LICENSE for more details.

🙋 About Us

This project was programmed by both of us within a few days and is constantly being further developed:

📬 Contact

Don't hesitate to contact us if you have questions, feedback, or just want to say hello!

📧 Email: server@schächner.de

🌐 Website:

💖 Special Thanks

The idea for our Heise News Crawler comes from David Kriesel and his presentation "Spiegel Mining" at 33c3.

Happy Crawling! 🎉

Name		Name	Last commit message	Last commit date
Latest commit History 249 Commits
.github/workflows		.github/workflows
chip		chip
data		data
heise		heise
visualization		visualization
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DOCKER_SETUP.md		DOCKER_SETUP.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MIGRATION_SUMMARY.md		MIGRATION_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SECURITY.md		SECURITY.md
SETUP_GUIDE.md		SETUP_GUIDE.md
TESTING_GUIDE.md		TESTING_GUIDE.md
docker-compose.yml		docker-compose.yml
init_database.py		init_database.py
requirements.txt		requirements.txt

License

SchBenedikt/datamining

Folders and files

Latest commit

History

Repository files navigation

🗞️ Unified News Mining System

📑 Table of Contents

🎯 Quick Links

🌍 Purpose & Functionality

🚀 Installation & Setup

1️⃣ Prerequisites

2️⃣ Clone Repository

3️⃣ Install Dependencies

4️⃣ Configure Environment Variables

5️⃣ Database Setup

🛠 Usage

Start Crawlers

Heise Archive Crawler (crawls backwards from newest to oldest)

Heise Live Crawler (checks every 5 minutes for new articles)

Chip Archive Crawler (crawls from page 1 upwards)

Chip Live Crawler (checks every 10 minutes for new articles)

Streamlit Dashboard

Discord Bot

API Endpoints

Export Data

🐳 Docker Deployment

Start all services with one command

Manage individual services

View logs

Stop services

Access Dashboard

🏗 Database Schema

Heise Table

Chip Table

📊 Streamlit Features

📈 Visualizations

🔧 Interactive Features

📥 Export Options

📂 Project Structure

🔧 Management with Docker Tools

🏆 Portainer (Recommended)

🎨 Dockge (Alternative)

🚢 Yacht

❗ Troubleshooting

Problem: Database connection error

Problem: No data in Streamlit dashboard

Problem: Email notifications not working

Problem: Discord bot not responding

Problem: Docker containers not starting

Problem: "Table does not exist"

🗂️ Examples & Screenshots

Deepnote:

📜 License

🙋 About Us

📬 Contact

💖 Special Thanks

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages