Analyze a website's privacy policy end-to-end: auto-discover the policy URL, fetch clean text (HTTP first, Selenium fallback), chunk the content, evaluate it via an LLM with a structured rubric, and aggregate category scores into an overall score with strengths, risks, red flags, and recommendations.
Part of Happy Hacking Space - A community-driven organization focused on security, AI, and software development.
- Auto-discovery: Common paths → robots.txt/sitemaps → footer links.
- HTTP-first extraction:
trafilatura(clean text) orBeautifulSoupfallback; Selenium for dynamic pages. - Structured scoring (JSON): Per-category (0–10) scores + rationales; aggregated to 0–100 overall in
scoring.py. - Configurable chunking: Paragraph-aware recursive splitting;
--max-chunkshard cap to control cost/latency. - Simple CLI: Choose
summary,detailed, orfullreports.
privacy-policy-analyzer/
├── src/
│ ├── __init__.py
│ ├── main.py # Main CLI application
│ └── analyzer/
│ ├── __init__.py
│ ├── prompts.py # LLM prompts for analysis
│ └── scoring.py # Scoring algorithms
├── docs/ # Documentation
│ ├── index.md
│ ├── user-guide.md
│ ├── api.md
│ ├── contributing.md
│ └── changelog.md
├── .github/
│ └── workflows/ # CI/CD pipelines
│ ├── ci.yaml
│ └── release.yml
├── pyproject.toml # Project configuration
├── requirements.txt # Legacy requirements
├── .env.example # Environment template
├── .gitignore
├── LICENSE
└── README.md
- Python 3.10.11 or higher
- An OpenAI API key
- (Optional) Chrome/Chromium on the machine (Selenium fallback; driver auto-installs)
# Clone the repository
git clone https://github.com/HappyHackingSpace/privacy-policy-analyzer.git
cd privacy-policy-analyzer
# Install dependencies
uv sync
# Activate the virtual environment
source .venv/bin/activate # On macOS/Linux
# or
.venv\Scripts\activate # On Windows# Clone the repository
git clone https://github.com/HappyHackingSpace/privacy-policy-analyzer.git
cd privacy-policy-analyzer
# Create virtual environment
python -m venv venv
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windows
# Install dependencies
pip install -e .Copy .env.example → .env and set your credentials:
OPENAI_API_KEY=sk-************************
# Optional (overrides default):
OPENAI_MODEL=gpt-4o
# Using uv
uv run python src/main.py --url https://www.example.com/ --report summary
# Using pip
python src/main.py --url https://www.example.com/ --report summaryuv run python src/main.py --url https://www.example.com/ --report detailed- HTTP only (faster/cleaner when available):
uv run python src/main.py --url https://www.example.com/ --fetch http --report detailed- Selenium only (for heavily dynamic pages):
uv run python src/main.py --url https://www.example.com/ --fetch selenium --report detaileduv run python src/main.py --url https://www.example.com/legal/privacy --no-discover --report detailed# Larger chunks = fewer requests (cheaper/faster), but slightly coarser analysis
uv run python src/main.py --url https://example.com --chunk-size 3500 --chunk-overlap 350 --max-chunks 30 --report summary--url(required): Site homepage or direct privacy policy URL.--model(default: envOPENAI_MODELorgpt-4o): OpenAI chat model name.--fetch(default:auto):auto|http|selenium.--no-discover: Analyze the given URL without discovery.--chunk-size(default: 3500) and--chunk-overlap(default: 350).--max-chunks(default: 30): Hard cap; tail chunks are merged to keep requests bounded.--report(default:summary):summary|detailed|full.
- summary: overall score, confidence, top strengths/risks, red-flags count.
- detailed: adds per-category scores (0–10), rationales, deduped red flags, recommendations.
- full: includes all per-chunk JSON items along with the aggregated report.
- Determinism: For consistent runs, pin
--fetch httpor--fetch seleniumand/or use--no-discoverwith a fixed policy URL. - International sites: The HTTP client sets
Accept-Language: en-US,en;q=0.9to reduce locale variance. - Selenium: Ensure Chrome/Chromium exists;
chromedriver-autoinstallerwill fetch a matching driver automatically.
ImportError: lxml.html.clean ...Ensurelxml[html_clean]is installed (it’s included inrequirements.txt).- Very low or inconsistent scores
Try
--fetch seleniumor analyze the explicit policy URL with--no-discover. Some sites serve different content per region/session.
This tool provides automated analysis heuristics and LLM-generated assessments. Treat results as decision support, not legal advice. Always review the original policy and consult qualified counsel for compliance-critical use cases.
- 📖 Full Documentation - Complete user guide and API reference
- 🚀 Quick Start Guide - Get up and running quickly
- 🔧 API Reference - Detailed API documentation
- 🤝 Contributing - How to contribute to the project
This project is part of Happy Hacking Space, a community-driven organization focused on:
- 🔒 Security Research - Tools and techniques for security professionals
- 🤖 AI & Machine Learning - Practical AI applications and research
- 💻 Software Development - Open source tools and libraries
- 🌐 Community Building - Bringing together developers, researchers, and security experts
- vulnerable-target - Intentionally vulnerable environments for security training
- events - Community events directory
- site - Happy Hacking Space website
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- 🐛 Bug Reports: GitHub Issues
- 💬 Discussions: GitHub Discussions
- 📧 Contact: Happy Hacking Space
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with ❤️ by the Happy Hacking Space community
- Powered by OpenAI's GPT models
- Uses trafilatura for web content extraction
- Inspired by the need for better privacy policy transparency