Skip to content

A command-line tool that automatically fetches, analyzes, and scores website privacy policies using AI to highlight strengths, risks, and recommendations.

License

Notifications You must be signed in to change notification settings

HappyHackingSpace/privacy-policy-analyzer

Privacy Policy Analyzer

License: MIT Python 3.10.11+ GitHub stars GitHub forks

Analyze a website's privacy policy end-to-end: auto-discover the policy URL, fetch clean text (HTTP first, Selenium fallback), chunk the content, evaluate it via an LLM with a structured rubric, and aggregate category scores into an overall score with strengths, risks, red flags, and recommendations.

Part of Happy Hacking Space - A community-driven organization focused on security, AI, and software development.

Features

  • Auto-discovery: Common paths → robots.txt/sitemaps → footer links.
  • HTTP-first extraction: trafilatura (clean text) or BeautifulSoup fallback; Selenium for dynamic pages.
  • Structured scoring (JSON): Per-category (0–10) scores + rationales; aggregated to 0–100 overall in scoring.py.
  • Configurable chunking: Paragraph-aware recursive splitting; --max-chunks hard cap to control cost/latency.
  • Simple CLI: Choose summary, detailed, or full reports.

Project Layout

privacy-policy-analyzer/
├── src/
│   ├── __init__.py
│   ├── main.py                    # Main CLI application
│   └── analyzer/
│       ├── __init__.py
│       ├── prompts.py             # LLM prompts for analysis
│       └── scoring.py             # Scoring algorithms
├── docs/                          # Documentation
│   ├── index.md
│   ├── user-guide.md
│   ├── api.md
│   ├── contributing.md
│   └── changelog.md
├── .github/
│   └── workflows/                 # CI/CD pipelines
│       ├── ci.yaml
│       └── release.yml
├── pyproject.toml                 # Project configuration
├── requirements.txt               # Legacy requirements
├── .env.example                   # Environment template
├── .gitignore
├── LICENSE
└── README.md

Requirements

  • Python 3.10.11 or higher
  • An OpenAI API key
  • (Optional) Chrome/Chromium on the machine (Selenium fallback; driver auto-installs)

Installation

Using uv (Recommended)

# Clone the repository
git clone https://github.com/HappyHackingSpace/privacy-policy-analyzer.git
cd privacy-policy-analyzer

# Install dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate  # On macOS/Linux
# or
.venv\Scripts\activate     # On Windows

Using pip

# Clone the repository
git clone https://github.com/HappyHackingSpace/privacy-policy-analyzer.git
cd privacy-policy-analyzer

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate     # On Windows

# Install dependencies
pip install -e .

Configuration

Copy .env.example.env and set your credentials:

OPENAI_API_KEY=sk-************************
# Optional (overrides default):
OPENAI_MODEL=gpt-4o

Usage

Quick start (auto-discovery, summary report)

# Using uv
uv run python src/main.py --url https://www.example.com/ --report summary

# Using pip
python src/main.py --url https://www.example.com/ --report summary

Detailed JSON (category scores, rationales, red flags)

uv run python src/main.py --url https://www.example.com/ --report detailed

Force a specific fetch method

  • HTTP only (faster/cleaner when available):
uv run python src/main.py --url https://www.example.com/ --fetch http --report detailed
  • Selenium only (for heavily dynamic pages):
uv run python src/main.py --url https://www.example.com/ --fetch selenium --report detailed

Skip auto-discovery (analyze a known policy URL as-is)

uv run python src/main.py --url https://www.example.com/legal/privacy --no-discover --report detailed

Tuning chunking and cost/latency

# Larger chunks = fewer requests (cheaper/faster), but slightly coarser analysis
uv run python src/main.py --url https://example.com --chunk-size 3500 --chunk-overlap 350 --max-chunks 30 --report summary

CLI Options (summary)

  • --url (required): Site homepage or direct privacy policy URL.
  • --model (default: env OPENAI_MODEL or gpt-4o): OpenAI chat model name.
  • --fetch (default: auto): auto | http | selenium.
  • --no-discover: Analyze the given URL without discovery.
  • --chunk-size (default: 3500) and --chunk-overlap (default: 350).
  • --max-chunks (default: 30): Hard cap; tail chunks are merged to keep requests bounded.
  • --report (default: summary): summary | detailed | full.

Output

  • summary: overall score, confidence, top strengths/risks, red-flags count.
  • detailed: adds per-category scores (0–10), rationales, deduped red flags, recommendations.
  • full: includes all per-chunk JSON items along with the aggregated report.

Notes & Tips

  • Determinism: For consistent runs, pin --fetch http or --fetch selenium and/or use --no-discover with a fixed policy URL.
  • International sites: The HTTP client sets Accept-Language: en-US,en;q=0.9 to reduce locale variance.
  • Selenium: Ensure Chrome/Chromium exists; chromedriver-autoinstaller will fetch a matching driver automatically.

Troubleshooting

  • ImportError: lxml.html.clean ... Ensure lxml[html_clean] is installed (it’s included in requirements.txt).
  • Very low or inconsistent scores Try --fetch selenium or analyze the explicit policy URL with --no-discover. Some sites serve different content per region/session.

Security & Ethics

This tool provides automated analysis heuristics and LLM-generated assessments. Treat results as decision support, not legal advice. Always review the original policy and consult qualified counsel for compliance-critical use cases.

Documentation

About Happy Hacking Space

This project is part of Happy Hacking Space, a community-driven organization focused on:

  • 🔒 Security Research - Tools and techniques for security professionals
  • 🤖 AI & Machine Learning - Practical AI applications and research
  • 💻 Software Development - Open source tools and libraries
  • 🌐 Community Building - Bringing together developers, researchers, and security experts

Other Projects

  • vulnerable-target - Intentionally vulnerable environments for security training
  • events - Community events directory
  • site - Happy Hacking Space website

Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Support

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built with ❤️ by the Happy Hacking Space community
  • Powered by OpenAI's GPT models
  • Uses trafilatura for web content extraction
  • Inspired by the need for better privacy policy transparency

About

A command-line tool that automatically fetches, analyzes, and scores website privacy policies using AI to highlight strengths, risks, and recommendations.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •