-
Notifications
You must be signed in to change notification settings - Fork 10
Enhance README with features and usage details #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,136 +1,214 @@ | ||||||||||
| # Redback Ethics Asset Scanner | ||||||||||
|
|
||||||||||
| The **Asset Scanner** is a Python-based tool for detecting sensitive information (PII, secrets, credentials, etc.) in documents and media. | ||||||||||
| It is designed for educational use in cybersecurity and ethics modules. | ||||||||||
| The **Asset Scanner** is a Python-based tool for detecting sensitive information (PII, secrets, credentials, etc.) in documents, code, and media. Designed for educational use in cybersecurity and ethics modules, the scanner helps students and professionals identify and mitigate risks associated with the exposure of sensitive data. | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## 🛠️ Key Features | ||||||||||
|
|
||||||||||
| - **Hybrid Detection**: Combines **Microsoft Presidio**'s NLP-based entity recognition with **custom regex patterns** from `patterns.json`. | ||||||||||
| - **OCR Capabilities**: Scans text within images and PDFs using Optical Character Recognition (OCR) via `ocr_engine.py`. | ||||||||||
| - **Risk Assessment**: Categorizes findings into _Low_, _Medium_, or _High_ risk levels with references to compliance frameworks (e.g., GDPR, Privacy Act). | ||||||||||
| - **Flexible Input Handling**: Supports directories, individual files, and various formats, including `.txt`, `.docx`, `.pdf`, `.png`, `.jpg`, and more. | ||||||||||
| - **Actionable Reports**: Provides detailed mitigation tips and compliance recommendations for each detected risk. | ||||||||||
| - **Command-Line Interface**: Easy-to-use CLI with options to customize pattern files, output format, and verbosity. | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## 📂 Project Structure | ||||||||||
|
|
||||||||||
| - `scanner.py` – Main entry point for scanning files and generating reports. | ||||||||||
| - `scan_media.py` – Scans image/PDF inputs using OCR (`ocr_engine.py`). | ||||||||||
| - `file_handler.py` – Handles input files and preprocessing. | ||||||||||
| - `ocr_engine.py` – OCR engine wrapper for text extraction from images. | ||||||||||
| - `reporter.py` – Builds structured scan results and output reports. | ||||||||||
| - `patterns.json` – Regex patterns for detecting sensitive items. | ||||||||||
| - `risk_rules.json` – Maps detected patterns to risk levels, compliance references, and remediation tips. | ||||||||||
| | File/Directory | Description | | ||||||||||
| |-----------------------|------------------------------------------------------------------------------------------------------------| | ||||||||||
| | `scanner.py` | Main entry point for scanning files and generating reports. | | ||||||||||
| | `scan_media.py` | Handles scanning of media files (images, PDFs) using OCR (`ocr_engine.py`). | | ||||||||||
| | `file_handler.py` | Manages file discovery and preprocessing (parsing `.docx`, `.txt`, etc.). | | ||||||||||
| | `ocr_engine.py` | OCR engine wrapper for extracting text from images and PDFs. | | ||||||||||
| | `reporter.py` | Builds structured scan results and outputs reports. | | ||||||||||
| | `patterns.json` | Regex patterns for detecting sensitive items (AWS keys, emails, etc.). | | ||||||||||
| | `risk_rules.json` | Maps detected patterns to risk levels, compliance references, and remediation tips. | | ||||||||||
| | `requirements.txt` | Lists the Python dependencies required to run the scanner. | | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## ⚙️ Setup | ||||||||||
|
|
||||||||||
| 1. Clone the repository: | ||||||||||
| 1. **Clone the Repository**: | ||||||||||
| ```bash | ||||||||||
| git clone https://github.com/<your-repo>/redback-ethics.git | ||||||||||
| cd redback-ethics/asset-scanner | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 2. Create and activate a virtual environment: | ||||||||||
| 2. **Create and Activate a Virtual Environment**: | ||||||||||
| ```bash | ||||||||||
| python3 -m venv .venv | ||||||||||
| source .venv/bin/activate | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 3. Install dependencies: | ||||||||||
| 3. **Install Dependencies**: | ||||||||||
| ```bash | ||||||||||
| pip install -r requirements.txt | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 4. **Install OCR Dependencies (Optional)**: | ||||||||||
| - For PDF/image support: | ||||||||||
| ```bash | ||||||||||
| sudo apt install poppler-utils | ||||||||||
| pip install pdf2image pytesseract | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## 🚀 Usage | ||||||||||
|
|
||||||||||
| To scan a document: | ||||||||||
| ### Scan a Single File: | ||||||||||
| ```bash | ||||||||||
| python scanner.py --file "/path/to/document.docx" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| To scan an image or PDF (OCR enabled): | ||||||||||
| ### Scan an Image or PDF (OCR): | ||||||||||
| ```bash | ||||||||||
| python scan_media.py --file "/path/to/image_or_pdf" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| To scan a directory: | ||||||||||
| ### Scan a Directory Recursively: | ||||||||||
| ```bash | ||||||||||
| python scanner.py --root "/path/to/folder" | ||||||||||
| ``` | ||||||||||
| OR | ||||||||||
| if you run scanner.py standalone you without and --file or --root arguments you will be prompted | ||||||||||
| to enter a directory in runtime | ||||||||||
|
|
||||||||||
| Output will include: | ||||||||||
| - Detected matches with line context | ||||||||||
| - Risk level (from `risk_rules.json`) | ||||||||||
| - Mitigation tips and relevant compliance frameworks | ||||||||||
| ### Interactive Mode: | ||||||||||
| Running `scanner.py` without arguments prompts you to specify a directory or file at runtime: | ||||||||||
| ```bash | ||||||||||
| python scanner.py | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| Output Includes: | ||||||||||
| - Detected matches with line numbers | ||||||||||
| - Risk levels (_Low_, _Medium_, _High_) | ||||||||||
| - Mitigation tips and compliance references | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## ⚡ Command-Line Arguments | ||||||||||
| ## ⚡ Command-Line Interface (CLI) | ||||||||||
|
|
||||||||||
| The scanner supports several arguments to control input and behaviour: | ||||||||||
| | Argument | Type | Description | Example | | ||||||||||
| |---------------|-----------|---------------------------------------------------------------|-------------------------------------------| | ||||||||||
| | `--file` | Path | Scan a single file or multiple | `python scanner.py --file "/path/to/doc"` | | ||||||||||
| | `--root` | Path | Recursively scan all files in a directory. | `python scanner.py --root "/path/to/"` | | ||||||||||
| | `--patterns` | Path | Custom path to `patterns.json`. | `--patterns ./configs/patterns.json` | | ||||||||||
| | `--out` | Path | Path to save structured scan results (e.g., `.json`, `.txt`). | `--out results.json` | | ||||||||||
| | `--ext` | List | Filter by file extensions (_default: .txt, .docx, .pdf_). | `--ext .txt .md` | | ||||||||||
| | `--no-console`| Flag | Suppress console output. Only write to the output file. | `--no-console` | | ||||||||||
|
||||||||||
| | `--no-console`| Flag | Suppress console output. Only write to the output file. | `--no-console` | |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example compliance reference is incomplete. According to risk_rules.json lines 14-19, the aws_access_key actually includes multiple compliance references: 'Privacy Act 1988 (Cth) — APP 11', 'Privacy Act 1988 (Cth) — Notifiable Data Breaches (NDB) scheme, Part IIIC', and 'GDPR Art. 32 — Security of processing'. The tip should also be 'Rotate immediately; revoke if exposed; move to a secrets manager; purge from history.'
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The emoji '🥼' (lab coat) is unusual for an 'Example' section. Consider using a more standard emoji like '📝' (memo) or '💡' (light bulb) for consistency with other section headers.
| ## 🥼 Example | |
| ## 📝 Example |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example shows email risk as 'Medium', but according to risk_rules.json line 3, emails are classified as 'Low' risk, not 'Medium'.
| "risk": "Medium", | |
| "risk": "Low", |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README references a CODE_OF_CONDUCT.md file, but this file does not exist in the asset-scanner directory. This will result in a broken link for users viewing the asset-scanner README.
| Please adhere to our [Code of Conduct](CODE_OF_CONDUCT.md). | |
| Please adhere to our [Code of Conduct](../CODE_OF_CONDUCT.md). |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README references a LICENSE file with a relative link, but no LICENSE file exists in the asset-scanner directory. This will result in a broken link for users viewing the asset-scanner README.
| This project is licensed under the [MIT License](LICENSE). | |
| See the `LICENSE` file for full details. | |
| This project is licensed under the MIT License. | |
| See the `LICENSE` file in the repository root for full details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documented default extensions for
--extare incorrect. According to scanner.py line 185, the actual default is['.txt', '.json'], not.txt, .docx, .pdf. The scanner.py line 51 definesDEFAULT_TARGET_EXTSwith a broader list, but these are not the defaults for the--extparameter.