Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
206 changes: 142 additions & 64 deletions asset-scanner/README.md
Original file line number Diff line number Diff line change
@@ -1,136 +1,214 @@
# Redback Ethics Asset Scanner

The **Asset Scanner** is a Python-based tool for detecting sensitive information (PII, secrets, credentials, etc.) in documents and media.
It is designed for educational use in cybersecurity and ethics modules.
The **Asset Scanner** is a Python-based tool for detecting sensitive information (PII, secrets, credentials, etc.) in documents, code, and media. Designed for educational use in cybersecurity and ethics modules, the scanner helps students and professionals identify and mitigate risks associated with the exposure of sensitive data.

---

## 🛠️ Key Features

- **Hybrid Detection**: Combines **Microsoft Presidio**'s NLP-based entity recognition with **custom regex patterns** from `patterns.json`.
- **OCR Capabilities**: Scans text within images and PDFs using Optical Character Recognition (OCR) via `ocr_engine.py`.
- **Risk Assessment**: Categorizes findings into _Low_, _Medium_, or _High_ risk levels with references to compliance frameworks (e.g., GDPR, Privacy Act).
- **Flexible Input Handling**: Supports directories, individual files, and various formats, including `.txt`, `.docx`, `.pdf`, `.png`, `.jpg`, and more.
- **Actionable Reports**: Provides detailed mitigation tips and compliance recommendations for each detected risk.
- **Command-Line Interface**: Easy-to-use CLI with options to customize pattern files, output format, and verbosity.

---

## 📂 Project Structure

- `scanner.py` – Main entry point for scanning files and generating reports.
- `scan_media.py` – Scans image/PDF inputs using OCR (`ocr_engine.py`).
- `file_handler.py` – Handles input files and preprocessing.
- `ocr_engine.py` – OCR engine wrapper for text extraction from images.
- `reporter.py` – Builds structured scan results and output reports.
- `patterns.json` – Regex patterns for detecting sensitive items.
- `risk_rules.json` – Maps detected patterns to risk levels, compliance references, and remediation tips.
| File/Directory | Description |
|-----------------------|------------------------------------------------------------------------------------------------------------|
| `scanner.py` | Main entry point for scanning files and generating reports. |
| `scan_media.py` | Handles scanning of media files (images, PDFs) using OCR (`ocr_engine.py`). |
| `file_handler.py` | Manages file discovery and preprocessing (parsing `.docx`, `.txt`, etc.). |
| `ocr_engine.py` | OCR engine wrapper for extracting text from images and PDFs. |
| `reporter.py` | Builds structured scan results and outputs reports. |
| `patterns.json` | Regex patterns for detecting sensitive items (AWS keys, emails, etc.). |
| `risk_rules.json` | Maps detected patterns to risk levels, compliance references, and remediation tips. |
| `requirements.txt` | Lists the Python dependencies required to run the scanner. |

---

## ⚙️ Setup

1. Clone the repository:
1. **Clone the Repository**:
```bash
git clone https://github.com/<your-repo>/redback-ethics.git
cd redback-ethics/asset-scanner
```

2. Create and activate a virtual environment:
2. **Create and Activate a Virtual Environment**:
```bash
python3 -m venv .venv
source .venv/bin/activate
```

3. Install dependencies:
3. **Install Dependencies**:
```bash
pip install -r requirements.txt
```

4. **Install OCR Dependencies (Optional)**:
- For PDF/image support:
```bash
sudo apt install poppler-utils
pip install pdf2image pytesseract
```

---

## 🚀 Usage

To scan a document:
### Scan a Single File:
```bash
python scanner.py --file "/path/to/document.docx"
```

To scan an image or PDF (OCR enabled):
### Scan an Image or PDF (OCR):
```bash
python scan_media.py --file "/path/to/image_or_pdf"
```

To scan a directory:
### Scan a Directory Recursively:
```bash
python scanner.py --root "/path/to/folder"
```
OR
if you run scanner.py standalone you without and --file or --root arguments you will be prompted
to enter a directory in runtime

Output will include:
- Detected matches with line context
- Risk level (from `risk_rules.json`)
- Mitigation tips and relevant compliance frameworks
### Interactive Mode:
Running `scanner.py` without arguments prompts you to specify a directory or file at runtime:
```bash
python scanner.py
```

Output Includes:
- Detected matches with line numbers
- Risk levels (_Low_, _Medium_, _High_)
- Mitigation tips and compliance references

---

## ⚡ Command-Line Arguments
## ⚡ Command-Line Interface (CLI)

The scanner supports several arguments to control input and behaviour:
| Argument | Type | Description | Example |
|---------------|-----------|---------------------------------------------------------------|-------------------------------------------|
| `--file` | Path | Scan a single file or multiple | `python scanner.py --file "/path/to/doc"` |
| `--root` | Path | Recursively scan all files in a directory. | `python scanner.py --root "/path/to/"` |
| `--patterns` | Path | Custom path to `patterns.json`. | `--patterns ./configs/patterns.json` |
| `--out` | Path | Path to save structured scan results (e.g., `.json`, `.txt`). | `--out results.json` |
| `--ext` | List | Filter by file extensions (_default: .txt, .docx, .pdf_). | `--ext .txt .md` |
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documented default extensions for --ext are incorrect. According to scanner.py line 185, the actual default is ['.txt', '.json'], not .txt, .docx, .pdf. The scanner.py line 51 defines DEFAULT_TARGET_EXTS with a broader list, but these are not the defaults for the --ext parameter.

Suggested change
| `--ext` | List | Filter by file extensions (_default: .txt, .docx, .pdf_). | `--ext .txt .md` |
| `--ext` | List | Filter by file extensions (_default: .txt, .json_). | `--ext .txt .md` |

Copilot uses AI. Check for mistakes.
| `--no-console`| Flag | Suppress console output. Only write to the output file. | `--no-console` |
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --no-console flag is documented in the CLI table, but this argument does not exist in scanner.py's argument parser (lines 173-190). This feature is not implemented.

Suggested change
| `--no-console`| Flag | Suppress console output. Only write to the output file. | `--no-console` |

Copilot uses AI. Check for mistakes.

| Argument | Type | Description | Example |
|----------|------|-------------|---------|
| `--file` | Path | Scan a single file (e.g., `.docx`, `.pdf`, `.png`). | `python scanner.py --file "/path/to/document.docx"` |
| `--root` | Path | Recursively scan all files within a directory. | `python scanner.py --root "/path/to/folder"` |
| `--patterns` | Path | Custom path to `patterns.json`. Useful if you want to override defaults. | `python scanner.py --file test.docx --patterns ./configs/patterns.json` |
| `--out` | Path | File to write structured scan results (JSON or text depending on implementation). | `python scanner.py --root ./docs --out results.json` |
| `--no-console` | Flag | Suppress console output. Results will only be written to the output file. | `python scanner.py --root ./docs --no-console --out results.json` |
### Example Usage:
- **Scanning a File**: `python scanner.py --file /example/path/file.docx`
- **Full Folder Scan**: `python scanner.py --root "./sensitive_files"`

### Common Usage Examples
---

Scan one file:
```bash
python scanner.py --file "/Users/alice/Documents/report.docx"
## 📋 Customization

### `patterns.json`
Defines custom regex patterns for sensitive data detection. Each entry includes:
- `pattern`: The regex string to match.
- `risk`: The associated risk level (_Low_, _Medium_, or _High_).
- `description`: A brief explanation of what the pattern detects.

Example:
```json
{
"aws_access_key": {
"pattern": "\\bAKIA[0-9A-Z]{16}\\b",
"risk": "High",
"description": "AWS Access Key ID"
}
}
```

Recursively Scan Directory:
```bash
python scanner.py --root "/Users/alice/Documents/sensitive_documents'
### `risk_rules.json`
Maps patterns to risk levels, mitigation tips, and compliance frameworks:
- `level`: _Low_, _Medium_, or _High_
- `tip`: A recommended action for addressing the risk.
- `compliance`: Legal/regulatory references (e.g., GDPR).

Example:
```json
{
"aws_access_key": {
"level": "High",
"tip": "Rotate immediately; revoke if exposed.",
"compliance": ["GDPR Art. 33 — Data Breach Notification"]
Comment on lines +138 to +139
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example compliance reference is incomplete. According to risk_rules.json lines 14-19, the aws_access_key actually includes multiple compliance references: 'Privacy Act 1988 (Cth) — APP 11', 'Privacy Act 1988 (Cth) — Notifiable Data Breaches (NDB) scheme, Part IIIC', and 'GDPR Art. 32 — Security of processing'. The tip should also be 'Rotate immediately; revoke if exposed; move to a secrets manager; purge from history.'

Copilot uses AI. Check for mistakes.
}
}
```

---

## 🛡️ Configuration
- **`patterns.json`**: Defines regex patterns for items like emails, API keys, driver’s licence numbers, etc.
Each entry specifies:
- `pattern`: regex string
- `risk`: risk level
- `description`: human-readable explanation
- **`risk_rules.json`**: Associates each pattern with:
- `level`: severity (Low/Medium/High)
- `tip`: recommended mitigation
- `compliance`: legal/regulatory references
You can extend these files to detect new types of data.
---
## 📝 Example
Scanning a document containing:
## 🥼 Example
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The emoji '🥼' (lab coat) is unusual for an 'Example' section. Consider using a more standard emoji like '📝' (memo) or '💡' (light bulb) for consistency with other section headers.

Suggested change
## 🥼 Example
## 📝 Example

Copilot uses AI. Check for mistakes.

### Input File:
```
Email: alice@example.com
Email: john.doe@example.com
Password: "SuperSecret123"
AWS Key: AKIAIOSFODNN7EXAMPLE
```
Would output:
### Output:
In **Console**:
```
[Email] -> Medium Risk
Tip: Mask or obfuscate emails in logs/code unless strictly required.
Compliance: Privacy Act 1988 (Cth) — APP 11

[Password] -> High Risk
Tip: Remove hard-coded passwords; rotate immediately; use env vars or a vault.
Tip: Remove hard-coded passwords; rotate immediately.
Compliance: GDPR Art. 32 — Security of processing

[AWS Key] -> High Risk
Tip: Rotate immediately; revoke if exposed.
Compliance: GDPR Art. 33 — Data Breach Notification
```
In **JSON Report**:
```json
[
{
"pattern": "email",
"file": "example.docx",
"line": 1,
"match": "john.doe@example.com",
"risk": "Medium",
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example shows email risk as 'Medium', but according to risk_rules.json line 3, emails are classified as 'Low' risk, not 'Medium'.

Suggested change
"risk": "Medium",
"risk": "Low",

Copilot uses AI. Check for mistakes.
"tip": "Mask or obfuscate emails in logs/code..."
},
{
"pattern": "aws_access_key",
"file": "example.docx",
"line": 3,
"match": "AKIAIOSFODNN7EXAMPLE",
"risk": "High",
"tip": "Rotate immediately; revoke if exposed..."
}
]
```

---

## 🔒 Notes
## 🔒 Notes and Limitations

1. **False Positives**: The scanner uses regex and NLP. Carefully tune `patterns.json` to reduce mismatches.
2. **Performance**: Large files or directories with OCR can be resource-intensive. Use efficient hardware.
3. **File Type Support**: By default, only common formats are supported. Extend `file_handler.py` for additional types.

For serious cases (e.g., accidental secret leaks), follow the [Security Policy](SECURITY.md).

---

## 🌟 Contributing

We welcome contributions! Fork the repository, create a feature branch, and submit a pull request.
Please adhere to our [Code of Conduct](CODE_OF_CONDUCT.md).
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README references a CODE_OF_CONDUCT.md file, but this file does not exist in the asset-scanner directory. This will result in a broken link for users viewing the asset-scanner README.

Suggested change
Please adhere to our [Code of Conduct](CODE_OF_CONDUCT.md).
Please adhere to our [Code of Conduct](../CODE_OF_CONDUCT.md).

Copilot uses AI. Check for mistakes.

---

- Regex-based scanning may produce **false positives**; tune `patterns.json` to your needs.
## 📄 License
This project is licensed under the [MIT License](LICENSE).
See the `LICENSE` file for full details.
Comment on lines +213 to +214
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README references a LICENSE file with a relative link, but no LICENSE file exists in the asset-scanner directory. This will result in a broken link for users viewing the asset-scanner README.

Suggested change
This project is licensed under the [MIT License](LICENSE).
See the `LICENSE` file for full details.
This project is licensed under the MIT License.
See the `LICENSE` file in the repository root for full details.

Copilot uses AI. Check for mistakes.