Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions Asset-Assessment-Scanner-V1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Redback Ethics Asset Scanner

The **Asset Scanner** is a Python-based tool for detecting sensitive information (PII, secrets, credentials, etc.) in documents and media.
It is designed for educational use in cybersecurity and ethics modules.

---

## 📂 Project Structure

- `scanner.py` – Main entry point for scanning files and generating reports.
- `scan_media.py` – Scans image/PDF inputs using OCR (`ocr_engine.py`).
- `file_handler.py` – Handles input files and preprocessing.
- `ocr_engine.py` – OCR engine wrapper for text extraction from images.
- `reporter.py` – Builds structured scan results and output reports.
- `patterns.json` – Regex patterns for detecting sensitive items.
- `risk_rules.json` – Maps detected patterns to risk levels, compliance references, and remediation tips.

---

## ⚙️ Setup

1. Clone the repository:
```bash
git clone https://github.com/<your-repo>/redback-ethics.git
cd redback-ethics/asset-scanner
```

2. Create and activate a virtual environment:
```bash
python3 -m venv .venv
source .venv/bin/activate
```

3. Install dependencies:
```bash
pip install -r requirements.txt
```

---

## 🚀 Usage

To scan a document:
```bash
python scanner.py --file "/path/to/document.docx"
```

To scan an image or PDF (OCR enabled):
```bash
python scan_media.py --file "/path/to/image_or_pdf"
```

To scan a directory:
```bash
python scanner.py --root "/path/to/folder"
```
OR
if you run scanner.py standalone you without and --file or --root arguments you will be prompted
to enter a directory in runtime

Output will include:
- Detected matches with line context
- Risk level (from `risk_rules.json`)
- Mitigation tips and relevant compliance frameworks

---

## ⚡ Command-Line Arguments

The scanner supports several arguments to control input and behaviour:

| Argument | Type | Description | Example |
|----------|------|-------------|---------|
| `--file` | Path | Scan a single file (e.g., `.docx`, `.pdf`, `.png`). | `python scanner.py --file "/path/to/document.docx"` |
| `--root` | Path | Recursively scan all files within a directory. | `python scanner.py --root "/path/to/folder"` |
| `--patterns` | Path | Custom path to `patterns.json`. Useful if you want to override defaults. | `python scanner.py --file test.docx --patterns ./configs/patterns.json` |
| `--out` | Path | File to write structured scan results (JSON or text depending on implementation). | `python scanner.py --root ./docs --out results.json` |
| `--no-console` | Flag | Suppress console output. Results will only be written to the output file. | `python scanner.py --root ./docs --no-console --out results.json` |

### Common Usage Examples

Scan one file:
```bash
python scanner.py --file "/Users/alice/Documents/report.docx"
```

Recursively Scan Directory:
```bash
python scanner.py --root "/Users/alice/Documents/sensitive_documents'
```

---

## 🛡️ Configuration

- **`patterns.json`**: Defines regex patterns for items like emails, API keys, driver’s licence numbers, etc.
Each entry specifies:
- `pattern`: regex string
- `risk`: risk level
- `description`: human-readable explanation

- **`risk_rules.json`**: Associates each pattern with:
- `level`: severity (Low/Medium/High)
- `tip`: recommended mitigation
- `compliance`: legal/regulatory references

You can extend these files to detect new types of data.

---

## 📝 Example

Scanning a document containing:

```
Email: alice@example.com
Password: "SuperSecret123"
```

Would output:

```
[Email] -> Medium Risk
Tip: Mask or obfuscate emails in logs/code unless strictly required.
Compliance: Privacy Act 1988 (Cth) — APP 11

[Password] -> High Risk
Tip: Remove hard-coded passwords; rotate immediately; use env vars or a vault.
Compliance: GDPR Art. 32 — Security of processing
```

---

## 🔒 Notes

- Regex-based scanning may produce **false positives**; tune `patterns.json` to your needs.
30 changes: 30 additions & 0 deletions Asset-Assessment-Scanner-V1/file_handler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import os
from docx import Document
# Import extract_text_from_file for PDF and image support
from scan_media import extract_text_from_file

def find_files(directory, exts=None):
exts = exts or []
matches = []
for dirpath, _, filenames in os.walk(directory):
for fn in filenames:
if not exts or any(fn.lower().endswith(e) for e in exts):
matches.append(os.path.join(dirpath, fn))
return matches

def read_file(path):
lower_path = path.lower()
if lower_path.endswith('.docx'):
try:
doc = Document(path)
return '\n'.join([p.text for p in doc.paragraphs])
except Exception as e:
return f"[Error reading DOCX: {e}]"
elif lower_path.endswith('.pdf') or lower_path.endswith(('.png', '.jpg', '.jpeg', '.tiff', '.tif', '.bmp', '.webp')):
try:
return extract_text_from_file(path)
except Exception as e:
return f"[Error extracting text from media: {e}]"
else:
with open(path, encoding="utf-8", errors="ignore") as f:
return f.read()
92 changes: 92 additions & 0 deletions Asset-Assessment-Scanner-V1/ocr_engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
from __future__ import annotations
from dataclasses import dataclass
from typing import List, Optional, Tuple
from pathlib import Path
import re

import numpy as np
from PIL import Image
import pytesseract
import cv2

try:
from pdf2image import convert_from_path
PDF2IMAGE_AVAILABLE = True
except Exception:
PDF2IMAGE_AVAILABLE = False

@dataclass
class OCRConfig:
dpi: int = 300
deskew: bool = True
binarize: bool = True
oem: int = 3
psm: int = 3
lang: str = "eng"

def _to_cv(img: Image.Image) -> np.ndarray:
return cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)

def _to_pil(arr: np.ndarray) -> Image.Image:
return Image.fromarray(cv2.cvtColor(arr, cv2.COLOR_BGR2RGB))

def _normalize_dpi(img: Image.Image, target_dpi: int) -> Image.Image:
dpi = img.info.get("dpi", (target_dpi, target_dpi))[0]
if dpi < target_dpi:
scale = target_dpi / dpi
new_size = (int(img.width * scale), int(img.height * scale))
img = img.resize(new_size, Image.LANCZOS)
img.info["dpi"] = (target_dpi, target_dpi)
return img

def _deskew(cv_img: np.ndarray) -> np.ndarray:
gray = cv2.cvtColor(cv_img, cv2.COLOR_BGR2GRAY)
gray = cv2.bitwise_not(gray)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
coords = np.column_stack(np.where(thresh > 0))
if coords.size == 0:
return cv_img
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
(h, w) = cv_img.shape[:2]
M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
rotated = cv2.warpAffine(cv_img, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
return rotated

def _binarize(cv_img: np.ndarray) -> np.ndarray:
gray = cv2.cvtColor(cv_img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 35, 11)
return cv2.cvtColor(thr, cv2.COLOR_GRAY2BGR)

def preprocess_image(img: Image.Image, cfg: OCRConfig) -> Image.Image:
img = _normalize_dpi(img, cfg.dpi)
cv_img = _to_cv(img)
if cfg.deskew:
cv_img = _deskew(cv_img)
if cfg.binarize:
cv_img = _binarize(cv_img)
return _to_pil(cv_img)

def _tesseract_args(cfg: OCRConfig) -> str:
return f"--oem {cfg.oem} --psm {cfg.psm}"

def ocr_image(img: Image.Image, cfg: Optional[OCRConfig] = None) -> str:
cfg = cfg or OCRConfig()
img_p = preprocess_image(img, cfg)
text = pytesseract.image_to_string(img_p, lang=cfg.lang, config=_tesseract_args(cfg))
return text.strip()

def pdf_to_images(pdf_path: str | Path, dpi: int = 300) -> List[Image.Image]:
if not PDF2IMAGE_AVAILABLE:
raise RuntimeError("pdf2image not available or poppler missing.")
return convert_from_path(str(pdf_path), dpi=dpi)

def ocr_pdf(pdf_path: str | Path, cfg: Optional[OCRConfig] = None) -> Tuple[str, List[str]]:
cfg = cfg or OCRConfig()
pages = pdf_to_images(pdf_path, dpi=cfg.dpi)
page_texts = [ocr_image(p, cfg) for p in pages]
return "\n".join(page_texts), page_texts
102 changes: 102 additions & 0 deletions Asset-Assessment-Scanner-V1/patterns.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
{
"email": {
"pattern": "[a-zA-Z0-9+._%-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,63}",
"risk": "Medium",
"description": "Email address"
},
"aws_access_key": {
"pattern": "\\bAKIA[0-9A-Z]{16}\\b",
"risk": "High",
"description": "AWS Access Key"
},
"aws_secret_access_key": {
"pattern": "(?<![A-Za-z0-9/+=])[A-Za-z0-9/+=]{40}(?![A-Za-z0-9/+=])",
"risk": "High",
"description": "AWS Secret Access Key (40-char base64-like)"
},
"gcp_service_account_key": {
"pattern": "-----BEGIN PRIVATE KEY-----[\\s\\S]+?-----END PRIVATE KEY-----",
"risk": "High",
"description": "GCP Service Account Private Key"
},
"azure_client_secret": {
"pattern": "(?i)(?:\\bclient[-_ ]?secret\\b|\\bazure[-_ ]?secret\\b|\\bapp[-_ ]?registration[-_ ]?secret\\b)\\s*[:=]\\s*['\"]?[A-Za-z0-9+/_\\-=]{20,128}['\"]?",
"risk": "High",
"description": "Azure client secret only when labelled"
},
"ssh_private_key": {
"pattern": "-----BEGIN (?:RSA|DSA|EC|OPENSSH) PRIVATE KEY-----[\\s\\S]+?-----END (?:RSA|DSA|EC|OPENSSH) PRIVATE KEY-----",
"risk": "High",
"description": "SSH Private Key"
},
"jwt_secret": {
"pattern": "\\b[A-Za-z0-9_-]{10,}\\.([A-Za-z0-9_-]{10,})\\.([A-Za-z0-9_-]{10,})\\b",
"risk": "High",
"description": "JWT token (header.payload.signature)"
},
"api_token": {
"pattern": "(?i)(?:\\bapi[-_ ]?token\\b|\\bapi[-_ ]?key\\b|\\baccess[-_ ]?token\\b|\\bsecret\\b)\\s*[:=]\\s*['\"]?[A-Za-z0-9._\\-]{20,}['\"]?|\\bAuthorization\\s*:\\s*Bearer\\s+[A-Za-z0-9._\\-]{20,}\\b",
"risk": "Medium",
"description": "Generic API token / key when explicitly labelled or in an Authorization header"
},
"password": {
"pattern": "(?i)\\bpassword\\s*[:=]\\s*['\"][^'\"\\r\\n]+['\"]",
"risk": "High",
"description": "Hard-coded password in labelled field"
},
"credit_card": {
"pattern": "\\b(?:4\\d{12}(?:\\d{3})?|5[1-5]\\d{14}|3[47]\\d{13}|6(?:011|5\\d{2})\\d{12})\\b",
"risk": "High",
"description": "Common card brands (Luhn check recommended in code)"
},
"ssn": {
"pattern": "\\b\\d{3}-\\d{2}-\\d{4}\\b",
"risk": "High",
"description": "US Social Security Number"
},
"phone_number": {
"pattern": "\\b04\\d{2}\\s?\\d{3}\\s?\\d{3}\\b",
"risk": "Medium",
"description": "Australian mobile number (04## ### ###)"
},
"ip_address": {
"pattern": "\\b(?:(?:25[0-5]|2[0-4]\\d|1\\d\\d|\\d?\\d)\\.){3}(?:25[0-5]|2[0-4]\\d|1\\d\\d|\\d?\\d)\\b",
"risk": "Low",
"description": "IPv4 address (0–255 octets)"
},
"database_connection_string": {
"pattern": "(?i)\\b(?:jdbc:[^\\s'\";]+|postgresql://[^\\s'\";]+|mysql://[^\\s'\";]+|mongodb:(?:\\+srv)?:[^\\s'\";]+)\\b",
"risk": "High",
"description": "Database connection string"
},
"tfn": {
"pattern": "\\b\\d{3}\\s?\\d{3}\\s?\\d{3}\\b",
"risk": "High",
"description": "Australian Tax File Number (apply checksum in code)"
},
"medicare_number": {
"pattern": "\\b\\d{4}\\s?\\d{5}\\s?\\d{1}(?:\\s?\\d)?\\b",
"risk": "High",
"description": "Medicare card number (10 digits + optional 1-digit IRN)"
},
"drivers_licence_number": {
"pattern": "(?i)\\bdriver'?s?\\s*licen[cs]e(?:\\s*(?:no\\.?|number|#))?\\s*[:#-]?\\s*([A-Z0-9]{6,10})\\b",
"risk": "High",
"description": "AUS driver’s licence number only when explicitly labelled"
},
"address_au": {
"pattern": "(?is)\\b\\d{1,5}\\s+[A-Za-z][A-Za-z’'\\-\\. ]+\\s+(?:St|Street|Rd|Road|Ave|Avenue|Blvd|Boulevard|Dr|Drive|Ln|Lane|Ct|Court|Pl|Place|Pde|Parade|Ter|Terrace|Way)\\b(?:,\\s*[A-Za-z][A-Za-z ’'\\-]+)?(?:,\\s*(?:VIC|NSW|QLD|SA|WA|TAS|ACT|NT))?(?:\\s+\\d{4})?(?!.{0,200}(?:\\bfull[_\\s-]?name\\b|\\bname\\b|[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,63}|\\+?[1-9]\\d{8,14}|\\bTFN\\b|\\bMedicare\\b|licen[cs]e|driver))",
"risk": "Low",
"description": "Australian street address (standalone)"
},
"address_au_with_pii": {
"pattern": "(?is)\\b\\d{1,5}\\s+[A-Za-z][A-Za-z’'\\-\\. ]+\\s+(?:St|Street|Rd|Road|Ave|Avenue|Blvd|Boulevard|Dr|Drive|Ln|Lane|Ct|Court|Pl|Place|Pde|Parade|Ter|Terrace|Way)\\b(?:,\\s*[A-Za-z][A-Za-z ’'\\-]+)?(?:,\\s*(?:VIC|NSW|QLD|SA|WA|TAS|ACT|NT))?(?:\\s+\\d{4})?(?=.{0,200}(?:\\bfull[_\\s-]?name\\b|\\bname\\b|[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,63}|\\+?[1-9]\\d{8,14}|\\bTFN\\b|\\bMedicare\\b|licen[cs]e|driver))",
"risk": "High",
"description": "Australian street address near other identifiers (name/email/phone/ID)"
},
"name_full": {
"pattern": "(?i)\\b(?:full[_\\s-]?name|name|first[_\\s-]?name|last[_\\s-]?name)\\s*[:=]\\s*['\"]?[A-Z][a-z]+(?:[ -][A-Z][a-z]+){1,3}['\"]?",
"risk": "Low",
"description": "Full name in a labelled field"
}
}
Loading