Skip to content

Static malware analysis pipeline integrating PE-header feature engineering, entropy profiling, import-based behavioral signals, and ML classification (Random Forest + CNN/RNN-ready features). Includes PE header parser, feature extractor, dataset builder, and baseline malware-vs-benign classifier.

Notifications You must be signed in to change notification settings

atharimran728/malware-static-analysis-ML-engine

Repository files navigation

Malware Static Analysis Engine (PE Feature Engineering + ML Classifier)

This project implements a full static-analysis workflow for Windows PE malware detection.
It combines PE header parsing, entropy-based section profiling, API import analysis, and a machine learning classifier trained on engineered features.

Core Components

1. PE Header Parser (PE_header_parser.py)

  • Extracts raw PE metadata.
  • Dumps section names, virtual/raw sizes, and entropy.
  • Enumerates DLL imports and API calls.
  • Designed for analyst-side sample inspection.

2. Static Malware Classifier (static_malware_classifier.py)

Feature engineering pipeline that extracts:

  • File size
  • Section entropy statistics (mean/max/std)
  • Suspicious API usage (VirtualAlloc, WriteProcessMemory, CreateRemoteThread, etc.)
  • DLL diversity
  • Compile timestamp (year)
  • Import volume and behavioral indicators

Exports dataset → trains Random Forest → outputs classification report.

3. Dataset (pe_features.csv)

Structure:

file_size, mean_entropy, max_entropy, entropy_std,
import_count, suspicious_api_count, unique_dll_count,
compile_year, label

Sample count in this demo dataset:

  • Malware: 1
  • Benign: 1
    (Expandable - script supports full directories.)

How It Works

  1. Parse PE

    python PE_header_parser.py
  2. Extract Features + Train Model

    python static_malware_classifier.py
  3. Outputs

    • pe_features.csv: engineered dataset
    • Classification metrics
    • Full import/section dump for each sample

Why This Matters

Static analysis is the first line of triage in SOC and IR workflows.
This project automates the extraction of structural and behavioral signals directly from the binary—zero execution required.

SOC/IR Applications:

  • Quick risk scoring of suspicious executables
  • Detecting anomalous entropy patterns (packing/obfuscation)
  • Identifying malware-like import behavior
  • Building ML-assisted pre-sandbox triage engines

This repo is structured so it can be expanded into:

  • CNN/RNN models using byte sequences
  • Hybrid static+dynamic classifiers
  • YARA-SVM hybrid detection pipeline

Project Structure

PE_header_parser.py
static_malware_classifier.py
pe_features.csv
malware_samples/
benign_samples/
README.md

About

Static malware analysis pipeline integrating PE-header feature engineering, entropy profiling, import-based behavioral signals, and ML classification (Random Forest + CNN/RNN-ready features). Includes PE header parser, feature extractor, dataset builder, and baseline malware-vs-benign classifier.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages