Implement binary morphing framework with advanced analysis, devirtualization, and mutation capabilities #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[PATCH] r2morph: Add Phase 2 binary analysis and devirtualization framework
This patch series implements Phase 2 of the r2morph binary morphing framework,
adding comprehensive binary analysis capabilities and devirtualization support.
The implementation focuses on analyzing virtualized and obfuscated binaries,
providing tools for reverse engineering protected code, and supporting advanced
mutation techniques for research and security analysis purposes.
Core additions:
Symbolic Execution Engine
The symbolic execution framework provides comprehensive path exploration and
constraint solving capabilities for complex binary analysis. The implementation
integrates with the angr binary analysis platform while maintaining independence
through a bridge architecture that allows fallback to native implementations.
The path explorer implements a worklist-based algorithm with configurable
exploration strategies including depth-first, breadth-first, and coverage-guided
exploration. State management handles symbolic memory, registers, and constraints
with copy-on-write optimization for memory efficiency during path explosion.
Constraint solving utilizes Z3 as the primary SMT solver with support for
bitvector operations, memory models, and mixed boolean arithmetic expressions.
The solver can handle complex obfuscation patterns including opaque predicates
and virtualized arithmetic operations.
Integration with the Syntia framework enables semantic learning from instruction
sequences, allowing the system to recognize equivalent code patterns and generate
simplified representations of complex obfuscated constructs.
VM Handler Analysis and Devirtualization
The devirtualization engine implements a multi-stage analysis pipeline for
identifying and reversing virtual machine-based code protection schemes. The
system handles various virtualization architectures including stack-based,
register-based, and hybrid virtual machines.
VM Handler Identification Pipeline
The dispatcher detection phase analyzes control flow graphs to identify
characteristic patterns of VM dispatchers, including indirect jumps with
large successor counts and table-based instruction decoding. Handler table
discovery examines memory references from dispatcher code to locate handler
address tables, validating entries by attempting disassembly and checking
for valid code patterns.
Handler extraction reads the discovered tables to enumerate all VM handlers,
performing basic validation to filter out invalid entries. Each handler
undergoes individual analysis including instruction sequence extraction,
pattern matching against known operation types, and confidence scoring.
Iterative Deobfuscation Process
The iterative simplification engine applies multiple deobfuscation techniques
in coordinated passes until convergence or maximum iteration limits. Each
pass applies MBA solving to simplify mixed boolean arithmetic expressions,
followed by control flow analysis to identify and eliminate dead code,
opaque predicates, and unnecessary control flow transfers.
Checkpoint creation occurs at regular intervals, allowing rollback when
simplification passes fail to make progress or introduce errors. The system
maintains metrics on simplification effectiveness and automatically adjusts
strategies based on observed results.
Mixed Boolean Arithmetic solving identifies arithmetic expressions that have
been obfuscated through boolean operations, applying algebraic simplification
rules to recover the original arithmetic intent. The solver handles common
MBA patterns including linear expressions, polynomial representations, and
nested boolean compositions.
Control Flow Obfuscation Simplification
Control flow flattening recovery implements pattern recognition for dispatcher-based
control flow obfuscation. The system identifies characteristic patterns including
state variable manipulation, indirect jumps through switch tables, and artificial
basic block splitting.
The simplification process reconstructs original control flow by analyzing state
transitions, identifying natural basic block boundaries, and rebuilding direct
control flow edges. Dead code elimination removes artificially introduced code
that serves no functional purpose beyond obfuscation.
Binary Reconstruction and Rewriting
The binary rewriter handles the complex task of applying deobfuscation results
back to the original binary format. The system supports multiple binary formats
including PE, ELF, and Mach-O, handling format-specific requirements for
section layouts, relocation tables, and metadata updates.
Patch generation creates precise binary modifications to replace obfuscated
code with simplified equivalents. Address space management handles the complex
task of fitting new code into existing binary layouts, potentially requiring
code cave utilization or section expansion.
Metadata updates ensure that all binary format structures remain consistent
after modification, including import tables, exception handling data, and
debug information. Symbol table reconstruction updates function boundaries
and entry points to reflect the simplified code structure.
Anti-Analysis Technique Detection and Bypass
The anti-analysis framework provides comprehensive detection and countermeasures
for common analysis evasion techniques. The implementation focuses on transparent
operation, allowing analysis to proceed unimpeded while the target binary
believes it is running in a protected environment.
Detection capabilities include identification of debugger presence checks,
virtual machine detection routines, sandbox environment detection, and timing-based
analysis detection. The system maintains a comprehensive database of known
detection patterns while supporting runtime learning of new techniques.
Bypass implementations provide runtime redirection of detection routines,
API hooking for environment spoofing, and registry/filesystem virtualization
to present convincing fake environments. The system operates at multiple
levels including user-mode API hooking, kernel-level redirection, and
hardware-assisted virtualization when available.
Dynamic Instrumentation Framework
The Frida-based instrumentation system provides comprehensive runtime analysis
capabilities including function hooking, code coverage tracking, and behavioral
monitoring. The framework integrates closely with the static analysis components
to provide hybrid analysis capabilities.
Runtime hook management allows selective interception of function calls with
configurable filtering based on module, function name, or address ranges.
Code coverage tracking provides real-time feedback on analysis completeness
and helps guide symbolic execution path selection.
The instrumentation framework supports both automatic and manual hook placement,
with automatic placement guided by static analysis results and manual placement
supporting researcher-directed investigation of specific behaviors.
Enhanced Validation and Benchmarking
The validation framework provides comprehensive testing and benchmarking
capabilities for all analysis components. The system includes unit tests
for individual algorithms, integration tests for complete analysis pipelines,
and performance benchmarks for scalability assessment.
Regression testing validates analysis results against known samples with
verified ground truth, ensuring that changes to analysis algorithms do not
introduce false positives or reduce detection capabilities. The framework
supports automated test case generation through fuzzing and mutation testing.
Performance benchmarking provides detailed metrics on analysis speed, memory
usage, and scalability characteristics. The system supports both synthetic
benchmarks and real-world sample analysis with comprehensive profiling data
collection.
Integration and CLI Enhancements
CLI integration adds the 'analyze-enhanced' command providing access to all
Phase 2 capabilities through a unified interface. The command supports
configurable analysis depth, output format selection, and parallel processing
for large sample sets.
The enhanced analysis pipeline integrates all components into a coherent
workflow, automatically selecting appropriate analysis techniques based on
binary characteristics and user requirements. Progress reporting provides
real-time feedback on analysis status with detailed logging for debugging
and research purposes.
Testing and Validation
The implementation includes comprehensive testing covering both individual
component functionality and integrated analysis pipelines. Unit tests validate
algorithmic correctness for core analysis functions including symbolic execution,
constraint solving, and pattern matching.
Integration tests verify end-to-end analysis functionality using synthetic
test cases and real-world samples. The test suite includes samples protected
by various commercial and custom protection schemes to ensure broad compatibility.
Performance testing validates scalability and resource usage characteristics
under various load conditions. Memory usage profiling ensures efficient
operation even with large binaries and complex analysis requirements.
All components include proper error handling with graceful degradation when
optional dependencies are unavailable or analysis components fail. The system
maintains backward compatibility with existing Phase 1 functionality while
providing enhanced capabilities when Phase 2 components are available.
Signed-off-by: anhed0nic anhed0nic.esq@gmail.com