dedup

An extremely fast and efficient duplicate file finder written in Rust to provide fast and accurate results while minimizing disk I/O. Available for Linux and macOS with Windows support planned in the future.

Files are compared using cryptographically secure hashing to ensure accuracy.

Optionally, duplicate files can be replaced with hardlinks to save disk space.

Features

Multi-stage filtering: size grouping -> partial hash (8KB) -> full hash
Parallel processing with rayon
BLAKE3 hashing (fast, cryptographically secure)
Hardlink replacement with dry-run support
Human-readable and JSON output formats
Flexible include/exclude glob patterns for filtering files

Installation

Cargo

cargo install dedup-cli

Homebrew

brew install denizariyan/tap/dedup

From Source

cargo build --release

The binary will be at target/release/dedup.

Usage

See CLI Options for all available options.

# Scan current directory, report duplicates
dedup

# Scan specific directory
dedup /path/to/directory

# Output as JSON
dedup --format json

# Report duplicates with exit code
dedup --action report-exit-code

# Dry-run replacing duplicates with hardlinks
dedup --action hardlink --dry-run

# Skip files by pattern
dedup -e "*.log" -e "*.tmp" -e "node_modules"

# Use an exclude file (gitignore-style, one pattern per line)
dedup --exclude-file .gitignore

# Only scan image files
dedup --include "*.jpg" --include "*.png"

# Use an include file
dedup --include-file patterns.txt

# Scan all images, except those in backup folder - if a file matches both include and exclude, exclude takes precedence
dedup -i "*.jpg" -e "backup"

CLI Options

All options can be used in combination.

Option	Short	Description
`--format <FORMAT>`	`-f`	Output format: `human` (default), `json`, or `quiet`
`--action <ACTION>`	`-a`	Action: `none` (default), `report-exit-code`, or `hardlink`
`--min-size <BYTES>`	`-s`	Skip files smaller than this size
`--max-size <BYTES>`	`-S`	Skip files larger than this size
`--exclude <PATTERN>`	`-e`	Glob pattern to exclude files or directories (can be used multiple times)
`--exclude-file <PATH>`		File containing exclude patterns (gitignore-style)
`--include <PATTERN>`	`-i`	Glob pattern to include files (can be used multiple times). Has no effect on directories
`--include-file <PATH>`		File containing include patterns
`--verbose`	`-v`	Show detailed output with file paths
`--jobs <N>`	`-j`	Number of threads to use (defaults to CPU core count)
`--dry-run`		Preview hardlink changes without modifying files
`--no-progress`		Disable progress bars

Benchmarks

Reference benchmark results for a 10GB dataset with various duplicate ratios and file size distributions can be found below. For more details, see benchmark docs.

In all tested scenarios, dedup outperforms other tested duplicate file finder tools, especially on slower disks where the multi-stage filtering and parallel computing shines by minimizing the downtime waiting for disk I/O.

Slow Disk (~500 MB/s read/write)

Fast Disk (~1.75 GB/s read/write)

How It Works

The tool uses a multi-stage pipeline to minimize disk I/O to reduce runtime:

Scan: Walk directory tree, collect file paths and sizes
Size grouping: Group files by size.
Partial hash: For remaining candidates, hash only the first 8KB. Group by this partial hash.
Full hash: For files with matching partial hashes, compute full content hash to confirm duplicates.

This approach avoids reading entire file contents for most files.

Example:

1000 files
    ↓ size grouping
  200 candidates (800 unique sizes skipped)
    ↓ partial hash (8KB each)
   50 candidates (150 different starts)
    ↓ full hash
   20 confirmed duplicates

Hardlinking

When using --action hardlink, duplicate files are replaced with hardlinks to a single copy.

Note that hardlinking files means the metadata such as file ownership and permissions are lost for the duplicates which are replaced by hardlinks. It is planned to provide other options in the future, but hardlinking is the only option for now.

If you are packaging the deduplicated files later, consider using a hardlink-aware archiver like tar to benefit from space savings.

Use --dry-run --verbose first to preview what would change.

Output Formats

Human (default)

Duplicate Report
  Groups: 3
  Total duplicate files: 12
  Wasted space: 45.2 MB

Quiet

Suppresses all output. Useful for scripting in combination with --action report-exit-code.

JSON

{
  "stats": {
    "duplicate_groups": 3,
    "duplicate_files": 12,
    "wasted_bytes": 47412224
  },
  "groups": [
    {
      "size": 15804074,
      "files": ["/path/to/file1.jpg", "/path/to/file2.jpg"]
    }
  ]
}

Limitations

Because Hardlinks are the only deduplication method currently supported, only files within the same filesystem can be deduplicated
Symlinks are ignored

License

MIT License. See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.cargo		.cargo
.github		.github
benchmark		benchmark
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dedup

Tree of Contents

Features

Installation

Cargo

Homebrew

From Source

Usage

CLI Options

Benchmarks

Slow Disk (~500 MB/s read/write)

Fast Disk (~1.75 GB/s read/write)

How It Works

Hardlinking

Output Formats

Human (default)

Quiet

JSON

Limitations

License

About

Uh oh!

Releases 4

Uh oh!

Languages

License

denizariyan/dedup

Folders and files

Latest commit

History

Repository files navigation

dedup

Tree of Contents

Features

Installation

Cargo

Homebrew

From Source

Usage

CLI Options

Benchmarks

Slow Disk (~500 MB/s read/write)

Fast Disk (~1.75 GB/s read/write)

How It Works

Hardlinking

Output Formats

Human (default)

Quiet

JSON

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Uh oh!

Languages