Skip to content

Like grep but with bloom filter. Extremely fast searches in large text directories

Notifications You must be signed in to change notification settings

tomventa/knodelgrep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

KnödelGrep

KnödelGrep is a high-performance command-line tool for searching and indexing text files. It is designed to handle large datasets efficiently, providing fast search capabilities with minimal resource consumption.

Features

  • Extremely fast during searches, skipping files that do not contains your search query
  • Efficient search algorithm (bloom filter pre-checks)
  • Customizable tokenization (not yet)
  • Low memory footprint
  • Low storage overhead for indexes

How it Works

KnödelGrep uses a custom tokenization strategy to break down text into manageable pieces (tokens) for indexing and searching.

Each token is then stored in a bloom filter, and saved into a .knodel file for quick access during searches.

This approach allows KnödelGrep to quickly determine the presence of tokens in large datasets without needing to read the files directly. If the bloom filter indicates that a token may be present, KnödelGrep performs a full file search to print the matching lines.

SSD + HDD & RAM disk

KnödelGrep is designed to work efficiently on both SSDs and HDDs, but the performance characteristics may vary. If you have a lot of data and you are using an HDD, I suggest to move the knodel index files to an SSD, this will significantly speed up the initial search procedure, because all the index files will be read from disk at every search. You can use a RAM disk for the index files if you have enough RAM available, this will provide the best performance possible. In case of RAM disk or SSD, a '-knodel-ssd' flag can be specified to use more threads to more multiple bloom filter files in parallel and speed up the search process.

Why

Well, because I need to search in large text files quickly, and within terabyte of data on disk I want to avoid reading unnecessary files. Using KnödelGrep I can quickly filter out files that do not contain the search tokens, significantly speeding up the search process.

But there is a catch: KnödelGrep is useless if you need to search only once, because the indexing process takes time, but if you need to search multiple times (ideally >= 3 times) within the same dataset, KnödelGrep will start to pay off.

Also, if you data is constantly changing, KnödelGrep is not the right tool for you, because you will need to re-index the data every time it changes.

Last but not least, if you have a small dataset (less than a few GBs), KnödelGrep may not be the best choice because the indexing overhead may outweigh the benefits of faster searches, the same applies if you have only a few files to search in (< 50 files).

Why not use existing tools?

Well, because existing tools like grep, ack, ag, and rg are great for searching text files, but they can be slow when dealing with very large datasets and with repetitive searches. Also, if you search multiple times within the same dataset, these tools will read the files from disk every time, which can be very slow, my tool solves this problem by indexing the data first, and then using the index to detect the possible files containing the search tokens, thus avoiding reading unnecessary files from disk.

Consider KnödelGrep if your data matches the following criteria:

  • Large datasets (hundreds of GBs to TBs)
  • Frequent searches within the same dataset
  • Relatively static data (not changing often, or not changing at all)
  • Text-based files (logs, codebases, documents, etc.)
  • Multiple files (more than 50 files, ideally thousands or more)
  • Willingness to trade-off some index size and indexing time for faster searches
  • Your time is more valuable than ~5% of your storage space

False positive rate

KnödelGrep uses bloom filters to index tokens, which means that there is a possibility of false positives during searches.

The false positive rate can be configured during the indexing process, so you can trade index size for search accuracy.

A lower false positive rate will result in a larger index size, but fewer false positives during searches. Conversely, a higher false positive rate will result in a smaller index size, but more false positives during searches.

The default false positive rate is set to 0.3%, which provides a good balance for most use cases.

In any case, a false positive just means that KnödelGrep will read a file that does not contain the search tokens, which is exactly what grep does, but don't worry the overall performance gain from using KnödelGrep is still significant. It's always a trade-off between index size, index time, and search performance.

To specify a different false positive rate, use the -fpr flag during indexing mode.

Help

Usage of knodelgrep:
  -fpr float
    	False positive rate for bloom filter (0.0001-99.9999) % (default 0.3)
  -input string
    	Path to input file or directory
  -knodel-path string
    	Path to knodel files (index/bloom filters). Defaults to input directory if not specified
  -knodel-ssd
    	Path to knodel files is on SSD storage. Default false (HDD)
  -max-token-len int
    	Maximum token length (default 32)
  -min-token-len int
    	Minimum token length (default 3)
  -mode string
    	Mode: 'index' or 'search' (default "search")
  -query string
    	Search query (for 'search' mode)
  -separators string
    	Token separators (default ",:.@|/ ")
  -verbose
    	Enable verbose output

Example Usage

To index a directory of text files:

knodelgrep -mode index -input /path/to/text/files/directory/

To index a directory and use a SSD for the knodel files:

knodelgrep -mode index -input /path/to/text/files/directory/ -knodel-path /path/to/knodel/files/ -knodel-ssd=true

To search for "critical error" in the indexed files:

knodelgrep -mode search -input /path/to/text/files/directory/ -query "critical error"

Note: You may need to deduplicate files and split large files into smaller chunks to optimize indexing and searching performance. You can use tools like dedupe for deduplication.

KnödelGrep name

The name "KnödelGrep" combines "Knödel" and "Grep". Knödel, also known as "Canederli" in Italian, are traditional dumplings popular in Central European cuisine.

I would like to tell the real meaning of the name, but I don't know it myself, it just popped into my mind when I was thinking about a name for this project.

The "Grep" part of the name is derived from the Unix command-line utility "grep", the same tool that today wasted 8 hours of my life while searching for a needle in a haystack of text files.

Contributing

Contributions are welcome! If you have ideas for improvements or new features, feel free to open an issue or submit a pull request.

Please, keep it mind that this project does not aim to be a general-purpose database or search engine, but only an alternative to grep/rg for very specific use cases.

About

Like grep but with bloom filter. Extremely fast searches in large text directories

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages