KnödelGrep is a high-performance command-line tool for searching and indexing text files. It is designed to handle large datasets efficiently, providing fast search capabilities with minimal resource consumption.
- Extremely fast during searches, skipping files that do not contains your search query
- Efficient search algorithm (bloom filter pre-checks)
- Customizable tokenization (not yet)
- Low memory footprint
- Low storage overhead for indexes
KnödelGrep uses a custom tokenization strategy to break down text into manageable pieces (tokens) for indexing and searching.
Each token is then stored in a bloom filter, and saved into a .knodel file for quick access during searches.
This approach allows KnödelGrep to quickly determine the presence of tokens in large datasets without needing to read the files directly. If the bloom filter indicates that a token may be present, KnödelGrep performs a full file search to print the matching lines.
KnödelGrep is designed to work efficiently on both SSDs and HDDs, but the performance characteristics may vary. If you have a lot of data and you are using an HDD, I suggest to move the knodel index files to an SSD, this will significantly speed up the initial search procedure, because all the index files will be read from disk at every search. You can use a RAM disk for the index files if you have enough RAM available, this will provide the best performance possible. In case of RAM disk or SSD, a '-knodel-ssd' flag can be specified to use more threads to more multiple bloom filter files in parallel and speed up the search process.
Well, because I need to search in large text files quickly, and within terabyte of data on disk I want to avoid reading unnecessary files. Using KnödelGrep I can quickly filter out files that do not contain the search tokens, significantly speeding up the search process.
But there is a catch: KnödelGrep is useless if you need to search only once, because the indexing process takes time, but if you need to search multiple times (ideally >= 3 times) within the same dataset, KnödelGrep will start to pay off.
Also, if you data is constantly changing, KnödelGrep is not the right tool for you, because you will need to re-index the data every time it changes.
Last but not least, if you have a small dataset (less than a few GBs), KnödelGrep may not be the best choice because the indexing overhead may outweigh the benefits of faster searches, the same applies if you have only a few files to search in (< 50 files).
Well, because existing tools like grep, ack, ag, and rg are great for searching text files, but they can be slow when dealing with very large datasets and with repetitive searches.
Also, if you search multiple times within the same dataset, these tools will read the files from disk every time, which can be very slow, my tool solves this problem by indexing the data first, and then using the index to detect the possible files containing the search tokens, thus avoiding reading unnecessary files from disk.
- Large datasets (hundreds of GBs to TBs)
- Frequent searches within the same dataset
- Relatively static data (not changing often, or not changing at all)
- Text-based files (logs, codebases, documents, etc.)
- Multiple files (more than 50 files, ideally thousands or more)
- Willingness to trade-off some index size and indexing time for faster searches
- Your time is more valuable than ~5% of your storage space
KnödelGrep uses bloom filters to index tokens, which means that there is a possibility of false positives during searches.
The false positive rate can be configured during the indexing process, so you can trade index size for search accuracy.
A lower false positive rate will result in a larger index size, but fewer false positives during searches. Conversely, a higher false positive rate will result in a smaller index size, but more false positives during searches.
The default false positive rate is set to 0.3%, which provides a good balance for most use cases.
In any case, a false positive just means that KnödelGrep will read a file that does not contain the search tokens, which is exactly what grep does, but don't worry the overall performance gain from using KnödelGrep is still significant. It's always a trade-off between index size, index time, and search performance.
To specify a different false positive rate, use the -fpr flag during indexing mode.
Usage of knodelgrep:
-fpr float
False positive rate for bloom filter (0.0001-99.9999) % (default 0.3)
-input string
Path to input file or directory
-knodel-path string
Path to knodel files (index/bloom filters). Defaults to input directory if not specified
-knodel-ssd
Path to knodel files is on SSD storage. Default false (HDD)
-max-token-len int
Maximum token length (default 32)
-min-token-len int
Minimum token length (default 3)
-mode string
Mode: 'index' or 'search' (default "search")
-query string
Search query (for 'search' mode)
-separators string
Token separators (default ",:.@|/ ")
-verbose
Enable verbose output
To index a directory of text files:
knodelgrep -mode index -input /path/to/text/files/directory/To index a directory and use a SSD for the knodel files:
knodelgrep -mode index -input /path/to/text/files/directory/ -knodel-path /path/to/knodel/files/ -knodel-ssd=trueTo search for "critical error" in the indexed files:
knodelgrep -mode search -input /path/to/text/files/directory/ -query "critical error"Note: You may need to deduplicate files and split large files into smaller chunks to optimize indexing and searching performance. You can use tools like dedupe for deduplication.
The name "KnödelGrep" combines "Knödel" and "Grep". Knödel, also known as "Canederli" in Italian, are traditional dumplings popular in Central European cuisine.
I would like to tell the real meaning of the name, but I don't know it myself, it just popped into my mind when I was thinking about a name for this project.
The "Grep" part of the name is derived from the Unix command-line utility "grep", the same tool that today wasted 8 hours of my life while searching for a needle in a haystack of text files.
Contributions are welcome! If you have ideas for improvements or new features, feel free to open an issue or submit a pull request.
Please, keep it mind that this project does not aim to be a general-purpose database or search engine, but only an alternative to grep/rg for very specific use cases.
