diff --git a/_partials/_since_0_1_0.md b/_partials/_since_0_1_0.md new file mode 100644 index 0000000000..5c7a119a24 --- /dev/null +++ b/_partials/_since_0_1_0.md @@ -0,0 +1 @@ +Since [pg_textsearch v0.1.0](https://github.com/timescale/pg_textsearch/releases/tag/v0.1.0) diff --git a/use-timescale/extensions/pg-textsearch.md b/use-timescale/extensions/pg-textsearch.md index c430e35e67..9879aedc00 100644 --- a/use-timescale/extensions/pg-textsearch.md +++ b/use-timescale/extensions/pg-textsearch.md @@ -7,6 +7,7 @@ products: [cloud, self_hosted] --- import EA1125 from "versionContent/_partials/_early_access_11_25.mdx"; +import SINCE010 from "versionContent/_partials/_since_0_1_0.mdx"; import IntegrationPrereqs from "versionContent/_partials/_integration-prereqs.mdx"; # Optimize full text search with BM25 @@ -27,13 +28,12 @@ matches. `pg_textsearch` implements the following: This page shows you how to install `pg_textsearch`, configure BM25 indexes, and optimize your search capabilities using the following best practice: -* **Memory planning**: size your `index_memory_limit` based on corpus vocabulary and document count * **Language configuration**: choose appropriate text search configurations for your data language * **Hybrid search**: combine with pgvector or pgvectorscale for applications requiring both semantic and keyword search * **Query optimization**: use score thresholds to filter low-relevance results * **Index monitoring**: regularly check index usage and memory consumption - this preview release is designed for development and staging environments. It is not recommended for use with hypertables. + this preview release is designed for development and staging environments. ## Prerequisites @@ -124,42 +124,76 @@ Use efficient query patterns to leverage BM25 ranking and optimize search perfor 1. **Perform ranked searches using the distance operator** ```sql - SELECT name, description, - description <@> to_bm25query('ergonomic work', 'products_search_idx') as score + SELECT name, description, description <@> to_bm25query('ergonomic work', 'products_search_idx') as score FROM products - ORDER BY description <@> to_bm25query('ergonomic work', 'products_search_idx') + ORDER BY score LIMIT 3; ``` + You see something like: + + ```sql + name | description | score + ----------------------------+-----------------------------------------------------------------------------------+--------------------- + Ergonomic Mouse | Wireless mouse with ergonomic design to reduce wrist strain during long work sessions | -1.8132977485656738 + Mechanical Keyboard | Durable mechanical switches with RGB backlighting for gaming and productivity | 0 + Standing Desk | Adjustable height desk for better posture and productivity throughout the workday | 0 + ``` + 1. **Filter results by score threshold** ```sql - SELECT name, - description <@> to_bm25query('wireless', 'products_search_idx') as score + SELECT name, description <@> to_bm25query('wireless', 'products_search_idx') as score FROM products - WHERE description <@> to_bm25query('wireless', 'products_search_idx') < -2.0; + WHERE description <@> to_bm25query('wireless', 'products_search_idx') < -0.5; + ``` + + You see something like: + + ```sql + name | score + ----------------+--------------------- + Ergonomic Mouse | -0.9066488742828369 ``` 1. **Combine with standard SQL operations** ```sql - SELECT category, name, - description <@> to_bm25query('ergonomic', 'products_search_idx') as score + SELECT category, name, description <@> to_bm25query('ergonomic', 'products_search_idx') as score FROM products WHERE price < 500 - AND description <@> to_bm25query('ergonomic', 'products_search_idx') < -1.0 + AND description <@> to_bm25query('ergonomic', 'products_search_idx') < -0.5 ORDER BY description <@> to_bm25query('ergonomic', 'products_search_idx') LIMIT 5; ``` + You see something like: + + ```sql + category | name | score + -------------+-----------------+--------------------- + Electronics | Ergonomic Mouse | -0.9066488742828369 + ``` + 1. **Verify index usage with EXPLAIN** ```sql EXPLAIN SELECT * FROM products - ORDER BY description <@> to_bm25query('wireless keyboard', 'products_search_idx') + ORDER BY description <@> to_bm25query('ergonomic', 'products_search_idx') LIMIT 5; ``` + You see something like: + + ```sql + QUERY PLAN + -------------------------------------------------------------------------------------------- + Limit (cost=8.55..8.56 rows=3 width=140) + -> Sort (cost=8.55..8.56 rows=3 width=140) + Sort Key: ((description <@> 'products_search_idx:ergonomic'::bm25query)) + -> Seq Scan on products (cost=0.00..8.53 rows=3 width=140) + ``` + You have optimized your search queries for BM25 ranking. @@ -181,10 +215,21 @@ Combine `pg_textsearch` with `pgvector` or `pgvectorscale` to build powerful hyb id serial PRIMARY KEY, title text, content text, - embedding vector(1536) -- OpenAI ada-002 embedding dimension + embedding vector(3) -- Using 3 dimensions for this example; use 1536 for OpenAI ada-002 ); ``` +1. **Insert sample data** + + ```sql + INSERT INTO articles (title, content, embedding) VALUES + ('Database Query Optimization', 'Learn how to optimize database query performance using indexes and query planning', '[0.1, 0.15, 0.2]'), + ('Performance Tuning Guide', 'A comprehensive guide to performance tuning in distributed systems and databases', '[0.12, 0.18, 0.25]'), + ('Introduction to Indexing', 'Understanding how database indexes improve query performance and data retrieval', '[0.09, 0.14, 0.19]'), + ('Advanced SQL Techniques', 'Master advanced SQL techniques for complex data analysis and reporting', '[0.5, 0.6, 0.7]'), + ('Data Warehousing Basics', 'Getting started with data warehousing and analytical query processing', '[0.8, 0.9, 0.85]'); + ``` + 1. **Create indexes for both search types** ```sql @@ -223,7 +268,19 @@ Combine `pg_textsearch` with `pgvector` or `pgvectorscale` to build powerful hyb LEFT JOIN keyword_search k ON a.id = k.id WHERE v.id IS NOT NULL OR k.id IS NOT NULL ORDER BY combined_score DESC - LIMIT 10; + LIMIT 10; + ``` + + You see something like: + + ```sql + id | title | combined_score + ----+----------------------------+-------------------- + 3 | Introduction to Indexing | 0.0325224748810153 + 1 | Database Query Optimization| 0.0322664584959667 + 2 | Performance Tuning Guide | 0.0320020481310804 + 5 | Data Warehousing Basics | 0.0310096153846154 + 4 | Advanced SQL Techniques | 0.0310096153846154 ``` 1. **Adjust relative weights for different search types** @@ -257,6 +314,18 @@ Combine `pg_textsearch` with `pgvector` or `pgvectorscale` to build powerful hyb LIMIT 10; ``` + You see something like: + + ```sql + id | title | combined_score + ----+----------------------------+-------------------- + 3 | Introduction to Indexing | 0.0163141195134849 + 2 | Performance Tuning Guide | 0.0160522273425499 + 1 | Database Query Optimization| 0.0160291438979964 + 4 | Advanced SQL Techniques | 0.0155528846153846 + 5 | Data Warehousing Basics | 0.0154567307692308 + ``` + You have implemented hybrid search combining semantic and keyword search. @@ -267,27 +336,37 @@ Customize `pg_textsearch` behavior for your specific use case and data character -1. **Configure the memory limit** +1. **Configure memory and performance settings** + + To manage memory usage, you control when the in-memory index spills to disk segments. When the memtable reaches the + threshold, it automatically flushes to a segment at transaction commit. - The size of the memtable depends primarily on the number of distinct terms in your corpus. A corpus with longer - documents or more varied vocabulary requires more memory per document. ```sql - -- Set memory limit per index (default 64MB) - SET pg_textsearch.index_memory_limit = '128MB'; + -- Set memtable spill threshold (default 800000 posting entries, ~8MB segments) + SET pg_textsearch.memtable_spill_threshold = 1000000; + + -- Set bulk load spill threshold (default 100000 terms per transaction) + SET pg_textsearch.bulk_load_threshold = 150000; + + -- Set default query limit when no LIMIT clause is present (default 1000) + SET pg_textsearch.default_limit = 5000; ``` + 1. **Configure language-specific text processing** - ```sql - -- French language configuration - CREATE INDEX products_fr_idx ON products_fr - USING pg_textsearch(description) - WITH (text_config='french'); + You can create multiple BM25 indexes on the same column with different language configurations: - -- Simple tokenization without stemming + ```sql + -- Create an additional index with simple tokenization (no stemming) CREATE INDEX products_simple_idx ON products - USING pg_textsearch(description) + USING bm25(description) WITH (text_config='simple'); + + -- Example: French language configuration for a French products table + -- CREATE INDEX products_fr_idx ON products_fr + -- USING bm25(description) + -- WITH (text_config='french'); ``` 1. **Tune BM25 parameters** @@ -310,7 +389,7 @@ Customize `pg_textsearch` behavior for your specific use case and data character - View detailed index information ```sql - SELECT bm25_debug_dump_index('products_search_idx'); + SELECT bm25_dump_index('products_search_idx'); ``` @@ -320,12 +399,7 @@ caching and pagination to improve user experience with large result sets. ## Current limitations -This preview release focuses on core BM25 functionality. It has the following limitations: - -* **Memory-only storage**: indexes are limited by `pg_textsearch.index_memory_limit` (default 64MB) -* **No phrase queries**: cannot search for exact multi-word phrases yet - -These limitations will be addressed in upcoming releases with disk-based segments and expanded query capabilities. +This preview release focuses on core BM25 functionality. In this release, you cannot search for exact multi-word phrases. [bm25-wiki]: https://en.wikipedia.org/wiki/Okapi_BM25 [connect-using-psql]: /integrations/:currentVersion:/psql/#connect-to-your-service