The common_terms query in Elasticsearch is a specialized full-text query designed to address performance issues related to stop words, such as 'the' and 'is' in English. This query type optimizes execution efficiency and accuracy by splitting the query into two parts: high-frequency terms and low-frequency terms.
Working Principle
When querying a text field, the common_terms query divides the query terms into two categories:
- High-frequency terms: These are words that appear frequently across the document set. For example, in English, they may include 'the', 'is', 'at', etc.
- Low-frequency terms: These words appear less frequently in the document set.
The query then proceeds in two stages:
- First stage: Only low-frequency terms are considered. These terms typically carry higher information content and effectively distinguish document relevance.
- Second stage: If the number of documents matching the low-frequency terms falls below a configurable threshold, high-frequency terms are also included in the query. This improves query precision, especially when low-frequency terms are insufficient to affect query results.
Configuration Example
Configuring the common_terms query in Elasticsearch can be done as follows:
json{ "query": { "common": { "body": { "query": "this is a test", "cutoff_frequency": 0.001, "low_freq_operator": "and", "high_freq_operator": "or" } } } }
In this example:
body: The field to query.query: The user input query text.cutoff_frequency: The threshold used to distinguish high-frequency and low-frequency terms. Terms with frequency above this value are considered high-frequency, and below are low-frequency.low_freq_operator: Set toand, meaning all low-frequency terms must match the document.high_freq_operator: Set toor, meaning any high-frequency term matching is sufficient.
Advantages and Use Cases
The main advantage of the common_terms query is that it effectively handles queries containing a large number of common words without sacrificing much query precision. This is particularly useful for applications with large volumes of text and high text complexity, such as news sites, blogs, and social media. By intelligently distinguishing between high-frequency and low-frequency terms, the common_terms query optimizes query performance while maintaining high result relevance.
In summary, Elasticsearch's common_terms query improves query performance and accuracy by efficiently handling high-frequency stop words, making it particularly suitable for search environments with large-scale text data.