• Technical
  • PostgreSQL
PostgreSQL® Full-Text Search

Full-text Search is a PostgreSQL® feature that facilitates the indexing of natural language text documents, and in the identification of indexed documents that match a given query. Matching documents can be sorted based on their relevance to the query, and document excerpts can be generated with the matching terms highlighted. A set of SQL data types, operators, and functions are provided to assist with the indexing, querying, and ranking of documents.

PostgreSQL uses the term document to mean any fragment of natural language text— essentially, strings containing human-readable words separated by whitespace and punctuation. Documents are often stored as text columns but can also be generated dynamically—such as by concatenating multiple columns together (even from multiple tables).

Some examples of natural language text are blog posts (such as the one you’re reading now), books, essays, user comments, forum posts, social media messages, emails, newsgroup and chat messages, newspaper and magazine articles, and product descriptions in a catalog. These types of documents are typical candidates for full-text search indexing.

Note that not all human-readable strings contain natural language. For example, usernames, passwords, and URLs are often human-readable, but don’t typically contain natural language.

What’s Not to Like About LIKE?

The text datatype has several operators for performing basic string pattern matching, notably LIKE/ ILIKE (SQL wildcard matches, case sensitive and insensitive, respectively), SIMILAR TO (SQL regex) and ~ (POSIX regex).

While it is possible to use these operators to perform very basic searches, pattern matching has several limitations that make it less than ideal for implementing useful searches. These operators lack linguistic support—such as understanding the structure of text (including punctuation), recognizing word variants and synonyms, and ignoring frequently used words. They have no ability to rank results based on relevance to the query, and, critically, they can be slow due to limited indexing support.

To explore some of these limitations, let’s look at some typical requirements for a comprehensive search feature of an application/website:

  • Users enter search terms that will be converted into queries against the database, and the results are displayed back to the user.
  • Search should be case-insensitive.
  • Words in the query should match variants (such as suffixes) of that word in the document, e.g. ‘cat’ should match ‘cats’ (and vice-versa)
  • Documents with related words/synonyms should be found, e.g. documents containing ‘feline’ or ‘kitten’ should be found when searching for ‘cat’.
  • Phrases can be searched for (often surrounded by double-quotes, e.g. "the fat black cat").
  • Users can flag certain words to be excluded (e.g., by prefixing the word with a dash: cat -fat).
  • The results are ordered by some sort of relevance metric relating to the user’s query, e.g. if one document contains the word ‘cat’ (or variants thereof) multiple times, and another document only mentions ‘cat’ once, the first document is ranked higher.

These requirements have been kept intentionally vague, as they often depend on the specifics of the application.

For the example queries below, the following table definition and data was used:

Were the application to perform searches using the ILIKE operator (which does case-insensitive matching), a user’s query for ‘cat’ could be rewritten in SQL to resemble ILIKE '%cat%'. This would find every document containing the string ‘cat’. However, this query will also find documents with words containing the substring ‘cat’. Some matches, such as ‘cats’ would be considered relevant to the query. Others, such as ‘cathode’, ‘catalog’, ‘deduplication’, and ‘predicate’ are probably irrelevant results.

Trying to narrow down the query with ILIKE '% cat %' doesn’t improve things either. While words such as ‘catacombs’ and ‘deduplication’ will no longer match, ‘cats’ won’t match either, and sentences starting and ending with ‘cat’, and punctuation other than spaces causes problems too—’cat!’, ‘cat,’, ‘cat.’ are all ignored.

ILIKE also doesn’t provide any assistance for matching synonyms or related words. If the application wanted to find documents containing ‘feline’, ‘kitten’, and ‘cat’ for the query ‘cat’, additional ILIKE operators would need to be OR’d together in the SQL. Some words can have many variants, which makes this a laborious approach.

Lastly, ILIKE only returns a boolean indicating if a string matches the pattern. It does not provide a scoring metric that can be used to rank results by relevance.

Similar problems exist with the regex operators. While they are more powerful than LIKE, and certain shortcomings of LIKE can be fixed with creative regex patterns, they still perform pattern matching, and have the same fundamental limitations.

To work around the limitations of these operators, you would likely end up tediously reimplementing large parts of PostgreSQL’s built-in full-text search! Instead of doing that, let’s explore what PostgreSQL has to offer.

Full Text Search Fundamentals

Getting Documents Ready for Search

PostgreSQL’s full-text search does not operate directly on documents stored using the text data type. Instead, documents need to be first converted into the tsvector data type, which is a format that is optimized for search.

To convert a document stored as text into a tsvector the to_tsvector function should be used:
to_tsvector([ config regconfig, ] document text) → tsvector

The to_tsvector function performs a number of processing steps on the document text. At a very high level to_tsvector first breaks the document down into words using a parser. Each word is then looked up against one or more dictionaries. A dictionary is a mapping of words to their normalized forms. These normalized forms are called lexemes. If the word matches a dictionary entry, that entry’s lexeme is added to the tsvector.

The resulting tsvector is an alphabetically sorted set of lexemes present in the source document. Each lexeme in the tsvector also includes position information. This is a list of integers representing the location of each source word. This position information is required for phrase searching and is useful for ordering matches based on proximity ranking.

The process of normalizing words into lexemes is dictionary-specific, but this almost always includes case-folding (converting UPPERCASE and Title-Case words into lowercase) and the removal of suffixes to reduce words to their root form (e.g. ‘cats’ becomes ‘cat’). Application-specific dictionaries can be created, and these can be used to customize the normalization process (e.g. to map domain-specific synonyms and phrases to a common term, such as ‘kitten’ and ‘feline’ to ‘cat’). This normalization process makes it fast and easy to find all variants of a word without needing to specify each variant in the query.

Dictionaries also typically don’t contain entries for extremely common words (such as ‘the’ and ‘and’). These common words are called stop words and aren’t terribly useful for searching. By skipping them, the generated tsvector will be smaller, which improves performance.

The choice of parser and dictionaries is called a configuration and can be specified by the config parameter to to_tsvector. This parameter is optional, but it is recommended to always specify it to ensure the right configuration is used. If omitted, the global default_text_search_config value will be used. Various default configurations exist ( \dF in psql will show a list), though you will likely want to customize the configuration.

See Section 20.11.2 and Section 12.1.3 of the PostgreSQL documentation for more details on the default_text_search_config option and text search configuration in general.

Here is an example of calling to_tsvector on a fragment of text using the built-in english configuration:

This returns the following tsvector:

All the lexemes have been transformed to lower case, and several words have been reduced to their root (‘cats’ → ‘cat’). The punctuation marks (‘!’ and ‘.’) and stop words (‘the’, ‘and’, and ‘to’) have been removed. Words that exist multiple times in the input (‘dog’, ‘hello’, and ‘world’) have more than one position entry.

Creating custom configurations and dictionaries is outside the scope of this article.

Writing Queries

PostgreSQL’s full-text search uses a specialized mini query language (DSL) that permits writing complex and advanced queries that combine boolean logic, grouping and phrase/proximity search. A query written in this DSL is stored using tsquery data type. Queries stored as tsquery can be evaluated against a document’s tsvector to determine if the document matches the query.

The DSL consists of single lexemes, separated using operators. Parentheses can be used to group operators and lexemes, to force precedence/binding. See Section 8.11.2 of the PostgreSQL documentation for detailed documentation on the tsquery DSL.

Symbol Operator Arity Precedence
&Boolean ANDBinaryLowest
|Boolean ORBinaryLowest
!Boolean NOTUnaryHighest
<N>Followed By (where N is a positive number)BinaryHigh
<->Followed By (equivalent to <1>)BinaryHigh

Boolean AND tests that both lexemes are present in the document. A match will be found if both lexemes exist.

Boolean OR tests that 1 or both lexemes are present in the document. A match will be found if either lexeme exists.

Boolean NOT tests that the lexeme is not present in the document. A match will be found if the lexeme does not exist.

Followed By tests that both lexemes are present in the document, and that the distance between the 2 lexemes matches the given value. A match will be found when the 2 lexemes exist and are exactly N words apart.

A selection of functions exist to convert query text into a tsquery: to_tsquery, plainto_tsquery, phraseto_tsquery and websearch_to_tsquery.

Each of these functions will perform normalization and stop word removal on the input query text, using the same process as to_tsvector. Parsing of the query text depends on the function.

to_tsquery

to_tsquery([ config regconfig, ] querytext text) → tsquery

to_tsquery understands the full DSL. It expects the input to be well-formed and follow the tsquery syntax. Any syntax errors will become SQL exceptions.

plainto_tsquery

plainto_tsquery([ config regconfig, ] querytext text) → tsquery

plainto_tsquery returns a query that tests if all the non-stop word lexemes are present in the document. It parses the query text using the same process as to_tsvector . The resulting lexemes are combined using the Boolean AND operator.

phraseto_tsquery

phraseto_tsquery([ config regconfig, ] querytext text) → tsquery

phraseto_tsquery returns a query that tests if the given phrase exists in the document. This is useful for “phrase search”. It behaves similar to phraseto_tsquery except that the lexemes are combined using the Followed By operator. If stop words exist between lexemes in the document, the Followed By operator will account for this.

In this example, ‘Over’ and ‘The’ are considered stop words. ‘Lazy’ is the 3rd word after ‘Jumped’, hence the Followed By operator between lexemes ‘jump’ and ‘lazi’ has a distance value of 3. ‘Dogs’ immediately follows ‘Lazy’, so the Followed By operator between lexemes ‘lazi’ and ‘dog’ has a distance of 1.

websearch_to_tsquery

websearch_to_tsquery([ config regconfig, ] querytext text) → tsquery

websearch_to_tsquery works a bit differently to the other to_tsquery functions. It understands an alternate syntax that is a basic variant of what is commonly available on web search engines. It combines parts of plainto_tsquery and phraseto_tsquery, while also understanding some basic operators.

Unquoted phrases will be converted to lexemes combined using Boolean AND (as in plainto_tsquery). Double-quoted phrases will be converted to lexemes combined using Followed By (as in phraseto_tsquery). or and - (a dash) are converted to Boolean OR and Boolean NOT, respectively.

websearch_to_tsquery is likely the most ideal built-in function to use when generating a tsquery from user input.

Note that “phrase search” queries that use Followed By operators, such as those generated by phraseto_tsquery and websearch_to_tsquery, can sometimes return unexpected results. See Inexact Phrase Search below for details.

Evaluating Queries

A tsquery can be evaluated against a tsvector, testing if the document matches the query. Evaluation is done using the @@ SQL operator. This operator takes a tsquery and a tsvector as operands (either order), and returns a Boolean indicating whether a match occurred.

tsvector @@ tsquery → boolean
tsquery @@ tsvector → boolean

For example, here a document is searched for a word using a single-lexeme tsquery:

Here, ‘Dogs’ matched ‘dog’ because ‘dogs’ in the query is normalized to the lexeme ‘dog’, which also exists in the document.

As expected, if the tsquery doesn’t match the tsvector, the @@ operator will return false.

Switching the query to something slightly more complex, here a document is searched for a phrase:

And a match is found, as expected.

Storing and Indexing Documents

So far, we’ve seen how to convert documents into tsvector, query text into tsquery, and how to combine them together to determine if a document matches a query. However, the examples have only worked with a single document, and always included the document text inline.

The more typical use case for full-text search is to search an entire table worth of documents in one go:

Without the appropriate indexes, this will run to_tsvector over the contents of the column on every SELECT. Running it over every row of a large table for every search will result in abysmal performance.

Adding an expression index is one way to improve performance. However, the details of how to do this will be left as an exercise to the reader (see Section 12.2.2. Creating Indexes of the PostgreSQL documentation). We consider the approach outlined below of using a separate, generated, tsvector column to have a number of benefits over a pure expression-index based approach, namely better performance. When using indexes, to_tsvector may need to be re-run to verify index matches.

Since tsvector is a data type, a column of type tsvector can be created on a table, and the results of to_tsvector are stored in this column. Hence, it is typical to store the documents’ tsvector alongside the original:

Marking the tsvector column as GENERATED (PostgreSQL 12 and later), simplifies things for client applications when they perform an INSERT or UPDATE. For older versions of PostgreSQL without generated column support, two convenience trigger functions are included that are designed to generate the tsvector from other columns in a table (see Section 12.4.3. Triggers for Automatic Updates of the PostgreSQL documentation). Of course, a custom trigger function can be used as well. The specific pros and cons of GENERATED columns compared to BEFORE INSERT OR UPDATE triggers is outside the scope of this article. However, we consider generated columns to be easier to grok, and should be used unless the limitations of generated columns force the use of trigger functions.

An index can then be created over the tsvector column to speed up queries:

Once the tsvector column is set up, it can be used in queries like normal:

And that’s all there is to it for basic text search. To summarize, there are 3 main steps involved:

  1. Convert each document into a tsvector using to_tsvector. Typically, this is done once on INSERT/UPDATE and the resulting tsvector is stored in a column alongside the original document. Indexes on this column speed up searches.
  2. Query text is converted to a tsquery using one of the ...to_tsquery functions.
  3. Evaluate the tsquery against the document tsvectors using the @@ operator.

Some Gotchas

Casting Text to tsvector/tsquery

It is possible to cast text directly to tsvector and tsquery. However, this probably won’t do what you expect!

Like most PostgreSQL data types, tsvector and tsquery have a compact binary format and a textual representation. The cast from text expects a string in the tsvector/ tsquery textual representation format (see Section 8.11. Text Search Types of the PostgreSQL documentation). If the string is well-formed, the cast will succeed, but no stop word removal, word normalization, or other preprocessing steps will occur—it is assumed that the lexemes in the input have already been normalized.

Phrase searches may behave unexpectedly when the query text includes stop words.

In the following example, the built-in english configuration considers ‘Over’ and ‘The’ to be stop words, and phraseto_tsquery replaces them with a Followed By with Distance 3 ( <3>) operator. As a result, this query will match any document that has the lexeme ‘jump’ followed by any two words, followed by ‘lazi’, for example:

Here, “Jumped Over The Lazy” matches “jumped past a lazy dog”. The match is successful because the tsvector contains the lexemes ‘jump’ and ‘lazi’, at positions 5 and 8, respectively – a distance of 3 words.

Limitations

Limit Value Behaviour
Size of each individual lexeme2047 bytesSQL ERROR raised
Total size of a tsvector.
This includes lexemes and positional information.
(type binary repr/on-disk size, for details see ts_type.h)
1048575 bytes (~1MiB)SQL ERROR raised
Total number of lexemes per tsvector264unknown (size limit is reached first)
Total number of position entries per lexeme per tsvector255Position entries after 255th are discarded
Maximum value of a lexeme position entry16383Values greater than 16383 are rounded to 16383
Total number of lexemes and operators (nodes) per tsquery32768SQL ERROR raised

There are a number of limitations to be aware of when using PostgreSQL full-text search.

Unless you’re trying to index an article about the world’s longest English word (the chemical name for Titin, which is ~189819 letters long!), it’s unlikely that you’ll hit the lexeme size limit. However, indexing large fragments of text may hit the tsvector size limit of ~1MiB. Note that this limit is based on the size of the binary representation (used to store a tsvector in memory and on disk). A typical workaround is to split the document into chunks (e.g. for books, split by chapter).

If your application intends to use phrase search pay careful attention to the tsvector limits around lexeme positions. No error or warning is emitted when either of these limits are reached while generating the tsvector. A tsvector can only store 255 position entries per lexeme. If a lexeme occurs more than 255 times in a document, only the first 255 positions will be stored. Each of these position entries also has a maximum value, which is capped at 16383. If a lexeme occurs after the 16383th position in the document, its position will be rounded to 16383. Note that stop words, while not included in the tsvector, influence lexeme positions.

Total Number of Lexemes per tsvector

In the following example, the word ‘fox’ is repeated 300 times in the source document, yet only 255 positions are recorded in the tsvector. ( unnest is a built-in utility function that converts a tsvector into a table so that various queries can be made against it)

Maximum Value of a Lexeme Position Entry

Here, a document, consisting of ‘hello’, followed by 20000 copies of ‘the’, and ending with ‘world’, is converted to a tsvector. ‘the’ was chosen as the filler because it is a stop word. While a lexeme for ‘the’ is not present in the tsvector, it still influences the lexeme positions of ‘hello’ and ‘world’. ‘world’ is the 20001th word in the document, yet because the maximum position value limit is reached, PostgreSQL rounds the position value to 16383:

Phrase Search and Large Document Limitations Example

Phrase search can work unreliably on large fragments of text because lexeme position information will either be wrong (rounded to 16383) or missing entirely, as shown in the following example:

In both the positions discarded and positions rounded tests, phrase_matches is false because the Followed By operator (as generated by phraseto_tsquery) will check the tsvector and, based on available information, incorrectly conclude that there is no instance where a ‘world’ lexeme directly follows a ‘hello’ (i.e. has a position difference of 1).

The positions discarded phrase test fails because only the first 255 position entries of the ‘hello’ lexeme are retained in the tsvector. The last recorded ‘hello’ position is 255, yet ‘world’ has a position of 301, a difference of 46.

The positions rounded phrase test fails because the 20000 instances of the ‘the’ stop word influence the position values of the ‘hello’ and ‘world’ lexemes. Both position values overflow and get rounded to 16383, and hence have a difference of 0.

Give PostgreSQL Full-Text Search a Try

The full-text search features built-in to PostgreSQL are more than capable for providing basic text search functionality to an application. Developers often reach first for heavyweight solutions, such as OpenSearch. If you’re already using PostgreSQL as your application data store, try the built-in text search first before adding additional dependencies to your application. It is easier to keep the indexed documents in sync with the originals when they’re stored alongside each other, and generated columns simplify this even further. Full-text search queries can be combined with other SQL operators, which enables powerful querying, filtering, sorting capabilities, all within one database.

If you are interested in trying out PostgreSQL’s full-text search then check out Instaclustr’s managed PostgreSQL, which has full support for full-text search. You can try it without entering a credit card with our free trial today by visiting our console.