Feb 2019

There are several ways to fuzzy search in Postgres. There's the ILIKE operator for case-insensitive searching, or the pg_trgm extension for trigram matching for more precise similarity comparisons. However ILIKE is slow because it scans the entire table, and pg_trgm isn't very accurate, and neither support multiple words or phrases.

The best way to do a fuzzy search in PostgreSQL is to use the full-text search utilities. They're fast, built-in, and don't require any extra extensions. It's also much more powerful than the other options and can fuzzy search with stemming and other normalization included.

Full-text search requires tsvector columns which are generated from the text you want to search, using the to_tsvector function which takes a string and returns a tsvector object. Queries use the @@ operator to compare that column with a ts_query object generated by the to_tsquery function.

Here's an example:

-- create a table with a tsvector column
CREATE TABLE movies (title text, tsv tsvector);

-- insert sample data
INSERT INTO movies (title) VALUES
('12 Angry Men'),
('American Beauty'),
('American History X'),
('Back to the Future'),
('Casablanca'),
('City of God'),
('Fight Club'),
('Forrest Gump'),
('Gladiator'),
('Good Will Hunting'),
('Goodfellas'),
('Inception'),
('Interstellar'),
('Jurassic Park'),
('Life Is Beautiful'),
('Once Upon a Time in America'),
('Pulp Fiction'),
('Saving Private Ryan'),
('Spirited Away'),
('Star Wars: A New Hope'),
('Terminator 2: Judgment Day'),
('The Dark Knight'),
('The Godfather'),
('The Good, the Bad and the Ugly'),
('The Green Mile'),
('The Incredibles'),
('The Lion King'),
('The Lord of the Rings: The Return of the King'),
('The Matrix'),
('The Shawshank Redemption'),
('The Usual Suspects'),
('Whiplash');

-- generate the tsvector column
UPDATE movies
SET tsv = to_tsvector(title);

-- Search for "good"
SELECT title
FROM movies, to_tsquery('good') as q
WHERE tsv @@ q;

PG12+ supports generated columns which can keep the tsvector column updated automatically. A GIN index can be added to speed up the search but it's not necessary for small tables.

The to_tsquery function with a single word will perform an exact match, but it can also be used to perform a prefix search by appending :* to the end of the word, or a phrase search by using the & operator to join multiple words together. All of these can be combined to create a fuzzy search query.

SELECT title
FROM movies, to_tsquery('good:* & bad:*') as q
WHERE tsv @@ q

Instead of manually constructing this search string, it can be automatically normalized by using to_tsvector and combining the terms with some string functions:

SELECT string_agg(lexeme || ':*', ' & ' order by positions)
FROM unnest(to_tsvector('good bad'))

This SQL gets the normalized text from the to_tsvector function, splits it into individual tokens, appends :* to each, then joins them with & to create a single string. The example above takes good bad and produces good:* & bad:* which you can then use directly with to_tsquery to get fuzzy matching with the normalization intact.

Combine it all into a CTE so it can be called easily as stored procedure or function:

WITH search AS (
    SELECT to_tsquery(string_agg(lexeme || ':*', ' & ' order by positions)) AS query
    FROM unnest(to_tsvector('enter your search query here'))
)
SELECT title
FROM movies, search
WHERE movies.tsv @@ search.query

Comments

No comments posted.