The tsearch2 Reference

Brandon Craig Rhodes
30 June 2003

This ancient document is now of only historical interest. The prototype "tsearch2" module that it discusses is now distributed with PostgreSQL itself, and you can read about its modern form in the chapter of the PostgreSQL manual on "Full Text Search". I enjoyed working on the project; my (small) contribution was, after the full-text search operators were already working, to help document them and to suggest simpler and more uniform names for many of the built-in types and functions.

This Reference documents the user types and functions of the tsearch2 module for PostgreSQL. An introduction to the module is provided by the tsearch2 Guide, a companion document to this one. You can retrieve a beta copy of the tsearch2 module from the GiST for PostgreSQL page — look under the section entitled Development History for the current version.

Vectors and Queries

Vectors and queries both store lexemes, but for different purposes. A tsvector stores the lexemes of the words that are parsed out of a document, and can also remember the position of each word. A tsquery specifies a boolean condition among lexemes.

Any of the following functions with a configuration argument can use either an integer id or textual ts_name to select a configuration; if the option is omitted, then the current configuration is used. For more information on the current configuration, read the next section on Configurations.

Vector Operations

to_tsvector( [configuration,] document TEXT) RETURNS tsvector

Query Operations

to_tsquery( [configuration,] querytext text) RETURNS tsvector: Parses a query, which should be single words separated by the boolean operators “&” and, “|” or, and “!” not, which can be grouped using parenthesis. Each word is reduced to a lexeme using the current or specified configuration.
querytree(query tsquery) RETURNS text: This might return a textual representation of the given query.
text::tsquery RETURNS tsquery: Directly casting text to a tsquery allows you to directly inject lexemes into a query, with whatever positions and position weight flags you choose to specify. The text should be formatted like the query would be printed by the output of a SELECT. See the Casting section in the Guide for details.

Configurations

A configuration specifies all of the equipment necessary to transform a document into a tsvector: the parser that breaks its text into tokens, and the dictionaries which then transform each token into a lexeme. Every call to to_tsvector() (described above) uses a configuration to perform its processing. Three configurations come with tsearch2:

default — Indexes words and numbers, using the en_stem English Snowball stemmer for Latin-alphabet words and the simple dictionary for all others.
default_russian — Indexes words and numbers, using the en_stem English Snowball stemmer for Latin-alphabet words and the ru_stem Russian Snowball dictionary for all others.
simple — Processes both words and numbers with the simple dictionary, which neither discards any stop words nor alters them.

The tsearch2 modules initially chooses your current configuration by looking for your current locale in the locale field of the pg_ts_cfg table described below. You can manipulate the current configuration yourself with these functions:

set_curcfg( id INT | ts_name TEXT ) RETURNS VOID: Set the current configuration used by to_tsvector and to_tsquery.
show_curcfg() RETURNS INT4: Returns the integer id of the current configuration.

Each configuration is defined by a record in the pg_ts_cfg table:

create table pg_ts_cfg (
	id		int not  null primary key,
	ts_name		text not null,
	prs_name	text not null,
	locale		text
);

The id and ts_name are unique values which identify the configuration; the prs_name specifies which parser the configuration uses. Once this parser has split document text into tokens, the type of each resulting token — or, more specifically, the type's lex_alias as specified in the parser's lexem_type() table — is searched for together with the configuration's ts_name in the pg_ts_cfgmap table:

create table pg_ts_cfgmap (
	ts_name		text not null,
	lex_alias	text not null,
	dict_name	text[],
	primary key (ts_name,lex_alias)
);

Those tokens whose types are not listed are discarded. The remaining tokens are assigned integer positions, starting with 1 for the first token in the document, and turned into lexemes with the help of the dictionaries whose names are given in the dict_name array for their type. These dictionaries are tried in order, stopping either with the first one to return a lexeme for the token, or discarding the token if no dictionary returns a lexeme for it.

Parsers

Each parser is defined by a record in the pg_ts_parser table:

create table pg_ts_parser (
	prs_id		int not null primary key,
	prs_name	text not null,
	prs_start	oid not null,
	prs_getlexem	oid not null,
	prs_end		oid not null,
	prs_headline	oid not null,
	prs_lextype	oid not null,
	prs_comment	text
);

The prs_id and prs_name uniquely identify the parser, while prs_comment usually describes its name and version for the reference of users. The other items identify the low-level functions which make the parser operate, and are only of interest to someone writing a parser of their own.

The tsearch2 module comes with one parser named default which is suitable for parsing most plain text and HTML documents.

Each parser argument below must designate a parser with either an integer prs_id or a textual prs_name; the current parser is used when this argument is omitted.

CREATE FUNCTION set_curprs(parser) RETURNS VOID: Selects a current parser which will be used when any of the following functions are called without a parser as an argument.
CREATE FUNCTION lexem_type( [ parser ] ) RETURNS SETOF lexemtype: Returns a table which defines and describes each kind of token the parser may produce as output. For each token type the table gives the lexid which the parser will label each token of that type, the alias which names the token type, and a short description descr for the user to read.
CREATE FUNCTION parse( [ parser, ] document TEXT ) RETURNS SETOF lexemtype: Parses the given document and returns a series of records, one for each token produced by parsing. Each token includes a lexid giving its type and a lexem which gives its content.

Dictionaries

Dictionaries take textual tokens as input, usually those produced by a parser, and return lexemes which are usually some reduced form of the token. Among the dictionaries which come installed with tsearch2 are:

simple simply folds uppercase letters to lowercase before returning the word.
en_stem runs an English Snowball stemmer on each word that attempts to reduce the various forms of a verb or noun to a single recognizable form.
ru_stem runs a Russian Snowball stemmer on each word.

Each dictionary is defined by an entry in the pg_ts_dict table:

CREATE TABLE pg_ts_dict (
	dict_id		int not null primary key,
	dict_name	text not null,
	dict_init	oid,
	dict_initoption	text,
	dict_lemmatize	oid not null,
	dict_comment	text
);

The dict_id and dict_name serve as unique identifiers for the dictionary. The meaning of the dict_initoption varies among dictionaries, but for the built-in Snowball dictionaries it specifies a file from which stop words should be read. The dict_comment is a human-readable description of the dictionary. The other fields are internal function identifiers useful only to developers trying to implement their own dictionaries.

The argument named dictionary in each of the following functions should be either an integer dict_id or a textual dict_name identifying which dictionary should be used for the operation; if omitted then the current dictionary is used.

CREATE FUNCTION set_curdict(dictionary) RETURNS VOID: Selects a current dictionary for use by functions that do not select a dictionary explicitly.
CREATE FUNCTION lexize( [ dictionary, ] word text) RETURNS TEXT[]: Reduces a single word to a lexeme. Note that lexemes are arrays of zero or more strings, since in some languages there might be several base words from which an inflected form could arise.

Ranking

Ranking attempts to measure how relevant documents are to particular queries by inspecting the number of times each search word appears in the document, and whether different search terms occur near each other. Note that this information is only available in unstripped vectors — ranking functions will only return a useful result for a tsvector which still has position information!

Both of these ranking functions take an integer normalization option that specifies whether a document's length should impact its rank. This is often desirable, since a hundred-word document with five instances of a search word is probably more relevant than a thousand-word document with five instances. The option can have the values:

0 (the default) ignores document length.
1 divides the rank by the logarithm of the length.
2 divides the rank by the length itself.

The two ranking functions currently available are:

CREATE FUNCTION rank( [ weights float4[], ] vector tsvector, query tsquery, [ normalization int4 ] ) RETURNS float4

This is the ranking function from the old version of OpenFTS, and offers the ability to weight word instances more heavily depending on how you have classified them. The weights specify how heavily to weight each category of word:

{D-weight, A-weight, B-weight, C-weight}

If no weights are provided, then these defaults are used:

{0.1, 0.2, 0.4, 1.0}

Often weights are used to mark words from special areas of the document, like the title or an initial abstract, and make them more or less important than words in the document body.

CREATE FUNCTION rank_cd( [ K int4, ] vector tsvector, query tsquery, [ normalization int4 ] ) RETURNS float4

This function computes the cover density ranking for the given document vector and query, as described in Clarke, Cormack, and Tudhope's “Relevance Ranking for One to Three Term Queries” in the 1999 Information Processing and Management. The value K is one of the values from their formula, and defaults to K=4. The examples in their paper K=16; we can roughly describe the term as stating how far apart two search terms can fall before the formula begins penalizing them for lack of proximity.

Headlines

CREATE FUNCTION headline( [ id int4, | ts_name text, ] document text, query tsquery, [ options text ] ) RETURNS text

Every form of the the headline() function accepts a document along with a query, and returns one or more ellipse-separated excerpts from the document in which terms from the query are highlighted. The configuration with which to parse the document can be specified by either its id or ts_name; if none is specified that the current configuration is used instead.

An options string if provided should be a comma-separated list of one or more ‘option=value’ pairs. The available options are:

StartSel, StopSel — the strings with which query words appearing in the document should be delimited to distinguish them from other excerpted words.
MaxWords, MinWords — limits on the shortest and longest headlines you will accept.
ShortWord — this prevents your headline from beginning or ending with a word which has this many characters or less. The default value of 3 should eliminate most English conjunctions and articles.

Any unspecified options receive these defaults:

StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3