FTS: Full Text Search Plugin (`fts`)

As the amount and importance of information stored in email messages is increasing in people’s everyday lives, searching through those messages is becoming ever more important. At the same time mobile clients add their own restrictions for what can be done on the client side. The ever diversifying mail client software also tests the limits of the IMAP protocol and current server implementations.

Furthermore, the IMAP protocol requires some rather complicated and expensive searching capabilities. For example, the protocol requires arbitrary substring matching. Some newer mobile clients (e.g. Apple iOS) rely on this functionality.

Without a high-performance index, Dovecot must fall back to a slow sequential search through all messages (default behavior). If storage latencies are high, this searching may not be completed in a reasonable time, or resource utilization may be too large, especially in mailboxes with large messages.

Dovecot maintains these FTS indexing engines:

Name	Description
Dovecot Pro FTS	Dovecot native, object storage optimized driver. Only available as part of Dovecot Pro.
fts_solr plugin	Interface to Apache Solr; stores data remotely.
fts_flatcurve plugin	Xapian based driver; stores data locally.

Searching In Dovecot

When a FTS indexing driver is not present, searches use a slow sequential search through all message data. This is both computationally and time expensive. It is desirable to pre-index data so that searches can be executed against this index.

There is a subtle but important distinction between searching through message headers and searching through message bodies.

Searching through message bodies (via the standard IMAP 'SEARCH TEXT/BODY' commands) makes use of the FTS indexes.

On the other hand, searching through message headers benefits from Dovecot's standard index and cache files (dovecot.index and dovecot.index.cache), which often contain the necessary information. It is possible to redirect header searches to FTS indexes via a configuration option (fts_search_add_missing).

Triggers for FTS indexing are configurable. It can be started on demand when searching, or automatically when new messages arrive or as a batch job.

By default the FTS indexes are updated only while searching, so neither LDA/LMTP nor an IMAP 'APPEND' command updates the indexes immediately. This means that if a user has received a lot of mail since the last indexing (i.e., the last search operation), it may take a while to index all the new mails before replying to the search command. Dovecot sends periodic "* OK Indexed n% of the mailbox" updates which can be caught by client implementations to implement a progress bar.

Updating the FTS index as messages arrive makes for a more responsive user experience, especially for users who don’t search often, but have a lot of mail. On the other hand, it increases overall system load regardless of whether or not the indexes will ever be used by the user.

Dovecot FTS Architecture

Dovecot splits the full text search functionality into two parts: a common tokenization library (lib-language) and driver indexing engine responsible for storing the tokens produced by the common library persistently.

Some of the FTS drivers do their own internal tokenization, although it's possible to configure them to use the lib-language tokenization as well.

See Tokenization for more details about configuring the tokenization.

All drivers are implemented as plugins that extend the base fts plugin's functionality.

Settings

`fts`

Default	[None]
Value	Named List Filter

Configures the used fts driver to perform fts plugin indexing. If not specified, FTS is disabled. The filter name refers to the fts_driver setting.

Example:

fts solr {
  # ...
}

`fts_autoindex`

Default	`no`
Value	boolean
See Also	`fts_autoindex_max_recent_msgs`

If enabled, index mail as it is delivered or appended.

It can be overridden at the mailbox level, e.g. you can disable autoindexing for selected mailboxes using this setting:

Example:

fts_autoindex = yes

# ...

mailbox trash {
  special_use = Trash
  fts_autoindex = no
}

mailbox spam {
  special_use = Junk
  fts_autoindex = no
}

mailbox storage/* {
  fts_autoindex = no
}

`fts_autoindex_max_recent_msgs`

Default	[None]
Value	unsigned integer
See Also	`fts_autoindex`

To exclude infrequently accessed mailboxes from automatic indexing, set this value to the maximum number of Recent flagged messages that exist in the mailbox.

A value of 0 means to ignore this setting.

Mailboxes with more flagged Recent messages than this value will not be autoindexed, even though they get deliveries or appends. This is useful for, e.g., inactive Junk folders.

Any folders excluded from automatic indexing will still be indexed, if a search on them is performed.

Example:

fts_autoindex_max_recent_msgs = 999

`fts_decoder_driver`

Default	[None]
Value	string
Allowed Values	`scripttika`

Optional setting. If set, decode attachments to plaintext using the selected service and index the resulting plaintext.

`fts_decoder_script_socket_path`

Default	[None]
Value	string
Changes	Changed: 2.4.0 Renamed from `fts_decoder`.

Name of the script service used to decode the attachments.

See the decode2text.sh script included in Dovecot for how to use this.

Example:

fts_decoder_driver = script
fts_decoder_script_socket_path = decode2text

service decode2text {
  executable = script /usr/lib/dovecot/decode2text.sh
  user = vmail
  unix_listener decode2text {
	mode = 0666
  }
}

`fts_decoder_tika_url`

Default	[None]
Value	string
Changes	Added: 2.4.0 Basic authentication support (via URL) is added. Changed: 2.4.0 Renamed from `fts_tika`.

URL for Apache Tika decoder for attachments.

Example:

fts_decoder_driver = tika
fts_decoder_tika_url = http://tikahost:9998/tika/

`fts_driver`

Default	[None]
Value	string
Allowed Values	`dovecotsolrflatcurve`

Configures the used fts driver to perform fts plugin indexing. The fts filter name refers to this setting.

`fts_header_excludes`

Default	[None]
Value	Boolean List

The list of headers to include or exclude.

The default is the preexisting behavior, i.e. index all headers.
includes take precedence over excludes: if a header matches both, it is indexed.
The terms are case insensitive.
An asterisk * at the end of a header name matches anything starting with that header name.
The asterisk can only be used at the end of the header name. Prefix and infix usage of asterisk are not supported.

Example:

fts_header_excludes {
  Received = yes
  DKIM-* = yes
  X-* = yes
  Comments = yes
}

fts_header_includes {
  X-Spam-Status = yes
  Comments = yes
}

Received headers, all DKIM- headers and all X- experimental headers are excluded, with the following exceptions:
- Comments and X-Spam-Status are indexed anyway, as they match both excludes and includes lists.
- All other headers are indexed.

Example:

fts_header_excludes {
  * = yes
}

fts_header_includes {
  From = yes
  To = yes
  Cc = yes
  Bcc = yes
  Subject = yes
  Message-ID = yes
  In-* = yes
  X-CustomApp-* = yes
}

No headers are indexed, except those specified in the includes.

`fts_header_includes`

Default	[None]
Value	Boolean List
See Also	`fts_header_excludes`

`fts_message_max_size`

Default	[None]
Value	size
Changes	Added: 2.4.0

Maximum body size that is processed by fts. 0 means unlimited.

`fts_search_add_missing`

Default	`body-search-only`
Value	string
Allowed Values	`body-search-onlyyes`

Should missing mails be added to FTS indexes before search?

With body-search-only this is done only when the search query requests searching message bodies, i.e. header searches are not updating the FTS index.

The unindexed mails are searched without FTS, i.e. either getting the headers from dovecot.index.cache or by opening the emails if the headers aren't in cache. This may be a useful optimization if the user's client only uses header searches.

INFO

Only the yes option guarantees consistent search results. Otherwise it's possible that the search results will be different depending on whether the search was performed via FTS index or not.

`fts_search_read_fallback`

Default	`yes`
Value	boolean

If FTS lookup or indexing fails, fall back to searching without FTS (i.e. possibly opening all emails). This may timeout for large mailboxes and/or slow storage.

`fts_search_timeout`

Default	`30s`
Value	time

When the full text search driver detects that the index isn't up-to-date, the indexer is told to index the messages and is given this much time to do so. If this time limit is reached, an error is returned, indicating that the search timed out during waiting for the indexing to complete: NO [INUSE] Timeout while waiting for indexing to finish.

A value of 0 means no timeout.

`language`

Default	`<textcat dir>`
Value	Named List Filter
Dependencies	`language_default`
See Also	`textcat_config_path` FTS Tokenization

Defines a language to be used in tokenization.

At least one language must be specified and one single language must be flagged as the default language using language_default = yes.

The language listed first is the default and is used when language recognition fails.

The filters used for stemming and stopwords are language dependent.

TIP

For better performance it's recommended to synchronize this setting with the textcat configuration file; see textcat_config_path.

Example:

language en {
  default = yes
}
language de {
}

`language_default`

Default	`no`
Value	boolean
Dependencies	`language`
See Also	FTS Tokenization

The language marked as default will be used when language detection cannot identify the proper language of the text being processed.

Exactly one language must be marked with this flag.

`language_filter_normalizer_icu_id`

Default	`Any-Lower; NFKD; [: Nonspacing Mark :] Remove; [\x20] Remove`
Value	string

Description of the normalizing/transliterating rules to use.

See Normalizer Format for syntax.

`language_filter_stopwords_dir`

Default	[None]
Value	string
See Also	FTS Tokenization

Path to the directory containing stopword files. The files inside the directory have names with the form /stopwords_<lang>.txt.

See Languages for the list of stopword files that are currently distributed with Dovecot.

More languages can be obtained from:

`language_filters`

Default	[None]
Value	Boolean List
See Also	FTS Tokenization Filter Configuration

The list of filters to apply.

See Filter Configuration for configuration information.

`language_tokenizer_address_token_maxlen`

Default	[None]
Value	unsigned integer
See Also	FTS Tokenization

Maximum length of token, before an arbitrary cut off is made.

`language_tokenizer_generic_algorithm`

Default	`simple`
Value	string
Allowed Values	`simpletr29`
See Also	FTS Tokenization

Defines the method for finding word boundaries.

Value	Description
`simple`	Faster algorithm that works for many texts, especially using the latin alphabets, but leaves corner cases.
`tr29`	Implements a version of Unicode technical report 29 word boundary lookup. It might work better with texts containing e.g. Katakana or Hebrew characters, but it is not possible to use a single algorithm for all existing languages.

`language_tokenizer_generic_token_maxlen`

Default	[None]
Value	unsigned integer
See Also	FTS Tokenization

Maximum length of token, before an arbitrary cut off is made.

`language_tokenizer_generic_wb5a`

Default	`no`
Value	boolean
See Also	FTS Tokenization

Unicode TR29 rule WB5a setting to the tr29 tokenizer. Splits prefixing contracted words from base word. E.g. l'homme -> l and homme. Together with a language specific stopword list unnecessary contractions can thus be filtered away. This is disabled by default and only works with the TR29 algorithm.

Enable by declaring:

`language_tokenizer_kuromoji_icu_id`

Default	`Any-NFKC`
Value	string
See Also	FTS Tokenization

Description of the normalizing/transliterating rules to use. See Normalizer Format for syntax.

Defaults to Any-NFKC which is quite good for CJK text mixed with latin alphabet languages. It transforms CJK characters to full-width encoding and transforms latin ones to half-width. The NFKC transformation is described above.

WARNING

If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case.

`language_tokenizer_kuromoji_split_compounds`

Default	`yes`
Value	boolean
See Also	FTS Tokenization
Advanced Setting; this should not normally be changed.

This setting enables search mode in the Atilika Kuromoji library. The setting defaults to enabled and should not be changed unless there is a compelling reason.

WARNING

If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case.

`language_tokenizer_kuromoji_token_maxlen`

Default	[None]
Value	unsigned integer
See Also	FTS Tokenization

Maximum length of token, before an arbitrary cut off is made.

`language_tokenizers`

Default	`generic email-address`
Value	Boolean List
See Also	FTS Tokenization Tokenizer Configuration

The list of tokenizers to use.

See Tokenizer Configuration for configuration information.

`textcat_config_path`

Default	`<textcat dir>`
Value	string
See Also	`language`

Path to the textcat/exttextcat configuration file, which lists the supported languages.

This is recommended to be changed to point to a minimal version of a configuration that supports only the languages listed in language.

Doing this improves language detection performance during indexing and also makes the detection more accurate.

Example:

textcat_config_path = /usr/share/libexttextcat/fpdb.conf

Configuration

FTS Indexing Triggers

Missing mails are always added to FTS indexes when using IMAP SEARCH command that attempts to access the FTS indexes.

Automatic FTS indexing can also be done during mail delivery, IMAP APPEND and other ways of adding mails to mailboxes using fts_autoindex.

Indexing can also be triggered manually:

doveadm index -u user@domain -q INBOX

Enforce FTS

When FTS indexing fails, Dovecot falls back on using the built-in search, which does not have indexes for mail bodies.

This could end up opening all the mails in the mailbox, which often isn't wanted.

To disable this functionality, enable fts_search_add_missing.

Indexing Attachments

Attachments can be indexed either via a script that translates the attachment to UTF-8 plaintext or Apache Tika server.

Rescan

Dovecot keeps track of indexed messages in the dovecot.index files. If this becomes out of sync with the actual FTS indexes (either too many or too few mails), you'll need to do a rescan and then index missing mails:

doveadm fts rescan -u user@domain
doveadm index -u user@domain -q '*'

Note that currently most FTS drivers don't implement the rescan. Instead, they simply delete all the FTS indexes. This may change in the future versions.

Languages

Language names are given as ISO 639-1 alpha 2 codes.

Stemming support indicates whether the snowball filter can be used.

Stopwords support indicates whether a stopwords file is distributed with Dovecot.

Currently supported languages:

Language Code	Language	Stemming	Stopwords
da	Danish	Yes	Yes
de	German	Yes	Yes
en	English	Yes	Yes
es	Spanish	Yes	Yes
fi	Finnish	Yes	Yes
fr	French	Yes	Yes
it	Italian	Yes	Yes
ja	Japanese (Requires Dovecot Pro)	No	No
nl	Dutch	Yes	Yes
no	Norwegian (Bokmal & Nynorsk detected)	Yes	Yes
pt	Portuguese	Yes	Yes
ro	Romanian	Yes	Yes
ru	Russian	Yes	Yes
sv	Swedish	Yes	Yes
tr	Turkish	Yes	Yes

Tokenization

Dovecot contains tokenization support that can be used by FTS drivers.

The lib-language tokenization library works in the following way:

Language detection: When indexing, the text language is attempted to be detected.
If the detection fails, the first listed language is used.
When searching, the search is done using all the configured languages.
Tokenization: The text is split to tokens (individual words).
- Whitespace and other nonindexable characters are dropped.
- Base64 sequences are looked for and skipped.
Filtering: Tokens are normalized:
- Normalization / lowercasing
- Stemming
Stopwords: A configurable list of words not to be indexed

Language Definition

The language setting declares the languages that need to be detected.

At least one language must be listed.

The first language is the default language used in case detection fails.

Each added language makes the indexing and searching slightly slower, so it's recommended not to add too many languages unnecessarily. The language detection performance can be improved by limiting the number of languages available for textcat, see textcat_config_path.

Example:

language en {
  default = yes
}
language de {
}

Filter and Tokenizer Order

The filters and tokenizers are created in the order they are declared in their respective settings in the configuration file. They form a chain, where the first filter or tokenizer is the parent or grandparent of the rest. The direction of the data flow needs some special attention.

In filters, the data flows from parent to child, so tokens are first passed to the grandparent of all filters and then further down the chain. For some filtering chains the order is important. E.g. the snowball stemmer wants all input in lower case, so the filter lower casing the tokens will need to be listed before it.

In tokenizers however, the data however flows from child to parent. This means that the tokenizer listed 'last' gets the processed data 'first'.

So, for filters data flows "left to right" through the filters listed in the configuration. In tokenizers the order is "right to left".

Base64 Detection

Base64 sequences are looked for in the tokenization buffer and skipped when detected.

A base64 sequence is detected by:

An optional leader character comprised in leader-characters set,
A run of characters, all comprised in the base64-characters set, at least minimum-run-length long,
An end-of-buffer, or a trailer character comprised in trailer-characters set,

where:

leader-characters are: [ \t\r\n=:;?]
base64-characters are: [0-9A-Za-z/+]
trailer-characters are: [ \t\r\n=:;?]
minimum-run-length is: 50
minimum-run-count is: 1

Thus, (even single) 50-chars runs of characters in the base64 set are recognized as base64 and ignored in indexing.

If a base64 sequence happens to be split across different chunks of data, part of it might not be detected as base64. In this case, the undetected base64 fragment is still indexed. However, this happens rarely enough that it does not significantly impact the quality of the filter.

So far the above rule seems to give good results in base64 indexing avoidance. It also performs well in removing base64 fragments inside headers, like ARC-Seal, DKIM-Signature, X-SG-EID, X-SG-ID, including header-encoded parts (e.g. =?us-ascii?Q?...?= sequences).

Filter Configuration

Filters affect how data is indexed.

They are configured through language_filters.

Example:

language_filters = normalizer-icu snowball stopwords
language en {
  language_filters = lowercase snowball english-possessive stopwords
}

Available filters:

`lowercase`

Change all text to lower case. Supports UTF8, when compiled with libicu and the library is installed. Otherwise only ASCII characters are lowercased.

`stopwords`

Filter certain common and short words, which are usually useless for searching.

WARNING

Using stopwords with multiple languages configured WILL cause some searches to fail. The recommended solution is to NOT use the stopword filter when multiple languages are present in the configuration.

Settings

`snowball`

Stemming tries to convert words to a common base form. A simple example is converting cars to car (in English).

This stemmer is based on the Snowball stemmer library.

`normalizer-icu`

Normalize text using libicu. This is potentially very resource intensive.

WARNING

There is a caveat for the Norwegian language:

The default normalizer filter does not modify U+00F8 (Latin Small Letter O with Stroke). In some configurations it might be desirable to rewrite it to, e.g., o. Same goes for the upper case version. This can be done by passing a modified id setting to the normalizer filter.

Similar cases can exist for other languages as well.

Settings

`language_filter_normalizer_icu_id`

Default	`Any-Lower; NFKD; [: Nonspacing Mark :] Remove; [\x20] Remove`
Value	string

Description of the normalizing/transliterating rules to use.

See Normalizer Format for syntax.

`english-possessive`

Remove trailing 's from English possessive form tokens. Any trailing single ' characters are already removed by tokenizing, whether this filter is used or not.

The snowball filter also removes possessive suffixes from English, so if using snowball this filter is not needed.

TIP

snowball likely produces better results, so this filter is advisable only when snowball is not available or cannot be used due to extreme CPU performance requirements.

`contractions`

Removes certain contractions that can prefix words. The idea is to only index the part of the token that conveys the core meaning.

Only works with French, so the language of the input needs to be recognized by textcat as French.

It filters qu', c', d', l', m', n', s' and t'.

Do not use at the same time as generic tokenizer with both

Tokenizer Configuration

Tokenizers affect how input data is parsed.

Available tokenizers:

`generic`

Input data, such as email text and headers, need to be divided into words suitable for indexing and searching. The generic tokenizer does this.

Settings

`language_tokenizer_generic_algorithm`

Default	`simple`
Value	string
Allowed Values	`simpletr29`
See Also	FTS Tokenization

Defines the method for finding word boundaries.

Value	Description
`simple`	Faster algorithm that works for many texts, especially using the latin alphabets, but leaves corner cases.
`tr29`	Implements a version of Unicode technical report 29 word boundary lookup. It might work better with texts containing e.g. Katakana or Hebrew characters, but it is not possible to use a single algorithm for all existing languages.

`language_tokenizer_generic_token_maxlen`

Default	[None]
Value	unsigned integer
See Also	FTS Tokenization

Maximum length of token, before an arbitrary cut off is made.

`language_tokenizer_generic_wb5a`

Default	`no`
Value	boolean
See Also	FTS Tokenization

Enable by declaring:

`email-address`

This tokenizer preserves email addresses as complete search tokens, by bypassing the generic tokenizer, when it finds an address. It will only work as intended if it is listed after other tokenizers.

Settings

`language_tokenizer_address_token_maxlen`

Default	[None]
Value	unsigned integer
See Also	FTS Tokenization

Maximum length of token, before an arbitrary cut off is made.

Databases

Mechanisms

Formats

Extensions

FTS: Full Text Search Plugin (fts) ​

Searching In Dovecot ​

Dovecot FTS Architecture ​

Settings ​

fts

fts_autoindex

fts_autoindex_max_recent_msgs

fts_decoder_driver

fts_decoder_script_socket_path

fts_decoder_tika_url

fts_driver

fts_header_excludes

fts_header_includes

fts_message_max_size

fts_search_add_missing

fts_search_read_fallback

fts_search_timeout

language

language_default

language_filter_normalizer_icu_id

language_filter_stopwords_dir

language_filters

language_tokenizer_address_token_maxlen

language_tokenizer_generic_algorithm

language_tokenizer_generic_token_maxlen

language_tokenizer_generic_wb5a

language_tokenizer_kuromoji_icu_id

language_tokenizer_kuromoji_split_compounds

language_tokenizer_kuromoji_token_maxlen

language_tokenizers

textcat_config_path

Configuration ​

FTS Indexing Triggers ​

Enforce FTS ​

Indexing Attachments ​

Rescan ​

Languages ​

Tokenization ​

Language Definition ​

Filter and Tokenizer Order ​

Base64 Detection ​

Filter Configuration ​

lowercase ​

stopwords ​

Settings ​

snowball ​

normalizer-icu ​

Settings ​

language_filter_normalizer_icu_id

english-possessive ​

contractions ​

Tokenizer Configuration ​

generic ​

Settings ​

language_tokenizer_generic_algorithm

language_tokenizer_generic_token_maxlen

language_tokenizer_generic_wb5a

email-address ​

Settings ​

language_tokenizer_address_token_maxlen

FTS: Full Text Search Plugin (`fts`)

Searching In Dovecot

Dovecot FTS Architecture

Settings

`fts`

`fts_autoindex`

`fts_autoindex_max_recent_msgs`

`fts_decoder_driver`

`fts_decoder_script_socket_path`

`fts_decoder_tika_url`

`fts_driver`

`fts_header_excludes`

`fts_header_includes`

`fts_message_max_size`

`fts_search_add_missing`

`fts_search_read_fallback`

`fts_search_timeout`

`language`

`language_default`

`language_filter_normalizer_icu_id`

`language_filter_stopwords_dir`

`language_filters`

`language_tokenizer_address_token_maxlen`

`language_tokenizer_generic_algorithm`

`language_tokenizer_generic_token_maxlen`

`language_tokenizer_generic_wb5a`

`language_tokenizer_kuromoji_icu_id`

`language_tokenizer_kuromoji_split_compounds`

`language_tokenizer_kuromoji_token_maxlen`

`language_tokenizers`

`textcat_config_path`

Configuration

FTS Indexing Triggers

Enforce FTS

Indexing Attachments

Rescan

Languages

Tokenization

Language Definition

Filter and Tokenizer Order

Base64 Detection

Filter Configuration

`lowercase`

`stopwords`

Settings

`snowball`

`normalizer-icu`

Settings

`language_filter_normalizer_icu_id`

`english-possessive`

`contractions`

Tokenizer Configuration

`generic`

Settings

`language_tokenizer_generic_algorithm`

`language_tokenizer_generic_token_maxlen`

`language_tokenizer_generic_wb5a`

`email-address`

Settings

`language_tokenizer_address_token_maxlen`