fts plugin

See also

See FTS (Full Text Search) for an overview of the Dovecot Full Text Search (FTS) system.

FTS languages

Language names are given as ISO 639-1 alpha 2 codes.

Stemming support indicates whether the snowball filter can be used.

Stopwords support indicates whether a stopwords file is distributed with Dovecot.

Currently supported languages:

Language Code

Language

Stemming

Stopwords

da

Danish

Yes

Yes

de

German

Yes

Yes

en

English

Yes

Yes

es

Spanish

Yes

Yes

fi

Finnish

Yes

Yes

fr

French

Yes

Yes

it

Italian

Yes

Yes

ja

Japanese (Requires separate Kuromoji license)

No

No

nl

Dutch

Yes

Yes

no

Norwegian (Bokmal & Nynorsk detected)

Yes

Yes

pt

Portuguese

Yes

Yes

ro

Romanian

Yes

Yes

ru

Russian

Yes

Yes

sv

Swedish

Yes

Yes

tr

Turkish

Yes

Yes

See also FTS Tokenization

Settings

fts_autoindex

If enabled, index mail as it is delivered or appended.

fts_autoindex_exclude
  • Default: <empty>

  • Values: String

To exclude a mailbox from automatic indexing, it can be listed in this setting.

To exclude additional mailboxes, add sequential numbers to the end of the plugin name.

Use either mailbox names or special-use flags (e.g. \Trash).

For example:

plugin {
  fts_autoindex_exclude = \Junk
  fts_autoindex_exclude2 = \Trash
  fts_autoindex_exclude3 = External Accounts/*
}

See also

fts_autoindex

fts_autoindex_max_recent_msgs

New in version v2.2.9.

To exclude infrequently accessed mailboxes from automatic indexing, set this value to the maximum number of \Recent flagged messages that exist in the mailbox.

A value of 0 means to ignore this setting.

Mailboxes with more flagged \Recent messages than this value will not be autoindexed, even though they get deliveries or appends. This is useful for, e.g., inactive Junk folders.

Any folders excluded from automatic indexing will still be indexed, if a search on them is performed.

Example:

plugin {
  fts_autoindex_max_recent_msgs = 999
}

See also

fts_autoindex

fts_decoder
  • Default: <empty>

  • Values: String

New in version v2.1.

Decode attachments to plaintext using this service and index the resulting plaintext.

See the decode2text.sh script included in Dovecot for how to use this.

Example:

plugin {
  fts_decoder = decode2text
}

service decode2text {
  executable = script /usr/lib/dovecot/decode2text.sh
  user = vmail
  unix_listener decode2text {
    mode = 0666
  }
}
fts_enforced
  • Default: no

  • Values: yes, no, body

New in version v2.2.19.

Require FTS indexes to perform a search? This controls what to do when searching headers and what to do on error situations.

When searching from message body, the FTS index is always (attempted to be) updated to contain any missing mails before the search is performed.

no

Searching from message headers won’t update FTS indexes. For header searches, the FTS indexes are used for searching the mails that are already in it, but the unindexed mails are searched via dovecot.index.cache (or by opening the emails if the headers aren’t in cache).

If FTS lookup or indexing fails, both header and body searches fallback to searching without FTS (i.e. possibly opening all emails). This may timeout for large mailboxes and/or slow storage.

yes

Searching from message headers updates FTS indexes, the same way as searching from body does. If FTS lookup or indexing fails, the search fails.

body

Searching from message headers won’t update FTS indexes (the same behavior as with no). If FTS lookup or indexing fails, the search fails.

New in version v2.3.7.

Note that only the yes value guarantees consistent search results. In other cases it’s possible that the search results will be different depending on whether the search was performed via FTS index or not.

fts_filters
  • Default: <empty>

  • Values: String

The list of filters to apply.

Language specific filter chains can be specified with fts_filters_<lang> (e.g. fts_filters_en).

Available filters:

lowercase

Change all text to lower case. Supports UTF8, when compiled with libicu and the library is installed. Otherwise only ASCII characters are lowercased.

stopwords

Filter certain common and short words, which are usually useless for searching.

Settings:

stopwords_dir

Path to the directory containing stopword files. Stopword files are looked up in ”<path>”/stopwords_<lang>.txt.

See FTS languages for list of stopword files that are currently distributed with Dovecot.

More languages can be obtained from Apache Lucene, Snowball stemmer, or https://github.com/stopwords-iso/.

snowball

Stemming tries to convert words to a common base form. A simple example is converting “cars” to “car” (in English).

This stemmer is based on the Snowball stemmer library.

See FTS languages

normalizer-icu

Normalize text using libicu. This is potentially very resource intensive.

Note

Caveat for Norwegian: The default normalizer filter does not modify U+00F8 (Latin Small Letter O with Stroke). In some configurations it might be desirable to rewrite it to e.g. o. Same goes for the upper case version. This can be done by passing a modified id setting to the normalizer filter. Similar cases can exist for other languages as well.

Settings:

id

Description of the normalizing/transliterating rules to use.

  • See Normalizer Format for syntax.

  • Defaults to Any-Lower; NFKD; [: Nonspacing Mark :] Remove; [\\x20] Remove

english-possessive

Remove trailing 's from English possessive form tokens. Any trailing single ' characters are already removed by tokenizing, whether this filter is used or not.

The snowball filter also removes possessive suffixes from English, so if using snowball this filter is not needed. snowball likely produces better results, so this filter is advisable only when snowball is not available or can not be used due to extreme CPU performance requirements.

contractions

Removes certain contractions that can prefix words. The idea is to only index the part of the token that conveys the core meaning.

Only works with French, so the language of the input needs to be recognized by textcat as French.

It filters “qu’”, “c’”, “d’”, “l’”, “m’”, “n’”, “s’” and “t’”.

Do not use at the same time as generic tokenizer with algorithm=tr29 wb5a=yes.

Example:

plugin {
  fts_filters = normalizer-icu snowball stopwords
  fts_filters_en = lowercase snowball english-possessive stopwords
}

See also

FTS Tokenization

fts_header_excludes
  • Default: <empty>

  • Values: String

New in version v2.3.18.

The list of headers to, respectively, include or exclude.

  • The default is the pre-existing behavior, i.e. index all headers.

  • includes take precedence over excludes: if a header matches both, it is indexed.

  • The terms are case insensitive.

  • An asterisk * at the end of a header name matches anything starting with that header name.

  • The asterisk can only be used at the end of the header name. Prefix and infix usage of asterisk are not supported.

Example:

plugin {
  fts_header_excludes = Received DKIM-* X-* Comments
  fts_header_includes = X-Spam-Status Comments
}
  • Received headers, all DKIM- headers and all X- experimental headers are excluded, with the following exceptions:

    • Comments and X-Spam-Status are indexed anyway, as they match both excludes and includes lists.

    • All other headers are indexed.

Example:

plugin {
  fts_header_excludes = *
  fts_header_includes = From To Cc Bcc Subject Message-ID In-* X-CustomApp-*
}
  • No headers are indexed, except those specified in the includes.

fts_header_includes
  • Default: <empty>

  • Values: String

New in version v2.3.18.

fts_index_timeout

When the full text search backend detects that the index isn’t up-to-date, the indexer is told to index the messages and is given this much time to do so. If this time limit is reached, an error is returned, indicating that the search timed out during waiting for the indexing to complete: NO [INUSE] Timeout while waiting for indexing to finish

A value of 0 means no timeout.

fts_language_config
  • Default: <textcat dir>

  • Values: String

Path to the textcat/exttextcat configuration file, which lists the supported languages.

This is recommended to be changed to point to a minimal version of a configuration that supports only the languages listed in fts_languages.

Doing this improves language detection performance during indexing and also makes the detection more accurate.

Example:

plugin {
  fts_language_config = /usr/share/libexttextcat/fpdb.conf
}

See also

fts_languages

fts_languages
  • Default: <empty>

  • Values: String

A space-separated list of languages that the full text search should detect.

At least one language must be specified.

The language listed first is the default and is used when language recognition fails.

The filters used for stemming and stopwords are language dependent.

Note

For better performance it’s recommended to synchronize this setting with the textcat configuration file; see fts_language_config.

Example:

plugin {
  fts_languages = en de
}
fts_tika
  • Default: <empty>

  • Values: String

New in version v2.2.13.

URL for Apache Tika decoder for attachments.

Example:

plugin {
  fts_tika = http://tikahost:9998/tika/
}
fts_tokenizers
  • Default: generic email-address

  • Values: String

The list of tokenizers to use.

This setting can be overridden for specific languages by using fts_tokenizers_<lang> (e.g. fts_tokenizers_en).

List of tokenizers:

generic

Input data, such as email text and headers, need to be divided into words suitable for indexing and searching. The generic tokenizer does this.

Settings:

maxlen

Maximum length of token, before an arbitrary cut off is made. Defaults to FTS_DEFAULT_TOKEN_MAX_LENGTH. The default is probably OK.

algorithm

Accepted values are simple or tr29. It defines the method for looking for word boundaries. Simple is faster and will work for many texts, especially those using latin alphabets, but leaves corner cases. The tr29 implements a version of Unicode technical report 29 word boundary lookup. It might work better with e.g. texts containing Katakana or Hebrew characters, but it is not possible to use a single algorithm for all existing languages. The default is simple.

wb5a

Unicode TR29 rule WB5a setting to the tr29 tokenizer. Splits prefixing contracted words from base word. E.g. “l’homme” → “l” “homme”. Together with a language specific stopword list unnecessary contractions can thus be filtered away. This is disabled by default and only works with the TR29 algorithm. Enable by fts_tokenizer_generic = algorithm=tr29 wb5a=yes.

email-address

This tokenizer preserves email addresses as complete search tokens, by bypassing the generic tokenizer, when it finds an address. It will only work as intended if it is listed after other tokenizers.

kuromoji

Important

The kuromoji tokenizer is a part of OX Dovecot Pro only.

This tokenizer is used for Japanese text. This tokenizer utilizes Atilika Kuromoji tokenizer library to tokenize Japanese text. This tokenizer also does NFKC normalization before tokenization. What NFKC normalization does is half-width and full-width character normalizations, such as:

  • transform half-width Katakana letters to full-width.

  • transform full-width number letters to half-width

  • transform those special letters (e.g, 1 will be transformed to 1, and 平成 to 平成)

Settings:

maxlen

Maximum length of token, before an arbitrary cut off is made. The default value for the kuromoji tokenizer is 1024.

kuromoji_split_compounds

This setting enables “search mode” in the Atilika Kuromoji library. The setting defaults to enabled (i.e .1) and should not be changed unless there is a compelling reason. To disable, set the value to 0.

Note

If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case.

id

Description of the normalizing/transliterating rules to use. See Normalizer Format for syntax. Defaults to Any-NFKC which is quite good for CJK text mixed with latin alphabet languages. It transforms CJK characters to full-width encoding and transforms latin ones to half-width. The NFKC transformation is described above.

Note

If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case.

We use the predefined set of stopwords which is recommended by Atilika. Those stopwords are reasonable and they have been made by tokenizing Japanese Wikipedia and have been reviewed by us. This set of stopwords is also included in the Apache Lucene and Solr projects and it is used by many Japanese search implementations.

See also

FTS Tokenization