fts plugin¶
See also
See FTS (Full Text Search) for an overview of the Dovecot Full Text Search (FTS) system.
FTS languages¶
Language names are given as ISO 639-1 alpha 2 codes.
Stemming support indicates whether the snowball
filter can be used.
Stopwords support indicates whether a stopwords file is distributed with Dovecot.
Currently supported languages:
Language Code |
Language |
Stemming |
Stopwords |
---|---|---|---|
da |
Danish |
Yes |
Yes |
de |
German |
Yes |
Yes |
en |
English |
Yes |
Yes |
es |
Spanish |
Yes |
Yes |
fi |
Finnish |
Yes |
Yes |
fr |
French |
Yes |
Yes |
it |
Italian |
Yes |
Yes |
ja |
Japanese (Requires separate Kuromoji license) |
No |
No |
nl |
Dutch |
Yes |
Yes |
no |
Norwegian (Bokmal & Nynorsk detected) |
Yes |
Yes |
pt |
Portuguese |
Yes |
Yes |
ro |
Romanian |
Yes |
Yes |
ru |
Russian |
Yes |
Yes |
sv |
Swedish |
Yes |
Yes |
tr |
Turkish |
Yes |
Yes |
See also FTS Tokenization
Settings¶
- fts_autoindex_exclude¶
Default: <empty>
Values: String
To exclude a mailbox from automatic indexing, it can be listed in this setting.
To exclude additional mailboxes, add sequential numbers to the end of the plugin name.
Use either mailbox names or special-use flags (e.g.
\Trash
).For example:
plugin { fts_autoindex_exclude = \Junk fts_autoindex_exclude2 = \Trash fts_autoindex_exclude3 = External Accounts/* }
See also
- fts_autoindex_max_recent_msgs¶
Default:
0
Values: Unsigned integer
New in version v2.2.9.
To exclude infrequently accessed mailboxes from automatic indexing, set this value to the maximum number of
\Recent
flagged messages that exist in the mailbox.A value of
0
means to ignore this setting.Mailboxes with more flagged
\Recent
messages than this value will not be autoindexed, even though they get deliveries or appends. This is useful for, e.g., inactive Junk folders.Any folders excluded from automatic indexing will still be indexed, if a search on them is performed.
Example:
plugin { fts_autoindex_max_recent_msgs = 999 }
See also
- fts_decoder¶
Default: <empty>
Values: String
New in version v2.1.
Decode attachments to plaintext using this service and index the resulting plaintext.
See the
decode2text.sh
script included in Dovecot for how to use this.Example:
plugin { fts_decoder = decode2text } service decode2text { executable = script /usr/lib/dovecot/decode2text.sh user = vmail unix_listener decode2text { mode = 0666 } }
- fts_enforced¶
Default:
no
Values:
yes
,no
,body
New in version v2.2.19.
Require FTS indexes to perform a search? This controls what to do when searching headers and what to do on error situations.
When searching from message body, the FTS index is always (attempted to be) updated to contain any missing mails before the search is performed.
no
Searching from message headers won’t update FTS indexes. For header searches, the FTS indexes are used for searching the mails that are already in it, but the unindexed mails are searched via dovecot.index.cache (or by opening the emails if the headers aren’t in cache).
If FTS lookup or indexing fails, both header and body searches fallback to searching without FTS (i.e. possibly opening all emails). This may timeout for large mailboxes and/or slow storage.
yes
Searching from message headers updates FTS indexes, the same way as searching from body does. If FTS lookup or indexing fails, the search fails.
body
Searching from message headers won’t update FTS indexes (the same behavior as with
no
). If FTS lookup or indexing fails, the search fails.New in version v2.3.7.
Note that only the
yes
value guarantees consistent search results. In other cases it’s possible that the search results will be different depending on whether the search was performed via FTS index or not.
- fts_filters¶
Default: <empty>
Values: String
The list of filters to apply.
Language specific filter chains can be specified with
fts_filters_<lang>
(e.g.fts_filters_en
).Available filters:
lowercase
Change all text to lower case. Supports UTF8, when compiled with libicu and the library is installed. Otherwise only ASCII characters are lowercased.
stopwords
Filter certain common and short words, which are usually useless for searching.
Settings:
stopwords_dir
Path to the directory containing stopword files. Stopword files are looked up in
”<path>”/stopwords_<lang>.txt
.See FTS languages for list of stopword files that are currently distributed with Dovecot.
More languages can be obtained from Apache Lucene, Snowball stemmer, or https://github.com/stopwords-iso/.
snowball
Stemming tries to convert words to a common base form. A simple example is converting “cars” to “car” (in English).
This stemmer is based on the Snowball stemmer library.
See FTS languages
normalizer-icu
Normalize text using libicu. This is potentially very resource intensive.
Note
Caveat for Norwegian: The default normalizer filter does not modify
U+00F8
(Latin Small Letter O with Stroke). In some configurations it might be desirable to rewrite it to e.g.o
. Same goes for the upper case version. This can be done by passing a modifiedid
setting to the normalizer filter. Similar cases can exist for other languages as well.Settings:
id
Description of the normalizing/transliterating rules to use.
See Normalizer Format for syntax.
Defaults to
Any-Lower; NFKD; [: Nonspacing Mark :] Remove; [\\x20] Remove
english-possessive
Remove trailing
's
from English possessive form tokens. Any trailing single'
characters are already removed by tokenizing, whether this filter is used or not.The
snowball
filter also removes possessive suffixes from English, so if usingsnowball
this filter is not needed.snowball
likely produces better results, so this filter is advisable only whensnowball
is not available or cannot be used due to extreme CPU performance requirements.contractions
Removes certain contractions that can prefix words. The idea is to only index the part of the token that conveys the core meaning.
Only works with French, so the language of the input needs to be recognized by textcat as French.
It filters “qu’”, “c’”, “d’”, “l’”, “m’”, “n’”, “s’” and “t’”.
Do not use at the same time as
generic
tokenizer withalgorithm=tr29 wb5a=yes
.Example:
plugin { fts_filters = normalizer-icu snowball stopwords fts_filters_en = lowercase snowball english-possessive stopwords }
See also
- fts_header_excludes¶
Default: <empty>
Values: String
New in version v2.3.18.
The list of headers to, respectively, include or exclude.
The default is the preexisting behavior, i.e. index all headers.
includes
take precedence overexcludes
: if a header matches both, it is indexed.The terms are case insensitive.
An asterisk
*
at the end of a header name matches anything starting with that header name.The asterisk can only be used at the end of the header name. Prefix and infix usage of asterisk are not supported.
Example:
plugin { fts_header_excludes = Received DKIM-* X-* Comments fts_header_includes = X-Spam-Status Comments }
Received
headers, allDKIM-
headers and allX-
experimental headers are excluded, with the following exceptions:Comments
andX-Spam-Status
are indexed anyway, as they match bothexcludes
andincludes
lists.All other headers are indexed.
Example:
plugin { fts_header_excludes = * fts_header_includes = From To Cc Bcc Subject Message-ID In-* X-CustomApp-* }
No headers are indexed, except those specified in the
includes
.
- fts_index_timeout¶
Default:
0
Values: Unsigned integer
When the full text search backend detects that the index isn’t up-to-date, the indexer is told to index the messages and is given this much time to do so. If this time limit is reached, an error is returned, indicating that the search timed out during waiting for the indexing to complete:
NO [INUSE] Timeout while waiting for indexing to finish
A value of
0
means no timeout.
- fts_language_config¶
Default: <textcat dir>
Values: String
Path to the textcat/exttextcat configuration file, which lists the supported languages.
This is recommended to be changed to point to a minimal version of a configuration that supports only the languages listed in
fts_languages
.Doing this improves language detection performance during indexing and also makes the detection more accurate.
Example:
plugin { fts_language_config = /usr/share/libexttextcat/fpdb.conf }
See also
- fts_languages¶
Default: <empty>
Values: String
A space-separated list of languages that the full text search should detect.
At least one language must be specified.
The language listed first is the default and is used when language recognition fails.
The filters used for stemming and stopwords are language dependent.
Note
For better performance it’s recommended to synchronize this setting with the textcat configuration file; see
fts_language_config
.Example:
plugin { fts_languages = en de }
See also
- fts_stopwords_workaround¶
Default:
auto
Values:
yes
,no
,auto
New in version v2.3.20.
When both multiple languages and stopwords are configured, stopwords in combination with other terms do not always produce the desired result.
The recommended solution is to disable stopwords AND perform the fts reindexing of the mailboxes (otherwise the results will be incorrect).
Exclusively as a temporary measure, the workaround changes the way the queries are generated, mitigating the issue (but not resolving it entirely).
The workaround can be forced on (
yes
) or off (no
). With the default settingauto
, the workaround is enabled IF:multiple languages are configured for the user
at least one of the languages has the stopword filter configured
With the setting
auto
the workaround is disabled automatically as soon as the stopword filter is removed.
- fts_tika¶
Default: <empty>
Values: String
New in version v2.2.13.
URL for Apache Tika decoder for attachments.
Example:
plugin { fts_tika = http://tikahost:9998/tika/ }
- fts_tokenizers¶
Default:
generic email-address
Values: String
The list of tokenizers to use.
This setting can be overridden for specific languages by using
fts_tokenizers_<lang>
(e.g.fts_tokenizers_en
).List of tokenizers:
generic
Input data, such as email text and headers, need to be divided into words suitable for indexing and searching. The generic tokenizer does this.
Settings:
maxlen
Maximum length of token, before an arbitrary cut off is made. Defaults to FTS_DEFAULT_TOKEN_MAX_LENGTH. The default is probably OK.
algorithm
Accepted values are
simple
ortr29
. It defines the method for looking for word boundaries. Simple is faster and will work for many texts, especially those using latin alphabets, but leaves corner cases. The tr29 implements a version of Unicode technical report 29 word boundary lookup. It might work better with e.g. texts containing Katakana or Hebrew characters, but it is not possible to use a single algorithm for all existing languages. The default issimple
.wb5a
Unicode TR29 rule WB5a setting to the tr29 tokenizer. Splits prefixing contracted words from base word. E.g. “l’homme” → “l” “homme”. Together with a language specific stopword list unnecessary contractions can thus be filtered away. This is disabled by default and only works with the TR29 algorithm. Enable by
fts_tokenizer_generic = algorithm=tr29 wb5a=yes
.email-address
This tokenizer preserves email addresses as complete search tokens, by bypassing the generic tokenizer, when it finds an address. It will only work as intended if it is listed after other tokenizers.
kuromoji
Important
The kuromoji tokenizer is a part of OX Dovecot Pro only.
This tokenizer is used for Japanese text. This tokenizer utilizes Atilika Kuromoji tokenizer library to tokenize Japanese text. This tokenizer also does NFKC normalization before tokenization. What NFKC normalization does is half-width and full-width character normalizations, such as:
transform half-width Katakana letters to full-width.
transform full-width number letters to half-width
transform those special letters (e.g, 1 will be transformed to 1, and 平成 to 平成)
Settings:
maxlen
Maximum length of token, before an arbitrary cut off is made. The default value for the kuromoji tokenizer is
1024
.kuromoji_split_compounds
This setting enables “search mode” in the Atilika Kuromoji library. The setting defaults to enabled (i.e .1) and should not be changed unless there is a compelling reason. To disable, set the value to 0.
Note
If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case.
id
Description of the normalizing/transliterating rules to use. See Normalizer Format for syntax. Defaults to
Any-NFKC
which is quite good for CJK text mixed with latin alphabet languages. It transforms CJK characters to full-width encoding and transforms latin ones to half-width. The NFKC transformation is described above.Note
If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case.
We use the predefined set of stopwords which is recommended by Atilika. Those stopwords are reasonable and they have been made by tokenizing Japanese Wikipedia and have been reviewed by us. This set of stopwords is also included in the Apache Lucene and Solr projects and it is used by many Japanese search implementations.
See also