fts
Default | [None] |
---|---|
Value | Named List Filter |
Configures the used fts driver to perform fts plugin indexing. If not
specified, FTS is disabled. The filter name refers to the
fts_driver
setting.
Example:
fts solr {
# ...
}
fts
) As the amount and importance of information stored in email messages is increasing in people’s everyday lives, searching through those messages is becoming ever more important. At the same time mobile clients add their own restrictions for what can be done on the client side. The ever diversifying mail client software also tests the limits of the IMAP protocol and current server implementations.
Furthermore, the IMAP protocol requires some rather complicated and expensive searching capabilities. For example, the protocol requires arbitrary substring matching. Some newer mobile clients (e.g. Apple iOS) rely on this functionality.
Without a high-performance index, Dovecot must fall back to a slow sequential search through all messages (default behavior). If storage latencies are high, this searching may not be completed in a reasonable time, or resource utilization may be too large, especially in mailboxes with large messages.
Dovecot maintains these FTS indexing engines:
Name | Description |
---|---|
Dovecot Pro FTS | Dovecot native, object storage optimized driver. Only available as part of Dovecot Pro. |
fts_solr plugin | Interface to Apache Solr; stores data remotely. |
fts_flatcurve plugin | Xapian based driver; stores data locally. |
When a FTS indexing driver is not present, searches use a slow sequential search through all message data. This is both computationally and time expensive. It is desirable to pre-index data so that searches can be executed against this index.
There is a subtle but important distinction between searching through message headers and searching through message bodies.
Searching through message bodies (via the standard IMAP 'SEARCH TEXT/BODY' commands) makes use of the FTS indexes.
On the other hand, searching through message headers benefits from Dovecot's standard index and cache files (dovecot.index
and dovecot.index.cache
), which often contain the necessary information. It is possible to redirect header searches to FTS indexes via a configuration option (fts_search_add_missing
).
Triggers for FTS indexing are configurable. It can be started on demand when searching, or automatically when new messages arrive or as a batch job.
By default the FTS indexes are updated only while searching, so neither LDA/LMTP nor an IMAP 'APPEND' command updates the indexes immediately. This means that if a user has received a lot of mail since the last indexing (i.e., the last search operation), it may take a while to index all the new mails before replying to the search command. Dovecot sends periodic "* OK Indexed n% of the mailbox" updates which can be caught by client implementations to implement a progress bar.
Updating the FTS index as messages arrive makes for a more responsive user experience, especially for users who don’t search often, but have a lot of mail. On the other hand, it increases overall system load regardless of whether or not the indexes will ever be used by the user.
Dovecot splits the full text search functionality into two parts: a common tokenization library (lib-language) and driver indexing engine responsible for storing the tokens produced by the common library persistently.
Some of the FTS drivers do their own internal tokenization, although it's possible to configure them to use the lib-language tokenization as well.
See Tokenization for more details about configuring the tokenization.
All drivers are implemented as plugins that extend the base fts plugin's functionality.
fts
Default | [None] |
---|---|
Value | Named List Filter |
Configures the used fts driver to perform fts plugin indexing. If not
specified, FTS is disabled. The filter name refers to the
fts_driver
setting.
Example:
fts solr {
# ...
}
fts_autoindex
Default | no |
---|---|
Value | boolean |
See Also |
If enabled, index mail as it is delivered or appended.
It can be overridden at the mailbox level, e.g. you can disable autoindexing for selected mailboxes using this setting:
Example:
fts_autoindex = yes
# ...
mailbox trash {
special_use = Trash
fts_autoindex = no
}
mailbox spam {
special_use = Junk
fts_autoindex = no
}
mailbox storage/* {
fts_autoindex = no
}
fts_autoindex_max_recent_msgs
Default | [None] |
---|---|
Value | unsigned integer |
See Also |
To exclude infrequently accessed mailboxes from automatic indexing, set
this value to the maximum number of Recent
flagged messages that exist
in the mailbox.
A value of 0
means to ignore this setting.
Mailboxes with more flagged Recent
messages than this value will not
be autoindexed, even though they get deliveries or appends. This is useful
for, e.g., inactive Junk folders.
Any folders excluded from automatic indexing will still be indexed, if a search on them is performed.
Example:
fts_autoindex_max_recent_msgs = 999
fts_decoder_driver
Default | [None] |
---|---|
Value | string |
Allowed Values | script tika |
Optional setting. If set, decode attachments to plaintext using the selected service and index the resulting plaintext.
fts_decoder_script_socket_path
Default | [None] |
---|---|
Value | string |
Changes |
|
Name of the script service used to decode the attachments.
See the decode2text.sh
script included in Dovecot for how to use this.
Example:
fts_decoder_driver = script
fts_decoder_script_socket_path = decode2text
service decode2text {
executable = script /usr/lib/dovecot/decode2text.sh
user = vmail
unix_listener decode2text {
mode = 0666
}
}
fts_decoder_tika_url
Default | [None] |
---|---|
Value | string |
Changes |
|
URL for Apache Tika decoder for attachments.
Example:
fts_decoder_driver = tika
fts_decoder_tika_url = http://tikahost:9998/tika/
fts_driver
Default | [None] |
---|---|
Value | string |
Allowed Values | dovecot solr flatcurve |
Configures the used fts driver to perform fts plugin indexing. The
fts
filter name refers to this setting.
fts_header_excludes
Default | [None] |
---|---|
Value | Boolean List |
The list of headers to include or exclude.
includes
take precedence over excludes
: if a header matches both,
it is indexed.*
at the end of a header name matches anything starting
with that header name.Example:
fts_header_excludes {
Received = yes
DKIM-* = yes
X-* = yes
Comments = yes
}
fts_header_includes {
X-Spam-Status = yes
Comments = yes
}
Received
headers, all DKIM-
headers and all X-
experimental
headers are excluded, with the following exceptions:
Comments
and X-Spam-Status
are indexed anyway, as they match
both excludes
and includes
lists.Example:
fts_header_excludes {
* = yes
}
fts_header_includes {
From = yes
To = yes
Cc = yes
Bcc = yes
Subject = yes
Message-ID = yes
In-* = yes
X-CustomApp-* = yes
}
includes
.fts_header_includes
Default | [None] |
---|---|
Value | Boolean List |
See Also |
fts_message_max_size
Default | [None] |
---|---|
Value | size |
Changes |
|
Maximum body size that is processed by fts. 0
means unlimited.
fts_search_add_missing
Default | body-search-only |
---|---|
Value | string |
Allowed Values | body-search-only yes |
Should missing mails be added to FTS indexes before search?
With body-search-only
this is done only when the search query requests
searching message bodies, i.e. header searches are not updating the FTS index.
The unindexed mails are searched without FTS, i.e. either getting the headers
from dovecot.index.cache
or by opening the emails if the headers aren't in
cache. This may be a useful optimization if the user's client only uses header
searches.
INFO
Only the yes
option guarantees consistent search results. Otherwise it's
possible that the search results will be different depending on whether the
search was performed via FTS index or not.
fts_search_read_fallback
Default | yes |
---|---|
Value | boolean |
If FTS lookup or indexing fails, fall back to searching without FTS (i.e. possibly opening all emails). This may timeout for large mailboxes and/or slow storage.
fts_search_timeout
Default | 30s |
---|---|
Value | time |
When the full text search driver detects that the index isn't up-to-date,
the indexer is told to index the messages and is given this much time to do
so. If this time limit is reached, an error is returned, indicating that
the search timed out during waiting for the indexing to complete:
NO [INUSE] Timeout while waiting for indexing to finish
.
A value of 0
means no timeout.
language
Default | <textcat dir> |
---|---|
Value | Named List Filter |
Dependencies | |
See Also |
Defines a language to be used in tokenization.
At least one language must be specified and one single language must be flagged
as the default language using language_default = yes
.
The language listed first is the default and is used when language recognition fails.
The filters used for stemming and stopwords are language dependent.
TIP
For better performance it's recommended to synchronize this setting with the
textcat configuration file; see textcat_config_path
.
Example:
language en {
default = yes
}
language de {
}
language_default
Default | no |
---|---|
Value | boolean |
Dependencies | |
See Also |
The language marked as default will be used when language detection cannot identify the proper language of the text being processed.
Exactly one language must be marked with this flag.
language_filter_normalizer_icu_id
Default | Any-Lower; NFKD; [: Nonspacing Mark :] Remove; [\x20] Remove |
---|---|
Value | string |
Description of the normalizing/transliterating rules to use.
See Normalizer Format for syntax.
language_filter_stopwords_dir
Default | [None] |
---|---|
Value | string |
See Also |
Path to the directory containing stopword files. The files inside the directory
have names with the form /stopwords_<lang>.txt
.
See Languages for the list of stopword files that are currently distributed with Dovecot.
More languages can be obtained from:
language_filters
Default | [None] |
---|---|
Value | Boolean List |
See Also |
The list of filters to apply.
See Filter Configuration for configuration information.
language_tokenizer_address_token_maxlen
Default | [None] |
---|---|
Value | unsigned integer |
See Also |
Maximum length of token, before an arbitrary cut off is made.
language_tokenizer_generic_algorithm
Default | simple |
---|---|
Value | string |
Allowed Values | simple tr29 |
See Also |
Defines the method for finding word boundaries.
Value | Description |
---|---|
simple |
Faster algorithm that works for many texts, especially using the latin alphabets, but leaves corner cases. |
tr29 |
Implements a version of Unicode technical report 29 word boundary lookup. It might work better with texts containing e.g. Katakana or Hebrew characters, but it is not possible to use a single algorithm for all existing languages. |
language_tokenizer_generic_token_maxlen
Default | [None] |
---|---|
Value | unsigned integer |
See Also |
Maximum length of token, before an arbitrary cut off is made.
language_tokenizer_generic_wb5a
Default | no |
---|---|
Value | boolean |
See Also |
Unicode TR29 rule WB5a setting to the tr29 tokenizer. Splits prefixing
contracted words from base word. E.g. l'homme
-> l
and homme
.
Together with a language specific stopword list unnecessary contractions can
thus be filtered away. This is disabled by default and only works with the TR29
algorithm.
Enable by declaring:
language_tokenizer_kuromoji_icu_id
Default | Any-NFKC |
---|---|
Value | string |
See Also |
Description of the normalizing/transliterating rules to use. See Normalizer Format for syntax.
Defaults to Any-NFKC
which is quite good for CJK text mixed with latin
alphabet languages. It transforms CJK characters to full-width encoding and
transforms latin ones to half-width. The NFKC transformation is described
above.
WARNING
If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case.
language_tokenizer_kuromoji_split_compounds
Default | yes |
---|---|
Value | boolean |
See Also | |
Advanced Setting; this should not normally be changed. |
This setting enables search mode
in the Atilika Kuromoji library. The
setting defaults to enabled and should not be changed unless there is a
compelling reason.
WARNING
If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case.
language_tokenizer_kuromoji_token_maxlen
Default | [None] |
---|---|
Value | unsigned integer |
See Also |
Maximum length of token, before an arbitrary cut off is made.
language_tokenizers
Default | generic email-address |
---|---|
Value | Boolean List |
See Also |
The list of tokenizers to use.
See Tokenizer Configuration for configuration information.
textcat_config_path
Default | <textcat dir> |
---|---|
Value | string |
See Also |
Path to the textcat/exttextcat configuration file, which lists the supported languages.
This is recommended to be changed to point to a minimal version of a
configuration that supports only the languages listed in
language
.
Doing this improves language detection performance during indexing and also makes the detection more accurate.
Example:
textcat_config_path = /usr/share/libexttextcat/fpdb.conf
Missing mails are always added to FTS indexes when using IMAP SEARCH command that attempts to access the FTS indexes.
Automatic FTS indexing can also be done during mail delivery, IMAP APPEND and other ways of adding mails to mailboxes using fts_autoindex
.
Indexing can also be triggered manually:
doveadm index -u user@domain -q INBOX
When FTS indexing fails, Dovecot falls back on using the built-in search, which does not have indexes for mail bodies.
This could end up opening all the mails in the mailbox, which often isn't wanted.
To disable this functionality, enable fts_search_add_missing
.
Attachments can be indexed either via a script that translates the attachment to UTF-8 plaintext or Apache Tika server.
Dovecot keeps track of indexed messages in the dovecot.index files
. If this becomes out of sync with the actual FTS indexes (either too many or too few mails), you'll need to do a rescan and then index missing mails:
doveadm fts rescan -u user@domain
doveadm index -u user@domain -q '*'
Note that currently most FTS drivers don't implement the rescan. Instead, they simply delete all the FTS indexes. This may change in the future versions.
Language names are given as ISO 639-1 alpha 2 codes.
Stemming support indicates whether the snowball
filter can be used.
Stopwords support indicates whether a stopwords file is distributed with Dovecot.
Currently supported languages:
Language Code | Language | Stemming | Stopwords |
---|---|---|---|
da | Danish | Yes | Yes |
de | German | Yes | Yes |
en | English | Yes | Yes |
es | Spanish | Yes | Yes |
fi | Finnish | Yes | Yes |
fr | French | Yes | Yes |
it | Italian | Yes | Yes |
ja | Japanese (Requires Dovecot Pro) | No | No |
nl | Dutch | Yes | Yes |
no | Norwegian (Bokmal & Nynorsk detected) | Yes | Yes |
pt | Portuguese | Yes | Yes |
ro | Romanian | Yes | Yes |
ru | Russian | Yes | Yes |
sv | Swedish | Yes | Yes |
tr | Turkish | Yes | Yes |
Dovecot contains tokenization support that can be used by FTS drivers.
The lib-language tokenization library works in the following way:
Language detection: When indexing, the text language is attempted to be detected.
If the detection fails, the first listed language is used.
When searching, the search is done using all the configured languages.
Tokenization: The text is split to tokens (individual words).
Filtering: Tokens are normalized:
Stopwords: A configurable list of words not to be indexed
The language
setting declares the languages that need to be detected.
At least one language must be listed.
The first language is the default language used in case detection fails.
Each added language makes the indexing and searching slightly slower, so it's recommended not to add too many languages unnecessarily. The language detection performance can be improved by limiting the number of languages available for textcat, see textcat_config_path
.
Example:
language en {
default = yes
}
language de {
}
The filters and tokenizers are created in the order they are declared in their respective settings in the configuration file. They form a chain, where the first filter or tokenizer is the parent or grandparent of the rest. The direction of the data flow needs some special attention.
In filters, the data flows from parent to child, so tokens are first passed to the grandparent of all filters and then further down the chain. For some filtering chains the order is important. E.g. the snowball stemmer wants all input in lower case, so the filter lower casing the tokens will need to be listed before it.
In tokenizers however, the data however flows from child to parent. This means that the tokenizer listed 'last' gets the processed data 'first'.
So, for filters data flows "left to right" through the filters listed in the configuration. In tokenizers the order is "right to left".
Base64 sequences are looked for in the tokenization buffer and skipped when detected.
A base64 sequence is detected by:
leader-characters
set,base64-characters
set, at least minimum-run-length
long,trailer-characters
set,where:
leader-characters
are: [ \t\r\n=:;?]
base64-characters
are: [0-9A-Za-z/+]
trailer-characters
are: [ \t\r\n=:;?]
minimum-run-length
is: 50
minimum-run-count
is: 1
Thus, (even single) 50-chars runs of characters in the base64 set are recognized as base64 and ignored in indexing.
If a base64 sequence happens to be split across different chunks of data, part of it might not be detected as base64. In this case, the undetected base64 fragment is still indexed. However, this happens rarely enough that it does not significantly impact the quality of the filter.
So far the above rule seems to give good results in base64 indexing avoidance. It also performs well in removing base64 fragments inside headers, like ARC-Seal, DKIM-Signature, X-SG-EID, X-SG-ID, including header-encoded parts (e.g. =?us-ascii?Q?...?=
sequences).
Filters affect how data is indexed.
They are configured through language_filters
.
Example:
language_filters = normalizer-icu snowball stopwords
language en {
language_filters = lowercase snowball english-possessive stopwords
}
Available filters:
lowercase
Change all text to lower case. Supports UTF8, when compiled with libicu and the library is installed. Otherwise only ASCII characters are lowercased.
stopwords
Filter certain common and short words, which are usually useless for searching.
WARNING
Using stopwords with multiple languages configured WILL cause some searches to fail. The recommended solution is to NOT use the stopword filter when multiple languages are present in the configuration.
snowball
Stemming tries to convert words to a common base form. A simple example is converting cars
to car
(in English).
This stemmer is based on the Snowball stemmer library.
normalizer-icu
Normalize text using libicu. This is potentially very resource intensive.
WARNING
There is a caveat for the Norwegian language:
The default normalizer filter does not modify U+00F8
(Latin Small Letter O with Stroke). In some configurations it might be desirable to rewrite it to, e.g., o
. Same goes for the upper case version. This can be done by passing a modified id
setting to the normalizer filter.
Similar cases can exist for other languages as well.
language_filter_normalizer_icu_id
Default | Any-Lower; NFKD; [: Nonspacing Mark :] Remove; [\x20] Remove |
---|---|
Value | string |
Description of the normalizing/transliterating rules to use.
See Normalizer Format for syntax.
english-possessive
Remove trailing 's
from English possessive form tokens. Any trailing single '
characters are already removed by tokenizing, whether this filter is used or not.
The snowball
filter also removes possessive suffixes from English, so if using snowball
this filter is not needed.
TIP
snowball
likely produces better results, so this filter is advisable only when snowball
is not available or cannot be used due to extreme CPU performance requirements.
contractions
Removes certain contractions that can prefix words. The idea is to only index the part of the token that conveys the core meaning.
Only works with French, so the language of the input needs to be recognized by textcat as French.
It filters qu'
, c'
, d'
, l'
, m'
, n'
, s'
and t'
.
Do not use at the same time as generic
tokenizer with both
Tokenizers affect how input data is parsed.
Available tokenizers:
generic
Input data, such as email text and headers, need to be divided into words suitable for indexing and searching. The generic tokenizer does this.
language_tokenizer_generic_algorithm
Default | simple |
---|---|
Value | string |
Allowed Values | simple tr29 |
See Also |
Defines the method for finding word boundaries.
Value | Description |
---|---|
simple |
Faster algorithm that works for many texts, especially using the latin alphabets, but leaves corner cases. |
tr29 |
Implements a version of Unicode technical report 29 word boundary lookup. It might work better with texts containing e.g. Katakana or Hebrew characters, but it is not possible to use a single algorithm for all existing languages. |
language_tokenizer_generic_token_maxlen
Default | [None] |
---|---|
Value | unsigned integer |
See Also |
Maximum length of token, before an arbitrary cut off is made.
language_tokenizer_generic_wb5a
Default | no |
---|---|
Value | boolean |
See Also |
Unicode TR29 rule WB5a setting to the tr29 tokenizer. Splits prefixing
contracted words from base word. E.g. l'homme
-> l
and homme
.
Together with a language specific stopword list unnecessary contractions can
thus be filtered away. This is disabled by default and only works with the TR29
algorithm.
Enable by declaring:
email-address
This tokenizer preserves email addresses as complete search tokens, by bypassing the generic tokenizer, when it finds an address. It will only work as intended if it is listed after other tokenizers.
language_tokenizer_address_token_maxlen
Default | [None] |
---|---|
Value | unsigned integer |
See Also |
Maximum length of token, before an arbitrary cut off is made.