.. _plugin-fts: ========== fts plugin ========== .. seealso:: See :ref:`fts` for an overview of the Dovecot Full Text Search (FTS) system. .. _fts_languages: FTS languages ^^^^^^^^^^^^^ Language names are given as ISO 639-1 alpha 2 codes. Stemming support indicates whether the ``snowball`` filter can be used. Stopwords support indicates whether a stopwords file is distributed with Dovecot. Currently supported languages: +---------------+---------------------------------------+----------+-----------+ | Language Code | Language | Stemming | Stopwords | +===============+=======================================+==========+===========+ | da | Danish | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | de | German | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | en | English | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | es | Spanish | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | fi | Finnish | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | fr | French | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | it | Italian | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | ja | Japanese | No | No | | | (Requires separate Kuromoji license) | | | +---------------+---------------------------------------+----------+-----------+ | nl | Dutch | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | no | Norwegian (Bokmal & Nynorsk detected) | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | pt | Portuguese | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | ro | Romanian | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | ru | Russian | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | sv | Swedish | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ | tr | Turkish | Yes | Yes | +---------------+---------------------------------------+----------+-----------+ See also :ref:`fts_tokenization` Settings ^^^^^^^^ .. dovecot_plugin:setting:: fts_autoindex :default: no :plugin: fts :seealso: @fts_autoindex_exclude;dovecot_plugin, @fts_autoindex_max_recent_msgs;dovecot_plugin :values: @boolean If enabled, index mail as it is delivered or appended. .. dovecot_plugin:setting:: fts_autoindex_exclude :plugin: fts :seealso: @fts_autoindex;dovecot_plugin :values: @string To exclude a mailbox from automatic indexing, it can be listed in this setting. To exclude additional mailboxes, add sequential numbers to the end of the plugin name. Use either mailbox names or special-use flags (e.g. ``\Trash``). For example: .. code-block:: none plugin { fts_autoindex_exclude = \Junk fts_autoindex_exclude2 = \Trash fts_autoindex_exclude3 = External Accounts/* } .. dovecot_plugin:setting:: fts_autoindex_max_recent_msgs :added: v2.2.9 :default: 0 :plugin: fts :seealso: @fts_autoindex;dovecot_plugin :values: @uint To exclude infrequently accessed mailboxes from automatic indexing, set this value to the maximum number of ``\Recent`` flagged messages that exist in the mailbox. A value of ``0`` means to ignore this setting. Mailboxes with more flagged ``\Recent`` messages than this value will not be autoindexed, even though they get deliveries or appends. This is useful for, e.g., inactive Junk folders. Any folders excluded from automatic indexing will still be indexed, if a search on them is performed. Example: .. code-block:: none plugin { fts_autoindex_max_recent_msgs = 999 } .. dovecot_plugin:setting:: fts_decoder :added: v2.1 :plugin: fts :values: @string Decode attachments to plaintext using this service and index the resulting plaintext. See the ``decode2text.sh`` script included in Dovecot for how to use this. Example: .. code-block:: none plugin { fts_decoder = decode2text } service decode2text { executable = script /usr/lib/dovecot/decode2text.sh user = vmail unix_listener decode2text { mode = 0666 } } .. dovecot_plugin:setting:: fts_enforced :added: v2.2.19 :default: no :plugin: fts :values: yes, no, body Require FTS indexes to perform a search? This controls what to do when searching headers and what to do on error situations. When searching from message body, the FTS index is always (attempted to be) updated to contain any missing mails before the search is performed. ``no`` Searching from message headers won't update FTS indexes. For header searches, the FTS indexes are used for searching the mails that are already in it, but the unindexed mails are searched via dovecot.index.cache (or by opening the emails if the headers aren't in cache). If FTS lookup or indexing fails, both header and body searches fallback to searching without FTS (i.e. possibly opening all emails). This may timeout for large mailboxes and/or slow storage. ``yes`` Searching from message headers updates FTS indexes, the same way as searching from body does. If FTS lookup or indexing fails, the search fails. ``body`` Searching from message headers won't update FTS indexes (the same behavior as with ``no``). If FTS lookup or indexing fails, the search fails. .. versionadded:: v2.3.7 Note that only the ``yes`` value guarantees consistent search results. In other cases it's possible that the search results will be different depending on whether the search was performed via FTS index or not. .. dovecot_plugin:setting:: fts_filters :plugin: fts :seealso: @fts_tokenization :values: @string The list of filters to apply. Language specific filter chains can be specified with ``fts_filters_`` (e.g. ``fts_filters_en``). Available filters: ``lowercase`` Change all text to lower case. Supports UTF8, when compiled with libicu and the library is installed. Otherwise only ASCII characters are lowercased. ``stopwords`` Filter certain common and short words, which are usually useless for searching. Settings: ``stopwords_dir`` Path to the directory containing stopword files. Stopword files are looked up in ``””/stopwords_.txt``. See :ref:`fts_languages` for list of stopword files that are currently distributed with Dovecot. More languages can be obtained from `Apache Lucene `_, `Snowball stemmer `_, or https://github.com/stopwords-iso/. ``snowball`` Stemming tries to convert words to a common base form. A simple example is converting “cars” to “car” (in English). This stemmer is based on the `Snowball stemmer `_ library. See :ref:`fts_languages` ``normalizer-icu`` Normalize text using libicu. This is potentially very resource intensive. .. note:: Caveat for Norwegian: The default normalizer filter does not modify ``U+00F8`` (Latin Small Letter O with Stroke). In some configurations it might be desirable to rewrite it to e.g. ``o``. Same goes for the upper case version. This can be done by passing a modified ``id`` setting to the normalizer filter. Similar cases can exist for other languages as well. Settings: ``id`` Description of the normalizing/transliterating rules to use. * See `Normalizer Format`_ for syntax. * Defaults to ``Any-Lower; NFKD; [: Nonspacing Mark :] Remove; [\\x20] Remove`` ``english-possessive`` Remove trailing ``'s`` from English possessive form tokens. Any trailing single ``'`` characters are already removed by tokenizing, whether this filter is used or not. The ``snowball`` filter also removes possessive suffixes from English, so if using ``snowball`` this filter is not needed. ``snowball`` likely produces better results, so this filter is advisable only when ``snowball`` is not available or cannot be used due to extreme CPU performance requirements. ``contractions`` Removes certain contractions that can prefix words. The idea is to only index the part of the token that conveys the core meaning. Only works with French, so the language of the input needs to be recognized by textcat as French. It filters “qu'”, “c'”, “d'”, “l'”, “m'”, “n'”, “s'” and “t'”. Do not use at the same time as ``generic`` tokenizer with ``algorithm=tr29 wb5a=yes``. Example: .. code-block:: none plugin { fts_filters = normalizer-icu snowball stopwords fts_filters_en = lowercase snowball english-possessive stopwords } .. _`Normalizer Format`: https://unicode-org.github.io/icu/userguide/transforms/general/#transliterator-identifiers .. dovecot_plugin:setting:: fts_header_excludes :added: v2.3.18 :plugin: fts :values: @string The list of headers to, respectively, include or exclude. - The default is the preexisting behavior, i.e. index all headers. - ``includes`` take precedence over ``excludes``: if a header matches both, it is indexed. - The terms are case insensitive. - An asterisk ``*`` at the end of a header name matches anything starting with that header name. - The asterisk can only be used at the end of the header name. Prefix and infix usage of asterisk are not supported. Example: .. code-block:: none plugin { fts_header_excludes = Received DKIM-* X-* Comments fts_header_includes = X-Spam-Status Comments } - ``Received`` headers, all ``DKIM-`` headers and all ``X-`` experimental headers are excluded, with the following exceptions: - ``Comments`` and ``X-Spam-Status`` are indexed anyway, as they match **both** ``excludes`` and ``includes`` lists. - All other headers are indexed. Example:: plugin { fts_header_excludes = * fts_header_includes = From To Cc Bcc Subject Message-ID In-* X-CustomApp-* } - No headers are indexed, except those specified in the ``includes``. .. dovecot_plugin:setting:: fts_header_includes :added: v2.3.18 :plugin: fts :seealso: @fts_header_excludes;dovecot_plugin :values: @string .. dovecot_plugin:setting:: fts_index_timeout :default: 0 :plugin: fts :values: @uint When the full text search backend detects that the index isn't up-to-date, the indexer is told to index the messages and is given this much time to do so. If this time limit is reached, an error is returned, indicating that the search timed out during waiting for the indexing to complete: ``NO [INUSE] Timeout while waiting for indexing to finish`` A value of ``0`` means no timeout. .. dovecot_plugin:setting:: fts_language_config :default: ! :plugin: fts :seealso: @fts_languages;dovecot_plugin :values: @string Path to the textcat/exttextcat configuration file, which lists the supported languages. This is recommended to be changed to point to a minimal version of a configuration that supports only the languages listed in :dovecot_plugin:ref:`fts_languages`. Doing this improves language detection performance during indexing and also makes the detection more accurate. Example: .. code-block:: none plugin { fts_language_config = /usr/share/libexttextcat/fpdb.conf } .. dovecot_plugin:setting:: fts_languages :plugin: fts :seealso: @fts_language_config;dovecot_plugin :values: @string A space-separated list of languages that the full text search should detect. At least one language must be specified. The language listed first is the default and is used when language recognition fails. The filters used for stemming and stopwords are language dependent. .. note:: For better performance it's recommended to synchronize this setting with the textcat configuration file; see :dovecot_plugin:ref:`fts_language_config`. Example: .. code-block:: none plugin { fts_languages = en de } .. dovecot_plugin:setting:: fts_stopwords_workaround :added: v2.3.20 :plugin: fts :default: auto :values: yes, no, auto When both multiple languages and stopwords are configured, stopwords in combination with other terms do not always produce the desired result. The recommended solution is to disable stopwords AND perform the fts reindexing of the mailboxes (otherwise the results will be incorrect). Exclusively as a temporary measure, the workaround changes the way the queries are generated, mitigating the issue (but not resolving it entirely). The workaround can be forced on (``yes``) or off (``no``). With the default setting ``auto``, the workaround is enabled IF: - multiple languages are configured for the user - at least one of the languages has the stopword filter configured With the setting ``auto`` the workaround is disabled automatically as soon as the stopword filter is removed. .. dovecot_plugin:setting:: fts_tika :added: v2.2.13 :plugin: fts :values: @string URL for `Apache Tika `_ decoder for attachments. Example: .. code-block:: none plugin { fts_tika = http://tikahost:9998/tika/ } .. dovecot_plugin:setting:: fts_tokenizers :default: generic email-address :plugin: fts :seealso: @fts_tokenization :values: @string The list of tokenizers to use. This setting can be overridden for specific languages by using ``fts_tokenizers_`` (e.g. ``fts_tokenizers_en``). List of tokenizers: ``generic`` Input data, such as email text and headers, need to be divided into words suitable for indexing and searching. The generic tokenizer does this. Settings: ``maxlen`` Maximum length of token, before an arbitrary cut off is made. Defaults to FTS_DEFAULT_TOKEN_MAX_LENGTH. The default is probably OK. ``algorithm`` Accepted values are ``simple`` or ``tr29``. It defines the method for looking for word boundaries. Simple is faster and will work for many texts, especially those using latin alphabets, but leaves corner cases. The tr29 implements a version of Unicode technical report 29 word boundary lookup. It might work better with e.g. texts containing Katakana or Hebrew characters, but it is not possible to use a single algorithm for all existing languages. The default is ``simple``. ``wb5a`` Unicode TR29 rule WB5a setting to the tr29 tokenizer. Splits prefixing contracted words from base word. E.g. “l'homme” → “l” “homme”. Together with a language specific stopword list unnecessary contractions can thus be filtered away. This is disabled by default and only works with the TR29 algorithm. Enable by ``fts_tokenizer_generic = algorithm=tr29 wb5a=yes``. ``email-address`` This tokenizer preserves email addresses as complete search tokens, by bypassing the generic tokenizer, when it finds an address. It will only work as intended if it is listed **after** other tokenizers. ``kuromoji`` .. important:: The kuromoji tokenizer is a part of :ref:`OX Dovecot Pro ` only. This tokenizer is used for Japanese text. This tokenizer utilizes Atilika Kuromoji tokenizer library to tokenize Japanese text. This tokenizer also does NFKC normalization before tokenization. What NFKC normalization does is half-width and full-width character normalizations, such as: * transform half-width Katakana letters to full-width. * transform full-width number letters to half-width * transform those special letters (e.g, 1 will be transformed to 1, and 平成 to 平成) Settings: ``maxlen`` Maximum length of token, before an arbitrary cut off is made. The default value for the kuromoji tokenizer is ``1024``. ``kuromoji_split_compounds`` This setting enables “search mode” in the Atilika Kuromoji library. The setting defaults to enabled (i.e .1) and should not be changed unless there is a compelling reason. To disable, set the value to 0. .. note:: If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case. ``id`` Description of the normalizing/transliterating rules to use. See `Normalizer Format` for syntax. Defaults to ``Any-NFKC`` which is quite good for CJK text mixed with latin alphabet languages. It transforms CJK characters to full-width encoding and transforms latin ones to half-width. The NFKC transformation is described above. .. note:: If this setting is changed, existing FTS indexes will produce unexpected results. The FTS indexes should be recreated in this case. We use the predefined set of stopwords which is recommended by Atilika. Those stopwords are reasonable and they have been made by tokenizing Japanese Wikipedia and have been reviewed by us. This set of stopwords is also included in the Apache Lucene and Solr projects and it is used by many Japanese search implementations.