Did You Mean?: Search Suggestions Administration

Introduction

As of 3.7, the work for Did You Mean enables search suggestions for a search comprising a single word within a single search class. For the purposes of suggestions, a search class in Evergreen is a keyword, title, author, series, or subject.

As of 3.11, search suggestions are offered for phrases and multi-word search within a single search class. Search suggestions can also leverage variant headings (4xx fields) in Authority records in this latest version, as long as search terms in these fields are in the symspell dictionary for the specified search class. Quoted phrases in search inputs require strict term order and adjacency for the phrase portion of the suggestion generated for the phrase(s), while unquoted search inputs do not require strict order and adjacency.

Search suggestions are available in the public catalog (both TPAC and Bootstrap versions), the Children’s OPAC (KPAC), and the Angular Staff Catalog.

Future iterations of this project are planning to add cross class and other search suggestion mechanisms.

Several search suggestion ordering mechanisms have been added, and are described below in the Library Settings section. The relative weights of each suggestion ordering mechanism can be adjusted to prioritize different suggestion routes. Each Evergreen organization will need to determine the best configuration of weights and suggestion ordering settings.

Search suggestions are based on existing bibliographic data, and are offered for potentially correctable spelling mistakes. A new set of tables have been added to collect bibliographic data and build an internal dictionary of potential search suggestions. When a catalog search meets criteria for offering suggestions, this dictionary is used to generate the suggestions.

The end user will be shown a configurable number of suggestions, hyperlinked to execute a new search based on that suggestion. Any search options such as Format that were initially set will be carried over to the new search.

Evergreen’s existing use of search term stemming has not been altered as a consequence of this work.

The Library Settings that were previously used to control the global behavior of search suggestions have been moved to search class configuration fields. This was done because the data in each search class benefits from different setting values. These settings are documented below.

As in the previous iteration of Did You Mean, search suggestions are globally using Damerau-Levenshtein distance as their highest-weighted setting. Damerau-Levenshtein includes insertion, deletion, substitution, and transposition. This is not a setting that can be changed, but the current implementation supports natural language word matching for Western languages.

Search Results Display

In all cases, search suggestions will be offered for potentially correctable spelling mistakes if a search retrieves fewer than a configured number of results; and potential suggested terms appear at least a configurable number of times within the bibliographic data. Both of these thresholds are configured via Library Settings described below.

For examples of where suggestions display in various public catalog interfaces, please see the documentation in the Did You Mean? section of the OPAC documentation.

Search suggestions in the Staff Catalog appear at the bottom of the search area. Below is an example of a single word suggestion:

Search suggestions in the Staff Catalog

Below is an image of a multi word suggestion using an authority cross-reference:

Authority record based suggestion

Below is an image of a phrase suggestion:

Phrase search suggestion

Authority record-based suggestions and phrase suggestions are also available in the public catalog.

Administration

Did You Mean Configuration

To configure Did You Mean search settings, navigate to Administration → Server Administration → MARC Search / Facet Classes.

The screenshot below shows the new fields on MARC Search / Facet Classes to configure search suggestions. Please note that in the interest of visual spacing, the screenshot is deliberately not showing several extant MARC Search / Facet Class columns. These columns are still present and visible by default in this view.

MARC Search/Facet class configuration

The new configuration fields are described in the table below.

Field Name Description Default value

Low result threshold for suggestions

If a search results in this number or fewer results, and there are correctable spelling mistakes, a suggested search may be provided. If you want all searches to generate suggestions, you can set this to an artificially high number, but it’s possible that this will generate less-useful suggestions.

0 (only searches with no hits)

Max suggestions

If this is set to -1, the system will provide the best suggestion (dependent on the weights of various suggestion mechanisms) if and only if the term is considered misspelled based on the Minimum required uses setting; if this is set to 1 or more, that is the maximum number of suggestions that will be provided; if this is set to 0, no suggestions will be provided. All values other than 0 only provide suggestions that meet the Minimum required uses threshold, and only when the Maximum search result count threshold is not passed. The maximum recommended setting for this is 3, since suggestions become rapidly less useful beyond that point.

-1

Perform variant heading authority suggestion cross-reference

When set to “Yes”, search suggestions will be offered from authority record 4xx (See From Tracing) fields. Suggestions are only offered for fields in the same search class, e.g., subject heading suggestions for subject search.

Yes

Minimum bib record suggestion threshold

The number of indexed bibliographic strings in which a spelling suggestion must appear in order to be offered to a user. Suggestions must appear in the bib data.

1

Suggestion SOUNDEX weight

Controls the relative weight of the scaled soundex component. Setting this to 1 can improve suggestions for catalogs that are primarily English.

0 (off)

Suggestion PG Trigram weight

Controls the relative weight of the scaled pg_trgm component. Setting this to 1 can significantly improve suggestions for most catalogs.

0 (off)

Suggestion keyboard distance weight

Controls the relative weight of the scaled keyboard distance component. This option can have a negative impact on suggestions and a value greater than 0 is not recommended for most catalogs.

0 (off)

Retain case in suggestions

Present search suggestions in the same casing as the original user-input search terms.

Yes

Avoid alternate suggestions on correctly spelled words

If set to Yes, correctly spelled words will not have suggestions offered even if a potential suggestion may exist in the bibliographic or authority data.

No

Symspell suggestion calculation verbosity

A setting used to control the internal behavior of the SymSpell algorithm. It allows tuning the balance between performance and suggestion generation, and is set to provide the widest range of suggestion generation by default.

2

Maximum average word edit distance

Suggestions that have an average per-word edit distance larger than this are discarded.

2

Maximum suggestions per word

The maximum number of suggestions offered for each individual word in the search phrase.

5

The three similarity measures, Pg_trgm (Tri-gram), Soundex, and keyboard distance weight, are calculated by comparing the user’s search input to each potential suggestion. The configured numerical values for Pg_trgm, Soundex, and keyboard distance are multipliers for each similarity measure. For example, setting the Pg_trgm weight to 2 will double the raw score for that similarity measure.

The final order of a group of potential suggestions is determined first by the Damerau-Levenshtein edit distance, and then by the summed value of the weighting measures, each multiplied by its score weight. If suggestions coming from a particular corpus are shown to benefit from giving additional consideration to one or more of the measures, their weighting score can be increased.

Empirical testing and existing research shows that increasing the weight of any similarity measure beyond 1 is not useful in a reasonable, representative set of bibliographic records, and that a multiplier of 1 for Pg_trgm and Soundex is ideal for primarily-English catalogs, but all data sets vary.

Internal flags

The suggestion mechanism primarily uses a SymSpell implementation in Evergreen’s Postgres database. The SymSpell edit distance and prefix key length are controlled by two internal global flags, symspell.prefix_length and symspell.max_edit_distance. A full dictionary rebuild is required if either of these flags are changed.

The SymSpell algorithm mandates the use of the Damerau-Levenshtein algorithm which includes insertion, deletion, substitution, and transposition cost calculations. While the original plan was to make use of the built-in Postgres implementation of the Levenshtein edit distance algorithm, results of partner testing led us to replace the built-in option with an external Damerau-Levenshtein implementation.

A recommended set of values for the SymSpell settings is 6 for symspell.prefix_length and 3 for symspell.max_edit_distance.

This set of values is known to provide a very good balance between accuracy and resource consumption based on empirical testing of the algorithm and analysis of English language texts. For further explanation of why these settings are recommended, please see this article and the embedded links to benchmarks and later improvements.