Use Advanced Linguistics with Basistech

The Fusion Advanced Linguistics Package embeds Basis Technology’s leading-edge Rosette natural language processing tools for multilingual text analysis. To improve search recall, Rosette Base Linguistics (RBL) handles the unique linguistic phenomena of more than 30 Asian and European languages. Rosette Entity Extractor (REX) identifies named entities such as people, locations, and organizations, allowing you to quickly refine your search, remove noise, and increase search relevance.

Using Named Entities (REX)

REX extracts named entities in 21 languages, including English, Chinese (traditional and simplified), and German. In English, it extracts 29 entity types and over 450 subtypes out of the box. This includes the following entity types (along with their associated subtypes):

  • PERSON

  • LOCATION

  • ORGANIZATION

  • PRODUCT

  • TITLE

  • NATIONALITY

  • RELIGION

Other languages may support a different set of entity types. For full details on supported languages and entity types, see the Entity Extractor and Linker technical specifications.

In this tutorial, we will extract named entities from English news articles.

Create Application

To begin, create a new application called "entities".

Configuration

Edit Solr Configuration

We will begin by adding the Basis library elements to the solrconfig.xml file. We will also add a new update processor to perform the entity extraction.

  1. Navigate to System > Solr Config to edit the solrconfig.xml file.

  2. In the <lib/> directive section, add the following lines:

    <lib dir="/opt/basistech/rex-je/lib" regex=".*\.jar" />
    <lib dir="/opt/basistech/solr/lib/" regex=".*\.jar" />
    Important
    For Fusion 4.x users, the dir paths are the local REX installation path.
  3. In the <updateRequestProcessorChain/> section, add the following lines after the existing processor chains:

    <updateRequestProcessorChain name="rex">
        <processor class="com.basistech.rosette.solr.EntityExtractorUpdateProcessorFactory">
            <str name="rootDirectory">/opt/basistech/rex-je</str>
            <str name="fields">text_eng</str>
        </processor>
        <processor class="solr.LogUpdateProcessorFactory"/>
        <processor class="solr.RunUpdateProcessorFactory"/>
    </updateRequestProcessorChain>

    Note the reference to a field called text_eng. We will create this field through the Fusion UI in the next step.

  4. Save your changes to solrconfig.xml.

Define Fields

The data file we will use, eng_docs.csv, contains two fields:

  • title - an article headline

  • article_text - the text content of the article

We will index these two static fields and also define a set of dynamic fields to hold the extracted entities. To create new fields, navigate to Collections > Fields and click Add a Field. Create the following fields:

Field name Field type Other options

title

string

Use default options.

text_eng

text_en

Use default options.

text_eng_REX_*

string

Create this field as a dynamic field by clicking the Dynamic checkbox. Click the Multivalued checkbox. Leave other options as defaults.

Create a new field

Note
Be sure to save each field after creating it.

Indexing Data

Create Indexing Pipeline

  1. Navigate to Indexing > Indexing Pipelines.

  2. Click Add and create a new pipeline called test-entities.

  3. Select the Field Mapping stage.

  4. In the Field Translations section, add a new row with source article_text and target text_eng. Set the Operation to move.

    Field translations

The CSV file we will upload contains a field title, but since this matches the title field that we created earlier, there is no need to map it. Leave the Unmapped Fields section with its default option, keep.

  1. Select the Solr Indexer stage.

  2. In the Additional Update Request Parameters section, add a new row with parameter name update.chain and value rex.

    Additional update request parameters

  3. Save the new pipeline.

Create Datasource

In this step, we will upload and index our documents from the data file.

  1. Navigate to Indexing > Datasources.

  2. Click Add and select File Upload from the dropdown menu.

  3. Enter eng_docs for the Datasource ID. Alternatively, use a name you prefer.

  4. Select test-entities for the Pipeline ID.

  5. In the File Upload field, choose the sample file eng_docs.csv and click Upload File.

    The File ID field will be automatically populated. Leave all other values as their defaults.

    File upload

  6. Save the new datasource. The form will refresh, adding a set of buttons at the top.

  7. Click Run and then Start. When the job is finished, you will see "Success" in the popup form.

Querying Data

  1. Navigate to Querying > Query Workbench. The default query is *:*, which should bring up three documents.

  2. For the document with title "SpaceX Successfully Launches its First Crewed Spaceflight", select Show fields. You will see a number of entities listed under the text_eng_REX_* fieldnames.

    Querying data

  3. Search on these multivalued fields. For example, set your query to text_eng_REX_LOCATION:"New York" to return the article that contains a mention of New York.

Customization (Advanced)

When setting up the Solr configuration, you specified the rootDirectory and fields options in your processor chain. REX provides a number of other configuration options you can set to control how entities are extracted. For example, if you are finding false positives, you can set parameters instructing REX to return only entities above a confidence threshold. The confidence threshold is a value between 0 and 1 and applies to entities extracted by the statistical model. We recommend starting with a low value, around 0.2. In your solrconfig.xml file, add the options calculateConfidence and confidenceThreshold to your processor chain definition:

<updateRequestProcessorChain name="rex">
    <processor class="com.basistech.rosette.solr.EntityExtractorUpdateProcessorFactory">
        <str name="rootDirectory">/opt/basistech/rex-je</str>
        <str name="fields">text_eng</str>
        <str name="calculateConfidence">True</str>
        <str name="confidenceThreshold">0.2</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Save the changes, re-index your data, and perform the same query on *:*. Note that for the SpaceX article, "Falcon", is now correctly omitted from the list of LOCATIONs.

Customization

If there is a particular entity you want to make sure is extracted or rejected, or if you wish to create a custom entity type, REX also supports gazetteers and regular expressions.

Gazetteers

A gazetteer is a UTF-8 text file in which the first line is the entity type. It is followed by the names of entities you wish to extract, separated by newlines, in the language of your documents. Comments can be prefixed by the # symbol. Create a file spacecraft_gaz.txt with the following lines:

SPACECRAFT
ISS
International Space Station
Vostok
Soyuz
Dragon
Crew Dragon
Cargo Dragon
Voyager
Apollo

Regular expressions

REX uses the Tcl regex format. Create a file, zulu_time_regex.xml, file with the following lines:

<regexps>
    <regexp lang="eng" type="ZULU_TIME">(?i)\m(?:[01]?\d|2[0-4])(?:[0-5]\d) (?:UTC|GMT)</regexp>
</regexps>

This regular expression will extract as entity type ZULU_TIME all spans that consist of a 4-digit military time unit followed by the time zone designator UTC or GMT.

Example

To instruct REX to use the gazetteer and regex file, edit your solrconfig.xml file. The addGazetteers option takes four parameters:

  • language

  • file

  • accept (True) or reject (False)

  • case-sensitive (True or False)

For example, with <str name="addGazetteers">eng,/path/to/spacecraft_gaz.txt,True,True</str>:

language file accept case-sensitive

eng

/path/to/spacecraft_gaz.txt

True

True

The addRegularExpressions option takes two parameters:

  • file

  • accept (True) or reject (False)

For example, with <str name="addRegularExpressions">/path/to/zulu_time_regex.xml,True</str>:

file accept

/path/to/zulu_time_regex.xml

True

The result in the solrconfig.xml file:

<updateRequestProcessorChain name="rex">
    <processor class="com.basistech.rosette.solr.EntityExtractorUpdateProcessorFactory">
        <str name="rootDirectory">/opt/basistech/rex-je</str>
        <str name="fields">text_eng</str>
        <str name="calculateConfidence">True</str>
        <str name="confidenceThreshold">0.2</str>
        <str name="addGazetteers">eng,/path/to/spacecraft_gaz.txt,True,True</str>
        <str name="addRegularExpressions">/path/to/zulu_time_regex.xml,True</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Now, when you re-index your data and search *:*, the SpaceX document will have new entities listed in the text_eng_REX_SPACECRAFT and text_eng_REX_ZULU_TIME dynamic fields.

Note
Additional Fusion deployment configurations are needed to use the REX gazetteer and regex options.

Using Multilingual Search (RBL)

RBL provides a set of linguistic tools to prepare your data for analysis. Language-specific models provide base forms (lemmas) of words, parts-of-speech tagging, compound components, normalized tokens, stems, and roots.

In this tutorial, we will index and query headlines in English, Chinese, and German to demonstrate the linguistics capabilities of RBL: lemmatization, tokenization, and decompounding.

Create Application

To begin, create a new application called "multilingual".

Configuration

Edit Solr Configuration

We will begin by adding the Basis library elements to the solrconfig.xml file.

  1. Navigate to System > Solr Config to edit the solrconfig.xml file.

  2. In the <lib/> directive section, add the following lines:

    <lib dir="/opt/basistech/solr/lib/" regex=".*\.jar" />
    Important
    For Fusion 4.x users, the dir path is the local REX installation path.
  3. Save your changes to solrconfig.xml.

Edit Schema

Add a fieldType element for each language to be processed by the application. The fieldType element includes two analyzers: one for indexing documents and one for querying documents. Each analyzer contains a tokenizer and a token filter. The language attribute is set to the language code, equal to the ISO 639-3 code in most cases. The rootDirectory points to the RBL directory.

  1. Navigate to System > Solr Config to edit the managed-schema file.

  2. In the fieldType section, add the following new field types: basis_english, basis_chinese, and basis_german.

    <fieldtype class="solr.TextField" name="basis_english">
        <analyzer type="index">
            <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="eng" rootDirectory="/opt/basistech/rbl-je"/>
            <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="eng" rootDirectory="/opt/basistech/rbl-je"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="eng" query="true" rootDirectory="/opt/basistech/rbl-je"/>
            <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="eng" query="true" rootDirectory="/opt/basistech/rbl-je"/>
        </analyzer>
    </fieldtype>
    
    <fieldtype class="solr.TextField" name="basis_chinese">
        <analyzer type="index">
            <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="zho" rootDirectory="/opt/basistech/rbl-je"/>
            <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="zho" rootDirectory="/opt/basistech/rbl-je"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="zho" query="true" rootDirectory="/opt/basistech/rbl-je"/>
            <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="zho" query="true" rootDirectory="/opt/basistech/rbl-je"/>
        </analyzer>
    </fieldtype>
    
    <fieldtype class="solr.TextField" name="basis_german">
        <analyzer type="index">
            <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="deu" rootDirectory="/opt/basistech/rbl-je"/>
            <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="deu" rootDirectory="/opt/basistech/rbl-je"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="deu" query="true" rootDirectory="/opt/basistech/rbl-je"/>
            <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="deu" query="true" rootDirectory="/opt/basistech/rbl-je"/>
        </analyzer>
    </fieldtype>
Note
You can incorporate any additional Solr filters you need, such as the Solr lowercase filter. However, filters should be added into the chain after the Base Linguistics token filter. If you modify the token stream too significantly before RBL, you degrade its ability to analyze the text.
  1. Save your changes to managed-schema.

Define Fields

The data file we will use, multilingual_headlines.csv, contains fields for headlines in three languages: eng_headline, zho_headline, and deu_headline. The analysis chain requires a field definition with a type attribute that maps to the fieldType you defined in the schema.

To create new fields, navigate to Collections > Fields and click Add a Field. Create the following fields:

Field name Field type Other options

text_eng

basis_english

Use default options.

text_zho

basis_chinese

Use default options.

text_deu

basis_german

Use default options.

Note
Be sure to save each field after creating it.

Indexing Data

Create Indexing Pipeline

  1. Navigate to Indexing > Indexing Pipelines.

  2. Click Add and create a new pipeline called test-multilingual.

  3. Select the Field Mapping stage.

  4. In the Field Translations section, add three new rows:

    Field name Field name Operation

    eng_headline

    text_eng

    move

    zho_headline

    text_zho

    move

    deu_headline

    text_deu

    move

  5. Save the new pipeline.

Create Datasource

In this step, we will upload and index our documents from the data file.

  1. Navigate to Indexing > Datasources.

  2. Click Add and select File Upload from the dropdown menu.

  3. Enter multilingual_headlines for the Datasource ID. Alternatively, use a name you prefer.

  4. Select test-multilingual for the Pipeline ID.

  5. In the File Upload field, choose the sample file multilingual_headlines.csv and click Upload File.

    The File ID field will be automatically populated. Leave all other values as their defaults.

  6. Save the new datasource. The form will refresh, adding a set of buttons at the top.

  7. Click Run and then Start. When the job is finished, you will see "Success" in the popup form.

Querying Data

  1. Navigate to Querying > Query Workbench. The default query is *:*, which should bring up ten documents.

  2. Follow the examples in the subsections below to see how Fusion’s Advanced Linguistics capabilities can improve your search results.

Lemmatization

A "lemma" is the canonical form of a word, or the version of a word that you find in the dictionary. For example, the lemma of "mice" is "mouse". The words "speaks", "speaking", "spoke", and "spoken" all share the same lemma: "speak".

With RBL, you can perform searches by lemma, thus increasing your search results. This example demonstrates this practice with the words "knife" and "knives" below.

  1. For ease of viewing results, select the Display Fields dropdown and enter text_eng in the Description field.

  2. Enter the query text_eng:knife in the search box.

Two documents are returned. One of the headlines, "Iowa City Man Arrested After Alleged Altercation with a Knife", is an exact match on the query term knife. With a standard Solr text field type, this would be the only result returned. However, the special type basis_english we configured allows the search engine to recognize "knives" as a form of "knife". Therefore, the article "The Best Ways to Sharpen Kitchen Knives at Home" is also returned.

RBL can significantly reduce your dependence on creating, maintaining, and using large synonym lists.

Tokenization

Tokenization is the process of separating a piece of text into smaller units called "tokens". Tokens can be words, characters, or subwords, depending on how they are defined and analyzed. The RBL tokenizer first determines sentence boundaries, then segments each sentence into individual tokens. The most useful tokens are often words, though they may also be numbers or other characters.

In some languages like Chinese and Japanese, word tokens are not separated by whitespace, and words can consist of one, two, or more characters. For example, the tokens in 我喜歡貓 (I like cats) are 我 (I), 喜歡 (like), and 貓 (cats). RBL uses statistical models to identify token boundaries, allowing for more accurate search results.

  1. For ease of viewing results, select the Display Fields dropdown and enter text_zho in the Description field.

  2. Enter the query text_zho:美國 (United States) in the search box.

The document "美國男子染疫無呼吸道症狀但難以說話行走" ("U.S. Man has no Respiratory Symptoms but has Difficulty Talking and Walking") is returned, as 美國 (United States) is a token in the headline.

With a standard Solr text field type, this headline would be naively tokenized with one character per token. Therefore, a search for (beautiful) would trigger a false positive match, even though it is not a word in this context. However, with the advanced analytics we have configured here, the query text_zho:美 will correctly return zero results.

Compounds

RBL can decompose Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian, and Swedish compounds, returning the lemmas of each of the components. The lemmas may differ from their surface form in the compound, such that the concatenation of the components is not the same as the original compound (or its lemma). Components are often connected by elements that are present only in the compound form. RBL allows Solr to index and query on these components, increasing recall of search results.

  1. For ease of viewing results, select the Display Fields dropdown and enter text_deu in the Description field.

  2. Enter the query text_deu:Land in the search box.

The result "Mitarbeiter sitzen in ihren Heimatländern fest" ("Employees are Stuck in Their Home Countries") returned.

This headline contains the compound "Heimatländern" (home countries). A search on Land (country) with a standard Solr text field type would not trigger a match. However, because RBL performs decompounding with lemmatization, searching on Heimat or Land will return a result.

Customization (Advanced)

When setting up the Solr configuration, you specified the language and rootDirectory options in your field type definition. This will be sufficient for most use cases. However, RBL does provide more options to control the behavior of the tokenizer and analyzer. For example, the default tokenization does not consider URLs. As a result, http://lucidworks.com will be tokenized as http, lucidworks, and com.

If you wish to recognize URLs, you can add the option urls="true" to the tokenizer in your field type definition:

<fieldtype class="solr.TextField" name="basis_english">
    <analyzer type="index">
        <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="eng" rootDirectory="/opt/basistech/rbl-je" urls="true"/>
        <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="eng" rootDirectory="/opt/basistech/rbl-je"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="eng" query="true" rootDirectory="/opt/basistech/rbl-je" urls="true"/>
        <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="eng" query="true" rootDirectory="/opt/basistech/rbl-je"/>
    </analyzer>
</fieldtype>

This will instruct RBL to consider URLs as a single token, for example: http://lucidworks.com.

To see a list of all options, consult the full RBL documentation.