Indexing of Sophora Documents

Different Types of Solr Collections

Sophora creates two Solr collections automatically: the default and the default-live collection. The default collection contains all documents in their current working version. Even deleted and offline documents are part of that collection. The default-live collection contains all currently published documents in their last live version.

Additional collections can easily be configured by using index configuration documents. Please refer to the corresponding section for further information.

For each index configuration, two collections are created. One collection contains the working versions of included documents and the other collection contains the last live version of the included document only. The later one has the suffix -live. In contrast to the embedded Solr indexer, the live collection of an index configuration is always created implicitly.

Beside these two collection types, there is a third type, offline collections. Only documents that had a published version and are now in the state offline are stored in an offline collection. These collections are described in detail in the corresponding section.

Using the right Solr collections

Any Sophora client connected to a Sophora Primary Server can potentially search in any collection. Searches in the Sophora DeskClient are also typically carried out in Solr. We recommend only using live collections in live contexts in order to exclude unwanted access to non-published content. To show a preview in the DeskClient or MobileClient, the corresponding web app needs to access a collection with "working" (non-published) versions.

Any Sophora Client connected to a Sophora Staging Server can only search in live collections.

Queuing and Prioritization of Indexing Tasks

Whenever a document is changed, it must be updated in all SolrCloud collections.

Some changes have an effect on multiple documents, e.g. changes to the name of a structure node, the label of a select value or an inherited property. In the worst case, all documents have to be re-indexed. (Technical note: Bulk changes result in a DerivedDocumentChangedEvent which can also be received by a regular IServerEventListener. A DerivedDocumentChangedEvent contains the UUIDs of all effected documents and its cause.)

Changes made by editors during the processing of a bulk change should be published to Solr without any delay. To accomplish this, all instances of the Sophora Indexing Service share the same Redis queues to index tasks. Each queue is assigned a priority to better distinguish between low and high priority tasks. The queue polling is based on these priorities.

The priority of the indexing tasks is based on the following rules:

Changes made by editors always have a higher priority than bulk changes.
Indexing tasks of the default and default-live core have a higher priority than indexing tasks of custom collections
Older indexing tasks have a higher priority than newer ones (FIFO)

HTTP API to Index Documents

It is possible to trigger the indexing of documents via http requests to the Sophora Indexing Service.

The documents to be re-index can be specified by the following criteria:

one of its IDs (UUID, SophoraId, ExternalId)
a date range
a query

It is also possible to re-index a whole collection.

Further documentation about the API endpoints is available under the http(s) root path of any running Sophora Indexing Service instance.

Synchronization by using the Change Registry

If the Sophora Indexing Service must be synchronized with the Sophora Primary Server, e.g. after an outage or after connection issues. The Sophora Indexing Services checks the Change Registry for all changes which have occurred since the Sophora event processed last.

This approach ensures that no changes are lost. Even derived changes or the complete deletion of a document can be re-indexed in this way.

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.