Introduction
Search results in a DeskClient can be ordered by relevance (see Sophora User Guide). The order is determined by Sophoras search engine (Solr). It is based on how often the search term appears in a document (term frequency) and how rare the search term is over all documents (inverse document frequency). But the search engine rates the properties according to lexicological aspects. The user often expects another order based on the domain of the content.
For example if a photographer searches for 'light' and has a document type which catalogs his spotlights, he might expect to find one of these instead of a picture which mentions light colors in its description. Therefore a document that contains the searched word in the property 'photography:equipment-category' should be more relevant. This can be done with this feature.
Sophora allows the administrator to define which properties are more important. This page describes how this can be configured and which possibilities exist.
sophora.solr.iquerysearch.enabled
to true
. See Server configuration for details.Configuration
The configuration of the Solr relevance is done in the configuration section in the administration view. Open Configuration >> Solr-Relevance by double clicking it. You see the current configuration in the table. Each row defines the relevance boost for a single property. For general handling of configuration documents please refer here.
The key of a row can be either a sophora property name or a Solr field name. A property name might be 'sophora-content:title'. A corresponding Solr field name would be 'sophora-content_title_t'. If a Sophora property is mapped to multiple Solr fields only one will be boosted (because else the boost would be applied multiple times). A text field is prefered over a string field. If no such field exists a random field will be used. To have full control which Solr field should be boosted you can enter that name directly as key.
Example | Syntax | Description |
---|---|---|
sophora-content:title | namespace ':' name | A sophora property name which should be boosted. It will be mapped to a Solr field. All document types which contain a property with this name will be used. |
sophora-content_title_t | anything without a colon | A Solr field which should be boosted. |
A row can have multiple values. Each value defines a boost for the field denoted by the key. Each value can be either a number or a term followed by an equal sign '=' and a number. The first format boosts a field which contains the searched term, the second format boosts documents which contain that value. In both cases the number is a multiplicator of the default score given to the defined field.
Example | Syntax | Description |
---|---|---|
27.5 | digit [.digit] | A number by which the field should be boosted. |
sophora=8 | text '=' digit [.digit] | Documents where the field contains the given text will be boosted by the given number. |
Examples
Here are some examples what can be boosted.
Use case | Example Key | Example Value | Description |
---|---|---|---|
The title is more important. | sophora-content:title | 3 | If the search term is contained in the title of a document it would be more relevant than if the search term was found in another property. |
Documents which are online are more important. | sophora:isOnline | true=20 | If a document has the value 'true' in its property 'sophora:isOnline' the relevance is boosted slightly (because almost every document contains the online property its base relevance is very low) |
Documents of the type 'Story' are most important and images more important than other documents. | primaryType_s | sophora-content-nt:story=100 sophora-extension-nt:image=50 | The key is given as Solr field name because it is a pre-defined system property that is not configured in Sophora. For this key two values are set. The type 'Story' will be boosted by 100 and images by 50. See below that this does not mean that the relevance of stories is twice the amount of images. |
Newer documents are more important. | _val_ | recip(ms(NOW, sophora_modificationDate_dt), 3.16e-11, 1, 1)=100 | For the special key '_val_' a Solr function can be defined. The example function calculates a reciprocal value based on the difference of now and the modification date of the document in milliseconds (for details see Date Boosting in the Solr wiki). |
Getting good weights
You know what properties you want to boost and entered them in the configuration. But the search results are not sorted as you expected. Although a property got a high number in your configuration, it stays in the second half of the results. Here are some tips how you can tune the weights of the different properties.
An important thing to know is, that the score of a search result is often based on the relative frequency of the term and field. That means that a word that exists in almost every document will not get a high score in the results and therefore might need a higher boost than a very rare word which exactly matches a special property. Another interesting thing is, that the score will be normalized for each query depending on the amount of searched terms and fields. The outcome of this are different scores for the same document. Therefore the absolute value of a score means nothing. The score is always a relative value compared to other documents in the same search result. So for one query a score of 3 might be the best match, but for another query the top result might get a 7. Because of this, Sophora uses multiplicative boosts instead of additive ones.
Displaying the relevancy score
To see the score calculated by the search engine, you can use the web interface at http://<sophoraserver>:1196/solr/
(1196 is the default Solr port and you might have changed it). Go to default >> Query to do a search on the default index (as the DeskClient does). The query must be entered in the field labeled q. A query consists of a field name followed by a colon and the field value, like myField_s:myValue
. The field fl defines what fields will be listed in the result. Here you have to add the value score
to see the relevancy score of the result. To see all fields and additionally the score you should use *,score
. To see only the Sophora ID and the score use sophora_id_s,score
.
Solr can show detailed information how the score of a document is calculated. To see this output select the checkbox debugQuery. After executing a query an explanation section will be displayed below the normal result. As an example, here is the explanation of the first result of the query +(searchField_t:London) OR sophora-content_title_t:London^5
:
<lst name="explain">
<str name="ffafe039-5c7e-4342-8a80-f58395a502fb">
5.588295 = (MATCH) sum of:
0.11100232 = (MATCH) weight(searchField_t:london in 275) [DefaultSimilarity], result of:
0.11100232 = score(doc=275,freq=1.0 = termFreq=1.0
), product of:
0.14546329 = queryWeight, product of:
4.0698404 = idf(docFreq=12, maxDocs=280)
0.035741765 = queryNorm
0.7630951 = fieldWeight in 275, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
4.0698404 = idf(docFreq=12, maxDocs=280)
0.1875 = fieldNorm(doc=275)
5.4772925 = (MATCH) weight(sophora-content_title_t:london^5.0 in 275) [DefaultSimilarity], result of:
5.4772925 = score(doc=275,freq=1.0 = termFreq=1.0
), product of:
0.9893637 = queryWeight, product of:
5.0 = boost
5.536177 = idf(docFreq=2, maxDocs=280)
0.035741765 = queryNorm
5.536177 = fieldWeight in 275, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
5.536177 = idf(docFreq=2, maxDocs=280)
1.0 = fieldNorm(doc=275)
</str>
You can see that the score is 5.588295 which is the sum of 0.11100232 for finding the term and 5.4772925 for matching boosted relevance fields. The defined boost was for a match in the property 'sophora-content:title' which scored 5.536177 for containing the search term once (fieldWeight). The queryWeight was boosted by 5 as defined in the query. The total boost is the product of this two weights.
Experimenting with queries
To make a simple query use the field searchField_t
. To boost the relevance of a match, optional parts are added to the query. The normal query is surrounded in parenthesis and prefixed with a plus sign. After this the optional relevance boosting parts are added with 'OR'. The boost of a field is append with a caret '^'. In the example above the field 'sophora-content_title_t' with the value 'London' is boosted by 5.
To tune the score of your filter, first create a simple query: searchField_t:London
searches for the word 'London'. Look at the scores in the search result. In this example they are
No. | ID | Type | Score |
---|---|---|---|
1. | london100 | Story | 1.0131925 |
2. | londonprints102 | Image | 0.7164353 |
3. | page106 | Teletext | 0.7079898 |
4. | page104 | Teletext | 0.50659627 |
5. | londonprints100 | Image | 0.50659627 |
Now we add a boost for documents of the type 'Story': +(searchField_t:London) OR primaryType_s:sophora-content-nt\:story^10
searches for the word 'London' and boosts stories by 10. The order is the same as before because only our already top rated document was a story. But the scores have changed:
No. | ID | Type | Score |
---|---|---|---|
1. | london100 | Story | 3.722851 |
2. | londonprints102 | Image | 0.039715305 |
3. | page106 | Teletext | 0.039247133 |
4. | page104 | Teletext | 0.028082963 |
5. | londonprints100 | Image | 0.028082963 |
But now we want to search mainly for images. So we add another weight for images. The total query looks like this: +(searchField_t:London) OR primaryType_s:sophora-content-nt\:story^10 OR primaryType_s:sophora-extension-nt\:image^20
. This does not lift the images to the top although the boost is twice the amount of the story bonus, but allows the last value to climb up some ranks:
No. | ID | Type | Score |
---|---|---|---|
1. | london100 | Story | 1.615343 |
2. | londonprints102 | Image | 1.1135606 |
3. | londonprints100 | Image | 1.103466 |
4. | page106 | Teletext | 0.01702931 |
5. | page104 | Teletext | 0.0121851815 |
Looking into the details shows that the story got a boost of 2.3499033 and the images a boost of 1.6186434. Why is the actual story boost larger than the image boost, when the defined image boost is twice the amount of the story boost? Let's have a look at the score calculation:
[story]
2.3499033 = (MATCH) weight(primaryType_s:sophora-content-nt:story^10.0 in 105) [DefaultSimilarity], result of:
2.3499033 = score(doc=105,freq=1.0 = termFreq=1.0
), product of:
0.6468367 = queryWeight, product of:
10.0 = boost
3.632916 = idf(docFreq=34, maxDocs=487)
0.01780489 = queryNorm
3.632916 = fieldWeight in 105, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
3.632916 = idf(docFreq=34, maxDocs=487)
1.0 = fieldNorm(doc=105)
[image]
1.6186434 = (MATCH) weight(primaryType_s:sophora-extension-nt:image^20.0 in 5) [DefaultSimilarity], result of:
1.6186434 = score(doc=5,freq=1.0 = termFreq=1.0
), product of:
0.75920707 = queryWeight, product of:
20.0 = boost
2.1320183 = idf(docFreq=156, maxDocs=487)
0.01780489 = queryNorm
2.1320183 = fieldWeight in 5, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
2.1320183 = idf(docFreq=156, maxDocs=487)
1.0 = fieldNorm(doc=5)
The boost was applied correctly, but the weight of the match is different. The boosted queryWeight of the image (0.7592) is actual higher than that of the story (0.6468). But the weight of a story in this example is approximatly 3.6 whereas an image only gets 2.1. As explained in the output the value depends on how many documents of that type exists (34 stories against 156 images). This means that if you match one of the rare stories it must be a better match than having found the word in one of the plenty images, because the target to hit is much smaller.
For our example-query a weight of 30 would be sufficient for lifting the images to the top of the results. To have a much higher score than the stories, a value of at least 50 is necessary (scores 1.35 against 0.80).
Further information about the relevance in Solr can be found in the Solr relevancy FAQ.