Solr Schema

Last modified by Vincent Massol on 2024/02/26 17:57

 XWiki
 Implementation
 Completed
 
 

The Solr schema configuration file used in XWiki

Description

Introduction

The goal of this document is to determine the Solr schema that is best suited for the XWiki model.

Solr is a search platform that is built on top of Lucene. It supports full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. It uses a plugin architecture that allows the administrator to setup different types of analysers by combining tokenizers, filters etc. All of this is done declaratively using XML configuration files.

In XWiki (model) we have a few entity types. There are wikis which have spaces which have documents. A document can have objects and attachments. A document can also define a class. At the same time we like to say that in XWiki "everything is a document" because everything revolves around documents. The document is the central notion. We can query the database (using HQL or XWQL) for any of the previously mentioned entities but what should a Solr query return (semantically)? In other words:

  • are we searching for an object without caring about the document that holds the object? Same for an object property.
  • how often are we searching for an attachment without caring about the document that holds the attachment?
  • are we searching for a class or for the document that defines that class?
  • are we searching for a wiki without caring about the documents it contains? Same for a space.

Test Cases

Before we start designing the Solr schema let's see some examples of queries that we would like to perform on top of Solr. In other words, what do we search for?

  1. TC1: Match all blog posts that contain the word 'wiki' and that are tagged with 'CSS'.
  2. TC2: Match all blog posts which have the word 'john' in any field and have the status published and are of the category 'News' or 'XWiki' and hightlight 'john' in all results.
  3. TC3: Match all documents which have the word 'john' in any field except user documents and highlight 'john' in all results.
  4. TC4: Be able to create an UI that would allow us to filter the search results by criteria which corresponds to fields in a class that defines the type of these documents. Same for sorting.
  5. TC5: Match all files that contain the word 'wiki' and that are attached to blog post documents.
  6. TC6: Match all documents that have at least one text/plain (media type) attachment that contains the word 'wiki'.
  7. TC7: Match all documents that have the word 'wiki' but don't look in attachments nor in Space.Article_image property, and also match all attachments that contain the word 'wiki' in their name.
  8. TC8: Match all blog posts that contain the word 'wiki' and that have been published between two given dates.
  9. TC9: Suppose a document can have one TestClass object, with a 'steps' property, and multiple TestExecutionClass objects, with 'tester' and 'comment' properties. Find all tests that contain 'wiki' in the steps to reproduce and that have been executed by 'mflorea' with a comment containing 'failed'.

TODO

Add more test cases!

Indexing a Single Entity vs. Multiple Entities

Unlike a relational database, Solr/Lucene index has a single 'table'. So normally you index a single entity type. Each row in the index represents an entity of that type. As a consequence the result of a Solr query is semantically a list of entities of that type. In our case the first choice for the entity type is (naturally) the document.

If you want to index more entity types (e.g. index attachments and objects separately, not as part of a document) then, since there is only one 'table' in the index, you need to add a 'type' column that specifies the type of entity you have on each row (e.g. type=document, type=attachment, type=object etc.). The result of a Solr query is now, semantically, a list of different entity types, unless you filter by a specific type.

Let's imagine what happens if we want to search for blog posts that has a specific tag. With the first approach the problem we have is that XWiki documents can store dynamic structured data. We cannot hard-code the name of the class, 'Blog.BlogPostClass', or the name of the property, 'tag', in the Solr schema because they are defined by the user at runtime: users can create their own data types in XWiki. Fortunately Solr supports dynamic fields, but still we have another problem: structured data is hierarchical and any attempt to index it flat, on the same row, will suffer from some information loss. For instance, if you have multiple objects of the same type on the same document and you want to find the object whose properties (at least 2) satisfy some constraints. You want the matching properties to be from the same object, not from different objects on the same document.

The second approach, to index multiple entities, has its own complexity because the information is spread on multiple rows:

  • one row with type=document for the blog post document
  • one row with type=object for the blog post object
  • one row with type=object for the tab object

In a relational database when you have the information spread in multiple places (tables) you do joins. Fortunately (you would say) Solr supports joins. In this particular case we would have to perform 2 joins which means:

index X index X index

where X represents the cartesian product. The document name would be the join key. Pretty complex even before trying to write this in Solr query syntax.

Another option is to use field collapsing to group search results. This is somehow similar to the GROUP BY clause in SQL. We could group for instance the object/attachment results based on the document that holds them. Grouping cannot replace a join though. If you filter the rows before the grouping then you cannot be sure they correspond to the same entity (document). If you filter the groups then you cannot access easily the initial rows that were used to build the group (object and attachment rows). Moreover, grouping is not performed in the query statement but using the API (the Query Manager in our case).

When we draw the line, the question becomes: is it worth indexing more entities separately instead of indexing just documents (with info about their objects and attachments) considering the complexity that it brings in writing Solr queries? Do we search for objects and attachments alone as separate entities often enough to justify this complexity? Can we accept the information limitations that come with indexing only documents?

A Mixed Approach

The best solution could be to mix the single entity approach with the multiple entities one. In order to avoid joins we have to duplicate information. This means that for each entity we have to index (and duplicate) information about other entities that are related. We must decide from the start if we care about the Solr index size or not. The good news is that we could make the indexing of each entity optional/configurable. For instance if you know that you'll be searching only for documents then you can disable the indexing of the attachments, objects and object properties as separate entities. Of course, they will be indexed on document rows, with a bit of information loss though, because we lose some entity relations when we transform the tree/hierarchical structure from XWiki to the flat structure from Solr.

Let's see the fields for each entity type.

Shared Fields

Some fields need to be shared by all indexed entities. The wiki, space and name information is shared because each indexed entity is in our case either a document or held by a document.

NameDescription
id 
typeThe type of entity that is indexed. E.g. document, attachment, object, object property etc.
wiki 
spaceDeprecated since 7.2, use the spaces multivalued field instead. The local space reference. For a document A.B.C.Page the value of this field is A.B.C. This field is analyzed and thus used for free text search.
spacesSince 7.2 The space names. E.g. for a document A.B.C.Page the value is ['A', 'B', 'C']. This field is analyzed and thus mostly used for free text search.
space_exactWe index the local space reference (e.g. A.B\.1.C) verbatim for exact matching.
space_facetSince 7.2 We also need a dedicated field for hierarchical faceting on nested spaces. This field is used to implement a 'facet.prefix'-based drill down. E.g. for a document A.B.C.Page this field will hold ['0/A.', '1/A.B.', '2/A.B.C.'].
space_prefixSince 7.2 This field is used to match descendant documents. A query such as space_prefix:A.B will match the documents from space A.B and all its descendants (like A.B.C). This is possible because this field holds the local references of all the ancestor spaces of a document (i.e. all the prefixes of the space reference). E.g. for a document A.B.C.Page this field will hold ['A', 'A.B', 'A.B.C']. As a consequence, searching for space_prefix:A.B will match A.B.C.Page. NOTE: We don't use the PathHierarchyTokenizer because it doesn't support specifying an escaping character. We compute the values ourselves at index time as a workaround.
nameThe document name. This field is analyzed and thus mostly used for free text search.
name_exactWe also need to store the document name verbatim for faceting (exact matching). This facet is useful for attachments and objects for instance.
locale 
localesThe list of locales covered by this entry. Dynamically determined from the list of enabled locales and the various locales of the document.
languageContains only the language part of the locale
hiddenWhether the entity is hidden on not. Only documents can be made hidden explicitly. The attachments, objects and object properties are hidden if the document that holds them is hidden.

Document Fields

First of all we need to index the document title, content and meta data.

NameDescription
fullnameI don't think we need this field, do we?
title_*The localized title, indexed based on the document locale. E.g. title_ro
title_sortWe need a dedicated field for sort because analyzed fields cannot be used for sort.
doccontent_*The rendered document content (transformations are not executed). E.g. doccontent_pt_BR . NOTE: The reason we added the 'doc' prefix instead of keeping just 'content' is because we wanted to be able to use a different boost value for the document content than for the object (objcontent) and the attachment content (attcontent, see the 'qf' parameter in solrconfig.xml).
doccontentraw_* 
versionWe need to index the document version (revision, e.g. '2.4') to be able to detect when the index is not up to date (not in sync with the database). This check is performed at XWiki startup for instance (see IndexerJob#addMissing).
comment_*The localized version summary. A brief description of the changes made in the latest version. E.g. comment_en
doclocaleContains the technical locale of the document (i.e. empty for default entry)
authorThe last author. This field is used for faceting (exact matching).
author_displayThe last author, this time analyzed and thus used for free text search.
author_display_sort 
creatorThe document creator, stored verbatim for faceting (exact matching)
creator_displayThe document creator, this time analyzed and thus used for free text search.
date 
creationDate 

Then, in order to avoid joins, we need to index the objects. We try to make the structured data flat using dynamic fields.

NameDescription
class/objectThe type of objects stored by this document. E.g. [Blog.BlogPostClass, XWiki.TagClass, ..]
objcontent_*This field collects the values from all the properties of all the objects found on the indexed document. It uses the "propertyName : propertyValue" format. This field is analyzed based on the document locale. E.g. objcontent_ro
object.aSpace.aClass_*Dynamic multiValued field indexing the entire content of the objects of the specified type. All values are indexed as localized text, using the document locale. E.g. object.XWiki.TagClass_fr
property.aSpace.aClass.aPropertyName_*Dynamic multiValued field indexing the value of the specified property. For static lists, we index both the raw value (what is saved in the database) and the display value (what the user sees, which is specified in the XClass). Property values are indexed based on their type. E.g. property.Blog.BlogPostClass.published_boolean, property.Blog.BlogPostClass.publishDate_date, property.Blog.BlogPostClass.category_string, property.Blog.BlogPostClass.summary_en
property.aSpace.aClass.aPropertyName_sort*Dedicated field for sorting on property values. We need this because Solr doesn't support sorting on multiValued fields. E.g. property.Blog.BlogPostClass.publishDate_sortDate

Notice that we index the property name only in the objcontent field (mixed with the property value). We don't have a dedicated field for this, i.e. the object property names appear only on the field names and not on the field values. Do we need to index the property names? We index the class names because we want to filter documents of a given type. Is there a real use case when we need to find documents that have objects with a given property?

Non-string XObject properties should be indexed based on their type. This means we'll be able to write type-specific constraints in Solr query (e.g. ranges) for Boolean, Number (int, long, float, double) and Date properties:

property.Blog.BlogPostClass.publishDate:[NOW-1MONTH TO NOW]

We can achieve this by suffixing the field name with the type name: property.Blog.BlogPostClass.publishDate_date. But in order to use just the field name in the Solr query need Dynamic Field Aliases.

Note that only the String and the TextArea properties should be indexed as localized text (depending on the document locale). For the rest of the string-based properties (Access Right, List of Users, List of Groups, DBList, etc.) we should use the "string" Solr field type to index/store the property value verbatim in order to be able to perform exact matches on these properties. For StaticList, we need to index the raw value (what is saved in the database) as string (so verbatim), and the display value (what is specified in the XClass) as localized text (so analysed).

A problem with dynamic fields is that we can get invalid field names. The documentation says:

 field names should consist of alphanumeric or underscore characters only and not start with a digit.  This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed.  Names with both leading and trailing underscores (e.g. _version_) are reserved.

The class name will surely have a dot and it can also contain other invalid characters (if it's not a standard XWiki class).

Another problem is that Solr doesn't support dynamic fields as default fields, i.e. as fields that are matched when you search for free text (without field:value in the query). This is not a problem for the search results, as dynamic fields like object.* and property.* are copied and aggregated in objcontent which is a default field. The issue is that we can't know what is exactly the XClass property that was matched, we just know that the searched free text was found inside an object.

When searching for documents we should also take the attachments into account.

NameDescription
filenameThe name of files attached to this document. E.g. ['todo.txt', 'image.png']
filename_exactWe also need to store the file names as verbatim (without analysing it) for exact/prefix matching.
mimetypeThe list of attachment media types. E.g. ['text/plain', 'image/png']
attauthorThe absolute references of the users that uploaded the last version of each of the document attachments. Used for faceting (exact matching). E.g. ['math:XWiki.mflorea', 'gang:XWiki.vmassol', 'xwiki:XWiki.evalica']
attauthor_displaySame as attauthor but indexes the real user name instead of the reference (alias) and it is used for free text search. E.g. ['Thomas Mortagne', 'Florea Marius Dumitru']
attdateThe dates when the attachments have been uploaded (their last version).
attcontent_*The content of each attachment, indexed based on the document locale. E.g. attcontent_en : ['content of first attachment', 'content of second attachment']
attsizeThe size of each attachment in bytes.

All the attachment fields I listed are multivalued. The problem we have with this solution is that the relation between the fields of the same attachment is keep only in the form of the value index (e.g. 3rd attachment size corresponds to 3rd attachment name) which can't be used in queries. In other words, we won't be able to query for documents that have a text/plain file which contains a given word. We will be able to query for documents that have a text/plan file and a file (not necessarily the same!) which contains the given word.

Another problem is that Solr / Lucene doesn't tell us the index of the value that has been matched from a multivalued field like attcontent so we won't know which attachment has been matched (e.g. if Solr would tell us that the 2nd value from attcontent is matched then we would know the 2nd attachment is matched).

Other solutions for indexing the attachments inside the document rows are:

  • use a dynamic field, e.g. attachment.image.png_*, but we'll hit invalid field names immediately because the file name can contain almost any character
  • aggregate all the information from each attachment in a static multivalued field (e.g. attachment: ['data of 1st attachment', 'data of 2nd attachment', ..])
  • aggregate the information from all attachments in a static single valued field

None of these solutions fix the problems we mentioned above and we can search for attachments only (type:ATTACHMENT) if the relation between the attachment fields is important (i.e. we're looking for an attachment that must have 2 or more fields matching some criteria).

Object Fields

NameDescription
classThe object type. E.g. Blog.BlogPostClass
numberThe object number, identifies an object when there are multiple objects of the same type on a document.
objcontent_*This field collects the values from all the properties of the indexed object. It uses the "propertyName : propertyValue" format. This field is analyzed based on the document locale. E.g. objcontent_ro
property.aName_*Dynamic multiValued field indexing the value of the specified property. For static lists, we index both the raw value (what is saved in the database) and the display value (what the user sees, which is specified in the XClass). Property values are indexed based on their type. E.g. property.published_boolean, property.publishDate_date, property.category_string, property.summary_en

Attachment Fields

NameDescription
filenameThe attachment file name. E.g. ['todo.txt']
filename_exactWe also need to store the attachment file name as verbatim (without analysing it) for exact/prefix matching.
filename_sortThe attachment file name used for sorting
mimetypeThe attachment media type. E.g. ['text/plain']
attversionWe need to index the attachment version (revision) to be able to detect when the Solr index is out of date (not in sync with the database). E.g. 1.2
attauthorThe absolute reference of the user that uploaded the last version of the attachment. Used for faceting (exact matching). E.g. ['gang:XWiki.vmassol']
attauthor_displayThe real name of the user that uploaded the last version of the attachment. Used for free text search. E.g. ['Ecaterina Moraru']
attauthor_display_sortSame as attauthor_display but used for sorting (single valued).
attdateThe date when the last version of the attachment was uploaded.
attdate_sortWe need a dedicated field for sort because the corresponding field is multiValued (attdate is reused on document rows, see above, and a document can have multiple attachments) and Solr doesn't support sorting on multiValued fields
attcontent_*The content of the last version of the attachment, indexed based on the document locale. E.g. attcontent_en
attsizeThe size, in bytes, of the last version of the attachment
attsize_sortNeeded for sort because attsize is multiValued. See attdate_sort.

Encoding Dynamic Field Names

We need to support special characters in dynamic field names. One solution is to use an encoding scheme similar to the URL-encoding. We cannot use directly the URL-encoding because '+' (plus) and '%' (percent) have special meaning in Solr query syntax. Also, we don't want to encode Unicode letters.

E.g. "Somé Spâce.Bob's Claß" would be encoded as "Somé$20Spâce.Bob$27s$20Claß"

Also, it would be nice to be able to extract the class and property reference from a field name in order to display the location where the search text has been found. We can't use the default class / property reference serialization syntax because '\' and '^' have special meaning in the Solr query syntax. One solution is to implemented a simple serialization syntax that uses only '.' as entity separator and the dot is escaped by repeating it.

E.g. "wiki:Some\.Space.My\.Class^color" would be serialized as "wiki.Some..Space.My..Class.color"

Dynamic Field Aliases

We have a few dynamic fields, such as object.* and property.*, that are multilingual fields so they are indexed in multiple languages. We need support for dynamic aliases (for dynamic fields) so that we can write:

object:Blog.BlogPostClass AND property.Blog.BlogPostClass.title:text AND object.XWiki.TagClass:news

and it will be expanded into

object:Blog.BlogPostClass AND
(property.Blog.BlogPostClass.title_en:text OR property.Blog.BlogPostClass.title_fr:text OR ...) AND
(object.XWiki.TagClass_en:news OR object.XWiki.TagClass_fr:news OR ...)

Faceting on Object Properties

We need to be able to add facets on an XObject property using the Query Module API:

#set ($discard = $query.bindValue('facet.field', ['someOtherField', 'property.Test.TestClass.staticList1_string']))

The 'string' suffix means the property was indexed/stored verbatim (without being analysed). Read above to understand why we suffix the field name with the data type. The facet can be triggered with this query:

object:Test.TestClass

Sorting on Object Properties

We should also be able to sort the document search results based on a property value using the Query Module API:

#set ($discard = $query.bindValue('sort', 'property.Test.TestClass.staticList1_sortString asc'))

The 'sortString' suffix is the dynamic type that is used for sorting. Other types are 'sortBoolean', 'sortInt', 'sortLong', 'sortDouble', 'sortFloat' and 'sortDate'. Note that Solr doesn't support sorting on multivalued fields. The documentation says:

 Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer).

If you try to sort on a multivalued field you'll get:

Caused by: org.apache.solr.common.SolrException: can not sort on multivalued field: property.Test.TestClass.staticList1_string
    at org.apache.solr.schema.SchemaField.checkSortability(SchemaField.java:155)

That's why we need dedicated 'sortXXX' fields that are single valued. The consequence is that only the last value of a property is used for sorting (you can have multiple values either because the property supports multiple selection or because there are multiple objects of the same type on the indexed document). Note that XObject properties are indexed using multivalued dynamic fields (we cannot know beforehand what properties a user-defined XClass will have and if a property supports multiple selection or if a document can have multiple objects of a given type).

Another option for sorting on fields that have multiple values could be to use a function but I can't find one that returns a single value from a multiValued field.

#set ($discard = $query.bindValue('sort', 'aFunctionThatSelectsOneValue(property.Test.TestClass.staticList1_string) asc'))

Test Results

  1. TC1: Match all blog posts that contain the word 'wiki' and that are tagged with 'CSS'.
    object.Blog.BlogPostClass:wiki AND property.XWiki.TagClass.tags:CSS
    For example:
    (posts contain the word ‘FOSDEM’ with no tags)
    http://www.xwiki.org/xwiki/rest/wikis/query?q=object.Blog.BlogPostClass:FOSDEM
  2. TC2: Match all blog posts which have the word 'john' in any field and have the status published and are of the category 'News' or 'XWiki' and hightlight 'john' in all results.
    object.Blog.BlogPostClass:john AND property.Blog.BlogPostClass.published:1 AND property.Blog.BlogPostClass.category:(News OR XWiki)
    For example:
    (posts contain the word ‘wiki’ ,and have been published and are of the category ‘NEWS’ or ‘XWiki’)
    http://www.xwiki.org/xwiki/rest/wikis/query?q=object.Blog.BlogPostClass:wiki%20AND%20property.Blog.BlogPostClass.published:1%20AND%20property.Blog.BlogPostClass.category:(News%20OR%20XWiki)
  3. TC3: Match all documents which have the word 'john' in any field except user documents and highlight 'john' in all results.
    john AND -object:XWiki.XWikiUsers
    For example:
    http://www.xwiki.org/xwiki/rest/wikis/query?q=john%20AND%20-object:XWiki.XWikiUsers
  4. TC4: Be able to create an UI that would allow us to filter the search results by criteria which corresponds to fields in a class that defines the type of these documents. Same for sorting.
    filter/sort by property.aSpace.aClass.aPropertyName
  5. TC5: Match all files that contain the word 'wiki' and that are attached to blog post documents.
    wiki AND type:ATTACHMENT AND object:Blog.BlogPostClass
    For example:
    http://www.xwiki.org/xwiki/rest/wikis/query?q=object:Blog.BlogPostClass%20and%20fosdem%20and%20

 

Tags:
    

Get Connected