Show last authors
1 {{toc/}}
2
3 = Introduction =
4
5 The goal of this document is to determine the Solr schema that is best suited for the XWiki model.
6
7 [[Solr>>http://lucene.apache.org/solr/]] is a search platform that is built on top of Lucene. It supports full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. It uses a plugin architecture that allows the administrator to setup different type of analysers by combining tokenizers, filters etc. All of this is done declaratively using XML configuration files.
8
9 In [[XWiki (model)>>platform:DevGuide.DataModel]] we have a few entity types. There are **wikis** which have **spaces** which have **documents**. A document can have **objects** and **attachments**. A document can also define a **class**. At the same time we like to say that in XWiki "everything is a document" because everything revolves around documents. The document is the central notion. We can query the database (using HQL or XWQL) for any of the previously mentioned entities but what should a Solr query return (semantically)? In other words:
10
11 * are we searching for an object without caring about the document that holds the object? Same for an object property.
12 * how often are we searching for an attachment without caring about the document that holds the attachment?
13 * are we searching for a class or for the document that defines that class?
14 * are we searching for a wiki without caring about the documents it contains? Same for a space.
15
16 = Test Cases =
17
18 Before we start designing the Solr schema let's see some examples of queries that we would like to perform on top of Solr. In other words, what do we search for?
19
20 {{velocity}}
21 #set ($testCases = [
22 "Match all blog posts that contain the word 'wiki' and that are tagged with 'CSS'.",
23 "Match all blog posts which have the word 'john' in any field and have the status published and are of the category 'News' or 'XWiki' and hightlight 'john' in all results.",
24 "Match all documents which have the word 'john' in any field except user documents and highlight 'john' in all results.",
25 "Be able to create an UI that would allow us to filter the search results by criteria which corresponds to fields in a class that defines the type of these documents. Same for sorting.",
26 "Match all files that contain the word 'wiki' and that are attached to blog post documents.",
27 "Match all documents that have at least one text/plain (media type) attachment that contains the word 'wiki'.",
28 "Match all documents that have the word 'wiki' but don't look in attachments nor in Space.Article_image property, and also match all attachments that contain the word 'wiki' in their name.",
29 "Match all blog posts that contain the word 'wiki' and that have been published between two given dates.",
30 "Suppose a document can have one TestClass object, with a 'steps' property, and multiple TestExecutionClass objects, with 'tester' and 'comment' properties. Find all tests that contain 'wiki' in the steps to reproduce and that have been executed by 'mflorea' with a comment containing 'failed'."
31 ])
32 #foreach ($testCase in $testCases)
33 1. TC${mathtool.add($foreach.index, 1)}: $testCase
34 #end
35 {{/velocity}}
36
37 {{todo}}
38 Add more test cases!
39 {{/todo}}
40
41 = Indexing a Single Entity vs. Multiple Entities =
42
43 Unlike a relational database, Solr/Lucene index has a single 'table'. So normally you index a single entity type. Each row in the index represents an entity of that type. As a consequence the result of a Solr query is semantically a list of entities of that type. In our case the first choice for the entity type is (naturally) the **document**.
44
45 If you want to index more entity types (e.g. index attachments and objects **separately**, not as part of a document) then, since there is only one 'table' in the index, you need to add a 'type' column that specifies the type of entity you have on each row (e.g. type=document, type=attachment, type=object etc.). The result of a Solr query is now, semantically, a list of different entity types, unless you filter by a specific type.
46
47 Let's imagine what happens if we want to search for blog posts that has a specific tag. With the first approach the problem we have is that XWiki documents can store **dynamic** structured data. We cannot hard-code the name of the class, 'Blog.BlogPostClass', or the name of the property, 'tag', in the Solr schema because they are defined by the user at runtime: users can create their own data types in XWiki. Fortunately Solr supports dynamic fields, but still we have another problem: structured data is hierarchical and any attempt to index it flat, on the same row, will suffer from some information loss. For instance, if you have multiple objects of the same type on the same document and you want to find the object whose properties (at least 2) satisfy some constraints. You want the matching properties to be from the same object, not from different objects on the same document.
48
49 The second approach, to index multiple entities, has its own complexity because the information is spread on multiple rows:
50
51 * one row with type=document for the blog post document
52 * one row with type=object for the blog post object
53 * one row with type=object for the tab object
54
55 In a relational database when you have the information spread in multiple places (tables) you do joins. Fortunately (you would say) [[Solr supports joins>>http://wiki.apache.org/solr/Join]]. In this particular case we would have to perform 2 joins which means:
56
57 {{code language="none"}}
58 index X index X index
59 {{/code}}
60
61 where X represents the cartesian product. The document name would be the join key. Pretty complex even before trying to write this in [[Solr query syntax>>http://wiki.apache.org/solr/SolrQuerySyntax]].
62
63 Another option is to use [[field collapsing>>https://wiki.apache.org/solr/FieldCollapsing]] to group search results. This is somehow similar to the GROUP BY clause in SQL. We could group for instance the object/attachment results based on the document that holds them. Grouping cannot replace a join though. If you filter the rows before the grouping then you cannot be sure they correspond to the same entity (document). If you filter the groups then you cannot access easily the initial rows that were used to build the group (object and attachment rows). Moreover, grouping is not performed in the query statement but using the API (the Query Manager in our case).
64
65 When we draw the line, the question becomes: is it worth indexing more entities **separately** instead of indexing just documents (with info about their objects and attachments) considering the complexity that it brings in writing Solr queries? Do we search for objects and attachments alone as separate entities often enough to justify this complexity? Can we accept the information limitations that come with indexing only documents?
66
67 = A Mixed Approach =
68
69 The best solution could be to mix the single entity approach with the multiple entities one. In order to avoid joins we have to duplicate information. This means that for each entity we have to index (and duplicate) information about other entities that are related. We must decide from the start if we care about the Solr index size or not. The good news is that we could make the indexing of each entity optional/configurable. For instance if you know that you'll be searching only for documents then you can disable the indexing of the attachments, objects and object properties as separate entities. Of course, they will be indexed on document rows, with a bit of information loss though, because we lose some entity relations when we transform the tree/hierarchical structure from XWiki to the flat structure from Solr.
70
71 Let's see the fields for each entity type.
72
73 == Shared Fields ==
74
75 Some fields need to be shared by all indexed entities. The //wiki//, //space// and //name// information is shared because each indexed entity is in our case either a document or held by a document.
76
77 |=Name|=Description
78 |id|\\
79 |type|The type of entity that is indexed. E.g. document, attachment, object, object property etc.
80 |wiki|\\
81 |space|{{warning}}Deprecated since 7.2, use the ##spaces## multivalued field instead.{{/warning}} The local space reference. For a document ##A.B.C.Page## the value of this field is ##A.B.C##. This field is analyzed and thus used for free text search.
82 |spaces|{{info}}Since 7.2{{/info}} The space names. E.g. for a document ##A.B.C.Page## the value is ['A', 'B', 'C']. This field is analyzed and thus mostly used for free text search.
83 |space_exact|We index the local space reference (e.g. ##A.B\.1.C##) verbatim for exact matching.
84 |space_facet|{{info}}Since 7.2{{/info}} We also need a dedicated field for [[hierarchical faceting>>https://wiki.apache.org/solr/HierarchicalFaceting]] on nested spaces. This field is used to implement a 'facet.prefix'-based drill down. E.g. for a document ##A.B.C.Page## this field will hold ['0/A.', '1/A.B.', '2/A.B.C.'].
85 |space_prefix|{{info}}Since 7.2{{/info}} This field is used to match descendant documents. A query such as ##space_prefix:A.B## will match the documents from space ##A.B## and all its descendants (like ##A.B.C##). This is possible because this field holds the local references of all the ancestor spaces of a document (i.e. all the prefixes of the space reference). E.g. for a document ##A.B.C.Page## this field will hold ['A', 'A.B', 'A.B.C']. As a consequence, searching for ##space_prefix:A.B## will match ##A.B.C.Page##. NOTE: We don't use the ##PathHierarchyTokenizer## because it doesn't support specifying an escaping character. We compute the values ourselves at index time as a workaround.
86 |name|The document name. This field is analyzed and thus mostly used for free text search.
87 |name_exact|We also need to store the document name verbatim for faceting (exact matching). This facet is useful for attachments and objects for instance.
88 |locale|\\
89 |locales|The list of locales covered by this entry. Dynamically determined from the list of enabled locales and the various locales of the document.
90 |language|Contains only the language part of the locale
91 |hidden|Whether the entity is hidden on not. Only documents can be made hidden explicitly. The attachments, objects and object properties are hidden if the document that holds them is hidden.
92
93 == Document Fields ==
94
95 First of all we need to index the document title, content and meta data.
96
97 |=Name|=Description
98 |fullname|{{error}}I don't think we need this field, do we?{{/error}}
99 |title_*|The localized title, indexed based on the document locale. E.g. title_ro
100 |title_sort|We need a dedicated field for sort because analyzed fields cannot be used for sort.
101 |doccontent_*|The rendered document content (transformations are not executed). E.g. doccontent_pt_BR . NOTE: The reason we added the 'doc' prefix instead of keeping just 'content' is because we wanted to be able to use a different boost value for the document content than for the object (##objcontent##) and the attachment content (##attcontent##, see the 'qf' parameter in ##solrconfig.xml##).
102 |doccontentraw_*|\\
103 |version|We need to index the document version (revision, e.g. '2.4') to be able to detect when the index is not up to date (not in sync with the database). This check is performed at XWiki startup for instance (see IndexerJob#addMissing).
104 |comment_*|The localized version summary. A brief description of the changes made in the latest version. E.g. comment_en
105 |doclocale|Contains the technical locale of the document (i.e. empty for default entry)
106 |author|The last author. This field is used for faceting (exact matching).
107 |author_display|The last author, this time analyzed and thus used for free text search.
108 |author_display_sort|\\
109 |creator|The document creator, stored verbatim for faceting (exact matching)
110 |creator_display|The document creator, this time analyzed and thus used for free text search.
111 |date|\\
112 |creationDate|\\
113
114 Then, in order to avoid joins, we need to index the objects. We try to make the structured data flat using dynamic fields.
115
116 |=Name|=Description
117 |class/object|The type of objects stored by this document. E.g. [Blog.BlogPostClass, XWiki.TagClass, ..]
118 |objcontent_*|This field collects the values from all the properties of all the objects found on the indexed document. It uses the "propertyName : propertyValue" format. This field is analyzed based on the document locale. E.g. objcontent_ro
119 |object.aSpace.aClass_*|Dynamic multiValued field indexing the entire content of the objects of the specified type. All values are indexed as localized text, using the document locale. E.g. object.XWiki.TagClass_fr
120 |property.aSpace.aClass.aPropertyName_*|Dynamic multiValued field indexing the value of the specified property. For static lists, we index both the raw value (what is saved in the database) and the display value (what the user sees, which is specified in the XClass). Property values are indexed based on their type. E.g. property.Blog.BlogPostClass.published_boolean, property.Blog.BlogPostClass.publishDate_date, property.Blog.BlogPostClass.category_string, property.Blog.BlogPostClass.summary_en
121 |property.aSpace.aClass.aPropertyName_sort*|Dedicated field for sorting on property values. We need this because Solr doesn't support sorting on multiValued fields. E.g. property.Blog.BlogPostClass.publishDate_sortDate
122
123 Notice that we index the property name only in the ##objcontent## field (mixed with the property value). We don't have a dedicated field for this, i.e. the object property names appear only on the field names and not on the field values. Do we need to index the property names? We index the class names because we want to filter documents of a given type. Is there a real use case when we need to find documents that have objects with a given property?
124
125 Non-string XObject properties should be indexed based on their type. This means we'll be able to write type-specific constraints in Solr query (e.g. ranges) for Boolean, Number (int, long, float, double) and Date properties:
126
127 {{code language="none"}}
128 property.Blog.BlogPostClass.publishDate:[NOW-1MONTH TO NOW]
129 {{/code}}
130
131 We can achieve this by suffixing the field name with the type name: ##property.Blog.BlogPostClass.publishDate_date##. But in order to use just the field name in the Solr query need [[Dynamic Field Aliases>>#HDynamicFieldAliases]].
132
133 Note that only the String and the TextArea properties should be indexed as localized text (depending on the document locale). For the rest of the string-based properties (Access Right, List of Users, List of Groups, DBList, etc.) we should use the "string" Solr field type to index/store the property value verbatim in order to be able to perform exact matches on these properties. For StaticList, we need to index the raw value (what is saved in the database) as string (so verbatim), and the display value (what is specified in the XClass) as localized text (so analysed).
134
135 A problem with dynamic fields is that we can get invalid field names. The documentation says:
136
137 > field names should consist of alphanumeric or underscore characters only and not start with a digit. This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed. Names with both leading and trailing underscores (e.g. _version_) are reserved.
138
139 The class name will surely have a dot and it can also contain other invalid characters (if it's not a standard XWiki class).
140
141 Another problem is that Solr doesn't support dynamic fields as default fields, i.e. as fields that are matched when you search for free text (without field:value in the query). This is not a problem for the search results, as dynamic fields like ##object.*## and ##property.*## are copied and aggregated in ##objcontent## which is a default field. The issue is that we can't know what is exactly the XClass property that was matched, we just know that the searched free text was found inside an object.
142
143 When searching for documents we should also take the attachments into account.
144
145 |=Name|=Description
146 |filename|The name of files attached to this document. E.g. ['todo.txt', 'image.png']
147 |filename_exact|We also need to store the file names as verbatim (without analysing it) for exact/prefix matching.
148 |mimetype|The list of attachment media types. E.g. ['text/plain', 'image/png']
149 |attauthor|The absolute references of the users that uploaded the last version of each of the document attachments. Used for faceting (exact matching). E.g. ['math:XWiki.mflorea', 'gang:XWiki.vmassol', 'xwiki:XWiki.evalica']
150 |attauthor_display|Same as ##attauthor## but indexes the real user name instead of the reference (alias) and it is used for free text search. E.g. ['Thomas Mortagne', 'Florea Marius Dumitru']
151 |attdate|The dates when the attachments have been uploaded (their last version).
152 |attcontent_*|The content of each attachment, indexed based on the document locale. E.g. attcontent_en : ['content of first attachment', 'content of second attachment']
153 |attsize|The size of each attachment in bytes.
154
155 All the attachment fields I listed are multivalued. The problem we have with this solution is that the relation between the fields of the same attachment is keep only in the form of the value index (e.g. 3rd attachment size corresponds to 3rd attachment name) which can't be used in queries. In other words, we won't be able to query for documents that have a text/plain file which contains a given word. We will be able to query for documents that have a text/plan file and a file (not necessarily the same!) which contains the given word.
156
157 Another problem is that Solr / Lucene doesn't tell us the index of the value that has been matched from a multivalued field like ##attcontent## so we won't know which attachment has been matched (e.g. if Solr would tell us that the 2nd value from ##attcontent## is matched then we would know the 2nd attachment is matched).
158
159 Other solutions for indexing the attachments inside the document rows are:
160
161 * use a dynamic field, e.g. attachment.image.png_*, but we'll hit invalid field names immediately because the file name can contain almost any character
162 * aggregate all the information from each attachment in a static multivalued field (e.g. attachment: ['data of 1st attachment', 'data of 2nd attachment', ..])
163 * aggregate the information from all attachments in a static single valued field
164
165 None of these solutions fix the problems we mentioned above and we can search for attachments only (##type:ATTACHMENT##) if the relation between the attachment fields is important (i.e. we're looking for an attachment that must have 2 or more fields matching some criteria).
166
167 == Object Fields ==
168
169 |=Name|=Description
170 |class|The object type. E.g. ##Blog.BlogPostClass##
171 |number|The object number, identifies an object when there are multiple objects of the same type on a document.
172 |objcontent_*|This field collects the values from all the properties of the indexed object. It uses the "propertyName : propertyValue" format. This field is analyzed based on the document locale. E.g. objcontent_ro
173 |property.aName_*|Dynamic multiValued field indexing the value of the specified property. For static lists, we index both the raw value (what is saved in the database) and the display value (what the user sees, which is specified in the XClass). Property values are indexed based on their type. E.g. property.published_boolean, property.publishDate_date, property.category_string, property.summary_en
174
175 == Attachment Fields ==
176
177 |=Name|=Description
178 |filename|The attachment file name. E.g. ['todo.txt']
179 |filename_exact|We also need to store the attachment file name as verbatim (without analysing it) for exact/prefix matching.
180 |filename_sort|The attachment file name used for sorting
181 |mimetype|The attachment media type. E.g. ['text/plain']
182 |attversion|We need to index the attachment version (revision) to be able to detect when the Solr index is out of date (not in sync with the database). E.g. 1.2
183 |attauthor|The absolute reference of the user that uploaded the last version of the attachment. Used for faceting (exact matching). E.g. ['gang:XWiki.vmassol']
184 |attauthor_display|The real name of the user that uploaded the last version of the attachment. Used for free text search. E.g. ['Ecaterina Moraru']
185 |attauthor_display_sort|Same as ##attauthor_display## but used for sorting (single valued).
186 |attdate|The date when the last version of the attachment was uploaded.
187 |attdate_sort|We need a dedicated field for sort because the corresponding field is multiValued (##attdate## is reused on document rows, see above, and a document can have multiple attachments) and Solr doesn't support sorting on multiValued fields
188 |attcontent_*|The content of the last version of the attachment, indexed based on the document locale. E.g. attcontent_en
189 |attsize|The size, in bytes, of the last version of the attachment
190 |attsize_sort|Needed for sort because ##attsize## is multiValued. See ##attdate_sort##.
191
192 == Encoding Dynamic Field Names ==
193
194 We need to support special characters in dynamic field names. One solution is to use an encoding scheme similar to the URL-encoding. We cannot use directly the URL-encoding because '+' (plus) and '%' (percent) have special meaning in Solr query syntax. Also, we don't want to encode Unicode letters.
195
196 {{code language="none"}}
197 E.g. "Somé Spâce.Bob's Claß" would be encoded as "Somé$20Spâce.Bob$27s$20Claß"
198 {{/code}}
199
200 Also, it would be nice to be able to extract the class and property reference from a field name in order to display the location where the search text has been found. We can't use the default class / property reference serialization syntax because '\' and '^' have special meaning in the Solr query syntax. One solution is to implemented a simple serialization syntax that uses only '.' as entity separator and the dot is escaped by repeating it.
201
202 {{code language="node"}}
203 E.g. "wiki:Some\.Space.My\.Class^color" would be serialized as "wiki.Some..Space.My..Class.color"
204 {{/code}}
205
206 == Dynamic Field Aliases ==
207
208 We have a few dynamic fields, such as ##object.*## and ##property.*##, that are multilingual fields so they are indexed in multiple languages. We need support for dynamic aliases (for dynamic fields) so that we can write:
209
210 {{code language="none"}}
211 object:Blog.BlogPostClass AND property.Blog.BlogPostClass.title:text AND object.XWiki.TagClass:news
212 {{/code}}
213
214 and it will be expanded into
215
216 {{code language="none"}}
217 object:Blog.BlogPostClass AND
218 (property.Blog.BlogPostClass.title_en:text OR property.Blog.BlogPostClass.title_fr:text OR ...) AND
219 (object.XWiki.TagClass_en:news OR object.XWiki.TagClass_fr:news OR ...)
220 {{/code}}
221
222 == Faceting on Object Properties ==
223
224 We need to be able to add facets on an XObject property using the [[Query Module>>extensions:Extension.Query Module]] API:
225
226 {{code language="none"}}
227 #set ($discard = $query.bindValue('facet.field', ['someOtherField', 'property.Test.TestClass.staticList1_string']))
228 {{/code}}
229
230 The 'string' suffix means the property was indexed/stored verbatim (without being analysed). Read above to understand why we suffix the field name with the data type. The facet can be triggered with this query:
231
232 {{code language="none"}}
233 object:Test.TestClass
234 {{/code}}
235
236 == Sorting on Object Properties ==
237
238 We should also be able to sort the document search results based on a property value using the [[Query Module>>extensions:Extension.Query Module]] API:
239
240 {{code language="none"}}
241 #set ($discard = $query.bindValue('sort', 'property.Test.TestClass.staticList1_sortString asc'))
242 {{/code}}
243
244 The 'sortString' suffix is the dynamic type that is used for sorting. Other types are 'sortBoolean', 'sortInt', 'sortLong', 'sortDouble', 'sortFloat' and 'sortDate'. Note that Solr doesn't support sorting on multivalued fields. The [[documentation>>http://wiki.apache.org/solr/CommonQueryParameters#sort]] says:
245
246 > Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer).
247
248 If you try to sort on a multivalued field you'll get:
249
250 {{code language="none"}}
251 Caused by: org.apache.solr.common.SolrException: can not sort on multivalued field: property.Test.TestClass.staticList1_string
252 at org.apache.solr.schema.SchemaField.checkSortability(SchemaField.java:155)
253 {{/code}}
254
255 That's why we need dedicated 'sortXXX' fields that are single valued. The consequence is that only the last value of a property is used for sorting (you can have multiple values either because the property supports multiple selection or because there are multiple objects of the same type on the indexed document). Note that XObject properties are indexed using multivalued dynamic fields (we cannot know beforehand what properties a user-defined XClass will have and if a property supports multiple selection or if a document can have multiple objects of a given type).
256
257 Another option for sorting on fields that have multiple values could be to use a [[function>>http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function]] but I can't find one that returns a single value from a multiValued field.
258
259 {{code language="none"}}
260 #set ($discard = $query.bindValue('sort', 'aFunctionThatSelectsOneValue(property.Test.TestClass.staticList1_string) asc'))
261 {{/code}}
262
263 == Test Results ==
264
265 {{velocity}}
266 #set ($testResults = [
267 "object.Blog.BlogPostClass:wiki AND property.XWiki.TagClass.tags:CSS
268 For example:
269 (posts contain the word ‘FOSDEM’ with no tags)
270 http://www.xwiki.org/xwiki/rest/wikis/query?q=object.Blog.BlogPostClass:FOSDEM
271 ",
272 "object.Blog.BlogPostClass:john AND property.Blog.BlogPostClass.published:1 AND property.Blog.BlogPostClass.category:(News OR XWiki)
273 For example:
274 (posts contain the word ‘wiki’ ,and have been published and are of the category ‘NEWS’ or ‘XWiki’)
275 http://www.xwiki.org/xwiki/rest/wikis/query?q=object.Blog.BlogPostClass:wiki%20AND%20property.Blog.BlogPostClass.published:1%20AND%20property.Blog.BlogPostClass.category:(News%20OR%20XWiki)
276 ",
277 "john AND -object:XWiki.XWikiUsers
278 For example:
279 http://www.xwiki.org/xwiki/rest/wikis/query?q=john%20AND%20-object:XWiki.XWikiUsers
280 ",
281 "filter/sort by property.aSpace.aClass.aPropertyName",
282 "wiki AND type:ATTACHMENT AND object:Blog.BlogPostClass
283 For example:
284 http://www.xwiki.org/xwiki/rest/wikis/query?q=object:Blog.BlogPostClass%20and%20fosdem%20and%20
285 "
286 ])
287 #foreach ($testResult in $testResults)
288 1. TC${mathtool.add($foreach.index, 1)}: $testCases.get($foreach.index)(((
289 {{code language="none"}}
290 $testResult
291 {{/code}}
292 )))
293 #end
294 {{/velocity}}

Get Connected