Distributed XWiki Search Engine (Design.DistributedXWikiSearchEngine)

This has now been implemented as part of the [[SOLR implementation>>Proposal.SearchSolr]] since SOLR supports both embedded and distributed instances.

3

4

5

= Abstract =

6

7

With the growth of information, the role played by search engines is quite pivotal in finding the appropriate information and there by decreasing the human efforts. Due to several reasons the data can be distributed over several machines. The moment data is distributed over several machines a modified version of search engine is required which can search the data on all the machines and the modified version is known as "distributed search engine".

8

When it comes to enterprise wiki's like XWiki, an organization can have multiple installation with each installation dedicated to specific department/role, therefore making distributed search a feature of it would be quite helpful.

9

10

Distributed search can be achieved in two ways, 1.Each server maintains it's own index and main server searches all indexes remotely. 1.The main server maintains the global index of all its children servers and searches on that index.

11

In both methods only main server gets authority to access other servers indexes(Of course children server can access it's own index).

12

\ Both methods 1 and 2 have their own pros and cons, and their behavior changes with the amount of data being indexed.

13

In this project I will be implementing both the methods mentioned above, with out disturbing the current state of the art of XWiki. XWiki currently uses "Lucene" to index and search the documents. Using the existing search engine(lucene) architecture of XWiki, distributed searching can be achieved using java RMI(Remote Method Invocation) technology and some lucene api's like ParallelMultiSearcher,MultiSearcher and RemoteSearchable to search remote indexes.

14

15

Once both the methods mentioned above are implemented, I will be evaluating them using JUint testing in terms of response time.

16

17

This project also includes customizing search depending on document types, i.e search process is restricted to user specified document types.

18

19

= Project Description =

20

21

22

The main intent of this project is to support multiple installations of XWiki, and provide an interface to search the data in all servers. Searching each single server takes longer time and also the users are not so patient. Typically organizations have multiple XWiki installations, each one dedicated to a department/role. Our Distributed search searches all the servers concurrently.

23

24

Distributed search can be achieved in two ways,

25

26

Each server maintains it's own index and main server searches all indexes remotely.

27

28

The main server maintains the global index of all its children servers and searches on that index.

29

30

In both the methods only main server has the authority to access other server's indexes. Both methods 1 and 2 have their own pros and cons, and their behavior depends on the amount of data being indexed. Also the modified version of search engine should support customized search depending on the documentation types.

31

32

== A tentative project architecture or an approach to it ==

33

34

* I want to do this project with out distributing the current architecture of XWiki.

35

* XWiki uses lucene to index and search the documents. I will be working on the 2 methods proposed. My tentative approach is written below, and it can be changed after some discussions with my mentor.

36

37

1. When indexes are distributed :: Main server searches the remote indexes of children server using lucene classes like MultiSearcher,ParallelMultiSearcher,RemoteSearchable,etc. Java RMI(Remote Method Invocation) will be used to search the indexes remotly.

38

1. When a global index needs to be maintained, the children servers intimate the main server about the document being updated/created. The main server collects the content of that updated/new document. Since there is no update function in lucene, the previous entry for the document in the index is deleted and a new index entry for that particular document is added to the old index. RMI will be used for communication between children servers and main server.

39

1. In both the cases, a search call to children servers, will call the main server search.

40

1. Also, to restrict the search to user specified document type, an extra field is added to the index of document to store the the file type of the document. A boolean query can be used to restrict search process to certain document types.

41

42

= An expected project plan and timeline with milestones =

43

44

This is my tentative timeline, I plan to work more or less according to this timeline. This timeline might change after discussions with my mentor

45

46

Till May26th ~-~- Will pick few issues from JIRA and create patches to them

47

Week 1: Finalize the plan, after thorough discussions with the mentor

48

Week 2-4: Implementation of method-I(when indexes are distributed)

49

Week 5: Break

50

Week 6: Testing and Evaluation of Method-I in terms of response time, possibly with different constraints, conditions

51

Mid-term Evaluation : Submit the code of Method-I and also the evaluation results

52

Week 7-9: Implementation of method-II(main server maintaining the global index)

53

Week 10: Testing and evaluation of method-II using the same metrics as mentioned above

54

Week 11-12: Implementation of customized search depending on document types, Code documentation and User documentation

55

Week 13-14: Back up weeks. Will be used for further improvements

56

Final Evaluation: Complete Code, evaluation results, Code and User documentation

57

58

= Implementation details =

59

60

Distributed search can be achieved in many ways, of which we are trying two methods to find out which suits the better to XWiki.

61

Method I is, each installation maintains its own index and the master node search all these indexes through remote calls.

62

63

Method II is, master node maintains a global index of all the installations and searches on the local global index. The data(information) required to build/maintain the global index is collected from each individual installation through remote calls.

64

This is how both the method methods are being implementated.

== Method I ==

Currently [[Lucene>>http://lucene.apache.org]] API's are being used in XWiki to achieve text based search. Lucene by default has some support for searching remote indexes. API's like RemoteSearchable, ParallelMultiSearcher and MultiSearcher helps us to search an index remotly.

69

70

Searching remote indexes is nothing but distributed search over multiple machines. The above API's help is to achieve a simple distributed search, but there are few more things to be taken care of.

71

72

In order to make sure that the current XWiki architecture is not disturbed while implementing the distributed search, existing lucene plugin has been extended to achieve distributed search. It has been modified in such a way that either distribued search can be enable or disabled in xwiki.cfg file.

73

74

**In short what the new code does:**

75

76

1. As soon is XWiki is started, lucene plugin is initiated, during the process of initialization, the each machine makes is index remote available.

77

78

2. A Parallel searcher is created over all the remotely available indexes, during the initialization of lucene plugin on Master node(machine). These 2 steps will achieve a simple (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%). But Index updates would not be reflected in the search process.

79

80

3. As and when there is an index update, index reader which was made remotely available earlier in step 1 is closed and is opened again. The newly opened index reader is made remotely available once again. With this Index updates on remote machines are reflected in (% class="nfakPe" %) search(%%) process. But what if one the machines go down during the process ??

81

82

4. When ever there is an error while performing (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%), the parallel searcher in step 2 is redefined, i.e instead of searching 'n' machines (earlier) it will (% class="nfakPe" %) search(%%) only 'n-m' machines. At some point the machines went down will be started again. How to (% class="nfakPe" %) search(%%) on these machines again?

83

84

5. A thread keeps on checking if searcher is set to (% class="nfakPe" %) search(%%) over all the slaves defined in xwiki.cfg or not . If it is not searching on all the slaves, It will try to redefine the searcher in step 2 as did in step 4.

85

86

6. I have modified existing XWiki.Results to make compatible to both (% class="nfakPe" %) distributed(%%)/normal (% class="nfakPe" %) search(%%).

87

88

7. Modified xwiki.cfg.vm, to have fews options required for (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%).

89

90

The Modified version of xwiki.cfg as for information listed below

91

92

* Whether **distributed search** should be **enabled or not** (true/false).

93

94

95

xwiki.plugins.lucene.distributedSearch=true

96

97

98

* **Hostname ( IP )** of the machine, this information could have been implicitly collected using some java code, but some times a machine might have more than one connection and the user would like to use only one of them. Therefore it is taken explicitly.

99

100

101

xwiki.plugins.lucene.hostname=172.16.17.11

102

103

104

* **Port** on which index should be made remotely available.

105

106

107

xwiki.plugins.lucene.port=5388

108

109

110

* **IP of the Master node.** ( Even though this info is not used in the current method, it will be used in the next method)

111

112

113

xwiki.plugins.lucene.master=172.16.17.11

114

115

116

* **List of slaves** ip along with the port on which their indexes are remotely available. This info is used by Master to search the remotely available indexes.

117

118

119

xwiki.plugins.lucene.slavelist=172.16.17.69:5288,172.16.45.11:5388,172.16.17.164:5388,172.16.17.110:5338,172.16.17.169:8388

120

121

122

**Code for Method I has been committed here [[https://svn.xwiki.org/svnroot/xwiki/sandbox/plugins/distributedSearch/]] read the README file before starting to test it.** \

== Method II ==

In the first method the master node searches the remotely available indexes, where in this method the master node is expected to maintain a global index of all the installations and search on the global index.

127

128

Once again lucene API like RemoteSearcable is used to achieve this task.

129

130

**In short what the new code does :**

131

132

**On Slave installations :**

133

134

1. Each installation maintains its index but with few changes.

135

136

* A new field is added to each lucene document, to store the time stamp of the when the document has been indexed.

137

* Earlier content of the document is just indexed but not stored, now the content is also stored.

138

139

2. The index build is made remotely available for the Master node to access it.

140

141

**On the Master Node :**

142

143

Similar to each installation maintains it's local index, the master node maintains a global index. In addition to global index it also maintains small index which has information about the slaves.

144

145

The index which created to store information about slaves is of very small size typically in the order of kilobytes, so it is stored in the work space of jetty by default.

146

147

3. Maintaining information about slaves:

148

149

* The master node keeps track of the time stamp of last document it has indexed. If no documents were indexed till then time stamp of last indexed document is assumed to be 0.

150

* This index is made remotely available which is used while updating the global index and is used by slaves to update the local index.

151

152

4. The master node opens a connection to each slave index ( in (% class="p" %)**//sequential//**(%%) order ), and fires a **range query** .

153

154

The range query is formed in such a way that it asks for the recently indexed documents i.e, the time stamp of last indexed document is retrieved from the index created & maintained in step 3 and documents whose time stamp is above the recent time stamp are asked to be returned.

155

156

The documents returned are used to update the global index.

157

158

Again the the time stamp of last indexed documents is updated in the index of Step 3.

159

160

**Once again on Slave installations :**

161

162

It is not idle to again maintain local index while maintaining a global index. The reason to maintain local index in Step 1 is just to help updating the global index.

163

164

5. During the process of updating the local index, slave nodes connects to the mode and gets the time stamp of last indexed document.

165

166

And then it deletes the documents that where indexed on or before the above retrieved time stamp.

167

168

Now the local index is also clean and we have only one global index.

169

170

**NOTE :** When ever ther is an update on wiki first the local index gets updated, then the global index is updated and then local index is cleaned.

171

172

6. The global index can be rebuild just by rebuilding the individual local indexes.

173

174

7. Once again xwiki.cfg.vm is modified to have 2 more options

175

176

* If the current installation is the master node where should be global index stored.

177

178

179

xwiki.plugins.lucene.globalindexdir=/usr/local/xwiki/global_lucene

180

181

182

* Port on which slaves information(Step 3) should be/is made available on.

183

184

185

xwiki.plugins.lucene.slavesinfoport = 1100

186

187

188

**Code for Method II has been committed here [[https://svn.xwiki.org/svnroot/xwiki/sandbox/plugins/distributedSearchMethod2]] read the README file before starting to test it.**

Wiki source code of Distributed XWiki Search Engine

About

About

Support

Platform

User Guide

Admin Guide

Developer Guide

Projects

XWiki

Extensions

Other

Contribute

Status

Practices

Under the Hood

Get Involved

Get Connected

author	version	line-number	content
		1	{{error}}
		2	This has now been implemented as part of the [[SOLR implementation>>Proposal.SearchSolr]] since SOLR supports both embedded and distributed instances.
		3	{{/error}}
		4
		5	= Abstract =
		6
		7	With the growth of information, the role played by search engines is quite pivotal in finding the appropriate information and there by decreasing the human efforts. Due to several reasons the data can be distributed over several machines. The moment data is distributed over several machines a modified version of search engine is required which can search the data on all the machines and the modified version is known as "distributed search engine".
		8	When it comes to enterprise wiki's like XWiki, an organization can have multiple installation with each installation dedicated to specific department/role, therefore making distributed search a feature of it would be quite helpful.
		9
		10	Distributed search can be achieved in two ways, 1.Each server maintains it's own index and main server searches all indexes remotely. 1.The main server maintains the global index of all its children servers and searches on that index.
		11	In both methods only main server gets authority to access other servers indexes(Of course children server can access it's own index).
		12	\ Both methods 1 and 2 have their own pros and cons, and their behavior changes with the amount of data being indexed.
		13	In this project I will be implementing both the methods mentioned above, with out disturbing the current state of the art of XWiki. XWiki currently uses "Lucene" to index and search the documents. Using the existing search engine(lucene) architecture of XWiki, distributed searching can be achieved using java RMI(Remote Method Invocation) technology and some lucene api's like ParallelMultiSearcher,MultiSearcher and RemoteSearchable to search remote indexes.
		14
		15	Once both the methods mentioned above are implemented, I will be evaluating them using JUint testing in terms of response time.
		16
		17	This project also includes customizing search depending on document types, i.e search process is restricted to user specified document types.
		18
		19	= Project Description =
		20
		21
		22	The main intent of this project is to support multiple installations of XWiki, and provide an interface to search the data in all servers. Searching each single server takes longer time and also the users are not so patient. Typically organizations have multiple XWiki installations, each one dedicated to a department/role. Our Distributed search searches all the servers concurrently.
		23
		24	Distributed search can be achieved in two ways,
		25
		26	Each server maintains it's own index and main server searches all indexes remotely.
		27
		28	The main server maintains the global index of all its children servers and searches on that index.
		29
		30	In both the methods only main server has the authority to access other server's indexes. Both methods 1 and 2 have their own pros and cons, and their behavior depends on the amount of data being indexed. Also the modified version of search engine should support customized search depending on the documentation types.
		31
		32	== A tentative project architecture or an approach to it ==
		33
		34	* I want to do this project with out distributing the current architecture of XWiki.
		35	* XWiki uses lucene to index and search the documents. I will be working on the 2 methods proposed. My tentative approach is written below, and it can be changed after some discussions with my mentor.
		36
		37	1. When indexes are distributed :: Main server searches the remote indexes of children server using lucene classes like MultiSearcher,ParallelMultiSearcher,RemoteSearchable,etc. Java RMI(Remote Method Invocation) will be used to search the indexes remotly.
		38	1. When a global index needs to be maintained, the children servers intimate the main server about the document being updated/created. The main server collects the content of that updated/new document. Since there is no update function in lucene, the previous entry for the document in the index is deleted and a new index entry for that particular document is added to the old index. RMI will be used for communication between children servers and main server.
		39	1. In both the cases, a search call to children servers, will call the main server search.
		40	1. Also, to restrict the search to user specified document type, an extra field is added to the index of document to store the the file type of the document. A boolean query can be used to restrict search process to certain document types.
		41
		42	= An expected project plan and timeline with milestones =
		43
		44	This is my tentative timeline, I plan to work more or less according to this timeline. This timeline might change after discussions with my mentor
		45
		46	Till May26th ~-~- Will pick few issues from JIRA and create patches to them
		47	Week 1: Finalize the plan, after thorough discussions with the mentor
		48	Week 2-4: Implementation of method-I(when indexes are distributed)
		49	Week 5: Break
		50	Week 6: Testing and Evaluation of Method-I in terms of response time, possibly with different constraints, conditions
		51	Mid-term Evaluation : Submit the code of Method-I and also the evaluation results
		52	Week 7-9: Implementation of method-II(main server maintaining the global index)
		53	Week 10: Testing and evaluation of method-II using the same metrics as mentioned above
		54	Week 11-12: Implementation of customized search depending on document types, Code documentation and User documentation
		55	Week 13-14: Back up weeks. Will be used for further improvements
		56	Final Evaluation: Complete Code, evaluation results, Code and User documentation
		57
		58	= Implementation details =
		59
		60	Distributed search can be achieved in many ways, of which we are trying two methods to find out which suits the better to XWiki.
		61	Method I is, each installation maintains its own index and the master node search all these indexes through remote calls.
		62
		63	Method II is, master node maintains a global index of all the installations and searches on the local global index. The data(information) required to build/maintain the global index is collected from each individual installation through remote calls.
		64	This is how both the method methods are being implementated.
		65
		66	== Method I ==
		67
		68	Currently [[Lucene>>http://lucene.apache.org]] API's are being used in XWiki to achieve text based search. Lucene by default has some support for searching remote indexes. API's like RemoteSearchable, ParallelMultiSearcher and MultiSearcher helps us to search an index remotly.
		69
		70	Searching remote indexes is nothing but distributed search over multiple machines. The above API's help is to achieve a simple distributed search, but there are few more things to be taken care of.
		71
		72	In order to make sure that the current XWiki architecture is not disturbed while implementing the distributed search, existing lucene plugin has been extended to achieve distributed search. It has been modified in such a way that either distribued search can be enable or disabled in xwiki.cfg file.
		73
		74	In short what the new code does:
		75
		76	1. As soon is XWiki is started, lucene plugin is initiated, during the process of initialization, the each machine makes is index remote available.
		77
		78	2. A Parallel searcher is created over all the remotely available indexes, during the initialization of lucene plugin on Master node(machine). These 2 steps will achieve a simple (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%). But Index updates would not be reflected in the search process.
		79
		80	3. As and when there is an index update, index reader which was made remotely available earlier in step 1 is closed and is opened again. The newly opened index reader is made remotely available once again. With this Index updates on remote machines are reflected in (% class="nfakPe" %) search(%%) process. But what if one the machines go down during the process ??
		81
		82	4. When ever there is an error while performing (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%), the parallel searcher in step 2 is redefined, i.e instead of searching 'n' machines (earlier) it will (% class="nfakPe" %) search(%%) only 'n-m' machines. At some point the machines went down will be started again. How to (% class="nfakPe" %) search(%%) on these machines again?
		83
		84	5. A thread keeps on checking if searcher is set to (% class="nfakPe" %) search(%%) over all the slaves defined in xwiki.cfg or not . If it is not searching on all the slaves, It will try to redefine the searcher in step 2 as did in step 4.
		85
		86	6. I have modified existing XWiki.Results to make compatible to both (% class="nfakPe" %) distributed(%%)/normal (% class="nfakPe" %) search(%%).
		87
		88	7. Modified xwiki.cfg.vm, to have fews options required for (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%).
		89
		90	The Modified version of xwiki.cfg as for information listed below
		91
		92	* Whether distributed search should be enabled or not (true/false).
		93
		94	{{code}}
		95	xwiki.plugins.lucene.distributedSearch=true
		96	{{/code}}
		97
		98	* Hostname ( IP ) of the machine, this information could have been implicitly collected using some java code, but some times a machine might have more than one connection and the user would like to use only one of them. Therefore it is taken explicitly.
		99
		100	{{code}}
		101	xwiki.plugins.lucene.hostname=172.16.17.11
		102	{{/code}}
		103
		104	* Port on which index should be made remotely available.
		105
		106	{{code}}
		107	xwiki.plugins.lucene.port=5388
		108	{{/code}}
		109
		110	* IP of the Master node. ( Even though this info is not used in the current method, it will be used in the next method)
		111
		112	{{code}}
		113	xwiki.plugins.lucene.master=172.16.17.11
		114	{{/code}}
		115
		116	* List of slaves ip along with the port on which their indexes are remotely available. This info is used by Master to search the remotely available indexes.
		117
		118	{{code}}
		119	xwiki.plugins.lucene.slavelist=172.16.17.69:5288,172.16.45.11:5388,172.16.17.164:5388,172.16.17.110:5338,172.16.17.169:8388
		120	{{/code}}
		121
		122	Code for Method I has been committed here [[https://svn.xwiki.org/svnroot/xwiki/sandbox/plugins/distributedSearch/]] read the README file before starting to test it. \
		123
		124	== Method II ==
		125
		126	In the first method the master node searches the remotely available indexes, where in this method the master node is expected to maintain a global index of all the installations and search on the global index.
		127
		128	Once again lucene API like RemoteSearcable is used to achieve this task.
		129
		130	In short what the new code does :
		131
		132	On Slave installations :
		133
		134	1. Each installation maintains its index but with few changes.
		135
		136	* A new field is added to each lucene document, to store the time stamp of the when the document has been indexed.
		137	* Earlier content of the document is just indexed but not stored, now the content is also stored.
		138
		139	2. The index build is made remotely available for the Master node to access it.
		140
		141	On the Master Node :
		142
		143	Similar to each installation maintains it's local index, the master node maintains a global index. In addition to global index it also maintains small index which has information about the slaves.
		144
		145	The index which created to store information about slaves is of very small size typically in the order of kilobytes, so it is stored in the work space of jetty by default.
		146
		147	3. Maintaining information about slaves:
		148
		149	* The master node keeps track of the time stamp of last document it has indexed. If no documents were indexed till then time stamp of last indexed document is assumed to be 0.
		150	* This index is made remotely available which is used while updating the global index and is used by slaves to update the local index.
		151
		152	4. The master node opens a connection to each slave index ( in (% class="p" %)//sequential//(%%) order ), and fires a range query .
		153
		154	The range query is formed in such a way that it asks for the recently indexed documents i.e, the time stamp of last indexed document is retrieved from the index created & maintained in step 3 and documents whose time stamp is above the recent time stamp are asked to be returned.
		155
		156	The documents returned are used to update the global index.
		157
		158	Again the the time stamp of last indexed documents is updated in the index of Step 3.
		159
		160	Once again on Slave installations :
		161
		162	It is not idle to again maintain local index while maintaining a global index. The reason to maintain local index in Step 1 is just to help updating the global index.
		163
		164	5. During the process of updating the local index, slave nodes connects to the mode and gets the time stamp of last indexed document.
		165
		166	And then it deletes the documents that where indexed on or before the above retrieved time stamp.
		167
		168	Now the local index is also clean and we have only one global index.
		169
		170	NOTE : When ever ther is an update on wiki first the local index gets updated, then the global index is updated and then local index is cleaned.
		171
		172	6. The global index can be rebuild just by rebuilding the individual local indexes.
		173
		174	7. Once again xwiki.cfg.vm is modified to have 2 more options
		175
		176	* If the current installation is the master node where should be global index stored.
		177
		178	{{code}}
		179	xwiki.plugins.lucene.globalindexdir=/usr/local/xwiki/global_lucene
		180	{{/code}}
		181
		182	* Port on which slaves information(Step 3) should be/is made available on.
		183
		184	{{code}}
		185	xwiki.plugins.lucene.slavesinfoport = 1100
		186	{{/code}}
		187
		188	Code for Method II has been committed here [[https://svn.xwiki.org/svnroot/xwiki/sandbox/plugins/distributedSearchMethod2]] read the README file before starting to test it.