Last modified by Vincent Massol on 2024/02/26 17:52

Show last authors
1 {{error}}
2 This has now been implemented as part of the [[SOLR implementation>>Proposal.SearchSolr]] since SOLR supports both embedded and distributed instances.
3 {{/error}}
4
5 = Abstract =
6
7 With the growth of information, the role played by search engines is quite pivotal in finding the appropriate information and there by decreasing the human efforts. Due to several reasons the data can be distributed over several machines. The moment data is distributed over several machines a modified version of search engine is required which can search the data on all the machines and the modified version is known as "distributed search engine".
8 When it comes to enterprise wiki's like XWiki, an organization can have multiple installation with each installation dedicated to specific department/role, therefore making distributed search a feature of it would be quite helpful.
9
10 Distributed search can be achieved in two ways, 1.Each server maintains it's own index and main server searches all indexes remotely. 1.The main server maintains the global index of all its children servers and searches on that index.
11 In both methods only main server gets authority to access other servers indexes(Of course children server can access it's own index).
12 \ Both methods 1 and 2 have their own pros and cons, and their behavior changes with the amount of data being indexed.
13 In this project I will be implementing both the methods mentioned above, with out disturbing the current state of the art of XWiki. XWiki currently uses "Lucene" to index and search the documents. Using the existing search engine(lucene) architecture of XWiki, distributed searching can be achieved using java RMI(Remote Method Invocation) technology and some lucene api's like ParallelMultiSearcher,MultiSearcher and RemoteSearchable to search remote indexes.
14
15 Once both the methods mentioned above are implemented, I will be evaluating them using JUint testing in terms of response time.
16
17 This project also includes customizing search depending on document types, i.e search process is restricted to user specified document types.
18
19 = Project Description =
20
21
22 The main intent of this project is to support multiple installations of XWiki, and provide an interface to search the data in all servers. Searching each single server takes longer time and also the users are not so patient. Typically organizations have multiple XWiki installations, each one dedicated to a department/role. Our Distributed search searches all the servers concurrently.
23
24 Distributed search can be achieved in two ways,
25
26 Each server maintains it's own index and main server searches all indexes remotely.
27
28 The main server maintains the global index of all its children servers and searches on that index.
29
30 In both the methods only main server has the authority to access other server's indexes. Both methods 1 and 2 have their own pros and cons, and their behavior depends on the amount of data being indexed. Also the modified version of search engine should support customized search depending on the documentation types.
31
32 == A tentative project architecture or an approach to it ==
33
34 * I want to do this project with out distributing the current architecture of XWiki.
35 * XWiki uses lucene to index and search the documents. I will be working on the 2 methods proposed. My tentative approach is written below, and it can be changed after some discussions with my mentor.
36
37 1. When indexes are distributed :: Main server searches the remote indexes of children server using lucene classes like MultiSearcher,ParallelMultiSearcher,RemoteSearchable,etc. Java RMI(Remote Method Invocation) will be used to search the indexes remotly.
38 1. When a global index needs to be maintained, the children servers intimate the main server about the document being updated/created. The main server collects the content of that updated/new document. Since there is no update function in lucene, the previous entry for the document in the index is deleted and a new index entry for that particular document is added to the old index. RMI will be used for communication between children servers and main server.
39 1. In both the cases, a search call to children servers, will call the main server search.
40 1. Also, to restrict the search to user specified document type, an extra field is added to the index of document to store the the file type of the document. A boolean query can be used to restrict search process to certain document types.
41
42 = An expected project plan and timeline with milestones =
43
44 This is my tentative timeline, I plan to work more or less according to this timeline. This timeline might change after discussions with my mentor
45
46 Till May26th ~-~- Will pick few issues from JIRA and create patches to them
47 Week 1: Finalize the plan, after thorough discussions with the mentor
48 Week 2-4: Implementation of method-I(when indexes are distributed)
49 Week 5: Break
50 Week 6: Testing and Evaluation of Method-I in terms of response time, possibly with different constraints, conditions
51 Mid-term Evaluation : Submit the code of Method-I and also the evaluation results
52 Week 7-9: Implementation of method-II(main server maintaining the global index)
53 Week 10: Testing and evaluation of method-II using the same metrics as mentioned above
54 Week 11-12: Implementation of customized search depending on document types, Code documentation and User documentation
55 Week 13-14: Back up weeks. Will be used for further improvements
56 Final Evaluation: Complete Code, evaluation results, Code and User documentation
57
58 = Implementation details =
59
60 Distributed search can be achieved in many ways, of which we are trying two methods to find out which suits the better to XWiki.
61 Method I is, each installation maintains its own index and the master node search all these indexes through remote calls.
62
63 Method II is, master node maintains a global index of all the installations and searches on the local global index. The data(information) required to build/maintain the global index is collected from each individual installation through remote calls.
64 This is how both the method methods are being implementated.
65
66 == Method I ==
67
68 Currently [[Lucene>>http://lucene.apache.org]] API's are being used in XWiki to achieve text based search. Lucene by default has some support for searching remote indexes. API's like RemoteSearchable, ParallelMultiSearcher and MultiSearcher helps us to search an index remotly.
69
70 Searching remote indexes is nothing but distributed search over multiple machines. The above API's help is to achieve a simple distributed search, but there are few more things to be taken care of.
71
72 In order to make sure that the current XWiki architecture is not disturbed while implementing the distributed search, existing lucene plugin has been extended to achieve distributed search. It has been modified in such a way that either distribued search can be enable or disabled in xwiki.cfg file.
73
74 **In short what the new code does:**
75
76 1. As soon is XWiki is started, lucene plugin is initiated, during the process of initialization, the each machine makes is index remote available.
77
78 2. A Parallel searcher is created over all the remotely available indexes, during the initialization of lucene plugin on Master node(machine). These 2 steps will achieve a simple (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%). But Index updates would not be reflected in the search process.
79
80 3. As and when there is an index update, index reader which was made remotely available earlier in step 1 is closed and is opened again. The newly opened index reader is made remotely available once again. With this Index updates on remote machines are reflected in (% class="nfakPe" %) search(%%) process. But what if one the machines go down during the process ??
81
82 4. When ever there is an error while performing (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%), the parallel searcher in step 2 is redefined, i.e instead of searching 'n' machines (earlier) it will (% class="nfakPe" %) search(%%) only 'n-m' machines. At some point the machines went down will be started again. How to (% class="nfakPe" %) search(%%) on these machines again?
83
84 5. A thread keeps on checking if searcher is set to (% class="nfakPe" %) search(%%) over all the slaves defined in xwiki.cfg or not . If it is not searching on all the slaves, It will try to redefine the searcher in step 2 as did in step 4.
85
86 6. I have modified existing XWiki.Results to make compatible to both (% class="nfakPe" %) distributed(%%)/normal (% class="nfakPe" %) search(%%).
87
88 7. Modified xwiki.cfg.vm, to have fews options required for (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%).
89
90 The Modified version of xwiki.cfg as for information listed below
91
92 * Whether **distributed search** should be **enabled or not** (true/false).
93
94 {{code}}
95 xwiki.plugins.lucene.distributedSearch=true
96 {{/code}}
97
98 * **Hostname ( IP )** of the machine, this information could have been implicitly collected using some java code, but some times a machine might have more than one connection and the user would like to use only one of them. Therefore it is taken explicitly.
99
100 {{code}}
101 xwiki.plugins.lucene.hostname=172.16.17.11
102 {{/code}}
103
104 * **Port** on which index should be made remotely available.
105
106 {{code}}
107 xwiki.plugins.lucene.port=5388
108 {{/code}}
109
110 * **IP of the Master node.** ( Even though this info is not used in the current method, it will be used in the next method)
111
112 {{code}}
113 xwiki.plugins.lucene.master=172.16.17.11
114 {{/code}}
115
116 * **List of slaves** ip along with the port on which their indexes are remotely available. This info is used by Master to search the remotely available indexes.
117
118 {{code}}
119 xwiki.plugins.lucene.slavelist=172.16.17.69:5288,172.16.45.11:5388,172.16.17.164:5388,172.16.17.110:5338,172.16.17.169:8388
120 {{/code}}
121
122 **Code for Method I has been committed here [[https://svn.xwiki.org/svnroot/xwiki/sandbox/plugins/distributedSearch/]] read the README file before starting to test it.** \
123
124 == Method II ==
125
126 In the first method the master node searches the remotely available indexes, where in this method the master node is expected to maintain a global index of all the installations and search on the global index.
127
128 Once again lucene API like RemoteSearcable is used to achieve this task.
129
130 **In short what the new code does :**
131
132 **On Slave installations :**
133
134 1. Each installation maintains its index but with few changes.
135
136 * A new field is added to each lucene document, to store the time stamp of the when the document has been indexed.
137 * Earlier content of the document is just indexed but not stored, now the content is also stored.
138
139 2. The index build is made remotely available for the Master node to access it.
140
141 **On the Master Node :**
142
143 Similar to each installation maintains it's local index, the master node maintains a global index. In addition to global index it also maintains small index which has information about the slaves.
144
145 The index which created to store information about slaves is of very small size typically in the order of kilobytes, so it is stored in the work space of jetty by default.
146
147 3. Maintaining information about slaves:
148
149 * The master node keeps track of the time stamp of last document it has indexed. If no documents were indexed till then time stamp of last indexed document is assumed to be 0.
150 * This index is made remotely available which is used while updating the global index and is used by slaves to update the local index.
151
152 4. The master node opens a connection to each slave index ( in (% class="p" %)**//sequential//**(%%) order ), and fires a **range query** .
153
154 The range query is formed in such a way that it asks for the recently indexed documents i.e, the time stamp of last indexed document is retrieved from the index created & maintained in step 3 and documents whose time stamp is above the recent time stamp are asked to be returned.
155
156 The documents returned are used to update the global index.
157
158 Again the the time stamp of last indexed documents is updated in the index of Step 3.
159
160 **Once again on Slave installations :**
161
162 It is not idle to again maintain local index while maintaining a global index. The reason to maintain local index in Step 1 is just to help updating the global index.
163
164 5. During the process of updating the local index, slave nodes connects to the mode and gets the time stamp of last indexed document.
165
166 And then it deletes the documents that where indexed on or before the above retrieved time stamp.
167
168 Now the local index is also clean and we have only one global index.
169
170 **NOTE :** When ever ther is an update on wiki first the local index gets updated, then the global index is updated and then local index is cleaned.
171
172 6. The global index can be rebuild just by rebuilding the individual local indexes.
173
174 7. Once again xwiki.cfg.vm is modified to have 2 more options
175
176 * If the current installation is the master node where should be global index stored.
177
178 {{code}}
179 xwiki.plugins.lucene.globalindexdir=/usr/local/xwiki/global_lucene
180 {{/code}}
181
182 * Port on which slaves information(Step 3) should be/is made available on.
183
184 {{code}}
185 xwiki.plugins.lucene.slavesinfoport = 1100
186 {{/code}}
187
188 **Code for Method II has been committed here [[https://svn.xwiki.org/svnroot/xwiki/sandbox/plugins/distributedSearchMethod2]] read the README file before starting to test it.**

Get Connected