Last modified by Vincent Massol on 2024/02/26 17:52

Hide last authors
Vincent Massol 31.1 1 {{error}}
Vincent Massol 31.4 2 This has now been implemented as part of the [[SOLR implementation>>Proposal.SearchSolr]] since SOLR supports both embedded and distributed instances.
Vincent Massol 31.1 3 {{/error}}
4
Ecaterina Moraru (Valica) 29.1 5 = Abstract =
SaiKrishna 4.1 6
Ecaterina Moraru (Valica) 30.3 7 With the growth of information, the role played by search engines is quite pivotal in finding the appropriate information and there by decreasing the human efforts. Due to several reasons the data can be distributed over several machines. The moment data is distributed over several machines a modified version of search engine is required which can search the data on all the machines and the modified version is known as "distributed search engine".
8 When it comes to enterprise wiki's like XWiki, an organization can have multiple installation with each installation dedicated to specific department/role, therefore making distributed search a feature of it would be quite helpful.
SaiKrishna 15.1 9
Ecaterina Moraru (Valica) 30.3 10 Distributed search can be achieved in two ways, 1.Each server maintains it's own index and main server searches all indexes remotely. 1.The main server maintains the global index of all its children servers and searches on that index.
11 In both methods only main server gets authority to access other servers indexes(Of course children server can access it's own index).
12 \ Both methods 1 and 2 have their own pros and cons, and their behavior changes with the amount of data being indexed.
13 In this project I will be implementing both the methods mentioned above, with out disturbing the current state of the art of XWiki. XWiki currently uses "Lucene" to index and search the documents. Using the existing search engine(lucene) architecture of XWiki, distributed searching can be achieved using java RMI(Remote Method Invocation) technology and some lucene api's like ParallelMultiSearcher,MultiSearcher and RemoteSearchable to search remote indexes.
SaiKrishna 4.1 14
Ecaterina Moraru (Valica) 30.3 15 Once both the methods mentioned above are implemented, I will be evaluating them using JUint testing in terms of response time.
SaiKrishna 15.1 16
Ecaterina Moraru (Valica) 30.3 17 This project also includes customizing search depending on document types, i.e search process is restricted to user specified document types.
SaiKrishna 4.1 18
Ecaterina Moraru (Valica) 29.1 19 = Project Description =
SaiKrishna 15.1 20
21
Ecaterina Moraru (Valica) 30.3 22 The main intent of this project is to support multiple installations of XWiki, and provide an interface to search the data in all servers. Searching each single server takes longer time and also the users are not so patient. Typically organizations have multiple XWiki installations, each one dedicated to a department/role. Our Distributed search searches all the servers concurrently.
SaiKrishna 4.1 23
Ecaterina Moraru (Valica) 30.3 24 Distributed search can be achieved in two ways,
SaiKrishna 4.1 25
Ecaterina Moraru (Valica) 30.3 26 Each server maintains it's own index and main server searches all indexes remotely.
SaiKrishna 2.1 27
Ecaterina Moraru (Valica) 30.3 28 The main server maintains the global index of all its children servers and searches on that index.
Ecaterina Moraru (Valica) 29.1 29
Ecaterina Moraru (Valica) 30.3 30 In both the methods only main server has the authority to access other server's indexes. Both methods 1 and 2 have their own pros and cons, and their behavior depends on the amount of data being indexed. Also the modified version of search engine should support customized search depending on the documentation types.
Ecaterina Moraru (Valica) 29.1 31
32 == A tentative project architecture or an approach to it ==
33
dward 1.1 34 * I want to do this project with out distributing the current architecture of XWiki.
35 * XWiki uses lucene to index and search the documents. I will be working on the 2 methods proposed. My tentative approach is written below, and it can be changed after some discussions with my mentor.
36
37 1. When indexes are distributed :: Main server searches the remote indexes of children server using lucene classes like MultiSearcher,ParallelMultiSearcher,RemoteSearchable,etc. Java RMI(Remote Method Invocation) will be used to search the indexes remotly.
38 1. When a global index needs to be maintained, the children servers intimate the main server about the document being updated/created. The main server collects the content of that updated/new document. Since there is no update function in lucene, the previous entry for the document in the index is deleted and a new index entry for that particular document is added to the old index. RMI will be used for communication between children servers and main server.
39 1. In both the cases, a search call to children servers, will call the main server search.
40 1. Also, to restrict the search to user specified document type, an extra field is added to the index of document to store the the file type of the document. A boolean query can be used to restrict search process to certain document types.
41
Ecaterina Moraru (Valica) 29.1 42 = An expected project plan and timeline with milestones =
SaiKrishna 15.1 43
Ecaterina Moraru (Valica) 30.3 44 This is my tentative timeline, I plan to work more or less according to this timeline. This timeline might change after discussions with my mentor
SaiKrishna 4.1 45
Ecaterina Moraru (Valica) 30.3 46 Till May26th ~-~- Will pick few issues from JIRA and create patches to them
47 Week 1: Finalize the plan, after thorough discussions with the mentor
48 Week 2-4: Implementation of method-I(when indexes are distributed)
49 Week 5: Break
50 Week 6: Testing and Evaluation of Method-I in terms of response time, possibly with different constraints, conditions
51 Mid-term Evaluation : Submit the code of Method-I and also the evaluation results
52 Week 7-9: Implementation of method-II(main server maintaining the global index)
53 Week 10: Testing and evaluation of method-II using the same metrics as mentioned above
54 Week 11-12: Implementation of customized search depending on document types, Code documentation and User documentation
55 Week 13-14: Back up weeks. Will be used for further improvements
56 Final Evaluation: Complete Code, evaluation results, Code and User documentation
SaiKrishna 15.1 57
Ecaterina Moraru (Valica) 30.3 58 = Implementation details =
SaiKrishna 15.1 59
Ecaterina Moraru (Valica) 30.3 60 Distributed search can be achieved in many ways, of which we are trying two methods to find out which suits the better to XWiki.
61 Method I is, each installation maintains its own index and the master node search all these indexes through remote calls.
SaiKrishna 3.1 62
Ecaterina Moraru (Valica) 30.3 63 Method II is, master node maintains a global index of all the installations and searches on the local global index. The data(information) required to build/maintain the global index is collected from each individual installation through remote calls.
64 This is how both the method methods are being implementated.
SaiKrishna 4.1 65
Ecaterina Moraru (Valica) 29.1 66 == Method I ==
SaiKrishna 2.1 67
Ecaterina Moraru (Valica) 30.3 68 Currently [[Lucene>>http://lucene.apache.org]] API's are being used in XWiki to achieve text based search. Lucene by default has some support for searching remote indexes. API's like RemoteSearchable, ParallelMultiSearcher and MultiSearcher helps us to search an index remotly.
SaiKrishna 6.1 69
Ecaterina Moraru (Valica) 30.3 70 Searching remote indexes is nothing but distributed search over multiple machines. The above API's help is to achieve a simple distributed search, but there are few more things to be taken care of.
SaiKrishna 15.1 71
Ecaterina Moraru (Valica) 30.3 72 In order to make sure that the current XWiki architecture is not disturbed while implementing the distributed search, existing lucene plugin has been extended to achieve distributed search. It has been modified in such a way that either distribued search can be enable or disabled in xwiki.cfg file.
SaiKrishna 15.1 73
Ecaterina Moraru (Valica) 30.3 74 **In short what the new code does:**
SaiKrishna 6.1 75
Ecaterina Moraru (Valica) 30.3 76 1. As soon is XWiki is started, lucene plugin is initiated, during the process of initialization, the each machine makes is index remote available.
SaiKrishna 15.1 77
Ecaterina Moraru (Valica) 30.3 78 2. A Parallel searcher is created over all the remotely available indexes, during the initialization of lucene plugin on Master node(machine). These 2 steps will achieve a simple (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%). But Index updates would not be reflected in the search process.
SaiKrishna 15.1 79
Ecaterina Moraru (Valica) 30.3 80 3. As and when there is an index update, index reader which was made remotely available earlier in step 1 is closed and is opened again. The newly opened index reader is made remotely available once again. With this Index updates on remote machines are reflected in (% class="nfakPe" %) search(%%) process. But what if one the machines go down during the process ??
Ecaterina Moraru (Valica) 29.1 81
Ecaterina Moraru (Valica) 30.3 82 4. When ever there is an error while performing (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%), the parallel searcher in step 2 is redefined, i.e instead of searching 'n' machines (earlier) it will (% class="nfakPe" %) search(%%) only 'n-m' machines. At some point the machines went down will be started again. How to (% class="nfakPe" %) search(%%) on these machines again?
Ecaterina Moraru (Valica) 29.1 83
Ecaterina Moraru (Valica) 30.3 84 5. A thread keeps on checking if searcher is set to (% class="nfakPe" %) search(%%) over all the slaves defined in xwiki.cfg or not . If it is not searching on all the slaves, It will try to redefine the searcher in step 2 as did in step 4.
Ecaterina Moraru (Valica) 29.1 85
Ecaterina Moraru (Valica) 30.3 86 6. I have modified existing XWiki.Results to make compatible to both (% class="nfakPe" %) distributed(%%)/normal (% class="nfakPe" %) search(%%).
SaiKrishna 15.1 87
Ecaterina Moraru (Valica) 29.1 88 7. Modified xwiki.cfg.vm, to have fews options required for (% class="nfakPe" %) distributed(%%) (% class="nfakPe" %) search(%%).
SaiKrishna 9.1 89
Ecaterina Moraru (Valica) 30.3 90 The Modified version of xwiki.cfg as for information listed below
SaiKrishna 15.1 91
Ecaterina Moraru (Valica) 29.1 92 * Whether **distributed search** should be **enabled or not** (true/false).
93
94 {{code}}
SaiKrishna 14.1 95 xwiki.plugins.lucene.distributedSearch=true
Ecaterina Moraru (Valica) 29.1 96 {{/code}}
97
98 * **Hostname ( IP )** of the machine, this information could have been implicitly collected using some java code, but some times a machine might have more than one connection and the user would like to use only one of them. Therefore it is taken explicitly.
99
100 {{code}}
SaiKrishna 15.1 101 xwiki.plugins.lucene.hostname=172.16.17.11
Ecaterina Moraru (Valica) 29.1 102 {{/code}}
103
104 * **Port** on which index should be made remotely available.
105
106 {{code}}
SaiKrishna 26.1 107 xwiki.plugins.lucene.port=5388
Ecaterina Moraru (Valica) 29.1 108 {{/code}}
109
110 * **IP of the Master node.** ( Even though this info is not used in the current method, it will be used in the next method)
111
112 {{code}}
SaiKrishna 26.1 113 xwiki.plugins.lucene.master=172.16.17.11
Ecaterina Moraru (Valica) 29.1 114 {{/code}}
115
116 * **List of slaves** ip along with the port on which their indexes are remotely available. This info is used by Master to search the remotely available indexes.
117
118 {{code}}
SaiKrishna 26.1 119 xwiki.plugins.lucene.slavelist=172.16.17.69:5288,172.16.45.11:5388,172.16.17.164:5388,172.16.17.110:5338,172.16.17.169:8388
Ecaterina Moraru (Valica) 29.1 120 {{/code}}
SaiKrishna 8.1 121
Ecaterina Moraru (Valica) 29.1 122 **Code for Method I has been committed here [[https://svn.xwiki.org/svnroot/xwiki/sandbox/plugins/distributedSearch/]] read the README file before starting to test it.** \
SaiKrishna 11.1 123
Ecaterina Moraru (Valica) 30.3 124 == Method II ==
SaiKrishna 15.1 125
Ecaterina Moraru (Valica) 30.3 126 In the first method the master node searches the remotely available indexes, where in this method the master node is expected to maintain a global index of all the installations and search on the global index.
SaiKrishna 11.1 127
Ecaterina Moraru (Valica) 30.3 128 Once again lucene API like RemoteSearcable is used to achieve this task.
SaiKrishna 11.1 129
Ecaterina Moraru (Valica) 30.3 130 **In short what the new code does :**
SaiKrishna 11.1 131
Ecaterina Moraru (Valica) 29.1 132 **On Slave installations :**
SaiKrishna 15.1 133
Ecaterina Moraru (Valica) 30.3 134 1. Each installation maintains its index but with few changes.
SaiKrishna 15.1 135
SaiKrishna 11.1 136 * A new field is added to each lucene document, to store the time stamp of the when the document has been indexed.
137 * Earlier content of the document is just indexed but not stored, now the content is also stored.
138
Ecaterina Moraru (Valica) 30.3 139 2. The index build is made remotely available for the Master node to access it.
SaiKrishna 15.1 140
Ecaterina Moraru (Valica) 29.1 141 **On the Master Node :**
SaiKrishna 11.1 142
Ecaterina Moraru (Valica) 30.3 143 Similar to each installation maintains it's local index, the master node maintains a global index. In addition to global index it also maintains small index which has information about the slaves.
SaiKrishna 15.1 144
Ecaterina Moraru (Valica) 30.3 145 The index which created to store information about slaves is of very small size typically in the order of kilobytes, so it is stored in the work space of jetty by default.
SaiKrishna 15.1 146
Ecaterina Moraru (Valica) 30.3 147 3. Maintaining information about slaves:
SaiKrishna 15.1 148
SaiKrishna 12.1 149 * The master node keeps track of the time stamp of last document it has indexed. If no documents were indexed till then time stamp of last indexed document is assumed to be 0.
Ecaterina Moraru (Valica) 29.1 150 * This index is made remotely available which is used while updating the global index and is used by slaves to update the local index.
SaiKrishna 11.1 151
Ecaterina Moraru (Valica) 30.4 152 4. The master node opens a connection to each slave index ( in (% class="p" %)**//sequential//**(%%) order ), and fires a **range query** .
SaiKrishna 11.1 153
Ecaterina Moraru (Valica) 30.3 154 The range query is formed in such a way that it asks for the recently indexed documents i.e, the time stamp of last indexed document is retrieved from the index created & maintained in step 3 and documents whose time stamp is above the recent time stamp are asked to be returned.
SaiKrishna 15.1 155
Ecaterina Moraru (Valica) 30.3 156 The documents returned are used to update the global index.
SaiKrishna 15.1 157
SaiKrishna 12.1 158 Again the the time stamp of last indexed documents is updated in the index of Step 3.
SaiKrishna 15.1 159
Ecaterina Moraru (Valica) 29.1 160 **Once again on Slave installations :**
SaiKrishna 12.1 161
Ecaterina Moraru (Valica) 30.3 162 It is not idle to again maintain local index while maintaining a global index. The reason to maintain local index in Step 1 is just to help updating the global index.
SaiKrishna 15.1 163
Ecaterina Moraru (Valica) 30.3 164 5. During the process of updating the local index, slave nodes connects to the mode and gets the time stamp of last indexed document.
SaiKrishna 12.1 165
Ecaterina Moraru (Valica) 30.3 166 And then it deletes the documents that where indexed on or before the above retrieved time stamp.
SaiKrishna 15.1 167
Ecaterina Moraru (Valica) 30.3 168 Now the local index is also clean and we have only one global index.
SaiKrishna 15.1 169
Ecaterina Moraru (Valica) 30.3 170 **NOTE :** When ever ther is an update on wiki first the local index gets updated, then the global index is updated and then local index is cleaned.
SaiKrishna 12.1 171
Ecaterina Moraru (Valica) 30.3 172 6. The global index can be rebuild just by rebuilding the individual local indexes.
SaiKrishna 12.1 173
Ecaterina Moraru (Valica) 30.3 174 7. Once again xwiki.cfg.vm is modified to have 2 more options
SaiKrishna 15.1 175
SaiKrishna 12.1 176 * If the current installation is the master node where should be global index stored.
Ecaterina Moraru (Valica) 29.1 177
178 {{code}}
SaiKrishna 26.1 179 xwiki.plugins.lucene.globalindexdir=/usr/local/xwiki/global_lucene
Ecaterina Moraru (Valica) 29.1 180 {{/code}}
181
SaiKrishna 13.1 182 * Port on which slaves information(Step 3) should be/is made available on.
Ecaterina Moraru (Valica) 29.1 183
184 {{code}}
SaiKrishna 26.1 185 xwiki.plugins.lucene.slavesinfoport = 1100
Ecaterina Moraru (Valica) 29.1 186 {{/code}}
SaiKrishna 13.1 187
Ecaterina Moraru (Valica) 29.1 188 **Code for Method II has been committed here [[https://svn.xwiki.org/svnroot/xwiki/sandbox/plugins/distributedSearchMethod2]] read the README file before starting to test it.**

Get Connected