Distributed XWiki Search Engine

Last modified by Vincent Massol on 2013/11/13 17:06

 Requirements
 Dropped

Description

This has now been implemented as part of the SOLR implementation since SOLR supports both embedded and distributed instances.

Abstract

With the growth of information, the role played by search engines is quite pivotal in finding the appropriate information and there by decreasing the human efforts. Due to several reasons the data can be distributed over several machines. The moment data is distributed over several machines a modified version of search engine is required which can search the data on all the machines and the modified version is known as "distributed search engine".
When it comes to enterprise wiki's like XWiki, an organization can have multiple installation with each installation dedicated to specific department/role, therefore making distributed search a feature of it would be quite helpful. 

Distributed search can be achieved in two ways, 1.Each server maintains it's own index and main server searches all indexes remotely. 1.The main server maintains the global index of all its children servers and searches on that index.
In both methods only main server gets authority to access other servers indexes(Of course children server can access it's own index).
\ Both methods 1 and 2 have their own pros and cons, and their behavior changes with the amount of data being indexed.
In this project I will be implementing both the methods mentioned above, with out disturbing the current state of the art of XWiki. XWiki currently uses "Lucene" to index and search the documents. Using the existing search engine(lucene) architecture of XWiki, distributed searching can be achieved using java RMI(Remote Method Invocation) technology and some lucene api's like ParallelMultiSearcher,MultiSearcher and RemoteSearchable to search remote indexes. 

Once both the methods mentioned above are implemented, I will be evaluating them using JUint testing in terms of response time. 

This project also includes customizing search depending on document types, i.e search process is restricted to user specified document types. 

Project Description

The main intent of this project is to support multiple installations of XWiki, and provide an interface to search the data in all servers. Searching each single server takes longer time and also the users are not so patient. Typically organizations have multiple XWiki installations, each one dedicated to a department/role. Our Distributed search searches all the servers concurrently. 

Distributed search can be achieved in two ways, 

Each server maintains it's own index and main server searches all indexes remotely. 

The main server maintains the global index of all its children servers and searches on that index. 

In both the methods only main server has the authority to access other server's indexes. Both methods 1 and 2 have their own pros and cons, and their behavior depends on the amount of data being indexed. Also the modified version of search engine should support customized search depending on the documentation types. 

A tentative project architecture or an approach to it

  • I want to do this project with out distributing the current architecture of XWiki.
  • XWiki uses lucene to index and search the documents. I will be working on the 2 methods proposed. My tentative approach is written below, and it can be changed after some discussions with my mentor.
  1. When indexes are distributed :: Main server searches the remote indexes of children server using lucene classes like MultiSearcher,ParallelMultiSearcher,RemoteSearchable,etc. Java RMI(Remote Method Invocation) will be used to search the indexes remotly.
  2. When a global index needs to be maintained, the children servers intimate the main server about the document being updated/created. The main server collects the content of that updated/new document. Since there is no update function in lucene, the previous entry for the document in the index is deleted and a new index entry for that particular document is added to the old index. RMI will be used for communication between children servers and main server.
  3. In both the cases, a search call to children servers, will call the main server search.
  4. Also, to restrict the search to user specified document type, an extra field is added to the index of document to store the the file type of the document. A boolean query can be used to restrict search process to certain document types.

An expected project plan and timeline with milestones

This is my tentative timeline, I plan to work more or less according to this timeline. This timeline might change after discussions with my mentor 

Till May26th -- Will pick few issues from JIRA and create patches to them
Week 1: Finalize the plan, after thorough discussions with the mentor
Week 2-4: Implementation of method-I(when indexes are distributed)
Week 5: Break
Week 6: Testing and Evaluation of Method-I in terms of response time, possibly with different constraints, conditions
Mid-term Evaluation : Submit the code of Method-I and also the evaluation results
Week 7-9: Implementation of method-II(main server maintaining the global index)
Week 10: Testing and evaluation of method-II using the same metrics as mentioned above
Week 11-12: Implementation of customized search depending on document types, Code documentation and User documentation
Week 13-14: Back up weeks. Will be used for further improvements
Final Evaluation: Complete Code, evaluation results, Code and User documentation

Implementation details

Distributed search can be achieved in many ways, of which we are trying two methods to find out which suits the better to XWiki.
Method I is, each installation maintains its own index and the master node search all these indexes through remote calls. 

Method II is, master node maintains a global index of all the installations and searches on the local global index. The data(information) required to build/maintain the global index is collected from each individual installation through remote calls.
This is how both the method methods are being implementated. 

Method I

Currently Lucene API's are being used in XWiki to achieve text based search. Lucene by default has some support for searching remote indexes. API's like RemoteSearchable, ParallelMultiSearcher and MultiSearcher helps us to search an index remotly. 

Searching remote indexes is nothing but distributed search over multiple machines. The above API's help is to achieve a simple distributed search, but there are few more things to be taken care of. 

In order to make sure that the current XWiki architecture is not disturbed while implementing the distributed search, existing lucene plugin has been extended to achieve distributed search. It has been modified in such a way that either distribued search can be enable or disabled in xwiki.cfg file. 

In short what the new code does: 

  1. As soon is XWiki is started, lucene plugin is initiated, during the process of initialization, the each machine makes is index remote available. 

2. A Parallel searcher is created over all the remotely available indexes, during the initialization of lucene plugin on Master node(machine). These 2 steps will achieve a simple  distributed  search. But Index updates would not be reflected in the search process. 

3. As and when there is an index update, index reader which was made remotely available earlier in step 1 is closed and is opened again. The newly opened index reader is made remotely available once again. With this Index updates on remote machines are reflected in  search process. But what if one the machines go down during the process ?? 

4. When ever there is an error while performing  distributed  search, the parallel searcher in step 2 is redefined, i.e instead of searching 'n' machines (earlier) it will  search only 'n-m' machines. At some point the machines went down will be started again. How to  search on these machines again? 

5. A thread keeps on checking if searcher is set to  search over all the slaves defined in xwiki.cfg or not . If it is not searching on all the slaves, It will try to redefine the searcher in step 2 as did in step 4. 

6. I have modified existing XWiki.Results to make compatible to both  distributed/normal  search.

7. Modified xwiki.cfg.vm, to have fews options required for  distributed  search.

The Modified version of xwiki.cfg as for information listed below

  • Whether distributed search should be enabled or not (true/false).
xwiki.plugins.lucene.distributedSearch=true
  • Hostname ( IP ) of the machine, this information could have been implicitly collected using some java code, but some times a machine might have more than one connection and the user would like to use only one of them. Therefore it is taken explicitly.
xwiki.plugins.lucene.hostname=172.16.17.11
  • Port on which index should be made remotely available.
xwiki.plugins.lucene.port=5388
  • IP of the Master node. ( Even though this info is not used in the current method, it will be used in the next method)
xwiki.plugins.lucene.master=172.16.17.11
  • List of slaves ip along with the port on which their indexes are remotely available. This info is used by Master to search the remotely available indexes.
xwiki.plugins.lucene.slavelist=172.16.17.69:5288,172.16.45.11:5388,172.16.17.164:5388,172.16.17.110:5338,172.16.17.169:8388

Code for Method I has been committed here https://svn.xwiki.org/svnroot/xwiki/sandbox/plugins/distributedSearch/ read the README file before starting to test it. \

Method II

In the first method the master node searches the remotely available indexes, where in this method the master node is expected to maintain a global index of all the installations and search on the global index.

Once again lucene API like RemoteSearcable is used to achieve this task.

In short what the new code does :

On Slave installations :

  1. Each installation maintains its index but with few changes.
  • A new field is added to each lucene document, to store the time stamp of the when the document has been indexed.
  • Earlier content of the document is just indexed but not stored, now the content is also stored.

2. The index build is made remotely available for the Master node to access it. 

On the Master Node :

Similar to each installation maintains it's local index, the master node maintains a global index. In addition to global index it also maintains small index which has information about the slaves. 

The index which created to store information about slaves is of very small size typically in the order of kilobytes, so it is stored in the work space of jetty by default. 

3. Maintaining information about slaves: 

  • The master node keeps track of the time stamp of last document it has indexed. If no documents were indexed till then time stamp of last indexed document is assumed to be 0.
  • This index is made remotely available which is used while updating the global index and is used by slaves to update the local index.

4. The master node opens a connection to each slave index ( in sequential order ), and fires a range query .

The range query is formed in such a way that it asks for the recently indexed documents i.e, the time stamp of last indexed document is retrieved from the index created & maintained in step 3 and documents whose time stamp is above the recent time stamp are asked to be returned. 

The documents returned are used to update the global index. 

Again the the time stamp of last indexed documents is updated in the index of Step 3.

Once again on Slave installations :

It is not idle to again maintain local index while maintaining a global index. The reason to maintain local index in Step 1 is just to help updating the global index. 

5. During the process of updating the local index, slave nodes connects to the mode and gets the time stamp of last indexed document. 

And then it deletes the documents that where indexed on or before the above retrieved time stamp. 

Now the local index is also clean and we have only one global index. 

NOTE : When ever ther is an update on wiki first the local index gets updated, then the global index is updated and then local index is cleaned. 

6. The global index can be rebuild just by rebuilding the individual local indexes. 

7. Once again xwiki.cfg.vm is modified to have 2 more options 

  • If the current installation is the master node where should be global index stored.
xwiki.plugins.lucene.globalindexdir=/usr/local/xwiki/global_lucene
  • Port on which slaves information(Step 3) should be/is made available on.
xwiki.plugins.lucene.slavesinfoport = 1100

Code for Method II has been committed here https://svn.xwiki.org/svnroot/xwiki/sandbox/plugins/distributedSearchMethod2 read the README file before starting to test it.


 

Tags: gsoc
Created by Ecaterina Moraru (Valica) on 2013/11/06 13:02
    

Get Connected