Planned Features and Architecture Overview

Last modified by Vincent Massol on 2024/05/23 14:23

This document supplements the high-level description of the WAISE - Wiki AI Search Engine by providing more details regarding the implemented features from a user's point of view, the organization into modules and the functionality that is implemented in the respective modules.

Functionality

The main outcome of the WAISE project is the so-called WAISE server that provides the features that are described in the following and a benchmark of different LLMs. See WAISE - Wiki AI Search Engine for a higher-level description of the project's goals.

In a first version, WAISE provides the ability to ask questions that are answered by an LLM using content that is indexed by the WAISE server as context using the so-called retrieval-augmented generation pattern. The context only consists of content that the current user can access, taking access controls into account. Further features are added as time permits. To find suitable context for a user query, both question and the indexed content is embedded with an embedding model that allows finding semantically similar content without needing an exact content match.

To fulfill these needs, there are two user interfaces:

  1. The web-based chat UI that allows the user to ask questions. This chat UI may be embedded in an external application like a content management system.
  2. An web-based administration UI that is part of the WAISE server and only accessible by logging in on the WAISE server.

Further, a Matrix bot is developed that also allows question-answering using the WAISE server similar to the web-based chat UI. The chat bot will also allow to summarize discussions and possibly store this summary.

In the following, we explain the features that are provided by the chat and administration UI in more detail. Some features are marked as "optional", those features will be implemented if time permits. Some of them are also experimental and require further research to understand if they can be implemented at all.

Chat UI

The chat UI's main purpose is to enter a question that is answered using context information. The answer includes links to the used context information such that the user can refer to these sources for further information and for fact-checking the answer. Due to the way large language models work, it is possible that the answer is incorrect and not backed by the sources. The chat UI contains warnings to let the user know about such limitations.

The chat UI is easy to embed in another application by loading a single JavaScript file and providing a configuration either as global object or as parameters to an initialization method. This configuration allows controlling certain features like showing advanced options that might confuse non-technical users.

Optional: after displaying the answer, an automatic fact-checking process runs in the background, indicated by a spinning animation. The fact-checking compares the sources to the answer using an LLM to automatically determine how well the answer is backed by the source. After the fact-checking finishes, the result is displayed in the UI with an icon that summarizes the confidence but also additional textual information that explains in which way the source and the answer match or disagree.

The chat UI also provides a way to ask follow-up questions. Experimentation needs to show if it makes sense to add new context information for follow-up questions or if follow-up questions will be answered using the same context as the original question. For advanced users, this might also be a choice that is offered in the UI.

Optional: the chat UI might also provide an option to save the conversion into, e.g., a wiki page when being integrated in XWiki.

If enabled in the configuration, the chat UI allows a selection between several models that could have access to different knowledge or that could be based on different LLMs, as configured in the administration UI.

Chat usage can be subject to rate limits, in which case the chat UI displays an error when the rate limit is exceeded.

For UI mockups, see WAISE UI.

Custom UIs can be developed on top of a JavaScript library that powers the chat UI. A customized XWiki-specific UI will be developed based on this that features, e.g., an integration with XWiki's WYSIWYG editor similar to the LLM Application that exists as a prototype of an LLM integration in XWiki.

Administration UI

The administration UI has three main purposes:

  1. Controlling which content is indexed for the use by the LLM
  2. Managing the chat and embedding models that are available in the system, defining API endpoints, keys and available models
  3. Controlling how this content is exposed to the chat UI and the matrix bot, i.e., which LLMs are available with what knowledge as context

The administration UI allows the creation and configuration of "collections" that consist of several documents that are either added by the source application from which these documents originate through a REST API or that are crawled using a configurable crawler. The availability of crawlers depends on the installation of additional extensions that could be developed as the need arises.

As an example, a collection could consist of all document of an XWiki installation while another collection could be a crawl of a public documentation repository or consist of all JIRA issues of a JIRA server.

There are various setting for a collection like how to split documents into parts in case a document is too long for a LLM or how access to the documents in the collection should be checked to ensure that users of the chat UI cannot access documents they shouldn't be able to see. Available access control methods may depend on the installation of additional extensions.

The WAISE server exposes an OpenAI-compatible API with several models that can be used with any OpenAI-compatible client. The administration UI allows defining these models, including, at least options which collection(s) should be used for context and which LLM shall be used to answer the question.

Technology and Installation

We plan to implement the project as Java components using XWiki as framework but designed in a way that it can be easily used as search appliance for another application. The search appliance consists of a regular XWiki installation that features several extensions to provide the WAISE server implementation that can be deployed using a Docker container. The UI of XWiki thereby serves as administration UI of WAISE. It is also possible to install the WAISE extensions on an existing XWiki installation.

To implement the index, a vector database is required. In preliminary experiments, we found that Solr, which is already used in XWiki for search, works reasonably well as a vector database for storing the chunks. We therefore plan to start with Solr but allow replacing the vector database if desired by developing an extension.

The advantage of Solr is also that it can be embedded in XWiki, thus eliminating the need of a separate Solr instance. If performance requires it, running Solr separately also on a separate server is supported by XWiki. Solr also supports clustering to distribute the load. Other vector databases could also be considered if scalability of Solr should turn out to be a bottleneck.

It is out of scope of WAISE to actually run an LLM. Instead, WAISE supports any server that implements an OpenAI-compatible API. Such a server for running freely available LLMs on-premise is provided by the LocalAI open source project. We're also considering other implementations like vllm which might be more scalable and more optimized for a multi-user use case. The server with the LLM needs to be installed separately or, if desired, a provider that offers an OpenAI-compatible API can be used.

Module Structure and Data Representation

The functionality of LLM integration in XWiki that is provided by WAISE consists of two independent parts:

  1. The content index that provides a knowledge base that is used for retrieval-augmented generation.
  2. The chat functionality, which could use the knowledge base but can also work independent of it, also for different tasks like summarizing user-provided content.

Both parts need access to models (either chat or embedding models) that can be provided by an OpenAI-compatible API. As frequently both are available from the same API with the same credentials, we provide a single module that serves both use cases.

Independent of these two parts, there is a framework for benchmarking and the matrix bot that interface with these modules using the provided REST APIs.

In a first structure we thus plan a rough separation into three parts:

  1. Chat for everything related to the chat UI and configuration around it like a database of useful prompts for common tasks.
  2. Model for everything related to embedding and chat models
  3. Knowledge for everything related to the index

For technical reasons, each of these modules has an API and a UI sub-module, where the API modules are JAR extensions that contain Java-based interfaces and implementations of them while the UI modules are XAR extensions contain the user interfaces. For both the chat and the knowledge module, we provide XWiki-specific sub-modules that provide deeper integration of the chat in XWiki and indexing of an XWiki installation. To make it easier to replace Solr, a separate module for the Solr-based vector index is provided which can be uninstalled when a different vector database is used.

When the need arises, additional modules might be introduced to, e.g., provide extra functionality that can be optionally installed, allow more fine-grained control which components are used, or resolve dependency cycles.

The following diagram provides a rough overview of the proposed module structure apart from benchmarking and the matrix bot, more details for every module will be provided in the following sections.

application-llm-model-apiapplication-llm-chat-uiapplication-llm-chat-xwikiapplication-llm-model-ui
  • Administration UI for managing which models from which API endpoints with which API keys are available in XWiki
  • Administration UI which models are exposed via the OpenAI-compatible API in XWiki to which users (as proxy)
Administration UI for managing which models from which API endpoints with which API keys are available in XWikiAdministration UI which models are exposed via the OpenAI-compatible API in XWiki to which users (as proxy) Administration UI for managing which models from which API endpoints with which API keys are available in XWikiAdministration UI which models are exposed via the OpenAI-compatible API in XWiki to which users (as proxy) Administration UI for managing which models from which API endpoints with which API keys are available in XWiki Administration UI which models are exposed via the OpenAI-compatible API in XWiki to which users (as proxy)
  • API for external models
  • Implementation of this API for OpenAI-compatible APIs
  • Implementation of the REST API in XWiki
  • Java APIs for managing related configuration options
API for external modelsImplementation of this API for OpenAI-compatible APIs Implementation of the REST API in XWiki Java APIs for managing related configuration options
  • Generic chat UI, exposes a JavaScript file that can be used to embed the chat UI in other applications. Uses the OpenAI-API in XWiki as interface
  • Implements a prompt database
Generic chat UI, exposes a JavaScript file that can be used to embed the chat UI in other applications. Uses the OpenAI-API in XWiki as interface Implements a prompt database
Use
Use
Use
Use
  • Integration of the chat UI in XWiki
  • Provides further integration, e.g., with XWiki's editor
Integration of the chat UI in XWikiProvides further integration, e.g., with XWiki's editor Integration of the chat UI in XWiki Provides further integration, e.g., with XWiki's editor
Use
Use
Use
Use
Use
Use
Use
Use
Use
Use
application-llm-knowledge-ui
Administration UI for
  • Managing the index
  • Controlling access to the index
  • Managing crawlers
Administration UI forManaging the indexControlling access to the indexManaging crawlers
application-llm-knowledge-api
  • Java APIs for the index
  • REST APIs for managing the index (for external applications)
  • Crawler APIs
  • Indexing scheduler and queues
  • Implementation of RAG
Java APIs for the indexREST APIs for managing the index (for external applications)Crawler APIsIndexing scheduler and queuesImplementation of RAG Java APIs for the indexREST APIs for managing the index (for external applications) Crawler APIsIndexing scheduler and queues Implementation of RAG
application-llm-knowledge-solr
  • Uses Solr to store embeddings
Uses Solr to store embeddings
Use
Use
application-llm-chat-api
Use
Use
  • API for prompt database
  • Potentially further APIs specific for the chat use case, like storing conversations
API for prompt databasePotentially further APIs specific for the chat use case, like storing conversations
application-llm-knowledge-xwiki
Use
Use
  • Crawler/indexer for local XWiki installation
Crawler/indexer for local XWiki installation

Models

The models module has two main tasks:

  1. Expose the configured "upstream" LLM models as Java APIs for the use by other components, e.g., to compute embeddings
  2. Expose configured "downstream" LLM models as OpenAI-compatible APIs for use by the chat interface or other interfaces

The models module can be imagined as a proxy. The following diagram shows a possible flow between an incoming request and a response:

:ChatResource
:ChatResource:ChatResource
completions
completions completions
StreamingOutput
StreamingOutput StreamingOutput StreamingOutput
:ModelManager
:ModelManager:ModelManager
getChatModel
getChatModel getChatModel
ChatModel
ChatModel ChatModel
:ChatModel
:ChatModel
loop
loop
chatRequestFilters
chatRequestFilters
:ChatRequestFilterManager
:ChatRequestFilterManager
getFilters
getFilters
List<ChatRequestFilter>
List<ChatRequestFilter>
ChatRequest
ChatRequest ChatRequest
filter
filter
:ChatRequestFilter
:ChatRequestFilter
processStreaming
processStreaming processStreaming
ChatResponseStream
ChatResponseStream
:LLMAPIClient
:LLMAPIClient
requestStreaming
requestStreaming
ChatResponseStream
ChatResponseStream

The ModelManager provides access to the models that are configured for API access via the admin user interface. The ChatModel API is the Java API for accessing a chat model. It provides methods for streaming and non-streaming access. Before actually performing the request, it passes the request through each ChatRequestFilter filter for this model. The applied filters and their behavior are configurable in the admin user interface. Filters provide an extensible mechanism for the following purposes:

  1. Authorization of the request based on the current user
  2. Rate limiting (optional feature)
  3. Logging and statistics (optional feature)
  4. Retrieval-augmented generation (implemented in the knowledge index) by modifying the request

If the request is allowed, it is passed to the actual LLM API client or rather the specific implementation of that interface for the kind of model. This could, e.g., be an OpenAI API client. Any filter can deny a request by throwing an exception with an error message that will be forwarded to the client.

The response is returned as a stream for streaming requests to display each generated token as soon as possible in the user interface to minimize delays as much as possible.

Logging (Optional)

The Logging and Feedback mechanism aims to capture user interactions, system performance, and error events to provide insights into system optimization and user experience enhancements.

The WAISE server maintains logs that capture important events such as:

  • User queries and corresponding responses from the LLM
  • Any errors or exceptions that occur
  • API calls made to third-party services
  • Access logs including unauthorized access attempts

Administrators can access these logs via the Administration UI, and filtering options are available to easily locate specific entries.

For logging purposes, a filtering mechanism that would wrap the response stream in the sequence diagram shown above could be imagined.

Knowledge Index

The knowledge index is responsible for storing the contents and providing access to it including access control. It also provides a REST API to manage its contents and crawlers for indexing content. LLMs have a limited context window, typically 4k tokens at the time of writing where 3 words roughly correspond to 4 tokens. The whole prompt and the answer need to fit into this context window. Therefore,  simply adding arbitrarily large documents to the prompt is not an option. Further, documents might cover several topics, making semantic similarity search more difficult. The knowledge index thus first splits each document into parts, in the following also called chunks, that fit into the context size of the LLM. Each chunk is then embedded using an embedding model and the resulting vector of numbers is stored in a vector index. This index allows fast similarity search with a search query that is embedded using the same embedding model.

The index is divided in collections. Each collection can store a set of documents that might come from a different source. Rights for managing the collection are managed per collection, i.e., an XWiki user can have the right to manage a collection.

Collections should have at least the following settings:

  • The name of the collection
  • The authorization method including parameters like the URL of an API that can be used to check whether a user has access to a document in the collection.
  • The chunking method including parameters
  • The embedding method including parameters

To ease the implementation, a collection could probably correspond to an XWiki document so AWM can be used to manage collections and standard XWiki access rights apply to collections.

Each document that is stored in a collection has the following properties:

  • An identifier that is unique within the collection, like the document reference in XWiki
  • A language
  • A title for display
  • An URL that can be used to access the document, like the URL of the XWiki page (for linking results)
  • The collection it belongs to
  • Optional: Additional metadata, e.g., stored as JSON string to allow the indexer to store arbitrary data
  • The chunks, or rather the vector database stores the above properties for every chunk

Optional: Full documents are stored also outside the vector store to support, e.g., easy re-indexing after chunking or embedding settings have been changed.

Optional: The index stores documents with several chunking configurations, e.g., with different chunk sizes, to allow, e.g., using larger chunks when the LLM supports a larger context size.

Optional: Information about next/previous chunks could be stored to allow, e.g., retrieving more context around a chunk if the model's context size permits it.

REST API

A REST API is provided to manage collections. This allows an external application to index its contents both initially as well as when the content is updated or deleted. The API can be used with API tokens, the scope and implementation of these tokens needs to be defined.

Crawlers

Crawlers provide an alternative to the REST API for indexing content: Instead of requiring the application to send its content to the knowledge index, a crawler component actively retrieves the content of the application, both for an initial indexing run and for continuously updating the index after a change. As part of WAISE, a crawler for the local XWiki instance on which WAISE is installed is developed. Optionally, a simple web crawler could be provided. We're also exploring the use of libraries like Apache ManifoldCF to connect to many applications. As part of the WAISE project, remote crawlers most likely won't support authentication and will thus only crawl public content.

Chunking Techniques

We plan to implement and experiment with different chunking methods.

  • Section-based Chunking: Use headers or other clear section delimiters to break the document into sections. This respects the inherent structure of the document.
  • Token-based Chunking: Further divide each section into chunks based on token limits. If a section exceeds, e.g., a tenth of the context length (so we can have several chunks in the response), split it into smaller chunks with a certain overlap to ensure no context is lost.

Chunking could also take additional syntax information into account, e.g., paragraphs, sentences, ... depending on the used syntax of the content. Chunking methods could be specific to certain syntaxes or languages.

Embedding

The knowledge index uses embedding models of the Models module for embedding the generated chunks. A queue stores indexing requests in order to prevent slow embedding models from slowing down indexing tasks.

Vector Index

A generic vector index API interface is provided as part of the knowledge index that can be implemented in separate extensions. An extension with an implementation for Solr is provided as part of WAISE.

Retrieval-Augmented Generation

The knowledge index provides a ChatRequestFilter filter that augments a chat request with context that is retrieved by embedding the request and searching for the most similar chunks. It also provides an administration UI to configure the used collections.

To provide the context for a chat request, the following steps are performed:

  1. Embed the user's question using the same embedding algorithm as the chunks.
  2. Retrieve chunks that are semantically similar from the index, filtering out chunks that the user must not access.
  3. Combine the retrieved chunks into a prompt that is forwarded to the LLM. This prompt should have a form like:
    [Introduction, explaining the format of the chunks] [1]: Content of chunk 1. [2]: Content of chunk 2. [Instruction to answer the following question using the information of the provided chunks while citing the used chunks like [1]] [Input Question]
    The text of the different parts needs to be determined experimentally also taking the different models into account that might need extra care to really follow the provided instructions. Further, ordering of the chunks needs to be carefully chosen to take into account that LLMs pay less attention to content in the middle of the prompt. The number of chunks needs to be determined such that a) the total size of the prompt still fits into the context length that is supported by the LLM while also leaving room for the answer b) the query time is not too long as longer prompts increase the computational cost of the LLM.

Before sending the answer back to the user, citation placeholders need to be replaced by actual links. The LLM module could provide a response filtering mechanism for this purpose.

The following diagram shows the flow of the information from the document processing to the query processing:

1697729764297-621.png

Authentication and Authorization

Whenever a chunk is read from the vector database, WAISE needs to check if the user who used the chat is actually allowed to access this chunk. The challenge here is that the user of the chat UI might be authenticated in another application that is not the WAISE server and that also the content might originate from another application. Any right check thus needs to query the other application if the user may access the chunk. For this, WAISE first needs to know who the user is. It is not as simple as passing the user's id in the request as this information could be forged by an attacker to access contents in the name of another user.

There are several scenarios that we consider:

  1. The chat is embedded in Application A with a user authenticated in Application A and the user wants to access content that originates from
    1. Application A.
    2. another Application B.
  2. The chat is embedded in an XWiki installation that also has the WAISE extensions installed and wants to access content that is from
    1. this XWiki installation.
    2. Application A and B.

We consider the sub-scenarios a as must-haves and b as nice-to-haves.

Scenario 2.a is the easiest scenario as in this case regular XWiki authentication and authorization can be used. For scenario 2.b we additionally need a way to associate external accounts with the XWiki account. A special case that we consider is that all applications use the same single sign-on system.

To handle scenario 1, we're evaluating two options:

  1. Map the external user to a user in WAISE, creating the user account if it doesn't exist using, e.g., OpenID Connect.
  2. Let the external application create a token (such as JSON Web Token, JWT) that WAISE forwards when checking rights.

For the first option, a possible flow could be:

  1. App A uses a secret key to sign a JWT that contains the user name.
  2. App A forwards this token to the WAISE server in a header.
  3. The WAISE server forwards this token when checking rights for the results from the index together with the document ids.
  4. App A verifies the signature of the token and checks rights based on the contained user name.

In this scenario, the WAISE server could also know a public key that corresponds to the secret key of App A and use that to verify the validity of the signature of the token to check that the request is actually coming from App A. This could also allow the WAISE server to associate further accounts with this user account if it knew that user X of App A corresponds to user Y of App B because there was some authentication flow that verified this.

In any case, it must not be possible for an attacker to fake the user information that is provided by the chat integration, i.e., it must not be possible for an attacker to construct a request that tells WAISE that it comes from user X on App A as this would compromise the security of the content of App A. This property needs to hold if the attacker has a (non-admin) user account Y on App A, i.e., if the attacker can use the chat feature and obtain valid authentication information for his own user. This is the reason for the use signed tokens.

See the Authentication architecture for more details on a possible authentication implementation.

Chat

The chat module provides a client-side JavaScript API and chat user interface that is independent of XWiki. Further, an integration of the chat user interface in XWiki is provided. The chat module also contains a prompt database that can be used in the chat UI to perform common requests more easily. For this prompt database, the chat API module provides a REST interface.

Optionally, convenience features like a chat history or user-specific prompts could be implemented this module.

In general, the chat communicates with the WAISE server using the OpenAI compatible API. However, for extra features additional REST APIs are provided by the chat API and used by the chat UI.

Feedback (Optional)

The chat UI could allow users to provide feedback on the quality of the answers received. This data is sent to the administrator of the WAISE server to be analyzed in order to improve the performance of the LLM.

Fact Checking (Optional)

Once an answer is generated, an optional automatic fact-checking process runs in the background, indicated by a spinning animation (as previously noted in the Chat UI section). This could use another LLM model trained specifically for fact verification or a carefully designed prompt for fact-checking. The result of the fact-checking process is displayed alongside the answer in the Chat UI. An icon summarizing the confidence level of the fact-checking is shown, along with a textual description explaining how well the answer aligns with the source material. We're not aware of public implementations of this kind of fact-checking, thus first a prototype to verify the feasibility of this approach needs to be developed before it can be integrated in a future version of the WAISE server.

Personalization (Optional)

The chat interface could offer personalization options such as a default (system) prompt that captures user preferences such as response style, language, ... It could also be explored how to generate this personalization automatically, e.g., based on the interaction history and feedback.

Evaluation Framework

We provide a framework that allows automatically indexing a collection of documents and then running a set of given queries against WAISE. During this, different statistics like inference speed or memory consumption are measured if possible.

We're considering scoring these answers using a powerful LLM in a first step and for repeated testing, e.g., with different chunking strategies or embedding models.

All test runs will be stored in a database for later comparison.

A sample or also all answers for a final run are rated by a human to verify the reliability of the LLM-based scoring.

As this implementation framework is independent of the core WAISE sever, it could be implemented, e.g., in Python to have access to data analysis tools that allow to easily filter and visualize the results.

Matrix Bot

The architecture of the bot will be decided once the questions around authentication have been solved as authentication will be a major part of the matrix bot. The matrix bot will use the same APIs as the chat UI to communicate with WAISE.

Get Connected