Evaluation Methodology

Last modified by Paul Panțiru on 2024/09/13 10:39

As part of the WAISE project, we evaluate LLMs.

As there are already a lot of evaluations that cover various aspects of LLMs, the main objective of our evaluation is to evaluate the suitability of self-hosted LLMs for specific tasks in the context of knowledge management and XWiki technical support questions in particular.

Apart from performance on the tasks, we want to measure the energetic costs of the LLMs.

Further, for every LLM an ethics rating will be provided that evaluates the license of the model, license of training data, license of training and inference software, and possibly other criteria.

The goal of this evaluation is to provide users of the LLM application a guide which models to choose based on different criteria like the performance of the LLM on certain tasks, the available computational/financial budget, and licensing of the model, its code and training data. This guide should also inform the users about typical problems that they should expect when using the LLM for the evaluated tasks.

Based on the use cases we consider for WAISE, we will evaluate the following tasks:

  • generating typical content
  • content summarization
  • question answering (based on provided context)

All tasks will be evaluated in several languages, more precisely, German, French and English as these are most relevant for XWiki.

Evaluation Process

  1. Select LLMs
  2. Select/generate Tasks
  3. Implement a benchmarking framework
  4. Run the LLMs on the tasks
  5. Evaluate the results automatically and manually
  6. Generate interesting visualizations
  7. Repeat (with the lessons learned from the evaluation)

In the following, we provide details for each of these steps.

Selecting LLMs

We will consider at least 5 "open" LLMs that at least permit most commercial use and can be self-hosted.

Additionally, we'll include top-performing closed models like (currently) GPT-4o as baseline.

Regarding the selection of "open" models, evaluating "all" models is impossible. Instead, we will select models based on the following criteria that aim to select the top performing LLMs in different categories:

  • Top performers of existing benchmarks and the LMSYS Chatbot Arena Leaderboard, select top-performers of several parameter sizes (like 7b, 13b, 70b) but avoiding selecting too many similar models and prefer models trained by "trusted"/well-known entities.
  • Availability and usage at providers like OpenRouter as this should be a good indication which models are considered to be interesting by others
  • Interesting properties, like:
    • open/permissive license that fulfills the Open Source AI Definition, open training dataset
    • language support beyond English (in particular French, German)
    • large context window
    • specific training for RAG or summarization

Select/Generate Tasks

We consider three categories of tasks:

  • content summarization
  • question answering (based on provided context)
  • generating typical content

Both for summarization and for question answering we use the same selection of documents from xwiki.org. They are selected to a) cover a broad range of topics by including the user and admin guide and b) cover some topics like authentication in depth to ensure that relevant context is selected even if many documents cover similar topics.

All documents are downloaded, the main content is extracted from the HTML document and converted to Markdown. URLs are converted to absolute URLs. This is to ensure that the LLM will see a familiar syntax and can provide working links in responses.

Summarization

For summarization, we ask the LLMs to summarize the provided documents. Some documents could be too long for the context of some of the LLMs. To accommodate for this, we will categorize the documents in different categories based on the length and only use documents that fit into the context of the respective LLM. We won't use techniques like a summary of summaries for long documents to evaluate directly the performance of the LLM instead of the performance of the summarization techniques.

Question Answering Based on Context

We will use a mixture of different kinds of questions and expected answers:

  • Questions generated by LLMs that were provided the full, relevant documentation with LLM-generated answers. These should be the easiest to answer as both the questions and answers are straightforward to derive from the context. Questions are reviewed by an XWiki expert to ensure that they are relevant and answers are fact-checked and, if necessary, corrected by an XWiki expert.
  • Manually written questions with answers generated by LLMs that were provided the full, relevant documentation. These should still be easy to answer as the answer can be derived from the context. Again, the generated answers are fact-checked and, if necessary, corrected by an XWiki expert.
  • Actual XWiki technical support questions that have actually been asked by users with manually written answers. These should be most difficult to answer. While we ensure that the context provides the expected answer, it might not be straightforward to derive them.

Additionally, we automatically translate all questions and expected answers to German and French to evaluate the ability to answer questions in a language different from the context documents.

Generating Typical Content

In contrast to question-answering based on context, for this category of tasks, no context will be provided, and instead all necessary context is provided in the prompt. In this category of tasks, we will primarily evaluate content transformation tasks, like extracting structured data from text, transforming keywords into text, or adapting texts to different audiences.

Implement a Benchmark Framework

As most existing software for evaluating LLMs is written in Python, we decided to implement an evaluation framework in Python. We use Snakemake to automate the different steps of the evaluation pipeline. This allows us to have meaningful, fully automated evaluation of the results.

Run the LLMs on the tasks

The benchmark framework also supports indexing the context documents in WAISE using the REST API that is provided by the Index for the LLM Application. The benchmark framework then runs each task on the different LLMs.

Evaluate the results automatically and manually

We use a mixture of automated and manual evaluation. The aim of an automated evaluation is to have an evaluation pipeline that is fully automated and can be used to easily test changes in different parts of the code like chunking, embedding, prompt formulation etc.. We use deepeval to evaluate the results of the LLMs as it provides many LLM-powered evaluation metrics that fit our use case. Additionally, we will use manual review to ensure that the automated evaluation works as expected and matches human intuition.

For summarization, the main criteria are:

  • The length of the summary is appropriate
  • The summary contains all important facts from the text
  • The summary doesn't contain any additional facts

For all other tasks, the main criteria are:

  • The response answers the question/followed the instructions
  • The response is backed by the context (if there is any)
  • The response contains the correct answer
  • The response doesn't contain any inaccurate extra answers (additional information/instructions that are correct but weren't really needed could be tolerated)

Generate interesting visualizations

We use Matplotlib for generating visualizations as it is a nicely integrated part of the Python data analysis frameworks.

Our aim with these visualizations is to show differences between the different LLMs in the execution of different categories of tasks.

Repeat

As soon as we see the first results, we'll certainly notice areas for improvement, we thus expect several iterations of

  • Adding additional tasks to better differentiate the capabilities of the different models
  • Adding or changing evaluation metrics, to, e.g., better capture differences that we see as humans, but that aren't well captured by the metrics we already have
  • Improve the LLM Application, e.g., with better prompts to improve the results
  • Other improvements whose necessity becomes apparent only after seeing the results

Get Connected