Performance and Costs Measurements

Last modified by Ludovic Dubost on 2023/06/07 14:39

The goal is to gather some performance and costs measurement of running LLMs locally.

Available setups

  • CPU1: VMWare VM with 8 (or more) CPUs and 32GB RAM, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
    • LocalAI CPU Mode
      • gpt4all and other models
    • llama.cpp CPU mode
      • llama 7b and 13b
    • Cost: 100 Euros / month based mostly on RAM
  • GPU1: OVH Public Cloud with 1 GPU NVIDIA Tesla 100VS 32Gb RAM and 14 CPUs and 54Gb of RAM
    • FastChat
    • llama.cpp GPU mode
    • LocalAI GPU mode (not working yet)
    • Cost: 900 Euros / month based mostly on GPU / GPU RAM, 1,8 Euros / hour

An Nvidia A100 could provide possibly better performance at a similar costs.

Other hostings could provide a V100 a lower cost.

Measurements

CPU

ModelTool usedMachineQueryNumber of Parallel QueriesAverage tokens/secondsTotal TimeTime per Query% GPU and RAM Used% CPU and RAM UsedNotes
llama 7bllama.cpp 4-bit quantizationCPU1"What is the capital of France ?"1413s13sN/A400% CPUThis can probably parallelized with more CPUs
           
           
           

GPU

ModelTool usedMachineQueryNumber of Parallel QueriesAverage tokens/secondsTotal TimeTime per Query% GPU and RAM Used% CPU and RAM UsedNotes
llama 7bllama.cpp 4-bit quantizationGPU1"What is the capital of France ?"150 tokens / sec< 3 seconds< 3 seconds40% 10Gb400% around 10Gb

Times and memory to verify

Lower on GPU but high on CPU

There seems to be little room for parallelization on the GPU

mpt-7b-chatFastChat OpenAI API - FP16GPU1"What is the capital of France ?"153 tokens / sec2.17s2.17s80% 13Gb100% ?Alternating between high GPU and high CPU
mpt-7b-chatFastChat OpenAI API with shell script to run in // with 0 seconds sleep - FP16GPU1"What is the capital of France ?"413 tokens / sec10s10s80% 13GbuncheckedAlternating between high GPU and high CPU
mpt-7b-chatFastChat OpenAI API with shell script to run in // with 1 seconds sleep between calls - FP16GPU1"What is the capital of France ?"420 to 50 tokens / sec10s2s to 8s depending on call80-100% 13GbuncheckedAlternating between high GPU and high CPU
mpt-7b-chatFastChat OpenAI API with shell script to run in // with 1 seconds sleep between calls - FP16GPU1"What is the capital of France ?"105 to 10 tokens / sec30s10s to 23s depending on call80-100% 13GbuncheckedAlternating between high GPU and high CPU
           
           

GPU Costs and Theoritical Performance

  • NVidia Testa V100: ? tensors, 5120 CUDA cores,
  • NVidia Tesla H100 SXM, PCIe, NVL (similar to V100 but newer): 640 tensors, 7256 CUDA cores, 312 teraflops (unsure about this value)
  • NVidia Tesla A100 (different architecture, possible for different needs / Training): 432 Tensors, 6912 CUDA cores
    • Could be more efficient for inference/LLM. Tests would be required to verify this.

Cost Analysis

A first cost analysis which would require some review is available in the following spreadsheet: 

AI-Server-Costs.xlsx

A summary is that at this point running LLMs locally can be similar to the cost of ChatGPT using GPT-4 and more expensive than GPT-3 but with a much smaller model, which does not guarantee the same quality of results. Note that Microsoft also provides Azure OpenAI (including in Europe) at a similar pricing than OpenAI.

Extrapolation of costs with large models could make it much more expensive to run local LLMs, at least using rented software.

However the NVidia A100 could have significant more inference performance than the V100 at similar costs. According to some inference benchmark, there is a 2,5x speed increase (https://cdn-prod.scdn6.secure.raxcdn.com/static/media/DAM_521f2ecd-92a5-41ac-a46c-b3db24f03ca3.pdf). Other data gives 60% better performance. Data from NVidia says 7x performance for FP16 (https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/).

It also shows that running LLMs is highly energy-consuming and therefore environmental effects need to be measured.

It is unclear if the reason for this is either:

  • More effective Hardware architecture
  • Better software to make better use of the hardware
  • Access to lower cost infrastructure from Microsoft by OpenAI
  • Dumping by OpenAI/Microsoft

    An analysis of the actual costs of inference would be needed to check this. This article is an interesting source
    It stated in Feb 2023 that ChatGPT (for GPT-3) uses around 3617 HGX A100 server (28936 GPUs). This estimates the cost per query to 0.36 cents (for 2k tokens), which is roughly at cost for GPT-3.

"Our model is built from the ground up on a per-inference basis, but it lines up with Sam Altman’s tweet and an interview he did recently. We assume that OpenAI used a GPT-3 dense model architecture with a size of 175 billion parameters, hidden dimension of 16k, sequence length of 4k, average tokens per response of 2k, 15 responses per user, 13 million daily active users, FLOPS utilization rates 2x higher than FasterTransformer at <2000ms latency, int8 quantization, 50% hardware utilization rates due to purely idle time, and $1 cost per GPU hour."

Get Connected