Performance and Costs Measurements
The goal is to gather some performance and costs measurement of running LLMs locally.
Available setups
- CPU1: VMWare VM with 8 (or more) CPUs and 32GB RAM, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
- LocalAI CPU Mode
- gpt4all and other models
- llama.cpp CPU mode
- llama 7b and 13b
- Cost: 100 Euros / month based mostly on RAM
- LocalAI CPU Mode
- GPU1: OVH Public Cloud with 1 GPU NVIDIA Tesla 100VS 32Gb RAM and 14 CPUs and 54Gb of RAM
- FastChat
- llama.cpp GPU mode
LocalAI GPU mode (not working yet)- Cost: 900 Euros / month based mostly on GPU / GPU RAM, 1,8 Euros / hour
An Nvidia A100 could provide possibly better performance at a similar costs.
Other hostings could provide a V100 a lower cost.
Measurements
CPU
| Model | Tool used | Machine | Query | Number of Parallel Queries | Average tokens/seconds | Total Time | Time per Query | % GPU and RAM Used | % CPU and RAM Used | Notes |
|---|---|---|---|---|---|---|---|---|---|---|
| llama 7b | llama.cpp 4-bit quantization | CPU1 | "What is the capital of France ?" | 1 | 4 | 13s | 13s | N/A | 400% CPU | This can probably parallelized with more CPUs |
GPU
| Model | Tool used | Machine | Query | Number of Parallel Queries | Average tokens/seconds | Total Time | Time per Query | % GPU and RAM Used | % CPU and RAM Used | Notes |
|---|---|---|---|---|---|---|---|---|---|---|
| llama 7b | llama.cpp 4-bit quantization | GPU1 | "What is the capital of France ?" | 1 | 50 tokens / sec | < 3 seconds | < 3 seconds | 40% 10Gb | 400% around 10Gb | Times and memory to verify Lower on GPU but high on CPU There seems to be little room for parallelization on the GPU |
| mpt-7b-chat | FastChat OpenAI API - FP16 | GPU1 | "What is the capital of France ?" | 1 | 53 tokens / sec | 2.17s | 2.17s | 80% 13Gb | 100% ? | Alternating between high GPU and high CPU |
| mpt-7b-chat | FastChat OpenAI API with shell script to run in // with 0 seconds sleep - FP16 | GPU1 | "What is the capital of France ?" | 4 | 13 tokens / sec | 10s | 10s | 80% 13Gb | unchecked | Alternating between high GPU and high CPU |
| mpt-7b-chat | FastChat OpenAI API with shell script to run in // with 1 seconds sleep between calls - FP16 | GPU1 | "What is the capital of France ?" | 4 | 20 to 50 tokens / sec | 10s | 2s to 8s depending on call | 80-100% 13Gb | unchecked | Alternating between high GPU and high CPU |
| mpt-7b-chat | FastChat OpenAI API with shell script to run in // with 1 seconds sleep between calls - FP16 | GPU1 | "What is the capital of France ?" | 10 | 5 to 10 tokens / sec | 30s | 10s to 23s depending on call | 80-100% 13Gb | unchecked | Alternating between high GPU and high CPU |
GPU Costs and Theoritical Performance
- NVidia Testa V100: ? tensors, 5120 CUDA cores,
- Tensor
Performance 125 teraflops - https://www.nvidia.com/en-us/data-center/v100/
- Tensor
- NVidia Tesla H100 SXM, PCIe, NVL (similar to V100 but newer): 640 tensors, 7256 CUDA cores, 312 teraflops (unsure about this value)
- NVL FP64 Tensor Core 134 NVL teraflops
- NVL FP16 Tensor Core 3,958 teraflops
- https://www.nvidia.com/en-us/data-center/h100/
- NVidia Tesla A100 (different architecture, possible for different needs / Training): 432 Tensors, 6912 CUDA cores
- Could be more efficient for inference/LLM. Tests would be required to verify this.
Cost Analysis
A first cost analysis which would require some review is available in the following spreadsheet:
A summary is that at this point running LLMs locally can be similar to the cost of ChatGPT using GPT-4 and more expensive than GPT-3 but with a much smaller model, which does not guarantee the same quality of results. Note that Microsoft also provides Azure OpenAI (including in Europe) at a similar pricing than OpenAI.
Extrapolation of costs with large models could make it much more expensive to run local LLMs, at least using rented software.
However the NVidia A100 could have significant more inference performance than the V100 at similar costs. According to some inference benchmark, there is a 2,5x speed increase (https://cdn-prod.scdn6.secure.raxcdn.com/static/media/DAM_521f2ecd-92a5-41ac-a46c-b3db24f03ca3.pdf). Other data gives 60% better performance. Data from NVidia says 7x performance for FP16 (https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/).
It also shows that running LLMs is highly energy-consuming and therefore environmental effects need to be measured.
It is unclear if the reason for this is either:
- More effective Hardware architecture
- Better software to make better use of the hardware
- Access to lower cost infrastructure from Microsoft by OpenAI
- Dumping by OpenAI/Microsoft
An analysis of the actual costs of inference would be needed to check this. This article is an interesting source
It stated in Feb 2023 that ChatGPT (for GPT-3) uses around 3617 HGX A100 server (28936 GPUs). This estimates the cost per query to 0.36 cents (for 2k tokens), which is roughly at cost for GPT-3.
"Our model is built from the ground up on a per-inference basis, but it lines up with Sam Altman’s tweet and an interview he did recently. We assume that OpenAI used a GPT-3 dense model architecture with a size of 175 billion parameters, hidden dimension of 16k, sequence length of 4k, average tokens per response of 2k, 15 responses per user, 13 million daily active users, FLOPS utilization rates 2x higher than FasterTransformer at <2000ms latency, int8 quantization, 50% hardware utilization rates due to purely idle time, and $1 cost per GPU hour."