Performance and Costs Measurements

Last modified by Ludovic Dubost on 2023/06/07 14:39

Manage
- Copy
Actions
Viewers
- Source
- Children
- Attachments (1)
- History
- Information
- Likes

The goal is to gather some performance and costs measurement of running LLMs locally.

Available setups
Measurements
- CPU
- GPU
GPU Costs and Theoritical Performance
Cost Analysis

Available setups

CPU1: VMWare VM with 8 (or more) CPUs and 32GB RAM, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
- LocalAI CPU Mode
  - gpt4all and other models
- llama.cpp CPU mode
  - llama 7b and 13b
- Cost: 100 Euros / month based mostly on RAM
GPU1: OVH Public Cloud with 1 GPU NVIDIA Tesla 100VS 32Gb RAM and 14 CPUs and 54Gb of RAM
- FastChat
- llama.cpp GPU mode
- ~~LocalAI GPU mode (not working yet)~~
- Cost: 900 Euros / month based mostly on GPU / GPU RAM, 1,8 Euros / hour

An Nvidia A100 could provide possibly better performance at a similar costs.

Other hostings could provide a V100 a lower cost.

Measurements

CPU

Model	Tool used	Machine	Query	Number of Parallel Queries	Average tokens/seconds	Total Time	Time per Query	% GPU and RAM Used	% CPU and RAM Used	Notes
llama 7b	llama.cpp 4-bit quantization	CPU1	"What is the capital of France ?"	1	4	13s	13s	N/A	400% CPU	This can probably parallelized with more CPUs

GPU

Model	Tool used	Machine	Query	Number of Parallel Queries	Average tokens/seconds	Total Time	Time per Query	% GPU and RAM Used	% CPU and RAM Used	Notes
llama 7b	llama.cpp 4-bit quantization	GPU1	"What is the capital of France ?"	1	50 tokens / sec	< 3 seconds	< 3 seconds	40% 10Gb	400% around 10Gb	Times and memory to verify Lower on GPU but high on CPU There seems to be little room for parallelization on the GPU
mpt-7b-chat	FastChat OpenAI API - FP16	GPU1	"What is the capital of France ?"	1	53 tokens / sec	2.17s	2.17s	80% 13Gb	100% ?	Alternating between high GPU and high CPU
mpt-7b-chat	FastChat OpenAI API with shell script to run in // with 0 seconds sleep - FP16	GPU1	"What is the capital of France ?"	4	13 tokens / sec	10s	10s	80% 13Gb	unchecked	Alternating between high GPU and high CPU
mpt-7b-chat	FastChat OpenAI API with shell script to run in // with 1 seconds sleep between calls - FP16	GPU1	"What is the capital of France ?"	4	20 to 50 tokens / sec	10s	2s to 8s depending on call	80-100% 13Gb	unchecked	Alternating between high GPU and high CPU
mpt-7b-chat	FastChat OpenAI API with shell script to run in // with 1 seconds sleep between calls - FP16	GPU1	"What is the capital of France ?"	10	5 to 10 tokens / sec	30s	10s to 23s depending on call	80-100% 13Gb	unchecked	Alternating between high GPU and high CPU

GPU Costs and Theoritical Performance

NVidia Testa V100: ? tensors, 5120 CUDA cores,
- Tensor
  Performance 125 teraflops
- https://www.nvidia.com/en-us/data-center/v100/
NVidia Tesla H100 SXM, PCIe, NVL (similar to V100 but newer): 640 tensors, 7256 CUDA cores, 312 teraflops (unsure about this value)
- NVL FP64 Tensor Core 134 NVL teraflops
- NVL FP16 Tensor Core 3,958 teraflops
- https://www.nvidia.com/en-us/data-center/h100/
NVidia Tesla A100 (different architecture, possible for different needs / Training): 432 Tensors, 6912 CUDA cores
- Could be more efficient for inference/LLM. Tests would be required to verify this.

Cost Analysis

A first cost analysis which would require some review is available in the following spreadsheet:

AI-Server-Costs.xlsx

A summary is that at this point running LLMs locally can be similar to the cost of ChatGPT using GPT-4 and more expensive than GPT-3 but with a much smaller model, which does not guarantee the same quality of results. Note that Microsoft also provides Azure OpenAI (including in Europe) at a similar pricing than OpenAI.

Extrapolation of costs with large models could make it much more expensive to run local LLMs, at least using rented software.

However the NVidia A100 could have significant more inference performance than the V100 at similar costs. According to some inference benchmark, there is a 2,5x speed increase (https://cdn-prod.scdn6.secure.raxcdn.com/static/media/DAM_521f2ecd-92a5-41ac-a46c-b3db24f03ca3.pdf). Other data gives 60% better performance. Data from NVidia says 7x performance for FP16 (https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/).

It also shows that running LLMs is highly energy-consuming and therefore environmental effects need to be measured.

It is unclear if the reason for this is either:

More effective Hardware architecture
Better software to make better use of the hardware
Access to lower cost infrastructure from Microsoft by OpenAI
Dumping by OpenAI/Microsoft

An analysis of the actual costs of inference would be needed to check this. This article is an interesting source
It stated in Feb 2023 that ChatGPT (for GPT-3) uses around 3617 HGX A100 server (28936 GPUs). This estimates the cost per query to 0.36 cents (for 2k tokens), which is roughly at cost for GPT-3.

"Our model is built from the ground up on a per-inference basis, but it lines up with Sam Altman’s tweet and an interview he did recently. We assume that OpenAI used a GPT-3 dense model architecture with a size of 175 billion parameters, hidden dimension of 16k, sequence length of 4k, average tokens per response of 2k, 15 responses per user, 13 million daily active users, FLOPS utilization rates 2x higher than FasterTransformer at <2000ms latency, int8 quantization, 50% hardware utilization rates due to purely idle time, and $1 cost per GPU hour."

Performance and Costs Measurements

Available setups

Measurements

CPU

GPU

GPU Costs and Theoritical Performance

Cost Analysis

About

About

Support

Platform

User Guide

Admin Guide

Developer Guide

Projects

XWiki

Extensions

Other

Contribute

Status

Practices

Under the Hood

Get Involved

Get Connected