Skip to content

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?

Notifications You must be signed in to change notification settings

XiongjieDai/GPU-Benchmarks-on-LLM-Inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU-Benchmarks-on-LLM-Inference

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference? 🧐

Description

Use llama.cpp to test the LLaMA models inference speed of different GPUs on RunPod, 13-inch M1 MacBook Air, 14-inch M1 Max MacBook Pro, M2 Ultra Mac Studio and 16-inch M3 Max MacBook Pro for LLaMA 3.

Overview

Average speed (tokens/s) of generating 1024 tokens by GPUs on LLaMA 3. Higher speed is better.

GPU 8B Q4_K_M 8B F16 70B Q4_K_M 70B F16
3070 8GB 70.94 OOM OOM OOM
3080 10GB 106.40 OOM OOM OOM
3080 Ti 12GB 106.71 OOM OOM OOM
4070 Ti 12GB 82.21 OOM OOM OOM
4080 16GB 106.22 40.29 OOM OOM
RTX 4000 Ada 20GB 58.59 20.85 OOM OOM
3090 24GB 111.74 46.51 OOM OOM
4090 24GB 127.74 54.34 OOM OOM
RTX 5000 Ada 32GB 89.87 32.67 OOM OOM
3090 24GB * 2 108.07 47.15 16.29 OOM
4090 24GB * 2 122.56 53.27 19.06 OOM
RTX A6000 48GB 102.22 40.25 14.58 OOM
RTX 6000 Ada 48GB 130.99 51.97 18.36 OOM
A40 48GB 88.95 33.95 12.08 OOM
L40S 48GB 113.60 43.42 15.31 OOM
RTX 4000 Ada 20GB * 4 56.14 20.58 7.33 OOM
A100 PCIe 80GB 138.31 54.56 22.11 OOM
A100 SXM 80GB 133.38 53.18 24.33 OOM
H100 PCIe 80GB 144.49 67.79 25.01 OOM
3090 24GB * 4 104.94 46.40 16.89 OOM
4090 24GB * 4 117.61 52.69 18.83 OOM
RTX 5000 Ada 32GB * 4 82.73 31.94 11.45 OOM
3090 24GB * 6 101.07 45.55 16.93 5.82
4090 24GB * 8 116.13 52.12 18.76 6.45
RTX A6000 48GB * 4 93.73 38.87 14.32 4.74
RTX 6000 Ada 48GB * 4 118.99 50.25 17.96 6.06
A40 48GB * 4 83.79 33.28 11.91 3.98
L40S 48GB * 4 105.72 42.48 14.99 5.03
A100 PCIe 80GB * 4 117.30 51.54 22.68 7.38
A100 SXM 80GB * 4 97.70 45.45 19.60 6.92
H100 PCIe 80GB * 4 118.14 62.90 26.20 9.63
M1 7‑Core GPU 8GB 9.72 OOM OOM OOM
M1 Max 32‑Core GPU 64GB 34.49 18.43 4.09 OOM
M2 Ultra 76-Core GPU 192GB 76.28 36.25 12.13 4.71
M3 Max 40‑Core GPU 64GB 50.74 22.39 7.53 OOM

Average 1024 tokens prompt eval speed (tokens/s) by GPUs on LLaMA 3.

GPU 8B Q4_K_M 8B F16 70B Q4_K_M 70B F16
3070 8GB 2283.62 OOM OOM OOM
3080 10GB 3557.02 OOM OOM OOM
3080 Ti 12GB 3556.67 OOM OOM OOM
4070 Ti 12GB 3653.07 OOM OOM OOM
4080 16GB 5064.99 6758.90 OOM OOM
RTX 4000 Ada 20GB 2310.53 2951.87 OOM OOM
3090 24GB 3865.39 4239.64 OOM OOM
4090 24GB 6898.71 9056.26 OOM OOM
RTX 5000 Ada 32GB 4467.46 5835.41 OOM OOM
3090 24GB * 2 4004.14 4690.50 393.89 OOM
4090 24GB * 2 8545.00 11094.51 905.38 OOM
RTX A6000 48GB 3621.81 4315.18 466.82 OOM
RTX 6000 Ada 48GB 5560.94 6205.44 547.03 OOM
A40 48GB 3240.95 4043.05 239.92 OOM
L40S 48GB 5908.52 2491.65 649.08 OOM
RTX 4000 Ada 20GB * 4 3369.24 4366.64 306.44 OOM
A100 PCIe 80GB 5800.48 7504.24 726.65 OOM
A100 SXM 80GB 5863.92 681.47 796.81 OOM
H100 PCIe 80GB 7760.16 10342.63 984.06 OOM
3090 24GB * 4 4653.93 5713.41 350.06 OOM
4090 24GB * 4 9609.29 12304.19 898.17 OOM
RTX 5000 Ada 32GB * 4 6530.78 2877.66 541.54 OOM
3090 24GB * 6 5153.05 5952.55 739.40 927.23
4090 24GB * 8 9706.82 11818.92 1336.26 1890.48
RTX A6000 48GB * 4 5340.10 6448.85 539.20 792.23
RTX 6000 Ada 48GB * 4 9679.55 12637.94 714.93 1270.39
A40 48GB * 4 4841.98 5931.06 263.36 900.79
L40S 48GB * 4 9008.27 2541.61 634.05 1478.83
A100 PCIe 80GB * 4 8889.35 11670.74 978.06 1733.41
A100 SXM 80GB * 4 7782.25 674.11 539.08 1834.16
H100 PCIe 80GB * 4 11560.23 15612.81 1133.23 2420.10
M1 7‑Core GPU 8GB 87.26 OOM OOM OOM
M1 Max 32‑Core GPU 64GB 355.45 418.77 33.01 OOM
M2 Ultra 76-Core GPU 192GB 1023.89 1202.74 117.76 145.82
M3 Max 40‑Core GPU 64GB 678.04 751.49 62.88 OOM

Model

Thanks to shawwn for LLaMA model weights (7B, 13B, 30B, 65B): llama-dl. Access LLaMA 2 from Meta AI. Access LLaMA 3 from Meta Llama 3 on Hugging Face or my Hugging Face repos: Xiongjie Dai.

Usage

Build

  • For NVIDIA GPUs, this provides BLAS acceleration using the CUDA cores of your Nvidia GPU:

    !make clean && LLAMA_CUBLAS=1 make -j
  • For Apple Silicon, Metal is enabled by default:

    !make clean && make -j

Text Completion

Use argument -ngl 0 to only use the CPU for inference and -ngl 10000 to ensure all layers are offloaded to the GPU.

!./main -ngl 10000 -m ./models/8B-v3/ggml-model-Q4_K_M.gguf --color --temp 1.1 --repeat_penalty 1.1 -c 0 -n 1024 -e -s 0 -p """\
First Citizen:\n\n\
Before we proceed any further, hear me speak.\n\n\
\n\n\
All:\n\n\
Speak, speak.\n\n\
\n\n\
First Citizen:\n\n\
You are all resolved rather to die than to famish?\n\n\
\n\n\
All:\n\n\
Resolved. resolved.\n\n\
\n\n\
First Citizen:\n\n\
First, you know Caius Marcius is chief enemy to the people.\n\n\
\n\n\
All:\n\n\
We know't, we know't.\n\n\
\n\n\
First Citizen:\n\n\
Let us kill him, and we'll have corn at our own price. Is't a verdict?\n\n\
\n\n\
All:\n\n\
No more talking on't; let it be done: away, away!\n\n\
\n\n\
Second Citizen:\n\n\
One word, good citizens.\n\n\
\n\n\
First Citizen:\n\n\
We are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if they would yield us but the superfluity, \
while it were wholesome, we might guess they relieved us humanely; but they think we are too dear: the leanness that afflicts us, the object of \
our misery, is as an inventory to particularise their abundance; our sufferance is a gain to them Let us revenge this with our pikes, \
ere we become rakes: for the gods know I speak this in hunger for bread, not in thirst for revenge.\n\n\
\n\n\
"""

Note: For Apple Silicon, check the recommendedMaxWorkingSetSize in the result to see how much memory can be allocated on the GPU and maintain its performance. Only 70% of unified memory can be allocated to the GPU on 32GB M1 Max right now, and we expect around 78% of usable memory for the GPU on larger memory. (Source: https://developer.apple.com/videos/play/tech-talks/10580/?time=346) To utilize the whole memory, use -ngl 0 to only use the CPU for inference. (Thanks to: ggerganov/llama.cpp#1826)

Chat template for LLaMA 3 🦙🦙🦙

!./main -ngl 10000 -m ./models/8B-v3-instruct/ggml-model-Q4_K_M.gguf --color -c 0 -n -2 -e -s 0 --mirostat 2 -i --no-display-prompt --keep -1 \
-r '<|eot_id|>' -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' \
--in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

Benchmark

!./llama-bench -p 512,1024,4096,8192 -n 512,1024,4096,8192 -m ./models/8B-v3/ggml-model-Q4_K_M.gguf

Total VRAM Requirements

Model Quantized size (Q4_K_M) Original size (f16)
8B 4.58 GB 14.96 GB
70B 39.59 GB 131.42 GB

You may estimate that VRAM requirement using this tool: LLM RAM Calculator

Perplexity table on LLaMA 3 70B

Less perplexity is better. (credit to: dranger003)

Quantization Size (GiB) Perplexity (wiki.test) Delta (FP16)
IQ1_S 14.29 9.8655 +/- 0.0625 248.51%
IQ1_M 15.60 8.5193 +/- 0.0530 201.94%
IQ2_XXS 17.79 6.6705 +/- 0.0405 135.64%
IQ2_XS 19.69 5.7486 +/- 0.0345 103.07%
IQ2_S 20.71 5.5215 +/- 0.0318 95.05%
Q2_K_S 22.79 5.4334 +/- 0.0325 91.94%
IQ2_M 22.46 4.8959 +/- 0.0276 72.35%
Q2_K 24.56 4.7763 +/- 0.0274 68.73%
IQ3_XXS 25.58 3.9671 +/- 0.0211 40.14%
IQ3_XS 27.29 3.7210 +/- 0.0191 31.45%
Q3_K_S 28.79 3.6502 +/- 0.0192 28.95%
IQ3_S 28.79 3.4698 +/- 0.0174 22.57%
IQ3_M 29.74 3.4402 +/- 0.0171 21.53%
Q3_K_M 31.91 3.3617 +/- 0.0172 18.75%
Q3_K_L 34.59 3.3016 +/- 0.0168 16.63%
IQ4_XS 35.30 3.0310 +/- 0.0149 7.07%
IQ4_NL 37.30 3.0261 +/- 0.0149 6.90%
Q4_K_S 37.58 3.0050 +/- 0.0148 6.15%
Q4_K_M 39.60 2.9674 +/- 0.0146 4.83%
Q5_K_S 45.32 2.8843 +/- 0.0141 1.89%
Q5_K_M 46.52 2.8656 +/- 0.0139 1.23%
Q6_K 53.91 2.8441 +/- 0.0138 0.47%
Q8_0 69.83 2.8316 +/- 0.0138 0.03%
F16 131.43 2.8308 +/- 0.0138 0.00%

Benchmarks

TG means "text-generation," and PP means "prompt processing." # for total generated/processing tokens. OOM means out of memory. Average speed in tokens/s.

LLaMA 3 🦙🦙🦙:

NVIDIA Gaming GPUs (OS: Ubuntu 22.04.2 LTS, pytorch:2.2.0, py: 3.10, cuda: 12.1.1 on RunPod) (snapshots in May 2024)

GPU Model tg 512 tg 1024 tg 4096 tg 8192 pp 512 pp 1024 pp 4096 pp 8192
3070 8GB 8B Q4_K_M 72.79 70.94 67.01 61.64 2402.51 2283.62 1826.59 1419.97
3080 10GB 8B Q4_K_M 109.57 106.40 98.67 89.90 3728.86 3557.02 2852.06 2232.21
3080 Ti 12GB 8B Q4_K_M 110.60 106.71 98.34 88.63 3690.30 3556.67 2947.11 2381.52
4070 Ti 12GB 8B Q4_K_M 83.50 82.21 78.59 73.46 3936.29 3653.07 2729.71 2019.71
4080 16GB 8B Q4_K_M 108.15 106.22 100.44 93.71 5389.74 5064.99 3790.96 2882.03
8B F16 40.58 40.29 39.44 OOM 7246.97 6758.90 4720.22 OOM
3090 24GB 8B Q4_K_M 115.42 111.74 97.31 87.49 4030.40 3865.39 3169.91 2527.40
8B F16 47.40 46.51 44.79 42.62 4444.65 4239.64 3410.47 2667.14
4090 24GB 8B Q4_K_M 130.58 127.74 119.44 110.66 7138.99 6898.71 5265.68 4039.68
8B F16 54.84 54.34 52.63 50.88 9382.00 9056.26 6531.36 4744.18
3090 24GB * 2 8B Q4_K_M 111.67 108.07 99.60 90.77 3336.37 4004.14 4013.34 3433.59
8B F16 47.72 47.15 45.56 43.61 4122.66 4690.50 4788.60 3851.37
70B Q4_K_M 16.57 16.29 15.36 14.34 357.32 393.89 379.52 338.82
4090 24GB * 2 8B Q4_K_M 124.65 122.56 114.32 106.18 7003.51 8545.00 8422.04 6895.68
8B F16 53.64 53.27 51.64 49.83 9177.92 11094.51 10329.29 8067.29
70B Q4_K_M 19.22 19.06 18.54 17.92 839.43 905.38 846.38 723.24
3090 24GB * 4 8B Q4_K_M 108.66 104.94 97.09 88.35 3742.66 4653.93 5826.91 4913.40
8B F16 47.07 46.40 44.76 42.81 4608.40 5713.41 6596.17 5361.52
70B Q4_K_M 17.07 16.89 16.24 15.39 300.79 350.06 367.75 331.37
4090 24GB * 4 8B Q4_K_M 120.32 117.61 110.52 103.13 6748.96 9609.29 12491.10 10993.75
8B F16 53.10 52.69 51.00 49.21 8750.57 12304.19 15143.84 12919.74
70B Q4_K_M 19.80 18.83 18.35 17.66 834.74 898.17 839.97 718.01
3090 24GB * 6 8B Q4_K_M 104.17 101.07 94.06 85.93 3359.99 5153.05 7690.65 7084.44
8B F16 46.23 45.55 43.99 42.15 3875.97 5952.55 9437.91 8780.49
70B Q4_K_M 17.09 16.93 16.32 15.45 456.95 739.40 786.79 695.44
70B F16 5.85 5.82 5.76 5.53 579.00 927.23 998.79 813.99
4090 24GB * 8 8B Q4_K_M 118.09 116.13 108.37 100.95 6172.06 9706.82 15089.45 13802.08
8B F16 52.51 52.12 50.39 48.72 7889.26 11818.92 16462.18 14300.98
70B Q4_K_M 18.94 18.76 18.23 17.57 812.95 1336.26 1488.36 1320.36
70B F16 6.47 6.45 6.39 6.31 1183.87 1890.48 2311.43 1995.85

NVIDIA Professional GPUs (OS: Ubuntu 22.04.2 LTS, pytorch:2.2.0, py: 3.10, cuda: 12.1.1 on RunPod) (snapshots in May 2024)

GPU Model tg 512 tg 1024 tg 4096 tg 8192 pp 512 pp 1024 pp 4096 pp 8192
RTX 4000 Ada 20GB 8B Q4_K_M 59.15 58.59 55.94 52.39 2451.93 2310.53 1798.01 1337.15
8B F16 20.92 20.85 20.50 20.01 3121.67 2951.87 2200.58 1557.00
RTX 5000 Ada 32GB 8B Q4_K_M 91.39 89.87 85.01 80.00 4761.12 4467.46 3272.94 2422.33
8B F16 32.84 32.67 32.04 31.27 6160.57 5835.41 4008.30 2808.89
RTX A6000 48GB 8B Q4_K_M 105.39 102.22 94.82 86.73 3780.55 3621.81 2917.23 2292.61
8B F16 40.71 40.25 39.14 37.73 4511.02 4315.18 3365.79 2566.46
70B Q4_K_M 14.71 14.58 14.09 13.42 482.19 466.82 404.61 340.73
RTX 6000 Ada 48GB 8B Q4_K_M 133.44 130.99 120.74 111.57 5791.74 5560.94 4495.19 3542.57
8B F16 52.32 51.97 50.21 48.79 6663.13 6205.44 4969.46 3915.81
70B Q4_K_M 18.52 18.36 17.80 16.97 565.98 547.03 481.59 419.76
A40 48GB 8B Q4_K_M 91.27 88.95 83.10 76.45 3324.98 3240.95 2586.50 2013.34
8B F16 34.26 33.95 33.06 31.93 4203.75 4043.05 3069.98 2295.02
70B Q4_K_M 11.60 12.08 11.68 11.26 209.38 239.92 268.89 291.13
L40S 48GB 8B Q4_K_M 115.55 113.60 105.50 97.98 6035.24 5908.52 4335.18 3192.70
8B F16 43.69 43.42 42.22 41.05 2253.93 2491.65 2887.70 3312.16
70B Q4_K_M 15.46 15.31 14.92 14.45 673.63 649.08 542.29 446.48
RTX 4000 Ada 20GB * 4 8B Q4_K_M 56.64 56.14 53.58 50.19 2413.07 3369.24 4404.45 3733.15
8B F16 20.65 20.58 20.24 19.74 3220.21 4366.64 5366.39 4323.70
70B Q4_K_M 7.36 7.33 7.12 6.84 282.28 306.44 290.70 243.45
A100 PCIe 80GB 8B Q4_K_M 140.62 138.31 127.22 117.60 5981.04 5800.48 4959.84 4083.37
8B F16 54.84 54.56 53.02 51.24 7741.34 7504.24 6137.54 4849.11
70B Q4_K_M 22.31 22.11 20.93 19.53 744.12 726.65 653.20 573.95
A100 SXM 80GB 8B Q4_K_M 135.04 133.38 125.09 115.92 5947.64 5863.92 5121.60 4137.08
8B F16 53.49 53.18 52.03 50.52 603.76 681.47 866.13 1323.07
70B Q4_K_M 24.61 24.33 22.91 21.32 817.58 796.81 714.07 625.66
H100 PCIe 80GB 8B Q4_K_M 145.55 144.49 136.06 126.83 8125.45 7760.16 6423.31 5185.03
8B F16 68.03 67.79 65.97 63.55 10815.51 10342.63 8106.53 6191.45
70B Q4_K_M 25.03 25.01 23.82 22.39 1012.73 984.06 863.37 741.52
RTX 5000 Ada 32GB * 4 8B Q4_K_M 84.07 82.73 78.45 74.11 4671.34 6530.78 8004.94 6790.82
8B F16 32.10 31.94 31.32 30.58 2427.96 2877.66 3836.89 5235.00
70B Q4_K_M 11.51 11.45 11.24 10.94 502.37 541.54 504.23 424.29
RTX A6000 48GB * 4 8B Q4_K_M 96.48 93.73 87.72 80.88 3712.99 5340.10 7126.45 6438.82
8B F16 39.34 38.87 37.81 36.51 4508.60 6448.85 8327.16 7298.18
70B Q4_K_M 14.44 14.32 13.91 13.32 496.08 539.20 511.22 434.31
70B F16 4.76 4.74 4.70 4.63 510.31 792.23 751.37 748.06
RTX 6000 Ada 48GB * 4 8B Q4_K_M 121.21 118.99 110.65 103.18 6640.86 9679.55 11734.85 10278.14
8B F16 50.61 50.25 48.69 47.18 8953.30 12637.94 13971.34 11702.36
70B Q4_K_M 18.13 17.96 17.49 16.89 656.61 714.93 697.10 612.54
70B F16 6.08 6.06 6.01 5.94 864.12 1270.39 1363.75 1182.28
A40 48GB * 4 8B Q4_K_M 85.91 83.79 78.56 72.70 3321.27 4841.98 6442.38 5742.84
8B F16 33.60 33.28 32.42 31.38 4144.88 5931.06 7544.92 6516.60
70B Q4_K_M 11.99 11.91 11.60 11.17 236.86 263.36 300.57 312.31
70B F16 3.99 3.98 3.95 3.90 610.51 900.79 893.28 735.16
L40S 48GB * 4 8B Q4_K_M 107.53 105.72 98.59 92.20 6125.69 9008.27 10566.97 9017.90
8B F16 42.70 42.48 41.33 40.19 2211.45 2541.61 3093.33 4336.81
70B Q4_K_M 15.12 14.99 14.63 14.17 591.05 634.05 605.66 541.67
70B F16 5.05 5.03 4.99 4.94 1042.13 1478.83 1427.77 1150.63
A100 PCIe 80GB * 4 8B Q4_K_M 119.28 117.30 110.75 103.87 6076.58 8889.35 12724.54 11803.39
8B F16 51.63 51.54 50.20 48.73 8088.79 11670.74 16025.11 14269.17
70B Q4_K_M 22.91 22.68 21.41 19.96 771.28 978.06 1138.60 1043.15
70B F16 7.40 7.38 7.23 7.06 1172.14 1733.41 1846.36 1592.37
A100 SXM 80GB * 4 8B Q4_K_M 99.73 97.70 92.09 86.27 4850.88 7782.25 12242.53 11535.66
8B F16 45.53 45.45 44.33 43.09 626.75 674.11 1003.37 1612.05
70B Q4_K_M 19.87 19.60 18.48 17.19 468.86 539.08 712.08 802.23
70B F16 6.95 6.92 6.77 6.58 1233.31 1834.16 1972.48 1699.56
H100 PCIe 80GB * 4 8B Q4_K_M 123.08 118.14 113.12 110.34 8054.58 11560.23 16128.27 14682.97
8B F16 64.00 62.90 61.45 59.72 11107.40 15612.81 20561.03 17762.96
70B Q4_K_M 26.40 26.20 24.60 23.68 1048.29 1133.23 1088.99 950.92
70B F16 9.67 9.63 9.46 9.23 1681.45 2420.10 2437.53 2031.77

Apple Silicon (snapshots in May 2024)

GPU Model tg 512 tg 1024 tg 4096 tg 8192 pp 512 pp 1024 pp 4096 pp 8192
M1 7‑Core GPU 8GB 8B Q4_K_M 10.20 9.72 11.77 OOM 94.48 87.26 96.53 OOM
M1 Max 32‑Core GPU 64GB 8B Q4_K_M 35.73 34.49 31.18 26.84 408.23 355.45 329.84 302.92
8B F16 18.75 18.43 16.33 15.03 517.34 418.77 374.09 351.46
70B Q4_K_M 4.34 4.09 4.09 3.71 34.96 33.01 32.64 30.97
M2 Ultra 76-Core GPU 192GB 8B Q4_K_M 78.81 76.28 64.58 54.13 994.04 1023.89 979.47 913.55
8B F16 36.90 36.25 33.67 30.68 1175.40 1202.74 1194.21 1103.44
70B Q4_K_M 12.48 12.13 10.75 9.34 118.79 117.76 109.53 108.57
70B F16 4.76 4.71 4.48 4.23 147.58 145.82 133.75 135.15
M3 Max 40‑Core GPU 64GB 8B Q4_K_M 48.97 50.74 44.21 36.12 693.32 678.04 573.09 505.32
8B F16 22.04 22.39 20.72 18.74 769.84 751.49 609.97 515.15
70B Q4_K_M 7.65 7.53 6.58 5.60 70.19 62.88 64.90 61.96

Conclusion

Same performance under the same size and quantization models. Multiple NVIDIA GPUs might affect text-generation performance but can still boost the prompt processing speed.

Buy NVIDIA gaming GPUs to save money. Buy professional GPUs for your business. Buy a Mac if you want to put your computer on your desk, save energy, be quiet, don't wanna maintenance, and have more fun. 😇

If you find this information helpful, please give me a star. ⭐️ Feel free to contact me if you have any advice. Thank you. 🤗

About

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published