ggml llama: align structs for memory optimization on 64-bit platforms #7267

GermanAizek · 2024-05-13T23:39:18Z

ggml_type_traits_t (80 -> 72 bytes)
llama_batch (72 -> 64 bytes)
llama_model_params (56 -> 48 bytes)
hash_node (32 -> 24 bytes)
ggml_compute_state (32 -> 24 bytes)
gguf_tensor_info (88 -> 80 bytes)

- ggml_type_traits_t (80 -> 72 bytes) - llama_batch (72 -> 64 bytes) - llama_model_params (56 -> 48 bytes) - hash_node (32 -> 24 bytes) - ggml_compute_state (32 -> 24 bytes) - gguf_tensor_info (88 -> 80 bytes)

github-actions · 2024-05-13T23:53:57Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 539 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8655.79ms p(95)=21820.54ms fails=, finish reason: stop=479 truncated=60
Prompt processing (pp): avg=113.8tk/s p(95)=552.59tk/s
Token generation (tg): avg=47.06tk/s p(95)=45.86tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=afad05d15c3ee4d1339640840b98e036796a00ff

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716184318 --> 1716184940
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 300.25, 300.25, 300.25, 300.25, 300.25, 661.05, 661.05, 661.05, 661.05, 661.05, 679.92, 679.92, 679.92, 679.92, 679.92, 718.57, 718.57, 718.57, 718.57, 718.57, 778.47, 778.47, 778.47, 778.47, 778.47, 775.34, 775.34, 775.34, 775.34, 775.34, 787.41, 787.41, 787.41, 787.41, 787.41, 808.17, 808.17, 808.17, 808.17, 808.17, 807.05, 807.05, 807.05, 807.05, 807.05, 823.95, 823.95, 823.95, 823.95, 823.95, 841.45, 841.45, 841.45, 841.45, 841.45, 829.72, 829.72, 829.72, 829.72, 829.72, 810.93, 810.93, 810.93, 810.93, 810.93, 836.66, 836.66, 836.66, 836.66, 836.66, 848.67, 848.67, 848.67, 848.67, 848.67, 847.79, 847.79, 847.79, 847.79, 847.79, 850.6, 850.6, 850.6, 850.6, 850.6, 848.0, 848.0, 848.0, 848.0, 848.0, 844.04, 844.04, 844.04, 844.04, 844.04, 862.04, 862.04, 862.04, 862.04, 862.04, 856.56, 856.56, 856.56, 856.56, 856.56, 862.88, 862.88, 862.88, 862.88, 862.88, 864.84, 864.84, 864.84, 864.84, 864.84, 883.4, 883.4, 883.4, 883.4, 883.4, 881.34, 881.34, 881.34, 881.34, 881.34, 884.65, 884.65, 884.65, 884.65, 884.65, 893.21, 893.21, 893.21, 893.21, 893.21, 893.8, 893.8, 893.8, 893.8, 893.8, 892.46, 892.46, 892.46, 892.46, 892.46, 892.88, 892.88, 892.88, 892.88, 892.88, 892.2, 892.2, 892.2, 892.2, 892.2, 889.84, 889.84, 889.84, 889.84, 889.84, 890.78, 890.78, 890.78, 890.78, 890.78, 901.5, 901.5, 901.5, 901.5, 901.5, 907.32, 907.32, 907.32, 907.32, 907.32, 905.98, 905.98, 905.98, 905.98, 905.98, 883.48, 883.48, 883.48, 883.48, 883.48, 878.75, 878.75, 878.75, 878.75, 878.75, 877.28, 877.28, 877.28, 877.28, 877.28, 877.83, 877.83, 877.83, 877.83, 877.83, 879.61, 879.61, 879.61, 879.61, 879.61, 879.49, 879.49, 879.49, 879.49, 879.49, 862.77, 862.77, 862.77, 862.77, 862.77, 867.67, 867.67, 867.67, 867.67, 867.67, 869.31, 869.31, 869.31, 869.31, 869.31, 867.9, 867.9, 867.9, 867.9, 867.9, 862.03, 862.03, 862.03, 862.03, 862.03, 863.47, 863.47, 863.47, 863.47, 863.47, 862.82, 862.82, 862.82, 862.82, 862.82, 863.42, 863.42, 863.42, 863.42, 863.42, 862.46, 862.46, 862.46, 862.46, 862.46, 863.69, 863.69, 863.69, 863.69, 863.69, 869.19, 869.19, 869.19, 869.19, 869.19, 867.12, 867.12, 867.12, 867.12, 867.12, 868.44, 868.44, 868.44, 868.44, 868.44, 869.12, 869.12, 869.12, 869.12, 869.12, 868.94, 868.94, 868.94, 868.94, 868.94, 868.55, 868.55, 868.55, 868.55, 868.55, 869.8, 869.8, 869.8, 869.8, 869.8, 870.94, 870.94, 870.94, 870.94, 870.94, 870.43, 870.43]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716184318 --> 1716184940
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.47, 44.47, 44.47, 44.47, 44.47, 44.47, 44.47, 44.47, 44.47, 44.47, 27.1, 27.1, 27.1, 27.1, 27.1, 30.16, 30.16, 30.16, 30.16, 30.16, 30.9, 30.9, 30.9, 30.9, 30.9, 31.31, 31.31, 31.31, 31.31, 31.31, 31.76, 31.76, 31.76, 31.76, 31.76, 32.81, 32.81, 32.81, 32.81, 32.81, 32.98, 32.98, 32.98, 32.98, 32.98, 32.84, 32.84, 32.84, 32.84, 32.84, 32.84, 32.84, 32.84, 32.84, 32.84, 32.78, 32.78, 32.78, 32.78, 32.78, 32.63, 32.63, 32.63, 32.63, 32.63, 32.49, 32.49, 32.49, 32.49, 32.49, 31.32, 31.32, 31.32, 31.32, 31.32, 30.31, 30.31, 30.31, 30.31, 30.31, 29.02, 29.02, 29.02, 29.02, 29.02, 28.64, 28.64, 28.64, 28.64, 28.64, 28.96, 28.96, 28.96, 28.96, 28.96, 28.93, 28.93, 28.93, 28.93, 28.93, 28.99, 28.99, 28.99, 28.99, 28.99, 29.3, 29.3, 29.3, 29.3, 29.3, 29.58, 29.58, 29.58, 29.58, 29.58, 29.66, 29.66, 29.66, 29.66, 29.66, 29.59, 29.59, 29.59, 29.59, 29.59, 29.81, 29.81, 29.81, 29.81, 29.81, 29.81, 29.81, 29.81, 29.81, 29.81, 29.88, 29.88, 29.88, 29.88, 29.88, 30.05, 30.05, 30.05, 30.05, 30.05, 30.3, 30.3, 30.3, 30.3, 30.3, 30.46, 30.46, 30.46, 30.46, 30.46, 30.51, 30.51, 30.51, 30.51, 30.51, 30.53, 30.53, 30.53, 30.53, 30.53, 30.65, 30.65, 30.65, 30.65, 30.65, 30.45, 30.45, 30.45, 30.45, 30.45, 30.25, 30.25, 30.25, 30.25, 30.25, 30.18, 30.18, 30.18, 30.18, 30.18, 29.75, 29.75, 29.75, 29.75, 29.75, 29.65, 29.65, 29.65, 29.65, 29.65, 29.83, 29.83, 29.83, 29.83, 29.83, 29.95, 29.95, 29.95, 29.95, 29.95, 30.14, 30.14, 30.14, 30.14, 30.14, 30.16, 30.16, 30.16, 30.16, 30.16, 30.07, 30.07, 30.07, 30.07, 30.07, 29.62, 29.62, 29.62, 29.62, 29.62, 28.81, 28.81, 28.81, 28.81, 28.81, 28.68, 28.68, 28.68, 28.68, 28.68, 28.7, 28.7, 28.7, 28.7, 28.7, 28.83, 28.83, 28.83, 28.83, 28.83, 28.84, 28.84, 28.84, 28.84, 28.84, 28.85, 28.85, 28.85, 28.85, 28.85, 28.91, 28.91, 28.91, 28.91, 28.91, 28.91, 28.91, 28.91, 28.91, 28.91, 28.9, 28.9, 28.9, 28.9, 28.9, 28.83, 28.83, 28.83, 28.83, 28.83, 28.85, 28.85, 28.85, 28.85, 28.85, 28.97, 28.97, 28.97, 28.97, 28.97, 29.11, 29.11, 29.11, 29.11, 29.11, 29.14, 29.14, 29.14, 29.14, 29.14, 29.23, 29.23, 29.23, 29.23, 29.23, 29.3, 29.3]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716184318 --> 1716184940
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.48, 0.48, 0.48, 0.48, 0.48, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.33, 0.33, 0.33, 0.33, 0.33, 0.5, 0.5, 0.5, 0.5, 0.5, 0.44, 0.44, 0.44, 0.44, 0.44, 0.28, 0.28, 0.28, 0.28, 0.28, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.26, 0.26, 0.26, 0.26, 0.26, 0.36, 0.36, 0.36, 0.36, 0.36, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.34, 0.34, 0.34, 0.34, 0.34, 0.48, 0.48, 0.48, 0.48, 0.48, 0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716184318 --> 1716184940
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0]

slaren · 2024-05-14T00:41:57Z

The changes to the llama.h public structs are effectively a API breaking change for no real benefit. The other structs are less sensitive since they are internal to ggml, but I don't see how this is worth the risk of introducing bugs.

USBhost · 2024-05-14T19:38:35Z

What I find interesting "sample time" has regressed about 0.02 ms for me. Not like that has much meaning but still interesting.

Master:
llama_print_timings:      sample time =       7.35 ms /   100 runs   (    0.07 ms per token, 13607.29 tokens per second)
llama_print_timings:      sample time =       7.29 ms /   100 runs   (    0.07 ms per token, 13711.78 tokens per second)
llama_print_timings:      sample time =       7.30 ms /   100 runs   (    0.07 ms per token, 13704.26 tokens per second)
llama_print_timings:      sample time =       7.59 ms /   100 runs   (    0.08 ms per token, 13173.49 tokens per second)
llama_print_timings:      sample time =       7.24 ms /   100 runs   (    0.07 ms per token, 13821.70 tokens per second)

PR:
llama_print_timings:      sample time =       8.61 ms /   100 runs   (    0.09 ms per token, 11611.70 tokens per second)
llama_print_timings:      sample time =       8.49 ms /   100 runs   (    0.08 ms per token, 11773.02 tokens per second)
llama_print_timings:      sample time =       8.69 ms /   100 runs   (    0.09 ms per token, 11514.10 tokens per second)
llama_print_timings:      sample time =       8.59 ms /   100 runs   (    0.09 ms per token, 11646.87 tokens per second)
llama_print_timings:      sample time =       8.65 ms /   100 runs   (    0.09 ms per token, 11566.04 tokens per second)
llama_print_timings:      sample time =       8.58 ms /   100 runs   (    0.09 ms per token, 11650.94 tokens per second)

GermanAizek · 2024-05-14T20:05:03Z

What I find interesting "sample time" has regressed about 0.02 ms for me. Not like that has much meaning but still interesting.

Which compiler did you compile on for which platform?

USBhost · 2024-05-14T21:23:08Z

What I find interesting "sample time" has regressed about 0.02 ms for me. Not like that has much meaning but still interesting.

Which compiler did you compile on for which platform?

         .://:`              `://:.            usbhost@IONA
       `hMMMMMMd/          /dMMMMMMh`          ------------
        `sMMMMMMMd:      :mMMMMMMMs`           OS: Proxmox VE 8.2.2 x86_64
`-/+oo+/:`.yMMMMMMMh-  -hMMMMMMMy.`:/+oo+/-`   Host: Super Server 0123456789
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`   Kernel: 6.8.4-3-pve
  `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/`     Uptime: 5 hours, 33 mins
    ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/.       Packages: 1133 (dpkg)
      .+ooooooo+-`oNMMMMNo`-+ooooooo+.         Shell: bash 5.2.15
        -+ooooooo/.`sMMs`./ooooooo+-           Resolution: 1024x768
          :oooooooo/`..`/oooooooo:             Terminal: /dev/pts/0
          :oooooooo/`..`/oooooooo:             CPU: AMD EPYC 7F72 (48) @ 3.200GHz
        -+ooooooo/.`sMMs`./ooooooo+-           GPU: NVIDIA RTX A6000
      .+ooooooo+-`oNMMMMNo`-+ooooooo+.         GPU: NVIDIA RTX A4000
    ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/.       GPU: NVIDIA RTX A4000
  `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/`     Memory: 2538MiB / 257591MiB
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`
`-/+oo+/:`.yMMMMMMMh-  -hMMMMMMMy.`:/+oo+/-`
        `sMMMMMMMm:      :dMMMMMMMs`
       `hMMMMMMd/          /dMMMMMMh`
         `://:`              `://:`

main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

`./llama-bench -m /mnt/36TB/AI/llama-3-8B-Instruct-More-abliterated/ggml-model-f16.gguf -t 24 -r 20 -pg 512,128

Master:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	24	pp512	58.32 ± 0.49
llama 8B F16	14.96 GiB	8.03 B	CPU	24	tg128	8.73 ± 0.01
llama 8B F16	14.96 GiB	8.03 B	CPU	24	pp512+tg128	27.00 ± 0.09
build: `9f77348` (2886)

PR:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	24	pp512	59.02 ± 0.47
llama 8B F16	14.96 GiB	8.03 B	CPU	24	tg128	8.71 ± 0.01
llama 8B F16	14.96 GiB	8.03 B	CPU	24	pp512+tg128	26.89 ± 0.08
build: 0d5473cd (2887)

GermanAizek · 2024-05-14T22:28:52Z

@USBhost, I have:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm

$ uname -a
Linux debian-laptop 6.1.0-21-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.90-1 (2024-05-03) x86_64 GNU/Linux

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           6.7Gi       3.2Gi       1.2Gi        15Mi       2.6Gi       3.5Gi
Swap:          5.0Gi       2.4Gi       2.6Gi

$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 17
model name      : AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
stepping        : 0
microcode       : 0x810100b
cpu MHz         : 1437.121
cache size      : 512 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso div0
bogomips        : 3992.26
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

$ glxinfo -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon Vega 8 Graphics (raven, LLVM 15.0.6, DRM 3.49, 6.1.0-21-amd64) (0x15dd)
    Version: 22.3.6
    Accelerated: yes
    Video memory: 1024MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 524 MB, largest block: 524 MB
    VBO free aux. memory - total: 3340 MB, largest block: 3340 MB
    Texture free memory - total: 524 MB, largest block: 524 MB
    Texture free aux. memory - total: 3340 MB, largest block: 3340 MB
    Renderbuffer free memory - total: 524 MB, largest block: 524 MB
    Renderbuffer free aux. memory - total: 3340 MB, largest block: 3340 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 1024 MB
    Total available memory: 4438 MB
    Currently available dedicated video memory: 524 MB
OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon Vega 8 Graphics (raven, LLVM 15.0.6, DRM 3.49, 6.1.0-21-amd64)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 22.3.6
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.3.6
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.3.6
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

master:

$ ./simple ../../models/models/ggml-model-f16.gguf "Hello my name is"

...

main: n_len = 32, n_ctx = 2048, n_kv_req = 32

<s> Hello my name is Danielle. I am 21 years old. I am a full time student at a local community college. I am majoring

main: decoded 27 tokens in 877.91 s, speed: 0.03 t/s

llama_print_timings:        load time =   67847.49 ms
llama_print_timings:      sample time =       0.99 ms /    28 runs   (    0.04 ms per token, 28311.43 tokens per second)
llama_print_timings: prompt eval time =   32733.53 ms /     5 tokens ( 6546.71 ms per token,     0.15 tokens per second)
llama_print_timings:        eval time =  877862.89 ms /    27 runs   (32513.44 ms per token,     0.03 tokens per second)
llama_print_timings:       total time =  945758.49 ms /    32 tokens

PR:

$ ./simple ../../models/models/ggml-model-f16.gguf "Hello my name is"

...

main: n_len = 32, n_ctx = 2048, n_kv_req = 32

<s> Hello my name is Danielle. I am 21 years old. I am a full time student at a local community college. I am majoring

main: decoded 27 tokens in 878.38 s, speed: 0.03 t/s

llama_print_timings:        load time =   67512.84 ms
llama_print_timings:      sample time =       0.97 ms /    28 runs   (    0.03 ms per token, 28747.43 tokens per second)
llama_print_timings: prompt eval time =   32895.30 ms /     5 tokens ( 6579.06 ms per token,     0.15 tokens per second)
llama_print_timings:        eval time =  878320.06 ms /    27 runs   (32530.37 ms per token,     0.03 tokens per second)
llama_print_timings:       total time =  945892.14 ms /    32 tokens

ggml llama: align structs for memory optimization on 64-bit platforms:

2a9a84b

- ggml_type_traits_t (80 -> 72 bytes) - llama_batch (72 -> 64 bytes) - llama_model_params (56 -> 48 bytes) - hash_node (32 -> 24 bytes) - ggml_compute_state (32 -> 24 bytes) - gguf_tensor_info (88 -> 80 bytes)

mofosyne added refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 14, 2024

Merge branch 'master' into master

afad05d

github-actions bot added examples server ggml changes relating to the ggml tensor library for machine learning labels May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml llama: align structs for memory optimization on 64-bit platforms #7267

ggml llama: align structs for memory optimization on 64-bit platforms #7267

GermanAizek commented May 13, 2024

github-actions bot commented May 13, 2024 •

edited

slaren commented May 14, 2024

USBhost commented May 14, 2024

GermanAizek commented May 14, 2024

USBhost commented May 14, 2024 •

edited

GermanAizek commented May 14, 2024 •

edited

ggml llama: align structs for memory optimization on 64-bit platforms #7267

Are you sure you want to change the base?

ggml llama: align structs for memory optimization on 64-bit platforms #7267

Conversation

GermanAizek commented May 13, 2024

github-actions bot commented May 13, 2024 • edited

slaren commented May 14, 2024

USBhost commented May 14, 2024

GermanAizek commented May 14, 2024

USBhost commented May 14, 2024 • edited

GermanAizek commented May 14, 2024 • edited

github-actions bot commented May 13, 2024 •

edited

USBhost commented May 14, 2024 •

edited

GermanAizek commented May 14, 2024 •

edited