Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml llama: align structs for memory optimization on 64-bit platforms #7267

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

GermanAizek
Copy link
Contributor

  • ggml_type_traits_t (80 -> 72 bytes)
  • llama_batch (72 -> 64 bytes)
  • llama_model_params (56 -> 48 bytes)
  • hash_node (32 -> 24 bytes)
  • ggml_compute_state (32 -> 24 bytes)
  • gguf_tensor_info (88 -> 80 bytes)

- ggml_type_traits_t (80 -> 72 bytes)
- llama_batch (72 -> 64 bytes)
- llama_model_params (56 -> 48 bytes)
- hash_node (32 -> 24 bytes)
- ggml_compute_state (32 -> 24 bytes)
- gguf_tensor_info (88 -> 80 bytes)
Copy link
Contributor

github-actions bot commented May 13, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 539 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8655.79ms p(95)=21820.54ms fails=, finish reason: stop=479 truncated=60
  • Prompt processing (pp): avg=113.8tk/s p(95)=552.59tk/s
  • Token generation (tg): avg=47.06tk/s p(95)=45.86tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=afad05d15c3ee4d1339640840b98e036796a00ff

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716184318 --> 1716184940
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 300.25, 300.25, 300.25, 300.25, 300.25, 661.05, 661.05, 661.05, 661.05, 661.05, 679.92, 679.92, 679.92, 679.92, 679.92, 718.57, 718.57, 718.57, 718.57, 718.57, 778.47, 778.47, 778.47, 778.47, 778.47, 775.34, 775.34, 775.34, 775.34, 775.34, 787.41, 787.41, 787.41, 787.41, 787.41, 808.17, 808.17, 808.17, 808.17, 808.17, 807.05, 807.05, 807.05, 807.05, 807.05, 823.95, 823.95, 823.95, 823.95, 823.95, 841.45, 841.45, 841.45, 841.45, 841.45, 829.72, 829.72, 829.72, 829.72, 829.72, 810.93, 810.93, 810.93, 810.93, 810.93, 836.66, 836.66, 836.66, 836.66, 836.66, 848.67, 848.67, 848.67, 848.67, 848.67, 847.79, 847.79, 847.79, 847.79, 847.79, 850.6, 850.6, 850.6, 850.6, 850.6, 848.0, 848.0, 848.0, 848.0, 848.0, 844.04, 844.04, 844.04, 844.04, 844.04, 862.04, 862.04, 862.04, 862.04, 862.04, 856.56, 856.56, 856.56, 856.56, 856.56, 862.88, 862.88, 862.88, 862.88, 862.88, 864.84, 864.84, 864.84, 864.84, 864.84, 883.4, 883.4, 883.4, 883.4, 883.4, 881.34, 881.34, 881.34, 881.34, 881.34, 884.65, 884.65, 884.65, 884.65, 884.65, 893.21, 893.21, 893.21, 893.21, 893.21, 893.8, 893.8, 893.8, 893.8, 893.8, 892.46, 892.46, 892.46, 892.46, 892.46, 892.88, 892.88, 892.88, 892.88, 892.88, 892.2, 892.2, 892.2, 892.2, 892.2, 889.84, 889.84, 889.84, 889.84, 889.84, 890.78, 890.78, 890.78, 890.78, 890.78, 901.5, 901.5, 901.5, 901.5, 901.5, 907.32, 907.32, 907.32, 907.32, 907.32, 905.98, 905.98, 905.98, 905.98, 905.98, 883.48, 883.48, 883.48, 883.48, 883.48, 878.75, 878.75, 878.75, 878.75, 878.75, 877.28, 877.28, 877.28, 877.28, 877.28, 877.83, 877.83, 877.83, 877.83, 877.83, 879.61, 879.61, 879.61, 879.61, 879.61, 879.49, 879.49, 879.49, 879.49, 879.49, 862.77, 862.77, 862.77, 862.77, 862.77, 867.67, 867.67, 867.67, 867.67, 867.67, 869.31, 869.31, 869.31, 869.31, 869.31, 867.9, 867.9, 867.9, 867.9, 867.9, 862.03, 862.03, 862.03, 862.03, 862.03, 863.47, 863.47, 863.47, 863.47, 863.47, 862.82, 862.82, 862.82, 862.82, 862.82, 863.42, 863.42, 863.42, 863.42, 863.42, 862.46, 862.46, 862.46, 862.46, 862.46, 863.69, 863.69, 863.69, 863.69, 863.69, 869.19, 869.19, 869.19, 869.19, 869.19, 867.12, 867.12, 867.12, 867.12, 867.12, 868.44, 868.44, 868.44, 868.44, 868.44, 869.12, 869.12, 869.12, 869.12, 869.12, 868.94, 868.94, 868.94, 868.94, 868.94, 868.55, 868.55, 868.55, 868.55, 868.55, 869.8, 869.8, 869.8, 869.8, 869.8, 870.94, 870.94, 870.94, 870.94, 870.94, 870.43, 870.43]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716184318 --> 1716184940
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.47, 44.47, 44.47, 44.47, 44.47, 44.47, 44.47, 44.47, 44.47, 44.47, 27.1, 27.1, 27.1, 27.1, 27.1, 30.16, 30.16, 30.16, 30.16, 30.16, 30.9, 30.9, 30.9, 30.9, 30.9, 31.31, 31.31, 31.31, 31.31, 31.31, 31.76, 31.76, 31.76, 31.76, 31.76, 32.81, 32.81, 32.81, 32.81, 32.81, 32.98, 32.98, 32.98, 32.98, 32.98, 32.84, 32.84, 32.84, 32.84, 32.84, 32.84, 32.84, 32.84, 32.84, 32.84, 32.78, 32.78, 32.78, 32.78, 32.78, 32.63, 32.63, 32.63, 32.63, 32.63, 32.49, 32.49, 32.49, 32.49, 32.49, 31.32, 31.32, 31.32, 31.32, 31.32, 30.31, 30.31, 30.31, 30.31, 30.31, 29.02, 29.02, 29.02, 29.02, 29.02, 28.64, 28.64, 28.64, 28.64, 28.64, 28.96, 28.96, 28.96, 28.96, 28.96, 28.93, 28.93, 28.93, 28.93, 28.93, 28.99, 28.99, 28.99, 28.99, 28.99, 29.3, 29.3, 29.3, 29.3, 29.3, 29.58, 29.58, 29.58, 29.58, 29.58, 29.66, 29.66, 29.66, 29.66, 29.66, 29.59, 29.59, 29.59, 29.59, 29.59, 29.81, 29.81, 29.81, 29.81, 29.81, 29.81, 29.81, 29.81, 29.81, 29.81, 29.88, 29.88, 29.88, 29.88, 29.88, 30.05, 30.05, 30.05, 30.05, 30.05, 30.3, 30.3, 30.3, 30.3, 30.3, 30.46, 30.46, 30.46, 30.46, 30.46, 30.51, 30.51, 30.51, 30.51, 30.51, 30.53, 30.53, 30.53, 30.53, 30.53, 30.65, 30.65, 30.65, 30.65, 30.65, 30.45, 30.45, 30.45, 30.45, 30.45, 30.25, 30.25, 30.25, 30.25, 30.25, 30.18, 30.18, 30.18, 30.18, 30.18, 29.75, 29.75, 29.75, 29.75, 29.75, 29.65, 29.65, 29.65, 29.65, 29.65, 29.83, 29.83, 29.83, 29.83, 29.83, 29.95, 29.95, 29.95, 29.95, 29.95, 30.14, 30.14, 30.14, 30.14, 30.14, 30.16, 30.16, 30.16, 30.16, 30.16, 30.07, 30.07, 30.07, 30.07, 30.07, 29.62, 29.62, 29.62, 29.62, 29.62, 28.81, 28.81, 28.81, 28.81, 28.81, 28.68, 28.68, 28.68, 28.68, 28.68, 28.7, 28.7, 28.7, 28.7, 28.7, 28.83, 28.83, 28.83, 28.83, 28.83, 28.84, 28.84, 28.84, 28.84, 28.84, 28.85, 28.85, 28.85, 28.85, 28.85, 28.91, 28.91, 28.91, 28.91, 28.91, 28.91, 28.91, 28.91, 28.91, 28.91, 28.9, 28.9, 28.9, 28.9, 28.9, 28.83, 28.83, 28.83, 28.83, 28.83, 28.85, 28.85, 28.85, 28.85, 28.85, 28.97, 28.97, 28.97, 28.97, 28.97, 29.11, 29.11, 29.11, 29.11, 29.11, 29.14, 29.14, 29.14, 29.14, 29.14, 29.23, 29.23, 29.23, 29.23, 29.23, 29.3, 29.3]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716184318 --> 1716184940
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.48, 0.48, 0.48, 0.48, 0.48, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.33, 0.33, 0.33, 0.33, 0.33, 0.5, 0.5, 0.5, 0.5, 0.5, 0.44, 0.44, 0.44, 0.44, 0.44, 0.28, 0.28, 0.28, 0.28, 0.28, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.26, 0.26, 0.26, 0.26, 0.26, 0.36, 0.36, 0.36, 0.36, 0.36, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.34, 0.34, 0.34, 0.34, 0.34, 0.48, 0.48, 0.48, 0.48, 0.48, 0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.41, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716184318 --> 1716184940
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0]
                    

@slaren
Copy link
Collaborator

slaren commented May 14, 2024

The changes to the llama.h public structs are effectively a API breaking change for no real benefit. The other structs are less sensitive since they are internal to ggml, but I don't see how this is worth the risk of introducing bugs.

@mofosyne mofosyne added refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 14, 2024
@USBhost
Copy link

USBhost commented May 14, 2024

What I find interesting "sample time" has regressed about 0.02 ms for me. Not like that has much meaning but still interesting.

Master:
llama_print_timings:      sample time =       7.35 ms /   100 runs   (    0.07 ms per token, 13607.29 tokens per second)
llama_print_timings:      sample time =       7.29 ms /   100 runs   (    0.07 ms per token, 13711.78 tokens per second)
llama_print_timings:      sample time =       7.30 ms /   100 runs   (    0.07 ms per token, 13704.26 tokens per second)
llama_print_timings:      sample time =       7.59 ms /   100 runs   (    0.08 ms per token, 13173.49 tokens per second)
llama_print_timings:      sample time =       7.24 ms /   100 runs   (    0.07 ms per token, 13821.70 tokens per second)

PR:
llama_print_timings:      sample time =       8.61 ms /   100 runs   (    0.09 ms per token, 11611.70 tokens per second)
llama_print_timings:      sample time =       8.49 ms /   100 runs   (    0.08 ms per token, 11773.02 tokens per second)
llama_print_timings:      sample time =       8.69 ms /   100 runs   (    0.09 ms per token, 11514.10 tokens per second)
llama_print_timings:      sample time =       8.59 ms /   100 runs   (    0.09 ms per token, 11646.87 tokens per second)
llama_print_timings:      sample time =       8.65 ms /   100 runs   (    0.09 ms per token, 11566.04 tokens per second)
llama_print_timings:      sample time =       8.58 ms /   100 runs   (    0.09 ms per token, 11650.94 tokens per second) 

@GermanAizek
Copy link
Contributor Author

What I find interesting "sample time" has regressed about 0.02 ms for me. Not like that has much meaning but still interesting.

Which compiler did you compile on for which platform?

@USBhost
Copy link

USBhost commented May 14, 2024

What I find interesting "sample time" has regressed about 0.02 ms for me. Not like that has much meaning but still interesting.

Which compiler did you compile on for which platform?

         .://:`              `://:.            usbhost@IONA
       `hMMMMMMd/          /dMMMMMMh`          ------------
        `sMMMMMMMd:      :mMMMMMMMs`           OS: Proxmox VE 8.2.2 x86_64
`-/+oo+/:`.yMMMMMMMh-  -hMMMMMMMy.`:/+oo+/-`   Host: Super Server 0123456789
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`   Kernel: 6.8.4-3-pve
  `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/`     Uptime: 5 hours, 33 mins
    ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/.       Packages: 1133 (dpkg)
      .+ooooooo+-`oNMMMMNo`-+ooooooo+.         Shell: bash 5.2.15
        -+ooooooo/.`sMMs`./ooooooo+-           Resolution: 1024x768
          :oooooooo/`..`/oooooooo:             Terminal: /dev/pts/0
          :oooooooo/`..`/oooooooo:             CPU: AMD EPYC 7F72 (48) @ 3.200GHz
        -+ooooooo/.`sMMs`./ooooooo+-           GPU: NVIDIA RTX A6000
      .+ooooooo+-`oNMMMMNo`-+ooooooo+.         GPU: NVIDIA RTX A4000
    ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/.       GPU: NVIDIA RTX A4000
  `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/`     Memory: 2538MiB / 257591MiB
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`
`-/+oo+/:`.yMMMMMMMh-  -hMMMMMMMy.`:/+oo+/-`
        `sMMMMMMMm:      :dMMMMMMMs`
       `hMMMMMMd/          /dMMMMMMh`
         `://:`              `://:`

main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

`./llama-bench -m /mnt/36TB/AI/llama-3-8B-Instruct-More-abliterated/ggml-model-f16.gguf -t 24 -r 20 -pg 512,128

Master:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 24 pp512 58.32 ± 0.49
llama 8B F16 14.96 GiB 8.03 B CPU 24 tg128 8.73 ± 0.01
llama 8B F16 14.96 GiB 8.03 B CPU 24 pp512+tg128 27.00 ± 0.09
build: 9f77348 (2886)

PR:

model size params backend threads test t/s
llama 8B F16 14.96 GiB 8.03 B CPU 24 pp512 59.02 ± 0.47
llama 8B F16 14.96 GiB 8.03 B CPU 24 tg128 8.71 ± 0.01
llama 8B F16 14.96 GiB 8.03 B CPU 24 pp512+tg128 26.89 ± 0.08
build: 0d5473cd (2887)

@GermanAizek
Copy link
Contributor Author

GermanAizek commented May 14, 2024

@USBhost, I have:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm
$ uname -a
Linux debian-laptop 6.1.0-21-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.90-1 (2024-05-03) x86_64 GNU/Linux
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           6.7Gi       3.2Gi       1.2Gi        15Mi       2.6Gi       3.5Gi
Swap:          5.0Gi       2.4Gi       2.6Gi
$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 17
model name      : AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
stepping        : 0
microcode       : 0x810100b
cpu MHz         : 1437.121
cache size      : 512 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso div0
bogomips        : 3992.26
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]
$ glxinfo -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon Vega 8 Graphics (raven, LLVM 15.0.6, DRM 3.49, 6.1.0-21-amd64) (0x15dd)
    Version: 22.3.6
    Accelerated: yes
    Video memory: 1024MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 524 MB, largest block: 524 MB
    VBO free aux. memory - total: 3340 MB, largest block: 3340 MB
    Texture free memory - total: 524 MB, largest block: 524 MB
    Texture free aux. memory - total: 3340 MB, largest block: 3340 MB
    Renderbuffer free memory - total: 524 MB, largest block: 524 MB
    Renderbuffer free aux. memory - total: 3340 MB, largest block: 3340 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 1024 MB
    Total available memory: 4438 MB
    Currently available dedicated video memory: 524 MB
OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon Vega 8 Graphics (raven, LLVM 15.0.6, DRM 3.49, 6.1.0-21-amd64)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 22.3.6
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.3.6
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.3.6
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

master:

$ ./simple ../../models/models/ggml-model-f16.gguf "Hello my name is"

...

main: n_len = 32, n_ctx = 2048, n_kv_req = 32

<s> Hello my name is Danielle. I am 21 years old. I am a full time student at a local community college. I am majoring

main: decoded 27 tokens in 877.91 s, speed: 0.03 t/s

llama_print_timings:        load time =   67847.49 ms
llama_print_timings:      sample time =       0.99 ms /    28 runs   (    0.04 ms per token, 28311.43 tokens per second)
llama_print_timings: prompt eval time =   32733.53 ms /     5 tokens ( 6546.71 ms per token,     0.15 tokens per second)
llama_print_timings:        eval time =  877862.89 ms /    27 runs   (32513.44 ms per token,     0.03 tokens per second)
llama_print_timings:       total time =  945758.49 ms /    32 tokens

PR:

$ ./simple ../../models/models/ggml-model-f16.gguf "Hello my name is"

...

main: n_len = 32, n_ctx = 2048, n_kv_req = 32

<s> Hello my name is Danielle. I am 21 years old. I am a full time student at a local community college. I am majoring

main: decoded 27 tokens in 878.38 s, speed: 0.03 t/s

llama_print_timings:        load time =   67512.84 ms
llama_print_timings:      sample time =       0.97 ms /    28 runs   (    0.03 ms per token, 28747.43 tokens per second)
llama_print_timings: prompt eval time =   32895.30 ms /     5 tokens ( 6579.06 ms per token,     0.15 tokens per second)
llama_print_timings:        eval time =  878320.06 ms /    27 runs   (32530.37 ms per token,     0.03 tokens per second)
llama_print_timings:       total time =  945892.14 ms /    32 tokens

@github-actions github-actions bot added examples server ggml changes relating to the ggml tensor library for machine learning labels May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants