Evaluate using Profile-Guided Optimization (PGO) and Post Link Optimization (PLO) for ast-grep #738

zamazan4ik · 2023-11-28T09:27:37Z

zamazan4ik
Nov 28, 2023

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are available here. According to the tests, PGO can help with achieving better performance in many cases for many applications. Since this, I think trying to optimize ast-grep with PGO can be a good idea.

I already did some benchmarks and want to share my results here.

Test environment

Fedora 39
Linux kernel 6.5.12
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.74 and Clang 17 (for the tree-sitter build), CFLAGS are -O3
ast-grep version: the latest for now from the main branch on commit 76d845162185bed4a5b9a22de456f26e976af0a6
Disabled Turbo boost

Benchmark

For the benchmark purposes, I used two scenarios:

Built-in ast-grep benchmarks
Handcrafted ast-grep scenario

All PGO and PLO optimizations are done with cargo-pgo.

For Release built-in benchmarks were tested with cargo bench -p benches. PGO instrumentation phase is done with cargo pgo bench -- -p benches, PGO optimized benches are done with cargo pgo optimize bench -- -p benches.

The handcrafted scenario is scanning for a simple Python rule in some Python project. The command to test is taskset -c 0 sg scan -r python.yml feast/ (taskset -c 0 is used for reducing the OS scheduler noise). feast directory contains https://github.com/feast-dev/feast repo (the master branch on 052182bcca046e35456674fc7d524825882f4b35 commit). The PGO training phase is done on other project - PyPy (https://github.com/mozillazg/pypy, master branch, 5306d9822d91412b224f529ae1aec485bf93dc86 commit) with the same Python rule. Release build is done with cargo build --release, PGO instrumented build is done with cargo pgo build + CFLAGS="-O3 -fprofile-generate=tr_%m_%p.profraw", PGO optimized build is done with cargo pgo optimize build + CFLAGS="-O3 -fprofile-use=tr.profdata" (profdata generated by llvm-profdata merge from profraw file from the instrumentation phase). I used this trick since I wanted to optimize the C dependency with PGO too since cargo-pgo does not support this scenario out of the box.

Python rule in python.yml (just a copy and paste from the official website):

id: swap
language: Python
rule:
  pattern: $X = $Y
fix: $Y = $X

All tests are done on the same machine, done multiple times (with hyperfine), with the same background "noise" (as much as I can guarantee of course).

Results

Let's begin with the built-in benchmarks:

PGO-optimized compared to Release: https://gist.github.com/zamazan4ik/94c0c8a7f3cec937e518abe6499ef427
(just for reference to estimate the ast-grep slowdown with PGO instrumentation) Instrumented compared to Release: https://gist.github.com/zamazan4ik/fb7c9320feef9a5f3045d777351d2b60

Results for the handcrafted scenario on running ast-grep scan on the Feast project (in hyperfine format):

hyperfine --warmup 50 --min-runs 300 'taskset -c 0 ./sg_release_clang scan -r rules/python.yml ../../feast/'  'taskset -c 0 ./sg_optimized_with_tree scan -r rules/python.yml ../../feast/' 'taskset -c 0 ./sg_optimized_with_tree_bolt_optimized scan -r rules/python.yml ../../feast/'
Benchmark 1: taskset -c 0 ./sg_release_clang scan -r rules/python.yml ../../feast/
  Time (mean ± σ):     849.6 ms ±   8.9 ms    [User: 759.3 ms, System: 86.4 ms]
  Range (min … max):   837.3 ms … 942.4 ms    300 runs

Benchmark 2: taskset -c 0 ./sg_optimized_with_tree scan -r rules/python.yml ../../feast/
  Time (mean ± σ):     805.8 ms ±   4.0 ms    [User: 717.2 ms, System: 85.1 ms]
  Range (min … max):   794.1 ms … 841.9 ms    300 runs

Benchmark 3: taskset -c 0 ./sg_optimized_with_tree_bolt_optimized scan -r rules/python.yml ../../feast/
  Time (mean ± σ):     796.6 ms ±   4.8 ms    [User: 709.7 ms, System: 83.3 ms]
  Range (min … max):   786.6 ms … 832.2 ms    300 runs

Summary
  taskset -c 0 ./sg_optimized_with_tree_bolt_optimized scan -r rules/python.yml ../../feast/ ran
    1.01 ± 0.01 times faster than taskset -c 0 ./sg_optimized_with_tree scan -r rules/python.yml ../../feast/
    1.07 ± 0.01 times faster than taskset -c 0 ./sg_release_clang scan -r rules/python.yml ../../feast/

where:

sg_release_clang - Release build
sg_optimized_with_tree - Release build + PGO optimization
sg_optimized_with_tree_bolt_optimized - Release build + PGO optimization + PLO optimization (via LLVM BOLT)

For reference, I post performance results in the instrumentation phases.

Release build:

hyperfine --runs 1 './sg_release scan -r rules/python.yml ../../pypy/'
Benchmark 1: ./sg_release scan -r rules/python.yml ../../pypy/
  Time (abs ≡):        85.517 s               [User: 86.887 s, System: 7.750 s]

PGO instrumented run:

hyperfine --runs 1 './sg_instrumented_with_tree scan -r rules/python.yml ../../pypy/'
Benchmark 1: ./sg_instrumented_with_tree scan -r rules/python.yml ../../pypy/
  Time (abs ≡):        136.215 s               [User: 360.901 s, System: 5.643 s]

LLVM BOLT instrumented run:

hyperfine --runs 1 './sg-bolt-instrumented scan -r rules/python.yml ../../pypy/'
Benchmark 1: ./sg-bolt-instrumented scan -r rules/python.yml ../../pypy/
  Time (abs ≡):        139.149 s               [User: 222.756 s, System: 5.893 s]

According to the tests above, I see measurable improvements from PGO.

Further steps

I can suggest the following action points:

Perform more PGO benchmarks on ast-grep. If it shows improvements - add a note to the documentation about possible improvements in ast-grep performance with PGO.
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize ast-grep according to their workloads.
Optimize pre-built ast-grep binaries

Here are some examples of how PGO optimization is integrated in other projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag

I have some examples of how PGO information looks in the documentation:

ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
Databend: https://databend.rs/doc/contributing/pgo
Vector: https://vector.dev/docs/administration/tuning/pgo/
Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
Clang:
- https://llvm.org/docs/HowToBuildWithPGO.html
- https://llvm.org/docs/AdvancedBuilds.html
tsv-utils: https://github.com/eBay/tsv-utils/blob/master/docs/BuildingWithLTO.md

Regarding LLVM BOLT integration, I have the following examples:

Rustc:
- Rustc itself (GitHub PR)
- LLVM in Rustc (Reddit)
CPython: GitHub PR
YDB: GitHub comment
Clang:
LDC: GitHub comment
HHVM, Proxygen and others: Facebook paper
NodeJS: Blog
Chromium: Blog
MySQL, MongoDB, memcached, Verilator: Paper

HerringtonDarkholme · 2023-11-28T09:35:44Z

HerringtonDarkholme
Nov 28, 2023
Maintainer

O!M!G!

Thank @zamazan4ik for your heroic adventure to explore PGO! I'm amazed at your detailed work 🥇

I'm not available now but I will definitely look at the post later!

Thanks and Best Wishes

Herrington

0 replies

HerringtonDarkholme · 2023-12-07T10:13:05Z

HerringtonDarkholme
Dec 7, 2023
Maintainer

Sorry for the late reply!

I am not too familiar with advanced techniques like PGO/PLO. My intuition is that PGO uses some code examples to build a better machine code layout? Therefore, the training input code should be as typical as possible to benefit most use cases.

Currently, the cargo bench code is some randomly chosen code: they are too arbitrary to represent common use cases.

Regardless of my poor benchmark cases, your analysis is really deep and insightful! Let me learn more about PGO!

1 reply

zamazan4ik Dec 7, 2023
Author

My intuition is that PGO uses some code examples to build a better machine code layout?

Yep, you are right. PGO usually affects these compiler optimizations (the same applies for all compilers, not only MSVC). PLO uses a bit more advanced optimization techniques to optimize the code layout for better CPU I-cache utilization. You can read about it more here.

Therefore, the training input code should be as typical as possible to benefit most use cases.

Right.

Currently, the cargo bench code is some randomly chosen code: they are too arbitrary to represent common use cases.

Yes. As for now, I do not recommend using current ast-grep benchmarks as a PGO training set. Probably the better idea will be prepare some near real-life cases (like running typical ast-grep scenarios for each supported language), running them for collecting the profiles, then merging collected profiles with llvm-profdata merge, and using this general profile for ast-grep optimization. This approach is used by the Rustc compiler.

Regardless of my poor benchmark cases

Your benchmarks are not poor :) They just don't suit well for showing PGO effects in practice, no more! These benchmarks could be great for tracking some microoptimizations or something like that.

Let me learn more about PGO!

I would be happy to answer your questions. I recommend starting from the official Rustc documentation about PGO - https://doc.rust-lang.org/rustc/profile-guided-optimization.html . Then, Clang PGO docs can be valuable as well = https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate using Profile-Guided Optimization (PGO) and Post Link Optimization (PLO) for ast-grep #738

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Evaluate using Profile-Guided Optimization (PGO) and Post Link Optimization (PLO) for ast-grep #738

zamazan4ik Nov 28, 2023

Test environment

Benchmark

Results

Further steps

Replies: 2 comments · 1 reply

HerringtonDarkholme Nov 28, 2023 Maintainer

HerringtonDarkholme Dec 7, 2023 Maintainer

zamazan4ik Dec 7, 2023 Author

zamazan4ik
Nov 28, 2023

Replies: 2 comments 1 reply

HerringtonDarkholme
Nov 28, 2023
Maintainer

HerringtonDarkholme
Dec 7, 2023
Maintainer

zamazan4ik Dec 7, 2023
Author