Optimizing Chromium with BOLT
Intro
As you might already know, BOLT helps achieve peak performance on top of compiler’s best effort, i.e. over both PGO and LTO. This post will cover the necessary steps to experiment with optimizing Chromium with BOLT.
Plan
We’ll tackle this problem in the following steps:
- produce pre-bolt chromium binary with preserved (static) relocations
- collect the profile
- optimize the binary
- get performance data
Building Chromium
I followed the steps here: Checking out and building Chromium on Linux. Things worked for me on this revision:
commit 2570036418a961586b01697e1f79b9d9380fd103 (origin/main, origin/HEAD)
Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Date: Mon Jan 2 22:34:43 2023 +0000
Roll Skia from 809e328ed55c to 697f9b541a0e (1 revision)
We’d need to enable the “official” build configuration which includes PGO:
# First fetch gclient config, tip https://stackoverflow.com/a/35382199
gclient config https://chromium.googlesource.com/chromium/src.git
gclient sync
gclient runhooks
Then add "checkout_pgo_profiles": True
to custom_vars in the gclient config, see here
vim .gclient
gclient runhooks
If the runhooks
step complains about invalid gs credentials, run the gsutil config
command suggested in the output.
(Note: gsutil
lives in depot_tools folder). Click on the link, log in to your Google account, paste the authorization code. Use 0 as project-id (worked for me).
Then repeat gclient runhooks
step.
Set gn variables, see here
gn args out/Default
We need to disable Clang CFI as BOLT doesn’t understand added symbols (see bug):
Put the following to the file:
is_debug = false
is_official_build = true
symbol_level = 0
is_cfi = false
And finally build with:
autoninja -C out/Default chrome
Chromium binary has a very large text section:
Build | .text size |
---|---|
Release | 210MB |
Official | 171MB |
Pre-BOLT Chromium binary
In order to enable function reordering, one of the most important optimizations, the input binary needs to have .text relocations preserved by the linker.
I had to tweak build configuration to actually achieve this:
diff --git a/build/config/BUILDCONFIG.gn b/build/config/BUILDCONFIG.gn
index a373e55a33..7960db4c81 100644
--- a/build/config/BUILDCONFIG.gn
+++ b/build/config/BUILDCONFIG.gn
@@ -352,8 +352,15 @@ default_compiler_configs = [
"//build/config/compiler/pgo:default_pgo_flags",
"//build/config/coverage:default_coverage",
"//build/config/sanitizers:default_sanitizer_flags",
+ "//build/config/compiler:emit-relocs",
]
if (is_win) {
default_compiler_configs += [
"//build/config/win:default_cfg_compiler",
diff --git a/build/config/compiler/BUILD.gn b/build/config/compiler/BUILD.gn
index 3eede98ae4..8cbb5c8d79 100644
--- a/build/config/compiler/BUILD.gn
+++ b/build/config/compiler/BUILD.gn
@@ -233,6 +233,15 @@ config("no_unresolved_symbols") {
}
}
+# Emit relocs for BOLT
+config("emit-relocs") {
+ if (!using_sanitizer && is_linux) {
+ ldflags = [
+ "-Wl,--emit-relocs",
+ ]
+ }
+}
+
# compiler ---------------------------------------------------------------------
#
With that, rerunning autoninja -C out/Default chrome
produces the binary with text relocations:
$ readelf -We out/Default/chrome | grep rela.text
[32] .rela.text RELA 0000000000000000 11065398 ba63780 18 I 45 16 8
Collecting the profile
The steps are described on BOLT’s GitHub page: Step 1: Collect Profile.
It took me several iterations to figure out the sampling rate which can be adjusted with -c/-F flags for perf record.
I ended up with sampling frequency of 10Khz (-F 10000
) to increase the quantity of collected profile.
The representative workload is borrowed from PGO build guide here: Run representative benchmarks to produce profiles
The profiling invocation looks like this:
perf record -e cycles:u -j any,u -o perf.data -- \
vpython3 tools/perf/run_benchmark speedometer2 \
--assert-gpu-compositing --browser=exact \
--browser-executable=out/Default/chrome
Converting the profile
In order to simplify the use of profile (i.e. avoid converting it multiple times), we’ll use perf2bolt to convert raw perf data to a profile:
perf2bolt out/Default/chrome -perfdata=perf.data -o perf.fdata -strict=0
Optimizing the binary
Again, we’re following BOLT’s manual: Step 3: Optimize with BOLT. There are a couple of Chromium’s quirks:
- The optimized binary needs to be at the same path as the original (i.e. in
out/Default/
folder) - Chromium employs V8 as a JS VM. It still has the same issue with code pointers described in this post: Optimizing NodeJS with BOLT.
Here’s the command line that I used:
llvm-bolt out/Default/chrome -o out/Default/chrome.bolt \
-data=perf.fdata -dyno-stats \
-reorder-blocks=ext-tsp -reorder-functions=hfsort \
-split-functions -split-all-cold -split-eh \
-skip-funcs=Builtins_.\*
BOLT optimization dynostats:
11566179 : executed forward branches (-8.2%)
748444 : taken forward branches (-65.7%)
3764854 : executed backward branches (+37.6%)
1565996 : taken backward branches (+3.4%)
486354 : executed unconditional branches (-48.7%)
1231833 : all function calls (=)
483079 : indirect calls (=)
139827 : PLT calls (=)
107491548 : executed instructions (-0.9%)
29806370 : executed load instructions (-0.0%)
11393188 : executed store instructions (-0.0%)
93934 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
15817387 : total branches (-2.8%)
2800794 : taken branches (-39.7%)
13016593 : non-taken conditional branches (+11.9%)
2314440 : taken conditional branches (-37.4%)
15331033 : all conditional branches (-0.0%)
Perf testing
Testing system:
- Intel ADL i7-12700K
- Ubuntu 22.04
For performance testing I used the same Speedometer2 invocation (sorry, no proper train/test split):
vpython3 tools/perf/run_benchmark speedometer2 \
--assert-gpu-compositing --browser=exact \
--browser-executable=out/Default/chrome
Speedometer2 Runs per minute
Build | Runs per minute |
---|---|
Official with CFI | 391.575 ± 17.901 |
Official sans CFI | 381.923 ± 11.436 |
Official sans CFI with BOLT | 404.009 ± 13.378 |
Uarch metrics
Running under perf
to see the change in uarch metrics, narrowing down to just P-cores (Golden Cove):
taskset -c 0-15 perf stat -e instructions,cycles,L1-icache-misses,iTLB-misses -- \
vpython3 tools/perf/run_benchmark speedometer2 \
--assert-gpu-compositing --browser=exact \
--browser-executable=out/Default/chrome
Official sans CFI | Official sans CFI with BOLT | reduction with BOLT, % | |
---|---|---|---|
seconds | 25.18 | 23.78 | 5.9% |
instructions | 288,747,937,775 | 287,258,920,991 | 0.5% |
cycles | 144,095,296,652 | 139,002,624,616 | 3.7% |
L1-icache-misses | 4,703,384,066 | 3,893,492,873 | 20.8% |
iTLB-misses | 42,881,805 | 32,524,602 | 31.8% |
Static code vs JIT code
Chromium incorporates V8 JIT, so part of the benchmark execution time is spent in dynamically generated code which BOLT can’t see or improve. BOLT’s effect is restricted to just the code inside the binary.
perf report --sort dso
shows the percentage of samples belonging to jitted code:
Samples: 9M of event 'cpu_core/cycles:u/', Event count (approx.): 9834176
Overhead Shared Object
66.58% chrome
22.26% [JIT] tid 3489774
5.16% python3
3.53% libc.so.6
1.14% vpython
...
If we exclude kernel code, system libraries and benchmarking harness overhead, the proportions become roughly 75% in chrome binary and 25% in jitted code. This means that BOLT’s effect for just the optimized binary is ~1.3x higher than for the complete benchmark.
Summary
BOLT is effective in optimizing large code footprint applications, both server- and client-side. Chromium appears to be a good candidate as it exhibits elevated icache misses per 1000 instructions (MPKI) of about 16. BOLT shows moderate speedups for Chromium of about 6% wall clock-wise, or 4% cycles-wise, driven by 21% icache miss reduction and 32% iTLB miss reduction.
Future work
- Confirm if CPU frontend is a bottleneck for Chromium using top-down methodology.
- Check if huge pages help further improve performance (
-hugify
). - Improve profiling coverage, bringing the function coverage closer to PGO levels.
- Carefully measure the resulting performance, following benchmarking best practices and rigorous approach to statistics.
- Support Clang CFI in BOLT.