Intro

This short note is about PGO, profile-guided optimizations, and FDO, feedback-directed optimizations, as implemented in LLVM codebase today (Jul’23). Despite impressive performance wins from PGO/FDO and strides in making profiling faster, easier and more accessible, the knowledge of the techniques has been slow to reach beyond compiler developers who work on them. This note is my humble attempt to fix this.

PGO vs FDO

What’s up with these two terms? Is there a material difference between the two? To me, there’s no. But these terms are used to distinguish different implementations which we’ll cover in this note.

Instrumentation vs sampling profile

But before we get to the implementations, there’s an important question of the kind of profile used to drive optimizations. The profiling methods/mechanisms differ in their overhead, resulting profile quality, and ease of use.

There are several ways to collect the profile:

  1. Instrumentation: the program is modified such that the profile is recorded during its execution. Instrumentation code can be added either statically by compiler or external tools, or dynamically during program’s execution. The difference between the two boils down to overhead (static has lower) and coverage of generated code and code from shared libraries (static can’t see this code). For optimization purposes, the only practical approach is static instrumentation, and even with that, the overhead is high enough to interfere with the behavior of time-sensitive applications (e.g. web servers).

  2. Sampling: execution profile is collected from unmodified program by leveraging profiling capabilities of the execution environment (OS or HW or both). The primary mechanism is Linux perf which interrupts the program at regular intervals and records the current IP and/or call stack. Modern CPUs enhance sampling in two important ways:
    • provide programmable events used to interrupt the program (e.g. instructions, cycles, branches, calls) – improves the profile quality,
    • have out-of-band profiling extensions (Intel LBR) that accumulate taken branches in a circular buffer which greatly enhances both the quantity of profile information (multiple taken branches per sample) and its quality by providing limited tracing information (branches form an execution trace).
  3. Tracing: the control flow is recorded from the program execution, and the profile is aggregated from the trace. Tracing is the most flexible and expensive option, but its overhead can be reduced with HW support (e.g. Intel PT or ARM ETM). Although I mention it here, it’s seldom used in production environments due to its prohibitive runtime cost. On the flipside, it might be the only profiling mechanism available (e.g. for ARM).

Hybrid approaches are also possible, specifically those that supplant sampling information with better code anchors via instrumentation (see CSSPGO below).

PGO

The term PGO in LLVM codebase almost always implies an instrumentation-based approach.

Clang Front-end PGO

This form of PGO is perhaps the most well known and is extensively covered in Clang Compiler User’s Manual.

Instrumentation is inserted at the source level, and needs to be preserved throughout the optimization pipeline. The profile is mapped back to the source using debug line information. Profile accuracy loss is relatively high as it can’t account for inlining a function into multiple sites and some control flow transformations (unroll, tail duplication, CFG simplifications).

Clang flags: -fprofile-instr-generate/-fprofile-instr-use.

LLVM CMake options: LLVM_BUILD_INSTRUMENTED.

IR PGO

IR PGO is also well covered by published documentation, in the same chapter of User’s Manual.

Instrumentation is inserted at LLVM IR level. The tradeoff versus FE PGO is that instrumentation is inserted later in the compilation pipeline, so it’s better preserved throughout. However, transformations done during instrumentation must be repeated in optimized build to ensure matching profile.

Flags: -fprofile-generate/-fprofile-use.

LLVM CMake options: LLVM_BUILD_INSTRUMENTED=IR, LLVM_ENABLE_IR_PGO.

CSIR PGO

Clang User Guide:

The difference is that the instrumentation is performed after inlining so that the resulted profile has a better context sensitive information.

CSIR instrumentation is inserted even later, after inlining, to enable matching profile in duplicated code. As it stands, it imposes even tighter control over optimizations – specifically that inlining decisions must be the same in instrumentation and optimized builds to ensure matching profile.

Flags: -fcs-profile-generate/-fprofile-use.

LLVM CMake options: LLVM_BUILD_INSTRUMENTED=CSIR.

Example of how to use IR PGO and CSIR PGO together: link.

Context-Sensitive Sample PGO (CSSPGO) (sampling enhanced by instrumentaion)

RFC introducing CSSPGO:

We propose a full context-sensitive sample profiling infrastructure that utilizes both LBR and call stack samples at the same time to synthesize a profile with a full context sensitivity. The key advantage is that rather than relying on previous inlining or a separate profile, the profile collected with the new approach will have full calling contexts recovered from both inlined and not inlined call sites. To achieve an accurate post-inline profile, a separate profile is no longer needed. Instead, the post-inline profile can be directly derived from adjusting the input profile based on all inline decisions. The richer context-sensitive profile also enables better inline decisions.

How to generate profile for CSSPGO: link.

How to use CSSPGO: link.

FDO

FDO in LLVM context is a shorthand for AutoFDO – sample-based optimization workflow developed at Google initially for use with GCC. It was eventually ported to LLVM codebase but the name stuck.

AutoFDO/SampleFDO

google/autofdo

AutoFDO uses sample information and maps the profile back to IR level.

Usage instructions: Using sampling profilers.

Control Flow-Sensitive Sample AutoFDO (FS-AFDO)

Control Flow Sensitive AutoFDO (FS-AFDO)

Sample profile is attached at lower MachineIR level. (see MIRSampleProfile.h)

Usage instructions can be found in Clang’s Users Manual.

PLO: Post-link profile-guided optimizations

Post-link optimizers are by nature “context-sensitive” since they work on the final binary, after all inlining and transformations. Therefore they target the same performance opportunity as CSIR PGO/CSSPGO. But:

  • CSSPGO uses a single profile and its accuracy is lower than the profile used for post-link optimizations.
  • CSIR PGO uses the second profile and is supposed to completely subsume post-link optimizations at the cost of second recompilation, but surprisingly it still leaves some opportunity to BOLT: example.

Post-link optimizers can be stacked on top of PGO and provide extra performance benefits, or be used as a faster alternative to CSIR PGO.

BOLT

BOLT is a post-link binary optimization and layout tool: BOLT.

CGO’19 paper.

BOLT can work with both sample and instrumentation profile. The profile is mapped to the same binary as used for profiling, ensuring perfect match. This allows BOLT to perform the most aggressive layout optimizations at binary level, with knowledge of branch and call distances, achieving peak performance.

BOLT’s scalability has been improved over the years – see Lightning BOLT paper.

Using BOLT: GitHub repo.

Using BOLT for Clang using CMake: Advanced Builds.

Propeller

Propeller: A frame work for Post Link Optimizations

Propeller is a link-time layout optimizer designed to address scalability issues of BOLT. It performs profile-guided code reordering at link-time and therefore relies on the compiler to perform function splitting (since linker can only see sections).

Propeller is fully integrated into Clang/LLVM.