Code:Breakers code : BREAKERS

Forecasting Fuzzer Performance with Static Analysis

Intro

Lately I invested a lot of time into getting really good at writing custom fuzzers with LibAFL which, for those who are unaware, is a Rust-based library for writing custom fuzzers. This is not my first fuzzing article, so if you are completely unaware of what that is, check out my previous Fuzz4All article. Anyway, back to the subject—a fuzzer in LibAFL is made of 3 main components:

  1. Scheduler is responsible for selecting the next test case.
  2. Feedback drives the fuzzer towards inputs that explore new program space.
  3. Mutator defines how the fuzzer modifies the input.

If you click on the links for each of those components, you will quickly realize we have a bunch of options that we can mix and match or even bring our own. A natural question to ask is: “Why and when do we choose which component?” Luckily for us, there is this stellar paper from EURECOM.

Static Analysis-Based Features

Static analysis is an approach to collect information about a program without running it. An example would be to compute the Cyclomatic complexity of a function, which is the number of linearly independent paths through the function’s source code and we compute it using its control-flow graph.

Thus in this approach, the authors collect 59 static features (a feature is just a piece of information that describes the program). These features describe characteristics such as data-flow graphs (DFG), control flow graphs (CFG), and instruction types from the intermediate representation of LLVM.

On top of the extracted features we create a model that, given those 59 static features from a program that we have not seen/fuzzed before, will predict the most promising fuzzer ensemble. For the complete source code go here. The actual model is a random forest model.

This is simple—yep, no neural networks, nothing super fancy—however it is hardly explainable.

Fuzzer Building blocks

As mentioned before, a fuzzer consists from 3 basic building blocks, here is an list of options used:

Scheduler

The scheduler has two goals:

  1. Power Schedule, which assigns power to the test case from a corpus. Test cases with more power undergo more trials. In the case of the research two techniques, fast and explore, from AFLFast are used, and conv-accounting from TortoiseFuzz
  2. Mutation selection, the goal is to select the next corpus to mutate. In our case these are weighted scheduler from AFL++ and rand-scheduler (random pick from corpus)

Feedback

Once we run a mutated test case, we need to decide if we want to save it back to the corpus. Here we use fuzzers with ngram feedback, more specifically ngram8 and ngram4 with context-sensitive coverage feedback (naive-ctx) from AFL-SENSITIVE. On top of these techniques we include the value-profile technique (originally from LibFuzzer), which rewards the fuzzer if a comparison instruction has some novelties in its operand.

Mutator

This defines how the fuzzer mutates the input. Here the options are:

Baseline

A naive fuzzer serves as a baseline, employs basic setup inspired by AFL, and consists of:

  • naive scheduler, prefers small and fast test cases
  • naive feedback, tracks coverage-feedback and hitcounts
  • naive mutator, we use havoc mutations, which is a stack of different types of mutations including bitflip, byteflip, arithmetic operators, token replacement and more.

Feature Extraction

We use static analysis to extract features primarily from LLVM IR.

Feature Types

Generic Features

Generic features include information like binary size, number of basic blocks and the average nested level of the source code.

Instruction Types

Here LLVM IR comes into play, we extract the frequency of each LLVM instruction like: alloca, binary-operator, branch, call, cmp, load and store.

Type of Instruction Operand

For every LLVM instruction we collect the type and count of every operand.

CFG and DDG

From CFG we extract features like: existence of cycles (loop-back edges), the length of the shortest path, and the average length of the shortest paths.

For DDG (data dependence graph), we compute the number of edges and nodes. This information should be useful for a fuzzer with more fine-grained feedback such as ngram fuzzer, giving reward for traversing more diverse paths.

APIs

These are API calls for memory allocations (malloc/free) and security-sensitive functions like memcpy and strcmp. These API calls should play well with cov-accounting fuzzer which should prioritise functions with these kinds of API calls.

Comparisons

Comparison instructions are grouped together based on their LLVM IR argument types, for example: comparison of integers, floats, vectors, and we compute their frequency. Comparisons tend to be a bottleneck and often block a fuzzer. We also expect these features to be predictive for cmplog and value-profile feedback.

Initial Set of Seeds

Seeds are independent of a given program, but they have a significant impact on its performance, thus including them in the analysis gives a more honest picture. Here we collect features like function and branch coverage acquired by running the initial inputs.

Predictions

Yep, Random Forest here, not the most fortunate choice but whatever (I share my thoughts on it in the Notes section, since it is more philosophical and it deeply reflects my biases towards certain approaches, biased due to education and experience).

Fuzzer Composition

I already gave a somewhat exhaustive view of how we can set up a fuzzer, so there are a lot of components we can mix and match (120 different permutations for those who like exact numbers). Running an exhaustive fuzzing campaign for every combination, for multiple binaries is not really feasible (yes people train LLMs, however fuzzing scales worse). Because of that, the authors run single-technique fuzzers, where we use 2 baseline components and 1 changed. This should help determine if a technique is useful or not.

Results

Here I will give a summary of the findings, and what kinds of features were more important in what scenarios.

Schedulers

Schedulers explore, fast and weighted work well for larger programs, larger inputs, and higher initial coverage. Their performance is influenced by how many corpus entries the scheduler has to schedule.

Cov-accounting tends to work better if security-sensitive API usage is used (this is expected).

Rand-scheduler behaves as it should—it is not influenced by any feature.

Feedback

Complex feedback mechanisms like ngram4 and ngram8 perform poorly for larger programs, but they manage to explore more paths within a function, which is expected behavior ngram fuzzers are designed to do. While naive-ctx performance degrades for larger inputs, this again makes sense since it is designed to distinguish between different caller stacks. In general, these strategies give a fuzzer more complex coverage feedback than simple edge coverage, and the consensus in the community is that too much information overwhelms the fuzzer corpus, resulting in degraded performance.

Value-profile fuzzer excels for programs with more branches (we have a lot of comparisons, resulting in if-else statements).

Mutators

Cmplog and grimoire fuzzers work better if the initial seed coverage is low. Why is that? For cmplog, the idea is that it mutates the input based on performed comparisons, thus if a comparison failed and we did not take a branch, then it mutates the input to take the unexplored branch. This makes cmplog more suited for programs that have fuzz blockers at the beginning. On the other hand, grimoire is more suited for programs that take structured input.

Notes

Here I lightly touch on topics not covered in depth, but which are important.

LTO

Every one of the programs is compiled with LLVM’s LTO (Link Time Optimization), for which we need to write a fuzzing harness (driver for our fuzzer). The harness has a huge impact on the fuzzer’s performance; it makes some parts of the program unreachable, and we need to prune the features (decrement the frequencies) that are coming from unreachable parts.

Correlation Analysis

The research includes a bunch of statistical tests to determine which feature is how strongly correlated with different fuzzer components. These tests include Cohen’s d, Spearman’s rank or Kendall’s r. However I do find this part not super helpful since a lot of these tests were built under assumptions that are rarely true, and history has shown us repeatedly that they are flawed—the never-ending debate about p values. The authors argue that leveraging a simple Multiple Linear Regression (which was used by Wolf, 2022) is not suited since it is affected by outliers. While true, this can be mitigated with a bit of feature preprocessing, where instead of using frequencies we work with distributions. For example, for comparison operators, instead of having 10 integer comparisons, 40 float comparisons, and 50 vector comparisons we would have: 0.1, 0.4 and 0.5 (here I just added an example but imagine that the magnitudes are way, way greater) or standardize the values. Also adopting a more Bayesian approach would be helpful, since features are likely correlated and we can use more heavy-tailed distributions in case of outliers.

Final Thoughts

What is the takeaway? No “free lunch” for fuzzing—each technique is suited more for different types of programs or potentially different states a program is in. One promising avenue to leverage different fuzzers is called Collaborative Fuzzing, with the main idea to run multiple fuzzers, each exchanging information, potentially leveraging the best of each individual. Combining it with Machine Learning and modern AI and we have a new and exciting field.