Lately I invested a lot of time into getting really good at writing custom fuzzers with LibAFL which, for those who are unaware, is a Rust-based library for writing custom fuzzers. This is not my first fuzzing article, so if you are completely unaware of what that is, check out my previous Fuzz4All article. Anyway, back to the subject—a fuzzer in LibAFL is made of 3 main components:
If you click on the links for each of those components, you will quickly realize we have a bunch of options that we can mix and match or even bring our own. A natural question to ask is: “Why and when do we choose which component?” Luckily for us, there is this stellar paper from EURECOM.
Static analysis is an approach to collect information about a program without running it. An example would be to compute the Cyclomatic complexity of a function, which is the number of linearly independent paths through the function’s source code and we compute it using its control-flow graph.
Thus in this approach, the authors collect 59 static features (a feature is just a piece of information that describes the program). These features describe characteristics such as data-flow graphs (DFG), control flow graphs (CFG), and instruction types from the intermediate representation of LLVM.
On top of the extracted features we create a model that, given those 59 static features from a program that we have not seen/fuzzed before, will predict the most promising fuzzer ensemble. For the complete source code go here. The actual model is a random forest model.
This is simple—yep, no neural networks, nothing super fancy—however it is hardly explainable.

As mentioned before, a fuzzer consists from 3 basic building blocks, here is an list of options used:
The scheduler has two goals:
Once we run a mutated test case, we need to decide if we want to save it back to the corpus. Here we use fuzzers with ngram feedback, more specifically ngram8 and ngram4 with context-sensitive coverage feedback (naive-ctx) from AFL-SENSITIVE. On top of these techniques we include the value-profile technique (originally from LibFuzzer), which rewards the fuzzer if a comparison instruction has some novelties in its operand.
This defines how the fuzzer mutates the input. Here the options are:
A naive fuzzer serves as a baseline, employs basic setup inspired by AFL, and consists of:
We use static analysis to extract features primarily from LLVM IR.
Generic features include information like binary size, number of basic blocks and the average nested level of the source code.
Here LLVM IR comes into play, we extract the frequency of each LLVM instruction like: alloca, binary-operator, branch, call, cmp, load and store.
For every LLVM instruction we collect the type and count of every operand.
From CFG we extract features like: existence of cycles (loop-back edges), the length of the shortest path, and the average length of the shortest paths.
For DDG (data dependence graph), we compute the number of edges and nodes. This information should be useful for a fuzzer with more fine-grained feedback such as ngram fuzzer, giving reward for traversing more diverse paths.
These are API calls for memory allocations (malloc/free) and security-sensitive functions like memcpy and strcmp. These API calls should play well with cov-accounting fuzzer which should prioritise functions with these kinds of API calls.
Comparison instructions are grouped together based on their LLVM IR argument types, for example: comparison of integers, floats, vectors, and we compute their frequency. Comparisons tend to be a bottleneck and often block a fuzzer. We also expect these features to be predictive for cmplog and value-profile feedback.
Seeds are independent of a given program, but they have a significant impact on its performance, thus including them in the analysis gives a more honest picture. Here we collect features like function and branch coverage acquired by running the initial inputs.
Yep, Random Forest here, not the most fortunate choice but whatever (I share my thoughts on it in the Notes section, since it is more philosophical and it deeply reflects my biases towards certain approaches, biased due to education and experience).
I already gave a somewhat exhaustive view of how we can set up a fuzzer, so there are a lot of components we can mix and match (120 different permutations for those who like exact numbers). Running an exhaustive fuzzing campaign for every combination, for multiple binaries is not really feasible (yes people train LLMs, however fuzzing scales worse). Because of that, the authors run single-technique fuzzers, where we use 2 baseline components and 1 changed. This should help determine if a technique is useful or not.
Here I will give a summary of the findings, and what kinds of features were more important in what scenarios.
Schedulers explore, fast and weighted work well for larger programs, larger inputs, and higher initial coverage. Their performance is influenced by how many corpus entries the scheduler has to schedule.
Cov-accounting tends to work better if security-sensitive API usage is used (this is expected).
Rand-scheduler behaves as it should—it is not influenced by any feature.
Complex feedback mechanisms like ngram4 and ngram8 perform poorly for larger programs, but they manage to explore more paths within a function, which is expected behavior ngram fuzzers are designed to do. While naive-ctx performance degrades for larger inputs, this again makes sense since it is designed to distinguish between different caller stacks. In general, these strategies give a fuzzer more complex coverage feedback than simple edge coverage, and the consensus in the community is that too much information overwhelms the fuzzer corpus, resulting in degraded performance.
Value-profile fuzzer excels for programs with more branches (we have a lot of comparisons, resulting in if-else statements).
Cmplog and grimoire fuzzers work better if the initial seed coverage is low. Why is that? For cmplog, the idea is that it mutates the input based on performed comparisons, thus if a comparison failed and we did not take a branch, then it mutates the input to take the unexplored branch. This makes cmplog more suited for programs that have fuzz blockers at the beginning. On the other hand, grimoire is more suited for programs that take structured input.
Here I lightly touch on topics not covered in depth, but which are important.
Every one of the programs is compiled with LLVM’s LTO (Link Time Optimization), for which we need to write a fuzzing harness (driver for our fuzzer). The harness has a huge impact on the fuzzer’s performance; it makes some parts of the program unreachable, and we need to prune the features (decrement the frequencies) that are coming from unreachable parts.
The research includes a bunch of statistical tests to determine which feature is how strongly correlated with different fuzzer components. These tests include Cohen’s d, Spearman’s rank or Kendall’s r. However I do find this part not super helpful since a lot of these tests were built under assumptions that are rarely true, and history has shown us repeatedly that they are flawed—the never-ending debate about p values. The authors argue that leveraging a simple Multiple Linear Regression (which was used by Wolf, 2022) is not suited since it is affected by outliers. While true, this can be mitigated with a bit of feature preprocessing, where instead of using frequencies we work with distributions. For example, for comparison operators, instead of having 10 integer comparisons, 40 float comparisons, and 50 vector comparisons we would have: 0.1, 0.4 and 0.5 (here I just added an example but imagine that the magnitudes are way, way greater) or standardize the values. Also adopting a more Bayesian approach would be helpful, since features are likely correlated and we can use more heavy-tailed distributions in case of outliers.
What is the takeaway? No “free lunch” for fuzzing—each technique is suited more for different types of programs or potentially different states a program is in. One promising avenue to leverage different fuzzers is called Collaborative Fuzzing, with the main idea to run multiple fuzzers, each exchanging information, potentially leveraging the best of each individual. Combining it with Machine Learning and modern AI and we have a new and exciting field.