Future of fuzzing

Recently, we [have been seeing fuzzer crashes being found but that somehow disappear](https://github.com/near/nayduck/issues/36). This is a bug that makes our fuzzing infra much less useful if it keeps happening.

We also currently [don’t support cargo-bolero yet](https://github.com/near/nayduck/pull/33), which makes it harder than necessary to add a new fuzz target.

This issue is for investigating the potential ways forward and choosing what to do. My personal preference leans towards using ClusterFuzz now and making nayduck interrupt its workers later.

The current situation is:
- Nayduck runs on 25 VMs (~$2-3k/mo)
- Nayduck actually really uses these VMs only ~1hr/day average
- During the rest of the time, our homegrown fuzzer runner runs fuzzing
- Our homegrown fuzzer runner pauses and resumes fuzzing whenever a nayduck test wants to run
- We only support cargo-fuzz fuzz targets, which means adding a fuzz target is a mess

### Keep the status quo

- **Pro:** Least amount of work
- **Con:** We keep losing some fuzzer artifacts, which means missing potentially-S0 issues. We can implement cargo-bolero support with some work
- **Con:** We keep not supporting cargo-bolero fuzz targets
- **Con:** When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz

### Keep the status quo but fix the disappearing reproducers issue

- **Pro:** Least changes to infrastructure
- **Pro/Con:** It is hard to estimate the amount of work needed to fix it. This could end up being either a pro or a con.
- **Con:** Supporting cargo-bolero fuzz targets is a ~2 weeks additional project on top of the fix
- **Con:** When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz

### Rewriting the current fuzzer infra to be more resilient

- **Pro:** We can keep using the same machines as nayduck
- **Pro:** We take advantage of this change to start supporting cargo-bolero fuzz targets
- **Con:** 1-2 months of engineering time to implement it in rust based on the experience with writing the current python runner
- **Con:** We don’t know yet how much of the issue with disappearing artifacts is due to the infra runner vs interactions with nayduck going wrong
- **Con:** When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz

### Using ClusterFuzz

- **Pro:** Supported by Google, so we’re pretty sure it’d work well
- **Pro:** We take advantage of this change to start supporting cargo-bolero fuzz targets (camshaft/bolero#98 describes a way to do that)
- **Con:** Around 1 month of engineering time to deploy it with a proper build pipeline
- **Con:** Additional expenses for the infra as it could not run alongside nayduck (around $2k/mo to have the same amount of fuzzing as nayduck)

### Using ClusterFuzz *and* making nayduck interrupt its workers when not actually using them

- **Pro:** ClusterFuzz is supported by Google, so we’re pretty sure it’d work well
- **Pro:** We take advantage of this change to start supporting cargo-bolero fuzz targets
- **Pro:** It’d probably become even less expensive than the current situation, as ClusterFuzz uses pre-emptible machines
- **Con:** The most engineering effort, as we don’t have (m?)any people knowing nayduck well enough to actually implement the interruption

### Running both nayduck and ClusterFuzz on top of nested virtualization VMs

- **Pro:** Same cost as today, plus fuzzer would be supported by Google
- **Pro:** We take advantage of this change to start supporting cargo-bolero fuzz targets
- **Con:** Amount of work is hard-to-guess, as ClusterFuzz seems to attempt to create its own GCP VMs, so running it on top of a non-directly-GCP VM might be hard
- **Con:** It’s unknown how well just setting CPU prio etc. would work for both nayduck and the fuzzer, as today the fuzzer gets a full SIGSTOP when nayduck is running a test

## Current status: implementing "Using ClusterFuzz" solution

Still missing:
- [x] need to build fuzzers on each new commit rather than once a day
- [ ] need to integrate the nayduck fuzzers’ corpus into clusterfuzz
- [ ] verify that the release process does document building new ondemand fuzzers
- [ ] can we move the workflows from github actions to buildkite?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future of fuzzing #37

Keep the status quo

Keep the status quo but fix the disappearing reproducers issue

Rewriting the current fuzzer infra to be more resilient

Using ClusterFuzz

Using ClusterFuzz and making nayduck interrupt its workers when not actually using them

Running both nayduck and ClusterFuzz on top of nested virtualization VMs

Current status: implementing "Using ClusterFuzz" solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Future of fuzzing #37

Description

Keep the status quo

Keep the status quo but fix the disappearing reproducers issue

Rewriting the current fuzzer infra to be more resilient

Using ClusterFuzz

Using ClusterFuzz and making nayduck interrupt its workers when not actually using them

Running both nayduck and ClusterFuzz on top of nested virtualization VMs

Current status: implementing "Using ClusterFuzz" solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions