Skip to content

Future of fuzzing #37

@Ekleog-NEAR

Description

@Ekleog-NEAR

Recently, we have been seeing fuzzer crashes being found but that somehow disappear. This is a bug that makes our fuzzing infra much less useful if it keeps happening.

We also currently don’t support cargo-bolero yet, which makes it harder than necessary to add a new fuzz target.

This issue is for investigating the potential ways forward and choosing what to do. My personal preference leans towards using ClusterFuzz now and making nayduck interrupt its workers later.

The current situation is:

  • Nayduck runs on 25 VMs (~$2-3k/mo)
  • Nayduck actually really uses these VMs only ~1hr/day average
  • During the rest of the time, our homegrown fuzzer runner runs fuzzing
  • Our homegrown fuzzer runner pauses and resumes fuzzing whenever a nayduck test wants to run
  • We only support cargo-fuzz fuzz targets, which means adding a fuzz target is a mess

Keep the status quo

  • Pro: Least amount of work
  • Con: We keep losing some fuzzer artifacts, which means missing potentially-S0 issues. We can implement cargo-bolero support with some work
  • Con: We keep not supporting cargo-bolero fuzz targets
  • Con: When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz

Keep the status quo but fix the disappearing reproducers issue

  • Pro: Least changes to infrastructure
  • Pro/Con: It is hard to estimate the amount of work needed to fix it. This could end up being either a pro or a con.
  • Con: Supporting cargo-bolero fuzz targets is a ~2 weeks additional project on top of the fix
  • Con: When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz

Rewriting the current fuzzer infra to be more resilient

  • Pro: We can keep using the same machines as nayduck
  • Pro: We take advantage of this change to start supporting cargo-bolero fuzz targets
  • Con: 1-2 months of engineering time to implement it in rust based on the experience with writing the current python runner
  • Con: We don’t know yet how much of the issue with disappearing artifacts is due to the infra runner vs interactions with nayduck going wrong
  • Con: When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz

Using ClusterFuzz

  • Pro: Supported by Google, so we’re pretty sure it’d work well
  • Pro: We take advantage of this change to start supporting cargo-bolero fuzz targets (Running cargo-bolero jobs on ClusterFuzz camshaft/bolero#98 describes a way to do that)
  • Con: Around 1 month of engineering time to deploy it with a proper build pipeline
  • Con: Additional expenses for the infra as it could not run alongside nayduck (around $2k/mo to have the same amount of fuzzing as nayduck)

Using ClusterFuzz and making nayduck interrupt its workers when not actually using them

  • Pro: ClusterFuzz is supported by Google, so we’re pretty sure it’d work well
  • Pro: We take advantage of this change to start supporting cargo-bolero fuzz targets
  • Pro: It’d probably become even less expensive than the current situation, as ClusterFuzz uses pre-emptible machines
  • Con: The most engineering effort, as we don’t have (m?)any people knowing nayduck well enough to actually implement the interruption

Running both nayduck and ClusterFuzz on top of nested virtualization VMs

  • Pro: Same cost as today, plus fuzzer would be supported by Google
  • Pro: We take advantage of this change to start supporting cargo-bolero fuzz targets
  • Con: Amount of work is hard-to-guess, as ClusterFuzz seems to attempt to create its own GCP VMs, so running it on top of a non-directly-GCP VM might be hard
  • Con: It’s unknown how well just setting CPU prio etc. would work for both nayduck and the fuzzer, as today the fuzzer gets a full SIGSTOP when nayduck is running a test

Current status: implementing "Using ClusterFuzz" solution

Still missing:

  • need to build fuzzers on each new commit rather than once a day
  • need to integrate the nayduck fuzzers’ corpus into clusterfuzz
  • verify that the release process does document building new ondemand fuzzers
  • can we move the workflows from github actions to buildkite?

Metadata

Metadata

Labels

C-housekeepingCategory: Refactoring, cleanups, code qualityGroomedA

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions