You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is for investigating the potential ways forward and choosing what to do. My personal preference leans towards using ClusterFuzz now and making nayduck interrupt its workers later.
The current situation is:
Nayduck runs on 25 VMs (~$2-3k/mo)
Nayduck actually really uses these VMs only ~1hr/day average
During the rest of the time, our homegrown fuzzer runner runs fuzzing
Our homegrown fuzzer runner pauses and resumes fuzzing whenever a nayduck test wants to run
We only support cargo-fuzz fuzz targets, which means adding a fuzz target is a mess
Keep the status quo
Pro: Least amount of work
Con: We keep losing some fuzzer artifacts, which means missing potentially-S0 issues. We can implement cargo-bolero support with some work
Con: We keep not supporting cargo-bolero fuzz targets
Con: When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz
Keep the status quo but fix the disappearing reproducers issue
Pro: Least changes to infrastructure
Pro/Con: It is hard to estimate the amount of work needed to fix it. This could end up being either a pro or a con.
Con: Supporting cargo-bolero fuzz targets is a ~2 weeks additional project on top of the fix
Con: When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz
Rewriting the current fuzzer infra to be more resilient
Pro: We can keep using the same machines as nayduck
Pro: We take advantage of this change to start supporting cargo-bolero fuzz targets
Con: 1-2 months of engineering time to implement it in rust based on the experience with writing the current python runner
Con: We don’t know yet how much of the issue with disappearing artifacts is due to the infra runner vs interactions with nayduck going wrong
Con: When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz
Using ClusterFuzz
Pro: Supported by Google, so we’re pretty sure it’d work well
Con: Around 1 month of engineering time to deploy it with a proper build pipeline
Con: Additional expenses for the infra as it could not run alongside nayduck (around $2k/mo to have the same amount of fuzzing as nayduck)
Using ClusterFuzz and making nayduck interrupt its workers when not actually using them
Pro: ClusterFuzz is supported by Google, so we’re pretty sure it’d work well
Pro: We take advantage of this change to start supporting cargo-bolero fuzz targets
Pro: It’d probably become even less expensive than the current situation, as ClusterFuzz uses pre-emptible machines
Con: The most engineering effort, as we don’t have (m?)any people knowing nayduck well enough to actually implement the interruption
Running both nayduck and ClusterFuzz on top of nested virtualization VMs
Pro: Same cost as today, plus fuzzer would be supported by Google
Pro: We take advantage of this change to start supporting cargo-bolero fuzz targets
Con: Amount of work is hard-to-guess, as ClusterFuzz seems to attempt to create its own GCP VMs, so running it on top of a non-directly-GCP VM might be hard
Con: It’s unknown how well just setting CPU prio etc. would work for both nayduck and the fuzzer, as today the fuzzer gets a full SIGSTOP when nayduck is running a test
Current status: implementing "Using ClusterFuzz" solution
Still missing:
need to build fuzzers on each new commit rather than once a day
need to integrate the nayduck fuzzers’ corpus into clusterfuzz
verify that the release process does document building new ondemand fuzzers
can we move the workflows from github actions to buildkite?
Recently, we have been seeing fuzzer crashes being found but that somehow disappear. This is a bug that makes our fuzzing infra much less useful if it keeps happening.
We also currently don’t support cargo-bolero yet, which makes it harder than necessary to add a new fuzz target.
This issue is for investigating the potential ways forward and choosing what to do. My personal preference leans towards using ClusterFuzz now and making nayduck interrupt its workers later.
The current situation is:
Keep the status quo
Keep the status quo but fix the disappearing reproducers issue
Rewriting the current fuzzer infra to be more resilient
Using ClusterFuzz
Using ClusterFuzz and making nayduck interrupt its workers when not actually using them
Running both nayduck and ClusterFuzz on top of nested virtualization VMs
Current status: implementing "Using ClusterFuzz" solution
Still missing: