Test old Linux kernels using Qemu#280
Conversation
02f8a3c to
a9d1e70
Compare
…e cascades during testing Throttle thresholds for constrainted qemu environments
…ing ulimit directly on Linux
| // If we can not retrieve pidfd, the system does not support waitid(P_PIDFD) | ||
| return false | ||
| } | ||
| defer { try? FileDescriptor(rawValue: selfPidfd).close() } |
There was a problem hiding this comment.
Can you use closeAfter instead?
There was a problem hiding this comment.
I don't think that it would work because we would lose the errno comparison if the close of the file descriptor fails, which we want to be just a best effort here.
Thread.swift — WorkQueue.dequeue() now uses while queue.isEmpty && !isShuttingDown so spurious pthread_cond_wait wakeups (which POSIX permits, and FreeBSD exercises frequently) no longer permanently kill the worker thread. shutdown() sets isShuttingDown = true before signaling so the thread exits cleanly. Subprocess+BSD.swift — source.resume() now happens before peekIfExited(). The AtomicCounter prevents double-resume in the case where both the DispatchSource event handler and the backup peek fire for the same exit. The backup is still necessary because kqueue may not deliver NOTE_EXIT retroactively if the process was already a zombie at registration time.
Thread.swift — WorkQueue.dequeue() now uses while queue.isEmpty && !isShuttingDown with an explicit isShuttingDown flag, so spurious pthread_cond_wait wakeups (permitted by POSIX, common on FreeBSD) no longer permanently kill the single worker thread. Subprocess+BSD.swift — Replaced DispatchSource NOTE_EXIT entirely with blocking waitid(WEXITED | WNOWAIT) dispatched to DispatchQueue.global(). This closes the race at its root: the kernel holds waitid until the process exits regardless of whether it was already a zombie at call time, the zombie is preserved for reapProcess, and GCD's thread pool handles concurrent subprocess waits without serialisation.
| // clone3(CLONE_PIDFD) allocates a pidfd before exec runs. | ||
| // If exec fails we retry with the next candidate path, so | ||
| // close the pidfd here to avoid leaking it across retries. | ||
| if processDescriptor > 0 { |
There was a problem hiding this comment.
To be pedantic, I think this should be >= 0 or != 0 because technically 0 is in the range of valid file descriptors.
There was a problem hiding this comment.
Yes, this makes sense that it could be conceptually zero, so I've updated it to be >= 0.
There was a problem hiding this comment.
I had to revert this to > 0 because it was causing some of the macOS tests to fail.
There was a problem hiding this comment.
Why not update the tests?
There was a problem hiding this comment.
It's because the tests are verifying correct behaviour. Allowing zero has an impact to macOS.
There was a problem hiding this comment.
(nit) I've added a .invalidDescriptor in b11e683 which you can use here: if processDescriptor != .invalidDescriptor. But the original way is correct too.
There was a problem hiding this comment.
Thanks, I've switched it over to use this constant and it's passing on macOS now.
| // If exec fails we retry with the next candidate path, so | ||
| // close the pidfd here to avoid leaking it across retries. | ||
| if processDescriptor > 0 { | ||
| try? FileDescriptor(rawValue: processDescriptor).close() |
There was a problem hiding this comment.
We shouldn't swallow the close error here. We can convert it to SubprocessError and throw it directly.
There was a problem hiding this comment.
I've wrapped any errors thrown here in SubprocessError in the newest revision.
| // process, and if Subprocess.run() then throws, | ||
| // the caller's task group cascades cancellation to | ||
| // other live processes. | ||
| try? AsyncIO.shared.cancelAsyncIO(for: processIdentifier) |
There was a problem hiding this comment.
Should we make cancelAsyncIO recognize DEL failure as harmless here rather than using try?? Same as above I worry that we accidentally swallow unintended errors.
There was a problem hiding this comment.
I've removed this bit of Linux-specific knowledge in the comment and the attempt to discard any errors thrown here. The epoll_ctl(DEL) case is now fully encapsulated, see comments further down for more detail.
| // when processIdentifier.close() closes it, so a failed DEL here is | ||
| // never permanent. Propagating a DEL error as a monitoring failure | ||
| // would trigger onCleanup → SIGKILL against an already-dead process. | ||
| _ = epoll_ctl( |
There was a problem hiding this comment.
Should we specifically ignore DEL as opposed to ignore all errors?
There was a problem hiding this comment.
According to the Linux manual the only other errors, other than the ENOENT observed during older kernel testing, would be various flavours of programming errors, such as bad file descriptor, or already closed. I've captured the rc and errno to form an assert that only gets activated in debug/test binaries so that in testing we can gather details about the programming error, and it has no effect in production.
Assert programming errors on epoll_ctl(DEL) Rethrow file descriptor closure errors Remove Linux assumption in Configuration leading to swallowing of cancellation errors
| } | ||
|
|
||
| return (continuationList, nil) | ||
| assert(delRC == 0 || errno == ENOENT, "epoll_ctl(DEL) failed unexpectedly: \(errno)") |
There was a problem hiding this comment.
Subprocess doesn't really use assert. We should follow the old format and resume with an error to propagate the error. If we want to only ignore ENOENT we can do something like
if delRC != 0 && errno != ENOENT {
let error = SubprocessError.failedToMonitor(
withUnderlyingError: Errno(rawValue: epollErrno)
)
return (continuationList, error)
}
return (continuationList, nil)
There was a problem hiding this comment.
Thanks, I've adopted this approach.
| // clone3(CLONE_PIDFD) allocates a pidfd before exec runs. | ||
| // If exec fails we retry with the next candidate path, so | ||
| // close the pidfd here to avoid leaking it across retries. | ||
| if processDescriptor > 0 { |
There was a problem hiding this comment.
(nit) I've added a .invalidDescriptor in b11e683 which you can use here: if processDescriptor != .invalidDescriptor. But the original way is correct too.
| @@ -1 +1 @@ | |||
| 6.2.0 No newline at end of file | |||
| 6.3.2 | |||
There was a problem hiding this comment.
Just curious: what's our policy to update this file when new Swift is released? Do we always try to use the current release?
There was a problem hiding this comment.
Honestly, I end up deleting it in my working tree half the time because it causes problems when using a different version.
This is really meant for "apps" which build with one canonical version only, not libraries which build with multiple. [/rant]
There was a problem hiding this comment.
In term of the policy of swift-subprocess and the swift version file, it's up to the code owners here what to do with it.
The swift version file is there to record that at least one person, and ideally the CI too, has verified that the project can be developed on a specific toolchain version with a degree of confidence that the package builds, tests will build and pass, the language server works, and things like code formatting work without extraneous diffs. The hope is to lessen "it works at my desk" kinds of problems with a measure of reproducibility, and to encourage new developers to a package because things just work right away.
By no means does this file indicate that this is the only version of the toolchain that can be used with this package, or the only one it supports. This is just the default that gets developers working with it to a known good configuration for most things and swiftly can help them with that.
Also, it has no effect on dependent packages. This is for development of this package, even if the package is a library package like this one.
It's worth noting that this file instructs what toolchain version to use for testing with the Linux kernel versions added as GH workflows in this PR as it stands. The Qemu setup script uses of swiftly to install the swift toolchain into the Linux VM that normally won't have it. It's currently configured to use this file as it's visible, easy to update, and swiftly can use it.
Please let me know if you'd prefer the test script to always use "latest", or something that's statically coded into the test script, or the GH workflow. I think that this file is simpler, more visible, and more reproducible than those alternatives.
Fixes the following: