test(supervisor): stress tests for FD/goroutine/zombie leaks (forgejo-mcp-broker-31t)
Adds two stress tests: TestStress_NoLeaksAcross1000Cycles — spawns and reaps 1000 children in sequence, asserts FD count, goroutine count, and zombie status are all stable. TestStress_StopMidLifecycle — 200 cycles that exercise the Stop path (SIGTERM via Close+Signal) rather than relying on natural exit. Bypassed by -short for the unit-test inner loop. Notable findings: * Using the helper-process pattern at this scale was a dead end. Each spawn re-execs the test binary, which inherits the parent's open FDs and runs Go's `testing` package init. Past a few hundred cycles the inner test binaries drag delivery of EOF on their inherited stderr pipe ends, leaving drainStderr goroutines blocked in bufio.ReadString even after Wait returned. Replacing the helper with /bin/true (for quick-exit) and /bin/cat (for echo-loop) sidesteps the recursion and is closer to the production case anyway: the broker spawns forgejo-mcp, not itself. * Defensively close stdout/stderr handles in supervisor's reap goroutine after cmd.Wait returns. cmd.StderrPipe is supposed to be closed by Wait, but under load the kernel doesn't always deliver EOF promptly through Go 1.26's pidfd-based wait path; an explicit Close ensures drainStderr exits and FDs aren't held longer than needed. Tests pass under -race with FD/goroutine deltas in single digits across 1000+200 cycles, and Wait4(-1) confirms no zombie children. Closes forgejo-mcp-broker-31t. Phase 3 complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
bd68d7ed06
commit
03f67786be
3 changed files with 179 additions and 3 deletions
|
|
@ -92,6 +92,13 @@ func Start(ctx context.Context, cfg Config) (*Child, error) {
|
|||
_ = stdin.Close()
|
||||
return nil, fmt.Errorf("supervisor: start %q: %w", cfg.Cmd[0], err)
|
||||
}
|
||||
// Close-handles we may need to clean up explicitly post-Wait. Some
|
||||
// kernels don't deliver EOF to drainStderr promptly under load when
|
||||
// using cmd.StderrPipe; an explicit Close after Wait ensures
|
||||
// drainStderr exits and FDs aren't leaked across high-frequency
|
||||
// spawn/reap cycles.
|
||||
stdoutCloser, _ := stdout.(io.Closer)
|
||||
stderrCloser, _ := stderr.(io.Closer)
|
||||
|
||||
c := &Child{
|
||||
Stdin: stdin,
|
||||
|
|
@ -114,9 +121,18 @@ func Start(ctx context.Context, cfg Config) (*Child, error) {
|
|||
go drainStderr(stderr, onStderr)
|
||||
|
||||
// Reap the child. cmd.Wait must be called exactly once; do it here so
|
||||
// nobody else has to remember.
|
||||
// nobody else has to remember. After Wait, defensively close the
|
||||
// stdout/stderr handles so drainStderr definitely exits and our
|
||||
// reference count drops to zero — important under stress loads where
|
||||
// the kernel doesn't always deliver EOF promptly.
|
||||
go func() {
|
||||
err := cmd.Wait()
|
||||
if stdoutCloser != nil {
|
||||
_ = stdoutCloser.Close()
|
||||
}
|
||||
if stderrCloser != nil {
|
||||
_ = stderrCloser.Close()
|
||||
}
|
||||
c.mu.Lock()
|
||||
c.exitErr = err
|
||||
c.mu.Unlock()
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue