One of my more frequent "housekeeping" tasks lately has been checking for hung regression tests on NetBSD machines. The first symptom is that the last few lines of the console output look like this: > 07:19:07 Build GlusterFS > 07:19:07 *************** > 07:19:07 > 07:19:07 + /opt/qa/build.sh When I log in to the machine, more often than not I see a bunch of processes sitting in various versions of umount as part of the cleanup from the previous test. There will also typically be one glusterfs (client) process that should have exited in response to the umount request, but for some reason hasn't done so. Manually sending SIGKILL to that process gets things unstuck and the next test usually runs properly after that. This is a relatively new phenomenon. I didn't start seeing this particular syndrome until 1-2 months ago, so something must have changed to cause it. Perhaps one of our local NetBSD experts could look into it so we can still claim Gluster runs on toasters.
As much as I hate to do this, I've added a `pkill gluster` to the start of the netbsd regression runs so we avoid this. I suspect this maybe my fault. I changed the default timeout for netbsd regression to 200 minutes the other day. If a job was killed mid regression run, it may have left the machine in an inconsistent state. I would much rather our regression.sh wrote a PID file and we kill only if the PID file pointed to a non-existent process.
Fun fact, the regression.sh on NetBSD machines is a bit different from our normal ones and already has pkill to kill errant gluster processes.
This is a dupe of bug 1368441, which is now fixed. The issues that's causing umount to hang is being resolved in bug 1369401 *** This bug has been marked as a duplicate of bug 1368441 ***