Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1359879

Summary: Cleanup hanging on NetBSD machines
Product: [Community] GlusterFS Reporter: Jeff Darcy <jdarcy>
Component: project-infrastructureAssignee: Nigel Babu <nigelb>
Status: CLOSED DUPLICATE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: bugs, gluster-infra, nigelb
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 11:31:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jeff Darcy 2016-07-25 15:29:30 UTC
One of my more frequent "housekeeping" tasks lately has been checking for hung regression tests on NetBSD machines.  The first symptom is that the last few lines of the console output look like this:

> 07:19:07 Build GlusterFS
> 07:19:07 ***************
> 07:19:07 
> 07:19:07 + /opt/qa/build.sh

When I log in to the machine, more often than not I see a bunch of processes sitting in various versions of umount as part of the cleanup from the previous test.  There will also typically be one glusterfs (client) process that should have exited in response to the umount request, but for some reason hasn't done so.  Manually sending SIGKILL to that process gets things unstuck and the next test usually runs properly after that.

This is a relatively new phenomenon.  I didn't start seeing this particular syndrome until 1-2 months ago, so something must have changed to cause it.  Perhaps one of our local NetBSD experts could look into it so we can still claim Gluster runs on toasters.

Comment 1 Nigel Babu 2016-07-26 07:51:16 UTC
As much as I hate to do this, I've added a `pkill gluster` to the start of the netbsd regression runs so we avoid this.

I suspect this maybe my fault. I changed the default timeout for netbsd regression to 200 minutes the other day. If a job was killed mid regression run, it may have left the machine in an inconsistent state.

I would much rather our regression.sh wrote a PID file and we kill only if the PID file pointed to a non-existent process.

Comment 2 Nigel Babu 2016-08-05 12:22:16 UTC
Fun fact, the regression.sh on NetBSD machines is a bit different from our normal ones and already has pkill to kill errant gluster processes.

Comment 3 Nigel Babu 2016-08-23 11:31:21 UTC
This is a dupe of bug 1368441, which is now fixed. The issues that's causing umount to hang is being resolved in bug 1369401

*** This bug has been marked as a duplicate of bug 1368441 ***