1359879 – Cleanup hanging on NetBSD machines

Bug 1359879 - Cleanup hanging on NetBSD machines

Summary: Cleanup hanging on NetBSD machines

Keywords:
Status:	CLOSED DUPLICATE of bug 1368441
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	project-infrastructure
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Nigel Babu
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-07-25 15:29 UTC by Jeff Darcy
Modified:	2016-08-23 11:31 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-08-23 11:31:21 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jeff Darcy 2016-07-25 15:29:30 UTC

One of my more frequent "housekeeping" tasks lately has been checking for hung regression tests on NetBSD machines.  The first symptom is that the last few lines of the console output look like this:

> 07:19:07 Build GlusterFS
> 07:19:07 ***************
> 07:19:07 
> 07:19:07 + /opt/qa/build.sh

When I log in to the machine, more often than not I see a bunch of processes sitting in various versions of umount as part of the cleanup from the previous test.  There will also typically be one glusterfs (client) process that should have exited in response to the umount request, but for some reason hasn't done so.  Manually sending SIGKILL to that process gets things unstuck and the next test usually runs properly after that.

This is a relatively new phenomenon.  I didn't start seeing this particular syndrome until 1-2 months ago, so something must have changed to cause it.  Perhaps one of our local NetBSD experts could look into it so we can still claim Gluster runs on toasters.

Comment 1 Nigel Babu 2016-07-26 07:51:16 UTC

As much as I hate to do this, I've added a `pkill gluster` to the start of the netbsd regression runs so we avoid this.

I suspect this maybe my fault. I changed the default timeout for netbsd regression to 200 minutes the other day. If a job was killed mid regression run, it may have left the machine in an inconsistent state.

I would much rather our regression.sh wrote a PID file and we kill only if the PID file pointed to a non-existent process.

Comment 2 Nigel Babu 2016-08-05 12:22:16 UTC

Fun fact, the regression.sh on NetBSD machines is a bit different from our normal ones and already has pkill to kill errant gluster processes.

Comment 3 Nigel Babu 2016-08-23 11:31:21 UTC

This is a dupe of bug 1368441, which is now fixed. The issues that's causing umount to hang is being resolved in bug 1369401

*** This bug has been marked as a duplicate of bug 1368441 ***

Note You need to log in before you can comment on or make changes to this bug.