Occasionally, the NetBSD machines hang when trying to umount. The only way to recover is to do a hard reboot or a `/sbin/reboot -n`. We need to figure out why we're hitting this and in the meanwhile figure out when this happens, fail sooner, and make sure we're notified. FAILURE on http://build.gluster.org/job/netbsd7-regression/47/consoleText nbslave7g.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/42/consoleText nbslave79.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/39/consoleText nbslave7c.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/36/consoleText nbslave7h.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/35/consoleText nbslave7c.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/32/consoleText nbslave74.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/30/consoleText nbslave74.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/29/consoleText nbslave7h.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/28/consoleText nbslave71.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/27/consoleText nbslave7j.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/26/consoleText nbslave79.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/16/consoleText nbslave7h.cloud.gluster.org FAILURE on http://build.gluster.org/job/netbsd7-regression/19/consoleText nbslave74.cloud.gluster.org
In bug 1359879, forcibly killing a glusterfs (client) process seemed to get things unstuck without a reboot. Is that not so for these hangs?
I should try that next time.
Interesting. So we do a kill -9 in our test clean up scripts. I did a random check of all the netbsd machines. A bunch of them had hung umount processes. But, a `pkill gluster` fixed all but one. I'm going to add two things. A `px ax | grep gluster` to the start of every job and a `pkill gluster`. I want to see how many times there are processes left over and how many times we end up killing those processes.
*** Bug 1366168 has been marked as a duplicate of this bug. ***
The issue of new tests failing have now been fixed with the addition of the `pkill gluster`. I've filed bug 1369401 to track what's causing umount to hang in the first place.
*** Bug 1359879 has been marked as a duplicate of this bug. ***