Description of problem: See: https://build.gluster.org/computer/builder106.cloud.gluster.org/builds Failures since last 2-3 days are on the same test mount-nfs-auth.t (at times other runs have failed earlier, possibly due to valid issues) The logs from the runs are very skimpy, just have a cli.log and the /var/log/messages has the following lines that look interesting, but state nothing more, Mar 10 01:15:34 builder106 kernel: nfs: server builder106.cloud.gluster.org not responding, timed out Mar 10 01:15:34 builder106 kernel: nfs: server builder106.cloud.gluster.org not responding, timed out As nature would have it, the 4.0 regression fix was scheduled on this builder 4 times in a row! Hence I have taken this builder offline: https://build.gluster.org//computer/builder106.cloud.gluster.org/ Other interesting/possibly unwated facts: - First job that started this failure: https://build.gluster.org/job/centos7-regression/234/changes
This is now fixed. For future reference, it needed a restart.
So, this happened again: Check builds from centos regression #1977 till #1991 on https://build.gluster.org/computer/builder106.cloud.gluster.org/builds 1977 dbench test within tests/bugs/rpc/bug-847624.t failed as the (gluster)nfs server lost connection to the client and ping timed out. Thus leaving behind a stale NFS mount for the client. This stale mount further prevented other tests from running successfully. The one test that succeeded in between (https://build.gluster.org/job/centos7-regression/1978/) was a doc only change, hence no actual tests were run. @misc rebooted the node, so things may have cleared up. I am checking the logs from 1977 to determine if we can root cause anything that caused the NFS client disconnection, to provide more information to resolve the problem in the future.
From the logs in run #1977 it is seen that the gluster NFS server lost connection (ping timed out) to the singleton brick, but the brick logs have no indication why. Basically the end result is a stale NFS mount. @nigel is there a way to get more information from the node itself as it is now in the offline state?
Oops, I did a restart before I saw the update to this bug.
Going to close this for now, but we'll take machines offline to debug next time before a restart.