1553980 – builder106.cloud.gluster.org is failing regressions on the (first?) NFS test mount-nfs-auth.t constantly

Bug 1553980 - builder106.cloud.gluster.org is failing regressions on the (first?) NFS test mount-nfs-auth.t constantly

Summary: builder106.cloud.gluster.org is failing regressions on the (first?) NFS test ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	project-infrastructure
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-10 12:18 UTC by Shyamsundar
Modified:	2018-08-03 10:26 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-08-03 10:26:48 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shyamsundar 2018-03-10 12:18:59 UTC

Description of problem:
See: https://build.gluster.org/computer/builder106.cloud.gluster.org/builds

Failures since last 2-3 days are on the same test mount-nfs-auth.t (at times other runs have failed earlier, possibly due to valid issues)

The logs from the runs are very skimpy, just have a cli.log and the /var/log/messages has the following lines that look interesting, but state nothing more,

Mar 10 01:15:34 builder106 kernel: nfs: server
builder106.cloud.gluster.org not responding, timed out
Mar 10 01:15:34 builder106 kernel: nfs: server
builder106.cloud.gluster.org not responding, timed out

As nature would have it, the 4.0 regression fix was scheduled on this builder 4 times in a row! Hence I have taken this builder offline: https://build.gluster.org//computer/builder106.cloud.gluster.org/

Other interesting/possibly unwated facts:
- First job that started this failure: https://build.gluster.org/job/centos7-regression/234/changes

Comment 1 Nigel Babu 2018-03-26 05:47:45 UTC

This is now fixed. For future reference, it needed a restart.

Comment 2 Shyamsundar 2018-07-31 14:43:49 UTC

So, this happened again:

Check builds from centos regression #1977 till #1991 on https://build.gluster.org/computer/builder106.cloud.gluster.org/builds

1977 dbench test within tests/bugs/rpc/bug-847624.t failed as the (gluster)nfs server lost connection to the client and ping timed out. Thus leaving behind a stale NFS mount for the client.

This stale mount further prevented other tests from running successfully. The one test that succeeded in between (https://build.gluster.org/job/centos7-regression/1978/) was a doc only change, hence no actual tests were run.

@misc rebooted the node, so things may have cleared up. I am checking the logs from 1977 to determine if we can root cause anything that caused the NFS client disconnection, to provide more information to resolve the problem in the future.

Comment 3 Shyamsundar 2018-07-31 15:02:19 UTC

From the logs in run #1977 it is seen that the gluster NFS server lost connection (ping timed out) to the singleton brick, but the brick logs have no indication why.

Basically the end result is a stale NFS mount.

@nigel is there a way to get more information from the node itself as it is now in the offline state?

Comment 4 Nigel Babu 2018-08-01 03:39:06 UTC

Oops, I did a restart before I saw the update to this bug.

Comment 5 Nigel Babu 2018-08-03 10:26:48 UTC

Going to close this for now, but we'll take machines offline to debug next time before a restart.

Note You need to log in before you can comment on or make changes to this bug.