1700098 – NFS tests are failing in baremetal 4.1 clusters

Bug 1700098 - NFS tests are failing in baremetal 4.1 clusters

Summary: NFS tests are failing in baremetal 4.1 clusters

Keywords:
Status:	CLOSED DUPLICATE of bug 1700076
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Bradley Childs
QA Contact:	Wenqi He
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-15 20:39 UTC by Hemant Kumar
Modified:	2019-04-16 15:22 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-16 15:22:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hemant Kumar 2019-04-15 20:39:50 UTC

It looks some of the NFS specific tests are failing on baremetal clusters. 

For example - https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1621/pull-ci-openshift-installer-master-e2e-metal/4/?log#log

I debugged this and found that NFS server is unable to come up when started by e2e:

Apr 15 17:44:56 worker-0.ci-op-w6lfbtli-d3e37.origin-ci-int-aws.dev.rhcloud.com kernel: NFSD: attempt to initialize umh client tracking in a container ignored.
Apr 15 17:44:56 worker-0.ci-op-w6lfbtli-d3e37.origin-ci-int-aws.dev.rhcloud.com kernel: NFSD: attempt to initialize legacy client tracking in a container ignored.
Apr 15 17:44:56 worker-0.ci-op-w6lfbtli-d3e37.origin-ci-int-aws.dev.rhcloud.com kernel: NFSD: Unable to initialize client recovery tracking! (-22)
Apr 15 17:44:56 worker-0.ci-op-w6lfbtli-d3e37.origin-ci-int-aws.dev.rhcloud.com kernel: NFSD: starting 10-second grace period (net f0000498)


The closest reference we have for this is - https://github.com/kubernetes/kubernetes/issues/33447 

Which was because the node did not had nfs tools.

Comment 1 Jan Safranek 2019-04-16 13:10:28 UTC

There are many NFS tests that passed in the test run, so it's not about missing NFS utils. Kernel logs listed above are IMO harmless. I ran the tests manually on a bare metal and it passed.

In addition, the test creates a PVC + PV and checks they're bound together. NFS is not involved here yet, it would be used later, if they were Bound.

PV and PVC are (from test teardown):
Apr 15 23:10:25.120: INFO: Deleting PersistentVolumeClaim "pvc-dvmt7"
Apr 15 23:10:25.142: INFO: Deleting PersistentVolume "nfs-8wxmm"

controller-manager logs shows that the PVC can't find its PV:

I0415 23:08:42.998947       1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"e2e-tests-pv-4tk65", Name:"pvc-dvmt7", UID:"3b64829e-5fd3-11e9-b3fc-0cc47a18ab96", APIVersion:"v1", ResourceVersion:"61900", FieldPath:""}): type: 'Normal' reason: 'FailedBinding' no persistent volumes available for this claim and no storage class is set

And PV is not to be found, because it got bound to PVC from a different test:

I0415 23:07:28.009082       1 pv_controller.go:874] claim "e2e-tests-statefulset-946xl/datadir-ss-0" bound to volume "nfs-8wxmm"
I0415 23:07:28.012629       1 pv_controller.go:824] volume "nfs-8wxmm" entered phase "Bound"
I0415 23:07:28.012648       1 pv_controller.go:963] volume "nfs-8wxmm" bound to claim "e2e-tests-statefulset-946xl/datadir-ss-0"

The test apparently races with StatefulSet test "[It] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s]". That one expects that there is a default storage class and it would get its PV provisioned dynamically. Its PVC steals PV from the other test instead.


Even if there was a default storage class + dynamic provisioning, there would still be (short) window of opportunity:

1. NFS test creates PV
2. StatefulSet test creates PVC
3. PV controller sees available PV from 1. and binds it to PVC from 2. instead of dynamic provisioning of a new PV for StatefulSet test.

These two tests should use a different storage class.

Comment 2 Jan Safranek 2019-04-16 15:22:39 UTC

> Even if there was a default storage class + dynamic provisioning, there would still be (short) window of opportunity:
>
> 1. NFS test creates PV
> 2. StatefulSet test creates PVC
> 3. PV controller sees available PV from 1. and binds it to PVC from 2. instead of dynamic provisioning of a new PV for StatefulSet test.
>
> These two tests should use a different storage class.


False alarm, they *do* use a different storage class (when there is one). NFS PV tests explicitly set StorageClassName: "" in PVCs and they don't get the default one assigned by our default storage class admission plugin:
https://github.com/kubernetes/kubernetes/blob/252cabf155308b43c8c612f482855dc0cfa2e29c/test/e2e/storage/persistent_volumes.go#L140

I think that skipping tests that need default storage class (bug #1700076) would be enough to fix also these NFS flakes.

*** This bug has been marked as a duplicate of bug 1700076 ***

Note You need to log in before you can comment on or make changes to this bug.