Bug 1700098

Summary: NFS tests are failing in baremetal 4.1 clusters
Product: OpenShift Container Platform Reporter: Hemant Kumar <hekumar>
Component: StorageAssignee: Bradley Childs <bchilds>
Status: CLOSED DUPLICATE QA Contact: Wenqi He <wehe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, aos-storage-staff, jsafrane
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-16 15:22:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hemant Kumar 2019-04-15 20:39:50 UTC
It looks some of the NFS specific tests are failing on baremetal clusters. 

For example - https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1621/pull-ci-openshift-installer-master-e2e-metal/4/?log#log

I debugged this and found that NFS server is unable to come up when started by e2e:

Apr 15 17:44:56 worker-0.ci-op-w6lfbtli-d3e37.origin-ci-int-aws.dev.rhcloud.com kernel: NFSD: attempt to initialize umh client tracking in a container ignored.
Apr 15 17:44:56 worker-0.ci-op-w6lfbtli-d3e37.origin-ci-int-aws.dev.rhcloud.com kernel: NFSD: attempt to initialize legacy client tracking in a container ignored.
Apr 15 17:44:56 worker-0.ci-op-w6lfbtli-d3e37.origin-ci-int-aws.dev.rhcloud.com kernel: NFSD: Unable to initialize client recovery tracking! (-22)
Apr 15 17:44:56 worker-0.ci-op-w6lfbtli-d3e37.origin-ci-int-aws.dev.rhcloud.com kernel: NFSD: starting 10-second grace period (net f0000498)


The closest reference we have for this is - https://github.com/kubernetes/kubernetes/issues/33447 

Which was because the node did not had nfs tools.

Comment 1 Jan Safranek 2019-04-16 13:10:28 UTC
There are many NFS tests that passed in the test run, so it's not about missing NFS utils. Kernel logs listed above are IMO harmless. I ran the tests manually on a bare metal and it passed.

In addition, the test creates a PVC + PV and checks they're bound together. NFS is not involved here yet, it would be used later, if they were Bound.

PV and PVC are (from test teardown):
Apr 15 23:10:25.120: INFO: Deleting PersistentVolumeClaim "pvc-dvmt7"
Apr 15 23:10:25.142: INFO: Deleting PersistentVolume "nfs-8wxmm"

controller-manager logs shows that the PVC can't find its PV:

I0415 23:08:42.998947       1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"e2e-tests-pv-4tk65", Name:"pvc-dvmt7", UID:"3b64829e-5fd3-11e9-b3fc-0cc47a18ab96", APIVersion:"v1", ResourceVersion:"61900", FieldPath:""}): type: 'Normal' reason: 'FailedBinding' no persistent volumes available for this claim and no storage class is set

And PV is not to be found, because it got bound to PVC from a different test:

I0415 23:07:28.009082       1 pv_controller.go:874] claim "e2e-tests-statefulset-946xl/datadir-ss-0" bound to volume "nfs-8wxmm"
I0415 23:07:28.012629       1 pv_controller.go:824] volume "nfs-8wxmm" entered phase "Bound"
I0415 23:07:28.012648       1 pv_controller.go:963] volume "nfs-8wxmm" bound to claim "e2e-tests-statefulset-946xl/datadir-ss-0"

The test apparently races with StatefulSet test "[It] should perform rolling updates and roll backs of template modifications with PVCs [Suite:openshift/conformance/parallel] [Suite:k8s]". That one expects that there is a default storage class and it would get its PV provisioned dynamically. Its PVC steals PV from the other test instead.


Even if there was a default storage class + dynamic provisioning, there would still be (short) window of opportunity:

1. NFS test creates PV
2. StatefulSet test creates PVC
3. PV controller sees available PV from 1. and binds it to PVC from 2. instead of dynamic provisioning of a new PV for StatefulSet test.

These two tests should use a different storage class.

Comment 2 Jan Safranek 2019-04-16 15:22:39 UTC
> Even if there was a default storage class + dynamic provisioning, there would still be (short) window of opportunity:
>
> 1. NFS test creates PV
> 2. StatefulSet test creates PVC
> 3. PV controller sees available PV from 1. and binds it to PVC from 2. instead of dynamic provisioning of a new PV for StatefulSet test.
>
> These two tests should use a different storage class.


False alarm, they *do* use a different storage class (when there is one). NFS PV tests explicitly set StorageClassName: "" in PVCs and they don't get the default one assigned by our default storage class admission plugin:
https://github.com/kubernetes/kubernetes/blob/252cabf155308b43c8c612f482855dc0cfa2e29c/test/e2e/storage/persistent_volumes.go#L140

I think that skipping tests that need default storage class (bug #1700076) would be enough to fix also these NFS flakes.

*** This bug has been marked as a duplicate of bug 1700076 ***