Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1627547

Summary: On GCP, e2e tests fail because PVs aren't getting mounted and attached in a reasonable time
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: StorageAssignee: Hemant Kumar <hekumar>
Status: CLOSED WORKSFORME QA Contact: Liang Xia <lxia>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, aos-storage-staff, bchilds, mawong
Target Milestone: ---   
Target Release: 3.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-04 17:27:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2018-09-10 20:26:15 UTC
Both cluster-monitoring-operator (which uses PVs from a statefulset) and some of the stateful set tests are failing because the GCP PVs timeout mounting (taking > 15 minutes in some cases)

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/9972/pull-ci-openshift-openshift-ansible-e2e-gcp/2255#sig-apps-statefulset-k8sio-basic-statefulset-functionality-statefulsetbasic-should-not-deadlock-when-a-pods-predecessor-fails-suiteopenshiftconformanceparallel-suitek8s

is one failure.  This seems new - we weren't seeing this prior to a few weeks ago.

Flake rate is 1/8 but seems to be getting worse.

Comment 1 Clayton Coleman 2018-09-10 20:27:10 UTC
https://github.com/openshift/cluster-monitoring-operator/issues/85 is the issues where we found that prometheus never was able to mount its PVC.

Comment 2 Clayton Coleman 2018-09-10 20:28:56 UTC
There may be overlap between https://bugzilla.redhat.com/show_bug.cgi?id=1624961 - either way, we need to track this down.

Comment 3 Hemant Kumar 2018-09-11 19:07:28 UTC
I think these tests may be sensitive to the fact that, when a pod enters with unbound pvcs inside attach/detach controller, it does not get processed. It gets processed by another populator loop which turns more slowly (every 3 minutes) compared to default 100ms. I think we should backport - https://github.com/kubernetes/kubernetes/pull/66863 which may help with this somewhat but I will look more.

Comment 4 Hemant Kumar 2018-09-11 21:04:37 UTC
Looking at the timings from logs - it appears that my theory about volume not being attached in time is not correct.

ep 10 18:31:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: I0910 18:31:47.881908   10912 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-c0b335df-b527-11e8-85b6-42010a8e0002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002") pod "ss-0" (UID: "c0b56e8a-b527-11e8-85b6-42010a8e0002")
Sep 10 18:31:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: I0910 18:31:47.892681   10912 operation_generator.go:1184] Controller attach succeeded for volume "pvc-c0b335df-b527-11e8-85b6-42010a8e0002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002") pod "ss-0" (UID: "c0b56e8a-b527-11e8-85b6-42010a8e0002") device path: "/dev/disk/by-id/google-kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002"
Sep 10 18:31:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: I0910 18:31:47.982483   10912 reconciler.go:252] operationExecutor.MountVolume started for volume "pvc-c0b335df-b527-11e8-85b6-42010a8e0002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002") pod "ss-0" (UID: "c0b56e8a-b527-11e8-85b6-42010a8e0002")
Sep 10 18:31:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: I0910 18:31:47.982709   10912 operation_generator.go:489] MountVolume.WaitForAttach entering for volume "pvc-c0b335df-b527-11e8-85b6-42010a8e0002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002") pod "ss-0" (UID: "c0b56e8a-b527-11e8-85b6-42010a8e0002") DevicePath "/dev/disk/by-id/google-kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002"
Sep 10 18:41:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: E0910 18:41:47.983176   10912 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002\"" failed. No retries permitted until 2018-09-10 18:41:48.483130109 +0000 UTC m=+1517.953389124 (durationBeforeRetry 500ms). Error: "MountVolume.WaitForAttach failed for volume \"pvc-c0b335df-b527-11e8-85b6-42010a8e0002\" (UniqueName: \"kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002\") pod \"ss-0\" (UID: \"c0b56e8a-b527-11e8-85b6-42010a8e0002\") : Could not find attached GCE PD \"kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002\". Timeout waiting for mount paths to be created."
Sep 10 18:41:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: I0910 18:41:47.986040   10912 reconciler.go:301] Volume detached for volume "pvc-c0b335df-b527-11e8-85b6-42010a8e0002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002") on node "ci-op-y4fqx6lz-1827f-ig-n-bgfc" DevicePath "/dev/disk/by-id/google-kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002"


It appears that, volume did get attached in time (18:31:xxx) but the device link which is typically created by udev did not appear on time.   Discussed with michelle(msau42) and there is a possibility of race condition inside udev which can cause an iSCSI attach to be missed. I still need to isolate and find out, if that is indeed the case but it is looking like that one.

Comment 7 Hemant Kumar 2018-09-14 02:46:01 UTC
So, I ran this e2e test for more than 2 hours today and didn't see the test in question fail. I know that the race condition in udev is still real. After discussing with msau42, we ended up filing https://github.com/kubernetes/kubernetes/issues/68528

I will look more closely tomorrow.

Comment 8 Clayton Coleman 2018-09-14 20:19:54 UTC
This is now failing for more than 1/3 runs - master is almost unable to merge changes, and the 3.11 queues aren't much better.  

Crosslinking with https://bugzilla.redhat.com/show_bug.cgi?id=1624961 in case there is some overlap.

Comment 9 Hemant Kumar 2018-09-14 20:33:44 UTC
Thanks. I am still investigating this. But just to clarify - the flake is not about attach issue. But the problem is - volume is getting attached but it is either not showing up on the node or udev events are not getting triggered and missing creation of /dev/disk/by-id symbolic links (looking at logs we can't tell one from another).

Comment 10 Matthew Wong 2018-09-14 21:11:35 UTC
Looks like nodes are entering some kind of "bad" state, it is node-specific not test-specific. e.g. in all the tests that fail in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/9972/pull-ci-openshift-openshift-ansible-e2e-gcp/2264/ , mounts fail on the node ci-op-1jt1t4g1-1827f-ig-n-fdt1. For https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/9972/pull-ci-openshift-openshift-ansible-e2e-gcp/2255#sig-apps-statefulset-k8sio-basic-statefulset-functionality-statefulsetbasic-should-not-deadlock-when-a-pods-predecessor-fails-suiteopenshiftconformanceparallel-suitek8s , mounts fail on the node ci-op-y4fqx6lz-1827f-ig-n-bgfc . Is there maybe some GCE equivalent to the AWS "stuck attaching" instance state we are hitting? As you said disks are getting attached, just not showing up at /dev/disk/by-id...

Comment 11 Hemant Kumar 2018-09-19 21:31:47 UTC
Okay, so I made a debug PR - https://github.com/openshift/origin/pull/21022 and tried to reproduce the issue. So, this PR ran gcp e2e tests more than 6 times and I did not see the test in question flake even once. 

We do know about udev events being missed on the node but until we can reproduce it and confirm that, it seems wrong to make a fix for that just based on hunch.

Comment 12 Hemant Kumar 2019-01-04 17:27:23 UTC
Closing this. If it happens again we will try and reproduce this.