Bug 1627547
| Summary: | On GCP, e2e tests fail because PVs aren't getting mounted and attached in a reasonable time | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Storage | Assignee: | Hemant Kumar <hekumar> |
| Status: | CLOSED WORKSFORME | QA Contact: | Liang Xia <lxia> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.11.0 | CC: | aos-bugs, aos-storage-staff, bchilds, mawong |
| Target Milestone: | --- | ||
| Target Release: | 3.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-01-04 17:27:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Clayton Coleman
2018-09-10 20:26:15 UTC
https://github.com/openshift/cluster-monitoring-operator/issues/85 is the issues where we found that prometheus never was able to mount its PVC. There may be overlap between https://bugzilla.redhat.com/show_bug.cgi?id=1624961 - either way, we need to track this down. I think these tests may be sensitive to the fact that, when a pod enters with unbound pvcs inside attach/detach controller, it does not get processed. It gets processed by another populator loop which turns more slowly (every 3 minutes) compared to default 100ms. I think we should backport - https://github.com/kubernetes/kubernetes/pull/66863 which may help with this somewhat but I will look more. Looking at the timings from logs - it appears that my theory about volume not being attached in time is not correct. ep 10 18:31:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: I0910 18:31:47.881908 10912 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-c0b335df-b527-11e8-85b6-42010a8e0002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002") pod "ss-0" (UID: "c0b56e8a-b527-11e8-85b6-42010a8e0002") Sep 10 18:31:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: I0910 18:31:47.892681 10912 operation_generator.go:1184] Controller attach succeeded for volume "pvc-c0b335df-b527-11e8-85b6-42010a8e0002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002") pod "ss-0" (UID: "c0b56e8a-b527-11e8-85b6-42010a8e0002") device path: "/dev/disk/by-id/google-kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002" Sep 10 18:31:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: I0910 18:31:47.982483 10912 reconciler.go:252] operationExecutor.MountVolume started for volume "pvc-c0b335df-b527-11e8-85b6-42010a8e0002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002") pod "ss-0" (UID: "c0b56e8a-b527-11e8-85b6-42010a8e0002") Sep 10 18:31:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: I0910 18:31:47.982709 10912 operation_generator.go:489] MountVolume.WaitForAttach entering for volume "pvc-c0b335df-b527-11e8-85b6-42010a8e0002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002") pod "ss-0" (UID: "c0b56e8a-b527-11e8-85b6-42010a8e0002") DevicePath "/dev/disk/by-id/google-kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002" Sep 10 18:41:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: E0910 18:41:47.983176 10912 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002\"" failed. No retries permitted until 2018-09-10 18:41:48.483130109 +0000 UTC m=+1517.953389124 (durationBeforeRetry 500ms). Error: "MountVolume.WaitForAttach failed for volume \"pvc-c0b335df-b527-11e8-85b6-42010a8e0002\" (UniqueName: \"kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002\") pod \"ss-0\" (UID: \"c0b56e8a-b527-11e8-85b6-42010a8e0002\") : Could not find attached GCE PD \"kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002\". Timeout waiting for mount paths to be created." Sep 10 18:41:47 ci-op-y4fqx6lz-1827f-ig-n-bgfc origin-node: I0910 18:41:47.986040 10912 reconciler.go:301] Volume detached for volume "pvc-c0b335df-b527-11e8-85b6-42010a8e0002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002") on node "ci-op-y4fqx6lz-1827f-ig-n-bgfc" DevicePath "/dev/disk/by-id/google-kubernetes-dynamic-pvc-c0b335df-b527-11e8-85b6-42010a8e0002" It appears that, volume did get attached in time (18:31:xxx) but the device link which is typically created by udev did not appear on time. Discussed with michelle(msau42) and there is a possibility of race condition inside udev which can cause an iSCSI attach to be missed. I still need to isolate and find out, if that is indeed the case but it is looking like that one. So, I ran this e2e test for more than 2 hours today and didn't see the test in question fail. I know that the race condition in udev is still real. After discussing with msau42, we ended up filing https://github.com/kubernetes/kubernetes/issues/68528 I will look more closely tomorrow. This is now failing for more than 1/3 runs - master is almost unable to merge changes, and the 3.11 queues aren't much better. Crosslinking with https://bugzilla.redhat.com/show_bug.cgi?id=1624961 in case there is some overlap. Thanks. I am still investigating this. But just to clarify - the flake is not about attach issue. But the problem is - volume is getting attached but it is either not showing up on the node or udev events are not getting triggered and missing creation of /dev/disk/by-id symbolic links (looking at logs we can't tell one from another). Looks like nodes are entering some kind of "bad" state, it is node-specific not test-specific. e.g. in all the tests that fail in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/9972/pull-ci-openshift-openshift-ansible-e2e-gcp/2264/ , mounts fail on the node ci-op-1jt1t4g1-1827f-ig-n-fdt1. For https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/9972/pull-ci-openshift-openshift-ansible-e2e-gcp/2255#sig-apps-statefulset-k8sio-basic-statefulset-functionality-statefulsetbasic-should-not-deadlock-when-a-pods-predecessor-fails-suiteopenshiftconformanceparallel-suitek8s , mounts fail on the node ci-op-y4fqx6lz-1827f-ig-n-bgfc . Is there maybe some GCE equivalent to the AWS "stuck attaching" instance state we are hitting? As you said disks are getting attached, just not showing up at /dev/disk/by-id... Okay, so I made a debug PR - https://github.com/openshift/origin/pull/21022 and tried to reproduce the issue. So, this PR ran gcp e2e tests more than 6 times and I did not see the test in question flake even once. We do know about udev events being missed on the node but until we can reproduce it and confirm that, it seems wrong to make a fix for that just based on hunch. Closing this. If it happens again we will try and reproduce this. |