Description of problem: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/312 [Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial] expand_less 48m28s fail [k8s.io/kubernetes/test/e2e/framework/util.go:2456]: Unexpected error: <*errors.errorString | 0xc0004729f0>: { s: "expected pod \"pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8\" success: Gave up after waiting 5m0s for pod \"pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8\" to be \"success or failure\"", } expected pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" success: Gave up after waiting 5m0s for pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" to be "success or failure" occurred blob:null/9dd5836b-a351-4b3d-861a-0b7fc8f0aaf4 [AfterEach] [Disruptive] /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:150 Oct 20 02:10:28.316: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready Oct 20 02:10:28.365: INFO: Running AfterSuite actions on all nodes Oct 20 02:10:28.365: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-sig-apps-deployment-upgrade-7188" for this suite. Oct 20 02:10:34.432: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered Oct 20 02:10:36.691: INFO: namespace e2e-sig-apps-deployment-upgrade-7188 deletion completed in 8.30937557s Oct 20 02:10:36.692: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-sig-storage-sig-api-machinery-secret-upgrade-7925" for this suite. Oct 20 02:10:42.760: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered Oct 20 02:10:44.869: INFO: namespace e2e-sig-storage-sig-api-machinery-secret-upgrade-7925 deletion completed in 8.159917641s Oct 20 02:10:44.869: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-sig-storage-sig-api-machinery-configmap-upgrade-3449" for this suite. Oct 20 02:10:50.936: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered Oct 20 02:10:53.017: INFO: namespace e2e-sig-storage-sig-api-machinery-configmap-upgrade-3449 deletion completed in 8.129973656s Oct 20 02:10:53.017: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-sig-apps-replicaset-upgrade-917" for this suite. Oct 20 02:11:05.083: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered Oct 20 02:11:07.193: INFO: namespace e2e-sig-apps-replicaset-upgrade-917 deletion completed in 14.158969734s Oct 20 02:11:07.193: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-sig-apps-daemonset-upgrade-3901" for this suite. Oct 20 02:11:57.262: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered Oct 20 02:11:59.340: INFO: namespace e2e-sig-apps-daemonset-upgrade-3901 deletion completed in 52.129459826s Oct 20 02:11:59.340: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-sig-apps-job-upgrade-5396" for this suite. Oct 20 02:12:45.408: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered Oct 20 02:12:47.484: INFO: namespace e2e-sig-apps-job-upgrade-5396 deletion completed in 48.125384235s Oct 20 02:12:47.484: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-service-upgrade-7113" for this suite. Oct 20 02:13:01.552: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered Oct 20 02:13:03.632: INFO: namespace e2e-service-upgrade-7113 deletion completed in 16.13025973s Oct 20 02:13:03.632: INFO: Running AfterSuite actions on node 1 Oct 20 02:13:03.632: INFO: Dumping logs locally to: /tmp/artifacts/junit Oct 20 02:13:03.632: INFO: Error running cluster/log-dump/log-dump.sh: fork/exec ../../cluster/log-dump/log-dump.sh: no such file or directory fail [k8s.io/kubernetes/test/e2e/framework/util.go:2456]: Unexpected error: <*errors.errorString | 0xc0004729f0>: { s: "expected pod \"pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8\" success: Gave up after waiting 5m0s for pod \"pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8\" to be \"success or failure\"", } expected pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" success: Gave up after waiting 5m0s for pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" to be "success or failure" occurred
The cluster upgrade finished https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/312/artifacts/e2e-aws-upgrade/clusterversion.json But the test pod failed to delete https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/312/build-log.txt ``` Oct 20 02:10:22.314: INFO: Pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 still exists Oct 20 02:10:24.298: INFO: Waiting for pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 to disappear Oct 20 02:10:24.315: INFO: Pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 still exists Oct 20 02:10:26.298: INFO: Waiting for pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 to disappear Oct 20 02:10:26.315: INFO: Pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 still exists Oct 20 02:10:28.298: INFO: Waiting for pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 to disappear Oct 20 02:10:28.314: INFO: Pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 no longer exists Oct 20 02:10:28.314: INFO: Unexpected error occurred: expected pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" success: Gave up after waiting 5m0s for pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" to be "success or failure" ``` This is not a cluster version operator bug. moving to controller manager as the pod failed to be deleted?
Flaky tests on ci: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11462#1:build-log.txt%3A10875 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11451#1:build-log.txt%3A6082
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/187
observed in e2e-aws-rollbackupgrade-4.1-to-4.2: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/418#1:build-log.txt%3A1500 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/417#1:build-log.txt%3A1706
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1/412 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/434
(In reply to Hongkai Liu from comment #3) > Flaky tests on ci: > > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release- > openshift-origin-installer-e2e-aws-upgrade/11462#1:build-log.txt%3A10875 The test itself is not very granular. You need to look inside for the actual failure message (just popping open the failed-test drop-down). In this case, it was: fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:130]: during upgrade Unexpected error: <*errors.errorString | 0xc0041a06b0>: { s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-0.ci-2019-11-21-115827: 12% complete", } Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-0.ci-2019-11-21-115827: 12% complete occurred which is completely different from the pod-configmap timeout that this bug is about. Please spin off into new bugs.
I'm not seeing this bug showing up in the recent upgrade scenarios in the CI, so I'm going to close this for now. Feel free to re-open if this reappears.
[1] shows 11% of release-openshift-origin-installer-e2e-gcp-4.5 failures matching this condition in the past 2d, and both of the past 2d of release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3 failures matching this condition. [1]: https://search.apps.build01.ci.devcluster.openshift.com/?search=expected+pod.*pod-configmap.*success:+Gave+up+after+waiting&type=build-log
Specific example [1,2] from the above search: fail [k8s.io/kubernetes/test/e2e/framework/util.go:803]: Unexpected error: <*errors.errorString | 0xc001431c70>: { s: "expected pod \"pod-configmaps-0a0136a8-5123-4a1c-b923-10539b42d602\" success: Gave up after waiting 5m0s for pod \"pod-configmaps-0a0136a8-5123-4a1c-b923-10539b42d602\" to be \"Succeeded or Failed\"", } expected pod "pod-configmaps-0a0136a8-5123-4a1c-b923-10539b42d602" success: Gave up after waiting 5m0s for pod "pod-configmaps-0a0136a8-5123-4a1c-b923-10539b42d602" to be "Succeeded or Failed" occurred ... failed: (5m14s) 2020-05-06T15:21:31 "[sig-storage] ConfigMap should be consumable from pods in volume with mappings as non-root with FSGroup [LinuxOnly] [NodeFeature:FSGroup] [Suite:openshift/conformance/parallel] [Suite:k8s]" [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1884 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1884/build-log.txt
Upon further inspection I see there are several more bugs similar to this: fail [k8s.io/kubernetes/test/e2e/framework/util.go:803]: Unexpected error: <*errors.errorString | 0xc002a29220>: { s: "expected pod \"pod-projected-secrets-b0d502d6-8f32-4677-9e4e-6819a1ca761c\" success: Gave up after waiting 5m0s for pod \"pod-projected-secrets-b0d502d6-8f32-4677-9e4e-6819a1ca761c\" to be \"Succeeded or Failed\"", } fail [k8s.io/kubernetes/test/e2e/framework/util.go:803]: Unexpected error: <*errors.errorString | 0xc0029754f0>: { s: "expected pod \"var-expansion-82d3c6b7-1056-4fcd-b54f-a8a19d9901b9\" success: Gave up after waiting 5m0s for pod \"var-expansion-82d3c6b7-1056-4fcd-b54f-a8a19d9901b9\" to be \"Succeeded or Failed\"", } fail [k8s.io/kubernetes/test/e2e/framework/util.go:803]: Unexpected error: <*errors.errorString | 0xc00125ab20>: { s: "expected pod \"pod-subpath-test-inlinevolume-6798\" success: Gave up after waiting 5m0s for pod \"pod-subpath-test-inlinevolume-6798\" to be \"Succeeded or Failed\"", } The one common thing they all have is this event: May 6 15:21:22.066: INFO: At 2020-05-06 15:20:39 +0000 UTC - event for pod-subpath-test-inlinevolume-6798: {kubelet ci-op-bt9cf-w-c-cb6fs.c.openshift-gce-devel-ci.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-6798_e2e-provisioning-8928_7c59371b-bd4d-4c5a-96d3-9a0f6f892c4b_0(5080da563dae12b2e4a8394e8a384f5eebc3e5870ea9a4b81091e70a446b896e): Multus: [e2e-provisioning-8928/pod-subpath-test-inlinevolume-6798]: error adding container to network "openshift-sdn": delegateAdd: error invoking confAdd - "openshift-sdn": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: e2e-provisioning-8928, netnamespaces.network.openshift.io "e2e-provisioning-8928" not found so this looks like a networking error so I'm sending this to appropriate team.
The expected resolution here is that we no longer see: expected pod "pod-configmap-..." success: Gave up after waiting 5m0s for pod "pod-configmap-..." to be "success or failure" no longer shows up in CI. If whatever the fix for that is also addresses pod-projected-secrets-* and such from comment 11, great. If not, those will need to be spun off into their own bugs. Looks like "success or failure" is "Succeeded or Failed" in modern openshift/origin, but still, I see none of these over the past 2d: $ curl -s 'https://search.ci.openshift.org/search?search=Gave+up+after+waiting.*for+pod.*to+be&maxAge=48h&type=junit&name=release-openshift-ocp-&context=0' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*expected pod \\"\([^\\]*\)-[^-]*\\".*/\1/' | sort | uniq -c | sort -n 1 exec-volume-test-aws 1 exec-volume-test-dynamicpv 1 pod-secrets-f45b43e3-25a4-494e-b921 1 pod-subpath-test-gluster-preprovisionedpv 1 pod-subpath-test-local-preprovisionedpv 1 pod-subpath-test-secret 1 security-context-e804d78c-b924-4d35-a8a6 2 exec-volume-test-cinder-dynamicpv 2 exec-volume-test-inlinevolume 2 exec-volume-test-preprovisionedpv 3 pod-subpath-test-preprovisionedpv 4 pod-subpath-test-inlinevolume So lets pivot this bug to be about pod-subpath-test-inlinevolume. Looking at the failing jobs, they are all using the new-style target string: $ curl -s 'https://search.ci.openshift.org/search?search=Gave+up+after+waiting.*for+pod.*pod-subpath-test-inlinevolume.*to+be&maxAge=48h&type=junit&name=release-openshift-ocp-&context=0' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*to be//' | sort | uniq -c 6 \"Succeeded or Failed\"", Failing jobs: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Gave+up+after+waiting.*for+pod.*pod-subpath-test-inlinevolume.*to+be.*Succeeded&maxAge=48h&type=junit&name=release-openshift-ocp-&context=0' | grep 'failures match' release-openshift-ocp-installer-e2e-azure-4.5 - 5 runs, 80% failed, 25% of failures match release-openshift-ocp-installer-e2e-aws-fips-4.6 - 20 runs, 90% failed, 11% of failures match release-openshift-ocp-installer-e2e-azure-4.6 - 20 runs, 60% failed, 8% of failures match release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 20 runs, 75% failed, 7% of failures match Matching example job (for 4.5.0-0.nightly-2020-08-03-123303): https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1290447311758954496
Given #comment 15 and that fact that we've decided to pivot to being about the pod-subpath-test-inlinevolume failing. I've decided to send this bug to the scheduler team. There are no networking issues leading to the failures related to build: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1290447311758954496 from #comment 15 The build logs are however filled with: W ns/e2e-provisioning-9050 pod/pod-subpath-test-inlinevolume-b9ts reason/FailedScheduling 0/6 nodes are available: 5 node(s) didn't match node selector, which is why the tests are failing (their test pods cannot be created) For all failing tests: I have looked at the details of some of the test logs and it seems be related to worker 1, aka: ci-op-dgqvmdtt-fe6b0-t7mhv-worker-centralus1-x2jd9 I have not been able to understand why that node is not schedulable from the moment the job was launched until its end. Looking at https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1290447311758954496/artifacts/e2e-azure/nodes.json the node does not seem to be defect: "addresses": [ { "address": "ci-op-dgqvmdtt-fe6b0-t7mhv-worker-centralus1-x2jd9", "type": "Hostname" }, { "address": "10.0.32.5", "type": "InternalIP" } ], "allocatable": { "attachable-volumes-azure-disk": "8", "cpu": "3500m", "ephemeral-storage": "122112633448", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "15268236Ki", "pods": "250" }, "capacity": { "attachable-volumes-azure-disk": "8", "cpu": "4", "ephemeral-storage": "133665772Ki", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "16419212Ki", "pods": "250" }, "conditions": [ { "lastHeartbeatTime": "2020-08-04T02:04:33Z", "lastTransitionTime": "2020-08-04T01:13:40Z", "message": "kubelet has sufficient memory available", "reason": "KubeletHasSufficientMemory", "status": "False", "type": "MemoryPressure" }, { "lastHeartbeatTime": "2020-08-04T02:04:33Z", "lastTransitionTime": "2020-08-04T01:13:40Z", "message": "kubelet has no disk pressure", "reason": "KubeletHasNoDiskPressure", "status": "False", "type": "DiskPressure" }, { "lastHeartbeatTime": "2020-08-04T02:04:33Z", "lastTransitionTime": "2020-08-04T01:13:40Z", "message": "kubelet has sufficient PID available", "reason": "KubeletHasSufficientPID", "status": "False", "type": "PIDPressure" }, { "lastHeartbeatTime": "2020-08-04T02:04:33Z", "lastTransitionTime": "2020-08-04T01:14:40Z", "message": "kubelet is posting ready status", "reason": "KubeletReady", "status": "True", "type": "Ready" } ], Maybe the scheduling team will be able to debug further.
Jan can you check this from scheduler POV?
Sure, going through the logs now.
- pod-subpath-test-inlinevolume is defined in https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go#L154 - later on pod's NodeSelector is set to target a node (picked randomly from a set of ready and schedulable nodes) So, by the time the pod is being created, the selected node is schedulable. The scheduler logs do not contain enough information to explain why the node was not selected. The reason is missing. There's an upstream issue about the missing reason: https://github.com/kubernetes/kubernetes/issues/91601. Suggested upstream fix is https://github.com/kubernetes/kubernetes/pull/93355. Worth backporting it to 4.5 to see if it helps.
To be clear, https://github.com/kubernetes/kubernetes/pull/93355 will tell more about the reasons why the pod was not scheduled. It will not fix the problem.
https://github.com/kubernetes/kubernetes/pull/93355 is merged into 1.19 upstream. Have picked it through another rebase?
Yeah, I am waiting for the next 1.19 rebase
1.19 is merged, will wait for few days to collect more flaking jobs to debug
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Gave+up+after+waiting.*for+pod.*pod-subpath-test-inlinevolume.*to+be.*Succeeded&maxAge=48h&type=junit&name=release-openshift-ocp-&context=0' | grep 'failures match' release-openshift-ocp-installer-e2e-ovirt-4.5 - 9 runs, 11% failed, 100% of failures match release-openshift-ocp-installer-e2e-openstack-4.6 - 18 runs, 94% failed, 6% of failures match release-openshift-ocp-installer-e2e-ovirt-4.6 - 24 runs, 38% failed, 11% of failures match release-openshift-ocp-installer-e2e-gcp-rt-4.6 - 17 runs, 53% failed, 22% of failures match release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 17 runs, 65% failed, 9% of failures match release-openshift-ocp-installer-e2e-metal-compact-4.5 - 4 runs, 50% failed, 50% of failures match ## Jobs ### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1306123437206409216 ``` fail [k8s.io/kubernetes/test/e2e/framework/util.go:798]: Unexpected error: <*errors.errorString | 0xc000cb3b10>: { s: "expected pod \"pod-subpath-test-inlinevolume-b5rf\" success: Gave up after waiting 5m0s for pod \"pod-subpath-test-inlinevolume-b5rf\" to be \"Succeeded or Failed\"", } expected pod "pod-subpath-test-inlinevolume-b5rf" success: Gave up after waiting 5m0s for pod "pod-subpath-test-inlinevolume-b5rf" to be "Succeeded or Failed" occurred ``` From https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1306123437206409216/artifacts/e2e-ovirt/pods/openshift-kube-scheduler_openshift-kube-scheduler-ovirt17-4wv67-master-0_kube-scheduler.log ``` I0916 07:59:01.985919 1 scheduler.go:731] pod e2e-provisioning-7746/pod-subpath-test-inlinevolume-b5rf is bound successfully on node "ovirt17-4wv67-worker-0-4ttv9", 5 nodes evaluated, 2 nodes were found feasible. ``` Kubelet running on ovirt17-4wv67-worker-0-4ttv says: ``` Sep 16 08:03:03.913483 ovirt17-4wv67-worker-0-4ttv9 hyperkube[1581]: E0916 08:03:03.913025 1581 pod_workers.go:191] Error syncing pod 345fdb23-9f66-42eb-9f1f-bb20f6c8ba08 ("pod-subpath-test-inlinevolume-b5rf_e2e-provisioning-7746(345fdb23-9f66-42eb-9f1f-bb20f6c8ba08)"), skipping: failed to "CreatePodSandbox" for "pod-subpath-test-inlinevolume-b5rf_e2e-provisioning-7746(345fdb23-9f66-42eb-9f1f-bb20f6c8ba08)" with CreatePodSandboxError: "CreatePodSandbox for pod \"pod-subpath-test-inlinevolume-b5rf_e2e-provisioning-7746(345fdb23-9f66-42eb-9f1f-bb20f6c8ba08)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded" Sep 16 08:03:03.913483 ovirt17-4wv67-worker-0-4ttv9 hyperkube[1581]: I0916 08:03:03.913074 1581 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-provisioning-7746", Name:"pod-subpath-test-inlinevolume-b5rf", UID:"345fdb23-9f66-42eb-9f1f-bb20f6c8ba08", APIVersion:"v1", ResourceVersion:"41682", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded Sep 16 08:03:04.134052 ovirt17-4wv67-worker-0-4ttv9 hyperkube[1581]: I0916 08:03:04.134005 1581 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/containers/storage/overlay-containers/02dbbec5128ef16d476c8ddfc0be05200d5a4e0b4eaff50bfcec1f93a2152e44/userdata/shm Sep 16 08:03:04.134052 ovirt17-4wv67-worker-0-4ttv9 hyperkube[1581]: I0916 08:03:04.134047 1581 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/containers/storage/overlay/e2c0a04981b2eae291aa67f62f86562bc2db756b7933a70349597ba58d526fb1/merged Sep 16 08:03:04.239460 ovirt17-4wv67-worker-0-4ttv9 hyperkube[1581]: E0916 08:03:04.239333 1581 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded ``` ### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.6/1306118658665222144 ``` [sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (default fs)] subPath should support file as subpath [LinuxOnly] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_less 5m10s fail [k8s.io/kubernetes.0-rc.2/test/e2e/framework/util.go:715]: Unexpected error: <*errors.errorString | 0xc0003aebb0>: { s: "expected pod \"pod-subpath-test-inlinevolume-vxsb\" success: Gave up after waiting 5m0s for pod \"pod-subpath-test-inlinevolume-vxsb\" to be \"Succeeded or Failed\"", } expected pod "pod-subpath-test-inlinevolume-vxsb" success: Gave up after waiting 5m0s for pod "pod-subpath-test-inlinevolume-vxsb" to be "Succeeded or Failed" occurred ``` From junit: ``` Sep 16 08:07:39.947: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-vxsb: { } Scheduled: Successfully assigned e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb to y41mz62c-12e28-g5lbt-worker-0-sqzgj Sep 16 08:07:39.947: INFO: At 2020-09-16 08:04:02 +0000 UTC - event for pod-subpath-test-inlinevolume-vxsb: {kubelet y41mz62c-12e28-g5lbt-worker-0-sqzgj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447_464d7763-5703-448c-9db3-4c9ec522abbc_0(da0d8ba0a36a71f925f6a0d76c908649ec791816373649b036abda05c8a30ab8): Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Sep 16 08:07:39.947: INFO: At 2020-09-16 08:05:45 +0000 UTC - event for pod-subpath-test-inlinevolume-vxsb: {kubelet y41mz62c-12e28-g5lbt-worker-0-sqzgj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447_464d7763-5703-448c-9db3-4c9ec522abbc_0(eec54f25a515e21dd9185738e8ca588df62a590f04d03071109a5805fd67f318): Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Sep 16 08:07:39.947: INFO: At 2020-09-16 08:07:26 +0000 UTC - event for pod-subpath-test-inlinevolume-vxsb: {kubelet y41mz62c-12e28-g5lbt-worker-0-sqzgj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447_464d7763-5703-448c-9db3-4c9ec522abbc_0(425fbed87e39a4185935127c8052568de68369ed712757df8d3599de51b6d501): Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition ``` kubelet: ``` Sep 16 08:04:02.662006 y41mz62c-12e28-g5lbt-worker-0-sqzgj.novalocal hyperkube[1765]: E0916 08:04:02.661810 1765 kuberuntime_manager.go:730] createPodSandbox for pod "pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447(464d7763-5703-448c-9db3-4c9ec522abbc)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447_464d7763-5703-448c-9db3-4c9ec522abbc_0(da0d8ba0a36a71f925f6a0d76c908649ec791816373649b036abda05c8a30ab8): Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Sep 16 08:04:02.662006 y41mz62c-12e28-g5lbt-worker-0-sqzgj.novalocal hyperkube[1765]: E0916 08:04:02.661915 1765 pod_workers.go:191] Error syncing pod 464d7763-5703-448c-9db3-4c9ec522abbc ("pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447(464d7763-5703-448c-9db3-4c9ec522abbc)"), skipping: failed to "CreatePodSandbox" for "pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447(464d7763-5703-448c-9db3-4c9ec522abbc)" with CreatePodSandboxError: "CreatePodSandbox for pod \"pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447(464d7763-5703-448c-9db3-4c9ec522abbc)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447_464d7763-5703-448c-9db3-4c9ec522abbc_0(da0d8ba0a36a71f925f6a0d76c908649ec791816373649b036abda05c8a30ab8): Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition" ``` ``` [sig-storage] In-tree Volumes [Driver: emptydir] [Testpattern: Inline-volume (default fs)] subPath should support readOnly file specified in the volumeMount [LinuxOnly] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_less 5m9s fail [k8s.io/kubernetes.0-rc.2/test/e2e/framework/util.go:715]: Unexpected error: <*errors.errorString | 0xc0029b3a50>: { s: "expected pod \"pod-subpath-test-inlinevolume-png4\" success: Gave up after waiting 5m0s for pod \"pod-subpath-test-inlinevolume-png4\" to be \"Succeeded or Failed\"", } expected pod "pod-subpath-test-inlinevolume-png4" success: Gave up after waiting 5m0s for pod "pod-subpath-test-inlinevolume-png4" to be "Succeeded or Failed" occurred ``` From junit: ``` Sep 16 08:05:51.327: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-png4: { } Scheduled: Successfully assigned e2e-provisioning-5179/pod-subpath-test-inlinevolume-png4 to y41mz62c-12e28-g5lbt-worker-0-sqzgj ``` The kubelet logs are ``` Sep 16 08:02:15.349715 y41mz62c-12e28-g5lbt-worker-0-sqzgj.novalocal hyperkube[1765]: E0916 08:02:15.349080 1765 pod_workers.go:191] Error syncing pod f6bb4f1e-3cda-412a-ab09-ba29a2fb9af7 ("pod-subpath-test-inlinevolume-png4_e2e-provisioning-5179(f6bb4f1e-3cda-412a-ab09-ba29a2fb9af7)"), skipping: failed to "CreatePodSandbox" for "pod-subpath-test-inlinevolume-png4_e2e-provisioning-5179(f6bb4f1e-3cda-412a-ab09-ba29a2fb9af7)" with CreatePodSandboxError: "CreatePodSandbox for pod \"pod-subpath-test-inlinevolume-png4_e2e-provisioning-5179(f6bb4f1e-3cda-412a-ab09-ba29a2fb9af7)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-png4_e2e-provisioning-5179_f6bb4f1e-3cda-412a-ab09-ba29a2fb9af7_0(5e6564601e82a2b4a62d6f52c598fe5b0de5b5681b0367f0a1dfb83ac6c690d3): Multus: [e2e-provisioning-5179/pod-subpath-test-inlinevolume-png4]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition" ``` In other instances the pod got always scheduled, kubelet failed to run the pod. ### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.6/1306096546189676544 ``` Sep 16 06:24:21.082: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-l8nc: { } Scheduled: Successfully assigned e2e-provisioning-6020/pod-subpath-test-inlinevolume-l8nc to ci-op-4bzyjwnv-1a9c0-s7zsc-worker-c-cdm6z ... Sep 16 06:24:21.082: INFO: At 2020-09-16 06:19:59 +0000 UTC - event for pod-subpath-test-inlinevolume-l8nc: {kubelet ci-op-4bzyjwnv-1a9c0-s7zsc-worker-c-cdm6z} FailedMount: MountVolume.SetUp failed for volume "test-volume" : mount failed: mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/a30ff59a-1877-4da3-8751-22becc5a5324/volumes/kubernetes.io~glusterfs/test-volume --scope -- mount -t glusterfs -o auto_unmount,log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/test-volume/pod-subpath-test-inlinevolume-l8nc-glusterfs.log,log-level=ERROR 10.129.3.160:test_vol /var/lib/kubelet/pods/a30ff59a-1877-4da3-8751-22becc5a5324/volumes/kubernetes.io~glusterfs/test-volume Output: Running scope as unit: run-r614d3d525db64ad58d4d91d4a59149be.scope Mounting glusterfs on /var/lib/kubelet/pods/a30ff59a-1877-4da3-8751-22becc5a5324/volumes/kubernetes.io~glusterfs/test-volume failed. ``` ### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.6/1305756913199222784 ``` Sep 15 07:30:16.447: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-xwht: { } Scheduled: Successfully assigned e2e-provisioning-8289/pod-subpath-test-inlinevolume-xwht to ci-op-r94d0cmx-1a9c0-qd4v7-worker-b-jmtlm ``` kubelet did not start the pod in time ### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306068071802212352 ``` Sep 16 05:22:03.520: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-v5j7: { } Scheduled: Successfully assigned e2e-provisioning-1046/pod-subpath-test-inlinevolume-v5j7 to ovirt14-gtf64-worker-0-wdplj Sep 16 05:22:03.520: INFO: At 2020-09-16 05:18:31 +0000 UTC - event for pod-subpath-test-inlinevolume-v5j7: {kubelet ovirt14-gtf64-worker-0-wdplj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-v5j7_e2e-provisioning-1046_773d37ee-288b-404e-a025-af60dfe92721_0(2844d32da63bd56bd2737eec25fccfa1122af5bd9ce31ff12c8c395877764542): Multus: [e2e-provisioning-1046/pod-subpath-test-inlinevolume-v5j7]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Sep 16 05:22:03.520: INFO: At 2020-09-16 05:20:17 +0000 UTC - event for pod-subpath-test-inlinevolume-v5j7: {kubelet ovirt14-gtf64-worker-0-wdplj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-v5j7_e2e-provisioning-1046_773d37ee-288b-404e-a025-af60dfe92721_0(0bcabfb1657ed4e4e73a584670d7f35e2af49f9fe4ecefd820fb0daf72523fe5): Multus: [e2e-provisioning-1046/pod-subpath-test-inlinevolume-v5j7]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Sep 16 05:22:03.520: INFO: At 2020-09-16 05:21:59 +0000 UTC - event for pod-subpath-test-inlinevolume-v5j7: {kubelet ovirt14-gtf64-worker-0-wdplj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-v5j7_e2e-provisioning-1046_773d37ee-288b-404e-a025-af60dfe92721_0(70ed392b8409dc2e9b0abe30378f24282f6f9a23224048af37a95b75c44d5a83): Multus: [e2e-provisioning-1046/pod-subpath-test-inlinevolume-v5j7]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition ``` Pods in other inline failing tests got scheduled, kubelet did not made the pods run. ### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.6/1305705117474885632 `` Sep 15 04:14:35.884: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-lgbv: { } Scheduled: Successfully assigned e2e-provisioning-652/pod-subpath-test-inlinevolume-lgbv to ci-op-0zr8v53k-45fdb-jmj5t-worker-b-p8n9r Sep 15 04:14:35.884: INFO: At 2020-09-15 04:09:33 +0000 UTC - event for pod-subpath-test-inlinevolume-lgbv: {controlplane } ErrorAddingLogicalPort: timeout waiting for namespace event Sep 15 04:14:35.884: INFO: At 2020-09-15 04:09:46 +0000 UTC - event for pod-subpath-test-inlinevolume-lgbv: {kubelet ci-op-0zr8v53k-45fdb-jmj5t-worker-b-p8n9r} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-lgbv_e2e-provisioning-652_a5963cf0-24f7-4505-932d-598b7a8b294b_0(9b1b1519afb1c4ecddfa5e7c29bd0d0185e7ec85dcbb9e4a449367a40a94c61c): [e2e-provisioning-652/pod-subpath-test-inlinevolume-lgbv:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[e2e-provisioning-652/pod-subpath-test-inlinevolume-lgbv] failed to configure pod interface: timed out waiting for pod flows for pod: pod-subpath-test-inlinevolume-lgbv, error: timed out waiting for the condition ... ``` ### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.5/1305478194941399040 ``` Pod got scheduled, kubelet failed to start it.
In all the cases I checked, all pod-subpath-test-inlinevolume pods got scheduled. In most cases underlying containers did not start due to FailedCreatePodSandBox. Re-assigning to node team to check the kubelet side of the failures.
Hate to bounce this around but it seems the majority of the failures are due to CNI failing underneath crio with the exception of the one glusterfs mount failure out of the runs that Jan anaylzed in comment 24 Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Multus: [e2e-provisioning-5179/pod-subpath-test-inlinevolume-png4]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition Multus: [e2e-provisioning-1046/pod-subpath-test-inlinevolume-v5j7]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[e2e-provisioning-652/pod-subpath-test-inlinevolume-lgbv] failed to configure pod interface: timed out waiting for pod flows for pod: pod-subpath-test-inlinevolume-lgbv, error: timed out waiting for the condition
When Multus is "waiting for ReadinessIndicatorFile" -- it's waiting for a semaphore from the default network (either OVN or OpenShift-SDN) ``` PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition ``` Do we have any logs from the ovn-k or openshift-sdn logs? This will wind up being our root cause. Multus is provisioned before openshift-sdn or ovn-k, but, it can't say "the nodes are ready" (by placing a CNI configuration on disk) until OpenShift-SDN or ovn-k are ready.