1763662 – Cluster upgrade should maintain a functioning cluster: expected pod "pod-subpath-test-inlinevolume-..." success: Gave up after waiting 5m0s

Bug 1763662 - Cluster upgrade should maintain a functioning cluster: expected pod "pod-subpath-test-inlinevolume-..." success: Gave up after waiting 5m0s

Summary: Cluster upgrade should maintain a functioning cluster: expected pod "pod-subp...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Douglas Smith
QA Contact:	Weibin Liang
Docs Contact:
URL:
Whiteboard:
Depends On:	1868314
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-21 09:45 UTC by Liang Xia
Modified:	2020-11-02 16:32 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-02 16:32:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Liang Xia 2019-10-21 09:45:02 UTC

Description of problem:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/312
[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial] expand_less 	48m28s
fail [k8s.io/kubernetes/test/e2e/framework/util.go:2456]: Unexpected error:
    <*errors.errorString | 0xc0004729f0>: {
        s: "expected pod \"pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8\" success: Gave up after waiting 5m0s for pod \"pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8\" to be \"success or failure\"",
    }
    expected pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" success: Gave up after waiting 5m0s for pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" to be "success or failure"
occurred

blob:null/9dd5836b-a351-4b3d-861a-0b7fc8f0aaf4
[AfterEach] [Disruptive]
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:150
Oct 20 02:10:28.316: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Oct 20 02:10:28.365: INFO: Running AfterSuite actions on all nodes
Oct 20 02:10:28.365: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-sig-apps-deployment-upgrade-7188" for this suite.
Oct 20 02:10:34.432: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Oct 20 02:10:36.691: INFO: namespace e2e-sig-apps-deployment-upgrade-7188 deletion completed in 8.30937557s
Oct 20 02:10:36.692: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-sig-storage-sig-api-machinery-secret-upgrade-7925" for this suite.
Oct 20 02:10:42.760: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Oct 20 02:10:44.869: INFO: namespace e2e-sig-storage-sig-api-machinery-secret-upgrade-7925 deletion completed in 8.159917641s
Oct 20 02:10:44.869: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-sig-storage-sig-api-machinery-configmap-upgrade-3449" for this suite.
Oct 20 02:10:50.936: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Oct 20 02:10:53.017: INFO: namespace e2e-sig-storage-sig-api-machinery-configmap-upgrade-3449 deletion completed in 8.129973656s
Oct 20 02:10:53.017: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-sig-apps-replicaset-upgrade-917" for this suite.
Oct 20 02:11:05.083: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Oct 20 02:11:07.193: INFO: namespace e2e-sig-apps-replicaset-upgrade-917 deletion completed in 14.158969734s
Oct 20 02:11:07.193: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-sig-apps-daemonset-upgrade-3901" for this suite.
Oct 20 02:11:57.262: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Oct 20 02:11:59.340: INFO: namespace e2e-sig-apps-daemonset-upgrade-3901 deletion completed in 52.129459826s
Oct 20 02:11:59.340: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-sig-apps-job-upgrade-5396" for this suite.
Oct 20 02:12:45.408: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Oct 20 02:12:47.484: INFO: namespace e2e-sig-apps-job-upgrade-5396 deletion completed in 48.125384235s
Oct 20 02:12:47.484: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-service-upgrade-7113" for this suite.
Oct 20 02:13:01.552: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Oct 20 02:13:03.632: INFO: namespace e2e-service-upgrade-7113 deletion completed in 16.13025973s
Oct 20 02:13:03.632: INFO: Running AfterSuite actions on node 1
Oct 20 02:13:03.632: INFO: Dumping logs locally to: /tmp/artifacts/junit
Oct 20 02:13:03.632: INFO: Error running cluster/log-dump/log-dump.sh: fork/exec ../../cluster/log-dump/log-dump.sh: no such file or directory
fail [k8s.io/kubernetes/test/e2e/framework/util.go:2456]: Unexpected error:
    <*errors.errorString | 0xc0004729f0>: {
        s: "expected pod \"pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8\" success: Gave up after waiting 5m0s for pod \"pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8\" to be \"success or failure\"",
    }
    expected pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" success: Gave up after waiting 5m0s for pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" to be "success or failure"
occurred

Comment 1 Abhinav Dahiya 2019-11-04 19:56:41 UTC

The cluster upgrade finished

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/312/artifacts/e2e-aws-upgrade/clusterversion.json

But the test pod failed to delete
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/312/build-log.txt

```
Oct 20 02:10:22.314: INFO: Pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 still exists
Oct 20 02:10:24.298: INFO: Waiting for pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 to disappear
Oct 20 02:10:24.315: INFO: Pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 still exists
Oct 20 02:10:26.298: INFO: Waiting for pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 to disappear
Oct 20 02:10:26.315: INFO: Pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 still exists
Oct 20 02:10:28.298: INFO: Waiting for pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 to disappear
Oct 20 02:10:28.314: INFO: Pod pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8 no longer exists
Oct 20 02:10:28.314: INFO: Unexpected error occurred: expected pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" success: Gave up after waiting 5m0s for pod "pod-configmap-deec7ff4-f2dd-11e9-98ce-0a58ac109ca8" to be "success or failure"
```

This is not a cluster version operator bug. moving to controller manager as the pod failed to be deleted?

Comment 3 Hongkai Liu 2019-11-21 20:09:18 UTC

Flaky tests on ci:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11462#1:build-log.txt%3A10875

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/11451#1:build-log.txt%3A6082

Comment 4 Hongkai Liu 2019-11-21 20:25:54 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/187

Comment 5 Vinay Kapalavai 2020-01-22 19:53:45 UTC

observed in e2e-aws-rollbackupgrade-4.1-to-4.2:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/418#1:build-log.txt%3A1500

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/417#1:build-log.txt%3A1706

Comment 6 Nikolaos Leandros Moraitis 2020-01-30 14:18:22 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1/412
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/434

Comment 7 W. Trevor King 2020-01-30 16:36:00 UTC

(In reply to Hongkai Liu from comment #3)
> Flaky tests on ci:
> 
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade/11462#1:build-log.txt%3A10875

The test itself is not very granular.  You need to look inside for the actual failure message (just popping open the failed-test drop-down).  In this case, it was:

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:130]: during upgrade
Unexpected error:
    <*errors.errorString | 0xc0041a06b0>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-0.ci-2019-11-21-115827: 12% complete",
    }
    Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-0.ci-2019-11-21-115827: 12% complete
occurred

which is completely different from the pod-configmap timeout that this bug is about.  Please spin off into new bugs.

Comment 8 Maciej Szulik 2020-05-06 13:23:06 UTC

I'm not seeing this bug showing up in the recent upgrade scenarios in the CI, so I'm going to close this for now. Feel free to re-open if this reappears.

Comment 9 W. Trevor King 2020-05-06 18:37:44 UTC

[1] shows 11% of release-openshift-origin-installer-e2e-gcp-4.5 failures matching this condition in the past 2d, and both of the past 2d of release-openshift-origin-installer-e2e-gcp-shared-vpc-4.3 failures matching this condition.

[1]: https://search.apps.build01.ci.devcluster.openshift.com/?search=expected+pod.*pod-configmap.*success:+Gave+up+after+waiting&type=build-log

Comment 10 W. Trevor King 2020-05-06 18:41:23 UTC

Specific example [1,2] from the above search:

fail [k8s.io/kubernetes/test/e2e/framework/util.go:803]: Unexpected error:
    <*errors.errorString | 0xc001431c70>: {
        s: "expected pod \"pod-configmaps-0a0136a8-5123-4a1c-b923-10539b42d602\" success: Gave up after waiting 5m0s for pod \"pod-configmaps-0a0136a8-5123-4a1c-b923-10539b42d602\" to be \"Succeeded or Failed\"",
    }
    expected pod "pod-configmaps-0a0136a8-5123-4a1c-b923-10539b42d602" success: Gave up after waiting 5m0s for pod "pod-configmaps-0a0136a8-5123-4a1c-b923-10539b42d602" to be "Succeeded or Failed"
occurred
...
failed: (5m14s) 2020-05-06T15:21:31 "[sig-storage] ConfigMap should be consumable from pods in volume with mappings as non-root with FSGroup [LinuxOnly] [NodeFeature:FSGroup] [Suite:openshift/conformance/parallel] [Suite:k8s]"

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1884
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1884/build-log.txt

Comment 11 Maciej Szulik 2020-05-11 15:50:24 UTC

Upon further inspection I see there are several more bugs similar to this:

fail [k8s.io/kubernetes/test/e2e/framework/util.go:803]: Unexpected error:
    <*errors.errorString | 0xc002a29220>: {
        s: "expected pod \"pod-projected-secrets-b0d502d6-8f32-4677-9e4e-6819a1ca761c\" success: Gave up after waiting 5m0s for pod \"pod-projected-secrets-b0d502d6-8f32-4677-9e4e-6819a1ca761c\" to be \"Succeeded or Failed\"",
    }


fail [k8s.io/kubernetes/test/e2e/framework/util.go:803]: Unexpected error:
    <*errors.errorString | 0xc0029754f0>: {
        s: "expected pod \"var-expansion-82d3c6b7-1056-4fcd-b54f-a8a19d9901b9\" success: Gave up after waiting 5m0s for pod \"var-expansion-82d3c6b7-1056-4fcd-b54f-a8a19d9901b9\" to be \"Succeeded or Failed\"",
    }

fail [k8s.io/kubernetes/test/e2e/framework/util.go:803]: Unexpected error:
    <*errors.errorString | 0xc00125ab20>: {
        s: "expected pod \"pod-subpath-test-inlinevolume-6798\" success: Gave up after waiting 5m0s for pod \"pod-subpath-test-inlinevolume-6798\" to be \"Succeeded or Failed\"",
    }

The one common thing they all have is this event:

May  6 15:21:22.066: INFO: At 2020-05-06 15:20:39 +0000 UTC - event for pod-subpath-test-inlinevolume-6798: {kubelet ci-op-bt9cf-w-c-cb6fs.c.openshift-gce-devel-ci.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-6798_e2e-provisioning-8928_7c59371b-bd4d-4c5a-96d3-9a0f6f892c4b_0(5080da563dae12b2e4a8394e8a384f5eebc3e5870ea9a4b81091e70a446b896e): Multus: [e2e-provisioning-8928/pod-subpath-test-inlinevolume-6798]: error adding container to network "openshift-sdn": delegateAdd: error invoking confAdd - "openshift-sdn": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to find netid for namespace: e2e-provisioning-8928, netnamespaces.network.openshift.io "e2e-provisioning-8928" not found

so this looks like a networking error so I'm sending this to appropriate team.

Comment 15 W. Trevor King 2020-08-04 21:58:00 UTC

The expected resolution here is that we no longer see:

  expected pod "pod-configmap-..." success: Gave up after waiting 5m0s for pod "pod-configmap-..." to be "success or failure"

no longer shows up in CI.  If whatever the fix for that is also addresses pod-projected-secrets-* and such from comment 11, great.  If not, those will need to be spun off into their own bugs.  Looks like "success or failure" is "Succeeded or Failed" in modern openshift/origin, but still, I see none of these over the past 2d:

$ curl -s 'https://search.ci.openshift.org/search?search=Gave+up+after+waiting.*for+pod.*to+be&maxAge=48h&type=junit&name=release-openshift-ocp-&context=0' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*expected pod \\"\([^\\]*\)-[^-]*\\".*/\1/' | sort | uniq -c | sort -n
      1 exec-volume-test-aws
      1 exec-volume-test-dynamicpv
      1 pod-secrets-f45b43e3-25a4-494e-b921
      1 pod-subpath-test-gluster-preprovisionedpv
      1 pod-subpath-test-local-preprovisionedpv
      1 pod-subpath-test-secret
      1 security-context-e804d78c-b924-4d35-a8a6
      2 exec-volume-test-cinder-dynamicpv
      2 exec-volume-test-inlinevolume
      2 exec-volume-test-preprovisionedpv
      3 pod-subpath-test-preprovisionedpv
      4 pod-subpath-test-inlinevolume

So lets pivot this bug to be about pod-subpath-test-inlinevolume.  Looking at the failing jobs, they are all using the new-style target string:

$ curl -s 'https://search.ci.openshift.org/search?search=Gave+up+after+waiting.*for+pod.*pod-subpath-test-inlinevolume.*to+be&maxAge=48h&type=junit&name=release-openshift-ocp-&context=0' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*to be//' | sort | uniq -c
      6  \"Succeeded or Failed\"",

Failing jobs:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Gave+up+after+waiting.*for+pod.*pod-subpath-test-inlinevolume.*to+be.*Succeeded&maxAge=48h&type=junit&name=release-openshift-ocp-&context=0' | grep 'failures match'
release-openshift-ocp-installer-e2e-azure-4.5 - 5 runs, 80% failed, 25% of failures match
release-openshift-ocp-installer-e2e-aws-fips-4.6 - 20 runs, 90% failed, 11% of failures match
release-openshift-ocp-installer-e2e-azure-4.6 - 20 runs, 60% failed, 8% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 20 runs, 75% failed, 7% of failures match

Matching example job (for 4.5.0-0.nightly-2020-08-03-123303): https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1290447311758954496

Comment 16 Alexander Constantinescu 2020-08-05 12:47:16 UTC

Given #comment 15 and that fact that we've decided to pivot to being about the pod-subpath-test-inlinevolume failing. I've decided to send this bug to the scheduler team. 

There are no networking issues leading to the failures related to build:  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1290447311758954496 from #comment 15

The build logs are however filled with: 

W ns/e2e-provisioning-9050 pod/pod-subpath-test-inlinevolume-b9ts reason/FailedScheduling 0/6 nodes are available: 5 node(s) didn't match node selector, which is why the tests are failing (their test pods cannot be created)

For all failing tests: I have looked at the details of some of the test logs and it seems be related to worker 1, aka: ci-op-dgqvmdtt-fe6b0-t7mhv-worker-centralus1-x2jd9

I have not been able to understand why that node is not schedulable from the moment the job was launched until its end. Looking at https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1290447311758954496/artifacts/e2e-azure/nodes.json the node does not seem to be defect:

  "addresses": [
                    {
                        "address": "ci-op-dgqvmdtt-fe6b0-t7mhv-worker-centralus1-x2jd9",
                        "type": "Hostname"
                    },
                    {
                        "address": "10.0.32.5",
                        "type": "InternalIP"
                    }
                ],
                "allocatable": {
                    "attachable-volumes-azure-disk": "8",
                    "cpu": "3500m",
                    "ephemeral-storage": "122112633448",
                    "hugepages-1Gi": "0",
                    "hugepages-2Mi": "0",
                    "memory": "15268236Ki",
                    "pods": "250"
                },
                "capacity": {
                    "attachable-volumes-azure-disk": "8",
                    "cpu": "4",
                    "ephemeral-storage": "133665772Ki",
                    "hugepages-1Gi": "0",
                    "hugepages-2Mi": "0",
                    "memory": "16419212Ki",
                    "pods": "250"
                },
                "conditions": [
                    {
                        "lastHeartbeatTime": "2020-08-04T02:04:33Z",
                        "lastTransitionTime": "2020-08-04T01:13:40Z",
                        "message": "kubelet has sufficient memory available",
                        "reason": "KubeletHasSufficientMemory",
                        "status": "False",
                        "type": "MemoryPressure"
                    },
                    {
                        "lastHeartbeatTime": "2020-08-04T02:04:33Z",
                        "lastTransitionTime": "2020-08-04T01:13:40Z",
                        "message": "kubelet has no disk pressure",
                        "reason": "KubeletHasNoDiskPressure",
                        "status": "False",
                        "type": "DiskPressure"
                    },
                    {
                        "lastHeartbeatTime": "2020-08-04T02:04:33Z",
                        "lastTransitionTime": "2020-08-04T01:13:40Z",
                        "message": "kubelet has sufficient PID available",
                        "reason": "KubeletHasSufficientPID",
                        "status": "False",
                        "type": "PIDPressure"
                    },
                    {
                        "lastHeartbeatTime": "2020-08-04T02:04:33Z",
                        "lastTransitionTime": "2020-08-04T01:14:40Z",
                        "message": "kubelet is posting ready status",
                        "reason": "KubeletReady",
                        "status": "True",
                        "type": "Ready"
                    }
                ],

Maybe the scheduling team will be able to debug further.

Comment 17 Maciej Szulik 2020-08-11 10:42:47 UTC

Jan can you check this from scheduler POV?

Comment 18 Jan Chaloupka 2020-08-11 13:57:55 UTC

Sure, going through the logs now.

Comment 19 Jan Chaloupka 2020-08-11 15:40:20 UTC

- pod-subpath-test-inlinevolume is defined in https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go#L154
- later on pod's NodeSelector is set to target a node (picked randomly from a set of ready and schedulable nodes)

So, by the time the pod is being created, the selected node is schedulable. The scheduler logs do not contain enough information to explain why the node was not selected. The reason is missing. There's an upstream issue about the missing reason: https://github.com/kubernetes/kubernetes/issues/91601. Suggested upstream fix is https://github.com/kubernetes/kubernetes/pull/93355. Worth backporting it to 4.5 to see if it helps.

Comment 20 Jan Chaloupka 2020-08-11 15:41:14 UTC

To be clear, https://github.com/kubernetes/kubernetes/pull/93355 will tell more about the reasons why the pod was not scheduled. It will not fix the problem.

Comment 21 Stefan Schimanski 2020-08-25 07:27:34 UTC

https://github.com/kubernetes/kubernetes/pull/93355 is merged into 1.19 upstream. Have picked it through another rebase?

Comment 22 Jan Chaloupka 2020-08-25 09:32:58 UTC

Yeah, I am waiting for the next 1.19 rebase

Comment 23 Jan Chaloupka 2020-09-10 14:13:36 UTC

1.19 is merged, will wait for few days to collect more flaking jobs to debug

Comment 24 Jan Chaloupka 2020-09-16 10:02:00 UTC

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Gave+up+after+waiting.*for+pod.*pod-subpath-test-inlinevolume.*to+be.*Succeeded&maxAge=48h&type=junit&name=release-openshift-ocp-&context=0' | grep 'failures match'
release-openshift-ocp-installer-e2e-ovirt-4.5 - 9 runs, 11% failed, 100% of failures match
release-openshift-ocp-installer-e2e-openstack-4.6 - 18 runs, 94% failed, 6% of failures match
release-openshift-ocp-installer-e2e-ovirt-4.6 - 24 runs, 38% failed, 11% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.6 - 17 runs, 53% failed, 22% of failures match
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 - 17 runs, 65% failed, 9% of failures match
release-openshift-ocp-installer-e2e-metal-compact-4.5 - 4 runs, 50% failed, 50% of failures match

## Jobs

### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1306123437206409216
```
fail [k8s.io/kubernetes/test/e2e/framework/util.go:798]: Unexpected error:
    <*errors.errorString | 0xc000cb3b10>: {
        s: "expected pod \"pod-subpath-test-inlinevolume-b5rf\" success: Gave up after waiting 5m0s for pod \"pod-subpath-test-inlinevolume-b5rf\" to be \"Succeeded or Failed\"",
    }
    expected pod "pod-subpath-test-inlinevolume-b5rf" success: Gave up after waiting 5m0s for pod "pod-subpath-test-inlinevolume-b5rf" to be "Succeeded or Failed"
occurred
```

From https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.5/1306123437206409216/artifacts/e2e-ovirt/pods/openshift-kube-scheduler_openshift-kube-scheduler-ovirt17-4wv67-master-0_kube-scheduler.log
```
I0916 07:59:01.985919       1 scheduler.go:731] pod e2e-provisioning-7746/pod-subpath-test-inlinevolume-b5rf is bound successfully on node "ovirt17-4wv67-worker-0-4ttv9", 5 nodes evaluated, 2 nodes were found feasible.
```

Kubelet running on ovirt17-4wv67-worker-0-4ttv says:
```
Sep 16 08:03:03.913483 ovirt17-4wv67-worker-0-4ttv9 hyperkube[1581]: E0916 08:03:03.913025    1581 pod_workers.go:191] Error syncing pod 345fdb23-9f66-42eb-9f1f-bb20f6c8ba08 ("pod-subpath-test-inlinevolume-b5rf_e2e-provisioning-7746(345fdb23-9f66-42eb-9f1f-bb20f6c8ba08)"), skipping: failed to "CreatePodSandbox" for "pod-subpath-test-inlinevolume-b5rf_e2e-provisioning-7746(345fdb23-9f66-42eb-9f1f-bb20f6c8ba08)" with CreatePodSandboxError: "CreatePodSandbox for pod \"pod-subpath-test-inlinevolume-b5rf_e2e-provisioning-7746(345fdb23-9f66-42eb-9f1f-bb20f6c8ba08)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Sep 16 08:03:03.913483 ovirt17-4wv67-worker-0-4ttv9 hyperkube[1581]: I0916 08:03:03.913074    1581 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-provisioning-7746", Name:"pod-subpath-test-inlinevolume-b5rf", UID:"345fdb23-9f66-42eb-9f1f-bb20f6c8ba08", APIVersion:"v1", ResourceVersion:"41682", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Sep 16 08:03:04.134052 ovirt17-4wv67-worker-0-4ttv9 hyperkube[1581]: I0916 08:03:04.134005    1581 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/containers/storage/overlay-containers/02dbbec5128ef16d476c8ddfc0be05200d5a4e0b4eaff50bfcec1f93a2152e44/userdata/shm
Sep 16 08:03:04.134052 ovirt17-4wv67-worker-0-4ttv9 hyperkube[1581]: I0916 08:03:04.134047    1581 fs.go:407] unable to determine file system type, partition mountpoint does not exist: /var/lib/containers/storage/overlay/e2c0a04981b2eae291aa67f62f86562bc2db756b7933a70349597ba58d526fb1/merged
Sep 16 08:03:04.239460 ovirt17-4wv67-worker-0-4ttv9 hyperkube[1581]: E0916 08:03:04.239333    1581 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

```

### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.6/1306118658665222144

```
[sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (default fs)] subPath should support file as subpath [LinuxOnly] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_less	5m10s
fail [k8s.io/kubernetes.0-rc.2/test/e2e/framework/util.go:715]: Unexpected error:
    <*errors.errorString | 0xc0003aebb0>: {
        s: "expected pod \"pod-subpath-test-inlinevolume-vxsb\" success: Gave up after waiting 5m0s for pod \"pod-subpath-test-inlinevolume-vxsb\" to be \"Succeeded or Failed\"",
    }
    expected pod "pod-subpath-test-inlinevolume-vxsb" success: Gave up after waiting 5m0s for pod "pod-subpath-test-inlinevolume-vxsb" to be "Succeeded or Failed"
occurred
```

From junit:
```
Sep 16 08:07:39.947: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-vxsb: { } Scheduled: Successfully assigned e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb to y41mz62c-12e28-g5lbt-worker-0-sqzgj
Sep 16 08:07:39.947: INFO: At 2020-09-16 08:04:02 +0000 UTC - event for pod-subpath-test-inlinevolume-vxsb: {kubelet y41mz62c-12e28-g5lbt-worker-0-sqzgj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447_464d7763-5703-448c-9db3-4c9ec522abbc_0(da0d8ba0a36a71f925f6a0d76c908649ec791816373649b036abda05c8a30ab8): Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
Sep 16 08:07:39.947: INFO: At 2020-09-16 08:05:45 +0000 UTC - event for pod-subpath-test-inlinevolume-vxsb: {kubelet y41mz62c-12e28-g5lbt-worker-0-sqzgj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447_464d7763-5703-448c-9db3-4c9ec522abbc_0(eec54f25a515e21dd9185738e8ca588df62a590f04d03071109a5805fd67f318): Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
Sep 16 08:07:39.947: INFO: At 2020-09-16 08:07:26 +0000 UTC - event for pod-subpath-test-inlinevolume-vxsb: {kubelet y41mz62c-12e28-g5lbt-worker-0-sqzgj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447_464d7763-5703-448c-9db3-4c9ec522abbc_0(425fbed87e39a4185935127c8052568de68369ed712757df8d3599de51b6d501): Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
```

kubelet:
```
Sep 16 08:04:02.662006 y41mz62c-12e28-g5lbt-worker-0-sqzgj.novalocal hyperkube[1765]: E0916 08:04:02.661810    1765 kuberuntime_manager.go:730] createPodSandbox for pod "pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447(464d7763-5703-448c-9db3-4c9ec522abbc)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447_464d7763-5703-448c-9db3-4c9ec522abbc_0(da0d8ba0a36a71f925f6a0d76c908649ec791816373649b036abda05c8a30ab8): Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
Sep 16 08:04:02.662006 y41mz62c-12e28-g5lbt-worker-0-sqzgj.novalocal hyperkube[1765]: E0916 08:04:02.661915    1765 pod_workers.go:191] Error syncing pod 464d7763-5703-448c-9db3-4c9ec522abbc ("pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447(464d7763-5703-448c-9db3-4c9ec522abbc)"), skipping: failed to "CreatePodSandbox" for "pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447(464d7763-5703-448c-9db3-4c9ec522abbc)" with CreatePodSandboxError: "CreatePodSandbox for pod \"pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447(464d7763-5703-448c-9db3-4c9ec522abbc)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-vxsb_e2e-provisioning-5447_464d7763-5703-448c-9db3-4c9ec522abbc_0(da0d8ba0a36a71f925f6a0d76c908649ec791816373649b036abda05c8a30ab8): Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition"
```

```
 [sig-storage] In-tree Volumes [Driver: emptydir] [Testpattern: Inline-volume (default fs)] subPath should support readOnly file specified in the volumeMount [LinuxOnly] [Suite:openshift/conformance/parallel] [Suite:k8s] expand_less	5m9s
fail [k8s.io/kubernetes.0-rc.2/test/e2e/framework/util.go:715]: Unexpected error:
    <*errors.errorString | 0xc0029b3a50>: {
        s: "expected pod \"pod-subpath-test-inlinevolume-png4\" success: Gave up after waiting 5m0s for pod \"pod-subpath-test-inlinevolume-png4\" to be \"Succeeded or Failed\"",
    }
    expected pod "pod-subpath-test-inlinevolume-png4" success: Gave up after waiting 5m0s for pod "pod-subpath-test-inlinevolume-png4" to be "Succeeded or Failed"
occurred
```

From junit:
```
Sep 16 08:05:51.327: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-png4: { } Scheduled: Successfully assigned e2e-provisioning-5179/pod-subpath-test-inlinevolume-png4 to y41mz62c-12e28-g5lbt-worker-0-sqzgj
```

The kubelet logs are
```
Sep 16 08:02:15.349715 y41mz62c-12e28-g5lbt-worker-0-sqzgj.novalocal hyperkube[1765]: E0916 08:02:15.349080    1765 pod_workers.go:191] Error syncing pod f6bb4f1e-3cda-412a-ab09-ba29a2fb9af7 ("pod-subpath-test-inlinevolume-png4_e2e-provisioning-5179(f6bb4f1e-3cda-412a-ab09-ba29a2fb9af7)"), skipping: failed to "CreatePodSandbox" for "pod-subpath-test-inlinevolume-png4_e2e-provisioning-5179(f6bb4f1e-3cda-412a-ab09-ba29a2fb9af7)" with CreatePodSandboxError: "CreatePodSandbox for pod \"pod-subpath-test-inlinevolume-png4_e2e-provisioning-5179(f6bb4f1e-3cda-412a-ab09-ba29a2fb9af7)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-png4_e2e-provisioning-5179_f6bb4f1e-3cda-412a-ab09-ba29a2fb9af7_0(5e6564601e82a2b4a62d6f52c598fe5b0de5b5681b0367f0a1dfb83ac6c690d3): Multus: [e2e-provisioning-5179/pod-subpath-test-inlinevolume-png4]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition"
```

In other instances the pod got always scheduled, kubelet failed to run the pod.

### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.6/1306096546189676544

```
Sep 16 06:24:21.082: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-l8nc: { } Scheduled: Successfully assigned e2e-provisioning-6020/pod-subpath-test-inlinevolume-l8nc to ci-op-4bzyjwnv-1a9c0-s7zsc-worker-c-cdm6z
...
Sep 16 06:24:21.082: INFO: At 2020-09-16 06:19:59 +0000 UTC - event for pod-subpath-test-inlinevolume-l8nc: {kubelet ci-op-4bzyjwnv-1a9c0-s7zsc-worker-c-cdm6z} FailedMount: MountVolume.SetUp failed for volume "test-volume" : mount failed: mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/a30ff59a-1877-4da3-8751-22becc5a5324/volumes/kubernetes.io~glusterfs/test-volume --scope -- mount -t glusterfs -o auto_unmount,log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/test-volume/pod-subpath-test-inlinevolume-l8nc-glusterfs.log,log-level=ERROR 10.129.3.160:test_vol /var/lib/kubelet/pods/a30ff59a-1877-4da3-8751-22becc5a5324/volumes/kubernetes.io~glusterfs/test-volume
Output: Running scope as unit: run-r614d3d525db64ad58d4d91d4a59149be.scope
Mounting glusterfs on /var/lib/kubelet/pods/a30ff59a-1877-4da3-8751-22becc5a5324/volumes/kubernetes.io~glusterfs/test-volume failed.
```

### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.6/1305756913199222784

```
Sep 15 07:30:16.447: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-xwht: { } Scheduled: Successfully assigned e2e-provisioning-8289/pod-subpath-test-inlinevolume-xwht to ci-op-r94d0cmx-1a9c0-qd4v7-worker-b-jmtlm
```

kubelet did not start the pod in time

### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1306068071802212352

```
Sep 16 05:22:03.520: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-v5j7: { } Scheduled: Successfully assigned e2e-provisioning-1046/pod-subpath-test-inlinevolume-v5j7 to ovirt14-gtf64-worker-0-wdplj
Sep 16 05:22:03.520: INFO: At 2020-09-16 05:18:31 +0000 UTC - event for pod-subpath-test-inlinevolume-v5j7: {kubelet ovirt14-gtf64-worker-0-wdplj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-v5j7_e2e-provisioning-1046_773d37ee-288b-404e-a025-af60dfe92721_0(2844d32da63bd56bd2737eec25fccfa1122af5bd9ce31ff12c8c395877764542): Multus: [e2e-provisioning-1046/pod-subpath-test-inlinevolume-v5j7]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
Sep 16 05:22:03.520: INFO: At 2020-09-16 05:20:17 +0000 UTC - event for pod-subpath-test-inlinevolume-v5j7: {kubelet ovirt14-gtf64-worker-0-wdplj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-v5j7_e2e-provisioning-1046_773d37ee-288b-404e-a025-af60dfe92721_0(0bcabfb1657ed4e4e73a584670d7f35e2af49f9fe4ecefd820fb0daf72523fe5): Multus: [e2e-provisioning-1046/pod-subpath-test-inlinevolume-v5j7]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
Sep 16 05:22:03.520: INFO: At 2020-09-16 05:21:59 +0000 UTC - event for pod-subpath-test-inlinevolume-v5j7: {kubelet ovirt14-gtf64-worker-0-wdplj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-v5j7_e2e-provisioning-1046_773d37ee-288b-404e-a025-af60dfe92721_0(70ed392b8409dc2e9b0abe30378f24282f6f9a23224048af37a95b75c44d5a83): Multus: [e2e-provisioning-1046/pod-subpath-test-inlinevolume-v5j7]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
```

Pods in other inline failing tests got scheduled, kubelet did not made the pods run.

### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.6/1305705117474885632

``
Sep 15 04:14:35.884: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-inlinevolume-lgbv: { } Scheduled: Successfully assigned e2e-provisioning-652/pod-subpath-test-inlinevolume-lgbv to ci-op-0zr8v53k-45fdb-jmj5t-worker-b-p8n9r
Sep 15 04:14:35.884: INFO: At 2020-09-15 04:09:33 +0000 UTC - event for pod-subpath-test-inlinevolume-lgbv: {controlplane } ErrorAddingLogicalPort: timeout waiting for namespace event
Sep 15 04:14:35.884: INFO: At 2020-09-15 04:09:46 +0000 UTC - event for pod-subpath-test-inlinevolume-lgbv: {kubelet ci-op-0zr8v53k-45fdb-jmj5t-worker-b-p8n9r} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pod-subpath-test-inlinevolume-lgbv_e2e-provisioning-652_a5963cf0-24f7-4505-932d-598b7a8b294b_0(9b1b1519afb1c4ecddfa5e7c29bd0d0185e7ec85dcbb9e4a449367a40a94c61c): [e2e-provisioning-652/pod-subpath-test-inlinevolume-lgbv:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[e2e-provisioning-652/pod-subpath-test-inlinevolume-lgbv] failed to configure pod interface: timed out waiting for pod flows for pod: pod-subpath-test-inlinevolume-lgbv, error: timed out waiting for the condition
...
```

### https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.5/1305478194941399040

```

Pod got scheduled, kubelet failed to start it.

Comment 25 Jan Chaloupka 2020-09-16 10:03:42 UTC

In all the cases I checked, all pod-subpath-test-inlinevolume pods got scheduled. In most cases underlying containers did not start due to FailedCreatePodSandBox.

Re-assigning to node team to check the kubelet side of the failures.

Comment 26 Seth Jennings 2020-09-16 17:45:56 UTC

Hate to bounce this around but it seems the majority of the failures are due to CNI failing underneath crio with the exception of the one glusterfs mount failure out of the runs that Jan anaylzed in comment 24

Multus: [e2e-provisioning-5447/pod-subpath-test-inlinevolume-vxsb]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition

Multus: [e2e-provisioning-5179/pod-subpath-test-inlinevolume-png4]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition

Multus: [e2e-provisioning-1046/pod-subpath-test-inlinevolume-v5j7]: PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition

error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[e2e-provisioning-652/pod-subpath-test-inlinevolume-lgbv] failed to configure pod interface: timed out waiting for pod flows for pod: pod-subpath-test-inlinevolume-lgbv, error: timed out waiting for the condition

Comment 27 Douglas Smith 2020-09-16 18:05:02 UTC

When Multus is "waiting for ReadinessIndicatorFile" -- it's waiting for a semaphore from the default network (either OVN or OpenShift-SDN)

```
PollImmediate error waiting for ReadinessIndicatorFile: timed out waiting for the condition
```

Do we have any logs from the ovn-k or openshift-sdn logs? This will wind up being our root cause.

Multus is provisioned before openshift-sdn or ovn-k, but, it can't say "the nodes are ready" (by placing a CNI configuration on disk) until OpenShift-SDN or ovn-k are ready.

Note You need to log in before you can comment on or make changes to this bug.