Bug 1816194
| Summary: | 4.2z upgrade test fails with 'failed to get pod status: pods \"pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70\" not found' | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vadim Rutkovsky <vrutkovs> |
| Component: | Node | Assignee: | Harshal Patil <harpatil> |
| Status: | CLOSED DUPLICATE | QA Contact: | Sunil Choudhary <schoudha> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.2.z | CC: | akashem, aos-bugs, bbennett, jdesousa, jokerman, mfojtik, rphillips |
| Target Milestone: | --- | Keywords: | Upgrades |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-15 18:29:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I checked all clusteroperator objects, all status conditions reported fine except for the following: curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/clusteroperators.json | jq '.items | .[] | .status.conditions[] | select((.type == "Progressing" and .status == "True") or (.type == "Upgradeable" and .status == "False") or (.type == "Degraded" and .status == "True") or (.type == "Available" and .status == "False")) | .' { "lastTransitionTime": "2020-03-23T11:39:40Z", "message": "Progressing: not all deployment replicas are ready", "reason": "ProgressingOAuthServerDeploymentNotReady", "status": "True", "type": "Progressing" } { "lastTransitionTime": "2020-03-23T11:39:41Z", "message": "DeploymentAvailable: 1 replicas ready at version 4.2.25", "reason": "DeploymentAvailableFailedUpdate", "status": "False", "type": "Available" } { "lastTransitionTime": "2020-03-23T11:40:40Z", "message": "DaemonSet \"openshift-sdn/ovs\" is not available (awaiting 1 nodes)", "reason": "Deploying", "status": "True", "type": "Progressing" } Can this error 'DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes)` be the root cause? Also, I searched the CI for this exact error - no trace of this error as far back as last two weeks. The following search returned zero results. https://search.svc.ci.openshift.org/?search=Unexpected+error+occurred.*failed+to+get+pod+status.*pods.*not+found&maxAge=336h&context=2&type=all Looking at the log, we see: The test creates the Pod object and waits for it: "Mar 23 11:36:44.524: INFO: Waiting up to 5m0s for pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-6434" to be success or failure" After about 10 seconds the pod object gets removed (why?): Mar 23 11:36:56.566: INFO: Pod pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70 no longer exists Mar 23 11:36:56.566: INFO: Unexpected error occurred: failed to get pod status: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found I checked the events in openshift-sdn namespace, found a couple of interesting events
{
"apiVersion": "v1",
"count": 1,
"eventTime": null,
"firstTimestamp": "2020-03-23T11:39:12Z",
"involvedObject": {
"apiVersion": "v1",
"kind": "Pod",
"name": "sdn-x5stt",
"namespace": "openshift-sdn",
"resourceVersion": "25315",
"uid": "4ed85eb4-6cf6-11ea-b0d9-42010a000005"
},
"kind": "Event",
"lastTimestamp": "2020-03-23T11:39:12Z",
"message": "MountVolume.SetUp failed for volume \"config\" : couldn't propagate object cache: timed out waiting for the condition",
"metadata": {
"creationTimestamp": "2020-03-23T11:39:23Z",
"name": "sdn-x5stt.15feebf4366f06a9",
"namespace": "openshift-sdn",
"resourceVersion": "25897",
"selfLink": "/api/v1/namespaces/openshift-sdn/events/sdn-x5stt.15feebf4366f06a9",
"uid": "f15121b4-6cfa-11ea-ba81-42010a000003"
},
"reason": "FailedMount",
"reportingComponent": "",
"reportingInstance": "",
"source": {
"component": "kubelet",
"host": "ci-op--t7sqw-m-0.c.openshift-gce-devel-ci.internal"
},
"type": "Warning"
}
{
"apiVersion": "v1",
"count": 1,
"eventTime": null,
"firstTimestamp": "2020-03-23T11:39:18Z",
"involvedObject": {
"apiVersion": "v1",
"fieldPath": "spec.containers{sdn}",
"kind": "Pod",
"name": "sdn-x5stt",
"namespace": "openshift-sdn",
"resourceVersion": "25315",
"uid": "4ed85eb4-6cf6-11ea-b0d9-42010a000005"
},
"kind": "Event",
"lastTimestamp": "2020-03-23T11:39:18Z",
"message": "Liveness probe failed: Get http://10.0.0.4:10256/healthz: dial tcp 10.0.0.4:10256: connect: connection refused",
"metadata": {
"creationTimestamp": "2020-03-23T11:39:51Z",
"name": "sdn-x5stt.15feebf595d3c523",
"namespace": "openshift-sdn",
"resourceVersion": "27103",
"selfLink": "/api/v1/namespaces/openshift-sdn/events/sdn-x5stt.15feebf595d3c523",
"uid": "01e316ed-6cfb-11ea-9dae-42010a000004"
},
"reason": "Unhealthy",
"reportingComponent": "",
"reportingInstance": "",
"source": {
"component": "kubelet",
"host": "ci-op--t7sqw-m-0.c.openshift-gce-devel-ci.internal"
},
"type": "Warning"
}
Also, checked the sdn Pod object above and found the following
{
"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"creationTimestamp": "2020-03-23T11:06:12Z",
"generateName": "sdn-",
"labels": {
"app": "sdn",
"component": "network",
"controller-revision-hash": "766d7fb896",
"openshift.io/component": "network",
"pod-template-generation": "1",
"type": "infra"
},
"name": "sdn-x5stt",
"status": {
"conditions": [
{
"lastProbeTime": null,
"lastTransitionTime": "2020-03-23T11:39:16Z",
"status": "True",
"type": "Initialized"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2020-03-23T11:39:23Z",
"status": "True",
"type": "Ready"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2020-03-23T11:39:23Z",
"status": "True",
"type": "ContainersReady"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2020-03-23T11:06:12Z",
"status": "True",
"type": "PodScheduled"
}
],
"containerStatuses": [
{
"containerID": "cri-o://e49a73bb1235e9e1c6a8ff7a8be43ee41eb8ed231b3d0a3774c0a500b11d9f91",
"image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a9a1297617df6617a8e8297987084beeb76dd324814e896dd0706cdbc9618ce",
"imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a9a1297617df6617a8e8297987084beeb76dd324814e896dd0706cdbc9618ce",
"lastState": {
"terminated": {
"containerID": "cri-o://7d220ad399e27e2cc459815280d54240a41a200c32559dd25000e2a3ea646ed7",
"exitCode": 255,
"finishedAt": "2020-03-23T11:39:10Z",
"message": "23 11:37:20.690211 4782 proxier.go:346] userspace syncProxyRules took 100.232011ms\nI0323 11:37:20.690223 4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:20.690234 4782 proxy.go:331] hybrid proxy: syncProxyRules start\nI0323 11:37:20.912750 4782 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete\nI0323 11:37:21.003229 4782 proxier.go:367] userspace proxy: processing 0 service events\nI0323 11:37:21.003255 4782 proxier.go:346] userspace syncProxyRules took 90.480026ms\nI0323 11:37:21.003267 4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:21.971741 4782 vnids.go:162] Dissociate netid 16301474 from namespace \"e2e-k8s-sig-apps-deployment-upgrade-7206\"\nI0323 11:37:30.028273 4782 proxy.go:331] hybrid proxy: syncProxyRules start\nI0323 11:37:30.028364 4782 service.go:357] Removing service port \"e2e-k8s-service-upgrade-8399/service-test:\"\nI0323 11:37:30.035938 4782 roundrobin.go:338] LoadBalancerRR: Removing endpoints for e2e-k8s-service-upgrade-8399/service-test:\nI0323 11:37:30.289797 4782 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete\nI0323 11:37:30.393149 4782 proxier.go:367] userspace proxy: processing 0 service events\nI0323 11:37:30.393183 4782 proxier.go:346] userspace syncProxyRules took 103.252369ms\nI0323 11:37:30.393198 4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:30.393209 4782 proxy.go:331] hybrid proxy: syncProxyRules start\nI0323 11:37:30.645528 4782 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete\nI0323 11:37:30.742651 4782 proxier.go:367] userspace proxy: processing 0 service events\nI0323 11:37:30.742678 4782 proxier.go:346] userspace syncProxyRules took 97.036018ms\nI0323 11:37:30.742690 4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:40.698263 4782 vnids.go:162] Dissociate netid 1366709 from namespace \"e2e-k8s-service-upgrade-8399\"\ninterrupt: Gracefully shutting down ...\n",
"reason": "Error",
"startedAt": "2020-03-23T11:07:13Z"
}
},
"name": "sdn",
"ready": true,
"restartCount": 2,
"state": {
"running": {
"startedAt": "2020-03-23T11:39:17Z"
}
}
}
],
"hostIP": "10.0.0.4",
"initContainerStatuses": [
{
"containerID": "cri-o://078342e02ce0ece9e39b1439ef9b4d168913502f3fcd791960822daddd22a1ed",
"image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e4e7d4920234d742c8b1fd059cfe13865f5a6e3fc3c529cfb666ae77edb11afb",
"imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e4e7d4920234d742c8b1fd059cfe13865f5a6e3fc3c529cfb666ae77edb11afb",
"lastState": {},
"name": "install-cni-plugins",
"ready": true,
"restartCount": 1,
"state": {
"terminated": {
"containerID": "cri-o://078342e02ce0ece9e39b1439ef9b4d168913502f3fcd791960822daddd22a1ed",
"exitCode": 0,
"finishedAt": "2020-03-23T11:39:15Z",
"reason": "Completed",
"startedAt": "2020-03-23T11:39:15Z"
}
}
}
],
"phase": "Running",
"podIP": "10.0.0.4",
"qosClass": "Burstable",
"startTime": "2020-03-23T11:06:12Z"
}
}
The error in `containerStatuses` coincides with the test error. So for now I am assigning this NZ to the network team so that they can validate this. Please feel free to reassign it back to `kube-apiserver` if you find otherwise.
SDN pods seem to have been gracefully terminated around that time. The pod which's liveness probe failed isn't really a motive for concern because it fails at 11:39:18 and the first log line after being restarted is at 11:39:18. These are all the SDN pods that failed at correlated times: - https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/pods/openshift-sdn_sdn-dwtqq_sdn_previous.log - https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/pods/openshift-sdn_sdn-pck4j_sdn_previous.log - https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/pods/openshift-sdn_sdn-52pcx_sdn_previous.log I see the SDN pods are restarted after logging this event: "message": "Pod sandbox changed, it will be killed and re-created.", Checking the node logs it looks like the SDN is yet another victim here. I'm not quite sure why this happens but I see alot of activity on crio and kubelet on m-0 around that time. I don't think SDN is it at fault here. Reassigning to node team for them to have a look. Test URL https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/nodes/ *** This bug has been marked as a duplicate of bug 1821341 *** |
Description of problem: 4.2.24 -> 4.2.25 test failed with fail [k8s.io/kubernetes/test/e2e/framework/util.go:2456]: Unexpected error: <*errors.errorString | 0xc002a02d60>: { s: "failed to get pod status: pods \"pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70\" not found", } failed to get pod status: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found occurred https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471 Apparently this pod is removed previously (race condition?): Mar 23 11:36:44.524: INFO: Waiting up to 5m0s for pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-6434" to be "success or failure" Mar 23 11:36:54.553: INFO: Pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70": Phase="Pending", Reason="", readiness=false. Elapsed: 10.0291803s Mar 23 11:36:56.556: INFO: Pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-6434" not found. Error: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found STEP: delete the pod Mar 23 11:36:56.563: INFO: Waiting for pod pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70 to disappear Mar 23 11:36:56.566: INFO: Pod pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70 no longer exists Mar 23 11:36:56.566: INFO: Unexpected error occurred: failed to get pod status: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found