Description of problem: 4.2.24 -> 4.2.25 test failed with fail [k8s.io/kubernetes/test/e2e/framework/util.go:2456]: Unexpected error: <*errors.errorString | 0xc002a02d60>: { s: "failed to get pod status: pods \"pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70\" not found", } failed to get pod status: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found occurred https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471 Apparently this pod is removed previously (race condition?): Mar 23 11:36:44.524: INFO: Waiting up to 5m0s for pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-6434" to be "success or failure" Mar 23 11:36:54.553: INFO: Pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70": Phase="Pending", Reason="", readiness=false. Elapsed: 10.0291803s Mar 23 11:36:56.556: INFO: Pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-6434" not found. Error: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found STEP: delete the pod Mar 23 11:36:56.563: INFO: Waiting for pod pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70 to disappear Mar 23 11:36:56.566: INFO: Pod pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70 no longer exists Mar 23 11:36:56.566: INFO: Unexpected error occurred: failed to get pod status: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found
I checked all clusteroperator objects, all status conditions reported fine except for the following: curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/clusteroperators.json | jq '.items | .[] | .status.conditions[] | select((.type == "Progressing" and .status == "True") or (.type == "Upgradeable" and .status == "False") or (.type == "Degraded" and .status == "True") or (.type == "Available" and .status == "False")) | .' { "lastTransitionTime": "2020-03-23T11:39:40Z", "message": "Progressing: not all deployment replicas are ready", "reason": "ProgressingOAuthServerDeploymentNotReady", "status": "True", "type": "Progressing" } { "lastTransitionTime": "2020-03-23T11:39:41Z", "message": "DeploymentAvailable: 1 replicas ready at version 4.2.25", "reason": "DeploymentAvailableFailedUpdate", "status": "False", "type": "Available" } { "lastTransitionTime": "2020-03-23T11:40:40Z", "message": "DaemonSet \"openshift-sdn/ovs\" is not available (awaiting 1 nodes)", "reason": "Deploying", "status": "True", "type": "Progressing" } Can this error 'DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes)` be the root cause? Also, I searched the CI for this exact error - no trace of this error as far back as last two weeks. The following search returned zero results. https://search.svc.ci.openshift.org/?search=Unexpected+error+occurred.*failed+to+get+pod+status.*pods.*not+found&maxAge=336h&context=2&type=all Looking at the log, we see: The test creates the Pod object and waits for it: "Mar 23 11:36:44.524: INFO: Waiting up to 5m0s for pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-6434" to be success or failure" After about 10 seconds the pod object gets removed (why?): Mar 23 11:36:56.566: INFO: Pod pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70 no longer exists Mar 23 11:36:56.566: INFO: Unexpected error occurred: failed to get pod status: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found
I checked the events in openshift-sdn namespace, found a couple of interesting events { "apiVersion": "v1", "count": 1, "eventTime": null, "firstTimestamp": "2020-03-23T11:39:12Z", "involvedObject": { "apiVersion": "v1", "kind": "Pod", "name": "sdn-x5stt", "namespace": "openshift-sdn", "resourceVersion": "25315", "uid": "4ed85eb4-6cf6-11ea-b0d9-42010a000005" }, "kind": "Event", "lastTimestamp": "2020-03-23T11:39:12Z", "message": "MountVolume.SetUp failed for volume \"config\" : couldn't propagate object cache: timed out waiting for the condition", "metadata": { "creationTimestamp": "2020-03-23T11:39:23Z", "name": "sdn-x5stt.15feebf4366f06a9", "namespace": "openshift-sdn", "resourceVersion": "25897", "selfLink": "/api/v1/namespaces/openshift-sdn/events/sdn-x5stt.15feebf4366f06a9", "uid": "f15121b4-6cfa-11ea-ba81-42010a000003" }, "reason": "FailedMount", "reportingComponent": "", "reportingInstance": "", "source": { "component": "kubelet", "host": "ci-op--t7sqw-m-0.c.openshift-gce-devel-ci.internal" }, "type": "Warning" } { "apiVersion": "v1", "count": 1, "eventTime": null, "firstTimestamp": "2020-03-23T11:39:18Z", "involvedObject": { "apiVersion": "v1", "fieldPath": "spec.containers{sdn}", "kind": "Pod", "name": "sdn-x5stt", "namespace": "openshift-sdn", "resourceVersion": "25315", "uid": "4ed85eb4-6cf6-11ea-b0d9-42010a000005" }, "kind": "Event", "lastTimestamp": "2020-03-23T11:39:18Z", "message": "Liveness probe failed: Get http://10.0.0.4:10256/healthz: dial tcp 10.0.0.4:10256: connect: connection refused", "metadata": { "creationTimestamp": "2020-03-23T11:39:51Z", "name": "sdn-x5stt.15feebf595d3c523", "namespace": "openshift-sdn", "resourceVersion": "27103", "selfLink": "/api/v1/namespaces/openshift-sdn/events/sdn-x5stt.15feebf595d3c523", "uid": "01e316ed-6cfb-11ea-9dae-42010a000004" }, "reason": "Unhealthy", "reportingComponent": "", "reportingInstance": "", "source": { "component": "kubelet", "host": "ci-op--t7sqw-m-0.c.openshift-gce-devel-ci.internal" }, "type": "Warning" } Also, checked the sdn Pod object above and found the following { "apiVersion": "v1", "kind": "Pod", "metadata": { "creationTimestamp": "2020-03-23T11:06:12Z", "generateName": "sdn-", "labels": { "app": "sdn", "component": "network", "controller-revision-hash": "766d7fb896", "openshift.io/component": "network", "pod-template-generation": "1", "type": "infra" }, "name": "sdn-x5stt", "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2020-03-23T11:39:16Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2020-03-23T11:39:23Z", "status": "True", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": "2020-03-23T11:39:23Z", "status": "True", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2020-03-23T11:06:12Z", "status": "True", "type": "PodScheduled" } ], "containerStatuses": [ { "containerID": "cri-o://e49a73bb1235e9e1c6a8ff7a8be43ee41eb8ed231b3d0a3774c0a500b11d9f91", "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a9a1297617df6617a8e8297987084beeb76dd324814e896dd0706cdbc9618ce", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a9a1297617df6617a8e8297987084beeb76dd324814e896dd0706cdbc9618ce", "lastState": { "terminated": { "containerID": "cri-o://7d220ad399e27e2cc459815280d54240a41a200c32559dd25000e2a3ea646ed7", "exitCode": 255, "finishedAt": "2020-03-23T11:39:10Z", "message": "23 11:37:20.690211 4782 proxier.go:346] userspace syncProxyRules took 100.232011ms\nI0323 11:37:20.690223 4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:20.690234 4782 proxy.go:331] hybrid proxy: syncProxyRules start\nI0323 11:37:20.912750 4782 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete\nI0323 11:37:21.003229 4782 proxier.go:367] userspace proxy: processing 0 service events\nI0323 11:37:21.003255 4782 proxier.go:346] userspace syncProxyRules took 90.480026ms\nI0323 11:37:21.003267 4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:21.971741 4782 vnids.go:162] Dissociate netid 16301474 from namespace \"e2e-k8s-sig-apps-deployment-upgrade-7206\"\nI0323 11:37:30.028273 4782 proxy.go:331] hybrid proxy: syncProxyRules start\nI0323 11:37:30.028364 4782 service.go:357] Removing service port \"e2e-k8s-service-upgrade-8399/service-test:\"\nI0323 11:37:30.035938 4782 roundrobin.go:338] LoadBalancerRR: Removing endpoints for e2e-k8s-service-upgrade-8399/service-test:\nI0323 11:37:30.289797 4782 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete\nI0323 11:37:30.393149 4782 proxier.go:367] userspace proxy: processing 0 service events\nI0323 11:37:30.393183 4782 proxier.go:346] userspace syncProxyRules took 103.252369ms\nI0323 11:37:30.393198 4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:30.393209 4782 proxy.go:331] hybrid proxy: syncProxyRules start\nI0323 11:37:30.645528 4782 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete\nI0323 11:37:30.742651 4782 proxier.go:367] userspace proxy: processing 0 service events\nI0323 11:37:30.742678 4782 proxier.go:346] userspace syncProxyRules took 97.036018ms\nI0323 11:37:30.742690 4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:40.698263 4782 vnids.go:162] Dissociate netid 1366709 from namespace \"e2e-k8s-service-upgrade-8399\"\ninterrupt: Gracefully shutting down ...\n", "reason": "Error", "startedAt": "2020-03-23T11:07:13Z" } }, "name": "sdn", "ready": true, "restartCount": 2, "state": { "running": { "startedAt": "2020-03-23T11:39:17Z" } } } ], "hostIP": "10.0.0.4", "initContainerStatuses": [ { "containerID": "cri-o://078342e02ce0ece9e39b1439ef9b4d168913502f3fcd791960822daddd22a1ed", "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e4e7d4920234d742c8b1fd059cfe13865f5a6e3fc3c529cfb666ae77edb11afb", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e4e7d4920234d742c8b1fd059cfe13865f5a6e3fc3c529cfb666ae77edb11afb", "lastState": {}, "name": "install-cni-plugins", "ready": true, "restartCount": 1, "state": { "terminated": { "containerID": "cri-o://078342e02ce0ece9e39b1439ef9b4d168913502f3fcd791960822daddd22a1ed", "exitCode": 0, "finishedAt": "2020-03-23T11:39:15Z", "reason": "Completed", "startedAt": "2020-03-23T11:39:15Z" } } } ], "phase": "Running", "podIP": "10.0.0.4", "qosClass": "Burstable", "startTime": "2020-03-23T11:06:12Z" } } The error in `containerStatuses` coincides with the test error. So for now I am assigning this NZ to the network team so that they can validate this. Please feel free to reassign it back to `kube-apiserver` if you find otherwise.
SDN pods seem to have been gracefully terminated around that time. The pod which's liveness probe failed isn't really a motive for concern because it fails at 11:39:18 and the first log line after being restarted is at 11:39:18. These are all the SDN pods that failed at correlated times: - https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/pods/openshift-sdn_sdn-dwtqq_sdn_previous.log - https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/pods/openshift-sdn_sdn-pck4j_sdn_previous.log - https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/pods/openshift-sdn_sdn-52pcx_sdn_previous.log I see the SDN pods are restarted after logging this event: "message": "Pod sandbox changed, it will be killed and re-created.", Checking the node logs it looks like the SDN is yet another victim here. I'm not quite sure why this happens but I see alot of activity on crio and kubelet on m-0 around that time. I don't think SDN is it at fault here. Reassigning to node team for them to have a look. Test URL https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/nodes/
*** This bug has been marked as a duplicate of bug 1821341 ***