Bug 1816194

Summary:	4.2z upgrade test fails with 'failed to get pod status: pods \"pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70\" not found'
Product:	OpenShift Container Platform	Reporter:	Vadim Rutkovsky <vrutkovs>
Component:	Node	Assignee:	Harshal Patil <harpatil>
Status:	CLOSED DUPLICATE	QA Contact:	Sunil Choudhary <schoudha>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.2.z	CC:	akashem, aos-bugs, bbennett, jdesousa, jokerman, mfojtik, rphillips
Target Milestone:	---	Keywords:	Upgrades
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-15 18:29:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vadim Rutkovsky 2020-03-23 14:28:55 UTC

Description of problem:
4.2.24 -> 4.2.25 test failed with 

fail [k8s.io/kubernetes/test/e2e/framework/util.go:2456]: Unexpected error:
    <*errors.errorString | 0xc002a02d60>: {
        s: "failed to get pod status: pods \"pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70\" not found",
    }
    failed to get pod status: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found
occurred


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471

Apparently this pod is removed previously (race condition?):


Mar 23 11:36:44.524: INFO: Waiting up to 5m0s for pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-6434" to be "success or failure"

Mar 23 11:36:54.553: INFO: Pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70": Phase="Pending", Reason="", readiness=false. Elapsed: 10.0291803s

Mar 23 11:36:56.556: INFO: Pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-6434" not found. Error: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found
STEP: delete the pod
Mar 23 11:36:56.563: INFO: Waiting for pod pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70 to disappear
Mar 23 11:36:56.566: INFO: Pod pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70 no longer exists
Mar 23 11:36:56.566: INFO: Unexpected error occurred: failed to get pod status: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found

Comment 1 Abu Kashem 2020-04-06 16:22:59 UTC

I checked all clusteroperator objects, all status conditions reported fine except for the following:
curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/clusteroperators.json | jq '.items | .[] | .status.conditions[] | select((.type == "Progressing" and .status == "True") or (.type == "Upgradeable" and .status == "False") or (.type == "Degraded" and .status == "True") or (.type == "Available" and .status == "False")) | .'
{
  "lastTransitionTime": "2020-03-23T11:39:40Z",
  "message": "Progressing: not all deployment replicas are ready",
  "reason": "ProgressingOAuthServerDeploymentNotReady",
  "status": "True",
  "type": "Progressing"
}
{
  "lastTransitionTime": "2020-03-23T11:39:41Z",
  "message": "DeploymentAvailable: 1 replicas ready at version 4.2.25",
  "reason": "DeploymentAvailableFailedUpdate",
  "status": "False",
  "type": "Available"
}
{
  "lastTransitionTime": "2020-03-23T11:40:40Z",
  "message": "DaemonSet \"openshift-sdn/ovs\" is not available (awaiting 1 nodes)",
  "reason": "Deploying",
  "status": "True",
  "type": "Progressing"
}


Can this error 'DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes)` be the root cause?


Also, I searched the CI for this exact error - no trace of this error as far back as last two weeks. The following search returned zero results.
https://search.svc.ci.openshift.org/?search=Unexpected+error+occurred.*failed+to+get+pod+status.*pods.*not+found&maxAge=336h&context=2&type=all

Looking at the log, we see:

The test creates the Pod object and waits for it:
"Mar 23 11:36:44.524: INFO: Waiting up to 5m0s for pod "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-6434" to be success or failure"

After about 10 seconds the pod object gets removed (why?):
Mar 23 11:36:56.566: INFO: Pod pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70 no longer exists
Mar 23 11:36:56.566: INFO: Unexpected error occurred: failed to get pod status: pods "pod-secrets-9292e6fa-6cfa-11ea-bae2-0a58ac106c70" not found

Comment 2 Abu Kashem 2020-04-06 19:11:09 UTC

I checked the events in openshift-sdn namespace, found a couple of interesting events
{
  "apiVersion": "v1",
  "count": 1,
  "eventTime": null,
  "firstTimestamp": "2020-03-23T11:39:12Z",
  "involvedObject": {
    "apiVersion": "v1",
    "kind": "Pod",
    "name": "sdn-x5stt",
    "namespace": "openshift-sdn",
    "resourceVersion": "25315",
    "uid": "4ed85eb4-6cf6-11ea-b0d9-42010a000005"
  },
  "kind": "Event",
  "lastTimestamp": "2020-03-23T11:39:12Z",
  "message": "MountVolume.SetUp failed for volume \"config\" : couldn't propagate object cache: timed out waiting for the condition",
  "metadata": {
    "creationTimestamp": "2020-03-23T11:39:23Z",
    "name": "sdn-x5stt.15feebf4366f06a9",
    "namespace": "openshift-sdn",
    "resourceVersion": "25897",
    "selfLink": "/api/v1/namespaces/openshift-sdn/events/sdn-x5stt.15feebf4366f06a9",
    "uid": "f15121b4-6cfa-11ea-ba81-42010a000003"
  },
  "reason": "FailedMount",
  "reportingComponent": "",
  "reportingInstance": "",
  "source": {
    "component": "kubelet",
    "host": "ci-op--t7sqw-m-0.c.openshift-gce-devel-ci.internal"
  },
  "type": "Warning"
}
{
  "apiVersion": "v1",
  "count": 1,
  "eventTime": null,
  "firstTimestamp": "2020-03-23T11:39:18Z",
  "involvedObject": {
    "apiVersion": "v1",
    "fieldPath": "spec.containers{sdn}",
    "kind": "Pod",
    "name": "sdn-x5stt",
    "namespace": "openshift-sdn",
    "resourceVersion": "25315",
    "uid": "4ed85eb4-6cf6-11ea-b0d9-42010a000005"
  },
  "kind": "Event",
  "lastTimestamp": "2020-03-23T11:39:18Z",
  "message": "Liveness probe failed: Get http://10.0.0.4:10256/healthz: dial tcp 10.0.0.4:10256: connect: connection refused",
  "metadata": {
    "creationTimestamp": "2020-03-23T11:39:51Z",
    "name": "sdn-x5stt.15feebf595d3c523",
    "namespace": "openshift-sdn",
    "resourceVersion": "27103",
    "selfLink": "/api/v1/namespaces/openshift-sdn/events/sdn-x5stt.15feebf595d3c523",
    "uid": "01e316ed-6cfb-11ea-9dae-42010a000004"
  },
  "reason": "Unhealthy",
  "reportingComponent": "",
  "reportingInstance": "",
  "source": {
    "component": "kubelet",
    "host": "ci-op--t7sqw-m-0.c.openshift-gce-devel-ci.internal"
  },
  "type": "Warning"
}


Also, checked the sdn Pod object above and found the following
        {
            "apiVersion": "v1",
            "kind": "Pod",
            "metadata": {
                "creationTimestamp": "2020-03-23T11:06:12Z",
                "generateName": "sdn-",
                "labels": {
                    "app": "sdn",
                    "component": "network",
                    "controller-revision-hash": "766d7fb896",
                    "openshift.io/component": "network",
                    "pod-template-generation": "1",
                    "type": "infra"
                },
                "name": "sdn-x5stt",
            
            "status": {
                "conditions": [
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-03-23T11:39:16Z",
                        "status": "True",
                        "type": "Initialized"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-03-23T11:39:23Z",
                        "status": "True",
                        "type": "Ready"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-03-23T11:39:23Z",
                        "status": "True",
                        "type": "ContainersReady"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2020-03-23T11:06:12Z",
                        "status": "True",
                        "type": "PodScheduled"
                    }
                ],
                "containerStatuses": [
                    {
                        "containerID": "cri-o://e49a73bb1235e9e1c6a8ff7a8be43ee41eb8ed231b3d0a3774c0a500b11d9f91",
                        "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a9a1297617df6617a8e8297987084beeb76dd324814e896dd0706cdbc9618ce",
                        "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a9a1297617df6617a8e8297987084beeb76dd324814e896dd0706cdbc9618ce",
                        "lastState": {
                            "terminated": {
                                "containerID": "cri-o://7d220ad399e27e2cc459815280d54240a41a200c32559dd25000e2a3ea646ed7",
                                "exitCode": 255,
                                "finishedAt": "2020-03-23T11:39:10Z",
                                "message": "23 11:37:20.690211    4782 proxier.go:346] userspace syncProxyRules took 100.232011ms\nI0323 11:37:20.690223    4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:20.690234    4782 proxy.go:331] hybrid proxy: syncProxyRules start\nI0323 11:37:20.912750    4782 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete\nI0323 11:37:21.003229    4782 proxier.go:367] userspace proxy: processing 0 service events\nI0323 11:37:21.003255    4782 proxier.go:346] userspace syncProxyRules took 90.480026ms\nI0323 11:37:21.003267    4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:21.971741    4782 vnids.go:162] Dissociate netid 16301474 from namespace \"e2e-k8s-sig-apps-deployment-upgrade-7206\"\nI0323 11:37:30.028273    4782 proxy.go:331] hybrid proxy: syncProxyRules start\nI0323 11:37:30.028364    4782 service.go:357] Removing service port \"e2e-k8s-service-upgrade-8399/service-test:\"\nI0323 11:37:30.035938    4782 roundrobin.go:338] LoadBalancerRR: Removing endpoints for e2e-k8s-service-upgrade-8399/service-test:\nI0323 11:37:30.289797    4782 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete\nI0323 11:37:30.393149    4782 proxier.go:367] userspace proxy: processing 0 service events\nI0323 11:37:30.393183    4782 proxier.go:346] userspace syncProxyRules took 103.252369ms\nI0323 11:37:30.393198    4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:30.393209    4782 proxy.go:331] hybrid proxy: syncProxyRules start\nI0323 11:37:30.645528    4782 proxy.go:334] hybrid proxy: mainProxy.syncProxyRules complete\nI0323 11:37:30.742651    4782 proxier.go:367] userspace proxy: processing 0 service events\nI0323 11:37:30.742678    4782 proxier.go:346] userspace syncProxyRules took 97.036018ms\nI0323 11:37:30.742690    4782 proxy.go:337] hybrid proxy: unidlingProxy.syncProxyRules complete\nI0323 11:37:40.698263    4782 vnids.go:162] Dissociate netid 1366709 from namespace \"e2e-k8s-service-upgrade-8399\"\ninterrupt: Gracefully shutting down ...\n",
                                "reason": "Error",
                                "startedAt": "2020-03-23T11:07:13Z"
                            }
                        },
                        "name": "sdn",
                        "ready": true,
                        "restartCount": 2,
                        "state": {
                            "running": {
                                "startedAt": "2020-03-23T11:39:17Z"
                            }
                        }
                    }
                ],
                "hostIP": "10.0.0.4",
                "initContainerStatuses": [
                    {
                        "containerID": "cri-o://078342e02ce0ece9e39b1439ef9b4d168913502f3fcd791960822daddd22a1ed",
                        "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e4e7d4920234d742c8b1fd059cfe13865f5a6e3fc3c529cfb666ae77edb11afb",
                        "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e4e7d4920234d742c8b1fd059cfe13865f5a6e3fc3c529cfb666ae77edb11afb",
                        "lastState": {},
                        "name": "install-cni-plugins",
                        "ready": true,
                        "restartCount": 1,
                        "state": {
                            "terminated": {
                                "containerID": "cri-o://078342e02ce0ece9e39b1439ef9b4d168913502f3fcd791960822daddd22a1ed",
                                "exitCode": 0,
                                "finishedAt": "2020-03-23T11:39:15Z",
                                "reason": "Completed",
                                "startedAt": "2020-03-23T11:39:15Z"
                            }
                        }
                    }
                ],
                "phase": "Running",
                "podIP": "10.0.0.4",
                "qosClass": "Burstable",
                "startTime": "2020-03-23T11:06:12Z"
            }
        }

The error in `containerStatuses` coincides with the test error. So for now I am assigning this NZ to the network team so that they can validate this. Please feel free to reassign it back to `kube-apiserver` if you find otherwise.

Comment 3 Juan Luis de Sousa-Valadas 2020-04-13 17:45:13 UTC

SDN pods seem to have been gracefully terminated around that time.

The pod which's liveness probe failed isn't really a motive for concern because it fails at 11:39:18 and the first log line after being restarted is at 11:39:18.

These are all the SDN pods that failed at correlated times:
- https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/pods/openshift-sdn_sdn-dwtqq_sdn_previous.log
- https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/pods/openshift-sdn_sdn-pck4j_sdn_previous.log
- https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/pods/openshift-sdn_sdn-52pcx_sdn_previous.log


I see the SDN pods are restarted after logging this event:
"message": "Pod sandbox changed, it will be killed and re-created.",

Checking the node logs it looks like the SDN is yet another victim here. I'm not quite sure why this happens but I see alot of activity on crio and kubelet on m-0 around that time. I don't think SDN is it at fault here.

Reassigning to node team for them to have a look. Test URL https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/471/artifacts/e2e-gcp-upgrade/nodes/

Comment 5 Ryan Phillips 2020-04-15 18:29:47 UTC


*** This bug has been marked as a duplicate of bug 1821341 ***