Description of problem: In https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24844/pull-ci-openshift-origin-master-e2e-gcp-upgrade/3637 we see [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable While this first looks like https://bugzilla.redhat.com/show_bug.cgi?id=1817588, the symptoms are actually different: - openshift-apiserver-operator switches to degraded because it cannot deploy one of its pods (at the same time another one is not starting, maybe for a related root cause) - one node is unschedulable (master 2). It reports: "machineconfiguration.openshift.io/reason": "failed to drain node (5 tries): timed out waiting for the condition: [error when waiting for pod \"cluster-monitoring-operator-5c4cbf9d66-hgcdx\" terminating: global timeout reached: 20s, error when waiting for pod \"machine-api-operator-7b9798f48b-gxnvx\" terminating: global timeout reached: 20s]", "machineconfiguration.openshift.io/state": "Degraded", - the kubelet on master-2 keeps reporting for many minutes (saw it at least for 15+ minutes): Pod "cluster-monitoring-operator-5c4cbf9d66-hgcdx_openshift-monitoring(da3230e0-929f-46d9-a027-8b9771c9cfbd)" is terminated, but some containers are still running getPodContainerStatuses for pod "machine-api-operator-7b9798f48b-gxnvx_openshift-machine-api(cef494f8-34c6-4ff4-a036-dd7feaea1dce)" failed: rpc error: code = Unknown desc = container with ID starting with 6f378759109743f648b9306f357e7f6a281e23c134eb999e0a2257ae627e4569 not found: ID does not exist status_manager.go:434] Ignoring same status for pod "machine-config-server-tzm6l_openshift-machine-config-operator(c1577554-3ce7-4db4-8e72-a2b128505c3a)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:03 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:05 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:05 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:03 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.0.4 PodIP:10.0.0.4 PodIPs:[{IP:10.0.0.4}] StartTime:2020-04-09 07:50:03 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:machine-config-server State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-04-09 07:50:04 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.svc.ci.openshift.org/ci-op-ddshm7dw/stable@sha256:c55b27a07e19889da3ef60ef80312a518dc82522ed9b824779c17acf0f0503c9 ImageID:registry.svc.ci.openshift.org/ci-op-ddshm7dw/stable-initial@sha256:c55b27a07e19889da3ef60ef80312a518dc82522ed9b824779c17acf0f0503c9 ContainerID:cri-o://db3f8052a7c3de19295d1422763c6b8b52fd89637c87a90bc0c81ce3cc0139e4 Started:0xc0036030f0}] QOSClass:Burstable EphemeralContainerStatuses:[]} Apr 09 07:56:18.722663 ci-op-x47zh-m-2.c.openshift-gce-devel-ci.internal hyperkube[1471]: I0409 07:56:18.722640 1471 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-api", Name:"machine-api-operator-7b9798f48b-gxnvx", UID:"cef494f8-34c6-4ff4-a036-dd7feaea1dce", APIVersion:"v1", ResourceVersion:"8738", FieldPath:""}): type: 'Warning' reason: 'FailedSync' error determining status: rpc error: code = Unknown desc = container with ID starting with 6f378759109743f648b9306f357e7f6a281e23c134eb999e0a2257ae627e4569 not found: ID does not exist - the drain never finishes. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
From the slack discussion, we are going to revert [1] and followup with the terminationGrace issue [2]. [1] https://github.com/cri-o/cri-o/pull/3458 [2] https://github.com/cri-o/cri-o/issues/3455
We are going to triage this some more. One BZ that spun out of this is a crash in Multus: https://bugzilla.redhat.com/show_bug.cgi?id=1822803
"not found: ID does not exist" shows up in bug 1819906 too. But not sure if getting that aspect fixed will resolve the other symptoms reported here or not.