Bug 1822566

Summary:	Pod is terminated, but some containers are still running
Product:	OpenShift Container Platform	Reporter:	Stefan Schimanski <sttts>
Component:	Node	Assignee:	Kir Kolyshkin <kir>
Status:	CLOSED DUPLICATE	QA Contact:	Sunil Choudhary <schoudha>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.5	CC:	aos-bugs, jokerman, mpatel, rphillips, wking, zyu
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1822773 (view as bug list)		Environment:
Last Closed:	2020-05-11 22:35:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1822773

Description Stefan Schimanski 2020-04-09 11:10:11 UTC

Description of problem:

In https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24844/pull-ci-openshift-origin-master-e2e-gcp-upgrade/3637 we see

  [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable

While this first looks like https://bugzilla.redhat.com/show_bug.cgi?id=1817588, the symptoms are actually different:

- openshift-apiserver-operator switches to degraded because it cannot deploy one of its pods (at the same time another one is not starting, maybe for a related root cause)
- one node is unschedulable (master 2). It reports:

 "machineconfiguration.openshift.io/reason": "failed to drain node (5 tries): timed out waiting for the condition: [error when waiting for pod \"cluster-monitoring-operator-5c4cbf9d66-hgcdx\" terminating: global timeout reached: 20s, error when waiting for pod \"machine-api-operator-7b9798f48b-gxnvx\" terminating: global timeout reached: 20s]",
 "machineconfiguration.openshift.io/state": "Degraded",

- the kubelet on master-2 keeps reporting for many minutes (saw it at least for 15+ minutes):

Pod "cluster-monitoring-operator-5c4cbf9d66-hgcdx_openshift-monitoring(da3230e0-929f-46d9-a027-8b9771c9cfbd)" is terminated, but some containers are still running
getPodContainerStatuses for pod "machine-api-operator-7b9798f48b-gxnvx_openshift-machine-api(cef494f8-34c6-4ff4-a036-dd7feaea1dce)" failed: rpc error: code = Unknown desc = container with ID starting with 6f378759109743f648b9306f357e7f6a281e23c134eb999e0a2257ae627e4569 not found: ID does not exist
status_manager.go:434] Ignoring same status for pod "machine-config-server-tzm6l_openshift-machine-config-operator(c1577554-3ce7-4db4-8e72-a2b128505c3a)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:03 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:05 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:05 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:03 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.0.4 PodIP:10.0.0.4 PodIPs:[{IP:10.0.0.4}] StartTime:2020-04-09 07:50:03 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:machine-config-server State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-04-09 07:50:04 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.svc.ci.openshift.org/ci-op-ddshm7dw/stable@sha256:c55b27a07e19889da3ef60ef80312a518dc82522ed9b824779c17acf0f0503c9 ImageID:registry.svc.ci.openshift.org/ci-op-ddshm7dw/stable-initial@sha256:c55b27a07e19889da3ef60ef80312a518dc82522ed9b824779c17acf0f0503c9 ContainerID:cri-o://db3f8052a7c3de19295d1422763c6b8b52fd89637c87a90bc0c81ce3cc0139e4 Started:0xc0036030f0}] QOSClass:Burstable EphemeralContainerStatuses:[]}
Apr 09 07:56:18.722663 ci-op-x47zh-m-2.c.openshift-gce-devel-ci.internal hyperkube[1471]: I0409 07:56:18.722640    1471 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-api", Name:"machine-api-operator-7b9798f48b-gxnvx", UID:"cef494f8-34c6-4ff4-a036-dd7feaea1dce", APIVersion:"v1", ResourceVersion:"8738", FieldPath:""}): type: 'Warning' reason: 'FailedSync' error determining status: rpc error: code = Unknown desc = container with ID starting with 6f378759109743f648b9306f357e7f6a281e23c134eb999e0a2257ae627e4569 not found: ID does not exist

- the drain never finishes.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ryan Phillips 2020-04-09 19:39:04 UTC

From the slack discussion, we are going to revert [1] and followup with the terminationGrace issue [2]. 

[1] https://github.com/cri-o/cri-o/pull/3458
[2] https://github.com/cri-o/cri-o/issues/3455

Comment 2 Ryan Phillips 2020-04-09 21:50:38 UTC

We are going to triage this some more.

One BZ that spun out of this is a crash in Multus: https://bugzilla.redhat.com/show_bug.cgi?id=1822803

Comment 3 W. Trevor King 2020-05-06 19:57:14 UTC

"not found: ID does not exist" shows up in bug 1819906 too.  But not sure if getting that aspect fixed will resolve the other symptoms reported here or not.