1822773 – [4.4] Pod is terminated, but some containers are still running

Bug 1822773 - [4.4] Pod is terminated, but some containers are still running

Summary: [4.4] Pod is terminated, but some containers are still running

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Kir Kolyshkin
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	1819906 1822566
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-09 19:41 UTC by Ryan Phillips
Modified:	2020-05-11 22:35 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1822566
Environment:
Last Closed:	2020-04-09 21:29:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ryan Phillips 2020-04-09 19:41:37 UTC

+++ This bug was initially created as a clone of Bug #1822566 +++

Description of problem:

In https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24844/pull-ci-openshift-origin-master-e2e-gcp-upgrade/3637 we see

  [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable

While this first looks like https://bugzilla.redhat.com/show_bug.cgi?id=1817588, the symptoms are actually different:

- openshift-apiserver-operator switches to degraded because it cannot deploy one of its pods (at the same time another one is not starting, maybe for a related root cause)
- one node is unschedulable (master 2). It reports:

 "machineconfiguration.openshift.io/reason": "failed to drain node (5 tries): timed out waiting for the condition: [error when waiting for pod \"cluster-monitoring-operator-5c4cbf9d66-hgcdx\" terminating: global timeout reached: 20s, error when waiting for pod \"machine-api-operator-7b9798f48b-gxnvx\" terminating: global timeout reached: 20s]",
 "machineconfiguration.openshift.io/state": "Degraded",

- the kubelet on master-2 keeps reporting for many minutes (saw it at least for 15+ minutes):

Pod "cluster-monitoring-operator-5c4cbf9d66-hgcdx_openshift-monitoring(da3230e0-929f-46d9-a027-8b9771c9cfbd)" is terminated, but some containers are still running
getPodContainerStatuses for pod "machine-api-operator-7b9798f48b-gxnvx_openshift-machine-api(cef494f8-34c6-4ff4-a036-dd7feaea1dce)" failed: rpc error: code = Unknown desc = container with ID starting with 6f378759109743f648b9306f357e7f6a281e23c134eb999e0a2257ae627e4569 not found: ID does not exist
status_manager.go:434] Ignoring same status for pod "machine-config-server-tzm6l_openshift-machine-config-operator(c1577554-3ce7-4db4-8e72-a2b128505c3a)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:03 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:05 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:05 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:03 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.0.4 PodIP:10.0.0.4 PodIPs:[{IP:10.0.0.4}] StartTime:2020-04-09 07:50:03 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:machine-config-server State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-04-09 07:50:04 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.svc.ci.openshift.org/ci-op-ddshm7dw/stable@sha256:c55b27a07e19889da3ef60ef80312a518dc82522ed9b824779c17acf0f0503c9 ImageID:registry.svc.ci.openshift.org/ci-op-ddshm7dw/stable-initial@sha256:c55b27a07e19889da3ef60ef80312a518dc82522ed9b824779c17acf0f0503c9 ContainerID:cri-o://db3f8052a7c3de19295d1422763c6b8b52fd89637c87a90bc0c81ce3cc0139e4 Started:0xc0036030f0}] QOSClass:Burstable EphemeralContainerStatuses:[]}
Apr 09 07:56:18.722663 ci-op-x47zh-m-2.c.openshift-gce-devel-ci.internal hyperkube[1471]: I0409 07:56:18.722640    1471 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-api", Name:"machine-api-operator-7b9798f48b-gxnvx", UID:"cef494f8-34c6-4ff4-a036-dd7feaea1dce", APIVersion:"v1", ResourceVersion:"8738", FieldPath:""}): type: 'Warning' reason: 'FailedSync' error determining status: rpc error: code = Unknown desc = container with ID starting with 6f378759109743f648b9306f357e7f6a281e23c134eb999e0a2257ae627e4569 not found: ID does not exist

- the drain never finishes.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Ryan Phillips on 2020-04-09 19:39:04 UTC ---

From the slack discussion, we are going to revert [1] and followup with the terminationGrace issue [2]. 

[1] https://github.com/cri-o/cri-o/pull/3458
[2] https://github.com/cri-o/cri-o/issues/3455

Note You need to log in before you can comment on or make changes to this bug.