1822566 – Pod is terminated, but some containers are still running

Bug 1822566 - Pod is terminated, but some containers are still running

Summary: Pod is terminated, but some containers are still running

Keywords:
Status:	CLOSED DUPLICATE of bug 1819906
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Kir Kolyshkin
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1822773
TreeView+	depends on / blocked

Reported:	2020-04-09 11:10 UTC by Stefan Schimanski
Modified:	2020-05-11 22:35 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1822773 (view as bug list)
Environment:
Last Closed:	2020-05-11 22:35:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Stefan Schimanski 2020-04-09 11:10:11 UTC

Description of problem:

In https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24844/pull-ci-openshift-origin-master-e2e-gcp-upgrade/3637 we see

  [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable

While this first looks like https://bugzilla.redhat.com/show_bug.cgi?id=1817588, the symptoms are actually different:

- openshift-apiserver-operator switches to degraded because it cannot deploy one of its pods (at the same time another one is not starting, maybe for a related root cause)
- one node is unschedulable (master 2). It reports:

 "machineconfiguration.openshift.io/reason": "failed to drain node (5 tries): timed out waiting for the condition: [error when waiting for pod \"cluster-monitoring-operator-5c4cbf9d66-hgcdx\" terminating: global timeout reached: 20s, error when waiting for pod \"machine-api-operator-7b9798f48b-gxnvx\" terminating: global timeout reached: 20s]",
 "machineconfiguration.openshift.io/state": "Degraded",

- the kubelet on master-2 keeps reporting for many minutes (saw it at least for 15+ minutes):

Pod "cluster-monitoring-operator-5c4cbf9d66-hgcdx_openshift-monitoring(da3230e0-929f-46d9-a027-8b9771c9cfbd)" is terminated, but some containers are still running
getPodContainerStatuses for pod "machine-api-operator-7b9798f48b-gxnvx_openshift-machine-api(cef494f8-34c6-4ff4-a036-dd7feaea1dce)" failed: rpc error: code = Unknown desc = container with ID starting with 6f378759109743f648b9306f357e7f6a281e23c134eb999e0a2257ae627e4569 not found: ID does not exist
status_manager.go:434] Ignoring same status for pod "machine-config-server-tzm6l_openshift-machine-config-operator(c1577554-3ce7-4db4-8e72-a2b128505c3a)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:03 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:05 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:05 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-09 07:50:03 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.0.4 PodIP:10.0.0.4 PodIPs:[{IP:10.0.0.4}] StartTime:2020-04-09 07:50:03 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:machine-config-server State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-04-09 07:50:04 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.svc.ci.openshift.org/ci-op-ddshm7dw/stable@sha256:c55b27a07e19889da3ef60ef80312a518dc82522ed9b824779c17acf0f0503c9 ImageID:registry.svc.ci.openshift.org/ci-op-ddshm7dw/stable-initial@sha256:c55b27a07e19889da3ef60ef80312a518dc82522ed9b824779c17acf0f0503c9 ContainerID:cri-o://db3f8052a7c3de19295d1422763c6b8b52fd89637c87a90bc0c81ce3cc0139e4 Started:0xc0036030f0}] QOSClass:Burstable EphemeralContainerStatuses:[]}
Apr 09 07:56:18.722663 ci-op-x47zh-m-2.c.openshift-gce-devel-ci.internal hyperkube[1471]: I0409 07:56:18.722640    1471 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-api", Name:"machine-api-operator-7b9798f48b-gxnvx", UID:"cef494f8-34c6-4ff4-a036-dd7feaea1dce", APIVersion:"v1", ResourceVersion:"8738", FieldPath:""}): type: 'Warning' reason: 'FailedSync' error determining status: rpc error: code = Unknown desc = container with ID starting with 6f378759109743f648b9306f357e7f6a281e23c134eb999e0a2257ae627e4569 not found: ID does not exist

- the drain never finishes.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ryan Phillips 2020-04-09 19:39:04 UTC

From the slack discussion, we are going to revert [1] and followup with the terminationGrace issue [2]. 

[1] https://github.com/cri-o/cri-o/pull/3458
[2] https://github.com/cri-o/cri-o/issues/3455

Comment 2 Ryan Phillips 2020-04-09 21:50:38 UTC

We are going to triage this some more.

One BZ that spun out of this is a crash in Multus: https://bugzilla.redhat.com/show_bug.cgi?id=1822803

Comment 3 W. Trevor King 2020-05-06 19:57:14 UTC

"not found: ID does not exist" shows up in bug 1819906 too.  But not sure if getting that aspect fixed will resolve the other symptoms reported here or not.

Note You need to log in before you can comment on or make changes to this bug.