Bug 1843876

Summary:	daemonset, deployment, and replicaset status can permafail
Product:	OpenShift Container Platform	Reporter:	Maciej Szulik <maszulik>
Component:	kube-controller-manager	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED ERRATA	QA Contact:	zhou ying <yinzhou>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4	CC:	aos-bugs, bparees, deads, mfojtik, yinzhou
Target Milestone:	---
Target Release:	4.4.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: In certain cases NotFound error was swallowed by controller logic. Consequence: Missing NotFound event was causing the controller not be aware of missing pods. Fix: Properly react to NotFound events, which indicate that the pod was already removed by a different actor. Result: Controller (deployment, daemonset, replicaset and others) will properly react to pod NotFound event.	Story Points:	---
Clone Of:	1843462
Clones:	1843877 (view as bug list)		Environment:
Last Closed:	2020-07-06 20:47:16 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1843462
Bug Blocks:	1843877

Description Maciej Szulik 2020-06-04 11:15:49 UTC

+++ This bug was initially created as a clone of Bug #1843462 +++

+++ This bug was initially created as a clone of Bug #1843187 +++

When pod expectations are not met, status for workloads can wedge. When status for workloads wedges, operators wait indefinitely. When operators wait indefinitely status is wrong.  When status is wrong, upgrades can fail.

Picking https://github.com/kubernetes/kubernetes/pull/91008 seems like a fix.

--- Additional comment from Maciej Szulik on 2020-06-03 12:54:58 CEST ---

Comment 1 Maciej Szulik 2020-06-18 09:56:48 UTC

This waiting to be merged in the queue.

Comment 4 zhou ying 2020-06-28 05:54:29 UTC

Confirmed with payload: 4.4.0-0.nightly-2020-06-27-171816, 
user login OCP ， and open two terminals , on the first terminal delete one pod, on the second terminal scale down the deployment , no new pod created and no other pods deleted:

[zhouying@dhcp-140-138 ~]$ oc get po 
NAME                      READY   STATUS    RESTARTS   AGE
my-dep-58668cd74d-chcmh   1/1     Running   0          101s
my-dep-58668cd74d-vj2dr   1/1     Running   0          54s
my-dep-58668cd74d-w4ww5   1/1     Running   0          54s
[zhouying@dhcp-140-138 ~]$ oc scale deploy/my-dep --replicas=2
deployment.apps/my-dep scaled
[zhouying@dhcp-140-138 ~]$ oc get po
NAME                      READY   STATUS    RESTARTS   AGE
my-dep-58668cd74d-chcmh   1/1     Running   0          3m22s
my-dep-58668cd74d-vj2dr   1/1     Running   0          2m35s

Comment 6 errata-xmlrpc 2020-07-06 20:47:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2786