1822016 – [4.3] Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded

Bug 1822016 - [4.3] Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded

Summary: [4.3] Operator should auto detect and retry quickly once the cause of NodeIns...

Keywords:
Status:	CLOSED DUPLICATE of bug 1817419
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:	1817419 1822018 1858763 1874597 1876484 1876486 1909600 1949370
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-08 04:58 UTC by Ke Wang
Modified:	2021-08-20 07:24 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1817419
Environment:
Last Closed:	2020-05-20 09:25:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ke Wang 2020-04-08 04:58:49 UTC

+++ This bug was initially created as a clone of Bug #1817419 +++

Description of problem:
Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2020-04-07-141343

How reproducible:
At least in two clusters, one is ipi-on-aws, another is  IPI on AWS (FIPS off) OVN

Steps to Reproduce:
1. Install a latest 4.3 fresh env

2. After the installation succeeds, check `oc get co`, found:
kube-apiserver   4.3.0-0.nightly-2020-04-07-141343   True  True  True  150m

3. Check `oc get po -n openshift-kube-apiserver --show-labels`, found all pods under openshift-kube-apiserver are:
NAME                                                           READY   STATUS      RESTARTS   AGE    LABELS
kube-apiserver-ip-...39-158.....compute.internal      3/3     Running     3          103m   apiserver=true,app=openshift-kube-apiserver,revision=7
kube-apiserver-ip-...47-36.....compute.internal       3/3     Running     3          111m   apiserver=true,app=openshift-kube-apiserver,revision=6
kube-apiserver-ip-...71-131.....compute.internal      3/3     Running     3          109m   apiserver=true,app=openshift-kube-apiserver,revision=6


Check `oc logs deploy/kube-apiserver-operator -n openshift-kube-apiserver-operator`, shows:
...'OperatorStatusChanged' Status for clusteroper ator/kube-apiserver changed: Degraded message changed from "NodeInstallerDegraded: 1 nodes are failing on revision 7:\nNodeInstallerDegraded: pods \"installer-7-ip-...- 139-158.....compute.internal\" not found" to "NodeControllerDegraded: The master nodes not ready...

4. Force operator to retry rolling out by:
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
After minutes, co/kube-apiserver quickly becomes normal:
kube-apiserver  4.3.0-0.nightly-2020-04-07-141343   True  False False  156m

Pods also become normal:

kube-apiserver-ip-...39-158.....compute.internal      3/3     Running     0          5m56s   apiserver=true,app=openshift-kube-apiserver,revision=8
kube-apiserver-ip-...47-36.....compute.internal       3/3     Running     0          4m8s    apiserver=true,app=openshift-kube-apiserver,revision=8
kube-apiserver-ip-...71-131.....compute.internal      3/3     Running     0          2m24s   apiserver=true,app=openshift-kube-apiserver,revision=8

Actual results:
1. Operator stuck in rolling out the static pods even if step 4 shows the cause gone.

Expected results:
1. Operator should auto detect the cause and auto retry rolling out once the cause gone.

Additional info:

Comment 1 Stefan Schimanski 2020-05-20 09:25:56 UTC

Closing as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1817419. Please don't clone bugs in advance. The developer will assess necessary backports and create clones. Exception: CVEs

*** This bug has been marked as a duplicate of bug 1817419 ***

Note You need to log in before you can comment on or make changes to this bug.