Bug 1822018 - [4.4] Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded
Summary: [4.4] Operator should auto detect and retry quickly once the cause of NodeIns...
Keywords:
Status: CLOSED DUPLICATE of bug 1817419
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.4.z
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks: 1822016 1858763 1909600 1949370
TreeView+ depends on / blocked
 
Reported: 2020-04-08 05:05 UTC by Ke Wang
Modified: 2021-08-20 07:24 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1817419
Environment:
Last Closed: 2020-05-20 09:25:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ke Wang 2020-04-08 05:05:08 UTC
+++ This bug was initially created as a clone of Bug #1817419 +++

Description of problem:
Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-03-25-223508

How reproducible:
So far hit once

Steps to Reproduce:
1. Install a latest 4.4 fresh env, the env matrix is upi-on-gcp, disconnected-remove_rhcos_worker-fips-ovn
2. After the installation succeeds, check `oc get co`, found:
kube-apiserver   4.4.0-0.nightly-2020-03-25-223508   True  True  True  144m
3. Check `oc get po -n openshift-kube-apiserver --show-labels`, found all pods under openshift-kube-apiserver are:
NAME                                                                     READY   STATUS      RESTARTS   AGE   LABELS
installer-8-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal         0/1     Completed   0          68m   app=installer
kube-apiserver-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal      4/4     Running     0          68m   apiserver=true,app=openshi
ft-kube-apiserver,revision=8
kube-apiserver-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal      4/4     Running     4          86m   apiserver=true,app=openshi
ft-kube-apiserver,revision=7
kube-apiserver-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal      4/4     Running     4          88m   apiserver=true,app=openshi
ft-kube-apiserver,revision=7
revision-pruner-7-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal   0/1     Completed   0          82m   app=pruner
revision-pruner-7-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal   0/1     Completed   0          73m   app=pruner
revision-pruner-7-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal   0/1     Completed   0          66m   app=pruner
revision-pruner-8-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal   0/1     Completed   0          66m   app=pruner
revision-pruner-8-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal   0/1     Completed   0          65m   app=pruner

Check `oc logs deploy/kube-apiserver-operator -n openshift-kube-apiserver-operator`, shows:
... NodeInstallerDegraded: pods \"installer-8-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal\" not found\nNodeControllerDegraded: The master nodes not ready: node \"qe-yapei44debug-03260202-m-2.c.openshift-qe.internal\" not ready since 2020-03-26 07:12:00 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)"...

But check `oc get no`, no issue:
NAME                                                   STATUS   ROLES    AGE    VERSION
qe-yapei44debug-03260202-m-0.c.openshift-qe.internal   Ready    master   114m   v1.17.1
qe-yapei44debug-03260202-m-1.c.openshift-qe.internal   Ready    master   114m   v1.17.1
qe-yapei44debug-03260202-m-2.c.openshift-qe.internal   Ready    master   114m   v1.17.1
qe-yapei44debug-03260202-w-a-l-rhel-0                  Ready    worker   34m    v1.17.1
qe-yapei44debug-03260202-w-a-l-rhel-1                  Ready    worker   34m    v1.17.1

Networking QE also helped debug, network also has no issue.

4. Force operator to retry rolling out by:
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
After minutes, co/kube-apiserver quickly becomes normal:
kube-apiserver  4.4.0-0.nightly-2020-03-25-223508   True  False False  173m
Pods also become normal:
$ ogpkas
kube-apiserver-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal   4/4  Running  0  6m28s   apiserver=true,app=openshift-kube-apiserver,revision=9
kube-apiserver-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal   4/4  Running  0  8m23s   apiserver=true,app=openshift-kube-apiserver,revision=9
kube-apiserver-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal   4/4  Running  0  10m     apiserver=true,app=openshift-kube-apiserver,revision=9

Actual results:
3. Operator stuck in rolling out the static pods even if step 4 shows the cause gone.

Expected results:
3. Operator should auto detect the cause and auto retry rolling out once the cause gone.

Additional info:
Bug is filed from https://coreos.slack.com/archives/CH76YSYSC/p1585214604225000?thread_ts=1585210139.199800&cid=CH76YSYSC with discussion there.

--- Additional comment from Abu Kashem on 2020-04-07 14:40:18 UTC ---

Facts:
- This is happening in upi-on-gcp and infrequently.
- All kube-apiserver pods are running successfully.
- The operator is reporting a misleading status in clusteroperator object. I am assuming this will not block upgrade. (correct?)
- There is a workaround to fix this issue, that's a plus.
- Troubleshooting may not be obvious, we have to check the operator log to find what the issue is. That's a minus.

Since this is reporting misleading information in ClusterOperator object, ideally we would want to fix it in 4.4. But given the time constraint I think we can defer it to 4.5.


This bug existed on 4.5 and 4.3, to believe that 4.4 also exists. It is worth fixing.

Comment 1 Stefan Schimanski 2020-05-20 09:25:01 UTC
Closing as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1817419. Please don't clone bugs in advance. The developer will assess necessary backports and create clones. Exception: CVEs

*** This bug has been marked as a duplicate of bug 1817419 ***


Note You need to log in before you can comment on or make changes to this bug.