Bug 1817419 - [4.5] Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded
Summary: [4.5] Operator should auto detect and retry quickly once the cause of NodeIns...
Keywords:
Status: CLOSED DUPLICATE of bug 1874597
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.5.z
Assignee: Abu Kashem
QA Contact: Ke Wang
URL:
Whiteboard:
: 1822016 1822018 (view as bug list)
Depends On: 1858763 1909600 1949370
Blocks: 1822016
TreeView+ depends on / blocked
 
Reported: 2020-03-26 10:22 UTC by Xingxing Xia
Modified: 2021-08-20 07:24 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1822016 1822018 (view as bug list)
Environment:
Last Closed: 2020-09-04 09:05:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Xingxing Xia 2020-03-26 10:22:13 UTC
Description of problem:
Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-03-25-223508

How reproducible:
So far hit once

Steps to Reproduce:
1. Install a latest 4.4 fresh env, the env matrix is upi-on-gcp, disconnected-remove_rhcos_worker-fips-ovn
2. After the installation succeeds, check `oc get co`, found:
kube-apiserver   4.4.0-0.nightly-2020-03-25-223508   True  True  True  144m
3. Check `oc get po -n openshift-kube-apiserver --show-labels`, found all pods under openshift-kube-apiserver are:
NAME                                                                     READY   STATUS      RESTARTS   AGE   LABELS
installer-8-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal         0/1     Completed   0          68m   app=installer
kube-apiserver-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal      4/4     Running     0          68m   apiserver=true,app=openshi
ft-kube-apiserver,revision=8
kube-apiserver-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal      4/4     Running     4          86m   apiserver=true,app=openshi
ft-kube-apiserver,revision=7
kube-apiserver-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal      4/4     Running     4          88m   apiserver=true,app=openshi
ft-kube-apiserver,revision=7
revision-pruner-7-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal   0/1     Completed   0          82m   app=pruner
revision-pruner-7-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal   0/1     Completed   0          73m   app=pruner
revision-pruner-7-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal   0/1     Completed   0          66m   app=pruner
revision-pruner-8-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal   0/1     Completed   0          66m   app=pruner
revision-pruner-8-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal   0/1     Completed   0          65m   app=pruner

Check `oc logs deploy/kube-apiserver-operator -n openshift-kube-apiserver-operator`, shows:
... NodeInstallerDegraded: pods \"installer-8-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal\" not found\nNodeControllerDegraded: The master nodes not ready: node \"qe-yapei44debug-03260202-m-2.c.openshift-qe.internal\" not ready since 2020-03-26 07:12:00 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)"...

But check `oc get no`, no issue:
NAME                                                   STATUS   ROLES    AGE    VERSION
qe-yapei44debug-03260202-m-0.c.openshift-qe.internal   Ready    master   114m   v1.17.1
qe-yapei44debug-03260202-m-1.c.openshift-qe.internal   Ready    master   114m   v1.17.1
qe-yapei44debug-03260202-m-2.c.openshift-qe.internal   Ready    master   114m   v1.17.1
qe-yapei44debug-03260202-w-a-l-rhel-0                  Ready    worker   34m    v1.17.1
qe-yapei44debug-03260202-w-a-l-rhel-1                  Ready    worker   34m    v1.17.1

Networking QE also helped debug, network also has no issue.

4. Force operator to retry rolling out by:
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
After minutes, co/kube-apiserver quickly becomes normal:
kube-apiserver  4.4.0-0.nightly-2020-03-25-223508   True  False False  173m
Pods also become normal:
$ ogpkas
kube-apiserver-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal   4/4  Running  0  6m28s   apiserver=true,app=openshift-kube-apiserver,revision=9
kube-apiserver-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal   4/4  Running  0  8m23s   apiserver=true,app=openshift-kube-apiserver,revision=9
kube-apiserver-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal   4/4  Running  0  10m     apiserver=true,app=openshift-kube-apiserver,revision=9

Actual results:
3. Operator stuck in rolling out the static pods even if step 4 shows the cause gone.

Expected results:
3. Operator should auto detect the cause and auto retry rolling out once the cause gone.

Additional info:
Bug is filed from https://coreos.slack.com/archives/CH76YSYSC/p1585214604225000?thread_ts=1585210139.199800&cid=CH76YSYSC with discussion there.

Comment 1 Abu Kashem 2020-04-07 14:40:18 UTC
Facts:
- This is happening in upi-on-gcp and infrequently.
- All kube-apiserver pods are running successfully.
- The operator is reporting a misleading status in clusteroperator object. I am assuming this will not block upgrade. (correct?)
- There is a workaround to fix this issue, that's a plus.
- Troubleshooting may not be obvious, we have to check the operator log to find what the issue is. That's a minus.

Since this is reporting misleading information in ClusterOperator object, ideally we would want to fix it in 4.4. But given the time constraint I think we can defer it to 4.5.

Comment 2 Xingxing Xia 2020-04-08 06:19:29 UTC
(In reply to Abu Kashem from comment #1)
> - This is happening in upi-on-gcp and infrequently
FYI, per https://bugzilla.redhat.com/show_bug.cgi?id=1822016#c0 saying ipi-on-aws, also happening on other cloud and matrix

Comment 3 Ke Wang 2020-04-20 06:40:56 UTC
Recently,similar problem was found OCP 4.4.0-rc.9-x86_64 installation with upi-on-osp, 

NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd                                       4.4.0-rc.9   True        True          True       34m
kube-apiserver                             4.4.0-rc.9   True        True          True       76m

Check the related operators error logs,
etcd opratorer log error,
E0420 03:09:41.719856       1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: etcdserver: leader changed

kube-apiserver log error,
E0420 03:12:27.092864       1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: bound-sa-token-signing-certs-7,config-7,etcd-serving-ca-7,kube-apiserver-cert-syncer-kubeconfig-7,kube-apiserver-pod-7,kubelet-serving-ca-7,sa-token-signing-certs-7]

After did a forced rolling out on them,
$ oc patch kube-apiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'

NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
etcd                                       4.4.0-rc.9   True        False         False      3h4m
kube-apiserver                             4.4.0-rc.9   True        False         False      3h45m

Comment 4 Ke Wang 2020-04-20 06:53:02 UTC
Revision for above forced rolling out CLI, should be like below,

oc patch etcd/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'

Comment 6 Xingxing Xia 2020-04-26 10:45:41 UTC
> kube-apiserver                             4.4.0-0.nightly-2020-04-26-045325   True        False         False      8h
Sorry, in previous comment, pasted the co after rolling out. Intended to paste the co before rolling out:
kube-apiserver                             4.4.0-0.nightly-2020-04-26-045325   True        True          True       7h49m

Comment 8 Abu Kashem 2020-04-29 17:19:14 UTC
Hi xxia,
I am looking into this issue, if you can reproduce it on a cluster, I would like to do a live debug.

Comment 9 Xingxing Xia 2020-04-30 00:53:33 UTC
I didn't know how above envs triggered the conditions like comment 0 'The master nodes not ready ... qe-yapei44debug-03260202-m-2 ... not ready since 2020-03-26 07:12:00 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)'.
Now, enlightened by the shown words "nodes not ready", I tentatively tried as below to make a master not ready and seems below is a good manual reproducer; twice tried, both can reproduce; though it may be as not exactly same as what happened in the real scenarios of above fresh envs and above upgrading env:
My on hand env is 4.3.0-0.nightly-2020-04-27-151540. Below commands run on two terminals: A in local client (xxia@pres) to run oc, B (root@ip-10-0-63-68) on the master.
[xxia@pres 2020-04-30 07:53:26 CST my]$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "xxia forced test'"$(date)"'" } ]'

[root@ip-10-0-63-68 2020-04-30 07:56:03 CST ~]# systemctl stop crio.service

[xxia@pres 2020-04-30 07:56:25 CST my]$ oc get no # now the master is made not ready
NAME                                        STATUS     ROLES    AGE   VERSION
ip-10-0-51-103.us-east-2.compute.internal   Ready      master   39h   v1.16.2
ip-10-0-63-68.us-east-2.compute.internal    NotReady   master   39h   v1.16.2
ip-10-0-72-76.us-east-2.compute.internal    Ready      master   39h   v1.16.2
...

[xxia@pres 2020-04-30 07:56:27 CST my]$ ogpkas-all | grep 14 # new revision is to be 14
installer-14-ip-10-0-63-68.us-east-2.compute.internal          0/1     Pending     0          9s     app=installer
installer-14-ip-10-0-72-76.us-east-2.compute.internal          0/1     Completed   0          2m6s   app=installer
kube-apiserver-ip-10-0-72-76.us-east-2.compute.internal        3/3     Running     0          118s   apiserver=true,app=openshift-kube-apiserver,revision=14
revision-pruner-14-ip-10-0-72-76.us-east-2.compute.internal    0/1     Completed   0          22s    app=pruner

[xxia@pres 2020-04-30 07:56:35 CST my]$ oc delete po installer-14-ip-10-0-63-68.us-east-2.compute.internal -n openshift-kube-apiserver --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "installer-14-ip-10-0-63-68.us-east-2.compute.internal" force deleted

[root@ip-10-0-63-68 2020-04-30 07:58:24 CST ~]# systemctl start crio.service
[root@ip-10-0-63-68 2020-04-30 07:58:33 CST ~]# systemctl status crio.service
...
   Active: active (running) since Thu 2020-04-30 07:58:33 CST; 17s ago
...
Apr 30 07:58:33 ip-10-0-63-68 crio[3635140]: time="2020-04-29 23:58:33.871619650Z" level=error msg="Error while checking pod to CNI net...

[root@ip-10-0-63-68 2020-04-30 07:59:12 CST ~]# systemctl restart kubelet

[xxia@pres 2020-04-30 07:59:48 CST my]$ oc get no # the master recovered to ready
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-51-103.us-east-2.compute.internal   Ready    master   39h   v1.16.2
ip-10-0-63-68.us-east-2.compute.internal    Ready    master   39h   v1.16.2
ip-10-0-72-76.us-east-2.compute.internal    Ready    master   39h   v1.16.2
...

[xxia@pres 2020-04-30 08:13:28 CST my]$ oc get co | grep -v "True.*False.*False" # still stuck in degraded
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
kube-apiserver                             4.3.0-0.nightly-2020-04-27-151540   True        True          True       40h

[xxia@pres 2020-04-30 08:13:36 CST my]$ ogpkas                                                                                  [2/1995]
NAME                                                       READY   STATUS    RESTARTS   AGE   LABELS
kube-apiserver-ip-10-0-51-103.us-east-2.compute.internal   3/3     Running   0          40m   apiserver=true,app=openshift-kube-apiserve
r,revision=13
kube-apiserver-ip-10-0-63-68.us-east-2.compute.internal    3/3     Running   0          44m   apiserver=true,app=openshift-kube-apiserve
r,revision=13
kube-apiserver-ip-10-0-72-76.us-east-2.compute.internal    3/3     Running   0          19m   apiserver=true,app=openshift-kube-apiserve
r,revision=14

Manually rollout to make not stuck:
[xxia@pres 2020-04-30 08:13:44 CST my]$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "xxia forced test'"$(date)"'" } ]'
kubeapiserver.operator.openshift.io/cluster patched

[xxia@pres 2020-04-30 08:13:53 CST my]$ oc get co | grep -v "True.*False.*False"
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
[xxia@pres 2020-04-30 08:24:26 CST my]$ ogpkas
NAME                                                       READY   STATUS    RESTARTS   AGE     LABELS
kube-apiserver-ip-10-0-51-103.us-east-2.compute.internal   3/3     Running   0          8m29s   apiserver=true,app=openshift-kube-apiser
ver,revision=15
kube-apiserver-ip-10-0-63-68.us-east-2.compute.internal    3/3     Running   0          10m     apiserver=true,app=openshift-kube-apiser
ver,revision=15
kube-apiserver-ip-10-0-72-76.us-east-2.compute.internal    3/3     Running   0          6m46s   apiserver=true,app=openshift-kube-apiser
ver,revision=15

Comment 10 Luis Sanchez 2020-05-05 19:39:31 UTC
So far, unable to reproduce using the simplified scenario documented in comment 9 on either 4.4.0-0.nightly-2020-05-01-231319 or 4.5.0-0.nightly-2020-05-04-113741.

installer pod is always recreated when deleted and when node becomes 'Ready', roll-out continues without any further manual intervention.

Comment 13 Abu Kashem 2020-05-18 19:57:57 UTC
Hi kewang,
this cluster has been claimed, can you please reproduce the issue on a new cluster? I would like to debug it live. Please feel free to ping me on slack when you have a cluster ready for me to investigate.

Comment 14 Ke Wang 2020-05-19 09:33:47 UTC
Hi Abu, I didn't know how above envs reproduced the issue, it is from other guy used env.

Comment 15 Stefan Schimanski 2020-05-20 09:25:01 UTC
*** Bug 1822018 has been marked as a duplicate of this bug. ***

Comment 16 Stefan Schimanski 2020-05-20 09:25:56 UTC
*** Bug 1822016 has been marked as a duplicate of this bug. ***

Comment 17 Abu Kashem 2020-05-20 16:57:16 UTC
We are not being able to reproduce it. we usually use gcp/aws to provision cluster. If qe can reproduce it we will do live debugging.

Comment 18 Ke Wang 2020-05-22 08:53:00 UTC
Hi Abu, the bug was reproduced, 

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-22-011031   True        True          70m     Unable to apply 4.5.0-0.nightly-2020-05-22-011031: the cluster operator kube-apiserver is degraded

$ oc describe co kube-apiserver
...
Status:
  Conditions:
    Last Transition Time:  2020-05-22T06:46:58Z
    Message:               NodeInstallerDegraded: 1 nodes are failing on revision 6:
NodeInstallerDegraded: pods "installer-6-..0522-k9m4v-master-1" not found
    Reason:                NodeInstaller_InstallerPodFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-05-22T06:28:21Z
    Message:               NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 6
    Reason:                NodeInstaller
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2020-05-22T06:24:18Z
    Message:               StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 6
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-05-22T06:21:21Z
...


$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-5876985f46-ns9f7 | grep E0522
E0522 08:02:45.133616       1 base_controller.go:180] "InstallerController" controller failed to sync "key", err: missing required resources: [secrets: aggregator-client,bound-service-account-signing-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,localhost-serving-cert-certkey,service-network-serving-certkey, secrets: etcd-client-6,kubelet-client-6,localhost-recovery-client-token-6,localhost-recovery-serving-certkey-6]
...

Comment 20 Abu Kashem 2020-05-22 18:56:38 UTC
After I logged in, this is what I saw.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-22-011031   True        True          6h36m   Unable to apply 4.5.0-0.nightly-2020-05-22-011031: the cluster operator openshift-samples is degraded

$ oc describe co kube-apiserver
Name:         kube-apiserver
...
Status:
  Conditions:
    Last Transition Time:  2020-05-22T12:24:10Z
    Message:               NodeControllerDegraded: All master nodes are ready
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2020-05-22T12:29:10Z
    Message:               NodeInstallerProgressing: 3 nodes are at revision 7
    Reason:                AsExpected
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-05-22T06:24:18Z
    Message:               StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 7
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-05-22T06:21:21Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:     operator.openshift.io
    Name:      cluster
    Resource:  kubeapiservers
    Group:     apiextensions.k8s.io
    Name:      
    Resource:  customresourcedefinitions
    Group:     security.openshift.io
    Name:      
    Resource:  securitycontextconstraints
    Group:     
    Name:      openshift-config
    Resource:  namespaces
    Group:     
    Name:      openshift-config-managed
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-apiserver-operator
    Resource:  namespaces
    Group:     
    Name:      openshift-kube-apiserver
    Resource:  namespaces
    Group:     admissionregistration.k8s.io
    Name:      
    Resource:  mutatingwebhookconfigurations
    Group:     admissionregistration.k8s.io
    Name:      
    Resource:  validatingwebhookconfigurations
  Versions:
    Name:     raw-internal
    Version:  4.5.0-0.nightly-2020-05-22-011031
    Name:     kube-apiserver
    Version:  1.18.2
    Name:     operator
    Version:  4.5.0-0.nightly-2020-05-22-011031
Events:       <none>


So after 6.5 hours it fixed itself, Last Transition time for Degraded 2020-05-22T06:46:58Z -> 2020-05-22T12:24:10Z (about 6.5 hours)

$ oc -n openshift-kube-apiserver-operator get pods
NAME                                       READY   STATUS    RESTARTS   AGE
kube-apiserver-operator-5876985f46-ns9f7   1/1     Running   0          6h27m


A few other CVO enabled operators rolled out around the same time. not sure if it was manually or cvo triggered.

Comment 21 Abu Kashem 2020-05-22 20:42:21 UTC
Further debugging:

$ kubectl -n openshift-kube-apiserver-operator get pods kube-apiserver-operator-5876985f46-ns9f7 -o json | jq .status.startTime
"2020-05-22T08:02:37Z"

$ oc -n openshift-kube-apiserver-operator logs kube-apiserver-operator-5876985f46-ns9f7 | head -n 1
I0522 08:02:43.045256       1 cmd.go:196] Using service-serving-cert provided certificates


$ oc get kubeapiserver cluster -o json | jq .status
{
  "conditions": [
    {
      "lastTransitionTime": "2020-05-22T08:05:05Z",
      "message": "All master nodes are ready",
      "reason": "MasterNodesReady",
      "status": "False",
      "type": "NodeControllerDegraded"
    },
    {
      "lastTransitionTime": "2020-05-22T08:02:45Z",
      "status": "False",
      "type": "InstallerControllerDegraded"
    },
    {
      "lastTransitionTime": "2020-05-22T12:24:10Z",
      "status": "False",
      "type": "NodeInstallerDegraded"
    },
    {
      "lastTransitionTime": "2020-05-22T12:22:07Z",
      "status": "False",
      "type": "RevisionControllerDegraded"
    },
    {
      "lastTransitionTime": "2020-05-22T08:02:59Z",
      "status": "False",
      "type": "TargetConfigControllerDegraded"
    },
   ]
}


I inspected the log and constructed a timeline:

I0522 08:03:51.581423: status controller first detects that master-1 node is not ready, with the following error:
	{
		"lastTransitionTime": "2020-05-22T06:46:58Z",
		"message": "NodeControllerDegraded: The master nodes not ready: node \"xiyuan0522-k9m4v-master-1\" not ready since 2020-05-22 08:03:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nNodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found",
		"reason": "NodeController_MasterNodesReady::NodeInstaller_InstallerPodFailed",
		"status": "True",
		"type": "Degraded"
	},


There are lots of (Throttling request took 2.37138117s) in between.


I0522 08:05:05.371494: status_controller.go:172 "master-1 not ready" is no longer reported in operator status:
	  {
        "lastTransitionTime": "2020-05-22T06:46:58Z",
        "message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found",
        "reason": "NodeInstaller_InstallerPodFailed",
        "status": "True",
        "type": "Degraded"
      },

I0522 08:05:05.397922: event.go:278 "OperatorStatusChanged" with the following message:
	Degraded message changed from
		"NodeControllerDegraded: The master nodes not ready: node \"xiyuan0522-k9m4v-master-1\" not ready since 2020-05-22 08:03:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nNodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found"
			to
		"NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found"


Clearly, at this point the master-1 is reporting ready according to the operator. But the "Degraded" remained "True" with "NodeInstallerDegraded" error.

Something interesting, from this point on and up to "12:21:27.220404" for more than 4.5 hours the log only has
	request.go:621] Throttling request took 1.195921374s
no other entries, at all! is it expected if the cluster is left untouched?

The first activity we see after that is the following:
I0522 12:21:27.220404       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}):
	type: 'Normal'
	reason: 'TargetUpdateRequired'
	"kube-controller-manager-client-cert-key" in "openshift-config-managed" requires a new target cert/key pair: past its refresh time 2020-05-22 12:21:26 +0000 UTC


I0522 12:22:07.558190       1 installer_controller.go:350] node xiyuan0522-k9m4v-master-1 is progressing towards 5 and needs new revision 7
I0522 12:22:07.558265       1 installer_controller.go:359] "xiyuan0522-k9m4v-master-1" moving to (v1.NodeStatus) {
	 NodeName: (string) (len=25) "xiyuan0522-k9m4v-master-1",
	 CurrentRevision: (int32) 5,
	 TargetRevision: (int32) 7,
	 LastFailedRevision: (int32) 6,
	 LastFailedRevisionErrors: ([]string) <nil>
	}


I0522 12:22:07.558753       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}):
	type: 'Normal' reason: 'RevisionCreate' Revision 6 created because secret/kubelet-client has changed

I0522 12:22:07.560168       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}):
	type: 'Normal' reason: 'RevisionTriggered' new revision 7 triggered by "secret/kubelet-client has changed"

W0522 12:22:07.566273       1 staticpod.go:37] revision 7 is unexpectedly already the latest available revision. This is a possible race!
E0522 12:22:07.594347       1 base_controller.go:180] "RevisionController" controller failed to sync "key", err: conflicting latestAvailableRevision 7

I0522 12:22:07.596637       1 status_controller.go:172:
	 {
		"lastTransitionTime": "2020-05-22T06:46:58Z",
		"message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found\nRevisionControllerDegraded: conflicting latestAvailableRevision 7",
		"reason": "NodeInstaller_InstallerPodFailed::RevisionController_Error",
		"status": "True",
		"type": "Degraded"
	  },


I0522 12:22:07.630223       1 status_controller.go:172]:
	  {
		"lastTransitionTime": "2020-05-22T06:46:58Z",
		"message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found",
		"reason": "NodeInstaller_InstallerPodFailed",
		"status": "True",
		"type": "Degraded"
	  },

I0522 12:22:07.722369       1 status_controller.go:172]
	  {
		"lastTransitionTime": "2020-05-22T06:46:58Z",
		"message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: ",
		"reason": "NodeInstaller_InstallerPodFailed",
		"status": "True",
		"type": "Degraded"
	  },

I0522 12:22:12.086708       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-7-xiyuan0522-k9m4v-master-1 -n openshift-kube-apiserver because it was missing
I0522 12:22:29.869235       1 installer_controller.go:331] "xiyuan0522-k9m4v-master-1" is in transition to 7, but has not made progress because static pod is pending

I0522 12:24:10.471080 installer_controller.go:318] "xiyuan0522-k9m4v-master-1" moving to
	(v1.NodeStatus) {
	 NodeName: (string) (len=25) "xiyuan0522-k9m4v-master-1",
	 CurrentRevision: (int32) 7,
	 TargetRevision: (int32) 0,
	 LastFailedRevision: (int32) 0,
	 LastFailedRevisionErrors: ([]string) <nil>
	} because static pod is ready
I0522 12:24:10.492913       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeCurrentRevisionChanged' Updated node "xiyuan0522-k9m4v-master-1" from revision 5 to 7 because static pod is ready

I0522 12:24:10.471080       1 installer_controller.go:318] "xiyuan0522-k9m4v-master-1" moving to (v1.NodeStatus) {
 NodeName: (string) (len=25) "xiyuan0522-k9m4v-master-1",
 CurrentRevision: (int32) 7,
 TargetRevision: (int32) 0,
 LastFailedRevision: (int32) 0,
 LastFailedRevisionErrors: ([]string) <nil>
}



A couple of things to investigate further:
- why does NodeInstallerDegraded does not clear up?
- This pause of about 5 hours with no event reported or any log entries, is this expected even when the cluster is quiet?

Comment 22 Stefan Schimanski 2020-05-25 10:09:37 UTC
Please attach must-gather output next time this happens. We cannot rely on life-clusters to debug this.

Comment 24 Xingxing Xia 2020-05-27 02:40:58 UTC
(In reply to Stefan Schimanski from comment #22)
> Please attach must-gather output next time this happens. We cannot rely on life-clusters to debug this.
OK. Sorry about that. Thought live cluster is more intuitive to debug.

(In reply to Luis Sanchez from comment #10)
> So far, unable to reproduce using the simplified scenario documented in comment 9 on either 4.4.0-0.nightly-2020-05-01-231319 or 4.5.0-0.nightly-2020-05-04-113741.
> installer pod is always recreated when deleted
You may need repeat the deletion again for the second or third times until no longer recreated.

Comment 25 Stefan Schimanski 2020-05-27 08:09:09 UTC
Moving out of the blocker list and will revisit when we get must-gather logs (or another live cluster in time).

Comment 26 Abu Kashem 2020-06-18 14:09:21 UTC
We don't have a must-gather or a live cluster to debug yet, so applying UpcomingSprint label.

Comment 28 Abu Kashem 2020-07-10 12:06:37 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 29 Abu Kashem 2020-07-31 15:47:11 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 31 Abu Kashem 2020-08-20 19:19:17 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. We will revisit it in a future sprint.

Comment 32 Michal Fojtik 2020-09-02 06:59:27 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 33 Xingxing Xia 2020-09-02 07:02:46 UTC
Removing LifecycleStale as per previous recent reproduced comment.

Comment 34 Ben Parees 2020-09-02 14:46:06 UTC
>  The operator is reporting a misleading status in clusteroperator object. I am assuming this will not block upgrade. (correct?)

i believe degraded blocks minor version upgrades, but not patch version upgrades.

Comment 35 Stefan Schimanski 2020-09-02 14:52:44 UTC
The stuck operator is probably due to https://bugzilla.redhat.com/show_bug.cgi?id=1858763.

For the condition message we lack the must-gather data to understand what was going on over time.

Comment 36 Stefan Schimanski 2020-09-02 14:53:29 UTC

*** This bug has been marked as a duplicate of bug 1858763 ***

Comment 37 Stefan Schimanski 2020-09-02 14:57:16 UTC
Confirmed that c27 was the same issue as in https://bugzilla.redhat.com/show_bug.cgi?id=1858763, one of the nodes has this state:

  - currentRevision: 9
    lastFailedRevision: 10
    lastFailedRevisionErrors:
    - pods "installer-10-ip-10-0-145-36.us-west-1.compute.internal" not found
    nodeName: ip-10-0-145-36.us-west-1.compute.internal
    targetRevision: 9

"targetVersion == currentVersion" and "lastFailedRevision: 10" is the symptom of that deadlock in https://bugzilla.redhat.com/show_bug.cgi?id=1858763.

Comment 38 Xingxing Xia 2020-09-03 02:23:54 UTC
Glad to see the bug finds the fix. This issue was originally found in 4.4 for which I opened this bug 1817419, also formerly found in 4.3 & 4.5 with clones of 4.3 bug 1822016, 4.4 bug 1822018. The 4.3 and 4.4 clones were closed when the fix was not figured out at that time. I don't object 4.3 & 4.4 clones can be kept closed as is. But I think at least the last release 4.5 needs backport due to the severity and frequency as we see of this issue. So I'm moving my bug 1817419 Target Release to 4.5.z instead of closed as Dup.

Comment 39 Stefan Schimanski 2020-09-04 09:05:58 UTC

*** This bug has been marked as a duplicate of bug 1874597 ***


Note You need to log in before you can comment on or make changes to this bug.