Description of problem: Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-03-25-223508 How reproducible: So far hit once Steps to Reproduce: 1. Install a latest 4.4 fresh env, the env matrix is upi-on-gcp, disconnected-remove_rhcos_worker-fips-ovn 2. After the installation succeeds, check `oc get co`, found: kube-apiserver 4.4.0-0.nightly-2020-03-25-223508 True True True 144m 3. Check `oc get po -n openshift-kube-apiserver --show-labels`, found all pods under openshift-kube-apiserver are: NAME READY STATUS RESTARTS AGE LABELS installer-8-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 0/1 Completed 0 68m app=installer kube-apiserver-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 4/4 Running 0 68m apiserver=true,app=openshi ft-kube-apiserver,revision=8 kube-apiserver-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal 4/4 Running 4 86m apiserver=true,app=openshi ft-kube-apiserver,revision=7 kube-apiserver-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 4/4 Running 4 88m apiserver=true,app=openshi ft-kube-apiserver,revision=7 revision-pruner-7-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 0/1 Completed 0 82m app=pruner revision-pruner-7-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal 0/1 Completed 0 73m app=pruner revision-pruner-7-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 0/1 Completed 0 66m app=pruner revision-pruner-8-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 0/1 Completed 0 66m app=pruner revision-pruner-8-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 0/1 Completed 0 65m app=pruner Check `oc logs deploy/kube-apiserver-operator -n openshift-kube-apiserver-operator`, shows: ... NodeInstallerDegraded: pods \"installer-8-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal\" not found\nNodeControllerDegraded: The master nodes not ready: node \"qe-yapei44debug-03260202-m-2.c.openshift-qe.internal\" not ready since 2020-03-26 07:12:00 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)"... But check `oc get no`, no issue: NAME STATUS ROLES AGE VERSION qe-yapei44debug-03260202-m-0.c.openshift-qe.internal Ready master 114m v1.17.1 qe-yapei44debug-03260202-m-1.c.openshift-qe.internal Ready master 114m v1.17.1 qe-yapei44debug-03260202-m-2.c.openshift-qe.internal Ready master 114m v1.17.1 qe-yapei44debug-03260202-w-a-l-rhel-0 Ready worker 34m v1.17.1 qe-yapei44debug-03260202-w-a-l-rhel-1 Ready worker 34m v1.17.1 Networking QE also helped debug, network also has no issue. 4. Force operator to retry rolling out by: $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]' After minutes, co/kube-apiserver quickly becomes normal: kube-apiserver 4.4.0-0.nightly-2020-03-25-223508 True False False 173m Pods also become normal: $ ogpkas kube-apiserver-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 4/4 Running 0 6m28s apiserver=true,app=openshift-kube-apiserver,revision=9 kube-apiserver-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal 4/4 Running 0 8m23s apiserver=true,app=openshift-kube-apiserver,revision=9 kube-apiserver-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 4/4 Running 0 10m apiserver=true,app=openshift-kube-apiserver,revision=9 Actual results: 3. Operator stuck in rolling out the static pods even if step 4 shows the cause gone. Expected results: 3. Operator should auto detect the cause and auto retry rolling out once the cause gone. Additional info: Bug is filed from https://coreos.slack.com/archives/CH76YSYSC/p1585214604225000?thread_ts=1585210139.199800&cid=CH76YSYSC with discussion there.
Facts: - This is happening in upi-on-gcp and infrequently. - All kube-apiserver pods are running successfully. - The operator is reporting a misleading status in clusteroperator object. I am assuming this will not block upgrade. (correct?) - There is a workaround to fix this issue, that's a plus. - Troubleshooting may not be obvious, we have to check the operator log to find what the issue is. That's a minus. Since this is reporting misleading information in ClusterOperator object, ideally we would want to fix it in 4.4. But given the time constraint I think we can defer it to 4.5.
(In reply to Abu Kashem from comment #1) > - This is happening in upi-on-gcp and infrequently FYI, per https://bugzilla.redhat.com/show_bug.cgi?id=1822016#c0 saying ipi-on-aws, also happening on other cloud and matrix
Recently,similar problem was found OCP 4.4.0-rc.9-x86_64 installation with upi-on-osp, NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.4.0-rc.9 True True True 34m kube-apiserver 4.4.0-rc.9 True True True 76m Check the related operators error logs, etcd opratorer log error, E0420 03:09:41.719856 1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: etcdserver: leader changed kube-apiserver log error, E0420 03:12:27.092864 1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: bound-sa-token-signing-certs-7,config-7,etcd-serving-ca-7,kube-apiserver-cert-syncer-kubeconfig-7,kube-apiserver-pod-7,kubelet-serving-ca-7,sa-token-signing-certs-7] After did a forced rolling out on them, $ oc patch kube-apiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]' $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE etcd 4.4.0-rc.9 True False False 3h4m kube-apiserver 4.4.0-rc.9 True False False 3h45m
Revision for above forced rolling out CLI, should be like below, oc patch etcd/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]' oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
> kube-apiserver 4.4.0-0.nightly-2020-04-26-045325 True False False 8h Sorry, in previous comment, pasted the co after rolling out. Intended to paste the co before rolling out: kube-apiserver 4.4.0-0.nightly-2020-04-26-045325 True True True 7h49m
Hi xxia, I am looking into this issue, if you can reproduce it on a cluster, I would like to do a live debug.
I didn't know how above envs triggered the conditions like comment 0 'The master nodes not ready ... qe-yapei44debug-03260202-m-2 ... not ready since 2020-03-26 07:12:00 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)'. Now, enlightened by the shown words "nodes not ready", I tentatively tried as below to make a master not ready and seems below is a good manual reproducer; twice tried, both can reproduce; though it may be as not exactly same as what happened in the real scenarios of above fresh envs and above upgrading env: My on hand env is 4.3.0-0.nightly-2020-04-27-151540. Below commands run on two terminals: A in local client (xxia@pres) to run oc, B (root@ip-10-0-63-68) on the master. [xxia@pres 2020-04-30 07:53:26 CST my]$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "xxia forced test'"$(date)"'" } ]' [root@ip-10-0-63-68 2020-04-30 07:56:03 CST ~]# systemctl stop crio.service [xxia@pres 2020-04-30 07:56:25 CST my]$ oc get no # now the master is made not ready NAME STATUS ROLES AGE VERSION ip-10-0-51-103.us-east-2.compute.internal Ready master 39h v1.16.2 ip-10-0-63-68.us-east-2.compute.internal NotReady master 39h v1.16.2 ip-10-0-72-76.us-east-2.compute.internal Ready master 39h v1.16.2 ... [xxia@pres 2020-04-30 07:56:27 CST my]$ ogpkas-all | grep 14 # new revision is to be 14 installer-14-ip-10-0-63-68.us-east-2.compute.internal 0/1 Pending 0 9s app=installer installer-14-ip-10-0-72-76.us-east-2.compute.internal 0/1 Completed 0 2m6s app=installer kube-apiserver-ip-10-0-72-76.us-east-2.compute.internal 3/3 Running 0 118s apiserver=true,app=openshift-kube-apiserver,revision=14 revision-pruner-14-ip-10-0-72-76.us-east-2.compute.internal 0/1 Completed 0 22s app=pruner [xxia@pres 2020-04-30 07:56:35 CST my]$ oc delete po installer-14-ip-10-0-63-68.us-east-2.compute.internal -n openshift-kube-apiserver --force --grace-period=0 warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "installer-14-ip-10-0-63-68.us-east-2.compute.internal" force deleted [root@ip-10-0-63-68 2020-04-30 07:58:24 CST ~]# systemctl start crio.service [root@ip-10-0-63-68 2020-04-30 07:58:33 CST ~]# systemctl status crio.service ... Active: active (running) since Thu 2020-04-30 07:58:33 CST; 17s ago ... Apr 30 07:58:33 ip-10-0-63-68 crio[3635140]: time="2020-04-29 23:58:33.871619650Z" level=error msg="Error while checking pod to CNI net... [root@ip-10-0-63-68 2020-04-30 07:59:12 CST ~]# systemctl restart kubelet [xxia@pres 2020-04-30 07:59:48 CST my]$ oc get no # the master recovered to ready NAME STATUS ROLES AGE VERSION ip-10-0-51-103.us-east-2.compute.internal Ready master 39h v1.16.2 ip-10-0-63-68.us-east-2.compute.internal Ready master 39h v1.16.2 ip-10-0-72-76.us-east-2.compute.internal Ready master 39h v1.16.2 ... [xxia@pres 2020-04-30 08:13:28 CST my]$ oc get co | grep -v "True.*False.*False" # still stuck in degraded NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-apiserver 4.3.0-0.nightly-2020-04-27-151540 True True True 40h [xxia@pres 2020-04-30 08:13:36 CST my]$ ogpkas [2/1995] NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-ip-10-0-51-103.us-east-2.compute.internal 3/3 Running 0 40m apiserver=true,app=openshift-kube-apiserve r,revision=13 kube-apiserver-ip-10-0-63-68.us-east-2.compute.internal 3/3 Running 0 44m apiserver=true,app=openshift-kube-apiserve r,revision=13 kube-apiserver-ip-10-0-72-76.us-east-2.compute.internal 3/3 Running 0 19m apiserver=true,app=openshift-kube-apiserve r,revision=14 Manually rollout to make not stuck: [xxia@pres 2020-04-30 08:13:44 CST my]$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "xxia forced test'"$(date)"'" } ]' kubeapiserver.operator.openshift.io/cluster patched [xxia@pres 2020-04-30 08:13:53 CST my]$ oc get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE [xxia@pres 2020-04-30 08:24:26 CST my]$ ogpkas NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-ip-10-0-51-103.us-east-2.compute.internal 3/3 Running 0 8m29s apiserver=true,app=openshift-kube-apiser ver,revision=15 kube-apiserver-ip-10-0-63-68.us-east-2.compute.internal 3/3 Running 0 10m apiserver=true,app=openshift-kube-apiser ver,revision=15 kube-apiserver-ip-10-0-72-76.us-east-2.compute.internal 3/3 Running 0 6m46s apiserver=true,app=openshift-kube-apiser ver,revision=15
So far, unable to reproduce using the simplified scenario documented in comment 9 on either 4.4.0-0.nightly-2020-05-01-231319 or 4.5.0-0.nightly-2020-05-04-113741. installer pod is always recreated when deleted and when node becomes 'Ready', roll-out continues without any further manual intervention.
Hi kewang, this cluster has been claimed, can you please reproduce the issue on a new cluster? I would like to debug it live. Please feel free to ping me on slack when you have a cluster ready for me to investigate.
Hi Abu, I didn't know how above envs reproduced the issue, it is from other guy used env.
*** Bug 1822018 has been marked as a duplicate of this bug. ***
*** Bug 1822016 has been marked as a duplicate of this bug. ***
We are not being able to reproduce it. we usually use gcp/aws to provision cluster. If qe can reproduce it we will do live debugging.
Hi Abu, the bug was reproduced, $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-22-011031 True True 70m Unable to apply 4.5.0-0.nightly-2020-05-22-011031: the cluster operator kube-apiserver is degraded $ oc describe co kube-apiserver ... Status: Conditions: Last Transition Time: 2020-05-22T06:46:58Z Message: NodeInstallerDegraded: 1 nodes are failing on revision 6: NodeInstallerDegraded: pods "installer-6-..0522-k9m4v-master-1" not found Reason: NodeInstaller_InstallerPodFailed Status: True Type: Degraded Last Transition Time: 2020-05-22T06:28:21Z Message: NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 6 Reason: NodeInstaller Status: True Type: Progressing Last Transition Time: 2020-05-22T06:24:18Z Message: StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 6 Reason: AsExpected Status: True Type: Available Last Transition Time: 2020-05-22T06:21:21Z ... $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-5876985f46-ns9f7 | grep E0522 E0522 08:02:45.133616 1 base_controller.go:180] "InstallerController" controller failed to sync "key", err: missing required resources: [secrets: aggregator-client,bound-service-account-signing-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,localhost-serving-cert-certkey,service-network-serving-certkey, secrets: etcd-client-6,kubelet-client-6,localhost-recovery-client-token-6,localhost-recovery-serving-certkey-6] ...
After I logged in, this is what I saw. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-22-011031 True True 6h36m Unable to apply 4.5.0-0.nightly-2020-05-22-011031: the cluster operator openshift-samples is degraded $ oc describe co kube-apiserver Name: kube-apiserver ... Status: Conditions: Last Transition Time: 2020-05-22T12:24:10Z Message: NodeControllerDegraded: All master nodes are ready Reason: AsExpected Status: False Type: Degraded Last Transition Time: 2020-05-22T12:29:10Z Message: NodeInstallerProgressing: 3 nodes are at revision 7 Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2020-05-22T06:24:18Z Message: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 7 Reason: AsExpected Status: True Type: Available Last Transition Time: 2020-05-22T06:21:21Z Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: kubeapiservers Group: apiextensions.k8s.io Name: Resource: customresourcedefinitions Group: security.openshift.io Name: Resource: securitycontextconstraints Group: Name: openshift-config Resource: namespaces Group: Name: openshift-config-managed Resource: namespaces Group: Name: openshift-kube-apiserver-operator Resource: namespaces Group: Name: openshift-kube-apiserver Resource: namespaces Group: admissionregistration.k8s.io Name: Resource: mutatingwebhookconfigurations Group: admissionregistration.k8s.io Name: Resource: validatingwebhookconfigurations Versions: Name: raw-internal Version: 4.5.0-0.nightly-2020-05-22-011031 Name: kube-apiserver Version: 1.18.2 Name: operator Version: 4.5.0-0.nightly-2020-05-22-011031 Events: <none> So after 6.5 hours it fixed itself, Last Transition time for Degraded 2020-05-22T06:46:58Z -> 2020-05-22T12:24:10Z (about 6.5 hours) $ oc -n openshift-kube-apiserver-operator get pods NAME READY STATUS RESTARTS AGE kube-apiserver-operator-5876985f46-ns9f7 1/1 Running 0 6h27m A few other CVO enabled operators rolled out around the same time. not sure if it was manually or cvo triggered.
Further debugging: $ kubectl -n openshift-kube-apiserver-operator get pods kube-apiserver-operator-5876985f46-ns9f7 -o json | jq .status.startTime "2020-05-22T08:02:37Z" $ oc -n openshift-kube-apiserver-operator logs kube-apiserver-operator-5876985f46-ns9f7 | head -n 1 I0522 08:02:43.045256 1 cmd.go:196] Using service-serving-cert provided certificates $ oc get kubeapiserver cluster -o json | jq .status { "conditions": [ { "lastTransitionTime": "2020-05-22T08:05:05Z", "message": "All master nodes are ready", "reason": "MasterNodesReady", "status": "False", "type": "NodeControllerDegraded" }, { "lastTransitionTime": "2020-05-22T08:02:45Z", "status": "False", "type": "InstallerControllerDegraded" }, { "lastTransitionTime": "2020-05-22T12:24:10Z", "status": "False", "type": "NodeInstallerDegraded" }, { "lastTransitionTime": "2020-05-22T12:22:07Z", "status": "False", "type": "RevisionControllerDegraded" }, { "lastTransitionTime": "2020-05-22T08:02:59Z", "status": "False", "type": "TargetConfigControllerDegraded" }, ] } I inspected the log and constructed a timeline: I0522 08:03:51.581423: status controller first detects that master-1 node is not ready, with the following error: { "lastTransitionTime": "2020-05-22T06:46:58Z", "message": "NodeControllerDegraded: The master nodes not ready: node \"xiyuan0522-k9m4v-master-1\" not ready since 2020-05-22 08:03:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nNodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found", "reason": "NodeController_MasterNodesReady::NodeInstaller_InstallerPodFailed", "status": "True", "type": "Degraded" }, There are lots of (Throttling request took 2.37138117s) in between. I0522 08:05:05.371494: status_controller.go:172 "master-1 not ready" is no longer reported in operator status: { "lastTransitionTime": "2020-05-22T06:46:58Z", "message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found", "reason": "NodeInstaller_InstallerPodFailed", "status": "True", "type": "Degraded" }, I0522 08:05:05.397922: event.go:278 "OperatorStatusChanged" with the following message: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xiyuan0522-k9m4v-master-1\" not ready since 2020-05-22 08:03:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nNodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found" to "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found" Clearly, at this point the master-1 is reporting ready according to the operator. But the "Degraded" remained "True" with "NodeInstallerDegraded" error. Something interesting, from this point on and up to "12:21:27.220404" for more than 4.5 hours the log only has request.go:621] Throttling request took 1.195921374s no other entries, at all! is it expected if the cluster is left untouched? The first activity we see after that is the following: I0522 12:21:27.220404 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TargetUpdateRequired' "kube-controller-manager-client-cert-key" in "openshift-config-managed" requires a new target cert/key pair: past its refresh time 2020-05-22 12:21:26 +0000 UTC I0522 12:22:07.558190 1 installer_controller.go:350] node xiyuan0522-k9m4v-master-1 is progressing towards 5 and needs new revision 7 I0522 12:22:07.558265 1 installer_controller.go:359] "xiyuan0522-k9m4v-master-1" moving to (v1.NodeStatus) { NodeName: (string) (len=25) "xiyuan0522-k9m4v-master-1", CurrentRevision: (int32) 5, TargetRevision: (int32) 7, LastFailedRevision: (int32) 6, LastFailedRevisionErrors: ([]string) <nil> } I0522 12:22:07.558753 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionCreate' Revision 6 created because secret/kubelet-client has changed I0522 12:22:07.560168 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 7 triggered by "secret/kubelet-client has changed" W0522 12:22:07.566273 1 staticpod.go:37] revision 7 is unexpectedly already the latest available revision. This is a possible race! E0522 12:22:07.594347 1 base_controller.go:180] "RevisionController" controller failed to sync "key", err: conflicting latestAvailableRevision 7 I0522 12:22:07.596637 1 status_controller.go:172: { "lastTransitionTime": "2020-05-22T06:46:58Z", "message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found\nRevisionControllerDegraded: conflicting latestAvailableRevision 7", "reason": "NodeInstaller_InstallerPodFailed::RevisionController_Error", "status": "True", "type": "Degraded" }, I0522 12:22:07.630223 1 status_controller.go:172]: { "lastTransitionTime": "2020-05-22T06:46:58Z", "message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found", "reason": "NodeInstaller_InstallerPodFailed", "status": "True", "type": "Degraded" }, I0522 12:22:07.722369 1 status_controller.go:172] { "lastTransitionTime": "2020-05-22T06:46:58Z", "message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: ", "reason": "NodeInstaller_InstallerPodFailed", "status": "True", "type": "Degraded" }, I0522 12:22:12.086708 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-7-xiyuan0522-k9m4v-master-1 -n openshift-kube-apiserver because it was missing I0522 12:22:29.869235 1 installer_controller.go:331] "xiyuan0522-k9m4v-master-1" is in transition to 7, but has not made progress because static pod is pending I0522 12:24:10.471080 installer_controller.go:318] "xiyuan0522-k9m4v-master-1" moving to (v1.NodeStatus) { NodeName: (string) (len=25) "xiyuan0522-k9m4v-master-1", CurrentRevision: (int32) 7, TargetRevision: (int32) 0, LastFailedRevision: (int32) 0, LastFailedRevisionErrors: ([]string) <nil> } because static pod is ready I0522 12:24:10.492913 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeCurrentRevisionChanged' Updated node "xiyuan0522-k9m4v-master-1" from revision 5 to 7 because static pod is ready I0522 12:24:10.471080 1 installer_controller.go:318] "xiyuan0522-k9m4v-master-1" moving to (v1.NodeStatus) { NodeName: (string) (len=25) "xiyuan0522-k9m4v-master-1", CurrentRevision: (int32) 7, TargetRevision: (int32) 0, LastFailedRevision: (int32) 0, LastFailedRevisionErrors: ([]string) <nil> } A couple of things to investigate further: - why does NodeInstallerDegraded does not clear up? - This pause of about 5 hours with no event reported or any log entries, is this expected even when the cluster is quiet?
Please attach must-gather output next time this happens. We cannot rely on life-clusters to debug this.
(In reply to Stefan Schimanski from comment #22) > Please attach must-gather output next time this happens. We cannot rely on life-clusters to debug this. OK. Sorry about that. Thought live cluster is more intuitive to debug. (In reply to Luis Sanchez from comment #10) > So far, unable to reproduce using the simplified scenario documented in comment 9 on either 4.4.0-0.nightly-2020-05-01-231319 or 4.5.0-0.nightly-2020-05-04-113741. > installer pod is always recreated when deleted You may need repeat the deletion again for the second or third times until no longer recreated.
Moving out of the blocker list and will revisit when we get must-gather logs (or another live cluster in time).
We don't have a must-gather or a live cluster to debug yet, so applying UpcomingSprint label.
This turned up in 4.4.5 -> 4.4.10 CI [1,2]. So we have all the usual gathered assets, including a must-gather [3]. [1]: https://github.com/openshift/cincinnati-graph-data/pull/295#issuecomment-651265210 [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616 [3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/must-gather.tar
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. We will revisit it in a future sprint.
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
Removing LifecycleStale as per previous recent reproduced comment.
> The operator is reporting a misleading status in clusteroperator object. I am assuming this will not block upgrade. (correct?) i believe degraded blocks minor version upgrades, but not patch version upgrades.
The stuck operator is probably due to https://bugzilla.redhat.com/show_bug.cgi?id=1858763. For the condition message we lack the must-gather data to understand what was going on over time.
*** This bug has been marked as a duplicate of bug 1858763 ***
Confirmed that c27 was the same issue as in https://bugzilla.redhat.com/show_bug.cgi?id=1858763, one of the nodes has this state: - currentRevision: 9 lastFailedRevision: 10 lastFailedRevisionErrors: - pods "installer-10-ip-10-0-145-36.us-west-1.compute.internal" not found nodeName: ip-10-0-145-36.us-west-1.compute.internal targetRevision: 9 "targetVersion == currentVersion" and "lastFailedRevision: 10" is the symptom of that deadlock in https://bugzilla.redhat.com/show_bug.cgi?id=1858763.
Glad to see the bug finds the fix. This issue was originally found in 4.4 for which I opened this bug 1817419, also formerly found in 4.3 & 4.5 with clones of 4.3 bug 1822016, 4.4 bug 1822018. The 4.3 and 4.4 clones were closed when the fix was not figured out at that time. I don't object 4.3 & 4.4 clones can be kept closed as is. But I think at least the last release 4.5 needs backport due to the severity and frequency as we see of this issue. So I'm moving my bug 1817419 Target Release to 4.5.z instead of closed as Dup.
*** This bug has been marked as a duplicate of bug 1874597 ***