Bug 1817419
| Summary: | [4.5] Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Xingxing Xia <xxia> | |
| Component: | kube-apiserver | Assignee: | Abu Kashem <akashem> | |
| Status: | CLOSED DUPLICATE | QA Contact: | Ke Wang <kewang> | |
| Severity: | high | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 4.4 | CC: | akashem, aos-bugs, bparees, dollierp, kewang, mfojtik, sanchezl, sdodson, sttts, wking | |
| Target Milestone: | --- | Keywords: | Reopened, Upgrades | |
| Target Release: | 4.5.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1822016 1822018 (view as bug list) | Environment: | ||
| Last Closed: | 2020-09-04 09:05:58 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1858763, 1909600, 1949370 | |||
| Bug Blocks: | 1822016 | |||
Facts: - This is happening in upi-on-gcp and infrequently. - All kube-apiserver pods are running successfully. - The operator is reporting a misleading status in clusteroperator object. I am assuming this will not block upgrade. (correct?) - There is a workaround to fix this issue, that's a plus. - Troubleshooting may not be obvious, we have to check the operator log to find what the issue is. That's a minus. Since this is reporting misleading information in ClusterOperator object, ideally we would want to fix it in 4.4. But given the time constraint I think we can defer it to 4.5. (In reply to Abu Kashem from comment #1) > - This is happening in upi-on-gcp and infrequently FYI, per https://bugzilla.redhat.com/show_bug.cgi?id=1822016#c0 saying ipi-on-aws, also happening on other cloud and matrix Recently,similar problem was found OCP 4.4.0-rc.9-x86_64 installation with upi-on-osp,
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.4.0-rc.9 True True True 34m
kube-apiserver 4.4.0-rc.9 True True True 76m
Check the related operators error logs,
etcd opratorer log error,
E0420 03:09:41.719856 1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: etcdserver: leader changed
kube-apiserver log error,
E0420 03:12:27.092864 1 base_controller.go:121] InstallerController controller failed to sync installercontrollerKey with: missing required resources: [configmaps: aggregator-client-ca,client-ca, configmaps: bound-sa-token-signing-certs-7,config-7,etcd-serving-ca-7,kube-apiserver-cert-syncer-kubeconfig-7,kube-apiserver-pod-7,kubelet-serving-ca-7,sa-token-signing-certs-7]
After did a forced rolling out on them,
$ oc patch kube-apiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.4.0-rc.9 True False False 3h4m
kube-apiserver 4.4.0-rc.9 True False False 3h45m
Revision for above forced rolling out CLI, should be like below,
oc patch etcd/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'
> kube-apiserver 4.4.0-0.nightly-2020-04-26-045325 True False False 8h
Sorry, in previous comment, pasted the co after rolling out. Intended to paste the co before rolling out:
kube-apiserver 4.4.0-0.nightly-2020-04-26-045325 True True True 7h49m
Hi xxia, I am looking into this issue, if you can reproduce it on a cluster, I would like to do a live debug. I didn't know how above envs triggered the conditions like comment 0 'The master nodes not ready ... qe-yapei44debug-03260202-m-2 ... not ready since 2020-03-26 07:12:00 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)'. Now, enlightened by the shown words "nodes not ready", I tentatively tried as below to make a master not ready and seems below is a good manual reproducer; twice tried, both can reproduce; though it may be as not exactly same as what happened in the real scenarios of above fresh envs and above upgrading env: My on hand env is 4.3.0-0.nightly-2020-04-27-151540. Below commands run on two terminals: A in local client (xxia@pres) to run oc, B (root@ip-10-0-63-68) on the master. [xxia@pres 2020-04-30 07:53:26 CST my]$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "xxia forced test'"$(date)"'" } ]' [root@ip-10-0-63-68 2020-04-30 07:56:03 CST ~]# systemctl stop crio.service [xxia@pres 2020-04-30 07:56:25 CST my]$ oc get no # now the master is made not ready NAME STATUS ROLES AGE VERSION ip-10-0-51-103.us-east-2.compute.internal Ready master 39h v1.16.2 ip-10-0-63-68.us-east-2.compute.internal NotReady master 39h v1.16.2 ip-10-0-72-76.us-east-2.compute.internal Ready master 39h v1.16.2 ... [xxia@pres 2020-04-30 07:56:27 CST my]$ ogpkas-all | grep 14 # new revision is to be 14 installer-14-ip-10-0-63-68.us-east-2.compute.internal 0/1 Pending 0 9s app=installer installer-14-ip-10-0-72-76.us-east-2.compute.internal 0/1 Completed 0 2m6s app=installer kube-apiserver-ip-10-0-72-76.us-east-2.compute.internal 3/3 Running 0 118s apiserver=true,app=openshift-kube-apiserver,revision=14 revision-pruner-14-ip-10-0-72-76.us-east-2.compute.internal 0/1 Completed 0 22s app=pruner [xxia@pres 2020-04-30 07:56:35 CST my]$ oc delete po installer-14-ip-10-0-63-68.us-east-2.compute.internal -n openshift-kube-apiserver --force --grace-period=0 warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "installer-14-ip-10-0-63-68.us-east-2.compute.internal" force deleted [root@ip-10-0-63-68 2020-04-30 07:58:24 CST ~]# systemctl start crio.service [root@ip-10-0-63-68 2020-04-30 07:58:33 CST ~]# systemctl status crio.service ... Active: active (running) since Thu 2020-04-30 07:58:33 CST; 17s ago ... Apr 30 07:58:33 ip-10-0-63-68 crio[3635140]: time="2020-04-29 23:58:33.871619650Z" level=error msg="Error while checking pod to CNI net... [root@ip-10-0-63-68 2020-04-30 07:59:12 CST ~]# systemctl restart kubelet [xxia@pres 2020-04-30 07:59:48 CST my]$ oc get no # the master recovered to ready NAME STATUS ROLES AGE VERSION ip-10-0-51-103.us-east-2.compute.internal Ready master 39h v1.16.2 ip-10-0-63-68.us-east-2.compute.internal Ready master 39h v1.16.2 ip-10-0-72-76.us-east-2.compute.internal Ready master 39h v1.16.2 ... [xxia@pres 2020-04-30 08:13:28 CST my]$ oc get co | grep -v "True.*False.*False" # still stuck in degraded NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE kube-apiserver 4.3.0-0.nightly-2020-04-27-151540 True True True 40h [xxia@pres 2020-04-30 08:13:36 CST my]$ ogpkas [2/1995] NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-ip-10-0-51-103.us-east-2.compute.internal 3/3 Running 0 40m apiserver=true,app=openshift-kube-apiserve r,revision=13 kube-apiserver-ip-10-0-63-68.us-east-2.compute.internal 3/3 Running 0 44m apiserver=true,app=openshift-kube-apiserve r,revision=13 kube-apiserver-ip-10-0-72-76.us-east-2.compute.internal 3/3 Running 0 19m apiserver=true,app=openshift-kube-apiserve r,revision=14 Manually rollout to make not stuck: [xxia@pres 2020-04-30 08:13:44 CST my]$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "xxia forced test'"$(date)"'" } ]' kubeapiserver.operator.openshift.io/cluster patched [xxia@pres 2020-04-30 08:13:53 CST my]$ oc get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE [xxia@pres 2020-04-30 08:24:26 CST my]$ ogpkas NAME READY STATUS RESTARTS AGE LABELS kube-apiserver-ip-10-0-51-103.us-east-2.compute.internal 3/3 Running 0 8m29s apiserver=true,app=openshift-kube-apiser ver,revision=15 kube-apiserver-ip-10-0-63-68.us-east-2.compute.internal 3/3 Running 0 10m apiserver=true,app=openshift-kube-apiser ver,revision=15 kube-apiserver-ip-10-0-72-76.us-east-2.compute.internal 3/3 Running 0 6m46s apiserver=true,app=openshift-kube-apiser ver,revision=15 So far, unable to reproduce using the simplified scenario documented in comment 9 on either 4.4.0-0.nightly-2020-05-01-231319 or 4.5.0-0.nightly-2020-05-04-113741. installer pod is always recreated when deleted and when node becomes 'Ready', roll-out continues without any further manual intervention. Hi kewang, this cluster has been claimed, can you please reproduce the issue on a new cluster? I would like to debug it live. Please feel free to ping me on slack when you have a cluster ready for me to investigate. Hi Abu, I didn't know how above envs reproduced the issue, it is from other guy used env. *** Bug 1822018 has been marked as a duplicate of this bug. *** *** Bug 1822016 has been marked as a duplicate of this bug. *** We are not being able to reproduce it. we usually use gcp/aws to provision cluster. If qe can reproduce it we will do live debugging. Hi Abu, the bug was reproduced,
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.5.0-0.nightly-2020-05-22-011031 True True 70m Unable to apply 4.5.0-0.nightly-2020-05-22-011031: the cluster operator kube-apiserver is degraded
$ oc describe co kube-apiserver
...
Status:
Conditions:
Last Transition Time: 2020-05-22T06:46:58Z
Message: NodeInstallerDegraded: 1 nodes are failing on revision 6:
NodeInstallerDegraded: pods "installer-6-..0522-k9m4v-master-1" not found
Reason: NodeInstaller_InstallerPodFailed
Status: True
Type: Degraded
Last Transition Time: 2020-05-22T06:28:21Z
Message: NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 6
Reason: NodeInstaller
Status: True
Type: Progressing
Last Transition Time: 2020-05-22T06:24:18Z
Message: StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 5; 2 nodes are at revision 6
Reason: AsExpected
Status: True
Type: Available
Last Transition Time: 2020-05-22T06:21:21Z
...
$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-5876985f46-ns9f7 | grep E0522
E0522 08:02:45.133616 1 base_controller.go:180] "InstallerController" controller failed to sync "key", err: missing required resources: [secrets: aggregator-client,bound-service-account-signing-key,external-loadbalancer-serving-certkey,internal-loadbalancer-serving-certkey,localhost-serving-cert-certkey,service-network-serving-certkey, secrets: etcd-client-6,kubelet-client-6,localhost-recovery-client-token-6,localhost-recovery-serving-certkey-6]
...
After I logged in, this is what I saw.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.5.0-0.nightly-2020-05-22-011031 True True 6h36m Unable to apply 4.5.0-0.nightly-2020-05-22-011031: the cluster operator openshift-samples is degraded
$ oc describe co kube-apiserver
Name: kube-apiserver
...
Status:
Conditions:
Last Transition Time: 2020-05-22T12:24:10Z
Message: NodeControllerDegraded: All master nodes are ready
Reason: AsExpected
Status: False
Type: Degraded
Last Transition Time: 2020-05-22T12:29:10Z
Message: NodeInstallerProgressing: 3 nodes are at revision 7
Reason: AsExpected
Status: False
Type: Progressing
Last Transition Time: 2020-05-22T06:24:18Z
Message: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 7
Reason: AsExpected
Status: True
Type: Available
Last Transition Time: 2020-05-22T06:21:21Z
Reason: AsExpected
Status: True
Type: Upgradeable
Extension: <nil>
Related Objects:
Group: operator.openshift.io
Name: cluster
Resource: kubeapiservers
Group: apiextensions.k8s.io
Name:
Resource: customresourcedefinitions
Group: security.openshift.io
Name:
Resource: securitycontextconstraints
Group:
Name: openshift-config
Resource: namespaces
Group:
Name: openshift-config-managed
Resource: namespaces
Group:
Name: openshift-kube-apiserver-operator
Resource: namespaces
Group:
Name: openshift-kube-apiserver
Resource: namespaces
Group: admissionregistration.k8s.io
Name:
Resource: mutatingwebhookconfigurations
Group: admissionregistration.k8s.io
Name:
Resource: validatingwebhookconfigurations
Versions:
Name: raw-internal
Version: 4.5.0-0.nightly-2020-05-22-011031
Name: kube-apiserver
Version: 1.18.2
Name: operator
Version: 4.5.0-0.nightly-2020-05-22-011031
Events: <none>
So after 6.5 hours it fixed itself, Last Transition time for Degraded 2020-05-22T06:46:58Z -> 2020-05-22T12:24:10Z (about 6.5 hours)
$ oc -n openshift-kube-apiserver-operator get pods
NAME READY STATUS RESTARTS AGE
kube-apiserver-operator-5876985f46-ns9f7 1/1 Running 0 6h27m
A few other CVO enabled operators rolled out around the same time. not sure if it was manually or cvo triggered.
Further debugging:
$ kubectl -n openshift-kube-apiserver-operator get pods kube-apiserver-operator-5876985f46-ns9f7 -o json | jq .status.startTime
"2020-05-22T08:02:37Z"
$ oc -n openshift-kube-apiserver-operator logs kube-apiserver-operator-5876985f46-ns9f7 | head -n 1
I0522 08:02:43.045256 1 cmd.go:196] Using service-serving-cert provided certificates
$ oc get kubeapiserver cluster -o json | jq .status
{
"conditions": [
{
"lastTransitionTime": "2020-05-22T08:05:05Z",
"message": "All master nodes are ready",
"reason": "MasterNodesReady",
"status": "False",
"type": "NodeControllerDegraded"
},
{
"lastTransitionTime": "2020-05-22T08:02:45Z",
"status": "False",
"type": "InstallerControllerDegraded"
},
{
"lastTransitionTime": "2020-05-22T12:24:10Z",
"status": "False",
"type": "NodeInstallerDegraded"
},
{
"lastTransitionTime": "2020-05-22T12:22:07Z",
"status": "False",
"type": "RevisionControllerDegraded"
},
{
"lastTransitionTime": "2020-05-22T08:02:59Z",
"status": "False",
"type": "TargetConfigControllerDegraded"
},
]
}
I inspected the log and constructed a timeline:
I0522 08:03:51.581423: status controller first detects that master-1 node is not ready, with the following error:
{
"lastTransitionTime": "2020-05-22T06:46:58Z",
"message": "NodeControllerDegraded: The master nodes not ready: node \"xiyuan0522-k9m4v-master-1\" not ready since 2020-05-22 08:03:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nNodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found",
"reason": "NodeController_MasterNodesReady::NodeInstaller_InstallerPodFailed",
"status": "True",
"type": "Degraded"
},
There are lots of (Throttling request took 2.37138117s) in between.
I0522 08:05:05.371494: status_controller.go:172 "master-1 not ready" is no longer reported in operator status:
{
"lastTransitionTime": "2020-05-22T06:46:58Z",
"message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found",
"reason": "NodeInstaller_InstallerPodFailed",
"status": "True",
"type": "Degraded"
},
I0522 08:05:05.397922: event.go:278 "OperatorStatusChanged" with the following message:
Degraded message changed from
"NodeControllerDegraded: The master nodes not ready: node \"xiyuan0522-k9m4v-master-1\" not ready since 2020-05-22 08:03:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nNodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found"
to
"NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found"
Clearly, at this point the master-1 is reporting ready according to the operator. But the "Degraded" remained "True" with "NodeInstallerDegraded" error.
Something interesting, from this point on and up to "12:21:27.220404" for more than 4.5 hours the log only has
request.go:621] Throttling request took 1.195921374s
no other entries, at all! is it expected if the cluster is left untouched?
The first activity we see after that is the following:
I0522 12:21:27.220404 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}):
type: 'Normal'
reason: 'TargetUpdateRequired'
"kube-controller-manager-client-cert-key" in "openshift-config-managed" requires a new target cert/key pair: past its refresh time 2020-05-22 12:21:26 +0000 UTC
I0522 12:22:07.558190 1 installer_controller.go:350] node xiyuan0522-k9m4v-master-1 is progressing towards 5 and needs new revision 7
I0522 12:22:07.558265 1 installer_controller.go:359] "xiyuan0522-k9m4v-master-1" moving to (v1.NodeStatus) {
NodeName: (string) (len=25) "xiyuan0522-k9m4v-master-1",
CurrentRevision: (int32) 5,
TargetRevision: (int32) 7,
LastFailedRevision: (int32) 6,
LastFailedRevisionErrors: ([]string) <nil>
}
I0522 12:22:07.558753 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}):
type: 'Normal' reason: 'RevisionCreate' Revision 6 created because secret/kubelet-client has changed
I0522 12:22:07.560168 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}):
type: 'Normal' reason: 'RevisionTriggered' new revision 7 triggered by "secret/kubelet-client has changed"
W0522 12:22:07.566273 1 staticpod.go:37] revision 7 is unexpectedly already the latest available revision. This is a possible race!
E0522 12:22:07.594347 1 base_controller.go:180] "RevisionController" controller failed to sync "key", err: conflicting latestAvailableRevision 7
I0522 12:22:07.596637 1 status_controller.go:172:
{
"lastTransitionTime": "2020-05-22T06:46:58Z",
"message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found\nRevisionControllerDegraded: conflicting latestAvailableRevision 7",
"reason": "NodeInstaller_InstallerPodFailed::RevisionController_Error",
"status": "True",
"type": "Degraded"
},
I0522 12:22:07.630223 1 status_controller.go:172]:
{
"lastTransitionTime": "2020-05-22T06:46:58Z",
"message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: pods \"installer-6-xiyuan0522-k9m4v-master-1\" not found",
"reason": "NodeInstaller_InstallerPodFailed",
"status": "True",
"type": "Degraded"
},
I0522 12:22:07.722369 1 status_controller.go:172]
{
"lastTransitionTime": "2020-05-22T06:46:58Z",
"message": "NodeInstallerDegraded: 1 nodes are failing on revision 6:\nNodeInstallerDegraded: ",
"reason": "NodeInstaller_InstallerPodFailed",
"status": "True",
"type": "Degraded"
},
I0522 12:22:12.086708 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-7-xiyuan0522-k9m4v-master-1 -n openshift-kube-apiserver because it was missing
I0522 12:22:29.869235 1 installer_controller.go:331] "xiyuan0522-k9m4v-master-1" is in transition to 7, but has not made progress because static pod is pending
I0522 12:24:10.471080 installer_controller.go:318] "xiyuan0522-k9m4v-master-1" moving to
(v1.NodeStatus) {
NodeName: (string) (len=25) "xiyuan0522-k9m4v-master-1",
CurrentRevision: (int32) 7,
TargetRevision: (int32) 0,
LastFailedRevision: (int32) 0,
LastFailedRevisionErrors: ([]string) <nil>
} because static pod is ready
I0522 12:24:10.492913 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"565063d0-d550-4655-89b8-8e2e9c39379f", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeCurrentRevisionChanged' Updated node "xiyuan0522-k9m4v-master-1" from revision 5 to 7 because static pod is ready
I0522 12:24:10.471080 1 installer_controller.go:318] "xiyuan0522-k9m4v-master-1" moving to (v1.NodeStatus) {
NodeName: (string) (len=25) "xiyuan0522-k9m4v-master-1",
CurrentRevision: (int32) 7,
TargetRevision: (int32) 0,
LastFailedRevision: (int32) 0,
LastFailedRevisionErrors: ([]string) <nil>
}
A couple of things to investigate further:
- why does NodeInstallerDegraded does not clear up?
- This pause of about 5 hours with no event reported or any log entries, is this expected even when the cluster is quiet?
Please attach must-gather output next time this happens. We cannot rely on life-clusters to debug this. (In reply to Stefan Schimanski from comment #22) > Please attach must-gather output next time this happens. We cannot rely on life-clusters to debug this. OK. Sorry about that. Thought live cluster is more intuitive to debug. (In reply to Luis Sanchez from comment #10) > So far, unable to reproduce using the simplified scenario documented in comment 9 on either 4.4.0-0.nightly-2020-05-01-231319 or 4.5.0-0.nightly-2020-05-04-113741. > installer pod is always recreated when deleted You may need repeat the deletion again for the second or third times until no longer recreated. Moving out of the blocker list and will revisit when we get must-gather logs (or another live cluster in time). We don't have a must-gather or a live cluster to debug yet, so applying UpcomingSprint label. This turned up in 4.4.5 -> 4.4.10 CI [1,2]. So we have all the usual gathered assets, including a must-gather [3]. [1]: https://github.com/openshift/cincinnati-graph-data/pull/295#issuecomment-651265210 [2]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616 [3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1275681394676207616/artifacts/launch/must-gather.tar I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. We will revisit it in a future sprint. This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Removing LifecycleStale as per previous recent reproduced comment. > The operator is reporting a misleading status in clusteroperator object. I am assuming this will not block upgrade. (correct?)
i believe degraded blocks minor version upgrades, but not patch version upgrades.
The stuck operator is probably due to https://bugzilla.redhat.com/show_bug.cgi?id=1858763. For the condition message we lack the must-gather data to understand what was going on over time. *** This bug has been marked as a duplicate of bug 1858763 *** Confirmed that c27 was the same issue as in https://bugzilla.redhat.com/show_bug.cgi?id=1858763, one of the nodes has this state: - currentRevision: 9 lastFailedRevision: 10 lastFailedRevisionErrors: - pods "installer-10-ip-10-0-145-36.us-west-1.compute.internal" not found nodeName: ip-10-0-145-36.us-west-1.compute.internal targetRevision: 9 "targetVersion == currentVersion" and "lastFailedRevision: 10" is the symptom of that deadlock in https://bugzilla.redhat.com/show_bug.cgi?id=1858763. Glad to see the bug finds the fix. This issue was originally found in 4.4 for which I opened this bug 1817419, also formerly found in 4.3 & 4.5 with clones of 4.3 bug 1822016, 4.4 bug 1822018. The 4.3 and 4.4 clones were closed when the fix was not figured out at that time. I don't object 4.3 & 4.4 clones can be kept closed as is. But I think at least the last release 4.5 needs backport due to the severity and frequency as we see of this issue. So I'm moving my bug 1817419 Target Release to 4.5.z instead of closed as Dup. *** This bug has been marked as a duplicate of bug 1874597 *** |
Description of problem: Operator should auto detect and retry quickly once the cause of NodeInstallerDegraded disappears, instead of stuck in Degraded Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-03-25-223508 How reproducible: So far hit once Steps to Reproduce: 1. Install a latest 4.4 fresh env, the env matrix is upi-on-gcp, disconnected-remove_rhcos_worker-fips-ovn 2. After the installation succeeds, check `oc get co`, found: kube-apiserver 4.4.0-0.nightly-2020-03-25-223508 True True True 144m 3. Check `oc get po -n openshift-kube-apiserver --show-labels`, found all pods under openshift-kube-apiserver are: NAME READY STATUS RESTARTS AGE LABELS installer-8-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 0/1 Completed 0 68m app=installer kube-apiserver-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 4/4 Running 0 68m apiserver=true,app=openshi ft-kube-apiserver,revision=8 kube-apiserver-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal 4/4 Running 4 86m apiserver=true,app=openshi ft-kube-apiserver,revision=7 kube-apiserver-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 4/4 Running 4 88m apiserver=true,app=openshi ft-kube-apiserver,revision=7 revision-pruner-7-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 0/1 Completed 0 82m app=pruner revision-pruner-7-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal 0/1 Completed 0 73m app=pruner revision-pruner-7-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 0/1 Completed 0 66m app=pruner revision-pruner-8-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 0/1 Completed 0 66m app=pruner revision-pruner-8-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 0/1 Completed 0 65m app=pruner Check `oc logs deploy/kube-apiserver-operator -n openshift-kube-apiserver-operator`, shows: ... NodeInstallerDegraded: pods \"installer-8-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal\" not found\nNodeControllerDegraded: The master nodes not ready: node \"qe-yapei44debug-03260202-m-2.c.openshift-qe.internal\" not ready since 2020-03-26 07:12:00 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network)"... But check `oc get no`, no issue: NAME STATUS ROLES AGE VERSION qe-yapei44debug-03260202-m-0.c.openshift-qe.internal Ready master 114m v1.17.1 qe-yapei44debug-03260202-m-1.c.openshift-qe.internal Ready master 114m v1.17.1 qe-yapei44debug-03260202-m-2.c.openshift-qe.internal Ready master 114m v1.17.1 qe-yapei44debug-03260202-w-a-l-rhel-0 Ready worker 34m v1.17.1 qe-yapei44debug-03260202-w-a-l-rhel-1 Ready worker 34m v1.17.1 Networking QE also helped debug, network also has no issue. 4. Force operator to retry rolling out by: $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]' After minutes, co/kube-apiserver quickly becomes normal: kube-apiserver 4.4.0-0.nightly-2020-03-25-223508 True False False 173m Pods also become normal: $ ogpkas kube-apiserver-qe-yapei44debug-03260202-m-0.c.openshift-qe.internal 4/4 Running 0 6m28s apiserver=true,app=openshift-kube-apiserver,revision=9 kube-apiserver-qe-yapei44debug-03260202-m-1.c.openshift-qe.internal 4/4 Running 0 8m23s apiserver=true,app=openshift-kube-apiserver,revision=9 kube-apiserver-qe-yapei44debug-03260202-m-2.c.openshift-qe.internal 4/4 Running 0 10m apiserver=true,app=openshift-kube-apiserver,revision=9 Actual results: 3. Operator stuck in rolling out the static pods even if step 4 shows the cause gone. Expected results: 3. Operator should auto detect the cause and auto retry rolling out once the cause gone. Additional info: Bug is filed from https://coreos.slack.com/archives/CH76YSYSC/p1585214604225000?thread_ts=1585210139.199800&cid=CH76YSYSC with discussion there.