Bug 1946853

Summary:	Machine-config cluster operator in degraded state during the 4.7.5 -> 4.8 nightly upgrades
Product:	OpenShift Container Platform	Reporter:	Naga Ravi Chaitanya Elluri <nelluri>
Component:	Machine Config Operator	Assignee:	Yu Qi Zhang <jerzhang>
Status:	CLOSED WORKSFORME	QA Contact:	Michael Nguyen <mnguyen>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	4.7	CC:	kewang, nelluri, smilner
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:	aos-scalability-48
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-12 11:41:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Naga Ravi Chaitanya Elluri 2021-04-07 03:09:24 UTC

Description of problem:
The machine-config-daemon pods are crashing without recovery which is leading to the machine-config cluster operator being in a degraded state and thus blocking the upgrades.

- lastTransitionTime: "2021-04-07T01:15:24Z"
message: 'Unable to apply 4.8.0-0.nightly-2021-04-05-174735: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 257, updated: 257, ready: 247, unavailable: 10)'
reason: MachineConfigDaemonFailed
status: "True"
type: Degraded

Looking at the failed machine-config-daemon pods, they are unable to find the nodes which are part of the cluster:

W0407 02:35:53.118424 573073 daemon.go:635] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ip-10-0-181-219.us-west-2.compute.internal" not found
I0407 02:35:53.118448 573073 daemon.go:636] Shutting down MachineConfigDaemon
F0407 02:35:53.118485 573073 helpers.go:147] error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ip-10-0-181-219.us-west-2.compute.internal" not found

Logs including the cluster-operators, openshift-machine-config-operator and failed pods can be found here: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.8-sdn/upgrades/.

Version-Release number of selected component (if applicable):
OCP 4.7.5 -> 4.8.0-0.nightly-2021-04-05-174735 upgrade

How reproducible:
Always

Steps to Reproduce:
1. Install an OpenShift cluster using 4.7.5 payload.
2. Upgrade to 4.8.0-0.nightly-2021-04-05-174735 nightly build.
3. Observe the machine-config cluster-operator status and machine-config-daemon pods during the upgrade run.

Actual results:
Upgrade gets stuck while rolling out the machine-config-daemon daemonset with a couple of pods crashing.

Expected results:
Successful Machine-config cluster operator rollout during the upgrade runs.

Comment 1 Yu Qi Zhang 2021-04-07 03:49:39 UTC

We are aware of this and this is being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1933772

*** This bug has been marked as a duplicate of bug 1933772 ***

Comment 2 Ke Wang 2021-04-27 09:15:34 UTC

The duplicate of bug 1933772 was verified, but we still hit this problem when upgrade from 4.7 to 4.8 nightly, so I have to reopen the bug.

Upgrade command: ./oc adm upgrade --to-image=vmc.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release:4.8.0-0.nightly-2021-04-25-110331 --force=true --allow-explicit-upgrade=true

$ oc get node
 NAME              STATUS                        ROLES    AGE     VERSION           INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
 compute-0         Ready                         worker   4h26m   v1.20.0+7d0a2b2   172.31.246.44   172.31.246.44   Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
 compute-1         NotReady,SchedulingDisabled   worker   4h26m   v1.20.0+7d0a2b2   172.31.246.61   172.31.246.61   Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
 control-plane-0   Ready                         master   4h39m   v1.20.0+7d0a2b2   172.31.246.28   172.31.246.28   Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
 control-plane-1   NotReady,SchedulingDisabled   master   4h39m   v1.20.0+7d0a2b2   172.31.246.52   172.31.246.52   Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
 control-plane-2   Ready                         master   4h39m   v1.20.0+7d0a2b2   172.31.246.41   172.31.246.41   Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8

$ oc get co
 NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
 authentication                             4.8.0-0.nightly-2021-04-25-110331   True        False         True       84m
 baremetal                                  4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 cloud-credential                           4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h39m
 cluster-autoscaler                         4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 config-operator                            4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 console                                    4.8.0-0.nightly-2021-04-25-110331   True        False         False      153m
 csi-snapshot-controller                    4.8.0-0.nightly-2021-04-25-110331   True        False         False      149m
 dns                                        4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h33m
 etcd                                       4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h33m
 image-registry                             4.8.0-0.nightly-2021-04-25-110331   True        False         False      88m
 ingress                                    4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h25m
 insights                                   4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h31m
 kube-apiserver                             4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h31m
 kube-controller-manager                    4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h32m
 kube-scheduler                             4.8.0-0.nightly-2021-04-25-110331   True        False         True       4h33m
 kube-storage-version-migrator              4.8.0-0.nightly-2021-04-25-110331   True        False         False      88m
 machine-api                                4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h25m
 machine-approver                           4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 machine-config                             4.7.0-0.nightly-2021-04-25-102429   False       True          True       132m
 marketplace                                4.8.0-0.nightly-2021-04-25-110331   True        False         False      3h8m
 monitoring                                 4.8.0-0.nightly-2021-04-25-110331   False       True          True       83m
 network                                    4.8.0-0.nightly-2021-04-25-110331   True        True          True       4h38m
 node-tuning                                4.8.0-0.nightly-2021-04-25-110331   True        False         False      153m
 openshift-apiserver                        4.8.0-0.nightly-2021-04-25-110331   True        False         True       84m
 openshift-controller-manager               4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h33m
 openshift-samples                          4.8.0-0.nightly-2021-04-25-110331   True        False         False      153m
 operator-lifecycle-manager                 4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-04-25-110331   True        False         False      153m
 service-ca                                 4.8.0-0.nightly-2021-04-25-110331   True        False         False      4h38m
 storage                                    4.8.0-0.nightly-2021-04-25-110331   True        False         False      3h10m
 
$ oc describe co/machine-config
Name:         machine-config
 Namespace:    
 Labels:       <none>
 Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
               include.release.openshift.io/self-managed-high-availability: true
               include.release.openshift.io/single-node-developer: true
 API Version:  config.openshift.io/v1
 Kind:         ClusterOperator
 Metadata:
   Creation Timestamp:  2021-04-25T16:42:19Z
   Generation:          1
   Managed Fields:
     API Version:  config.openshift.io/v1
     Fields Type:  FieldsV1
     fieldsV1:
       f:metadata:
         f:annotations:
           .:
           f:exclude.release.openshift.io/internal-openshift-hosted:
           f:include.release.openshift.io/self-managed-high-availability:
           f:include.release.openshift.io/single-node-developer:
       f:spec:
       f:status:
         .:
         f:versions:
     Manager:      cluster-version-operator
     Operation:    Update
     Time:         2021-04-25T16:42:19Z
     API Version:  config.openshift.io/v1
     Fields Type:  FieldsV1
     fieldsV1:
       f:status:
         f:conditions:
         f:extension:
           .:
           f:master:
           f:worker:
         f:relatedObjects:
         f:versions:
     Manager:         machine-config-operator
     Operation:       Update
     Time:            2021-04-25T20:05:22Z
   Resource Version:  141689
   UID:               cdf4f3fa-b3ca-4abe-bdb2-63dff6efbe9a
 Spec:
 Status:
   Conditions:
     Last Transition Time:  2021-04-25T19:01:50Z
     Message:               Working towards 4.8.0-0.nightly-2021-04-25-110331
     Status:                True
     Type:                  Progressing
     Last Transition Time:  2021-04-25T20:05:22Z
     Message:               Unable to apply 4.8.0-0.nightly-2021-04-25-110331: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 3, unavailable: 2)
     Reason:                MachineConfigDaemonFailed
     Status:                True
     Type:                  Degraded
     Last Transition Time:  2021-04-25T19:11:51Z
     Message:               Cluster not available for 4.8.0-0.nightly-2021-04-25-110331
     Status:                False
     Type:                  Available
     Last Transition Time:  2021-04-25T19:55:23Z
     Message:               One or more machine config pools are updating, please see `oc get mcp` for further details
     Reason:                PoolUpdating
     Status:                False
     Type:                  Upgradeable
   Extension:
     Master:  0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-3769e7e060d5890610360c5d5513eaa8
     Worker:  0 (ready 0) out of 2 nodes are updating to latest configuration rendered-worker-5570462504cf4f902167d5d3ce228ac4
   Related Objects:
     Group:     
     Name:      openshift-machine-config-operator
     Resource:  namespaces
     Group:     machineconfiguration.openshift.io
     Name:      
     Resource:  machineconfigpools
     Group:     machineconfiguration.openshift.io
     Name:      
     Resource:  controllerconfigs
     Group:     machineconfiguration.openshift.io
     Name:      
     Resource:  kubeletconfigs
     Group:     machineconfiguration.openshift.io
     Name:      
     Resource:  containerruntimeconfigs
     Group:     machineconfiguration.openshift.io
     Name:      
     Resource:  machineconfigs
     Group:     
     Name:      
     Resource:  nodes
     Group:     
     Name:      openshift-kni-infra
     Resource:  namespaces
     Group:     
     Name:      openshift-openstack-infra
     Resource:  namespaces
     Group:     
     Name:      openshift-ovirt-infra
     Resource:  namespaces
     Group:     
     Name:      openshift-vsphere-infra
     Resource:  namespaces
   Versions:
     Name:     operator
     Version:  4.7.0-0.nightly-2021-04-25-102429
 Events:       <none>

Comment 4 Yu Qi Zhang 2021-05-03 22:54:45 UTC

Hi Ke Wang,

I see no indication that the issue in the BZ is related to your must-gather. From what you pointed out the error is listed as

Unable to apply 4.8.0-0.nightly-2021-04-25-110331: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 3, unavailable: 2)

Which at a glance looks like the same issue, but if you look at the nodes, there is 1 master and 1 worker that has NotReady,SchedulingDisabled, which means they never came back up after reboot. The daemon logs for those nodes are also empty, indicating that this is not the same issue (instead of the pod itself crashlooping, it actually isn't running at all)

So it seems that the 2 nodes didn't update properly, so they could be stuck in a booting phase, or have failed services (e.g. kubelet) such that they are unable to rejoin the cluster. Are you able to access either the console or ssh into either of those nodes to see what may be going on? If we find something else as the cause we should perhaps open another BZ.

Comment 5 Ke Wang 2021-05-12 11:41:43 UTC

jerzhang, you are right, I dug deeply, fuond the root reason is the following error,
namespaces/openshift-monitoring/pods/cluster-monitoring-operator-8767f9b5d-xfdcg/cluster-monitoring-operator/cluster-monitoring-operator/logs/current.log:85:2021-04-25T20:05:19.044608892Z E0425 20:05:19.044589       1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 2 unavailable nodes
 
I reopened the bug 1937888, will close this bug.