Bug 1946853
Summary: | Machine-config cluster operator in degraded state during the 4.7.5 -> 4.8 nightly upgrades | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> |
Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> |
Status: | CLOSED WORKSFORME | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | unspecified | Docs Contact: | |
Priority: | high | ||
Version: | 4.7 | CC: | kewang, nelluri, smilner |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | aos-scalability-48 | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-12 11:41:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Naga Ravi Chaitanya Elluri
2021-04-07 03:09:24 UTC
We are aware of this and this is being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1933772 *** This bug has been marked as a duplicate of bug 1933772 *** The duplicate of bug 1933772 was verified, but we still hit this problem when upgrade from 4.7 to 4.8 nightly, so I have to reopen the bug. Upgrade command: ./oc adm upgrade --to-image=vmc.mirror-registry.qe.devcluster.openshift.com:5000/openshift-release-dev/ocp-release:4.8.0-0.nightly-2021-04-25-110331 --force=true --allow-explicit-upgrade=true $ oc get node NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME compute-0 Ready worker 4h26m v1.20.0+7d0a2b2 172.31.246.44 172.31.246.44 Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8 compute-1 NotReady,SchedulingDisabled worker 4h26m v1.20.0+7d0a2b2 172.31.246.61 172.31.246.61 Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8 control-plane-0 Ready master 4h39m v1.20.0+7d0a2b2 172.31.246.28 172.31.246.28 Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8 control-plane-1 NotReady,SchedulingDisabled master 4h39m v1.20.0+7d0a2b2 172.31.246.52 172.31.246.52 Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8 control-plane-2 Ready master 4h39m v1.20.0+7d0a2b2 172.31.246.41 172.31.246.41 Red Hat Enterprise Linux CoreOS 47.83.202104250838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-04-25-110331 True False True 84m baremetal 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m cloud-credential 4.8.0-0.nightly-2021-04-25-110331 True False False 4h39m cluster-autoscaler 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m config-operator 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m console 4.8.0-0.nightly-2021-04-25-110331 True False False 153m csi-snapshot-controller 4.8.0-0.nightly-2021-04-25-110331 True False False 149m dns 4.8.0-0.nightly-2021-04-25-110331 True False True 4h33m etcd 4.8.0-0.nightly-2021-04-25-110331 True False True 4h33m image-registry 4.8.0-0.nightly-2021-04-25-110331 True False False 88m ingress 4.8.0-0.nightly-2021-04-25-110331 True False True 4h25m insights 4.8.0-0.nightly-2021-04-25-110331 True False False 4h31m kube-apiserver 4.8.0-0.nightly-2021-04-25-110331 True False True 4h31m kube-controller-manager 4.8.0-0.nightly-2021-04-25-110331 True False True 4h32m kube-scheduler 4.8.0-0.nightly-2021-04-25-110331 True False True 4h33m kube-storage-version-migrator 4.8.0-0.nightly-2021-04-25-110331 True False False 88m machine-api 4.8.0-0.nightly-2021-04-25-110331 True False False 4h25m machine-approver 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m machine-config 4.7.0-0.nightly-2021-04-25-102429 False True True 132m marketplace 4.8.0-0.nightly-2021-04-25-110331 True False False 3h8m monitoring 4.8.0-0.nightly-2021-04-25-110331 False True True 83m network 4.8.0-0.nightly-2021-04-25-110331 True True True 4h38m node-tuning 4.8.0-0.nightly-2021-04-25-110331 True False False 153m openshift-apiserver 4.8.0-0.nightly-2021-04-25-110331 True False True 84m openshift-controller-manager 4.8.0-0.nightly-2021-04-25-110331 True False False 4h33m openshift-samples 4.8.0-0.nightly-2021-04-25-110331 True False False 153m operator-lifecycle-manager 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-04-25-110331 True False False 153m service-ca 4.8.0-0.nightly-2021-04-25-110331 True False False 4h38m storage 4.8.0-0.nightly-2021-04-25-110331 True False False 3h10m $ oc describe co/machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-04-25T16:42:19Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:spec: f:status: .: f:versions: Manager: cluster-version-operator Operation: Update Time: 2021-04-25T16:42:19Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:master: f:worker: f:relatedObjects: f:versions: Manager: machine-config-operator Operation: Update Time: 2021-04-25T20:05:22Z Resource Version: 141689 UID: cdf4f3fa-b3ca-4abe-bdb2-63dff6efbe9a Spec: Status: Conditions: Last Transition Time: 2021-04-25T19:01:50Z Message: Working towards 4.8.0-0.nightly-2021-04-25-110331 Status: True Type: Progressing Last Transition Time: 2021-04-25T20:05:22Z Message: Unable to apply 4.8.0-0.nightly-2021-04-25-110331: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 3, unavailable: 2) Reason: MachineConfigDaemonFailed Status: True Type: Degraded Last Transition Time: 2021-04-25T19:11:51Z Message: Cluster not available for 4.8.0-0.nightly-2021-04-25-110331 Status: False Type: Available Last Transition Time: 2021-04-25T19:55:23Z Message: One or more machine config pools are updating, please see `oc get mcp` for further details Reason: PoolUpdating Status: False Type: Upgradeable Extension: Master: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-3769e7e060d5890610360c5d5513eaa8 Worker: 0 (ready 0) out of 2 nodes are updating to latest configuration rendered-worker-5570462504cf4f902167d5d3ce228ac4 Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Group: Name: openshift-kni-infra Resource: namespaces Group: Name: openshift-openstack-infra Resource: namespaces Group: Name: openshift-ovirt-infra Resource: namespaces Group: Name: openshift-vsphere-infra Resource: namespaces Versions: Name: operator Version: 4.7.0-0.nightly-2021-04-25-102429 Events: <none> Hi Ke Wang, I see no indication that the issue in the BZ is related to your must-gather. From what you pointed out the error is listed as Unable to apply 4.8.0-0.nightly-2021-04-25-110331: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 5, updated: 5, ready: 3, unavailable: 2) Which at a glance looks like the same issue, but if you look at the nodes, there is 1 master and 1 worker that has NotReady,SchedulingDisabled, which means they never came back up after reboot. The daemon logs for those nodes are also empty, indicating that this is not the same issue (instead of the pod itself crashlooping, it actually isn't running at all) So it seems that the 2 nodes didn't update properly, so they could be stuck in a booting phase, or have failed services (e.g. kubelet) such that they are unable to rejoin the cluster. Are you able to access either the console or ssh into either of those nodes to see what may be going on? If we find something else as the cause we should perhaps open another BZ. jerzhang, you are right, I dug deeply, fuond the root reason is the following error, namespaces/openshift-monitoring/pods/cluster-monitoring-operator-8767f9b5d-xfdcg/cluster-monitoring-operator/cluster-monitoring-operator/logs/current.log:85:2021-04-25T20:05:19.044608892Z E0425 20:05:19.044589 1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 2 unavailable nodes I reopened the bug 1937888, will close this bug. |