Bug 1817455
| Summary: | Node goes to degraded status when machine-config-daemon moves a file across filesystems | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Antonio Murdaca <amurdaca> | |
| Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> | |
| Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.4 | CC: | alukiano, amurdaca, bparees, ccoleman, dshchedr, grajaiya, jhou, jiajliu, jiazha, jkaur, kgarriso, lmohanty, mifiedle, mnguyen, nschuetz, rpattath, skordas, wking, wsun, wzheng, yanpzhan | |
| Target Milestone: | --- | Keywords: | Regression, TestBlocker | |
| Target Release: | 4.4.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 1814397 | |||
| : | 1817458 (view as bug list) | Environment: | ||
| Last Closed: | 2020-05-04 11:47:28 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1814397 | |||
| Bug Blocks: | 1817458 | |||
|
Comment 1
W. Trevor King
2020-03-27 01:21:22 UTC
*** Bug 1817847 has been marked as a duplicate of this bug. *** Please make the bug description public unless you have sensitive information. This impairs searching for bugs from CI tooling. Upstream bug has Upgrade Impact Assessment questions, please answer them on the upstream bug. https://bugzilla.redhat.com/show_bug.cgi?id=1814397#c14 *** Bug 1819232 has been marked as a duplicate of this bug. *** Reproduced the bug when upgrade from 4.3.9 to nightly 4.4.
$ oc get node
NAME STATUS ROLES AGE VERSION
qe-upg-share-mmc2q-compute-0 Ready worker 19h v1.17.1
qe-upg-share-mmc2q-compute-1 Ready worker 19h v1.17.1
qe-upg-share-mmc2q-compute-2 Ready worker 19h v1.17.1
qe-upg-share-mmc2q-control-plane-0 Ready,SchedulingDisabled master 19h v1.16.2
qe-upg-share-mmc2q-control-plane-1 Ready master 19h v1.16.2
qe-upg-share-mmc2q-control-plane-2 Ready master 19h v1.16.2
$ oc get co master-config -o yaml
<--snip-->
status:
conditions:
- lastTransitionTime: "2020-04-03T07:16:10Z"
message: Cluster not available for 4.4.0-0.nightly-2020-04-02-130551
status: "False"
type: Available
- lastTransitionTime: "2020-04-03T07:02:01Z"
message: Working towards 4.4.0-0.nightly-2020-04-02-130551
status: "True"
type: Progressing
- lastTransitionTime: "2020-04-03T07:16:10Z"
message: 'Unable to apply 4.4.0-0.nightly-2020-04-02-130551: timed out waiting
for the condition during syncRequiredMachineConfigPools: pool master has not
progressed to latest configuration: controller version mismatch for rendered-master-1d1b2a692c950078690d8b3b215bec2f
expected a7b13759061f645a76f03c04d385d275bbbd0c02 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca,
retrying'
reason: RequiredPoolsFailed
status: "True"
type: Degraded
<--snip-->
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-1d1b2a692c950078690d8b3b215bec2f False True True 3 0 0 1 20h
worker rendered-worker-72f43e4889519a6ede04333776de8d32 True False False 3 3 3 0 20h
$ oc get mcp master -o yaml
<--snip-->
- lastTransitionTime: "2020-04-03T07:08:32Z"
message: 'Node qe-upg-share-mmc2q-control-plane-0 is reporting: "rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig
/usr/local/bin/etcd-member-add.sh: invalid cross-device link"'
reason: 1 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
<--snip-->
*** Bug 1821369 has been marked as a duplicate of this bug. *** *** Bug 1821364 has been marked as a duplicate of this bug. *** *** Bug 1821716 has been marked as a duplicate of this bug. *** Upgrade from 4.3.9 to nightly 4.4.0-0.nightly-2020-04-07-130324 successfully, don't meet issue in the bug now. So move the bug to "Verified" Hit the issue again for 4.2.29->4.3.18->4.4.1 upgrade.
After upgrade v4.2.29-v4.3.18 successfully, we continue upgrade the cluster to v4.4.1. But the upgrade failed.
# ./oc adm upgrade
info: An upgrade is in progress. Unable to apply 4.4.1: the cluster operator openshift-apiserver is degraded
# ./oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.4.1 True False False 4h56m
cloud-credential 4.4.1 True False False 5h8m
cluster-autoscaler 4.4.1 True False False 5h2m
console 4.4.1 True False False 137m
csi-snapshot-controller 4.4.1 True False False 123m
dns 4.4.1 True False False 5h8m
etcd 4.4.1 True False False 145m
image-registry 4.4.1 True False False 117m
ingress 4.4.1 True False False 117m
insights 4.4.1 True False False 5h8m
kube-apiserver 4.4.1 True False False 5h7m
kube-controller-manager 4.4.1 True False False 152m
kube-scheduler 4.4.1 True False False 152m
kube-storage-version-migrator 4.4.1 True False False 123m
machine-api 4.4.1 True False False 5h9m
machine-config 4.3.18 False True True 116m
marketplace 4.4.1 True False False 144m
monitoring 4.4.1 True False False 3h42m
network 4.4.1 True False False 5h7m
node-tuning 4.4.1 True False False 145m
openshift-apiserver 4.4.1 True False True 137m
openshift-controller-manager 4.4.1 True False False 5h7m
openshift-samples 4.4.1 False True True 1s
operator-lifecycle-manager 4.4.1 True False False 5h2m
operator-lifecycle-manager-catalog 4.4.1 True False False 5h5m
operator-lifecycle-manager-packageserver 4.4.1 True False False 136m
service-ca 4.4.1 True False False 5h8m
service-catalog-apiserver 4.4.1 True False False 137m
service-catalog-controller-manager 4.4.1 True False False 4h7m
storage 4.4.1 True False False 145m
Checked openshift-apiserver degreaded is because one of master is unscheduled.
# ./oc get node
NAME STATUS ROLES AGE VERSION
ugdci2-x9ljw-m-0.c.openshift-qe.internal Ready,SchedulingDisabled master 5h11m v1.16.2
ugdci2-x9ljw-m-1.c.openshift-qe.internal Ready master 5h11m v1.16.2
ugdci2-x9ljw-m-2.c.openshift-qe.internal Ready master 5h11m v1.16.2
ugdci2-x9ljw-w-a-frg9s.c.openshift-qe.internal Ready worker 5h5m v1.17.1
ugdci2-x9ljw-w-b-qggrc.c.openshift-qe.internal Ready worker 5h5m v1.17.1
ugdci2-x9ljw-w-c-7knmr.c.openshift-qe.internal Ready worker 5h6m v1.17.1
# ./oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-51bc0454ee0f0a886b5812eb225f400e False True True 3 0 0 1 5h13m
worker rendered-worker-1893f230e08f250db257307b8a6db414 True False False 3 3 3 0 5h13m
Reason:
Status: False
Type: Updated
Last Transition Time: 2020-04-30T08:47:50Z
Message: All nodes are updating to rendered-master-8cf84e0dd6e3e8e6b2a0d533d74074d8
Reason:
Status: True
Type: Updating
Last Transition Time: 2020-04-30T08:48:09Z
Message: Node ugdci2-x9ljw-m-0.c.openshift-qe.internal is reporting: "rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh: invalid cross-device link"
Reason: 1 nodes are reporting degraded status on sync
Status: True
Type: NodeDegraded
Last Transition Time: 2020-04-30T08:48:09Z
Message:
Reason:
Status: True
Type: Degraded
I can succeed upgrade from 4.3.18 to 4.4.1 on ipi-on-azure. Dropping in more complete summary of frequency from QE. xiaoli 1 hour ago - bug 1817455, QE hit it 4 times from 4.2 to 4.3 to 4.4, 1 time from 4.3 to 4.4 (AWS) , 3 succeed from 4.3 to 4.4 (in Azure, GCP, vSphere) (edited) Ok, I've looked at an Azure cluster and I'm confirming my previous comment. The different failures across platforms are still triggered/or not by the same root cause: bugged backup and restore routine. There are mainly two different ways to trigger this bug but again same root cause: - the diff between 3 machine configs (rendered in our case) triggers this In the azure case we can only see 2 rendered MCs, **that's why it doesn't trigger** In the aws case we can see 3 rendered MCs because another MC has been deployed to tweak chrony In the 4.2->4.3->4.4 case, regardless of the platform, we'll always have 3+ MCs so this bug is triggered So, same root cause, same fix for the >=3 rendered machineconfigs case Verified on 4.4.0-0.nightly-2020-04-30-145451. Upgrades are working again. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |