1817455 – Node goes to degraded status when machine-config-daemon moves a file across filesystems

Bug 1817455 - Node goes to degraded status when machine-config-daemon moves a file across filesystems

Summary: Node goes to degraded status when machine-config-daemon moves a file across f...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Antonio Murdaca
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Duplicates (5):	1817847 1819232 1821364 1821369 1821716 (view as bug list)
Depends On:	1814397
Blocks:	1817458
TreeView+	depends on / blocked

Reported:	2020-03-26 11:35 UTC by Antonio Murdaca
Modified:	2021-04-05 17:36 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1814397
Clones:	1817458 (view as bug list)
Environment:
Last Closed:	2020-05-04 11:47:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1609	None	closed	Bug 1817455: [release-4.4] fix wrongful backup of files not originally on the system	2020-11-23 07:26:05 UTC
Github	openshift machine-config-operator pull 1702	None	closed	Bug 1817455: pkg/daemon: best effort restore or delete	2020-11-23 07:26:04 UTC
Red Hat Product Errata	RHBA-2020:0581	None	None	None	2020-05-04 11:48:13 UTC

Comment 1 W. Trevor King 2020-03-27 01:21:22 UTC

4.5 bug 1814397 still just POST, so too soon for the Bugzilla bot to link this 4.4 bug to its backport PR.  But I'm linking it manually now, because that makes it easier for me to understand where we stand on 4.4 UpgradeBlockers ;).

Comment 2 Antonio Murdaca 2020-03-27 09:59:08 UTC

*** Bug 1817847 has been marked as a duplicate of this bug. ***

Comment 3 Clayton Coleman 2020-03-29 01:37:11 UTC

Please make the bug description public unless you have sensitive information.  This impairs searching for bugs from CI tooling.

Comment 4 Scott Dodson 2020-03-30 20:00:29 UTC

Upstream bug has Upgrade Impact Assessment questions, please answer them on the upstream bug.

https://bugzilla.redhat.com/show_bug.cgi?id=1814397#c14

Comment 5 Antonio Murdaca 2020-04-03 07:30:09 UTC

*** Bug 1819232 has been marked as a duplicate of this bug. ***

Comment 6 Yanping Zhang 2020-04-03 08:17:17 UTC

Reproduced the bug when upgrade from 4.3.9 to nightly 4.4.
$ oc get node
NAME                                 STATUS                     ROLES    AGE   VERSION
qe-upg-share-mmc2q-compute-0         Ready                      worker   19h   v1.17.1
qe-upg-share-mmc2q-compute-1         Ready                      worker   19h   v1.17.1
qe-upg-share-mmc2q-compute-2         Ready                      worker   19h   v1.17.1
qe-upg-share-mmc2q-control-plane-0   Ready,SchedulingDisabled   master   19h   v1.16.2
qe-upg-share-mmc2q-control-plane-1   Ready                      master   19h   v1.16.2
qe-upg-share-mmc2q-control-plane-2   Ready                      master   19h   v1.16.2

$ oc get co master-config -o yaml
<--snip-->
status:
  conditions:
  - lastTransitionTime: "2020-04-03T07:16:10Z"
    message: Cluster not available for 4.4.0-0.nightly-2020-04-02-130551
    status: "False"
    type: Available
  - lastTransitionTime: "2020-04-03T07:02:01Z"
    message: Working towards 4.4.0-0.nightly-2020-04-02-130551
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-04-03T07:16:10Z"
    message: 'Unable to apply 4.4.0-0.nightly-2020-04-02-130551: timed out waiting
      for the condition during syncRequiredMachineConfigPools: pool master has not
      progressed to latest configuration: controller version mismatch for rendered-master-1d1b2a692c950078690d8b3b215bec2f
      expected a7b13759061f645a76f03c04d385d275bbbd0c02 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca,
      retrying'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded
<--snip-->

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-1d1b2a692c950078690d8b3b215bec2f   False     True       True       3              0                   0                     1                      20h
worker   rendered-worker-72f43e4889519a6ede04333776de8d32   True      False      False      3              3                   3                     0                      20h

$ oc get mcp master -o yaml
<--snip-->
  - lastTransitionTime: "2020-04-03T07:08:32Z"
    message: 'Node qe-upg-share-mmc2q-control-plane-0 is reporting: "rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig
      /usr/local/bin/etcd-member-add.sh: invalid cross-device link"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
<--snip-->

Comment 9 W. Trevor King 2020-04-07 20:14:54 UTC

*** Bug 1821369 has been marked as a duplicate of this bug. ***

Comment 10 W. Trevor King 2020-04-07 20:21:33 UTC

*** Bug 1821364 has been marked as a duplicate of this bug. ***

Comment 11 Antonio Murdaca 2020-04-08 08:46:15 UTC

*** Bug 1821716 has been marked as a duplicate of this bug. ***

Comment 12 Yanping Zhang 2020-04-09 08:57:00 UTC

Upgrade from 4.3.9 to nightly 4.4.0-0.nightly-2020-04-07-130324 successfully, don't meet issue in the bug now.
So move the bug to "Verified"

Comment 15 liujia 2020-04-30 11:11:06 UTC

Hit the issue again for 4.2.29->4.3.18->4.4.1 upgrade.

After upgrade v4.2.29-v4.3.18 successfully, we continue upgrade the cluster to v4.4.1. But the upgrade failed.

# ./oc adm upgrade
info: An upgrade is in progress. Unable to apply 4.4.1: the cluster operator openshift-apiserver is degraded
# ./oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.1     True        False         False      4h56m
cloud-credential                           4.4.1     True        False         False      5h8m
cluster-autoscaler                         4.4.1     True        False         False      5h2m
console                                    4.4.1     True        False         False      137m
csi-snapshot-controller                    4.4.1     True        False         False      123m
dns                                        4.4.1     True        False         False      5h8m
etcd                                       4.4.1     True        False         False      145m
image-registry                             4.4.1     True        False         False      117m
ingress                                    4.4.1     True        False         False      117m
insights                                   4.4.1     True        False         False      5h8m
kube-apiserver                             4.4.1     True        False         False      5h7m
kube-controller-manager                    4.4.1     True        False         False      152m
kube-scheduler                             4.4.1     True        False         False      152m
kube-storage-version-migrator              4.4.1     True        False         False      123m
machine-api                                4.4.1     True        False         False      5h9m
machine-config                             4.3.18    False       True          True       116m
marketplace                                4.4.1     True        False         False      144m
monitoring                                 4.4.1     True        False         False      3h42m
network                                    4.4.1     True        False         False      5h7m
node-tuning                                4.4.1     True        False         False      145m
openshift-apiserver                        4.4.1     True        False         True       137m
openshift-controller-manager               4.4.1     True        False         False      5h7m
openshift-samples                          4.4.1     False       True          True       1s
operator-lifecycle-manager                 4.4.1     True        False         False      5h2m
operator-lifecycle-manager-catalog         4.4.1     True        False         False      5h5m
operator-lifecycle-manager-packageserver   4.4.1     True        False         False      136m
service-ca                                 4.4.1     True        False         False      5h8m
service-catalog-apiserver                  4.4.1     True        False         False      137m
service-catalog-controller-manager         4.4.1     True        False         False      4h7m
storage                                    4.4.1     True        False         False      145m

Checked openshift-apiserver degreaded is because one of master is unscheduled.
# ./oc get node
NAME                                             STATUS                     ROLES    AGE     VERSION
ugdci2-x9ljw-m-0.c.openshift-qe.internal         Ready,SchedulingDisabled   master   5h11m   v1.16.2
ugdci2-x9ljw-m-1.c.openshift-qe.internal         Ready                      master   5h11m   v1.16.2
ugdci2-x9ljw-m-2.c.openshift-qe.internal         Ready                      master   5h11m   v1.16.2
ugdci2-x9ljw-w-a-frg9s.c.openshift-qe.internal   Ready                      worker   5h5m    v1.17.1
ugdci2-x9ljw-w-b-qggrc.c.openshift-qe.internal   Ready                      worker   5h5m    v1.17.1
ugdci2-x9ljw-w-c-7knmr.c.openshift-qe.internal   Ready                      worker   5h6m    v1.17.1

# ./oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-51bc0454ee0f0a886b5812eb225f400e   False     True       True       3              0                   0                     1                      5h13m
worker   rendered-worker-1893f230e08f250db257307b8a6db414   True      False      False      3              3                   3                     0                      5h13m

    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2020-04-30T08:47:50Z
    Message:               All nodes are updating to rendered-master-8cf84e0dd6e3e8e6b2a0d533d74074d8
    Reason:                
    Status:                True
    Type:                  Updating
    Last Transition Time:  2020-04-30T08:48:09Z
    Message:               Node ugdci2-x9ljw-m-0.c.openshift-qe.internal is reporting: "rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh: invalid cross-device link"
    Reason:                1 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2020-04-30T08:48:09Z
    Message:               
    Reason:                
    Status:                True
    Type:                  Degraded

Comment 17 Wenjing Zheng 2020-04-30 11:45:02 UTC

I can succeed upgrade from 4.3.18 to 4.4.1 on ipi-on-azure.

Comment 18 Scott Dodson 2020-04-30 12:44:26 UTC

Dropping in more complete summary of frequency from QE.

xiaoli  1 hour ago - bug 1817455, QE hit it 4 times from 4.2 to 4.3 to 4.4,  1 time from 4.3 to 4.4 (AWS) ,  3 succeed from 4.3 to 4.4 (in Azure, GCP, vSphere) (edited)

Comment 22 Antonio Murdaca 2020-04-30 13:10:23 UTC

Ok, I've looked at an Azure cluster and I'm confirming my previous comment.

The different failures across platforms are still triggered/or not by the same root cause: bugged backup and restore routine.

There are mainly two different ways to trigger this bug but again same root cause:

- the diff between 3 machine configs (rendered in our case) triggers this

In the azure case we can only see 2 rendered MCs, **that's why it doesn't trigger**
In the aws case we can see 3 rendered MCs because another MC has been deployed to tweak chrony
In the 4.2->4.3->4.4 case, regardless of the platform, we'll always have 3+ MCs so this bug is triggered

So, same root cause, same fix for the >=3 rendered machineconfigs case

Comment 24 Mike Fiedler 2020-05-01 15:09:32 UTC

Verified on 4.4.0-0.nightly-2020-04-30-145451.  Upgrades are working again.

Comment 26 errata-xmlrpc 2020-05-04 11:47:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 27 W. Trevor King 2021-04-05 17:36:17 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.

alukiano
amurdaca
bparees
ccoleman
dshchedr
grajaiya
jhou
jiajliu
jiazha
jkaur
kgarriso
lmohanty
mifiedle
mnguyen
nschuetz
rpattath
skordas
wking
wsun
wzheng
yanpzhan