1911841 – [IPI Baremetal] After restoring to previous state the cluster operator machine-config is degraded

Bug 1911841 - [IPI Baremetal] After restoring to previous state the cluster operator machine-config is degraded

Summary: [IPI Baremetal] After restoring to previous state the cluster operator machin...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Yu Qi Zhang
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-31 18:54 UTC by Ori Michaeli
Modified:	2021-03-15 17:14 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-15 17:14:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ori Michaeli 2020-12-31 18:54:06 UTC

Description of problem:

Restoring cluster to previous state after 4.6 to 4.7 upgrade results in machine-config operator in Degraded state.

Version-Release number of selected component (if applicable):
Upgrade from 4.6.9 to 4.7.0-fc.0

[kni@provisionhost-0-0 ~]$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.6.9
built from commit a48ad4a15b42102d1747d2f5f3b635deffb950b5
release image registry.svc.ci.openshift.org/ocp/release@sha256:43d5c84169a4b3ff307c29d7374f6d69a707de15e9fa90ad352b432f77c0cead


How reproducible:
Every time.

Steps to Reproduce:
1. Backup cluster.
2. Mirror release image to the disconnected registry.
3. Create ImageContentSourcePolicy.
4. Create ConfigMap for image signature.
5. Create custom upgrade graph.
6. Point CVO to custom upgrade graph.
7. Upgrade to 4.7 nightly.
8. Restore cluster to previous state (based on https://docs.openshift.com/container-platform/4.6/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html)

Actual results:
machine-config operator is in degraded state after restore.

Expected results:
Cluster is restored to previous state.

Additional info:
Virtual env: 3 masters + 2 workers, disconnected deployment

Comment 2 Seth Jennings 2021-01-08 22:00:01 UTC

Pretty sure this is because MCO started using ignition version 3.2 in 4.7 and MCD in 4.6 only understands up to 3.1.

https://github.com/openshift/machine-config-operator/pull/2248

Comment 3 Seth Jennings 2021-01-08 22:57:00 UTC

While I don't think y-stream downgrades are supported really, I was able to break out of this by:

- `oc get mc -oyaml` on the MC currentConfig for each node type (currentConfig annotation on the Nodes)
- delete all rendered MCs (one will be regenerated per node type)
- edit currentConfig MCs from step 1, replacing ignition version 3.2.0 with 3.1.0
- recreate the currentConfig MCs with `oc create -f`
- log into each node and delete /etc/machine-config-daemon/currentConfig

Should get things moving.

Comment 4 Xingxing Xia 2021-02-18 07:07:42 UTC

Adding the Keywords because this bug blocks: the testing of https://issues.redhat.com/browse/API-1055 ; And the testing of downgrade cases of upgrade subteam.

Comment 5 Michelle Krejci 2021-03-01 18:19:16 UTC

Could you say more about why you are testing a downgrade path? The MCO has been operating under the understanding that we do not support downgrade paths. There might be some reasoning here that we are not understanding. Thank you for providing more context.

Comment 6 Seth Jennings 2021-03-01 20:04:15 UTC

I think you meant to set needinfo on the reporter.

Comment 7 Xingxing Xia 2021-03-02 06:28:26 UTC

> Could you say more about why you are testing a downgrade path?
The upgrade QE guys (like above Yang Yang) have downgrade test case. They say, though downgrade is not officially supported, Dev requires QE should have a basic check for downgrade. So that the cluster function can be ensured to work, when its upgrading hits problem and makes it go into urgent situation.
In addition, while doing the basic checking for downgrade, QE hit / reported many issues which made the cluster malfunction, like bug 1907812, bug 1913620, bug 1916586 etc. But they were all fixed. This is why testing downgrade.

Comment 8 Ori Michaeli 2021-03-02 08:09:03 UTC

(In reply to Michelle Krejci from comment #5)
> Could you say more about why you are testing a downgrade path?
As Xingxing commented, we are testing disaster recovery as part of updates/upgrades testing on IPI BM.

Comment 10 Yu Qi Zhang 2021-03-05 22:56:39 UTC

Hi, this is Jerry from the MCO team.

Regarding major y stream downgrades, the MCO has never guarenteed its ability, and in this case like Seth mentions the MCO in 4.6 does not have the ignition version bump and we are unfortunately unlikely to backport that functionality, given the priority of other work.

In terms of workaround, like Seth mentions, its possible to set all ignition 3.2 machineconfigs to 3.1 manually, and it should get past that error (3.1->3.2 should not have changed anything unless you are using LUKS encryption, which would not be backwards compatible).

Apologies to the disruption of the QE process. If you believe "MCO downgradeability" should be a supported flow, please raise the issue as a new epic. The MCO today does not consider this a bug.

Comment 11 Yu Qi Zhang 2021-03-15 17:14:09 UTC

Closing this as NOTABUG for now. If we would like to discuss this further, perhaps Jira is a better place to continue

Note You need to log in before you can comment on or make changes to this bug.