1999777 – Take etcd backups before minor-version OpenShift updates

Bug 1999777 - Take etcd backups before minor-version OpenShift updates

Summary: Take etcd backups before minor-version OpenShift updates

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.z
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:	1997347
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-31 17:55 UTC by Sam Batschelet
Modified:	2021-09-21 08:02 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1997347
Environment:
Last Closed:	2021-09-21 08:02:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 652	None	None	None	2021-09-01 14:19:48 UTC
Github	openshift cluster-version-operator pull 649	None	None	None	2021-09-07 16:05:37 UTC
Red Hat Product Errata	RHBA-2021:3511	None	None	None	2021-09-21 08:02:43 UTC

Description Sam Batschelet 2021-08-31 17:55:58 UTC

+++ This bug was initially created as a clone of Bug #1997347 +++

OCP 4.9 will include a minor bump to etcd bringing it to v3.5.0. Once the cluster has upgraded to 3.5 etcd will fail in a rollback scenario. For this reason, we must ensure that a valid backup exists for the user before the upgrade. This backup safety net can be used to mitigate the risk of failed upgrade in a case where the customer for got to take a backup. Otherwise, without a backup, the customer would be forced to create a new cluster.

--- Additional comment from Sam Batschelet on 2021-08-27 15:23:17 UTC ---

I wanted to give an update, a few assumptions were proven invalid for backup pre-checks around storage. I am working on a new approach that performs these checks in the pod init containers. This should as a bonus simplify the operator code.

Comment 3 W. Trevor King 2021-09-07 19:23:01 UTC

Following [1] to do some pre-merge testing here too (easier, because we have signed 4.9 releases to use), I launched a cluster-bot cluster with 'launch 4.8,openshift/cluster-etcd-operator#652,openshift/cluster-version-operator#649'.  Updating that cluster to 4.9.0-rc.0:

  $ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:d1c1401fdbfe0820036dd3f3cc5df1539b5a101fe9f21f1845e55d8655000f66
  $ oc get -o json clusteroperator etcd | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-09-07T18:54:40Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready
  EtcdMembersDegraded: No unhealthy members found
  2021-09-07T19:05:31Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3
  EtcdMembersProgressing: No unstarted etcd members found
  2021-09-07T18:55:52Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
  EtcdMembersAvailable: 3 members are available
  2021-09-07T18:54:40Z Upgradeable=True AsExpected: All is well
  2021-09-07T19:20:56Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2021-09-07_192056 on node "ci-ln-7wqizib-f76d1-j78qx-master-1"
  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-09-07T18:50:35Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-09-07T19:15:06Z Available=True : Done applying 4.8.0-0.ci.test-2021-09-07-184416-ci-ln-7wqizib-latest
  2021-09-07T19:20:51Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 
  2021-09-07T19:20:30Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.9.0-rc.0: it may not be safe to apply this update

Sits like this for a bit, until the CVO takes another pass at downloading and verifying the release and checking preconditions.  After which:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-09-07T18:50:35Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-09-07T19:15:06Z Available=True : Done applying 4.8.0-0.ci.test-2021-09-07-184416-ci-ln-7wqizib-latest
  2021-09-07T19:22:20Z Failing=False : 
  2021-09-07T19:20:30Z Progressing=True : Working towards 4.9.0-rc.0: 4 of 734 done (0% complete)

So looks good to me :)

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1997347#c5

Comment 4 Yang Yang 2021-09-08 02:02:17 UTC

Hi Trevor,

> 2021-09-07T19:20:56Z RecentBackup=True
> 2021-09-07T19:22:20Z Failing=False 

CVO's Failing condition goes to False at 2021-09-07T19:22:20Z, while etcd's RecentBackup condition goes to True at 2021-09-07T19:20:56Z. CVO delays 1m30 or so on the condition transition. Does CVO need any improvements on it? Thanks!

Comment 5 W. Trevor King 2021-09-08 17:15:15 UTC

> CVO delays 1m30 or so on the condition transition. Does CVO need any improvements on it?

Yeah, it sits for a bit to avoid hotlooping, and then pulls the release again (or launches a new container consuming the already-downloaded release image) and takes a second pass at checking upgrade preconditions.  We could probably squeeze that delay down with some CVO-side work, but I think that would be a forward-looking RFE, and not something that should block us releasing this initial implementation.

Comment 7 Sam Batschelet 2021-09-14 20:25:35 UTC

It's very important that we have a clear signal on this feature working as expected. Do you have a test case in jenkins that we can exercise to ensure we have no issues? I am working on a possible addition to periodic testing that would allow testing of minor nightly images. But it would be good to have multiple test runs on the above.

Comment 8 Yang Yang 2021-09-15 01:38:25 UTC

Sam, currently I have the test case [1] in the Polarion which is not automated yet. I'll get it automated soon so that we can test it periodically easily. The test case only covers the testing from CVO side. Ge Liu works on the testing from etcd side.

[1] https://polarion.engineering.redhat.com/polarion/redirect/project/OSE/workitem?id=OCP-44209

Comment 9 Yang Yang 2021-09-15 10:17:45 UTC

Sam, the filename displayed in the RecentBackup message is incorrect. The RecentBackup condition is telling "message": "UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2021-09-15_100541 on node \"yangyang0915-2-88b9l-master-0.c.openshift-qe.internal\"", but the file does not exist. 

# oc debug node/yangyang0915-2-88b9l-master-0.c.openshift-qe.internal
Starting pod/yangyang0915-2-88b9l-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.2
If you don't see a command prompt, try pressing enter.

sh-4.4# chroot /host
sh-4.4# ls /etc/kubernetes/cluster-backup/upgrade-backup-2021-09-15_100541
ls: cannot access '/etc/kubernetes/cluster-backup/upgrade-backup-2021-09-15_100541': No such file or directory
sh-4.4# ls /etc/kubernetes/cluster-backup/
upgrade-backup-2021-09-15_100535

The exact file name is upgrade-backup-2021-09-15_100535 rather than upgrade-backup-2021-09-15_100541.

Comment 10 Sam Batschelet 2021-09-15 10:43:12 UTC

Thank you for the report. We can look into this as a 4.9.z fix I would not consider this blocking release as the goal is for a functional backup to exist. Thank you for working with us on automation.

Comment 11 Yang Yang 2021-09-15 11:01:15 UTC

Thanks. I just opened a bz [1] to track the filename issue.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2004451

I'm automating my test case using QE's internal framework and I need to build the jenkins job manually. Do you have any test plan for this feature? Do we need to test every 4.8 nightly release?

Comment 16 Yang Yang 2021-09-16 12:11:26 UTC

> there are some info about "version: 4.9.0-rc.1" in clusterversion, could u pls double if it's an issue?

It looks correct. The history is telling the cluster is running with 4.8.0-0.nightly-2021-09-15-162303 and it's being upgraded to 4.9.0-rc.1 but not completed yet. The desired version 4.9.0-rc.1 is telling the cluster is expected to be upgraded to 4.9.0-rc.1 if it's not identical with the current version.

Ah, I'm curious how did you restore with the backup.

Comment 19 errata-xmlrpc 2021-09-21 08:02:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.12 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3511

Note You need to log in before you can comment on or make changes to this bug.