Bug 1997347

Summary: Take etcd backups before minor-version OpenShift updates
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: high    
Version: 4.8CC: geliu, lcosic, yanyang, yselkowi
Target Milestone: ---Keywords: Upgrades
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1999777 (view as bug list) Environment:
Last Closed: 2021-10-18 17:48:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1999777    

Description W. Trevor King 2021-08-25 03:05:41 UTC
OCP 4.9 will include a minor bump to etcd bringing it to v3.5.0. Once the cluster has upgraded to 3.5 etcd will fail in a rollback scenario. For this reason, we must ensure that a valid backup exists for the user before the upgrade. This backup safety net can be used to mitigate the risk of failed upgrade in a case where the customer for got to take a backup. Otherwise, without a backup, the customer would be forced to create a new cluster.

Comment 1 Sam Batschelet 2021-08-27 15:23:17 UTC
I wanted to give an update, a few assumptions were proven invalid for backup pre-checks around storage. I am working on a new approach that performs these checks in the pod init containers. This should as a bonus simplify the operator code.

Comment 4 W. Trevor King 2021-08-31 23:32:27 UTC
To test the landed 4.9 code, we need a 4.10 release image.  We don't have any signed 4.10s yet, so I created an unsigned "4.10"  in my personal Quay account, using a random 4.9 release seed (in the CI registry):

  $ oc adm release new --from-release registry.ci.openshift.org/ocp/release:4.9.0-0.ci-2021-08-23-203814 --name 4.10.0-wking.0 --to-image quay.io/wking/scratch:4.10
  $ oc adm release info quay.io/wking/scratch:4.10 | grep Pull
  Pull From: quay.io/wking/scratch@sha256:09272d43fc7b140992da4bb94002f98d5b08cb9b94843bb5e5db8db634c91a21

Because it's not signed, I created [1], and launched a cluster-bot cluster with '4.9.0-0.ci-2021-08-31-200948,openshift/cluster-version-operator#648'.  Updating that cluster to my "4.10" release:

  $ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/wking/scratch@sha256:09272d43fc7b140992da4bb94002f98d5b08cb9b94843bb5e5db8db634c91a21

CVO blocks (yay):

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-08-31T22:50:07Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-08-31T23:12:43Z Available=True : Done applying 4.9.0-0.ci.test-2021-08-31-224351-ci-ln-f9nbcw2-latest
  2021-08-31T23:19:53Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "EtcdRecentBackupNotSet": RecentBackup: etcd backup condition is not set.
  2021-08-31T23:19:41Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.10.0-wking.0: it may not be safe to apply this update

Still waiting for etcd to notice:

  $ oc get -o json clusteroperator etcd | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-08-31T22:55:34Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready
  EtcdMembersDegraded: No unhealthy members found
  2021-08-31T23:06:12Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3
  EtcdMembersProgressing: No unstarted etcd members found
  2021-08-31T22:57:14Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
  EtcdMembersAvailable: 3 members are available
  2021-08-31T22:55:55Z Upgradeable=True AsExpected: All is well

The backup pod did get launched, and completed:

  $ oc -n openshift-etcd get pod cluster-backup
  NAME             READY   STATUS      RESTARTS   AGE
  cluster-backup   0/1     Completed   0          6m31s

Not clear to me why the etcd operator isn't noticing and setting a RecentBackup=True condition in its ClusterOperator.  Talking to Sam, turns out that's because the operator wrote it to the etcd object and not the ClusterOperator:

  $ oc get -o json etcds cluster| jq -r '.status.conditions[] | select(.type == "RecentBackup") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-08-31T23:19:57Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2021-08-31_231957 on node "ci-ln-f9nbcw2-f76d1-jldxp-master-2"

[1]: https://github.com/openshift/cluster-version-operator/pull/648

Comment 5 W. Trevor King 2021-09-01 05:03:17 UTC
Ok, new commit in etcd-operator#653, retesting with a fresh 'launch 4.9.0-0.ci-2021-08-31-200948,openshift/cluster-version-operator#648,openshift/cluster-etcd-operator#653' (merging 42d1b59e [1]):

$ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/wking/scratch@sha256:09272d43fc7b140992da4bb94002f98d5b08cb9b94843bb5e5db8db634c91a21
$ oc get -o json clusteroperator etcd | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2021-09-01T04:45:57Z Degraded=False AsExpected: NodeControllerDegraded: All master nodes are ready
EtcdMembersDegraded: No unhealthy members found
2021-09-01T04:52:57Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 3
EtcdMembersProgressing: No unstarted etcd members found
2021-09-01T04:45:10Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 3
EtcdMembersAvailable: 3 members are available
2021-09-01T04:43:07Z Upgradeable=True AsExpected: All is well
2021-09-01T05:00:44Z RecentBackup=True UpgradeBackupSuccessful: UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2021-09-01_050044 on node "ci-ln-1bbqp5k-f76d1-jrjzl-master-0"
$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2021-09-01T04:37:20Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
2021-09-01T04:59:12Z Available=True : Done applying 4.9.0-0.ci.test-2021-09-01-043116-ci-ln-1bbqp5k-latest
2021-09-01T05:00:39Z Failing=True UpgradePreconditionCheckFailed: Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 
2021-09-01T05:00:22Z Progressing=True UpgradePreconditionCheckFailed: Unable to apply 4.10.0-wking.0: it may not be safe to apply this update

Sits like this for a bit, until the CVO takes another pass at downloading and verifying the release and checking preconditions.  After which:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2021-09-01T04:37:20Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
2021-09-01T04:59:12Z Available=True : Done applying 4.9.0-0.ci.test-2021-09-01-043116-ci-ln-1bbqp5k-latest
2021-09-01T05:02:14Z Failing=False : 
2021-09-01T05:00:22Z Progressing=True : Working towards 4.10.0-wking.0: 12 of 731 done (1% complete)

So hooray :).  Once etcd-operator#653 lands, I think we'll be in a good place to mark this one VERIFIED.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1432922463011868672#1:build-log.txt%3A3

Comment 8 Yaakov Selkowitz 2021-09-01 18:52:52 UTC
The cluster-etcd-operator change fails to compile on s390x:

go build -trimpath -ldflags "-X github.com/openshift/cluster-etcd-operator/pkg/version.versionFromGit="4.9.0-202109011537.p0.git.6e07db2.assembly.stream-6e07db2" -X github.com/openshift/cluster-etcd-operator/pkg/version.commitFromGit="6e07db2b134df4358dcf4c2a1e3ce0d086a05bc4" -X github.com/openshift/cluster-etcd-operator/pkg/version.gitTreeState="clean" -X github.com/openshift/cluster-etcd-operator/pkg/version.buildDate="2021-09-01T16:42:19Z" " github.com/openshift/cluster-etcd-operator/cmd/cluster-etcd-operator
# github.com/openshift/cluster-etcd-operator/pkg/cmd/verify
pkg/cmd/verify/backupstorage.go:221:34: invalid operation: int64(stat.Bavail) * stat.Bsize (mismatched types int64 and uint32)
make: *** [vendor/github.com/openshift/build-machinery-go/make/targets/golang/build.mk:16: build] Error 2

Fix forthcoming.

Comment 11 ge liu 2021-09-03 08:13:25 UTC
Close this bug according to comment5, and we will verify it with 4.10 release upgrade, and regarding to s390x in comments8, we can't cover s390x, so pls file a new bug if there is issue on s390x.

Comment 12 Lili Cosic 2021-09-06 07:00:15 UTC
https://github.com/openshift/cluster-etcd-operator/pull/653 landed now. Could you try to reverify it @geliu 4.9.0-0.nightly-2021-09-06-004132?

Comment 14 ge liu 2021-09-07 10:39:42 UTC
@Lili Cosic, currently, there is not available 4.10 build for upgrade, so I'm afraid can't try it.

Comment 15 Sam Batschelet 2021-09-21 19:57:59 UTC
*** Bug 1997376 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2021-10-18 17:48:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759