2079803 – Upgrade-triggered etcd backup will be skip during serial upgrade

Bug 2079803 - Upgrade-triggered etcd backup will be skip during serial upgrade

Summary: Upgrade-triggered etcd backup will be skip during serial upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	melbeher
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2091604 2097431 2105148
TreeView+	depends on / blocked

Reported:	2022-04-28 09:47 UTC by liujia
Modified:	2023-03-14 04:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2091604 2097431 (view as bug list)
Environment:
Last Closed:	2022-08-10 11:09:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 835	0	None	Merged	Bug 2079803: fix consecutive backup on consecutive upgrades	2022-06-01 04:07:25 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:09:32 UTC

Description liujia 2022-04-28 09:47:18 UTC

Description of problem:
During a serial upgrade from v4.9-v4.10-v4.11, the upgrade-triggered etcd backup will be skipped during v4.10-v4.11 upgrade due to RecentBackup was already set to True during the 1st upgrade. 

There should be twice backup since there are two upgrade(from v4.9-v4.10, and then from v4.10 to v4.11), but there is only one backup file of v4.9.

sh-4.4# ls -la /etc/kubernetes/cluster-backup
total 0
drwxr-xr-x. 3 root root  46 Apr 28 06:11 .
drwxr-xr-x. 7 root root 224 Apr 28 06:11 ..
drwxr-xr-x. 2 root root 132 Apr 28 06:11 upgrade-backup-2022-04-28_061106

No cluster-backup pod.
# ./oc -n openshift-etcd get pod|grep backup
#

Version-Release number of selected component (if applicable):
v4.10.12

How reproducible:
always

Steps to Reproduce:
1. Install v4.9.30
2. Upgrade v4.9.30 to v4.10.12, during this upgrade, etcd is backup successfully before upgrade.
# ./oc get clusterversion -ojson|jq .items[].status.history
[
  {
    "completionTime": "2022-04-28T07:17:11Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:f77f4f75c1e1a4ddd0a0355f298a834db3473fd9ca473235013e9419d1df16db",
    "startedTime": "2022-04-28T06:10:43Z",
    "state": "Completed",
    "verified": true,
    "version": "4.10.12"
  },
  {
    "completionTime": "2022-04-28T03:50:36Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:af9369a6d57f40457440dd730f7f6a640837cb5ced64231b43b39fb2e5835fa6",
    "startedTime": "2022-04-28T03:29:55Z",
    "state": "Completed",
    "verified": false,
    "version": "4.9.30"
  }
]

# ./oc get co etcd -ojson|jq .status.conditions[-1]
{
  "lastTransitionTime": "2022-04-28T06:11:13Z",
  "message": "UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-04-28_061106 on node \"jliu49-kpt7q-master-2.c.openshift-qe.internal\"",
  "reason": "UpgradeBackupSuccessful",
  "status": "True",
  "type": "RecentBackup"
}

3. Continue upgrade to v4.11.0-0.nightly-2022-04-26-181148
Still the old backup file for v4.9.
# ./oc get co etcd -ojson|jq .status.conditions[-1]
{
  "lastTransitionTime": "2022-04-28T06:11:13Z",
  "message": "UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-04-28_061106 on node \"jliu49-kpt7q-master-2.c.openshift-qe.internal\"",
  "reason": "UpgradeBackupSuccessful",
  "status": "True",
  "type": "RecentBackup"
}

Actual results:
Upgrade-triggered etcd backup was skipped when there is already a backup in previous upgrade.

Expected results:
Upgrade-triggered etcd backup should be executed before the upgrade.


Additional info:
Not sure if cvo side or etcd side is suitable to get a fix, so we track the bug in cvo first. Discussing with etcd qe, at least there should be new backup file to replace the old one.

Comment 1 W. Trevor King 2022-04-29 08:04:35 UTC

Reading through [1], I'm not actually noticing anything that moves RecentBackup from True to False after a previous, successful backup is no longer considered "recent".  I dunno what the freshness threshold would be (minutes?  Hours?).  But I think it's up to etcd to make that call, and set RecentBackup=False (while still keeping the condition around to point at the now-stale backup?  Or removing the condition?) when they think the backup isn't fresh enough.  If that ends up allowing:

1. Cluster is on 4.9
2. Requested update to 4.10
3. etcd takes a backup
4. Update to 4.10
5. Requested update to 4.11
6. Update to 4.11
7. etcd decides the 4.9 backup is stale, and sets RecentBackup=False

that's fine with me.  In the event of a disaster, the user can restore to step 3, and they can repeat as many of the subsequent steps as they like.  The value of a second snapshot at 5 doesn't seem all that high, since it would just pick up an hour or so of the step 4 activity.

[1]: https://github.com/openshift/cluster-etcd-operator/blob/95049e93f7acb4bd9ca7c684702390671e7a1371/pkg/operator/upgradebackupcontroller/upgradebackupcontroller.go

Comment 2 melbeher 2022-05-18 14:52:20 UTC

May I ask how you upgraded to 4.11 ? .. Have you used `--force` flag ? 

If you used --force, no backup will be taken

@jiajliu

Comment 3 liujia 2022-05-19 00:03:27 UTC

(In reply to melbeher from comment #2)
> May I ask how you upgraded to 4.11 ? .. Have you used `--force` flag ? 
> 
> If you used --force, no backup will be taken
> 
> @jiajliu

No `--force`

Comment 4 melbeher 2022-05-19 17:18:57 UTC

I raised a fix here https://github.com/openshift/cluster-etcd-operator/pull/835 

please test extensively @geliu

Comment 9 ge liu 2022-05-27 06:59:52 UTC

Hello melbeher, this bug fixed in 4.11, so I suppose we have not fix it in 4.10, exact? will it be backport to 4.10?

Comment 11 ge liu 2022-05-31 07:48:09 UTC

status condition on cvo listed in Comment 6
@melbeher

Comment 12 melbeher 2022-05-31 10:53:36 UTC

So the backup status should be in CEO conditions, not CVO .. You can see this in Comment #1

Comment 17 errata-xmlrpc 2022-08-10 11:09:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 18 W. Trevor King 2023-03-14 04:46:17 UTC

Errata shipped; presumably all the NEEDINFO were addressed :)

Note You need to log in before you can comment on or make changes to this bug.