Bug 2079803

Summary:	Upgrade-triggered etcd backup will be skip during serial upgrade
Product:	OpenShift Container Platform	Reporter:	liujia <jiajliu>
Component:	Etcd	Assignee:	melbeher
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.10	CC:	alray, aos-bugs, geliu, melbeher, wking, yanyang
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2091604 2097431 (view as bug list)		Environment:
Last Closed:	2022-08-10 11:09:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2091604, 2097431, 2105148

Description liujia 2022-04-28 09:47:18 UTC

Description of problem:
During a serial upgrade from v4.9-v4.10-v4.11, the upgrade-triggered etcd backup will be skipped during v4.10-v4.11 upgrade due to RecentBackup was already set to True during the 1st upgrade. 

There should be twice backup since there are two upgrade(from v4.9-v4.10, and then from v4.10 to v4.11), but there is only one backup file of v4.9.

sh-4.4# ls -la /etc/kubernetes/cluster-backup
total 0
drwxr-xr-x. 3 root root  46 Apr 28 06:11 .
drwxr-xr-x. 7 root root 224 Apr 28 06:11 ..
drwxr-xr-x. 2 root root 132 Apr 28 06:11 upgrade-backup-2022-04-28_061106

No cluster-backup pod.
# ./oc -n openshift-etcd get pod|grep backup
#

Version-Release number of selected component (if applicable):
v4.10.12

How reproducible:
always

Steps to Reproduce:
1. Install v4.9.30
2. Upgrade v4.9.30 to v4.10.12, during this upgrade, etcd is backup successfully before upgrade.
# ./oc get clusterversion -ojson|jq .items[].status.history
[
  {
    "completionTime": "2022-04-28T07:17:11Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:f77f4f75c1e1a4ddd0a0355f298a834db3473fd9ca473235013e9419d1df16db",
    "startedTime": "2022-04-28T06:10:43Z",
    "state": "Completed",
    "verified": true,
    "version": "4.10.12"
  },
  {
    "completionTime": "2022-04-28T03:50:36Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:af9369a6d57f40457440dd730f7f6a640837cb5ced64231b43b39fb2e5835fa6",
    "startedTime": "2022-04-28T03:29:55Z",
    "state": "Completed",
    "verified": false,
    "version": "4.9.30"
  }
]

# ./oc get co etcd -ojson|jq .status.conditions[-1]
{
  "lastTransitionTime": "2022-04-28T06:11:13Z",
  "message": "UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-04-28_061106 on node \"jliu49-kpt7q-master-2.c.openshift-qe.internal\"",
  "reason": "UpgradeBackupSuccessful",
  "status": "True",
  "type": "RecentBackup"
}

3. Continue upgrade to v4.11.0-0.nightly-2022-04-26-181148
Still the old backup file for v4.9.
# ./oc get co etcd -ojson|jq .status.conditions[-1]
{
  "lastTransitionTime": "2022-04-28T06:11:13Z",
  "message": "UpgradeBackup pre 4.9 located at path /etc/kubernetes/cluster-backup/upgrade-backup-2022-04-28_061106 on node \"jliu49-kpt7q-master-2.c.openshift-qe.internal\"",
  "reason": "UpgradeBackupSuccessful",
  "status": "True",
  "type": "RecentBackup"
}

Actual results:
Upgrade-triggered etcd backup was skipped when there is already a backup in previous upgrade.

Expected results:
Upgrade-triggered etcd backup should be executed before the upgrade.


Additional info:
Not sure if cvo side or etcd side is suitable to get a fix, so we track the bug in cvo first. Discussing with etcd qe, at least there should be new backup file to replace the old one.

Comment 1 W. Trevor King 2022-04-29 08:04:35 UTC

Reading through [1], I'm not actually noticing anything that moves RecentBackup from True to False after a previous, successful backup is no longer considered "recent".  I dunno what the freshness threshold would be (minutes?  Hours?).  But I think it's up to etcd to make that call, and set RecentBackup=False (while still keeping the condition around to point at the now-stale backup?  Or removing the condition?) when they think the backup isn't fresh enough.  If that ends up allowing:

1. Cluster is on 4.9
2. Requested update to 4.10
3. etcd takes a backup
4. Update to 4.10
5. Requested update to 4.11
6. Update to 4.11
7. etcd decides the 4.9 backup is stale, and sets RecentBackup=False

that's fine with me.  In the event of a disaster, the user can restore to step 3, and they can repeat as many of the subsequent steps as they like.  The value of a second snapshot at 5 doesn't seem all that high, since it would just pick up an hour or so of the step 4 activity.

[1]: https://github.com/openshift/cluster-etcd-operator/blob/95049e93f7acb4bd9ca7c684702390671e7a1371/pkg/operator/upgradebackupcontroller/upgradebackupcontroller.go

Comment 2 melbeher 2022-05-18 14:52:20 UTC

May I ask how you upgraded to 4.11 ? .. Have you used `--force` flag ? 

If you used --force, no backup will be taken

@jiajliu

Comment 3 liujia 2022-05-19 00:03:27 UTC

(In reply to melbeher from comment #2)
> May I ask how you upgraded to 4.11 ? .. Have you used `--force` flag ? 
> 
> If you used --force, no backup will be taken
> 
> @jiajliu

No `--force`

Comment 4 melbeher 2022-05-19 17:18:57 UTC

I raised a fix here https://github.com/openshift/cluster-etcd-operator/pull/835 

please test extensively @geliu

Comment 9 ge liu 2022-05-27 06:59:52 UTC

Hello melbeher, this bug fixed in 4.11, so I suppose we have not fix it in 4.10, exact? will it be backport to 4.10?

Comment 11 ge liu 2022-05-31 07:48:09 UTC

status condition on cvo listed in Comment 6
@melbeher

Comment 12 melbeher 2022-05-31 10:53:36 UTC

So the backup status should be in CEO conditions, not CVO .. You can see this in Comment #1

Comment 17 errata-xmlrpc 2022-08-10 11:09:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 18 W. Trevor King 2023-03-14 04:46:17 UTC

Errata shipped; presumably all the NEEDINFO were addressed :)