1775057 – [MSTR-485] Cluster is abnormal after etcd backup/restore when the backup is conducted during etcd encryption is migrating

Bug 1775057 - [MSTR-485] Cluster is abnormal after etcd backup/restore when the backup is conducted during etcd encryption is migrating

Summary: [MSTR-485] Cluster is abnormal after etcd backup/restore when the backup is c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Andrew McDermott
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1776811
TreeView+	depends on / blocked

Reported:	2019-11-21 11:50 UTC by Xingxing Xia
Modified:	2022-08-04 22:24 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1776811 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:10:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Details of the abnormal cluster after backup and restore (9.61 KB, text/plain) 2019-11-21 11:58 UTC, Xingxing Xia	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift library-go pull 603	'None'	closed	Bug 1775057: encryption: keep last read key after migration for easier backup/restore	2021-02-20 08:58:16 UTC
Red Hat Bugzilla	1788895	unspecified	CLOSED	Add step to restart router pods after etcd(enabled encryption) backup/restore	2022-08-22 12:49:30 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:11:52 UTC

Description Xingxing Xia 2019-11-21 11:50:48 UTC

Description of problem:
Cluster is abnormal after etcd backup/restore when the backup is conducted during etcd encryption is migrating

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-11-20-225457

How reproducible:
Tried once by geliu (trying once needs much time) (I saw he trying)

Steps to Reproduce:
Follow the test case OCP-25932:
1. Install a fresh env, create 500 projects, this creates 4500 secrets
2. Enable encryption:
oc patch apiserver/cluster -p '{"spec":{"encryption": {"type":"aescbc"}}}' --type merge
3. Repeatedly check condition EncryptionMigrationControllerProgressing till its "status" becomes True and "reason" becomes "Migrating":
oc get kubeapiserver cluster -o json | jq -r '.status.conditions[] | select(.type == "EncryptionMigrationControllerProgressing")'
4. When above condition is True, do etcd backup by following https://docs.openshift.com/container-platform/4.2/backup_and_restore/backing-up-etcd.html
5. Follow https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html to restore
6. After the restore, check cluster

Actual results:
6. Cluster is abnormal:
oc get po -A | grep -vE "(Running|Completed)"; oc get co; oc get no
openshift-apiserver-operator                            openshift-apiserver-operator-7d7bbd897f-465z4                              0/1     CrashLoopBackOff   3
   4h4m
openshift-cluster-node-tuning-operator                  tuned-4lx6b                                                                0/1     Error              0
   3h54m
openshift-cluster-node-tuning-operator                  tuned-89b86                                                                0/1     Error              0
   3h54m
openshift-cluster-version                               cluster-version-operator-667d947df7-zjfrx                                  0/1     CrashLoopBackOff   1
   4h4m
...many other pods...
kube-apiserver                             4.3.0-0.nightly-2019-11-20-225457   True        True          False      3h57m
...

More details, see attachment

Expected results:
6. Cluster is normal

Additional info:

Comment 1 Xingxing Xia 2019-11-21 11:58:49 UTC

Created attachment 1638414 [details]
Details of the abnormal cluster after backup and restore

Env kubeconfig http://file.rdu.redhat.com/~xxia/env-for-bug-1775057.kubeconfig . The env will be preserved for Dev debugging for 1 ~ 2 day(s).

Comment 2 ge liu 2019-11-22 02:23:08 UTC

@xxia, it's kube-apiserver component, your cake! ^_^

Comment 4 Xingxing Xia 2019-12-05 03:46:20 UTC

Moving like its clone bug did.

Comment 5 Xingxing Xia 2020-01-07 15:49:11 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1776811#c29 ~ 30 has the known router pod restart issue. That is not fixed yet. Guess 4.4 same, so assigned back.

Comment 8 Andrew McDermott 2020-03-03 13:53:56 UTC

I tried the backup procedure but ran into: 

Not a blocker for 4.4 and waiting on: 

- https://bugzilla.redhat.com/show_bug.cgi?id=1788895
- https://bugzilla.redhat.com/show_bug.cgi?id=1807447

Moving to 4.5.

Comment 11 Andrew McDermott 2020-05-19 15:57:24 UTC

Moved to 4.6 until we can work out whether the workaround is still acceptable or whether worker nodes where the router runs need to be either rebooted or the router should be scale down/up to pick up the new endpoints.

Comment 12 Andrew McDermott 2020-06-17 09:45:00 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 13 Andrew McDermott 2020-07-09 12:04:22 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 14 Andrew McDermott 2020-07-30 09:59:10 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 15 mfisher 2020-08-18 20:01:41 UTC

Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 16 Andrew McDermott 2020-09-10 11:44:12 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 17 Andrew McDermott 2020-10-02 16:13:41 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 18 Andrew McDermott 2020-10-23 15:59:19 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 19 Andrew McDermott 2020-11-16 08:29:27 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 20 Xingxing Xia 2020-11-16 12:34:25 UTC

Andrew McDermott, 4.6 regular tests didn't hit it. So moving to VERIFIED. Thanks for tracking it.

Comment 23 errata-xmlrpc 2021-02-24 15:10:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.