Bug 1775057 - [MSTR-485] Cluster is abnormal after etcd backup/restore when the backup is conducted during etcd encryption is migrating
Summary: [MSTR-485] Cluster is abnormal after etcd backup/restore when the backup is c...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.7.0
Assignee: Andrew McDermott
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks: 1776811
TreeView+ depends on / blocked
 
Reported: 2019-11-21 11:50 UTC by Xingxing Xia
Modified: 2020-11-16 12:34 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1776811 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
Details of the abnormal cluster after backup and restore (9.61 KB, text/plain)
2019-11-21 11:58 UTC, Xingxing Xia
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift library-go pull 603 'None' closed Bug 1775057: encryption: keep last read key after migration for easier backup/restore 2020-11-16 08:28:56 UTC
Red Hat Bugzilla 1788895 unspecified NEW Add step to restart router pods after etcd(enabled encryption) backup/restore 2020-10-14 00:28:05 UTC

Description Xingxing Xia 2019-11-21 11:50:48 UTC
Description of problem:
Cluster is abnormal after etcd backup/restore when the backup is conducted during etcd encryption is migrating

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-11-20-225457

How reproducible:
Tried once by geliu (trying once needs much time) (I saw he trying)

Steps to Reproduce:
Follow the test case OCP-25932:
1. Install a fresh env, create 500 projects, this creates 4500 secrets
2. Enable encryption:
oc patch apiserver/cluster -p '{"spec":{"encryption": {"type":"aescbc"}}}' --type merge
3. Repeatedly check condition EncryptionMigrationControllerProgressing till its "status" becomes True and "reason" becomes "Migrating":
oc get kubeapiserver cluster -o json | jq -r '.status.conditions[] | select(.type == "EncryptionMigrationControllerProgressing")'
4. When above condition is True, do etcd backup by following https://docs.openshift.com/container-platform/4.2/backup_and_restore/backing-up-etcd.html
5. Follow https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html to restore
6. After the restore, check cluster

Actual results:
6. Cluster is abnormal:
oc get po -A | grep -vE "(Running|Completed)"; oc get co; oc get no
openshift-apiserver-operator                            openshift-apiserver-operator-7d7bbd897f-465z4                              0/1     CrashLoopBackOff   3
   4h4m
openshift-cluster-node-tuning-operator                  tuned-4lx6b                                                                0/1     Error              0
   3h54m
openshift-cluster-node-tuning-operator                  tuned-89b86                                                                0/1     Error              0
   3h54m
openshift-cluster-version                               cluster-version-operator-667d947df7-zjfrx                                  0/1     CrashLoopBackOff   1
   4h4m
...many other pods...
kube-apiserver                             4.3.0-0.nightly-2019-11-20-225457   True        True          False      3h57m
...

More details, see attachment

Expected results:
6. Cluster is normal

Additional info:

Comment 1 Xingxing Xia 2019-11-21 11:58:49 UTC
Created attachment 1638414 [details]
Details of the abnormal cluster after backup and restore

Env kubeconfig http://file.rdu.redhat.com/~xxia/env-for-bug-1775057.kubeconfig . The env will be preserved for Dev debugging for 1 ~ 2 day(s).

Comment 2 ge liu 2019-11-22 02:23:08 UTC
@xxia, it's kube-apiserver component, your cake! ^_^

Comment 4 Xingxing Xia 2019-12-05 03:46:20 UTC
Moving like its clone bug did.

Comment 5 Xingxing Xia 2020-01-07 15:49:11 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1776811#c29 ~ 30 has the known router pod restart issue. That is not fixed yet. Guess 4.4 same, so assigned back.

Comment 8 Andrew McDermott 2020-03-03 13:53:56 UTC
I tried the backup procedure but ran into: 

Not a blocker for 4.4 and waiting on: 

- https://bugzilla.redhat.com/show_bug.cgi?id=1788895
- https://bugzilla.redhat.com/show_bug.cgi?id=1807447

Moving to 4.5.

Comment 11 Andrew McDermott 2020-05-19 15:57:24 UTC
Moved to 4.6 until we can work out whether the workaround is still acceptable or whether worker nodes where the router runs need to be either rebooted or the router should be scale down/up to pick up the new endpoints.

Comment 12 Andrew McDermott 2020-06-17 09:45:00 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 13 Andrew McDermott 2020-07-09 12:04:22 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 14 Andrew McDermott 2020-07-30 09:59:10 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 15 mfisher 2020-08-18 20:01:41 UTC
Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 16 Andrew McDermott 2020-09-10 11:44:12 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 17 Andrew McDermott 2020-10-02 16:13:41 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 18 Andrew McDermott 2020-10-23 15:59:19 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 19 Andrew McDermott 2020-11-16 08:29:27 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 20 Xingxing Xia 2020-11-16 12:34:25 UTC
Andrew McDermott, 4.6 regular tests didn't hit it. So moving to VERIFIED. Thanks for tracking it.


Note You need to log in before you can comment on or make changes to this bug.