Description of problem: Cluster is abnormal after etcd backup/restore when the backup is conducted during etcd encryption is migrating Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2019-11-20-225457 How reproducible: Tried once by geliu (trying once needs much time) (I saw he trying) Steps to Reproduce: Follow the test case OCP-25932: 1. Install a fresh env, create 500 projects, this creates 4500 secrets 2. Enable encryption: oc patch apiserver/cluster -p '{"spec":{"encryption": {"type":"aescbc"}}}' --type merge 3. Repeatedly check condition EncryptionMigrationControllerProgressing till its "status" becomes True and "reason" becomes "Migrating": oc get kubeapiserver cluster -o json | jq -r '.status.conditions[] | select(.type == "EncryptionMigrationControllerProgressing")' 4. When above condition is True, do etcd backup by following https://docs.openshift.com/container-platform/4.2/backup_and_restore/backing-up-etcd.html 5. Follow https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html to restore 6. After the restore, check cluster Actual results: 6. Cluster is abnormal: oc get po -A | grep -vE "(Running|Completed)"; oc get co; oc get no openshift-apiserver-operator openshift-apiserver-operator-7d7bbd897f-465z4 0/1 CrashLoopBackOff 3 4h4m openshift-cluster-node-tuning-operator tuned-4lx6b 0/1 Error 0 3h54m openshift-cluster-node-tuning-operator tuned-89b86 0/1 Error 0 3h54m openshift-cluster-version cluster-version-operator-667d947df7-zjfrx 0/1 CrashLoopBackOff 1 4h4m ...many other pods... kube-apiserver 4.3.0-0.nightly-2019-11-20-225457 True True False 3h57m ... More details, see attachment Expected results: 6. Cluster is normal Additional info:
Created attachment 1638414 [details] Details of the abnormal cluster after backup and restore Env kubeconfig http://file.rdu.redhat.com/~xxia/env-for-bug-1775057.kubeconfig . The env will be preserved for Dev debugging for 1 ~ 2 day(s).
@xxia, it's kube-apiserver component, your cake! ^_^
Moving like its clone bug did.
https://bugzilla.redhat.com/show_bug.cgi?id=1776811#c29 ~ 30 has the known router pod restart issue. That is not fixed yet. Guess 4.4 same, so assigned back.
I tried the backup procedure but ran into: Not a blocker for 4.4 and waiting on: - https://bugzilla.redhat.com/show_bug.cgi?id=1788895 - https://bugzilla.redhat.com/show_bug.cgi?id=1807447 Moving to 4.5.
Moved to 4.6 until we can work out whether the workaround is still acceptable or whether worker nodes where the router runs need to be either rebooted or the router should be scale down/up to pick up the new endpoints.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved.
Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.
Andrew McDermott, 4.6 regular tests didn't hit it. So moving to VERIFIED. Thanks for tracking it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633