From original github issue: Describe the bug - Default installation on AWS - https://docs.okd.io/latest/installing/installing_aws/installing-aws-default.html - Created a project and within it 2 sample applications (python django/ nginx) - Created a backup following https://docs.openshift.com/container-platform/4.7/backup_and_restore/backing-up-etcd.html - Delete the 2 sample apps - Followed https://docs.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html - At step 11 I don't seem to get "all nodes at the latest revision" $ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}' 3 nodes are at revision 3; 0 nodes have achieved new revision 4 Version 4.7.0-0.okd-2021-03-28-152009 on AWS How reproducible 100% (tried twice on clean installation) Log bundle https://nfdvmaoeikfjtviehegjfcb.s3-eu-west-1.amazonaws.com/must-gather.local.1084280979694287889.tar.gz Relevant kubelet log indicating OVN blocking installer pod launch: Apr 07 18:27:06.081721 ip-10-0-173-39 hyperkube[379114]: E0407 18:27:06.081226 379114 kuberuntime_manager.go:767] createPodSandbox for pod "installer-4-ip-10-0-173-39.eu-west-1.compute.internal_openshift-etcd(44f9ba1c-1a01-41cf-9990-52912f090149)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-4-ip-10-0-173-39.eu-west-1.compute.internal_openshift-etcd_44f9ba1c-1a01-41cf-9990-52912f090149_0(452bff968561e5c1997654c20e3a7fdff3d1325ee4f290b03c5366321a6d427c): [openshift-etcd/installer-4-ip-10-0-173-39.eu-west-1.compute.internal:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-etcd/installer-4-ip-10-0-173-39.eu-west-1.compute.internal 452bff968561e5c1997654c20e3a7fdff3d1325ee4f290b03c5366321a6d427c] [openshift-etcd/installer-4-ip-10-0-173-39.eu-west-1.compute.internal 452bff968561e5c1997654c20e3a7fdff3d1325ee4f290b03c5366321a6d427c] failed to get annotations: pod "installer-4-ip-10-0-173-39.eu-west-1.compute.internal" not found
The disruptive job has a test that validates the documented restore procedure. Once the disruptive job has been transitioned to the step registry (https://github.com/openshift/release/pull/17556) I'll add an ovn+disruptive job that should allow reproduction of the observed issue and validation of an eventual fix.
Unable to reproduce in CI on aws. Not clear this is a reproducible problem and may be down to the difficulty in following the manual restore procedure.
I'm unable to reproduce this on 4.7. CI for a cluster configured with OVN is passing the automated backup/restore procedure. Test: [sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should recover from a backup taken on one node and recovered on another [Serial] Passing Job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26110/pull-ci-openshift-origin-release-4.7-e2e-aws-disruptive/1387233835602677760