Bug 1947705 - After DR cluster restore, ovn appears to block launch of etcd installation pod
Summary: After DR cluster restore, ovn appears to block launch of etcd installation pod
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: Maru Newby
QA Contact: ge liu
URL:
Whiteboard:
Depends On: 1886160
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-09 01:12 UTC by Maru Newby
Modified: 2021-05-10 22:44 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-10 22:44:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift okd issues 586 0 None open Disaster recovery on OVN stuck at 'failed to get annotations' 2021-04-09 01:12:38 UTC

Description Maru Newby 2021-04-09 01:12:39 UTC
From original github issue:

Describe the bug

- Default installation on AWS 
  - https://docs.okd.io/latest/installing/installing_aws/installing-aws-default.html
- Created a project and within it 2 sample applications (python django/ nginx)
- Created a backup following https://docs.openshift.com/container-platform/4.7/backup_and_restore/backing-up-etcd.html
- Delete the 2 sample apps
- Followed https://docs.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html
- At step 11 I don't seem to get "all nodes at the latest revision"
 $ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

 3 nodes are at revision 3; 0 nodes have achieved new revision 4

Version 
4.7.0-0.okd-2021-03-28-152009 on AWS

How reproducible
100% (tried twice on clean installation)

Log bundle
https://nfdvmaoeikfjtviehegjfcb.s3-eu-west-1.amazonaws.com/must-gather.local.1084280979694287889.tar.gz

Relevant kubelet log indicating OVN blocking installer pod launch:

Apr 07 18:27:06.081721 ip-10-0-173-39 hyperkube[379114]: E0407 18:27:06.081226  379114 kuberuntime_manager.go:767] createPodSandbox for pod "installer-4-ip-10-0-173-39.eu-west-1.compute.internal_openshift-etcd(44f9ba1c-1a01-41cf-9990-52912f090149)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-4-ip-10-0-173-39.eu-west-1.compute.internal_openshift-etcd_44f9ba1c-1a01-41cf-9990-52912f090149_0(452bff968561e5c1997654c20e3a7fdff3d1325ee4f290b03c5366321a6d427c): [openshift-etcd/installer-4-ip-10-0-173-39.eu-west-1.compute.internal:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-etcd/installer-4-ip-10-0-173-39.eu-west-1.compute.internal 452bff968561e5c1997654c20e3a7fdff3d1325ee4f290b03c5366321a6d427c] [openshift-etcd/installer-4-ip-10-0-173-39.eu-west-1.compute.internal 452bff968561e5c1997654c20e3a7fdff3d1325ee4f290b03c5366321a6d427c] failed to get annotations: pod "installer-4-ip-10-0-173-39.eu-west-1.compute.internal" not found

Comment 1 Maru Newby 2021-04-09 01:15:43 UTC
The disruptive job has a test that validates the documented restore procedure. Once the disruptive job has been transitioned to the step registry (https://github.com/openshift/release/pull/17556) I'll add an ovn+disruptive job that should allow reproduction of the observed issue and validation of an eventual fix.

Comment 2 Maru Newby 2021-04-19 17:40:01 UTC
Unable to reproduce in CI on aws. Not clear this is a reproducible problem and may be down to the difficulty in following the manual restore procedure.

Comment 3 Maru Newby 2021-04-28 20:41:01 UTC
I'm unable to reproduce this on 4.7. CI for a cluster configured with OVN is passing the automated backup/restore procedure.

Test: [sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should recover from a backup taken on one node and recovered on another [Serial]

Passing Job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26110/pull-ci-openshift-origin-release-4.7-e2e-aws-disruptive/1387233835602677760


Note You need to log in before you can comment on or make changes to this bug.