1947705 – After DR cluster restore, ovn appears to block launch of etcd installation pod

Bug 1947705 - After DR cluster restore, ovn appears to block launch of etcd installation pod

Summary: After DR cluster restore, ovn appears to block launch of etcd installation pod

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Maru Newby
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:	1886160
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-09 01:12 UTC by Maru Newby
Modified:	2021-05-10 22:44 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-10 22:44:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift okd issues 586	0	None	open	Disaster recovery on OVN stuck at 'failed to get annotations'	2021-04-09 01:12:38 UTC

Description Maru Newby 2021-04-09 01:12:39 UTC

From original github issue:

Describe the bug

- Default installation on AWS 
  - https://docs.okd.io/latest/installing/installing_aws/installing-aws-default.html
- Created a project and within it 2 sample applications (python django/ nginx)
- Created a backup following https://docs.openshift.com/container-platform/4.7/backup_and_restore/backing-up-etcd.html
- Delete the 2 sample apps
- Followed https://docs.openshift.com/container-platform/4.7/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html
- At step 11 I don't seem to get "all nodes at the latest revision"
 $ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

 3 nodes are at revision 3; 0 nodes have achieved new revision 4

Version 
4.7.0-0.okd-2021-03-28-152009 on AWS

How reproducible
100% (tried twice on clean installation)

Log bundle
https://nfdvmaoeikfjtviehegjfcb.s3-eu-west-1.amazonaws.com/must-gather.local.1084280979694287889.tar.gz

Relevant kubelet log indicating OVN blocking installer pod launch:

Apr 07 18:27:06.081721 ip-10-0-173-39 hyperkube[379114]: E0407 18:27:06.081226  379114 kuberuntime_manager.go:767] createPodSandbox for pod "installer-4-ip-10-0-173-39.eu-west-1.compute.internal_openshift-etcd(44f9ba1c-1a01-41cf-9990-52912f090149)" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-4-ip-10-0-173-39.eu-west-1.compute.internal_openshift-etcd_44f9ba1c-1a01-41cf-9990-52912f090149_0(452bff968561e5c1997654c20e3a7fdff3d1325ee4f290b03c5366321a6d427c): [openshift-etcd/installer-4-ip-10-0-173-39.eu-west-1.compute.internal:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-etcd/installer-4-ip-10-0-173-39.eu-west-1.compute.internal 452bff968561e5c1997654c20e3a7fdff3d1325ee4f290b03c5366321a6d427c] [openshift-etcd/installer-4-ip-10-0-173-39.eu-west-1.compute.internal 452bff968561e5c1997654c20e3a7fdff3d1325ee4f290b03c5366321a6d427c] failed to get annotations: pod "installer-4-ip-10-0-173-39.eu-west-1.compute.internal" not found

Comment 1 Maru Newby 2021-04-09 01:15:43 UTC

The disruptive job has a test that validates the documented restore procedure. Once the disruptive job has been transitioned to the step registry (https://github.com/openshift/release/pull/17556) I'll add an ovn+disruptive job that should allow reproduction of the observed issue and validation of an eventual fix.

Comment 2 Maru Newby 2021-04-19 17:40:01 UTC

Unable to reproduce in CI on aws. Not clear this is a reproducible problem and may be down to the difficulty in following the manual restore procedure.

Comment 3 Maru Newby 2021-04-28 20:41:01 UTC

I'm unable to reproduce this on 4.7. CI for a cluster configured with OVN is passing the automated backup/restore procedure.

Test: [sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should recover from a backup taken on one node and recovered on another [Serial]

Passing Job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26110/pull-ci-openshift-origin-release-4.7-e2e-aws-disruptive/1387233835602677760

Note You need to log in before you can comment on or make changes to this bug.