2019809 – [OVN][Upgrade] After upgrade to 4.7.34 ovnkube-master pods are in CrashLoopBackOff/ContainerCreating and other multiple issues at OVS/OVN level

Bug 2019809 - [OVN][Upgrade] After upgrade to 4.7.34 ovnkube-master pods are in CrashLoopBackOff/ContainerCreating and other multiple issues at OVS/OVN level

Summary: [OVN][Upgrade] After upgrade to 4.7.34 ovnkube-master pods are in CrashLoopBa...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Tim Rozet
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:	1974403
Blocks:	2027864
TreeView+	depends on / blocked

Reported:	2021-11-03 11:37 UTC by Andre Costa
Modified:	2023-09-18 04:27 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2027864 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:24:41 UTC
Target Upstream Version:
Embargoed:
Flags:	mateusz.bacal: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 851	None	Merged	Bug 2019809: [DownstreamMerge] 11-29-21	2021-12-01 15:29:17 UTC
Github	ovn-org ovn-kubernetes pull 2684	None	Merged	Fixes race between node handler and pod sync	2021-12-01 15:29:28 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:25:01 UTC

Comment 33 Łukasz Wrzesiński 2021-11-05 17:01:49 UTC

Hi,

We have tested provided workaround:
```
for master_node in $(oc get nodes --selector="node-role.kubernetes.io/master"="" -o jsonpath='{range .items[*].metadata}{.name}{"\n"}{end}'
); do \
  oc debug node/${master_node} -- chroot /host sh \
   -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && \
   oc delete pod --field-selector spec.nodeName=${master_node} -n openshift-ovn-kubernetes --selector=app=ovnkube-master; \
done
```

But after that all fresh PODs was failing with errors similar to the below one:
```
3s          Warning   FailedCreatePodSandBox         pod/vault-1                               Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_vault-1_mcp-vault_cd8a898c-fd2d-4605-972d-c03afe6eeadb_0(ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa): error adding pod mcp-vault_vault-1 to CNI network "multus-cni-network": [mcp-vault/vault-1:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[mcp-vault/vault-1 ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa] [mcp-vault/vault-1 ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows
```

This is 4 nodes cluster with 1 worker and 3 master/worker nodes

ovn-master PODS has been restarting after above step.

After the additional step of removing all PODs in openshift-ovn-kubernetes fresh and deleted POD has been finally created successfully
oc delete pod --all -n openshift-ovn-kubernetes


Additional observation (I was looking for some info on POD description level to identify "broken" ones and prepare automation of killing them).

Example from the working POD:
```
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.130.1.124/23"],"mac_address":"0a:58:0a:82:01:7c","gateway_ips":["10.130.0.1"],"ip_address":"10.130.1.124/23","gateway_ip":"1
0.130.0.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.130.1.124"
          ],
          "mac": "0a:58:0a:82:01:7c",
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.130.1.124"
          ],
          "mac": "0a:58:0a:82:01:7c",
          "default": true,
          "dns": {}
      }]
```

Example from the broken POD:
```
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.0.38/23"],"mac_address":"0a:58:0a:81:00:26","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.38/23","gateway_ip":"10.
129.0.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.129.0.25"
          ],
          "mac": "0a:58:0a:81:00:19",
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.129.0.25"
          ],
          "mac": "0a:58:0a:81:00:19",
          "default": true,
          "dns": {}
      }]
```

And it looks like on all working PODs (which are not on hostNetwork):
- k8s.v1.cni["cncf.io/network-status"].default.mac_address == k8s.v1.cni["cncf.io/network-status"][0].mac

For all non-working (I've checked all clusters which have this issue) this condition is false MACs are different. Same apply for pod-networks/default/ip_addresses and network-status/ips, but it is CIDR string vs array, so harder to compare. MACs works fine.

So for now we are testing workaround with checking this condition on any POD which have container restarts. If MACs differs we run oc delete pod on such POD. Tested on 2 clusters so far (different than 131 and 150 from the case  https://access.redhat.com/support/cases/#/case/03068154) and it fixed the issue there.

We are going to run tests for the weekend with this workaround placed in critical places where drain/reboots happen and will see the results.

What do you think about this workaround?

Best Regards,
Łukasz Wrzesiński

Comment 63 Tim Rozet 2021-11-29 15:37:36 UTC

Posted a fix upstream for the duplicate IP issue:
https://github.com/ovn-org/ovn-kubernetes/pull/2684

As Andrew mentioned, we will also need to pull in the CNI fixes from:
https://github.com/openshift/ovn-kubernetes/pull/686

Comment 69 Mike Fiedler 2021-12-06 15:59:17 UTC

Marking this verified based on comment 68

Comment 72 errata-xmlrpc 2022-03-10 16:24:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 73 Red Hat Bugzilla 2023-09-18 04:27:39 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.