Bug 2019809

Summary: [OVN][Upgrade] After upgrade to 4.7.34 ovnkube-master pods are in CrashLoopBackOff/ContainerCreating and other multiple issues at OVS/OVN level
Product: OpenShift Container Platform Reporter: Andre Costa <andcosta>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: astoycos, bpickard, cldavey, cpassare, ffernand, lukasz.wrzesinski, mateusz.bacal, mifiedle, obockows, openshift-bugs-escalate, palonsor, pdiak, trozet
Version: 4.7Keywords: Triaged
Target Milestone: ---Flags: mateusz.bacal: needinfo+
Target Release: 4.10.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2027864 (view as bug list) Environment:
Last Closed: 2022-03-10 16:24:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1974403    
Bug Blocks: 2027864    

Comment 33 Łukasz Wrzesiński 2021-11-05 17:01:49 UTC
Hi,

We have tested provided workaround:
```
for master_node in $(oc get nodes --selector="node-role.kubernetes.io/master"="" -o jsonpath='{range .items[*].metadata}{.name}{"\n"}{end}'
); do \
  oc debug node/${master_node} -- chroot /host sh \
   -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && \
   oc delete pod --field-selector spec.nodeName=${master_node} -n openshift-ovn-kubernetes --selector=app=ovnkube-master; \
done
```

But after that all fresh PODs was failing with errors similar to the below one:
```
3s          Warning   FailedCreatePodSandBox         pod/vault-1                               Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_vault-1_mcp-vault_cd8a898c-fd2d-4605-972d-c03afe6eeadb_0(ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa): error adding pod mcp-vault_vault-1 to CNI network "multus-cni-network": [mcp-vault/vault-1:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[mcp-vault/vault-1 ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa] [mcp-vault/vault-1 ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows
```

This is 4 nodes cluster with 1 worker and 3 master/worker nodes

ovn-master PODS has been restarting after above step.

After the additional step of removing all PODs in openshift-ovn-kubernetes fresh and deleted POD has been finally created successfully
oc delete pod --all -n openshift-ovn-kubernetes


Additional observation (I was looking for some info on POD description level to identify "broken" ones and prepare automation of killing them).

Example from the working POD:
```
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.130.1.124/23"],"mac_address":"0a:58:0a:82:01:7c","gateway_ips":["10.130.0.1"],"ip_address":"10.130.1.124/23","gateway_ip":"1
0.130.0.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.130.1.124"
          ],
          "mac": "0a:58:0a:82:01:7c",
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.130.1.124"
          ],
          "mac": "0a:58:0a:82:01:7c",
          "default": true,
          "dns": {}
      }]
```

Example from the broken POD:
```
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.0.38/23"],"mac_address":"0a:58:0a:81:00:26","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.38/23","gateway_ip":"10.
129.0.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.129.0.25"
          ],
          "mac": "0a:58:0a:81:00:19",
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.129.0.25"
          ],
          "mac": "0a:58:0a:81:00:19",
          "default": true,
          "dns": {}
      }]
```

And it looks like on all working PODs (which are not on hostNetwork):
- k8s.v1.cni["cncf.io/network-status"].default.mac_address == k8s.v1.cni["cncf.io/network-status"][0].mac

For all non-working (I've checked all clusters which have this issue) this condition is false MACs are different. Same apply for pod-networks/default/ip_addresses and network-status/ips, but it is CIDR string vs array, so harder to compare. MACs works fine.

So for now we are testing workaround with checking this condition on any POD which have container restarts. If MACs differs we run oc delete pod on such POD. Tested on 2 clusters so far (different than 131 and 150 from the case  https://access.redhat.com/support/cases/#/case/03068154) and it fixed the issue there.

We are going to run tests for the weekend with this workaround placed in critical places where drain/reboots happen and will see the results.

What do you think about this workaround?

Best Regards,
Łukasz Wrzesiński

Comment 63 Tim Rozet 2021-11-29 15:37:36 UTC
Posted a fix upstream for the duplicate IP issue:
https://github.com/ovn-org/ovn-kubernetes/pull/2684

As Andrew mentioned, we will also need to pull in the CNI fixes from:
https://github.com/openshift/ovn-kubernetes/pull/686

Comment 69 Mike Fiedler 2021-12-06 15:59:17 UTC
Marking this verified based on comment 68

Comment 72 errata-xmlrpc 2022-03-10 16:24:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 73 Red Hat Bugzilla 2023-09-18 04:27:39 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days