Bug 2019809 - [OVN][Upgrade] After upgrade to 4.7.34 ovnkube-master pods are in CrashLoopBackOff/ContainerCreating and other multiple issues at OVS/OVN level
Summary: [OVN][Upgrade] After upgrade to 4.7.34 ovnkube-master pods are in CrashLoopBa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: Tim Rozet
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On: 1974403
Blocks: 2027864
TreeView+ depends on / blocked
 
Reported: 2021-11-03 11:37 UTC by Andre Costa
Modified: 2023-09-18 04:27 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2027864 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:24:41 UTC
Target Upstream Version:
Embargoed:
mateusz.bacal: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 851 0 None Merged Bug 2019809: [DownstreamMerge] 11-29-21 2021-12-01 15:29:17 UTC
Github ovn-org ovn-kubernetes pull 2684 0 None Merged Fixes race between node handler and pod sync 2021-12-01 15:29:28 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:25:01 UTC

Comment 33 Łukasz Wrzesiński 2021-11-05 17:01:49 UTC
Hi,

We have tested provided workaround:
```
for master_node in $(oc get nodes --selector="node-role.kubernetes.io/master"="" -o jsonpath='{range .items[*].metadata}{.name}{"\n"}{end}'
); do \
  oc debug node/${master_node} -- chroot /host sh \
   -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && \
   oc delete pod --field-selector spec.nodeName=${master_node} -n openshift-ovn-kubernetes --selector=app=ovnkube-master; \
done
```

But after that all fresh PODs was failing with errors similar to the below one:
```
3s          Warning   FailedCreatePodSandBox         pod/vault-1                               Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_vault-1_mcp-vault_cd8a898c-fd2d-4605-972d-c03afe6eeadb_0(ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa): error adding pod mcp-vault_vault-1 to CNI network "multus-cni-network": [mcp-vault/vault-1:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[mcp-vault/vault-1 ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa] [mcp-vault/vault-1 ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows
```

This is 4 nodes cluster with 1 worker and 3 master/worker nodes

ovn-master PODS has been restarting after above step.

After the additional step of removing all PODs in openshift-ovn-kubernetes fresh and deleted POD has been finally created successfully
oc delete pod --all -n openshift-ovn-kubernetes


Additional observation (I was looking for some info on POD description level to identify "broken" ones and prepare automation of killing them).

Example from the working POD:
```
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.130.1.124/23"],"mac_address":"0a:58:0a:82:01:7c","gateway_ips":["10.130.0.1"],"ip_address":"10.130.1.124/23","gateway_ip":"1
0.130.0.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.130.1.124"
          ],
          "mac": "0a:58:0a:82:01:7c",
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.130.1.124"
          ],
          "mac": "0a:58:0a:82:01:7c",
          "default": true,
          "dns": {}
      }]
```

Example from the broken POD:
```
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.0.38/23"],"mac_address":"0a:58:0a:81:00:26","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.38/23","gateway_ip":"10.
129.0.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.129.0.25"
          ],
          "mac": "0a:58:0a:81:00:19",
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.129.0.25"
          ],
          "mac": "0a:58:0a:81:00:19",
          "default": true,
          "dns": {}
      }]
```

And it looks like on all working PODs (which are not on hostNetwork):
- k8s.v1.cni["cncf.io/network-status"].default.mac_address == k8s.v1.cni["cncf.io/network-status"][0].mac

For all non-working (I've checked all clusters which have this issue) this condition is false MACs are different. Same apply for pod-networks/default/ip_addresses and network-status/ips, but it is CIDR string vs array, so harder to compare. MACs works fine.

So for now we are testing workaround with checking this condition on any POD which have container restarts. If MACs differs we run oc delete pod on such POD. Tested on 2 clusters so far (different than 131 and 150 from the case  https://access.redhat.com/support/cases/#/case/03068154) and it fixed the issue there.

We are going to run tests for the weekend with this workaround placed in critical places where drain/reboots happen and will see the results.

What do you think about this workaround?

Best Regards,
Łukasz Wrzesiński

Comment 63 Tim Rozet 2021-11-29 15:37:36 UTC
Posted a fix upstream for the duplicate IP issue:
https://github.com/ovn-org/ovn-kubernetes/pull/2684

As Andrew mentioned, we will also need to pull in the CNI fixes from:
https://github.com/openshift/ovn-kubernetes/pull/686

Comment 69 Mike Fiedler 2021-12-06 15:59:17 UTC
Marking this verified based on comment 68

Comment 72 errata-xmlrpc 2022-03-10 16:24:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 73 Red Hat Bugzilla 2023-09-18 04:27:39 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.