Hi, We have tested provided workaround: ``` for master_node in $(oc get nodes --selector="node-role.kubernetes.io/master"="" -o jsonpath='{range .items[*].metadata}{.name}{"\n"}{end}' ); do \ oc debug node/${master_node} -- chroot /host sh \ -c 'rm -f /var/lib/ovn/etc/ovn*_db.db' && \ oc delete pod --field-selector spec.nodeName=${master_node} -n openshift-ovn-kubernetes --selector=app=ovnkube-master; \ done ``` But after that all fresh PODs was failing with errors similar to the below one: ``` 3s Warning FailedCreatePodSandBox pod/vault-1 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_vault-1_mcp-vault_cd8a898c-fd2d-4605-972d-c03afe6eeadb_0(ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa): error adding pod mcp-vault_vault-1 to CNI network "multus-cni-network": [mcp-vault/vault-1:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[mcp-vault/vault-1 ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa] [mcp-vault/vault-1 ba8b51fb20056950ceb2e172f294bf991c7cb5d1ff67cb673460c9797d80a7fa] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows ``` This is 4 nodes cluster with 1 worker and 3 master/worker nodes ovn-master PODS has been restarting after above step. After the additional step of removing all PODs in openshift-ovn-kubernetes fresh and deleted POD has been finally created successfully oc delete pod --all -n openshift-ovn-kubernetes Additional observation (I was looking for some info on POD description level to identify "broken" ones and prepare automation of killing them). Example from the working POD: ``` apiVersion: v1 kind: Pod metadata: annotations: k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.130.1.124/23"],"mac_address":"0a:58:0a:82:01:7c","gateway_ips":["10.130.0.1"],"ip_address":"10.130.1.124/23","gateway_ip":"1 0.130.0.1"}}' k8s.v1.cni.cncf.io/network-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.130.1.124" ], "mac": "0a:58:0a:82:01:7c", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.130.1.124" ], "mac": "0a:58:0a:82:01:7c", "default": true, "dns": {} }] ``` Example from the broken POD: ``` apiVersion: v1 kind: Pod metadata: annotations: k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.0.38/23"],"mac_address":"0a:58:0a:81:00:26","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.38/23","gateway_ip":"10. 129.0.1"}}' k8s.v1.cni.cncf.io/network-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.129.0.25" ], "mac": "0a:58:0a:81:00:19", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.129.0.25" ], "mac": "0a:58:0a:81:00:19", "default": true, "dns": {} }] ``` And it looks like on all working PODs (which are not on hostNetwork): - k8s.v1.cni["cncf.io/network-status"].default.mac_address == k8s.v1.cni["cncf.io/network-status"][0].mac For all non-working (I've checked all clusters which have this issue) this condition is false MACs are different. Same apply for pod-networks/default/ip_addresses and network-status/ips, but it is CIDR string vs array, so harder to compare. MACs works fine. So for now we are testing workaround with checking this condition on any POD which have container restarts. If MACs differs we run oc delete pod on such POD. Tested on 2 clusters so far (different than 131 and 150 from the case https://access.redhat.com/support/cases/#/case/03068154) and it fixed the issue there. We are going to run tests for the weekend with this workaround placed in critical places where drain/reboots happen and will see the results. What do you think about this workaround? Best Regards, Łukasz Wrzesiński
Posted a fix upstream for the duplicate IP issue: https://github.com/ovn-org/ovn-kubernetes/pull/2684 As Andrew mentioned, we will also need to pull in the CNI fixes from: https://github.com/openshift/ovn-kubernetes/pull/686
Marking this verified based on comment 68
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days