Description of problem: Deletion and creation of pods(3000) on multiple namespace failed with OVS flow request timeout on some pods due to mismatch in pod IP. combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dep-served-1-2-job-29-5c759c9849-hflhq_f5-served-ns-29_115a700e-8c69-4b0c-91aa-d20ba9093089_0(17608c523b0a9f240d04dfe6aaf4eeb1361b439e1e4ab0235ea5b3533776198d): error adding pod f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq to CNI network "multus-cni-network": [f5-served-ns-29/dep-served-1-2-job-29-5c759c9849-hflhq:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[f5-served-ns-29/dep-served-1-2-job-29-5c759c9849-hflhq 17608c523b0a9f240d04dfe6aaf4eeb1361b439e1e4ab0235ea5b3533776198d] [f5-served-ns-29/dep-served-1-2-job-29-5c759c9849-hflhq 17608c523b0a9f240d04dfe6aaf4eeb1361b439e1e4ab0235ea5b3533776198d] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows ' Version-Release number of selected component (if applicable): 4.7.28 OVN - testing the patch https://bugzilla.redhat.com/show_bug.cgi?id=2001363 How reproducible: Less frequently Steps to Reproduce: 1. Deploy a health cluster with 110+ worker nodes 2. Update OVN image with the patch 3. Create 3000+ pods across 35 namespace 4. Delete and re-create pods Actual results: Pod stuck in `containerCreating` state with lot of timeouts to add lflows for the pod, due to mismatch in pod IP. Expected results: Pod should be created successfully Additional info: IP mismatch between CNI and OVN _uuid : c2f597fb-e63f-447b-a7ab-ae6a4f2f16d2 admin_state : up bfd : {} bfd_status : {} cfm_fault : [] cfm_fault_status : [] cfm_flap_count : [] cfm_health : [] cfm_mpid : [] cfm_remote_mpids : [] cfm_remote_opstate : [] duplex : full error : [] external_ids : {attached_mac="0a:58:0a:83:03:4a", iface-id=f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq, ip_addresses="10.131.3.74/23", ovn-installed="true", sandbox="40f1a64e8574f957a56e49262cf72a704b680c26dd65e3aab1bb6fddc4857f22"} ovn-nbctl --no-leader-only find logical_switch_port name=f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq _uuid : 27bbbf68-913b-4629-b9ed-ae6847cd4ab6 addresses : ["0a:58:0a:83:03:48 10.131.3.72"] dhcpv4_options : [] dhcpv6_options : [] dynamic_addresses : [] enabled : [] external_ids : {namespace=f5-served-ns-29, pod="true"} ha_chassis_group : [] name : f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq options : {requested-chassis=worker020-fc640} parent_name : [] port_security : ["0a:58:0a:83:03:48 10.131.3.72"] tag : [] tag_request : [] type : "" up : true
This is a hard case to hit, so moving severity to medium. Basically what happened is: In the update pod logic, we pass the current pod event to addLogicalPort. In addLogicalPort we assume that if the annotations exist for the pod mac/ifaddr, then we use those and do not update annotations on the pod. This assumption is invalid, because this event may not be the current state of the pod. In other words we could have a situation where: 1. A pod add event comes we annotate with 10.0.0.2, assume OVN execute failure 2. Before the annotate is done, the pod is modified in some other way signaling another pod update event 3. A pod update event comes for 2, the pod is annotated with 10.0.0.3 because this was an update to the original pod, before it was annotated with 10.0.0.2, assume OVN execute failure 4. A pod update event comes for 1, since annotations existed, nothing is annotated and 10.0.0.2 is found to be used. OVN logical port is configured with 10.0.0.2. addLogicalPort succeeds. Now the pod has 10.0.0.3 annotated, and 10.0.0.2 in OVN. CNI openflow check will fail and the pod will never come up.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056