Bug 2002372

Summary: Pod creation failed due to mismatched pod IP address in CNI and OVN
Product: OpenShift Container Platform Reporter: Murali Krishnasamy <murali>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Murali Krishnasamy <murali>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: urgent CC: anusaxen, bbennett, dblack, smalleni, trozet, yprokule
Version: 4.7Keywords: FastFix
Target Milestone: ---Flags: murali: needinfo-
murali: needinfo-
Target Release: 4.10.0   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:08:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2004340    

Description Murali Krishnasamy 2021-09-08 16:10:08 UTC
Description of problem:
Deletion and creation of pods(3000) on multiple namespace failed with OVS flow request timeout on some pods due to mismatch in pod IP.  

combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dep-served-1-2-job-29-5c759c9849-hflhq_f5-served-ns-29_115a700e-8c69-4b0c-91aa-d20ba9093089_0(17608c523b0a9f240d04dfe6aaf4eeb1361b439e1e4ab0235ea5b3533776198d): error adding pod f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq to CNI network "multus-cni-network": [f5-served-ns-29/dep-served-1-2-job-29-5c759c9849-hflhq:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[f5-served-ns-29/dep-served-1-2-job-29-5c759c9849-hflhq 17608c523b0a9f240d04dfe6aaf4eeb1361b439e1e4ab0235ea5b3533776198d] [f5-served-ns-29/dep-served-1-2-job-29-5c759c9849-hflhq 17608c523b0a9f240d04dfe6aaf4eeb1361b439e1e4ab0235ea5b3533776198d] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows '


Version-Release number of selected component (if applicable):
4.7.28
OVN - testing the patch https://bugzilla.redhat.com/show_bug.cgi?id=2001363

How reproducible:
Less frequently

Steps to Reproduce:
1. Deploy a health cluster with 110+ worker nodes
2. Update OVN image with the patch 
3. Create 3000+ pods across 35 namespace
4. Delete and re-create pods

Actual results:
Pod stuck in `containerCreating` state with lot of timeouts to add lflows for the pod, due to mismatch in pod IP.

Expected results:
Pod should be created successfully

Additional info:

IP mismatch between CNI and OVN

_uuid               : c2f597fb-e63f-447b-a7ab-ae6a4f2f16d2
admin_state         : up
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : full
error               : []
external_ids        : {attached_mac="0a:58:0a:83:03:4a", iface-id=f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq, ip_addresses="10.131.3.74/23", ovn-installed="true", sandbox="40f1a64e8574f957a56e49262cf72a704b680c26dd65e3aab1bb6fddc4857f22"}

ovn-nbctl --no-leader-only find logical_switch_port name=f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq
_uuid               : 27bbbf68-913b-4629-b9ed-ae6847cd4ab6
addresses           : ["0a:58:0a:83:03:48 10.131.3.72"]
dhcpv4_options      : []
dhcpv6_options      : []
dynamic_addresses   : []
enabled             : []
external_ids        : {namespace=f5-served-ns-29, pod="true"}
ha_chassis_group    : []
name                : f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq
options             : {requested-chassis=worker020-fc640}
parent_name         : []
port_security       : ["0a:58:0a:83:03:48 10.131.3.72"]
tag                 : []
tag_request         : []
type                : ""
up                  : true

Comment 1 Tim Rozet 2021-09-08 16:24:54 UTC
This is a hard case to hit, so moving severity to medium. Basically what happened is:
In the update pod logic, we pass the current pod event to
addLogicalPort. In addLogicalPort we assume that if the annotations
exist for the pod mac/ifaddr, then we use those and do not update
annotations on the pod. This assumption is invalid, because this event
may not be the current state of the pod. In other words we could have a
situation where:

1. A pod add event comes we annotate with 10.0.0.2, assume OVN execute failure
2. Before the annotate is done, the pod is modified in some other way signaling another pod update event
3. A pod update event comes for 2, the pod is annotated with 10.0.0.3 because this was an update to the original pod, before it was annotated with 10.0.0.2, assume OVN execute failure
4. A pod update event comes for 1, since annotations existed, nothing is annotated and 10.0.0.2 is found to be used. OVN logical port is configured with 10.0.0.2. addLogicalPort succeeds. Now the pod has 10.0.0.3 annotated, and 10.0.0.2 in OVN. CNI openflow check will fail and the pod will never come up.

Comment 8 errata-xmlrpc 2022-03-10 16:08:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056