Bug 2002372 - Pod creation failed due to mismatched pod IP address in CNI and OVN
Summary: Pod creation failed due to mismatched pod IP address in CNI and OVN
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: All
OS: Unspecified
urgent
medium
Target Milestone: ---
: 4.10.0
Assignee: Tim Rozet
QA Contact: Murali Krishnasamy
URL:
Whiteboard:
Depends On:
Blocks: 2004340
TreeView+ depends on / blocked
 
Reported: 2021-09-08 16:10 UTC by Murali Krishnasamy
Modified: 2022-03-10 16:09 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:08:57 UTC
Target Upstream Version:
Embargoed:
murali: needinfo-
murali: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 735 0 None None None 2021-09-10 16:39:31 UTC
Github ovn-org ovn-kubernetes pull 2477 0 None None None 2021-09-08 16:27:08 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:09:24 UTC

Description Murali Krishnasamy 2021-09-08 16:10:08 UTC
Description of problem:
Deletion and creation of pods(3000) on multiple namespace failed with OVS flow request timeout on some pods due to mismatch in pod IP.  

combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_dep-served-1-2-job-29-5c759c9849-hflhq_f5-served-ns-29_115a700e-8c69-4b0c-91aa-d20ba9093089_0(17608c523b0a9f240d04dfe6aaf4eeb1361b439e1e4ab0235ea5b3533776198d): error adding pod f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq to CNI network "multus-cni-network": [f5-served-ns-29/dep-served-1-2-job-29-5c759c9849-hflhq:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[f5-served-ns-29/dep-served-1-2-job-29-5c759c9849-hflhq 17608c523b0a9f240d04dfe6aaf4eeb1361b439e1e4ab0235ea5b3533776198d] [f5-served-ns-29/dep-served-1-2-job-29-5c759c9849-hflhq 17608c523b0a9f240d04dfe6aaf4eeb1361b439e1e4ab0235ea5b3533776198d] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows '


Version-Release number of selected component (if applicable):
4.7.28
OVN - testing the patch https://bugzilla.redhat.com/show_bug.cgi?id=2001363

How reproducible:
Less frequently

Steps to Reproduce:
1. Deploy a health cluster with 110+ worker nodes
2. Update OVN image with the patch 
3. Create 3000+ pods across 35 namespace
4. Delete and re-create pods

Actual results:
Pod stuck in `containerCreating` state with lot of timeouts to add lflows for the pod, due to mismatch in pod IP.

Expected results:
Pod should be created successfully

Additional info:

IP mismatch between CNI and OVN

_uuid               : c2f597fb-e63f-447b-a7ab-ae6a4f2f16d2
admin_state         : up
bfd                 : {}
bfd_status          : {}
cfm_fault           : []
cfm_fault_status    : []
cfm_flap_count      : []
cfm_health          : []
cfm_mpid            : []
cfm_remote_mpids    : []
cfm_remote_opstate  : []
duplex              : full
error               : []
external_ids        : {attached_mac="0a:58:0a:83:03:4a", iface-id=f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq, ip_addresses="10.131.3.74/23", ovn-installed="true", sandbox="40f1a64e8574f957a56e49262cf72a704b680c26dd65e3aab1bb6fddc4857f22"}

ovn-nbctl --no-leader-only find logical_switch_port name=f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq
_uuid               : 27bbbf68-913b-4629-b9ed-ae6847cd4ab6
addresses           : ["0a:58:0a:83:03:48 10.131.3.72"]
dhcpv4_options      : []
dhcpv6_options      : []
dynamic_addresses   : []
enabled             : []
external_ids        : {namespace=f5-served-ns-29, pod="true"}
ha_chassis_group    : []
name                : f5-served-ns-29_dep-served-1-2-job-29-5c759c9849-hflhq
options             : {requested-chassis=worker020-fc640}
parent_name         : []
port_security       : ["0a:58:0a:83:03:48 10.131.3.72"]
tag                 : []
tag_request         : []
type                : ""
up                  : true

Comment 1 Tim Rozet 2021-09-08 16:24:54 UTC
This is a hard case to hit, so moving severity to medium. Basically what happened is:
In the update pod logic, we pass the current pod event to
addLogicalPort. In addLogicalPort we assume that if the annotations
exist for the pod mac/ifaddr, then we use those and do not update
annotations on the pod. This assumption is invalid, because this event
may not be the current state of the pod. In other words we could have a
situation where:

1. A pod add event comes we annotate with 10.0.0.2, assume OVN execute failure
2. Before the annotate is done, the pod is modified in some other way signaling another pod update event
3. A pod update event comes for 2, the pod is annotated with 10.0.0.3 because this was an update to the original pod, before it was annotated with 10.0.0.2, assume OVN execute failure
4. A pod update event comes for 1, since annotations existed, nothing is annotated and 10.0.0.2 is found to be used. OVN logical port is configured with 10.0.0.2. addLogicalPort succeeds. Now the pod has 10.0.0.3 annotated, and 10.0.0.2 in OVN. CNI openflow check will fail and the pod will never come up.

Comment 8 errata-xmlrpc 2022-03-10 16:08:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.