1862521 – OCP 4.4.5: intermittent traffic failures with OVNKubernetes

Bug 1862521 - OCP 4.4.5: intermittent traffic failures with OVNKubernetes

Summary: OCP 4.4.5: intermittent traffic failures with OVNKubernetes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	mcambria@redhat.com
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1870867 1937118
TreeView+	depends on / blocked

Reported:	2020-07-31 15:44 UTC by jteagno
Modified:	2023-12-15 18:40 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1870867 1870869 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:21:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 237	None	closed	Bug 1862521: 8-22-2020 merge	2021-02-14 14:49:25 UTC
Github	ovn-org ovn-kubernetes pull 1622	None	closed	Set gateway mac explicitly so that it doesn't use the same address as…	2021-02-14 14:49:25 UTC
Red Hat Knowledge Base (Solution)	5339971	None	None	None	2020-08-24 13:03:31 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:22:21 UTC

Description jteagno 2020-07-31 15:44:33 UTC

Description of problem:


Version-Release number of selected component (if applicable): OpenShift 4.4.5


How reproducible: Repeatable every time


Steps to Reproduce:
1. Send traffic to an application inside a pod
2. Validate that the application receives the traffic, and that tcpdump captures the incoming packet arriving at (a) the node's IP address, (b) an ovn-k8s address (169.254.x.x), then (c) the pod's IP address
3. Wait 11 seconds

Actual results: the application does *not* receive traffic, and that tcpdump shows only traffic arriving at the IP address of the node and an ocn-k8s address (but not the pod's IP address).


Expected results: Traffic should arrive in the application pod each time.


Additional info:

On OpenShift 4.4.5 bare metal IPI with the `OVNKubernetes` networkType, incoming traffic fails to appear inside an application, even as a tcpdump capture confirms it arrives at the node where the application pod runs.

The problem can be duplicated reliably-- the application receives the expected traffic, and then 11 seconds later, the application pod does not receive the same traffic.  tcpdump captures confirm that traffic arrives at the IP address of the worker node in both the "good" and "failed" cases, but on failure, no packet is captured whose destination IP address is that of the pod.

The node on which the "bad" pod runs may be in an OVN or ovs error state-- its pod logs from the `openshift-ovn-kubernetes` namespace are spewing error messages.  The corresponding `openshift-ovn-kubernetes` logs from other nodes running this applications' pods are relatively silent.

---

The application inside pod `MY-APP-POD` expects to receive traffic on UDP destination port 35431.  This pod runs on node `worker-6`, and the below captures were captured on that node like so:

podman run -v /tmp/pcap:/tmp/pcap:rw,z --rm --net=host -it ${RHEL_8_SUPPORT_TOOLS} tcpdump -n -i any -v -s0 -w /tmp/pcap/worker-6.host.any.1596142079Z.pcap

# 172.18.8.14 is the source of traffic outside this node
# 172.19.1.79 is the IP address of the node (worker-6)
# 169.254.33.2 is adjacent to 169.254.33.1/24, which is bound on ovn-k8s-gw0
# 172.10.20.22 is the IP address of the application pod (`ip route get 172.10.20.22` prints device ovn-k8s-mp0)

---

good case: the application receives the expected traffic, and the tcpdump capture confirms traffic arriving at the node's IP address, then an ovn-k8s address, and finally at the pod's IP address:

$ tshark -o 'gui.column.format:"No.","%m","Time","%Yut","Source","%s","Source port","%uS","Destination","%d","Destination port","%uD","Protocol","%p","Length","%L","Info","%i"' -nn -r worker-6.host.any.1596142079Z.pcap -Y "frame.number == 130529 or frame.number == 130530 or frame.number == 130548" ; echo

130529 2020-07-30 20:48:57.807664  172.18.8.14 5000 172.19.1.79  35431 TAPA 269 Tunnel - V=0, T=Type 0
130530 2020-07-30 20:48:57.807675  172.18.8.14 5000 169.254.33.2 35431 TAPA 269 Tunnel - V=0, T=Type 0
130548 2020-07-30 20:48:57.808371  172.18.8.14 5000 172.10.20.22 35431 TAPA 269 Tunnel - V=0, T=Type 0

---

failed case: 11 seconds after the successful case, the application no longer receives the expected traffic.  The tcpdump shows packets arriving at the node's IP address, then an ovn-k8s address (twice), but not the expected third packet whose destination IP address is that of the pod:

$ tshark -o 'gui.column.format:"No.","%m","Time","%Yut","Source","%s","Source port","%uS","Destination","%d","Destination port","%uD","Protocol","%p","Length","%L","Info","%i"' -nn -r worker-6.host.any.1596142079Z.pcap -Y "frame.number >= 171081 and frame.number <= 171083" ; echo

171081 2020-07-30 20:49:08.110914  172.18.8.14 5000 172.19.1.79  35431 TAPA 116 Tunnel - V=0, T=Type 0
171082 2020-07-30 20:49:08.110931  172.18.8.14 5000 169.254.33.2 35431 TAPA 116 Tunnel - V=0, T=Type 0
171083 2020-07-30 20:49:08.111008  172.18.8.14 5000 169.254.33.2 35431 TAPA 116 Tunnel - V=0, T=Type 0

---

Highlighted below are samples of the spammed OVN log messages that appear on the node running the pod experiencing the failure:

--> Repeated every minute on worker-6; one instance on worker-7 at 2020-07-30T20:20:19.521Z ; no similar messages around time of problem on workers 0 & 3:
$ oc -n openshift-ovn-kubernetes logs ovnkube-node-62zrm -c ovnkube-node
...
W0730 20:49:03.778581    4517 helper_linux.go:269] failed to clear stale OVS port "" iface-id "MY-OTHER-APP-2": failed to run 'ovs-vsctl --timeout=30 remove Interface  external-ids iface-id': exit status 1
  "ovs-vsctl: no row \"\" in table Interface\n"
  ""
...

--> Incessantly spammed on worker-6; silence on workers 0, 3, & 7:
$ oc -n openshift-ovn-kubernetes logs ovnkube-node-62zrm -c ovn-controller
...
2020-07-30T20:49:04Z|204323|binding|INFO|Claiming lport MY-OTHER-APP-3 for this chassis.
2020-07-30T20:49:04Z|204324|binding|INFO|MY-OTHER-APP-3: Claiming dynamic
2020-07-30T20:49:04Z|204325|binding|INFO|Claiming lport MY-OTHER-APP-2 for this chassis.
2020-07-30T20:49:04Z|204326|binding|INFO|MY-OTHER-APP-2: Claiming dynamic
...

--> Written once a minute on worker-6; less noise (no adds/deletions; ~20 flow mods) on workers 0 & 3:
$ oc -n openshift-ovn-kubernetes logs ovs-node-tc5r6
...
2020-07-30T20:48:52.948Z|187390|connmgr|INFO|br-int<->unix#3: 144 flow_mods in the 20 s starting 51 s ago (93 adds, 48 deletes, 3 modifications)
2020-07-30T20:49:03.804Z|187391|bridge|INFO|bridge br-int: added interface f2d9b0295e5be0a on port 2427
2020-07-30T20:49:03.953Z|187392|bridge|INFO|bridge br-int: added interface 47ef0d4a6e774ac on port 2428
2020-07-30T20:49:12.653Z|187393|bridge|INFO|bridge br-int: deleted interface 47ef0d4a6e774ac on port 2428
2020-07-30T20:49:13.655Z|187394|bridge|INFO|bridge br-int: deleted interface f2d9b0295e5be0a on port 2427
...

--> ... and even noisier on worker-7:
$ oc -n openshift-ovn-kubernetes logs ovs-node-wbh6n
...
2020-07-30T20:48:42.454Z|37223|ofproto_dpif_upcall(revalidator150)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:ed47a098-8e33-4f79-8752-0a02d5bb29b0
2020-07-30T20:48:42.455Z|16545|ofproto_dpif_upcall(revalidator152)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:6cdaa051-5783-4d2c-b7b3-8ee6f1e6b295
2020-07-30T20:48:44.457Z|189422|ofproto_dpif_upcall(revalidator141)|WARN|Dropped 3 log messages in last 2 seconds (most recently, 1 seconds ago) due to excessive rate
2020-07-30T20:48:44.457Z|189423|ofproto_dpif_upcall(revalidator141)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:9789457f-871d-4b6d-a1ae-9f4931c621c5
2020-07-30T20:48:50.947Z|189424|ofproto_dpif_upcall(revalidator141)|WARN|Dropped 7 log messages in last 7 seconds (most recently, 4 seconds ago) due to excessive rate
2020-07-30T20:48:50.948Z|189425|ofproto_dpif_upcall(revalidator141)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:d5daf84d-9c4f-4b1d-b8f4-86faea354621
2020-07-30T20:48:52.950Z|52624|connmgr|INFO|br-int<->unix#3: 24 flow_mods in the 20 s starting 51 s ago (16 adds, 8 deletes)
2020-07-30T20:48:59.018Z|14431|ofproto_dpif_upcall(revalidator145)|WARN|Dropped 5 log messages in last 8 seconds (most recently, 4 seconds ago) due to excessive rate
2020-07-30T20:48:59.018Z|14432|ofproto_dpif_upcall(revalidator145)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:feac34cb-cdb2-408d-b9f8-3b30a579c551
2020-07-30T20:49:11.023Z|48723|ofproto_dpif_upcall(handler82)|INFO|Dropped 29 log messages in last 32 seconds (most recently, 13 seconds ago) due to excessive rate
2020-07-30T20:49:11.023Z|48724|ofproto_dpif_upcall(handler82)|INFO|received packet on unassociated datapath port 4294967295
2020-07-30T20:49:11.134Z|189426|ofproto_dpif_upcall(revalidator141)|WARN|Dropped 1 log messages in last 12 seconds (most recently, 12 seconds ago) due to excessive rate
2020-07-30T20:49:11.134Z|189427|ofproto_dpif_upcall(revalidator141)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:6e628051-7200-473b-9741-32ad8e5d1d5e
2020-07-30T20:49:11.138Z|09308|ofproto_dpif_upcall(revalidator156)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:08a43d00-c65c-4173-a735-730dc7fbaa01
...

Comment 5 Ben Bennett 2020-07-31 19:40:26 UTC

Assigning to 4.6 so that we can address the problem in the development branch.  We will consider the backport once the problem is resolved.

Comment 16 jteagno 2020-08-04 13:54:59 UTC

In re: the ovs-node log message "Failed to acquire udpif_key corresponding to unexpected flow":

These ovs-node messages do not appear on the node running the "problem" pod, where tcpdump was run and where the "bad" test captured no traffic arriving at the pod's IP address.

The node with the "problem" pod was worker-6; these log messages appear on worker-7, which runs other pods belonging to this application.

Mentioning the above in case it's relevant.

Comment 48 errata-xmlrpc 2020-10-27 16:21:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.