Description of problem: Version-Release number of selected component (if applicable): OpenShift 4.4.5 How reproducible: Repeatable every time Steps to Reproduce: 1. Send traffic to an application inside a pod 2. Validate that the application receives the traffic, and that tcpdump captures the incoming packet arriving at (a) the node's IP address, (b) an ovn-k8s address (169.254.x.x), then (c) the pod's IP address 3. Wait 11 seconds Actual results: the application does *not* receive traffic, and that tcpdump shows only traffic arriving at the IP address of the node and an ocn-k8s address (but not the pod's IP address). Expected results: Traffic should arrive in the application pod each time. Additional info: On OpenShift 4.4.5 bare metal IPI with the `OVNKubernetes` networkType, incoming traffic fails to appear inside an application, even as a tcpdump capture confirms it arrives at the node where the application pod runs. The problem can be duplicated reliably-- the application receives the expected traffic, and then 11 seconds later, the application pod does not receive the same traffic. tcpdump captures confirm that traffic arrives at the IP address of the worker node in both the "good" and "failed" cases, but on failure, no packet is captured whose destination IP address is that of the pod. The node on which the "bad" pod runs may be in an OVN or ovs error state-- its pod logs from the `openshift-ovn-kubernetes` namespace are spewing error messages. The corresponding `openshift-ovn-kubernetes` logs from other nodes running this applications' pods are relatively silent. --- The application inside pod `MY-APP-POD` expects to receive traffic on UDP destination port 35431. This pod runs on node `worker-6`, and the below captures were captured on that node like so: podman run -v /tmp/pcap:/tmp/pcap:rw,z --rm --net=host -it ${RHEL_8_SUPPORT_TOOLS} tcpdump -n -i any -v -s0 -w /tmp/pcap/worker-6.host.any.1596142079Z.pcap # 172.18.8.14 is the source of traffic outside this node # 172.19.1.79 is the IP address of the node (worker-6) # 169.254.33.2 is adjacent to 169.254.33.1/24, which is bound on ovn-k8s-gw0 # 172.10.20.22 is the IP address of the application pod (`ip route get 172.10.20.22` prints device ovn-k8s-mp0) --- good case: the application receives the expected traffic, and the tcpdump capture confirms traffic arriving at the node's IP address, then an ovn-k8s address, and finally at the pod's IP address: $ tshark -o 'gui.column.format:"No.","%m","Time","%Yut","Source","%s","Source port","%uS","Destination","%d","Destination port","%uD","Protocol","%p","Length","%L","Info","%i"' -nn -r worker-6.host.any.1596142079Z.pcap -Y "frame.number == 130529 or frame.number == 130530 or frame.number == 130548" ; echo 130529 2020-07-30 20:48:57.807664 172.18.8.14 5000 172.19.1.79 35431 TAPA 269 Tunnel - V=0, T=Type 0 130530 2020-07-30 20:48:57.807675 172.18.8.14 5000 169.254.33.2 35431 TAPA 269 Tunnel - V=0, T=Type 0 130548 2020-07-30 20:48:57.808371 172.18.8.14 5000 172.10.20.22 35431 TAPA 269 Tunnel - V=0, T=Type 0 --- failed case: 11 seconds after the successful case, the application no longer receives the expected traffic. The tcpdump shows packets arriving at the node's IP address, then an ovn-k8s address (twice), but not the expected third packet whose destination IP address is that of the pod: $ tshark -o 'gui.column.format:"No.","%m","Time","%Yut","Source","%s","Source port","%uS","Destination","%d","Destination port","%uD","Protocol","%p","Length","%L","Info","%i"' -nn -r worker-6.host.any.1596142079Z.pcap -Y "frame.number >= 171081 and frame.number <= 171083" ; echo 171081 2020-07-30 20:49:08.110914 172.18.8.14 5000 172.19.1.79 35431 TAPA 116 Tunnel - V=0, T=Type 0 171082 2020-07-30 20:49:08.110931 172.18.8.14 5000 169.254.33.2 35431 TAPA 116 Tunnel - V=0, T=Type 0 171083 2020-07-30 20:49:08.111008 172.18.8.14 5000 169.254.33.2 35431 TAPA 116 Tunnel - V=0, T=Type 0 --- Highlighted below are samples of the spammed OVN log messages that appear on the node running the pod experiencing the failure: --> Repeated every minute on worker-6; one instance on worker-7 at 2020-07-30T20:20:19.521Z ; no similar messages around time of problem on workers 0 & 3: $ oc -n openshift-ovn-kubernetes logs ovnkube-node-62zrm -c ovnkube-node ... W0730 20:49:03.778581 4517 helper_linux.go:269] failed to clear stale OVS port "" iface-id "MY-OTHER-APP-2": failed to run 'ovs-vsctl --timeout=30 remove Interface external-ids iface-id': exit status 1 "ovs-vsctl: no row \"\" in table Interface\n" "" ... --> Incessantly spammed on worker-6; silence on workers 0, 3, & 7: $ oc -n openshift-ovn-kubernetes logs ovnkube-node-62zrm -c ovn-controller ... 2020-07-30T20:49:04Z|204323|binding|INFO|Claiming lport MY-OTHER-APP-3 for this chassis. 2020-07-30T20:49:04Z|204324|binding|INFO|MY-OTHER-APP-3: Claiming dynamic 2020-07-30T20:49:04Z|204325|binding|INFO|Claiming lport MY-OTHER-APP-2 for this chassis. 2020-07-30T20:49:04Z|204326|binding|INFO|MY-OTHER-APP-2: Claiming dynamic ... --> Written once a minute on worker-6; less noise (no adds/deletions; ~20 flow mods) on workers 0 & 3: $ oc -n openshift-ovn-kubernetes logs ovs-node-tc5r6 ... 2020-07-30T20:48:52.948Z|187390|connmgr|INFO|br-int<->unix#3: 144 flow_mods in the 20 s starting 51 s ago (93 adds, 48 deletes, 3 modifications) 2020-07-30T20:49:03.804Z|187391|bridge|INFO|bridge br-int: added interface f2d9b0295e5be0a on port 2427 2020-07-30T20:49:03.953Z|187392|bridge|INFO|bridge br-int: added interface 47ef0d4a6e774ac on port 2428 2020-07-30T20:49:12.653Z|187393|bridge|INFO|bridge br-int: deleted interface 47ef0d4a6e774ac on port 2428 2020-07-30T20:49:13.655Z|187394|bridge|INFO|bridge br-int: deleted interface f2d9b0295e5be0a on port 2427 ... --> ... and even noisier on worker-7: $ oc -n openshift-ovn-kubernetes logs ovs-node-wbh6n ... 2020-07-30T20:48:42.454Z|37223|ofproto_dpif_upcall(revalidator150)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:ed47a098-8e33-4f79-8752-0a02d5bb29b0 2020-07-30T20:48:42.455Z|16545|ofproto_dpif_upcall(revalidator152)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:6cdaa051-5783-4d2c-b7b3-8ee6f1e6b295 2020-07-30T20:48:44.457Z|189422|ofproto_dpif_upcall(revalidator141)|WARN|Dropped 3 log messages in last 2 seconds (most recently, 1 seconds ago) due to excessive rate 2020-07-30T20:48:44.457Z|189423|ofproto_dpif_upcall(revalidator141)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:9789457f-871d-4b6d-a1ae-9f4931c621c5 2020-07-30T20:48:50.947Z|189424|ofproto_dpif_upcall(revalidator141)|WARN|Dropped 7 log messages in last 7 seconds (most recently, 4 seconds ago) due to excessive rate 2020-07-30T20:48:50.948Z|189425|ofproto_dpif_upcall(revalidator141)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:d5daf84d-9c4f-4b1d-b8f4-86faea354621 2020-07-30T20:48:52.950Z|52624|connmgr|INFO|br-int<->unix#3: 24 flow_mods in the 20 s starting 51 s ago (16 adds, 8 deletes) 2020-07-30T20:48:59.018Z|14431|ofproto_dpif_upcall(revalidator145)|WARN|Dropped 5 log messages in last 8 seconds (most recently, 4 seconds ago) due to excessive rate 2020-07-30T20:48:59.018Z|14432|ofproto_dpif_upcall(revalidator145)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:feac34cb-cdb2-408d-b9f8-3b30a579c551 2020-07-30T20:49:11.023Z|48723|ofproto_dpif_upcall(handler82)|INFO|Dropped 29 log messages in last 32 seconds (most recently, 13 seconds ago) due to excessive rate 2020-07-30T20:49:11.023Z|48724|ofproto_dpif_upcall(handler82)|INFO|received packet on unassociated datapath port 4294967295 2020-07-30T20:49:11.134Z|189426|ofproto_dpif_upcall(revalidator141)|WARN|Dropped 1 log messages in last 12 seconds (most recently, 12 seconds ago) due to excessive rate 2020-07-30T20:49:11.134Z|189427|ofproto_dpif_upcall(revalidator141)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:6e628051-7200-473b-9741-32ad8e5d1d5e 2020-07-30T20:49:11.138Z|09308|ofproto_dpif_upcall(revalidator156)|WARN|Failed to acquire udpif_key corresponding to unexpected flow (Invalid argument): ufid:08a43d00-c65c-4173-a735-730dc7fbaa01 ...
Assigning to 4.6 so that we can address the problem in the development branch. We will consider the backport once the problem is resolved.
In re: the ovs-node log message "Failed to acquire udpif_key corresponding to unexpected flow": These ovs-node messages do not appear on the node running the "problem" pod, where tcpdump was run and where the "bad" test captured no traffic arriving at the pod's IP address. The node with the "problem" pod was worker-6; these log messages appear on worker-7, which runs other pods belonging to this application. Mentioning the above in case it's relevant.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196