Bug 1939045
| Summary: | [OCPv4.6] pod to pod communication broken on PFCP procotol over UDP | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Angelo Gabrieli <agabriel> | |
| Component: | Networking | Assignee: | Tim Rozet <trozet> | |
| Networking sub component: | ovn-kubernetes | QA Contact: | zhaozhanqi <zzhao> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | urgent | |||
| Priority: | urgent | CC: | alosadag, anbhat, bbennett, dcbw, djuran, fbaudin, fpan, fpaoline, fsoppels, hchatter, mavazque, mschwabe, openshift-bugs-escalate, pabeni, pibanezr, rkhan, trozet, zzhao | |
| Version: | 4.6 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2024910 2024911 2024914 (view as bug list) | Environment: | ||
| Last Closed: | 2021-10-18 17:29:21 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1939676 | |||
| Bug Blocks: | 2024910 | |||
|
Description
Angelo Gabrieli
2021-03-15 13:55:31 UTC
Further, possibly significant observation: We saw that if sending something from pfcp-endpoint -> data-plane continusly (1 message/second) we get the connection working. So the problem seems to happen when this "path" is not in use for a while. After looking through this more I believe the issue is caused by the ip/port collisions between the pods + service. We can see in OVS there is a failure to commit the entry into conntrack, presumably because an entry already exists that would conflict: 2021-03-12T08:12:40.670Z|00004|dpif(handler10)|WARN|system@ovs-system: execute ct(commit,zone=84,label=0/0x1),ct(zone=85),recirc(0x19590) failed (Invalid argument) on packet udp,vlan_tci=0x0000,dl_src=0a:58:0a:81:02:07,dl_dst=0a:58:0a:81:02:08,nw_src=10.129.2.7,nw_dst=10.129.2.8,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=5054,tp_dst=5088 udp_csum:14c4 with metadata skb_priority(0),skb_mark(0),ct_state(0x21),ct_zone(0x54),ct_tuple4(src=10.129.2.7,dst=10.129.2.8,proto=17,tp_src=5054,tp_dst=5088),in_port(16) mtu 0 conflicting zone 84 entry: udp,orig=(src=10.129.2.7,dst=172.30.9.90,sport=5054,dport=5088),reply=(src=10.129.2.8,dst=10.129.2.7,sport=5088,dport=5054),zone=84,labels=0x2 I've filed an OVN bug to handle this case: https://bugzilla.redhat.com/show_bug.cgi?id=1939676 With this potential fix we would SNAT(0.0.0.0) in conntrack, which would only change the source port of traffic if there is a collision. The caveat of this is that the packet may arrive at the server with a different source port, which may or may not be desirable. In order to fully avoid this type of scenario, the service and/or app configuration should be changed to avoid such port collisions. 4.9 contains 21.09 OVN with the relevant fix as well as openvswitch2.15-2.15.0-28.el8fdp.x86_64 Verified this bug on 4.9.0-0.nightly-2021-08-18-144658
with version:
openvswitch2.15-2.15.0-28.el8fdp.x86_64
ovn21.09-21.09.0-13.el8fdp.x86_64
steps:
1. new project z1
2. Create one test pod and svc with following json file
{
"apiVersion": "v1",
"kind": "List",
"items": [
{
"apiVersion": "v1",
"kind": "ReplicationController",
"metadata": {
"labels": {
"name": "test-rc"
},
"name": "test-rc"
},
"spec": {
"replicas": 1,
"template": {
"metadata": {
"labels": {
"name": "test-pods"
}
},
"spec": {
"containers": [
{
"image": "quay.io/openshifttest/hello-sdn@sha256:d5785550cf77b7932b090fcd1a2625472912fb3189d5973f177a5a2c347a1f95",
"name": "test-pod",
"imagePullPolicy": "IfNotPresent"
}
]
}
}
}
},
{
"apiVersion": "v1",
"kind": "Service",
"metadata": {
"labels": {
"name": "test-service"
},
"name": "test-service"
},
"spec": {
"ports": [
{
"name": "http",
"port": 27017,
"protocol": "TCP",
"targetPort": 8080
}
],
"selector": {
"name": "test-pods"
}
}
}
]
}
3. Create another client pod which need to schedule same node with above pod
{
"kind": "Pod",
"apiVersion":"v1",
"metadata": {
"generateName": "hello-pod2",
"labels": {
"name": "hello-pod2"
}
},
"spec": {
"containers": [{
"name": "hello-pod",
"image": "quay.io/openshifttest/hello-sdn@sha256:d5785550cf77b7932b090fcd1a2625472912fb3189d5973f177a5a2c347a1f95"
}],
"nodeName" : "ip-10-0-158-114.us-east-2.compute.internal"
}
}
4. Check two pods are running on same worker
$ oc get pod -n z1 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hello-pod22ww4h 1/1 Running 0 34m 10.131.0.15 ip-10-0-158-114.us-east-2.compute.internal <none> <none>
test-rc-bcwzm 1/1 Running 0 39m 10.131.0.14 ip-10-0-158-114.us-east-2.compute.internal <none> <none>
$ oc get svc -n z1
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
test-service ClusterIP 172.30.226.85 <none> 27017/TCP 41m
5. open connection from client pod to access svc address
$oc rsh -n z1 hello-pod22ww4h
/ # nc 172.30.226.85 27017 -p 5555
6. open another termial and create another connect to test pod directly
$oc rsh -n z1 hello-pod22ww4h
/ # nc 10.131.0.14 8080 -p 5555
7. wait some times, step 6 should not be stop
8. oc rsh this worker and check
sh-4.4# conntrack -L | grep 10.131.0.15
tcp 6 431829 ESTABLISHED src=10.131.0.15 dst=10.131.0.14 sport=50151 dport=8080 src=10.131.0.14 dst=10.131.0.15 sport=8080 dport=50151 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=30 use=1
tcp 6 431821 ESTABLISHED src=10.131.0.15 dst=172.30.226.85 sport=5555 dport=27017 src=10.131.0.14 dst=10.131.0.15 sport=8080 dport=5555 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=31 use=1
tcp 6 431829 ESTABLISHED src=10.131.0.15 dst=10.131.0.14 sport=5555 dport=8080 src=10.131.0.14 dst=10.131.0.15 sport=8080 dport=50151 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=31 use=1
tcp 6 431821 ESTABLISHED src=10.131.0.15 dst=10.131.0.14 sport=5555 dport=8080 src=10.131.0.14 dst=10.131.0.15 sport=8080 dport=5555 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=30 use=1
9. Check logs in ovs-vswitchd.log
sh-4.4# tail /var/log/openvswitch/ovs-vswitchd.log | grep "Invalid argument" ###should show nothing.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |