Spend two days going over all the data supplied (and writing some tools), and it looks like there are duplicate IP addresses in the network. I'll put my full analysis in the next comment, but here is a short conclusion.
On the CPE trace we can see the node setting up a lot of connection to/from 10.128.3.182, however they do not all come from the BAW node. Around half of the traffic comes from a device on tun0 with the same IP.
So the next action for support is to figure out who on the network is sharing these IPs and is servicing the 9443 port also!
Can we ask the customer if reducing the number of clients helps to alleviate the issue?
If that is true, can we check the CPU usage while the testing is running to compare with comment#35?
(In reply to Flavio Leitner from comment #48)
> Can we ask the customer if reducing the number of clients helps to alleviate
> the issue?
What I am really looking for here is to know if the system works well with light load and as they add more clients, the number of failures increases as well.
So looking at the last set of traces we no longer see the wrongly translated NAT address issue. However, we still see the issue where setting up a TCP session sometimes needs a number of retries before they succeed. To be precise 388 times the connection setup needed retries and all of them were from BAW to CPE (10.128.2.10:xxxx->10.128.2.6:9443).
After analyzing the PCAPs, and creating yet another tool to verify this, it looks like they are all due to ephemeral port collisions between the direct traffic, and the service NATted traffic. For example:
- 10.128.2.10:36440->10.128.2.6:9443, seq=1284738543:
TX LEFT >> number:12324, time: 2021-02-15 13:53:53.103262 UTC
TX LEFT >> number:12351, time: 2021-02-15 13:53:54.129987 UTC
TX LEFT >> number:12392, time: 2021-02-15 13:53:56.177958 UTC
TX LEFT >> number:12469, time: 2021-02-15 13:54:00.212106 UTC
TX LEFT >> number:12626, time: 2021-02-15 13:54:08.593946 UTC
RX COLL >> number:8298, time: 2021-02-15 13:52:18.652697 UTC
TX ??? >> number:10056, time: 2021-02-15 13:52:18.652517 UTC: 10.128.2.10:36440->172.30.32.208:36440, seq=4259382902
I'll start a discussion with the openshift-sdn fooks to see what we can do about this.
In the meantime, as the customer is using both the direct POD IPs and the service IP (for example 10.128.2.6 and172.30.32.208), can we ask them to use either one for all communication. So make sure the CPEs and BAWs all communicate either using the Service IP or POD IP, but not both.
We are still looking on how to solve this, but let me describe the problem in detail so it's captured in the BZ:
The openshift-sdn networks look something like this:
[dnat 172.30.249.115 to 10.128.3.182 and loop back to the bridge]
[veth: 10.128.3.180] ------ [OVS BRIDGE + CONNTRACK + NAT] ------ [veth: 10.128.3.182]
Now two connections are set up in the following order:
- POD A opens a connection to port 9943 to the Service IP of POD B. It selects ephemeral source port 30000.
10.128.3.180:30000 -> 172.20.249.115:9943
- POD A opens a connection to port 9943 to the POD B direct IP address. Here it also selects ephemeral port 30000, which is fine as from POD A's perspective this is a different unique 5 tuple.
10.128.3.180:30000 -> 10.128.3.182:9943
The first connection goes trough the tun0 interface and a connection tracking entry gets created:
tcp 6 src=10.128.3.180 dst=172.30.249.115 sport=30000 dport=9443 src=10.128.3.182 dst=10.128.3.180 sport=9443 dport=30000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=2
Now the problem is with the second connection, as OVS will use connection tracking. The second connection collides with the reverse tuple of the first connection, so the commit is not excepted, and the SYN packets get dropped. Hence the second connection will fail until the first connection is closed and the TIME_WAIT period has passed.
(In reply to Eelco Chaudron from comment #70)
The solution to this problem could be to let the kernel connection tracking solve this potential collision by doing a SNAT on the source TCP port if a collision is detected. The kernel nftables does this automatically, however, Open vSwitch does not perform any NAT until explicitly requested.
The kernel infrastructure will do this by adding a NULL SNAT entry, which can be simulated in OVS with the following action:
The below conntrack output shows the output where the duplicate src port, 30000, was used:
tcp 6 109 TIME_WAIT src=10.128.3.180 dst=172.30.249.115 sport=30000 dport=9443 src=10.128.3.182 dst=10.128.3.180 sport=9443 dport=30000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp 6 109 TIME_WAIT src=10.128.3.180 dst=10.128.3.182 sport=30000 dport=9443 src=10.128.3.182 dst=10.128.3.180 sport=9443 dport=31099 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
I quickly tested the latest iteration of the patch, https://github.com/openshift/sdn/pull/269 but it still shows some CI failures.
Anyway, as this is not an OVS issue, will move it to the OCP team.
I'm adding some details on how to replicate this:
- Create two PODs and allow port 9443 to be visible outside.
- Make sure the two PODs are on the same node!!
- Create a service for port 9443
- Start a server on POD-B, is was using the following:
nc -k -l 9443
- Now from POD-A open a connection via the service IP using a fixed source port (30000), and keep it open:
nc 172.30.167.140 9443 -p 30000
- Verify traffic from POD-A to B works.
- Now open another connection from POD-A to B using the POD's IP address and the same fixes source port (30000):
nc 10.131.0.23 9443 -p 30000
- Make sure traffic continues to work from both connections
- Verify the conntrack entries exist correctly and the source port for the second connection is translated:
conntrack -L | grep 10.131.0.23
tcp 6 431989 ESTABLISHED src=10.131.0.22 dst=172.30.167.140 sport=30000 dport=9443 src=10.131.0.23 dst=10.131.0.22 sport=9443 dport=30000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp 6 431987 ESTABLISHED src=10.131.0.22 dst=10.131.0.23 sport=30000 dport=9443 src=10.131.0.23 dst=10.131.0.22 sport=9443 dport=50340 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
- Setup another connection using the PODs IPs, but using a different source port to make sure translation does NOT happen without the collision:
tcp 6 431992 ESTABLISHED src=10.131.0.22 dst=10.131.0.23 sport=30001 dport=9443 src=10.131.0.23 dst=10.131.0.22 sport=9443 dport=30001 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
Verified this bug on 4.8.0-0.nightly-2021-03-08-184701 with step comment 73
sh-4.4# conntrack -L | grep 10.131.0.22
tcp 6 431924 ESTABLISHED src=10.131.0.23 dst=10.131.0.22 sport=30000 dport=8080 src=10.131.0.22 dst=10.131.0.23 sport=8080 dport=14810 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp 6 431994 ESTABLISHED src=10.131.0.23 dst=10.131.0.22 sport=30001 dport=8080 src=10.131.0.22 dst=10.131.0.23 sport=8080 dport=30001 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp 6 431867 ESTABLISHED src=10.131.0.23 dst=172.30.103.70 sport=30000 dport=27017 src=10.131.0.22 dst=10.131.0.23 sport=8080 dport=30000 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.