Bug 1903414

Summary:	NodePort is not working when configuring an egress IP address
Product:	OpenShift Container Platform	Reporter:	shishika
Component:	Networking	Assignee:	Jacob Tanenbaum <jtanenba>
Networking sub component:	openshift-sdn	QA Contact:	huirwang
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aconstan, anbhat, anowak, danw, huirwang, jdesousa, jtanenba, tkimura, zzhao
Version:	4.6
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1986413 (view as bug list)		Environment:
Last Closed:	2021-02-24 15:37:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1926662, 1986413

Description shishika 2020-12-02 03:06:04 UTC

Description of problem:

I have a customer who is using the OpenShift SDN default Container Network Interface network provider. He says it is not able to access NodePort when externalTrafficPolicy is set to Local, and an egress IP address and the Pod are on the different worker node.

Version-Release number of selected component (if applicable):

4.6

How reproducible:

Always

Steps to Reproduce:

1. Configure an EgressIP to the "ip-10-0-223-252.ap-northeast-1.compute.internal" node

$ oc get hostsubnet
NAME                                              HOST                                              HOST IP        SUBNET          EGRESS CIDRS      EGRESS IPS
ip-10-0-143-207.ap-northeast-1.compute.internal   ip-10-0-143-207.ap-northeast-1.compute.internal   10.0.143.207   10.131.0.0/23   ["10.0.0.0/16"]   
ip-10-0-151-235.ap-northeast-1.compute.internal   ip-10-0-151-235.ap-northeast-1.compute.internal   10.0.151.235   10.128.0.0/23                     
ip-10-0-163-90.ap-northeast-1.compute.internal    ip-10-0-163-90.ap-northeast-1.compute.internal    10.0.163.90    10.130.0.0/23                     
ip-10-0-185-248.ap-northeast-1.compute.internal   ip-10-0-185-248.ap-northeast-1.compute.internal   10.0.185.248   10.128.2.0/23   ["10.0.0.0/16"]   
ip-10-0-214-229.ap-northeast-1.compute.internal   ip-10-0-214-229.ap-northeast-1.compute.internal   10.0.214.229   10.129.0.0/23                     
ip-10-0-223-252.ap-northeast-1.compute.internal   ip-10-0-223-252.ap-northeast-1.compute.internal   10.0.223.252   10.129.2.0/23   ["10.0.0.0/16"]   ["10.0.0.24"]

2. Create a Pod on the "ip-10-0-223-252.ap-northeast-1.compute.internal" node

$ oc get po -o wide
NAME                                                 READY   STATUS    RESTARTS   AGE   IP             NODE                                              NOMINATED NODE   READINESS GATES
example1                                             1/1     Running   0          34m   10.129.2.29    ip-10-0-223-252.ap-northeast-1.compute.internal   <none>           <none>

3. Create a NodePort with externalTrafficPolicy is set to Local

$ oc describe svc
Name:                     example1
Namespace:                02799576
Labels:                   app=hello-openshift
Annotations:              <none>
Selector:                 app=hello-openshift
Type:                     NodePort
IP:                       172.30.131.116
Port:                     <unset>  8080/TCP
TargetPort:               8080/TCP
NodePort:                 <unset>  31106/TCP
Endpoints:                10.129.2.29:8080
Session Affinity:         None
External Traffic Policy:  Local <-----
Events:                   <none>

4. Can access from another node

sh-4.4# curl ip-10-0-223-252.ap-northeast-1.compute.internal:31106
Hello OpenShift!

5. Migrate the EgressIP to the "ip-10-0-143-207.ap-northeast-1.compute.internal" node by `oc patch` command, the Pod remains on the "ip-10-0-223-252.ap-northeast-1.compute.internal" node

$ oc patch netnamespace 02799576 --type=merge -p '{"egressIPs": null}'
netnamespace.network.openshift.io/02799576 patched

$ oc patch netnamespace 02799576 --type=merge -p '{"egressIPs": ["10.0.0.24"]}'
netnamespace.network.openshift.io/02799576 patched

$ oc get hostsubnet
NAME                                              HOST                                              HOST IP        SUBNET          EGRESS CIDRS      EGRESS IPS
ip-10-0-143-207.ap-northeast-1.compute.internal   ip-10-0-143-207.ap-northeast-1.compute.internal   10.0.143.207   10.131.0.0/23   ["10.0.0.0/16"]   ["10.0.0.24"]
ip-10-0-151-235.ap-northeast-1.compute.internal   ip-10-0-151-235.ap-northeast-1.compute.internal   10.0.151.235   10.128.0.0/23                     
ip-10-0-163-90.ap-northeast-1.compute.internal    ip-10-0-163-90.ap-northeast-1.compute.internal    10.0.163.90    10.130.0.0/23                     
ip-10-0-185-248.ap-northeast-1.compute.internal   ip-10-0-185-248.ap-northeast-1.compute.internal   10.0.185.248   10.128.2.0/23   ["10.0.0.0/16"]   
ip-10-0-214-229.ap-northeast-1.compute.internal   ip-10-0-214-229.ap-northeast-1.compute.internal   10.0.214.229   10.129.0.0/23                     
ip-10-0-223-252.ap-northeast-1.compute.internal   ip-10-0-223-252.ap-northeast-1.compute.internal   10.0.223.252   10.129.2.0/23   ["10.0.0.0/16"]

$ oc get po -o wide
NAME                                                 READY   STATUS    RESTARTS   AGE   IP             NODE                                              NOMINATED NODE   READINESS GATES
example1                                             1/1     Running   0          34m   10.129.2.29    ip-10-0-223-252.ap-northeast-1.compute.internal   <none>           <none>

6. Become to can't access from any other nodes

sh-4.4# curl ip-10-0-223-252.ap-northeast-1.compute.internal:31106
^C
sh-4.4# curl ip-10-0-185-248.ap-northeast-1.compute.internal:31106
^C
sh-4.4# curl ip-10-0-143-207.ap-northeast-1.compute.internal:31106
^C

Actual results:
Can't access from other nodes.

Expected results:
Can access from other nodes.

Additional info:

Comment 1 Juan Luis de Sousa-Valadas 2020-12-03 14:33:02 UTC

Hi Zhanqui,
I think this may be a duplicate of BZ#1881882.
Can you please try to reproduce this on RHEL nodes instead of RHCOS? I think this must be something in the kernel, to be more precise I think this is conntrack.

Comment 2 zhaozhanqi 2020-12-04 01:56:09 UTC

huiran could you help have a check if same issue with BZ#1881882.

Comment 8 Juan Luis de Sousa-Valadas 2020-12-23 16:24:21 UTC

Hello, this message is just the problem statement, feel free not to read it.

The summary of the issue is that when a nodePort with externalTrafficPolicy: Local is reached from an egressIP, the node with the nodePort discards the egress traffic from the pod.


There are two different scenarios here:
1- The client is on the node which is being used to reach the nodePort service, here there is no egress IP because packet is not leaving the node. This is working as intended
2- The client is on a different node, this does not work combined with egress IP.

Now I have a simple test:

$ oc get netnamespace test
NAME   NETID      EGRESS IPS
test   12136949   ["172.31.249.201"]

$ oc get hostsubnet
NAME                                     HOST                                     HOST IP          SUBNET          EGRESS CIDRS   EGRESS IPS
huirwang-bug1903414-rrkmp-master-0       huirwang-bug1903414-rrkmp-master-0       172.31.249.123   10.130.0.0/23                  ["172.31.249.201"]
huirwang-bug1903414-rrkmp-worker-97znx   huirwang-bug1903414-rrkmp-worker-97znx   172.31.249.13    10.131.0.0/23                  []
huirwang-bug1903414-rrkmp-worker-jmff2   huirwang-bug1903414-rrkmp-worker-jmff2   172.31.249.210   10.128.2.0/23                  

(some hostsubnets were deleted from the output for simplicty)

$ oc get pod -o wide
NAME            READY   STATUS    RESTARTS   AGE   IP            NODE                                     NOMINATED NODE   READINESS GATES
test-rc-cc5n2   1/1     Running   0          70m   10.131.0.30   huirwang-bug1903414-rrkmp-worker-97znx   <none>           <none>
test-rc-rbglb   1/1     Running   0          70m   10.128.3.33   huirwang-bug1903414-rrkmp-worker-jmff2   <none>           <none>


$ oc get svc
NAME        TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
hello-pod   NodePort   172.30.131.139   <none>        8000:30011/TCP   13h

I acquired a tcpdump while doing this simple test:
$ oc rsh test-rc-rbglb 
sh-5.0$ curl 172.31.249.13:30011
^C
sh-5.0$ curl 172.31.249.13:30011
^C
sh-5.0$ curl 172.31.249.13:30011
^C
sh-5.0$ curl 172.31.249.13:30011
^C

And checking the tcpdump in the ens192 oif the worker hosting the node:
$ tshark -r  huirwang-bug1903414-rrkmp-worker-jmff2.pcap -Y 'tcp.port == 30011'
  207   2.571442 0.005239 15:59:23.744467  10.128.3.33 → 172.31.249.13 TCP 124  56812 → 30011 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3053704175 TSecr=0 WS=128
  244   3.621191 0.012833 15:59:24.794216  10.128.3.33 → 172.31.249.13 TCP 124  [TCP Retransmission] 56812 → 30011 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3053705225 TSecr=0 WS=128
  539   5.669215 0.003371 15:59:26.842240  10.128.3.33 → 172.31.249.13 TCP 124  [TCP Retransmission] 56812 → 30011 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3053707272 TSecr=0 WS=128
 1027  10.347814 0.005093 15:59:31.520839  10.128.3.33 → 172.31.249.13 TCP 124  56960 → 30011 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3053711951 TSecr=0 WS=128
 1110  11.365343 0.097540 15:59:32.538368  10.128.3.33 → 172.31.249.13 TCP 124  [TCP Retransmission] 56960 → 30011 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3053712969 TSecr=0 WS=128
 1551  13.413177 0.063932 15:59:34.586202  10.128.3.33 → 172.31.249.13 TCP 124  [TCP Retransmission] 56960 → 30011 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3053715016 TSecr=0 WS=128
 2054  17.649781 0.005325 15:59:38.822806  10.128.3.33 → 172.31.249.13 TCP 124  57096 → 30011 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3053719253 TSecr=0 WS=128
 2196  18.661183 0.007356 15:59:39.834208  10.128.3.33 → 172.31.249.13 TCP 124  [TCP Retransmission] 57096 → 30011 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3053720264 TSecr=0 WS=128
 6028  20.709184 0.003825 15:59:41.882209  10.128.3.33 → 172.31.249.13 TCP 124  [TCP Retransmission] 57096 → 30011 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3053722312 TSecr=0 WS=128


It's pretty obvious the issue is that either the traffic is not making to the node but either the pod is not getting it or the server is not answering.

Checking the pod's tcpdump I see the pod ACKs the message, this means the issue happens in the server, for some reason the virtual switch discards this traffic.

$ tshark -r serverpod.pcap | head -6
    1   0.000000 0.000000 16:16:41.242155 172.31.249.201 → 10.131.0.30  TCP 74  47816 → 8000 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3054741674 TSecr=0 WS=128
    2   0.000043 0.000043 16:16:41.242198  10.131.0.30 → 172.31.249.201 TCP 74  8000 → 47816 [SYN, ACK] Seq=0 Ack=1 Win=27960 Len=0 MSS=1410 SACK_PERM=1 TSval=530118369 TSecr=3054741674 WS=128
    3   0.001140 0.001097 16:16:41.243295 172.31.249.201 → 10.131.0.30  TCP 54  47816 → 8000 [RST] Seq=1 Win=0 Len=0
    4   1.053287 1.052147 16:16:42.295442 172.31.249.201 → 10.131.0.30  TCP 74  [TCP Retransmission] 47816 → 8000 [SYN] Seq=0 Win=28200 Len=0 MSS=1410 SACK_PERM=1 TSval=3054742728 TSecr=0 WS=128
    5   1.053326 0.000039 16:16:42.295481  10.131.0.30 → 172.31.249.201 TCP 74  [TCP Previous segment not captured] [TCP Port numbers reused] 8000 → 47816 [SYN, ACK] Seq=16457586 Ack=1 Win=27960 Len=0 MSS=1410 SACK_PERM=1 TSval=530119423 TSecr=3054742728 WS=128
    6   1.053793 0.000467 16:16:42.295948 172.31.249.201 → 10.131.0.30  TCP 54  47816 → 8000 [RST] Seq=1 Win=0 Len=0

Comment 12 Juan Luis de Sousa-Valadas 2020-12-24 16:01:36 UTC

Investigation notes:

Inside the pod we see the packet sent
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:01:12.517597 IP 172.31.249.201.33052 > 10.130.2.4.8000: Flags [S], seq 1370673013, win 28200, options [mss 1410,sackOK,TS val 47144581 ecr 0,nop,wscale 7], length 0
15:01:12.517646 IP 10.130.2.4.8000 > 172.31.249.201.33052: Flags [S.], seq 33490073, ack 1370673014, win 27960, options [mss 1410,sackOK,TS val 47144581 ecr 47144581,nop,wscale 7], length 0
15:01:12.518054 IP 172.31.249.201.33052 > 10.130.2.4.8000: Flags [R], seq 1370673014, win 0, length 0
15:01:13.520211 IP 172.31.249.201.33052 > 10.130.2.4.8000: Flags [S], seq 1370673013, win 28200, options [mss 1410,sackOK,TS val 47145584 ecr 0,nop,wscale 7], length 0
15:01:13.520272 IP 10.130.2.4.8000 > 172.31.249.201.33052: Flags [S.], seq 49156100, ack 1370673014, win 27960, options [mss 1410,sackOK,TS val 47145584 ecr 47145584,nop,wscale 7], length 0
15:01:13.520633 IP 172.31.249.201.33052 > 10.130.2.4.8000: Flags [R], seq 1370673014, win 0, length 0

In the node the conntrack isn't complete, it's just SYN_RECV:
sh-4.4# conntrack -L -p tcp | grep 30011
conntrack v1.4.4 (conntrack-tools): 202 flow entries have been shown.
tcp      6 59 SYN_RECV src=172.31.249.201 dst=172.31.249.158 sport=33926 dport=30011 src=10.130.2.4 dst=172.31.249.201 sport=8000 dport=33926 mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1


sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep 10.130.2.4 | grep in_port= 
 cookie=0x0, duration=12898.275s, table=20, n_packets=31, n_bytes=1302, priority=100,arp,in_port=5,arp_spa=10.130.2.4,arp_sha=00:00:0a:82:02:04/00:00:ff:ff:ff:ff actions=load:0xb931f5->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=12898.275s, table=20, n_packets=52, n_bytes=3892, priority=100,ip,in_port=5,nw_src=10.130.2.4 actions=load:0xb931f5->NXM_NX_REG0[],goto_table:21


Following the packet inside the switch, we care about port 5,  source IP 10.130.2.4 and adestination ip 72.31.249.201.
table 0:
 cookie=0x0, duration=48411.707s, table=0, n_packets=189023, n_bytes=53846847, priority=1000,ct_state=-trk,ip actions=ct(table=0) <- MATCHES conntrack table=0

# don't match
 cookie=0x0, duration=48411.707s, table=0, n_packets=49787, n_bytes=4512768, priority=400,ip,in_port=tun0,nw_src=10.130.2.1 actions=goto_table:30
 cookie=0x0, duration=48411.707s, table=0, n_packets=696, n_bytes=97440, priority=300,ip,in_port=tun0,nw_src=10.130.2.0/23,nw_dst=10.128.0.0/14 actions=goto_table:25
 cookie=0x0, duration=48411.707s, table=0, n_packets=0, n_bytes=0, priority=250,ip,in_port=tun0,nw_dst=224.0.0.0/4 actions=drop
 cookie=0x0, duration=48411.707s, table=0, n_packets=10922, n_bytes=458724, priority=200,arp,in_port=vxlan0,arp_spa=10.128.0.0/14,arp_tpa=10.130.2.0/23 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:10
 cookie=0x0, duration=48411.707s, table=0, n_packets=68657, n_bytes=22547874, priority=200,ip,in_port=vxlan0,nw_src=10.128.0.0/14 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:10
 cookie=0x0, duration=48411.707s, table=0, n_packets=48, n_bytes=2592, priority=200,ip,in_port=vxlan0,nw_dst=10.128.0.0/14 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:10
 cookie=0x0, duration=48411.707s, table=0, n_packets=2291, n_bytes=96222, priority=200,arp,in_port=tun0,arp_spa=10.130.2.1,arp_tpa=10.128.0.0/14 actions=goto_table:30
 cookie=0x0, duration=48411.707s, table=0, n_packets=20527, n_bytes=8734201, priority=200,ip,in_port=tun0 actions=goto_table:30
 cookie=0x0, duration=48411.707s, table=0, n_packets=0, n_bytes=0, priority=150,in_port=vxlan0 actions=drop
 cookie=0x0, duration=48411.707s, table=0, n_packets=14, n_bytes=1068, priority=150,in_port=tun0 actions=drop
 cookie=0x0, duration=48411.707s, table=0, n_packets=11821, n_bytes=496482, priority=100,arp actions=goto_table:20

 cookie=0x0, duration=48411.707s, table=0, n_packets=120306, n_bytes=31295397, priority=100,ip actions=goto_table:20 <- MATCHES go to table 20

table 20:
# don't match
 cookie=0x0, duration=48536.838s, table=20, n_packets=1581, n_bytes=66402, priority=100,arp,in_port=veth07db5d4f,arp_spa=10.130.2.2,arp_sha=00:00:0a:82:02:02/00:00:ff:ff:ff:ff actions=load:0x390da7->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=48536.752s, table=20, n_packets=10242, n_bytes=430164, priority=100,arp,in_port=vethe3e9b323,arp_spa=10.130.2.3,arp_sha=00:00:0a:82:02:03/00:00:ff:ff:ff:ff actions=load:0x400329->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=13845.403s, table=20, n_packets=33, n_bytes=1386, priority=100,arp,in_port=veth105793d3,arp_spa=10.130.2.4,arp_sha=00:00:0a:82:02:04/00:00:ff:ff:ff:ff actions=load:0xb931f5->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=48536.838s, table=20, n_packets=19511, n_bytes=7245497, priority=100,ip,in_port=veth07db5d4f,nw_src=10.130.2.2 actions=load:0x390da7->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=48536.752s, table=20, n_packets=101058, n_bytes=24131939, priority=100,ip,in_port=vethe3e9b323,nw_src=10.130.2.3 actions=load:0x400329->NXM_NX_REG0[],goto_table:21

 cookie=0x0, duration=13845.403s, table=20, n_packets=57, n_bytes=5375, priority=100,ip,in_port=veth105793d3,nw_src=10.130.2.4 actions=load:0xb931f5->NXM_NX_REG0[],goto_table:21 <- MATCHES REG0= 0xb931f5
 cookie=0x0, duration=48562.706s, table=20, n_packets=0, n_bytes=0, priority=0 actions=drop

table 21:
# doesn't match
 cookie=0x0, duration=48648.757s, table=21, n_packets=97303, n_bytes=27643377, priority=200,ip,nw_dst=10.128.0.0/14 actions=ct(commit,table=30)

 cookie=0x0, duration=48648.797s, table=21, n_packets=35399, n_bytes=4290695, priority=0 actions=goto_table:30 <- MATCH go to table 30

table 30:
# don't match:
 cookie=0x0, duration=48684.809s, table=30, n_packets=2302, n_bytes=96684, priority=300,arp,arp_tpa=10.130.2.1 actions=output:tun0
 cookie=0x0, duration=48684.809s, table=30, n_packets=48547, n_bytes=4630691, priority=300,ip,nw_dst=10.130.2.1 actions=output:tun0

 cookie=0x0, duration=48684.809s, table=30, n_packets=60852, n_bytes=27968399, priority=300,ct_state=+rpl,ip,nw_dst=10.130.2.0/23 actions=ct(table=70,nat) <- MATCH nat conntrack table=70

# don't match
 cookie=0x0, duration=48684.809s, table=30, n_packets=11883, n_bytes=499086, priority=200,arp,arp_tpa=10.130.2.0/23 actions=goto_table:40
 cookie=0x0, duration=48684.809s, table=30, n_packets=78881, n_bytes=8146431, priority=200,ip,nw_dst=10.130.2.0/23 actions=goto_table:70
 cookie=0x0, duration=48684.809s, table=30, n_packets=10979, n_bytes=461118, priority=100,arp,arp_tpa=10.128.0.0/14 actions=goto_table:50
 cookie=0x0, duration=48684.809s, table=30, n_packets=52526, n_bytes=23326289, priority=100,ip,nw_dst=10.128.0.0/14 actions=goto_table:90
 cookie=0x0, duration=48684.809s, table=30, n_packets=19677, n_bytes=3302902, priority=100,ip,nw_dst=172.30.0.0/16 actions=goto_table:60
 cookie=0x0, duration=48684.809s, table=30, n_packets=0, n_bytes=0, priority=50,ip,in_port=vxlan0,nw_dst=224.0.0.0/4 actions=goto_table:120
 cookie=0x0, duration=48684.809s, table=30, n_packets=0, n_bytes=0, priority=25,ip,nw_dst=224.0.0.0/4 actions=goto_table:110

 cookie=0x0, duration=48684.809s, table=30, n_packets=3873, n_bytes=494356, priority=0,ip actions=goto_table:100 < MATCH go to table 100

table 100:

# Don't match
 cookie=0x0, duration=48832.450s, table=100, n_packets=0, n_bytes=0, priority=300,udp,tp_dst=4789 actions=drop
 cookie=0x0, duration=48832.450s, table=100, n_packets=0, n_bytes=0, priority=200,tcp,nw_dst=172.31.249.158,tp_dst=53 actions=output:tun0
 cookie=0x0, duration=48832.450s, table=100, n_packets=3855, n_bytes=494856, priority=200,udp,nw_dst=172.31.249.158,tp_dst=53 actions=output:tun0

 cookie=0x0, duration=48832.185s, table=100, n_packets=48, n_bytes=3552, priority=100,ip,reg0=0xb931f5 actions=ct(commit),move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.31.249.123->tun_dst,output:vxlan0 <- MATCH set 

Effective flows
 cookie=0x0, duration=48411.707s, table=0, n_packets=189023, n_bytes=53846847, priority=1000,ct_state=-trk,ip actions=ct(table=0) <- MATCHES conntrack table=0
 cookie=0x0, duration=48411.707s, table=0, n_packets=120306, n_bytes=31295397, priority=100,ip actions=goto_table:20 <- MATCHES go to table 20
 cookie=0x0, duration=13845.403s, table=20, n_packets=57, n_bytes=5375, priority=100,ip,in_port=veth105793d3,nw_src=10.130.2.4 actions=load:0xb931f5->NXM_NX_REG0[],goto_table:21 <- MATCHES REG0= 0xb931f5 and go to table=21
 cookie=0x0, duration=48648.797s, table=21, n_packets=35399, n_bytes=4290695, priority=0 actions=goto_table:30 <- MATCH go to table 30
 cookie=0x0, duration=48684.809s, table=30, n_packets=60852, n_bytes=27968399, priority=300,ct_state=+rpl,ip,nw_dst=10.130.2.0/23 actions=ct(table=70,nat) <- MATCH nat conntrack table=70
 cookie=0x0, duration=48684.809s, table=30, n_packets=3873, n_bytes=494356, priority=0,ip actions=goto_table:100 <- MATCH go to table 100
 cookie=0x0, duration=48832.185s, table=100, n_packets=48, n_bytes=3552, priority=100,ip,reg0=0xb931f5 actions=ct(commit),move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.31.249.123->tun_dst,output:vxlan0 <- MATCH send the packet through the egressIP node, encapsulate with VNID 0xb931f5

And we see the traffic sent back in the vxlan NIC:
$ tshark -r rhel-0-vxlan.pcap -Y 'ip.addr==10.130.2.4'
   11   3.559266 1.025974 15:33:47.650832   10.130.2.4 → 172.31.249.201 TCP 74  8000 → 50494 [SYN, ACK] Seq=0 Ack=1 Win=27960 Len=0 MSS=1410 SACK_PERM=1 TSval=49099714 TSecr=49099713 WS=128
   12   3.559990 0.000724 15:33:47.651556 172.31.249.201 → 10.130.2.4   TCP 54  50494 → 8000 [RST] Seq=1 Win=0 Len=0
   17   4.560400 0.571170 15:33:48.651966   10.130.2.4 → 172.31.249.201 TCP 74  [TCP Previous segment not captured] [TCP Port numbers reused] 8000 → 50494 [SYN, ACK] Seq=15644212 Ack=1 Win=27960 Len=0 MSS=1410 SACK_PERM=1 TSval=49100715 TSecr=49100716 WS=128
   18   4.560717 0.000317 15:33:48.652283 172.31.249.201 → 10.130.2.4   TCP 54  50494 → 8000 [RST] Seq=1 Win=0 Len=0
   19   7.266425 2.705708 15:33:51.357991   10.130.2.4 → 172.31.249.201 TCP 74  8000 → 50532 [SYN, ACK] Seq=0 Ack=1 Win=27960 Len=0 MSS=1410 SACK_PERM=1 TSval=49103421 TSecr=49103422 WS=128
   20   7.266925 0.000500 15:33:51.358491 172.31.249.201 → 10.130.2.4   TCP 54  50532 → 8000 [RST] Seq=1 Win=0 Len=0
   21   8.269347 1.002422 15:33:52.360913   10.130.2.4 → 172.31.249.201 TCP 74  [TCP Previous segment not captured] [TCP Port numbers reused] 8000 → 50532 [SYN, ACK] Seq=15670647 Ack=1 Win=27960 Len=0 MSS=1410 SACK_PERM=1 TSval=49104424 TSecr=49104425 WS=128
   22   8.269696 0.000349 15:33:52.361262 172.31.249.201 → 10.130.2.4   TCP 54  50532 → 8000 [RST] Seq=1 Win=0 Len=0


sh-4.4# ip route 
default via 172.31.248.1 dev ens192 proto dhcp metric 100 
10.128.0.0/14 dev tun0 scope link 
172.30.0.0/16 dev tun0 
172.31.248.0/23 dev ens192 proto kernel scope link src 172.31.249.158 metric 100 

We have a default GW so this traffic should be sent there, so I'm inclined to think the packets are lost in iptables, however I added trace rules in iptables and the reply doesn't show up at all. Needs further investigation

Comment 14 Juan Luis de Sousa-Valadas 2020-12-24 17:05:05 UTC

OK I found the issue. The problem is this flow here:

We incorrectly try to send the encapsulate the traffic here:

 cookie=0x0, duration=48832.185s, table=100, n_packets=48, n_bytes=3552, priority=100,ip,reg0=0xb931f5 actions=ct(commit),move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.31.249.123->tun_dst,output:vxlan0 <- MATCH send the packet through the egressIP node, encapsulate with VNID 0xb931f5

And the reason why we the don't see the traffic going through the ens192 is because we have created the vxlan interface with 'options:remote_ip=flow' but we're not defining the field tun_dst, therefore OVS has no destination for it and drops it.


I can think of a couple ways to fix it, but because this is an architectural change, it will be small in code but very high in complexity and I'm fairly scared about possible side effects of the fix.

I will need consensus here. Most of the team is on PTO and this is my last day before PTO. I'll raise a discussion about this when I'm back the 2nd of January.

PS: 
I fixed this by adding a flow manually in my cluster which I don't think is going to break anything, but it may: 
ovs-ofctl -O OpenFlow13 add-flow br0 table=100,priority=250,ip,nw_dst=172.31.249.201,reg0=0xb931f5,actions=output:tun0

Comment 16 Juan Luis de Sousa-Valadas 2020-12-27 19:45:57 UTC

In my previous comment I assumed the client was another pod within the cluster consuming an egressIP. Unfortunately the customer's client is external which means we cannot rely on that constraint and my manually added flow won't work for this scenario.

I need to discus this with other SDN team members because I don't see a way to differentiate the traffic the client egress traffic from the server egress traffic so that we can only apply the egressIP only to the client...

Comment 18 Dan Winship 2021-01-04 16:41:36 UTC

(In reply to Juan Luis de Sousa-Valadas from comment #14)
> OK I found the issue. The problem is this flow here:
> 
> We incorrectly try to send the encapsulate the traffic here:
> 
>  cookie=0x0, duration=48832.185s, table=100, n_packets=48, n_bytes=3552,
> priority=100,ip,reg0=0xb931f5
> actions=ct(commit),move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:172.31.
> 249.123->tun_dst,output:vxlan0 <- MATCH send the packet through the egressIP
> node, encapsulate with VNID 0xb931f5
> 
> And the reason why we the don't see the traffic going through the ens192 is
> because we have created the vxlan interface with 'options:remote_ip=flow'
> but we're not defining the field tun_dst, therefore OVS has no destination
> for it and drops it.

But we *are* defining tun_dst: "set_field:172.31.249.123->tun_dst"

Comment 19 Juan Luis de Sousa-Valadas 2021-01-04 17:37:41 UTC

(In reply to Dan Winship from comment #18)
> But we *are* defining tun_dst: "set_field:172.31.249.123->tun_dst"
You're right, I don't know why I didn't see the reply in the tcpdump of ens192, I think the filter 'tcp.port == 30011' is valid, but I don't have the tcpdump any more.

Anyway, we still need to avoid hitting that flow when the pod is the server. Even if the packet is sent to the node that has the egress IP, and this node forwards it as we expect when the pod is the client, the client will get a reply with the wrong source IP.

Comment 20 Dan Winship 2021-01-05 11:34:50 UTC

ah... yes, I think you want to check `ct_state=-rpl`. Actually, all of table 100 should be bypassed for reply packets

Comment 33 Juan Luis de Sousa-Valadas 2021-02-03 16:06:55 UTC

Hello, the reason why conntrack failed to skip table 100 is that I defined the flow as  `ct_state=-rpl,actions=goto_table:101` instead of the current `ct_state=+rpl,actions=goto_table:101`.

Jason verified both the nodePort and the egressIP work as expected in the PR.

Comment 40 errata-xmlrpc 2021-02-24 15:37:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633