2117791 – NodePort access to pod broken after upgrading from 4.8 to 4.10 - old LGW got traffic from mp0 versus new SGW/LGW got traffic from join switch

Bug 2117791 - NodePort access to pod broken after upgrading from 4.8 to 4.10 - old LGW got traffic from mp0 versus new SGW/LGW got traffic from join switch

Summary: NodePort access to pod broken after upgrading from 4.8 to 4.10 - old LGW got ...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Surya Seetharaman
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-11 22:17 UTC by Akash Dubey
Modified:	2023-09-18 04:44 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-02 13:12:32 UTC
Target Upstream Version:
Embargoed:
Flags:	surya: needinfo-

Attachments	(Terms of Use)

Comment 2 Miguel Duarte Barroso 2022-08-16 11:15:29 UTC

Can we get the routes info from before the change ? Or at least know which version was being before in order to extract those ? @adubey

I'm mostly a bit confused about the ask here: is it about additive default-route behavior ? (i.e. keeping both the eth0 and net1 ifaces with a default route ?)

Comment 4 Akash Dubey 2022-08-23 13:40:13 UTC

Hello Miguel,

Thank you so much for looking into this.

They were using v4.8.36 when RoutingTable already had the entries for NodePort traffic.

It's about the missing routing entry as when the customer manually added the entry traffic starts to receive. It might be source IP address for the NodePort traffic within the pod network in v4.8.36 and hence the routing table already had entries for the pod network.

Let me know if any other information is needed.

Comment 5 Miguel Duarte Barroso 2022-08-29 08:41:23 UTC

Thanks.

Let's see what Doug comes up with.
(mostly just clearing out the `needinfo` flag).

Comment 6 Akash Dubey 2022-09-13 09:17:02 UTC

Hi Team,

Is there an update on this issue? The customer is expecting a response from our end.

Can we please prioritize this?

Comment 8 Douglas Smith 2022-10-12 18:40:30 UTC

After some review of bridge CNI between versions -- I don't believe that there is a significant change that would cause this.

I think we'll need some analysis from OVN-K team to see if there's a possibility that a change to nodePort services caused this issue.

Thanks.

Comment 9 Akash Dubey 2022-10-13 07:19:07 UTC

Hi Team,

Is there an update we can state to customer? They are really waiting for a response from us.

Comment 15 Akash Dubey 2022-11-01 08:04:04 UTC

Hi @mduarted @surya 

I am attaching the route info as shared by the customer today

Output of ip route in the pod, OCP v4.10:
~~~
default via 199.219.44.1 dev net1 
172.18.0.0/15 via 172.18.19.1 dev eth0 
172.18.19.0/25 dev eth0 proto kernel scope link src 172.18.19.12 
192.168.48.0/20 via 172.18.19.1 dev eth0 
199.219.44.0/24 dev net1 proto kernel scope link src 199.219.44.15 
~~~
That is for a pod for which the 2nd ip address is 199.219.44.15. Pod network is 172.18.0.0/15 and service ip network is 192.168.48.0/20

They deployed a OCP v4.8.36 cluster to share the below routing table info:
~~~
default via 140.223.56.1 dev net1 
140.223.56.0/24 dev net1 proto kernel scope link src 140.223.56.206 
172.18.0.0/15 via 172.18.9.1 dev eth0 
172.18.9.0/25 dev eth0 proto kernel scope link src 172.18.9.64 
192.168.48.0/20 via 172.18.9.1 dev eth0
~~~
That is for a pod for which the 2nd ip address is 140.223.56.206. Pod network is 172.18.0.0/15 and service ip network is 192.168.48.0/20

This is what they said with sharing these details
~~~
I don't recall there being anything much different in the IP routing tables between the two versions. The problem is the the IP address of the traffic coming into the pod changed (to a 100.64 ip address...prior to 4.10 it was an ip address from the pod subnet). However the IP routing table doesn't have an entry for this 100.64.x.x range like it does for the pod subnet (and the service ip subnet).
~~~

Comment 16 Surya Seetharaman 2022-11-02 13:12:32 UTC

The 100.64.x.x range is how we started routing service traffic starting from OCP 4.8 for shared gateway. We adopted the same strategy for LGW from 4.10 to not have to use the DGP anymore and bring both the modes closer. So before those versions, this topology with making the secondary interface on the pod as the default route just worked as pure coincidence or luck. It was not intended to work that way since one is consciously setting which default route they want the traffic to go out through and in the previous versions the topology just happened to allow it to work that way. Speaking with my team lead, we have agreed that this was never supported from the start and if this is needed, then please open a RFE, closing this bug since there is nothing more we can do here.

Comment 27 Red Hat Bugzilla 2023-09-18 04:44:36 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.