2076321 – [ocp-4.10][ovn-kubernetes] pod fails to connect kubernetes-service-ip when EgressIP is assigned to a namespace.

Bug 2076321 - [ocp-4.10][ovn-kubernetes] pod fails to connect kubernetes-service-ip when EgressIP is assigned to a namespace.

Summary: [ocp-4.10][ovn-kubernetes] pod fails to connect kubernetes-service-ip when Eg...

Keywords:
Status:	CLOSED DUPLICATE of bug 2070929
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-18 17:28 UTC by siva kanakala
Modified:	2023-09-15 01:23 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-19 16:43:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description siva kanakala 2022-04-18 17:28:50 UTC

Description of problem:
[ocp-4.10][ovn-kubernetes]pod failed to connect Kubernetes service IP when egress IP was assigned to a namespace. 

Version-Release number of selected component (if applicable):
4.10.8

How reproducible:

Steps to Reproduce:
1. Configure egressIP to a namespace 

  Mon Apr 18 10:38:44 skanakal  ☻ ☀  oc create -f egressip.yml 
egressip.k8s.ovn.org/egress-project1 created
  Mon Apr 18 10:38:57 skanakal  ☻ ☀  
  Mon Apr 18 10:38:57 skanakal  ☻ ☀  oc get egressip
NAME              EGRESSIPS       ASSIGNED NODE                            ASSIGNED EGRESSIPS
egress-project1   192.168.51.13   ci-ln-r4vc4yk-c1627-5lfzr-worker-hk9ps   192.168.51.13
  Mon Apr 18 10:39:00 skanakal  ☻ ☀  

2. Deploy a test pod and verify the curl response to k8's svc ip:   

Mon Apr 18 10:37:40 skanakal  ☻ ☀  oc get pods -o wide
NAME             READY   STATUS    RESTARTS   AGE   IP            NODE                                     NOMINATED NODE   READINESS GATES
caddy-rc-n7tfx   1/1     Running   0          99s   10.129.2.15   ci-ln-r4vc4yk-c1627-5lfzr-worker-kp8ps   <none>           <none>
caddy-rc-xghl7   1/1     Running   0          99s   10.128.2.22   ci-ln-r4vc4yk-c1627-5lfzr-worker-hk9ps   <none>           <none>
  Mon Apr 18 10:38:04 skanakal  ☻ ☀  

It works if I try it from the pod which is currently on egressnode: 

Mon Apr 18 10:39:01 skanakal  ☻ ☀  oc rsh caddy-rc-xghl7
/srv $ 
/srv $ 
/srv $ nc -zv 172.30.0.1 443
172.30.0.1 (172.30.0.1:443) open
/srv $ 
/srv $ exit

It fails if I try this from the pod that is scheduled on non-egress node: 

  Mon Apr 18 10:39:22 skanakal  ☻ ☀  oc rsh  caddy-rc-n7tfx    
/srv $ 
/srv $ nc -zv 172.30.0.1 443
nc: 172.30.0.1 (172.30.0.1:443): Operation timed out
/srv $ exit
command terminated with exit code 1


3. When I delete the egress-ip it works from both the pods: 

  Mon Apr 18 10:46:25 skanakal  ☻ ☀  oc delete egressip egress-project1
egressip.k8s.ovn.org "egress-project1" deleted
  Mon Apr 18 10:47:28 skanakal  ☻ ☀  
  Mon Apr 18 10:47:30 skanakal  ☻ ☀  
  Mon Apr 18 10:47:31 skanakal  ☻ ☀  oc rsh  caddy-rc-n7tfx
/srv $ 
/srv $ nc -zv 172.30.0.1 443
172.30.0.1 (172.30.0.1:443) open
/srv $ 
/srv $ exit


Actual results:
pod fails to connect to the k8's svc IP even when the egress_IP attached 

Expected results:
pod should be able to connect to the k8's svc IP even when the egress_IP attached 

Additional info:

I am able to reproduce this issue locally and we have mustgather data.

Comment 1 siva kanakala 2022-04-18 18:25:04 UTC

sh-4.4# ovn-nbctl lr-policy-list ovn_cluster_router
Routing Policies
      1004 inport == "rtos-ci-ln-r4vc4yk-c1627-5lfzr-master-0" && ip4.dst == 192.168.51.14 /* ci-ln-r4vc4yk-c1627-5lfzr-master-0 */         reroute                10.129.0.2
      1004 inport == "rtos-ci-ln-r4vc4yk-c1627-5lfzr-master-0" && ip4.dst == 192.168.51.2 /* ci-ln-r4vc4yk-c1627-5lfzr-master-0 */         reroute                10.129.0.2
      1004 inport == "rtos-ci-ln-r4vc4yk-c1627-5lfzr-master-1" && ip4.dst == 192.168.51.19 /* ci-ln-r4vc4yk-c1627-5lfzr-master-1 */         reroute                10.128.0.2
      1004 inport == "rtos-ci-ln-r4vc4yk-c1627-5lfzr-master-2" && ip4.dst == 192.168.51.30 /* ci-ln-r4vc4yk-c1627-5lfzr-master-2 */         reroute                10.130.0.2
      1004 inport == "rtos-ci-ln-r4vc4yk-c1627-5lfzr-worker-hdfdx" && ip4.dst == 192.168.51.20 /* ci-ln-r4vc4yk-c1627-5lfzr-worker-hdfdx */         reroute                10.131.0.2
      1004 inport == "rtos-ci-ln-r4vc4yk-c1627-5lfzr-worker-hk9ps" && ip4.dst == 192.168.51.23 /* ci-ln-r4vc4yk-c1627-5lfzr-worker-hk9ps */         reroute                10.128.2.2
      1004 inport == "rtos-ci-ln-r4vc4yk-c1627-5lfzr-worker-hk9ps" && ip4.dst == 192.168.51.3 /* ci-ln-r4vc4yk-c1627-5lfzr-worker-hk9ps */         reroute                10.128.2.2
      1004 inport == "rtos-ci-ln-r4vc4yk-c1627-5lfzr-worker-kp8ps" && ip4.dst == 192.168.51.12 /* ci-ln-r4vc4yk-c1627-5lfzr-worker-kp8ps */         reroute                10.129.2.2
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.128.0.0/14           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 100.64.0.0/16           allow           <<<<<<<<<----------
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 192.168.51.12/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 192.168.51.14/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 192.168.51.19/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 192.168.51.20/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 192.168.51.23/32           allow
       101 ip4.src == 10.128.0.0/14 && ip4.dst == 192.168.51.30/32           allow
sh-4.4# exit

It seems hostnetwork access is allowed to the services backed by egressip matching pods:


Mon Apr 18 11:41:28 skanakal  ☻ ☀  oc get network cluster -o json | jq '.status'
{
  "clusterNetwork": [
    {
      "cidr": "10.128.0.0/14",
      "hostPrefix": 23
    }
  ],
  "clusterNetworkMTU": 1400,
  "networkType": "OVNKubernetes",
  "serviceNetwork": [
    "172.30.0.0/16"
  ]
}
  Mon Apr 18 11:42:08 skanakal  ☻ ☀

Comment 2 Scott Dodson 2022-04-18 19:34:03 UTC

If this is believed to be a regression, it worked in 4.9 but not 4.10, please add the Regression keyword. It's unclear reading the description whether this is believed to be a regression or not.

Comment 7 Surya Seetharaman 2022-04-19 07:15:27 UTC

It's possible this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=2070929? @flavio: wdyt?
Since api server backend pods are host-networked, is it possible that the 1004 route takes priority over 101? I'm surprised we haven't noticed this though for so long, not sure if its the same in versions less than 4.9 as well.

Comment 9 Tim Rozet 2022-04-19 16:43:51 UTC

I think this is a duplicate of 2070929. I can see that SNAT entries are missing on the originating node (not the egress IP node).

*** This bug has been marked as a duplicate of bug 2070929 ***

Comment 10 Red Hat Bugzilla 2023-09-15 01:23:14 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.