Bug 2034477

Summary:	[OVN] Multiple EgressIP objects configured, EgressIPs weren't working properly
Product:	OpenShift Container Platform	Reporter:	huirwang
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	ovn-kubernetes	QA Contact:	huirwang
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	anbhat, dbrahane, ffernand, jechen
Version:	4.10	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:35:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2029742
Bug Blocks:

Description huirwang 2021-12-21 04:15:41 UTC

Description of problem:
[OVN AWS] Multiple EgressIP objects configured, EgressIPs weren't working properly

Version-Release number of selected component (if applicable):
4.10.0-0.ci.test-2021-12-21-010955-ci-ln-qp8x36t-latest 
latest 4.10 including PR https://github.com/openshift/cloud-network-config-controller/pull/12

How reproducible:


Steps to Reproduce:
1. Tag 3 nodes as Egress nodes
2. Create multiple EgressIPs
$ oc get egressip
NAME                EGRESSIPS     ASSIGNED NODE                               ASSIGNED EGRESSIPS
egressip-1          10.0.58.100   ip-10-0-58-47.us-east-2.compute.internal    10.0.58.100
egressip-example6   10.0.58.110   ip-10-0-67-155.us-east-2.compute.internal   10.0.67.112
egressip4           10.0.58.101   ip-10-0-61-37.us-east-2.compute.internal    10.0.58.101

$ oc get egressip -o yaml
apiVersion: v1
items:
- apiVersion: k8s.ovn.org/v1
  kind: EgressIP
  metadata:
    creationTimestamp: "2021-12-21T02:29:01Z"
    generation: 2
    name: egressip-1
    resourceVersion: "38607"
    uid: d29a69da-cc65-40ce-98a6-adfa9c8bd300
  spec:
    egressIPs:
    - 10.0.58.100
    namespaceSelector:
      matchLabels:
        name: test
    podSelector: {}
  status:
    items:
    - egressIP: 10.0.58.100
      node: ip-10-0-58-47.us-east-2.compute.internal
- apiVersion: k8s.ovn.org/v1
  kind: EgressIP
  metadata:
    creationTimestamp: "2021-12-21T03:33:45Z"
    generation: 4
    name: egressip-example6
    resourceVersion: "60413"
    uid: 0ba7f6bc-aed7-45f7-b159-de877da0e8be
  spec:
    egressIPs:
    - 10.0.58.110
    - 10.0.58.111
    - 10.0.67.112
    namespaceSelector:
      matchLabels:
        team: red
    podSelector: {}
  status:
    items:
    - egressIP: 10.0.67.112
      node: ip-10-0-67-155.us-east-2.compute.internal
    - egressIP: 10.0.58.110
      node: ip-10-0-61-37.us-east-2.compute.internal
    - egressIP: 10.0.58.111
      node: ip-10-0-58-47.us-east-2.compute.internal
- apiVersion: k8s.ovn.org/v1
  kind: EgressIP
  metadata:
    creationTimestamp: "2021-12-21T03:19:37Z"
    generation: 4
    name: egressip4
    resourceVersion: "56234"
    uid: 86562d81-4e00-427c-98c4-726db3ccf6f9
  spec:
    egressIPs:
    - 10.0.58.101
    - 10.0.58.102
    - 10.0.67.100
    namespaceSelector:
      matchLabels:
        team: red
    podSelector:
      matchLabels:
        team: blue
  status:
    items:
    - egressIP: 10.0.58.101
      node: ip-10-0-61-37.us-east-2.compute.internal
    - egressIP: 10.0.58.102
      node: ip-10-0-58-47.us-east-2.compute.internal
    - egressIP: 10.0.67.100
      node: ip-10-0-67-155.us-east-2.compute.internal
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

3. Create 2 namespace, test2 and hrw and some pods in the namespaces
oc get ns hrw --show-labels
NAME   STATUS   AGE   LABELS
hrw    Active   17m   kubernetes.io/metadata.name=hrw,team=red

$ oc get pod -n hrw --show-labels
NAME            READY   STATUS    RESTARTS   AGE   LABELS
test-rc-qpt8v   1/1     Running   0          17m   name=test-pods
test-rc-rwsxm   1/1     Running   0          17m   name=test-pods

$ oc get ns test2 --show-labels
NAME    STATUS   AGE   LABELS
test2   Active   82m   kubernetes.io/metadata.name=test2,team=red

oc get pod -n test2 --show-labels
NAME            READY   STATUS    RESTARTS   AGE   LABELS
test-rc-65z6n   1/1     Running   0          71m   name=test-pods,team=blue
test-rc-hqpfm   1/1     Running   0          71m   name=test-pods
test-rc-kmjjq   1/1     Running   0          71m   name=test-pods
test-rc-kpjst   1/1     Running   0          71m   name=test-pods
test-rc-mghm8   1/1     Running   0          71m   name=test-pods

3. Check the egressip function from pods in hrw and in test2

Actual results:
From pod test-rc-qpt8v  in namespace hrw, the EgressIP uses EgressIPs in object 
egressip4, this is not correct. 
$ oc rsh -n hrw test-rc-qpt8v
~ $ while true;do   curl 10.0.2.196:9095 --connect-timeout 2 -s ; echo "";sleep 2; done
10.0.58.102
10.0.58.102
10.0.58.102
10.0.67.100
10.0.58.102
10.0.58.102
10.0.67.100
10.0.67.100
10.0.67.100
10.0.67.100
10.0.67.100
10.0.58.101
10.0.58.101

From pod test-rc-65z6n in namespace test2, the load balance happened between 2 IPs, but here we configured 3.
~ $ while true;do   curl 10.0.2.196:9095 --connect-timeout 2 -s ; echo "";sleep 2; done
10.0.58.101
10.0.58.102
10.0.58.102
10.0.58.102
10.0.58.102
10.0.58.101
10.0.58.102
10.0.58.102
10.0.58.102
10.0.58.101
10.0.58.102
10.0.58.102
10.0.58.101
10.0.58.102
10.0.58.101
10.0.58.101
10.0.58.101
10.0.58.101
10.0.58.101
10.0.58.101
10.0.58.101
10.0.58.102
10.0.58.101
10.0.58.102
10.0.58.102
10.0.58.102
10.0.58.101
10.0.58.101
10.0.58.101
10.0.58.101
10.0.58.101
10.0.58.101
10.0.58.102
10.0.58.102
10.0.58.102
10.0.58.102
10.0.58.102
10.0.58.101
10.0.58.102
10.0.58.102
10.0.58.102

Expected results:
The pod should uses the matched EgressIP object.

Additional info:

Comment 4 ffernand 2022-01-04 20:54:38 UTC

Can you give me a little more info about what is running on "10.0.2.196:9095"? Is that an external server that pints back the ip
of the curl client? I would like to try it! ;) I am assuming that to reproduce this issue you did not have a specific script and
simply added/remove configs until you got the cluster in this bad state, correct?


* Regarding issue 1 of 2: pod test-rc-qpt8v using snat from egressip4

There may be a bug in the logic for deciding which egressip is usable by a given pod. Since "egressip-example6" is a
superset of "egressip4", would you expect any pods from your example -- including "test-rc-65z6n" -- to use "egressip4" ?
The documentation [1] is not very clear on that, so I wonder if this is some undetermined behavior. Or I may be missing something.
I will look at the code some more, but I clearly see that ovn-k8 is adding the improper NAT in OVN:

[root@3aa61e97a1fe ~]# ovn-nbctl list logical_switch_port hrw_test-rc-qpt8v
_uuid               : d75b075f-88f5-4bd6-ab4c-636fb5bd908b
addresses           : ["0a:58:0a:80:02:0c 10.128.2.12"]
...

[root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-58-47.us-east-2.compute.internal | grep 10.128.2.12
snat             10.0.58.102                         10.128.2.12

[root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-61-37.us-east-2.compute.internal | grep 10.128.2.12
snat             10.0.58.101                         10.128.2.12

[root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-67-155.us-east-2.compute.internal | grep 10.128.2.12
snat             10.0.67.100                         10.128.2.12

Note from the output above that the pod's ip was not added to any of the egress ips of "egressip-example6", which is the exact opposite of
what it should have done. :P

[1]: https://docs.openshift.com/container-platform/4.9/networking/ovn_kubernetes_network_provider/configuring-egress-ips-ovn.html


* Regarding issue 2 of 2: only 2 out of the 3 snat address are being observed

The reason we never see "10.0.67.112" is because of bug 2029742, where ovn_cluster_router is left with duplicate and wrong re-routes.
Can you please retry this test with the fixes in this PR: https://github.com/ovn-org/ovn-kubernetes/pull/2735 , or wait for that bug to be merged?

[root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-58-47.us-east-2.compute.internal
TYPE             EXTERNAL_IP        EXTERNAL_PORT    LOGICAL_IP            EXTERNAL_MAC         LOGICAL_PORT
snat             10.0.58.111                         10.129.2.23
...
[root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-61-37.us-east-2.compute.internal
TYPE             EXTERNAL_IP        EXTERNAL_PORT    LOGICAL_IP            EXTERNAL_MAC         LOGICAL_PORT
snat             10.0.58.110                         10.129.2.23
...
[root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-67-155.us-east-2.compute.internal
TYPE             EXTERNAL_IP        EXTERNAL_PORT    LOGICAL_IP            EXTERNAL_MAC         LOGICAL_PORT
snat             10.0.67.112                         10.129.2.23
...

[root@a5eae22bcd51 ~]# ovn-nbctl show GR_ip-10-0-58-47.us-east-2.compute.internal
router 4cc1d62f-b9cd-4be4-9708-3c3a538bdf9e (GR_ip-10-0-58-47.us-east-2.compute.internal)
    port rtoj-GR_ip-10-0-58-47.us-east-2.compute.internal
        mac: "0a:58:64:40:00:07"
        networks: ["100.64.0.7/16"]
...
[root@a5eae22bcd51 ~]# ovn-nbctl show GR_ip-10-0-61-37.us-east-2.compute.internal
router fde7b52b-e3ec-41ee-89d5-48504ff93cd2 (GR_ip-10-0-61-37.us-east-2.compute.internal)
    port rtoj-GR_ip-10-0-61-37.us-east-2.compute.internal
        mac: "0a:58:64:40:00:05"
        networks: ["100.64.0.5/16"]
...
[root@a5eae22bcd51 ~]# ovn-nbctl show GR_ip-10-0-67-155.us-east-2.compute.internal
router 498fe80d-48ce-42e2-8ad7-d42ee766d657 (GR_ip-10-0-67-155.us-east-2.compute.internal)
    port rtoj-GR_ip-10-0-67-155.us-east-2.compute.internal
        mac: "0a:58:64:40:00:06"
        networks: ["100.64.0.6/16"]

[root@a5eae22bcd51 ~]# ovn-nbctl lr-policy-list ovn_cluster_router
Routing Policies
...
       100                             ip4.src == 10.129.2.23         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7  <--- "37", "155", "47"
       100                             ip4.src == 10.129.2.23         reroute                100.64.0.5, 100.64.0.6, 100.64.0.7  <--- DUPLICATE
       100                             ip4.src == 10.129.2.23         reroute                100.64.0.5, 100.64.0.7              <--- DUPLICATE AND WRONG!!!
...

# pod on nodeA -> node-switch on nodeA -> ovn-cluster-router (hits this 100 reroute policy) -> join switch -> GR (snat) -> external switch -> outside

Comment 10 ffernand 2022-01-07 20:35:52 UTC

Found flaw in logic where egressip's pod selector was not properly checking the labels of the pod.

Potential fix posted upstream: https://github.com/ovn-org/ovn-kubernetes/pull/2742

Alexander asked me to give this bug to him; hopefully that is okay. :^)

Comment 22 errata-xmlrpc 2022-03-10 16:35:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056