Bug 2034477
Summary: | [OVN] Multiple EgressIP objects configured, EgressIPs weren't working properly | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | huirwang |
Component: | Networking | Assignee: | Ben Bennett <bbennett> |
Networking sub component: | ovn-kubernetes | QA Contact: | huirwang |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | anbhat, dbrahane, ffernand, jechen |
Version: | 4.10 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:35:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2029742 | ||
Bug Blocks: |
Description
huirwang
2021-12-21 04:15:41 UTC
Can you give me a little more info about what is running on "10.0.2.196:9095"? Is that an external server that pints back the ip of the curl client? I would like to try it! ;) I am assuming that to reproduce this issue you did not have a specific script and simply added/remove configs until you got the cluster in this bad state, correct? * Regarding issue 1 of 2: pod test-rc-qpt8v using snat from egressip4 There may be a bug in the logic for deciding which egressip is usable by a given pod. Since "egressip-example6" is a superset of "egressip4", would you expect any pods from your example -- including "test-rc-65z6n" -- to use "egressip4" ? The documentation [1] is not very clear on that, so I wonder if this is some undetermined behavior. Or I may be missing something. I will look at the code some more, but I clearly see that ovn-k8 is adding the improper NAT in OVN: [root@3aa61e97a1fe ~]# ovn-nbctl list logical_switch_port hrw_test-rc-qpt8v _uuid : d75b075f-88f5-4bd6-ab4c-636fb5bd908b addresses : ["0a:58:0a:80:02:0c 10.128.2.12"] ... [root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-58-47.us-east-2.compute.internal | grep 10.128.2.12 snat 10.0.58.102 10.128.2.12 [root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-61-37.us-east-2.compute.internal | grep 10.128.2.12 snat 10.0.58.101 10.128.2.12 [root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-67-155.us-east-2.compute.internal | grep 10.128.2.12 snat 10.0.67.100 10.128.2.12 Note from the output above that the pod's ip was not added to any of the egress ips of "egressip-example6", which is the exact opposite of what it should have done. :P [1]: https://docs.openshift.com/container-platform/4.9/networking/ovn_kubernetes_network_provider/configuring-egress-ips-ovn.html * Regarding issue 2 of 2: only 2 out of the 3 snat address are being observed The reason we never see "10.0.67.112" is because of bug 2029742, where ovn_cluster_router is left with duplicate and wrong re-routes. Can you please retry this test with the fixes in this PR: https://github.com/ovn-org/ovn-kubernetes/pull/2735 , or wait for that bug to be merged? [root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-58-47.us-east-2.compute.internal TYPE EXTERNAL_IP EXTERNAL_PORT LOGICAL_IP EXTERNAL_MAC LOGICAL_PORT snat 10.0.58.111 10.129.2.23 ... [root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-61-37.us-east-2.compute.internal TYPE EXTERNAL_IP EXTERNAL_PORT LOGICAL_IP EXTERNAL_MAC LOGICAL_PORT snat 10.0.58.110 10.129.2.23 ... [root@a5eae22bcd51 ~]# ovn-nbctl lr-nat-list GR_ip-10-0-67-155.us-east-2.compute.internal TYPE EXTERNAL_IP EXTERNAL_PORT LOGICAL_IP EXTERNAL_MAC LOGICAL_PORT snat 10.0.67.112 10.129.2.23 ... [root@a5eae22bcd51 ~]# ovn-nbctl show GR_ip-10-0-58-47.us-east-2.compute.internal router 4cc1d62f-b9cd-4be4-9708-3c3a538bdf9e (GR_ip-10-0-58-47.us-east-2.compute.internal) port rtoj-GR_ip-10-0-58-47.us-east-2.compute.internal mac: "0a:58:64:40:00:07" networks: ["100.64.0.7/16"] ... [root@a5eae22bcd51 ~]# ovn-nbctl show GR_ip-10-0-61-37.us-east-2.compute.internal router fde7b52b-e3ec-41ee-89d5-48504ff93cd2 (GR_ip-10-0-61-37.us-east-2.compute.internal) port rtoj-GR_ip-10-0-61-37.us-east-2.compute.internal mac: "0a:58:64:40:00:05" networks: ["100.64.0.5/16"] ... [root@a5eae22bcd51 ~]# ovn-nbctl show GR_ip-10-0-67-155.us-east-2.compute.internal router 498fe80d-48ce-42e2-8ad7-d42ee766d657 (GR_ip-10-0-67-155.us-east-2.compute.internal) port rtoj-GR_ip-10-0-67-155.us-east-2.compute.internal mac: "0a:58:64:40:00:06" networks: ["100.64.0.6/16"] [root@a5eae22bcd51 ~]# ovn-nbctl lr-policy-list ovn_cluster_router Routing Policies ... 100 ip4.src == 10.129.2.23 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 <--- "37", "155", "47" 100 ip4.src == 10.129.2.23 reroute 100.64.0.5, 100.64.0.6, 100.64.0.7 <--- DUPLICATE 100 ip4.src == 10.129.2.23 reroute 100.64.0.5, 100.64.0.7 <--- DUPLICATE AND WRONG!!! ... # pod on nodeA -> node-switch on nodeA -> ovn-cluster-router (hits this 100 reroute policy) -> join switch -> GR (snat) -> external switch -> outside Found flaw in logic where egressip's pod selector was not properly checking the labels of the pod. Potential fix posted upstream: https://github.com/ovn-org/ovn-kubernetes/pull/2742 Alexander asked me to give this bug to him; hopefully that is okay. :^) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |