Bug 1958126
Summary: | [OVN]Egressip doesn't take effect | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | huirwang |
Component: | Networking | Assignee: | Alexander Constantinescu <aconstan> |
Networking sub component: | ovn-kubernetes | QA Contact: | huirwang |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aconstan, anusaxen, philipp.dallig |
Version: | 4.8 | ||
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 23:07:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
huirwang
2021-05-07 09:11:45 UTC
This BZ affects egress IP but is not caused by it. I am seeing the following listed on all nodes' GR: [root@huirwang-0507a-t7n9q-master-1 ~]# ovn-nbctl -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:172.31.249.126:9641,ssl:172.31.249.18:9641,ssl:172.31.249.193:9641 lr-nat-list GR_huirwang-0507a-t7n9q-worker-46cpf TYPE EXTERNAL_IP EXTERNAL_PORT LOGICAL_IP EXTERNAL_MAC LOGICAL_PORT snat 172.31.249.182 10.128.2.92 snat 172.31.249.182 10.128.2.106 snat 172.31.249.182 10.128.2.105 snat 172.31.249.43 10.128.2.5 snat 172.31.249.43 10.128.2.6 snat 172.31.249.43 10.128.2.92 snat 172.31.249.43 10.128.2.4 snat 172.31.249.43 10.128.2.49 snat 172.31.249.43 10.128.2.105 snat 172.31.249.43 10.128.2.3 snat 172.31.249.43 10.128.2.28 snat 172.31.249.43 10.128.2.26 snat 172.31.249.43 10.128.2.77 snat 172.31.249.43 10.128.2.106 That is incorrect. The only SNAT that should exist on the GR are the egress IP ones. In this case something is assigning a dedicated SNAT for every pod running on the node. This is in turn "scrambles" the egress IP configuration and causes OVN to use, not the egress IP dedicated SNAT, but the incorrect one - this is why we're not seeing the egress IP on the server's side. I've looked at the logs to see what command creates these SNAT objects and I've found the following: W0507 09:10:16.872586 1 pods.go:334] Failed to get options for port: 0i9xy_test-rc-vc6hh I0507 09:10:16.872665 1 kube.go:61] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.2.105/23"],"mac_address":"0a:58:0a:80:02:69","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.105/23","gate way_ip":"10.128.2.1"}}] on pod 0i9xy/test-rc-vc6hh W0507 09:10:16.908185 1 pods.go:334] Failed to get options for port: 0i9xy_test-rc-7tcn6 I0507 09:10:16.908341 1 kube.go:61] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.2.106/23"],"mac_address":"0a:58:0a:80:02:6a","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.106/23","gate way_ip":"10.128.2.1"}}] on pod 0i9xy/test-rc-7tcn6 2021-05-07T09:10:16.940Z|04306|nbctl|INFO|Running command run -- add address_set 7a3c0f32-1bf3-4dd2-b1d9-bd157e72410f addresses "\"10.128.2.105\"" 2021-05-07T09:10:16.953Z|04307|nbctl|INFO|Running command run --if-exists -- lr-nat-del GR_huirwang-0507a-t7n9q-worker-46cpf snat 10.128.2.105/32 2021-05-07T09:10:16.957Z|04308|nbctl|INFO|Running command run -- lr-nat-add GR_huirwang-0507a-t7n9q-worker-46cpf snat 172.31.249.43 10.128.2.105/32 2021-05-07T09:10:16.969Z|04309|nbctl|INFO|Running command run -- add address_set 7a3c0f32-1bf3-4dd2-b1d9-bd157e72410f addresses "\"10.128.2.106\"" I0507 09:10:16.977094 1 pods.go:289] [0i9xy/test-rc-vc6hh] addLogicalPort took 104.719222ms It seems that some time during the pod setup in addLogicalPort we start setting up SNAT for the pod on the GR....this is happening for every pod on every node. The reason this is happening is because commit: https://github.com/openshift/cluster-network-operator/commit/14a5e41bb9b8fedaec0037b8551be4888e0ac821 added --disable-snat-multiple-gws to ovnkube-master which now does that pod setup in addLogicalPort. This is also the reason upstream CI did not pick the problem up (we have E2E tests for egress IP), because that option is OpenShift specific. I need to talk to the Platform team about this. But this is clearly a regression that breaks egress IP for OpenShift, and I am thus setting the blocker+ flag. Moreover: those pod annotation seem completely off to me: "ip_addresses":["10.128.2.106/23"] is not correct. Also another (cosmetic) problem is that even though that flag is provided to ovnkube-master, the logged "parsed config" does not indicate that it's been correctly set: + exec /usr/bin/ovnkube --init-master huirwang-0507a-t7n9q-master-1 --config-file=/run/ovnkube-config/ovnkube.conf --ovn-empty-lb-events --loglevel 4 --metrics-bind-address 127.0.0.1:29102 --gateway-mode shared --gateway-interface br-ex --sb-address ssl:172.31.249.126:9642,ssl:172.31.249.18:9642,ssl:172.31.249.193:9642 --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common-name ovn --nb-address ssl:172.31.249.126:9641,ssl:172.31.249.18:9641,ssl:172.31.249.193:9641 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --nbctl-daemon-mode --nb-cert-common-name ovn --enable-multicast --disable-snat-multiple-gws --acl-logging-rate-limit 20 I0507 05:34:25.278043 1 config.go:1437] Parsed config file /run/ovnkube-config/ovnkube.conf I0507 05:34:25.278112 1 config.go:1438] Parsed config: {Default:{MTU:1400 ConntrackZone:64000 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 RawClusterSubnets:10.128.0.0/14/23 ClusterSubnets:[]} Logging:{File: CNIFile: Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:5 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableEgressIP:true} Kubernetes:{Kubeconfig: CACert: APIServer:https://api-int.huirwang-0507a.qe.devcluster.openshift.com:6443 Token: CompatServiceCIDR: RawServiceCIDRs:172.30.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes MetricsBindAddress: OVNMetricsBindAddress: MetricsEnablePprof:false OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes: NoHostSubnetNodes:nil HostNetworkNamespace:openshift-host-network} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: northbound:false exec:<nil>} Gateway:{Mode:local Interface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64} MasterHA:{ElectionLeaseDuration:60 ElectionRenewDeadline:30 ElectionRetryPeriod:20} HybridOverlay:{Enabled:false RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full}} Specifically: DisableSNATMultipleGWs:false which incorrectly indicating that the flag was not provided. The flag was provided, so that should be DisableSNATMultipleGWs:true Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |