Description of problem: Unable to assign master-0 for EgressIP even if the egress-assignable label is set Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-06-10-213131 How reproducible: Not sure, just found this issue in two fresh installation ovn cluster, and somehow one of them worked later. And the issue only happened on master-0 node. Steps to Reproduce: $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME yinzhou-regre-kk9x9-master-0 Ready master 150m v1.20.0+2817867 172.31.249.51 172.31.249.51 Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8 yinzhou-regre-kk9x9-master-1 Ready master 150m v1.20.0+2817867 172.31.249.136 172.31.249.136 Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8 yinzhou-regre-kk9x9-master-2 Ready master 150m v1.20.0+2817867 172.31.249.216 172.31.249.216 Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8 yinzhou-regre-kk9x9-rhel-0 Ready worker 78m v1.20.0+2817867 172.31.249.36 172.31.249.36 Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el7 yinzhou-regre-kk9x9-worker-fxnhq Ready worker 139m v1.20.0+2817867 172.31.249.253 172.31.249.253 Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8 yinzhou-regre-kk9x9-worker-lb2xl Ready worker 139m v1.20.0+2817867 172.31.249.142 172.31.249.142 Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8 1. Label node master-0 $ oc label node yinzhou-regre-kk9x9-master-0 "k8s.ovn.org/egress-assignable"="" node/yinzhou-regre-kk9x9-master-0 labeled $ oc get nodes yinzhou-regre-kk9x9-master-0 --show-labels NAME STATUS ROLES AGE VERSION LABELS yinzhou-regre-kk9x9-master-0 Ready master 136m v1.20.0+2817867 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,k8s.ovn.org/egress-assignable=,kubernetes.io/arch=amd64,kubernetes.io/hostname=yinzhou-regre-kk9x9-master-0,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos 2. Create egressip object oc create -f egressip.yaml $ cat egressip.yaml apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egressip-1 spec: egressIPs: - 172.31.249.227 namespaceSelector: matchLabels: name: test Actual results: No assigned egress node $ oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 172.31.249.227 $ oc get event | egrep egressip 33m Warning NoMatchingNodeFound egressip/egressip-1 No matching nodes found, which can host any of the egress IPs: [172.31.249.227] for object EgressIP: egressip-1 71m Warning NoMatchingNodeFound egressip/egressip-1 no assignable nodes for EgressIP: egressip-1, please tag at least one node with label: k8s.ovn.org/egress-assignable 28m Warning NoMatchingNodeFound egressip/egressip-1 No matching nodes found, which can host any of the egress IPs: [172.31.249.227] for object EgressIP: egressip-1 Expected results: yinzhou-regre-kk9x9-master-0 should be assigned as egress node. Additional info: All other nodes worked except master-0, checked egressip is in the same subnet as primary interface, also port 9 is open as expected. $ oc debug node/yinzhou-regre-kk9x9-master-0 Starting pod/yinzhou-regre-kk9x9-master-0-debug ... To use host binaries, run `chroot /host` Pod IP: 172.31.249.51 If you don't see a command prompt, try pressing enter. sh-4.4# sh-4.4# sh-4.4# ip a show br-ex 4: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 00:50:56:ac:35:62 brd ff:ff:ff:ff:ff:ff inet 172.31.249.51/23 brd 172.31.249.255 scope global dynamic noprefixroute br-ex valid_lft 5900sec preferred_lft 5900sec inet6 fe80::c1b:568f:b47c:b043/64 scope link noprefixroute valid_lft forever preferred_lft forever # curl 172.31.249.51:9 curl: (7) Failed to connect to 172.31.249.51 port 9: Connection refused sh-4.4# curl 172.31.249.51:9 -v * Rebuilt URL to: 172.31.249.51:9/ * Trying 172.31.249.51... * TCP_NODELAY set * connect to 172.31.249.51 port 9 failed: Connection refused * Failed to connect to 172.31.249.51 port 9: Connection refused * Closing connection 0 curl: (7) Failed to connect to 172.31.249.51 port 9: Connection refused
TL;DR: we can remove the regression tag. This is not a regression, but an existing problem on 4.7 that I think will be fixed by the commit: https://github.com/openshift/ovn-kubernetes/commit/1ba1ce885f089a023b4c803d85a2ffc7206eb98b (which is on 4.8 already) I've had a look at the cluster and I saw the following: The reason we can't assign the egress IP to master-0 is because the annotation which parses the primary IP address, has the following value: k8s.ovn.org/node-primary-ifaddr: '{"ipv4":"172.31.248.112/32"}', which is incorrect. We can see that from the default L3 config annotation that OVN has set: k8s.ovn.org/l3-gateway-config: '{"default":{"mode":"shared","interface-id":"br-ex_yinzhou-regre-kk9x9-master-0","mac-address":"00:50:56:ac:35:62","ip-addresses":["172.31.249.51/23"],"ip-address":"172.31.249.51/23","next-hops":["172.31.248.1"],"next-hop":"172.31.248.1","node-port-enable":"true","vlan-id":"0"}}' The IP: 172.31.248.112 is actually the cluster ingress VIP, which was associated with master-0 at one point: $ oc get cm -n kube-system cluster-config-v1 -o yaml ... platform: vsphere: apiVIP: 172.31.248.111 cluster: Cluster-1 datacenter: SDDC-Datacenter defaultDatastore: WorkloadDatastore ingressVIP: 172.31.248.112 network: qe-segment password: "" username: "" vCenter: vcenter.sddc-44-236-21-251.vmwarevmc.com publish: External pullSecret: "" ... If we look at the ovnkube-node logs on master-0 we can see it found the ingressVIP when it started, and the egress IP code which parses the primary IP address picks up the ingressVIP address instead of the correct one: oc logs -c ovnkube-node ovnkube-node-pj2z8 -n openshift-ovn-kubernetes | less ... I0611 05:24:57.154095 6122 gateway_localnet.go:183] Node local addresses initialized to: map[10.130.0.2:{10.130.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 172.31.248.112:{172.31.248.112 ffffffff} 172.31.249.51:{172.31.248.0 fffffe00} ::1:{::1 ffffffffffffffffffffffffffffffff} fe80::90ea:baff:fefd:6dc9:{fe80:: ffffffffffffffff0000000000000000} fe80::c1b:568f:b47c:b043:{fe80:: ffffffffffffffff0000000000000000}] ... I0611 05:24:57.990189 6122 kube.go:89] Setting annotations map[k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_yinzhou-regre-kk9x9-master-0","mac-address":"00:50:56:ac:35:62","ip-addresses":["172.31.249.51/23"],"ip-address":"172.31.249.51/23","next-hops":["172.31.248.1"],"next-hop":"172.31.248.1","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:81f65b0b-619c-4ab5-a69c-6456b218144f k8s.ovn.org/node-mgmt-port-mac-address:92:ea:ba:fd:6d:c9 k8s.ovn.org/node-primary-ifaddr:{"ipv4":"172.31.248.112/32"}] on node yinzhou-regre-kk9x9-master-0 The problem is that the ovnkube-node code on 4.7 which parses the node IP used for egress IP assignments, uses the following function to get the default IP address: https://github.com/openshift/ovn-kubernetes/blob/release-4.7/go-controller/pkg/node/helper_linux.go#L98-L118 whereas the function parsing the L3 annotation uses https://github.com/openshift/ovn-kubernetes/blob/master/go-controller/pkg/util/net_linux.go#L354-L380 Specifically the egress IP related code is missing: https://github.com/openshift/ovn-kubernetes/blob/master/go-controller/pkg/util/net_linux.go#L370-L376 That will be fixed by back-porting the commit: https://github.com/openshift/ovn-kubernetes/commit/1ba1ce885f089a023b4c803d85a2ffc7206eb98b /Alex
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2502