Bug 1970779 - [OVN] Unable to assign master-0 for EgressIP even if the egress-assignable label is set
Summary: [OVN] Unable to assign master-0 for EgressIP even if the egress-assignable la...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.z
Assignee: Alexander Constantinescu
QA Contact: huirwang
URL:
Whiteboard:
Depends On: 1970833
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-11 08:00 UTC by huirwang
Modified: 2021-06-29 04:20 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1970833 (view as bug list)
Environment:
Last Closed: 2021-06-29 04:20:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 571 0 None open [release-4.7] Bug 1970779: Remove getDefaultIfAddr and use getNetworkInterfaceIPAddresses 2021-06-11 10:07:35 UTC
Red Hat Product Errata RHBA-2021:2502 0 None None None 2021-06-29 04:20:39 UTC

Description huirwang 2021-06-11 08:00:28 UTC
Description of problem:
Unable to assign master-0 for EgressIP even if the egress-assignable label is set 

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-06-10-213131

How reproducible:
Not sure, just found this issue in two fresh installation ovn cluster, and somehow one of them worked later. And the issue only happened on master-0 node.

Steps to Reproduce:
$ oc get nodes -o wide
NAME                               STATUS   ROLES    AGE    VERSION           INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
yinzhou-regre-kk9x9-master-0       Ready    master   150m   v1.20.0+2817867   172.31.249.51    172.31.249.51    Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8
yinzhou-regre-kk9x9-master-1       Ready    master   150m   v1.20.0+2817867   172.31.249.136   172.31.249.136   Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8
yinzhou-regre-kk9x9-master-2       Ready    master   150m   v1.20.0+2817867   172.31.249.216   172.31.249.216   Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8
yinzhou-regre-kk9x9-rhel-0         Ready    worker   78m    v1.20.0+2817867   172.31.249.36    172.31.249.36    Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64    cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el7
yinzhou-regre-kk9x9-worker-fxnhq   Ready    worker   139m   v1.20.0+2817867   172.31.249.253   172.31.249.253   Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8
yinzhou-regre-kk9x9-worker-lb2xl   Ready    worker   139m   v1.20.0+2817867   172.31.249.142   172.31.249.142   Red Hat Enterprise Linux CoreOS 47.83.202106032343-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-2.rhaos4.7.gitb53fa9d.el8

1. Label node master-0 
$ oc label node yinzhou-regre-kk9x9-master-0 "k8s.ovn.org/egress-assignable"=""
node/yinzhou-regre-kk9x9-master-0 labeled
$ oc get nodes yinzhou-regre-kk9x9-master-0 --show-labels
NAME                           STATUS   ROLES    AGE    VERSION           LABELS
yinzhou-regre-kk9x9-master-0   Ready    master   136m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,k8s.ovn.org/egress-assignable=,kubernetes.io/arch=amd64,kubernetes.io/hostname=yinzhou-regre-kk9x9-master-0,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.openshift.io/os_id=rhcos

2. Create egressip object
oc create -f egressip.yaml 

$ cat egressip.yaml 
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egressip-1
spec:
  egressIPs:
  - 172.31.249.227
  namespaceSelector:
    matchLabels:
      name: test

Actual results:
No assigned egress node

$ oc get egressip
NAME         EGRESSIPS        ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-1   172.31.249.227

$ oc get event | egrep egressip
33m         Warning   NoMatchingNodeFound                          egressip/egressip-1                      No matching nodes found, which can host any of the egress IPs: [172.31.249.227] for object EgressIP: egressip-1
71m         Warning   NoMatchingNodeFound                          egressip/egressip-1                      no assignable nodes for EgressIP: egressip-1, please tag at least one node with label: k8s.ovn.org/egress-assignable
28m         Warning   NoMatchingNodeFound                          egressip/egressip-1                      No matching nodes found, which can host any of the egress IPs: [172.31.249.227] for object EgressIP: egressip-1

Expected results:
yinzhou-regre-kk9x9-master-0 should be assigned as egress node.

Additional info:

All other nodes worked except master-0, checked egressip is in the same subnet as primary interface, also port 9 is open as expected.

$ oc debug node/yinzhou-regre-kk9x9-master-0
Starting pod/yinzhou-regre-kk9x9-master-0-debug ...
To use host binaries, run `chroot /host`
Pod IP: 172.31.249.51
If you don't see a command prompt, try pressing enter.
sh-4.4# 
sh-4.4# 
sh-4.4# ip a show br-ex
4: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 00:50:56:ac:35:62 brd ff:ff:ff:ff:ff:ff
    inet 172.31.249.51/23 brd 172.31.249.255 scope global dynamic noprefixroute br-ex
       valid_lft 5900sec preferred_lft 5900sec
    inet6 fe80::c1b:568f:b47c:b043/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
# curl 172.31.249.51:9
curl: (7) Failed to connect to 172.31.249.51 port 9: Connection refused
sh-4.4# curl 172.31.249.51:9 -v
* Rebuilt URL to: 172.31.249.51:9/
*   Trying 172.31.249.51...
* TCP_NODELAY set
* connect to 172.31.249.51 port 9 failed: Connection refused
* Failed to connect to 172.31.249.51 port 9: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 172.31.249.51 port 9: Connection refused

Comment 3 Alexander Constantinescu 2021-06-11 09:51:33 UTC
TL;DR: we can remove the regression tag. This is not a regression, but an existing problem on 4.7 that I think will be fixed by the commit: https://github.com/openshift/ovn-kubernetes/commit/1ba1ce885f089a023b4c803d85a2ffc7206eb98b (which is on 4.8 already)

I've had a look at the cluster and I saw the following:

The reason we can't assign the egress IP to master-0 is because the annotation which parses the primary IP address, has the following value: k8s.ovn.org/node-primary-ifaddr: '{"ipv4":"172.31.248.112/32"}', which is incorrect. We can see that from the default L3 config annotation that OVN has set: 

k8s.ovn.org/l3-gateway-config: '{"default":{"mode":"shared","interface-id":"br-ex_yinzhou-regre-kk9x9-master-0","mac-address":"00:50:56:ac:35:62","ip-addresses":["172.31.249.51/23"],"ip-address":"172.31.249.51/23","next-hops":["172.31.248.1"],"next-hop":"172.31.248.1","node-port-enable":"true","vlan-id":"0"}}'

The IP: 172.31.248.112 is actually the cluster ingress VIP, which was associated with master-0 at one point:

$ oc get cm -n kube-system cluster-config-v1 -o yaml
...
    platform:
      vsphere:
        apiVIP: 172.31.248.111
        cluster: Cluster-1
        datacenter: SDDC-Datacenter
        defaultDatastore: WorkloadDatastore
        ingressVIP: 172.31.248.112
        network: qe-segment
        password: ""
        username: ""
        vCenter: vcenter.sddc-44-236-21-251.vmwarevmc.com
    publish: External
    pullSecret: ""
...

If we look at the ovnkube-node logs on master-0 we can see it found the ingressVIP when it started, and the egress IP code which parses the primary IP address picks up the ingressVIP address instead of the correct one:

oc logs -c ovnkube-node ovnkube-node-pj2z8 -n openshift-ovn-kubernetes | less
...
I0611 05:24:57.154095    6122 gateway_localnet.go:183] Node local addresses initialized to: map[10.130.0.2:{10.130.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 172.31.248.112:{172.31.248.112 ffffffff} 172.31.249.51:{172.31.248.0 fffffe00} ::1:{::1 ffffffffffffffffffffffffffffffff} fe80::90ea:baff:fefd:6dc9:{fe80:: ffffffffffffffff0000000000000000} fe80::c1b:568f:b47c:b043:{fe80:: ffffffffffffffff0000000000000000}]
...
I0611 05:24:57.990189    6122 kube.go:89] Setting annotations map[k8s.ovn.org/l3-gateway-config:{"default":{"mode":"shared","interface-id":"br-ex_yinzhou-regre-kk9x9-master-0","mac-address":"00:50:56:ac:35:62","ip-addresses":["172.31.249.51/23"],"ip-address":"172.31.249.51/23","next-hops":["172.31.248.1"],"next-hop":"172.31.248.1","node-port-enable":"true","vlan-id":"0"}} k8s.ovn.org/node-chassis-id:81f65b0b-619c-4ab5-a69c-6456b218144f k8s.ovn.org/node-mgmt-port-mac-address:92:ea:ba:fd:6d:c9 k8s.ovn.org/node-primary-ifaddr:{"ipv4":"172.31.248.112/32"}] on node yinzhou-regre-kk9x9-master-0

The problem is that the ovnkube-node code on 4.7 which parses the node IP used for egress IP assignments, uses the following function to get the default IP address: https://github.com/openshift/ovn-kubernetes/blob/release-4.7/go-controller/pkg/node/helper_linux.go#L98-L118 whereas the function parsing the L3 annotation uses https://github.com/openshift/ovn-kubernetes/blob/master/go-controller/pkg/util/net_linux.go#L354-L380

Specifically the egress IP related code is missing: https://github.com/openshift/ovn-kubernetes/blob/master/go-controller/pkg/util/net_linux.go#L370-L376

That will be fixed by back-porting the commit:  https://github.com/openshift/ovn-kubernetes/commit/1ba1ce885f089a023b4c803d85a2ffc7206eb98b

/Alex

Comment 9 errata-xmlrpc 2021-06-29 04:20:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2502


Note You need to log in before you can comment on or make changes to this bug.