Bug 1914250
Summary: | ovnkube-node fails on master nodes when both DHCPv6 and SLAAC addresses are configured on nodes | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Victor Voronkov <vvoronko> | ||||||
Component: | Networking | Assignee: | Antonio Ojea <aojeagar> | ||||||
Networking sub component: | ovn-kubernetes | QA Contact: | Victor Voronkov <vvoronko> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | high | ||||||||
Priority: | high | CC: | aconstan, anbhat, kquinn | ||||||
Version: | 4.7 | ||||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.7.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Cause: The code in ovn-kube that detects the default gateway was not taking into consideration multipath environments.
Consequence: OVN-Kubernetes nodes failed to start because they cannot find the default gateway.
Fix: The logic has been modified to consider the first available gateway if multipath is present.
Result: OVN-Kubernetes works in environments with multipath and multiple default gateways.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-02-24 15:51:25 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1910165 | ||||||||
Attachments: |
|
Description
Victor Voronkov
2021-01-08 12:59:29 UTC
Created attachment 1745589 [details]
installer gather logs
I can´t reproduce the behavior locally, I have 2 default routes like in the description > default proto ra metric 20100 pref medium > nexthop via fe80::c24a:ff:fe2c:ec60 dev enp2s0 weight 1 > nexthop via fe80::4969:2cb2:f186:5c13 dev enp2s0 weight 1 but the function returns the default gw correctly > {Ifindex: 2 Dst: <nil> Src: <nil> Gw: fe80::4969:2cb2:f186:5c13 Flags: [] Table: 254} > enp2s0 [fe80::4969:2cb2:f186:5c13] <nil> I also, can´t see the error mentioned in the logs attached > grep -r "failed to get default" log-bundle-20210108121911 Is it possible to access the environment? I will prepare and give you such ASAP ok, one step at a time, I don´t know if this is the root cause, but I´vefound a bug in the scripts used by the network manager that didn´t work for interfaces with multiple IP addresses. We can see that br-ex has two IP addresses [root@master-0-0 ~]# ip -6 a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000 inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000 inet6 fd00:1101::5715:49f9:21c1:a594/128 scope global dynamic noprefixroute valid_lft 2853sec preferred_lft 2853sec inet6 fe80::5054:ff:fea8:e75/64 scope link noprefixroute valid_lft forever preferred_lft forever 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UNKNOWN qlen 1000 inet6 fd2e:6f44:5dd8::127/128 scope global dynamic noprefixroute valid_lft 3223sec preferred_lft 3223sec inet6 fd2e:6f44:5dd8:0:ef82:12af:7151:bae6/64 scope global dynamic noprefixroute valid_lft 86384sec preferred_lft 14384sec inet6 fe80::b89c:c288:ac4e:1265/64 scope link noprefixroute valid_lft forever preferred_lft forever The script doesn´t discriminate by IP, so it obtains 2 leases, one per IP, and fails the check Jan 14 08:46:53.777560 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + '[' -z fd2e:6f44:5dd8::14a ']' Jan 14 08:46:53.778626 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: ++ ip -j -6 a show br-ex Jan 14 08:46:53.778796 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: ++ jq -r '.[].addr_info[] | select(.scope=="global") | select(.deprecated!=true) | .preferred_life_time' Jan 14 08:46:53.780413 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com hyperkube[2369]: E0114 08:46:53.780369 2369 kubelet.go:2250] node "master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com" not found Jan 14 08:46:53.825754 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + LEASE_TIME='3548 Jan 14 08:46:53.825754 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: 14348' Jan 14 08:46:53.825754 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + '[' 3548 14348 -lt 4294967295 ']' Jan 14 08:46:53.825754 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: /etc/NetworkManager/dispatcher.d/30-static-dhcpv6: line 12: [: too many arguments Jan 14 08:46:53.825963 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + '[' ovs-if-br-ex == 'Wired Connection' ']' Jan 14 08:46:53.825963 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + IPS=($IP6_ADDRESS_0) Jan 14 08:46:53.825963 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + CHECK_STR='^fd2e:6f44:5dd8::14a/' Jan 14 08:46:53.825963 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + [[ fd2e:6f44:5dd8::14a/128 fe80::7072:275:6b56:66d =~ ^fd2e:6f44:5dd8::14a/ ]] Jan 14 08:46:53.826074 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + IPS=($IP6_ADDRESS_1) Jan 14 08:46:53.826074 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + CIDR=fd2e:6f44:5dd8::14a/128 Jan 14 08:46:53.826074 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + nmcli con mod ovs-if-br-ex ipv6.addresses fd2e:6f44:5dd8::14a/128 ip -j -6 a show br-ex | jq -r --arg IPADDRESS "$IPADDRESS" '.[].addr_info[] | select(.local==$IPADDRESS) | select(.scope=="global") | select(.deprecated!=true) | .preferred_life_time' Tthe network manager script wasn´t discriminating by IP address, hence, interfaces with multiple addresses broke the scripts. This was partially fixed by https://github.com/openshift/machine-config-operator/pull/2312 However, this still carries the problem that the parsing of the interface output return multiples fields, and those should be discriminated by the IP address received as parameter. https://github.com/openshift/machine-config-operator/pull/2341 Ok, I totally misread the script and it has alread been fixed. We need to test again with a new version of the Machine Config Operator that has included https://github.com/openshift/machine-config-operator/pull/2312 , I´ve created an executable with the ovn code to detect the gateway import ( "fmt" "net" "syscall" "github.com/vishvananda/netlink" utilnet "k8s.io/utils/net" ) // getDefaultGatewayInterfaceDetails returns the interface name on // which the default gateway (for route to 0.0.0.0) is configured. // It also returns the default gateways themselves. func getDefaultGatewayInterfaceDetails() (string, []net.IP, error) { var intfName string var gatewayIPs []net.IP needIPv4 := false needIPv6 := true routes, err := netlink.RouteList(nil, syscall.AF_UNSPEC) if err != nil { return "", nil, fmt.Errorf("failed to get routing table in node") } for _, route := range routes { if route.Dst == nil && route.Gw != nil && route.LinkIndex > 0 { fmt.Println(route) intfLink, err := netlink.LinkByIndex(route.LinkIndex) if err != nil { continue } if utilnet.IsIPv6(route.Gw) { if !needIPv6 { continue } needIPv6 = false } else { if !needIPv4 { continue } needIPv4 = false } if intfName == "" { intfName = intfLink.Attrs().Name } else if intfName != intfLink.Attrs().Name { return "", nil, fmt.Errorf("multiple gateway interfaces detected: %s %s", intfName, intfLink.Attrs().Name) } gatewayIPs = append(gatewayIPs, route.Gw) } } if len(gatewayIPs) == 0 { return "", nil, fmt.Errorf("failed to get default gateway interface") } return intfName, gatewayIPs, nil } func main() { fmt.Println(getDefaultGatewayInterfaceDetails()) } The executable works with multiple routes in my local environment as you can see in comment#2 However, it fails (as ovnkube-node fails) in the environment [core@master-0-2 ~]$ sudo ./test [] failed to get default gateway interface Created attachment 1747499 [details]
strace default gateway detection code in ovn
It seems that the problem is that netlink returns an array in the MultiPath field and keeps Gw nil, and we were only checking that Gw != nil
> Debug route {Dst: <nil> Src: <nil> Gw: [{Ifindex: 5 Weight: 1 Gw: fe80::1b8b:3a78:f38c:bfdb Flags: []} {Ifindex: 5 Weight: 1 Gw: fe80::5054:ff:fe48:86d9 Flags: []}] Flags: [] Table: 254}
// Route represents a netlink route.
type Route struct {
LinkIndex int
ILinkIndex int
Scope Scope
Dst *net.IPNet
Src net.IP
Gw net.IP
MultiPath []*NexthopInfo
I have updated the PR and now the code detects the right gateway in the failing environment.
[core@master-0-2 ~]$ ./test
Found default gateway br-ex fe80::1b8b:3a78:f38c:bfdbbr-ex [fe80::1b8b:3a78:f38c:bfdb] <nil>
We verified in the environment that ovnkube-node now detects the gateway correctly [kni@provisionhost-0-0 ~]$ oc get pods -A | grep ovn openshift-ovn-kubernetes ovnkube-master-5lxdf 6/6 Running 21 19h openshift-ovn-kubernetes ovnkube-master-pwk5n 6/6 Running 25 19h openshift-ovn-kubernetes ovnkube-master-tx9fl 6/6 Running 18 19h openshift-ovn-kubernetes ovnkube-node-pg7fn 3/3 Running 0 11m openshift-ovn-kubernetes ovnkube-node-wt7ns 3/3 Running 4 13m openshift-ovn-kubernetes ovnkube-node-x67jc 3/3 Running 0 13m openshift-ovn-kubernetes ovs-node-67l6l 1/1 Running 0 19h openshift-ovn-kubernetes ovs-node-gx6zx 1/1 Running 0 19h openshift-ovn-kubernetes ovs-node-lz22m 1/1 Running 0 19h Verified on 4.7.0-0.nightly-2021-01-22-063949 Bootstrap stage passed ok, OVN pods are up and running [kni@provisionhost-0-0 ~]$ oc get pods -A | grep ovn openshift-ovn-kubernetes ovnkube-master-fmzhw 6/6 Running 1 33m openshift-ovn-kubernetes ovnkube-master-pq55l 6/6 Running 3 33m openshift-ovn-kubernetes ovnkube-master-xsnc5 6/6 Running 3 33m openshift-ovn-kubernetes ovnkube-node-89vlv 3/3 Running 0 33m openshift-ovn-kubernetes ovnkube-node-bm8pm 3/3 Running 0 33m openshift-ovn-kubernetes ovnkube-node-k657b 3/3 Running 0 14m openshift-ovn-kubernetes ovnkube-node-zgtsj 3/3 Running 0 33m openshift-ovn-kubernetes ovnkube-node-zsqqt 3/3 Running 0 14m openshift-ovn-kubernetes ovs-node-dqhrl 1/1 Running 0 14m openshift-ovn-kubernetes ovs-node-lmzfs 1/1 Running 0 33m openshift-ovn-kubernetes ovs-node-sqt2t 1/1 Running 0 14m openshift-ovn-kubernetes ovs-node-vb64d 1/1 Running 0 33m openshift-ovn-kubernetes ovs-node-zwvcx 1/1 Running 0 33m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |