Description of problem: bootstrap stage of Baremetal IPI deployment fails, when both DHCPv6 and SLAAC addresses are configured on nodes. ovnkube-node fails to start: oc -n openshift-ovn-kubernetes logs ovnkube-node-5mw86 -c ovnkube-node ... I0108 12:30:30.277830 82410 gateway_localnet.go:182] Node local addresses initialized to: map[127.0.0.1:{127.0.0.0 ff000000} ::1:{::1 ffffffffffffffffffffffffffffffff} fd00:1101::fa58:4a9f:b174:2d5b:{fd00:1101::fa58:4a9f:b174:2d5b ffffffffffffffffffffffffffffffff} fd01:0:0:2::2:{fd01:0:0:2:: ffffffffffffffff0000000000000000} fd2e:6f44:5dd8:0:c8e1:6c1c:e835:dc96:{fd2e:6f44:5dd8:: ffffffffffffffff0000000000000000} fd2e:6f44:5dd8::103:{fd2e:6f44:5dd8::103 ffffffffffffffffffffffffffffffff} fe80::3c34:58ff:fe44:c135:{fe80:: ffffffffffffffff0000000000000000} fe80::5054:ff:fe69:4493:{fe80:: ffffffffffffffff0000000000000000} fe80::b0af:cdff:fe55:71d0:{fe80:: ffffffffffffffff0000000000000000} fe80::de6b:3249:5520:1b94:{fe80:: ffffffffffffffff0000000000000000}] F0108 12:30:30.277980 82410 ovnkube.go:130] failed to get default gateway interface Version-Release number of selected component (if applicable): 4.7.0-fc.1 How reproducible: Trigger deployment with IPv6 control plane network when both DHCPv6 and SLAAC addresses are configured on nodes Steps to Reproduce: 1. Prepare nodes and do all the prequisites 2. Start RA together with DHCPv6 (configured with the same subnet fd2e:6f44:5dd8::/64) 3. Trigger the deployment process Actual results: Deployment fails, installer output is: ... DEBUG The connection to the server api-int.ocp-edge-cluster-0.qe.lab.redhat.com:6443 was refused - did you specify the right host or port? DEBUG Gather remote logs ... DEBUG Log bundle written to /var/home/core/log-bundle-20210108121911.tar.gz INFO Bootstrap gather logs captured here "/home/kni/clusterconfigs/log-bundle-20210108121911.tar.gz" FATAL Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition Expected results: Deployments succeed Additional info: [core@master-0-0 ~]$ ip -6 route ::1 dev lo proto kernel metric 256 pref medium fd00:1101::5343:db04:d44:490c dev enp4s0 proto kernel metric 100 pref medium fd00:1101::/64 dev enp4s0 proto ra metric 100 pref medium fd01:0:0:1::/64 dev ovn-k8s-mp0 proto kernel metric 256 pref medium fd01::/48 via fd01:0:0:1::1 dev ovn-k8s-mp0 metric 1024 pref medium fd02::/112 via fd01:0:0:1::1 dev ovn-k8s-mp0 metric 1024 pref medium fd2e:6f44:5dd8::10c dev br-ex proto kernel metric 100 pref medium fd2e:6f44:5dd8::/64 dev br-ex proto ra metric 100 pref medium fe80::/64 dev enp4s0 proto kernel metric 100 pref medium fe80::/64 dev br-ex proto kernel metric 100 pref medium fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium fe80::/64 dev ovn-k8s-mp0 proto kernel metric 256 pref medium default proto ra metric 100 nexthop via fe80::6ae5:34fe:4ef6:e430 dev br-ex weight 1 nexthop via fe80::5054:ff:feac:eac9 dev br-ex weight 1 pref medium
Created attachment 1745589 [details] installer gather logs
I can´t reproduce the behavior locally, I have 2 default routes like in the description > default proto ra metric 20100 pref medium > nexthop via fe80::c24a:ff:fe2c:ec60 dev enp2s0 weight 1 > nexthop via fe80::4969:2cb2:f186:5c13 dev enp2s0 weight 1 but the function returns the default gw correctly > {Ifindex: 2 Dst: <nil> Src: <nil> Gw: fe80::4969:2cb2:f186:5c13 Flags: [] Table: 254} > enp2s0 [fe80::4969:2cb2:f186:5c13] <nil> I also, can´t see the error mentioned in the logs attached > grep -r "failed to get default" log-bundle-20210108121911 Is it possible to access the environment?
I will prepare and give you such ASAP
ok, one step at a time, I don´t know if this is the root cause, but I´vefound a bug in the scripts used by the network manager that didn´t work for interfaces with multiple IP addresses. We can see that br-ex has two IP addresses [root@master-0-0 ~]# ip -6 a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000 inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000 inet6 fd00:1101::5715:49f9:21c1:a594/128 scope global dynamic noprefixroute valid_lft 2853sec preferred_lft 2853sec inet6 fe80::5054:ff:fea8:e75/64 scope link noprefixroute valid_lft forever preferred_lft forever 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UNKNOWN qlen 1000 inet6 fd2e:6f44:5dd8::127/128 scope global dynamic noprefixroute valid_lft 3223sec preferred_lft 3223sec inet6 fd2e:6f44:5dd8:0:ef82:12af:7151:bae6/64 scope global dynamic noprefixroute valid_lft 86384sec preferred_lft 14384sec inet6 fe80::b89c:c288:ac4e:1265/64 scope link noprefixroute valid_lft forever preferred_lft forever The script doesn´t discriminate by IP, so it obtains 2 leases, one per IP, and fails the check Jan 14 08:46:53.777560 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + '[' -z fd2e:6f44:5dd8::14a ']' Jan 14 08:46:53.778626 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: ++ ip -j -6 a show br-ex Jan 14 08:46:53.778796 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: ++ jq -r '.[].addr_info[] | select(.scope=="global") | select(.deprecated!=true) | .preferred_life_time' Jan 14 08:46:53.780413 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com hyperkube[2369]: E0114 08:46:53.780369 2369 kubelet.go:2250] node "master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com" not found Jan 14 08:46:53.825754 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + LEASE_TIME='3548 Jan 14 08:46:53.825754 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: 14348' Jan 14 08:46:53.825754 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + '[' 3548 14348 -lt 4294967295 ']' Jan 14 08:46:53.825754 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: /etc/NetworkManager/dispatcher.d/30-static-dhcpv6: line 12: [: too many arguments Jan 14 08:46:53.825963 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + '[' ovs-if-br-ex == 'Wired Connection' ']' Jan 14 08:46:53.825963 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + IPS=($IP6_ADDRESS_0) Jan 14 08:46:53.825963 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + CHECK_STR='^fd2e:6f44:5dd8::14a/' Jan 14 08:46:53.825963 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + [[ fd2e:6f44:5dd8::14a/128 fe80::7072:275:6b56:66d =~ ^fd2e:6f44:5dd8::14a/ ]] Jan 14 08:46:53.826074 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + IPS=($IP6_ADDRESS_1) Jan 14 08:46:53.826074 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + CIDR=fd2e:6f44:5dd8::14a/128 Jan 14 08:46:53.826074 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com nm-dispatcher[1580]: + nmcli con mod ovs-if-br-ex ipv6.addresses fd2e:6f44:5dd8::14a/128 ip -j -6 a show br-ex | jq -r --arg IPADDRESS "$IPADDRESS" '.[].addr_info[] | select(.local==$IPADDRESS) | select(.scope=="global") | select(.deprecated!=true) | .preferred_life_time' Tthe network manager script wasn´t discriminating by IP address, hence, interfaces with multiple addresses broke the scripts. This was partially fixed by https://github.com/openshift/machine-config-operator/pull/2312 However, this still carries the problem that the parsing of the interface output return multiples fields, and those should be discriminated by the IP address received as parameter. https://github.com/openshift/machine-config-operator/pull/2341
Ok, I totally misread the script and it has alread been fixed. We need to test again with a new version of the Machine Config Operator that has included https://github.com/openshift/machine-config-operator/pull/2312 ,
I´ve created an executable with the ovn code to detect the gateway import ( "fmt" "net" "syscall" "github.com/vishvananda/netlink" utilnet "k8s.io/utils/net" ) // getDefaultGatewayInterfaceDetails returns the interface name on // which the default gateway (for route to 0.0.0.0) is configured. // It also returns the default gateways themselves. func getDefaultGatewayInterfaceDetails() (string, []net.IP, error) { var intfName string var gatewayIPs []net.IP needIPv4 := false needIPv6 := true routes, err := netlink.RouteList(nil, syscall.AF_UNSPEC) if err != nil { return "", nil, fmt.Errorf("failed to get routing table in node") } for _, route := range routes { if route.Dst == nil && route.Gw != nil && route.LinkIndex > 0 { fmt.Println(route) intfLink, err := netlink.LinkByIndex(route.LinkIndex) if err != nil { continue } if utilnet.IsIPv6(route.Gw) { if !needIPv6 { continue } needIPv6 = false } else { if !needIPv4 { continue } needIPv4 = false } if intfName == "" { intfName = intfLink.Attrs().Name } else if intfName != intfLink.Attrs().Name { return "", nil, fmt.Errorf("multiple gateway interfaces detected: %s %s", intfName, intfLink.Attrs().Name) } gatewayIPs = append(gatewayIPs, route.Gw) } } if len(gatewayIPs) == 0 { return "", nil, fmt.Errorf("failed to get default gateway interface") } return intfName, gatewayIPs, nil } func main() { fmt.Println(getDefaultGatewayInterfaceDetails()) } The executable works with multiple routes in my local environment as you can see in comment#2 However, it fails (as ovnkube-node fails) in the environment [core@master-0-2 ~]$ sudo ./test [] failed to get default gateway interface
Created attachment 1747499 [details] strace default gateway detection code in ovn
It seems that the problem is that netlink returns an array in the MultiPath field and keeps Gw nil, and we were only checking that Gw != nil > Debug route {Dst: <nil> Src: <nil> Gw: [{Ifindex: 5 Weight: 1 Gw: fe80::1b8b:3a78:f38c:bfdb Flags: []} {Ifindex: 5 Weight: 1 Gw: fe80::5054:ff:fe48:86d9 Flags: []}] Flags: [] Table: 254} // Route represents a netlink route. type Route struct { LinkIndex int ILinkIndex int Scope Scope Dst *net.IPNet Src net.IP Gw net.IP MultiPath []*NexthopInfo I have updated the PR and now the code detects the right gateway in the failing environment. [core@master-0-2 ~]$ ./test Found default gateway br-ex fe80::1b8b:3a78:f38c:bfdbbr-ex [fe80::1b8b:3a78:f38c:bfdb] <nil>
We verified in the environment that ovnkube-node now detects the gateway correctly [kni@provisionhost-0-0 ~]$ oc get pods -A | grep ovn openshift-ovn-kubernetes ovnkube-master-5lxdf 6/6 Running 21 19h openshift-ovn-kubernetes ovnkube-master-pwk5n 6/6 Running 25 19h openshift-ovn-kubernetes ovnkube-master-tx9fl 6/6 Running 18 19h openshift-ovn-kubernetes ovnkube-node-pg7fn 3/3 Running 0 11m openshift-ovn-kubernetes ovnkube-node-wt7ns 3/3 Running 4 13m openshift-ovn-kubernetes ovnkube-node-x67jc 3/3 Running 0 13m openshift-ovn-kubernetes ovs-node-67l6l 1/1 Running 0 19h openshift-ovn-kubernetes ovs-node-gx6zx 1/1 Running 0 19h openshift-ovn-kubernetes ovs-node-lz22m 1/1 Running 0 19h
Verified on 4.7.0-0.nightly-2021-01-22-063949 Bootstrap stage passed ok, OVN pods are up and running [kni@provisionhost-0-0 ~]$ oc get pods -A | grep ovn openshift-ovn-kubernetes ovnkube-master-fmzhw 6/6 Running 1 33m openshift-ovn-kubernetes ovnkube-master-pq55l 6/6 Running 3 33m openshift-ovn-kubernetes ovnkube-master-xsnc5 6/6 Running 3 33m openshift-ovn-kubernetes ovnkube-node-89vlv 3/3 Running 0 33m openshift-ovn-kubernetes ovnkube-node-bm8pm 3/3 Running 0 33m openshift-ovn-kubernetes ovnkube-node-k657b 3/3 Running 0 14m openshift-ovn-kubernetes ovnkube-node-zgtsj 3/3 Running 0 33m openshift-ovn-kubernetes ovnkube-node-zsqqt 3/3 Running 0 14m openshift-ovn-kubernetes ovs-node-dqhrl 1/1 Running 0 14m openshift-ovn-kubernetes ovs-node-lmzfs 1/1 Running 0 33m openshift-ovn-kubernetes ovs-node-sqt2t 1/1 Running 0 14m openshift-ovn-kubernetes ovs-node-vb64d 1/1 Running 0 33m openshift-ovn-kubernetes ovs-node-zwvcx 1/1 Running 0 33m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633