Description of problem: Dualstack has been failing since last week. Failures like this are showing up in the logs: ./host_service_logs/masters/kubelet_service.log:Nov 08 23:58:54.775332 master-1.ostest.test.metalkube.org hyperkube[3305]: echo "$(date -Iseconds) - ERROR RESTARTING - nbdb - too many failed ovn-nbctl attempts, giving up" Version-Release number of selected component (if applicable): 4.7 nightlies How reproducible: Always Steps to Reproduce: 1. Install baremetal IPI dualstack Actual results: Failed installation Expected results: Success Additional info: Logs available at: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-dualstack/1325575040141365248/artifacts/e2e-metal-ipi-ovn-dualstack/baremetalds-devscripts-gather/ Query for more jobs: https://prow.ci.openshift.org/?type=periodic&job=*ocp-4.7-e2e-metal-ipi-ovn-dualstack*
(In reply to Stephen Benjamin from comment #0) > Failures like this are showing up in the logs: > > ./host_service_logs/masters/kubelet_service.log:Nov 08 23:58:54.775332 > master-1.ostest.test.metalkube.org hyperkube[3305]: echo "$(date > -Iseconds) - ERROR RESTARTING - nbdb - too many failed ovn-nbctl attempts, > giving up" It's not logging that that error _happened_, it's just logging a debug/info message that includes the complete ovnkube-master podspec, including the script in the `postStart` hook that includes that "echo" command
Oh I see, thanks - that's confusing. I only skimmed the must-gather and it's obviously very noisy, that stuck out to me as something I saw repeatedly in the logs.
So the problem in the linked test run is that the workers are not being created. There are some baremetal/machine objects for them, but no nodes, and I forget the whole chain of events that needs to go right at that point to cause the nodes to be created, but something in there is not working.
It looks like cluster-machine-approver can't link the node, due to hostname mismatch. Machine only contains shortname, kubelet reported FQDN. Maybe something changed in Ironic agent that reports up the network addresses. machine-approver is scrolling: I1109 20:53:41.427270 1 main.go:182] CSR csr-nj5nf not authorized: failed to find machine for node worker-0.ostest.test.metalkube.org I1109 20:53:41.427362 1 main.go:218] Error syncing csr csr-nj5nf: failed to find machine for node worker-0.ostest.test.metalkube.org The machine network addresses are: status: addresses: - address: 192.168.111.23 type: InternalIP - address: fd00:1101::6caf:8d:e57e:f291 type: InternalIP - address: worker-0 type: Hostname - address: worker-0 type: InternalDNS
So this sounds like what was supposed to be fixed by https://github.com/openshift/ironic-ipa-downloader/pull/43. When I bring up a dual-stack environment with dev-scripts myself, it comes up fine. So maybe it's a race condition? Alternatively did the base OS version change maybe? Like maybe the default hostname is `localhost` rather than `localhost.localdomain` now, so the hack above doesn't work. (I guess that would explain why it fails with 4.7 but not 4.6.) IIRC there are no logs of the ironic-ipa-downloader stage so it's hard to debug... Need to come up with something less hacky here either way...
Created attachment 1728094 [details] worker journal log I think I've reproduced this, attaching the journal log from the worker node, see Network Manager looping through failed attempts to run Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <warn> [1605020803.7218] dhcp4 (enp1s0): request timed out Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7219] dhcp4 (enp1s0): state changed unknown -> timeout Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <warn> [1605020803.7219] dhcp4 (enp2s0): request timed out Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7219] dhcp4 (enp2s0): state changed unknown -> timeout Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7220] device (enp1s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <warn> [1605020803.7242] device (enp1s0): Activation: failed for connection 'Wired Connection' Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7242] device (enp2s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7249] manager: NetworkManager state is now DISCONNECTED Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <warn> [1605020803.7261] device (enp2s0): Activation: failed for connection 'Wired Connection' Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7267] device (enp1s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed') Nov 10 15:06:43 worker-0.ostest.test.metalkube.org ovs-vswitchd[1361]: ovs|00092|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7333] dhcp4 (enp1s0): canceled DHCP transaction Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7334] dhcp4 (enp1s0): state changed timeout -> done Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7348] device (enp2s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed') Nov 10 15:06:43 worker-0.ostest.test.metalkube.org ovs-vswitchd[1361]: ovs|00093|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) Nov 10 15:06:43 worker-0.ostest.test.metalkube.org ovs-vswitchd[1361]: ovs|00094|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7454] dhcp4 (enp2s0): canceled DHCP transaction Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info> [1605020803.7454] dhcp4 (enp2s0): state changed timeout -> done
> Nov 10 15:06:43 worker-0.ostest.test.metalkube.org ovs-vswitchd[1361]: ovs|00092|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) I assume the DHCP configuration in the dev-scripts environment uses the node MAC to assign the right IP to each node? The failures started happening (not completely reliably) on Nov 3. It's possible that it was caused by the bump to OVN 20.09.0-7 on Nov 2? Something changed in the interaction between OVN/OVS/NetworkManager-ovs with respect to how we copy the MAC address from eth0 to br-ex? (Can you confirm that br-ex on the workers (at least) has the wrong MAC, relative to the ethernet interface / the DHCP config?)
looks to me like DHCP6_FQDN_FQDN used to have the format "worker-0" and sometime on the 4th(in the periodic jobs) it changed to "worker-0.ostest.test.metalkube.org" forcing the hostname to exclude the domain allows the job to pass, but I don't know if this is the correct way around to do it. this worked for me locally diff --git a/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml b/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml index 974406a5c..d1afd3c55 100644 --- a/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml +++ b/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml @@ -20,7 +20,7 @@ contents: - [[ -n "$DHCP6_FQDN_FQDN" && "$DHCP6_FQDN_FQDN" != "localhost.localdomain" && "$DHCP6_FQDN_FQDN" =~ "." ]] && hostnamectl set-hostname --static --transient $DHCP6_FQDN_FQDN + [[ "$DHCP6_FQDN_FQDN" =~ "." && ! "$DHCP6_FQDN_FQDN" =~ localhost ]] && hostnamectl set-hostname --static --transient "${DHCP6_FQDN_FQDN/.*}" What I don't understand is why its only happening on the ovn/dualstack job (In reply to Dan Winship from comment #7) > > Nov 10 15:06:43 worker-0.ostest.test.metalkube.org ovs-vswitchd[1361]: ovs|00092|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) > > I assume the DHCP configuration in the dev-scripts environment uses the node > MAC to assign the right IP to each node? > > The failures started happening (not completely reliably) on Nov 3. It's > possible that it was caused by the bump to OVN 20.09.0-7 on Nov 2? Something > changed in the interaction between OVN/OVS/NetworkManager-ovs with respect > to how we copy the MAC address from eth0 to br-ex? > > (Can you confirm that br-ex on the workers (at least) has the wrong MAC, > relative to the ethernet interface / the DHCP config?) I think the "ignoring mac in Interface record" error might be a red herring, it is happening even in successful runs with the workaround above, but here are the MACs on the worker node to answer your question 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000\ link/ether 00:1d:6d:b8:f6:7c brd ff:ff:ff:ff:ff:ff 3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP mode DEFAULT group default qlen 1000\ link/ether 00:1d:6d:b8:f6:7e brd ff:ff:ff:ff:ff:ff 4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\ link/ether 9a:40:25:85:3a:8d brd ff:ff:ff:ff:ff:ff 5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 00:1d:6d:b8:f6:7e brd ff:ff:ff:ff:ff:ff 6: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN mode DEFAULT group default qlen 1000\ link/ether 82:60:10:6a:c2:63 brd ff:ff:ff:ff:ff:ff 7: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 4a:20:0e:85:f3:cf brd ff:ff:ff:ff:ff:ff 8: br-local: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 8a:19:b4:9f:41:4e brd ff:ff:ff:ff:ff:ff 9: ovn-k8s-gw0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 0a:58:a9:fe:00:01 brd ff:ff:ff:ff:ff:ff 10: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 0a:dd:92:c6:99:d8 brd ff:ff:ff:ff:ff:ff
(In reply to Derek Higgins from comment #8) > looks to me like DHCP6_FQDN_FQDN used to have the format "worker-0" and > sometime on the 4th(in the periodic jobs) it changed to > "worker-0.ostest.test.metalkube.org" I don't think so; as the name suggests, it's expected to be a fqdn. > forcing the hostname to exclude the domain allows the job to pass, but I don't know if this is the correct way around to do it. > this worked for me locally It will probably break single-stack IPv6 though. > --- a/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml Wait where does that file come from? Is that new? The issue is that in the dev-scripts setup, the DHCPv4 server sends "short" hostnames but the DHCPv6 server sends fully-qualified ones. So in a single-stack IPv4 environment the node gets a short hostname both in the preboot environment and when it's actually running as a node, and in single-stack IPv6 environment, it gets a fqdn hostname both in the preboot environment and when it's actually running as a node. In dual-stack things get tricky because we've had problems where the preboot environment ends up setting the hostname from the DHCPv6 server but when running as a node it sets the hostname from the DHCPv4 server, and then the names in the different "machine" objects don't match and the worker nodes can't come up. There's a hack in https://github.com/openshift/ironic-ipa-downloader/blob/master/chrooted.sh to try to only use the DHCPv6 hostname in a single-stack environment. (It only sets the hostname from $DHCP6_FQDN_FQDN if the hostname is currently "localhost.localdomain", meaning it _didn't_ get set from DHCPv4.) That had made things work for a while but it seems broken now.
This is failing in 4.6 apparently too (https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-release-4.6-e2e-metal-ipi-ovn-dualstack) so whatever broke it must have been backported there. (Or else is something unversioned like the openshift/release config fragments)
I've updated the workaround in IPA to always prefer the FQDN if set by IPv6. This appears to work for out IPv4/IPv6 and dualstack jobs - echo -e '[[ "$DHCP6_FQDN_FQDN" =~ - ]] && [[ "$(hostname)" == localhost.localdomain ]] && hostname $DHCP6_FQDN_FQDN' > /etc/NetworkManager/dispatcher.d/01-hostname + echo -e '[[ "$DHCP6_FQDN_FQDN" =~ "." ]] && hostname $DHCP6_FQDN_FQDN' > /etc/NetworkManager/dispatcher.d/01-hostname
A bug should not be in POST state without a PR linked in the external links section, which PR is purported to fix this?
New ironic-images RPM has landed in CI repos and is being brought into OCP 4.7 via https://github.com/openshift/ironic-ipa-downloader/pull/55
Just seen a 4.7 dualstack CI job pass on https://github.com/openshift/ovn-kubernetes/pull/365 Marking a verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633