Bug 1895909

Summary: e2e-metal-ipi-ovn-dualstack is failing
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: InstallerAssignee: Derek Higgins <derekh>
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Amit Ugol <augol>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: danw, dcbw, derekh, ealcaniz, iheim, kquinn, sdodson, trozet
Version: 4.7Keywords: Triaged
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously when using dual stack deployments, worker node hostnames didn't match the name inspected before deployment causing nodes to need manual approval. This has been fixed.
Story Points: ---
Clone Of:
: 1902797 (view as bug list) Environment:
Last Closed: 2021-02-24 15:31:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1902797    
Attachments:
Description Flags
worker journal log none

Description Stephen Benjamin 2020-11-09 12:25:03 UTC
Description of problem:


Dualstack has been failing since last week. 

Failures like this are showing up in the logs:

./host_service_logs/masters/kubelet_service.log:Nov 08 23:58:54.775332 master-1.ostest.test.metalkube.org hyperkube[3305]:     echo "$(date -Iseconds) - ERROR RESTARTING - nbdb - too many failed ovn-nbctl attempts, giving up"


Version-Release number of selected component (if applicable):

4.7 nightlies

How reproducible:

Always

Steps to Reproduce:
1. Install baremetal IPI dualstack

Actual results:

Failed installation

Expected results:

Success

Additional info:

Logs available at:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-metal-ipi-ovn-dualstack/1325575040141365248/artifacts/e2e-metal-ipi-ovn-dualstack/baremetalds-devscripts-gather/

Query for more jobs:

https://prow.ci.openshift.org/?type=periodic&job=*ocp-4.7-e2e-metal-ipi-ovn-dualstack*

Comment 1 Dan Winship 2020-11-09 14:02:05 UTC
(In reply to Stephen Benjamin from comment #0)
> Failures like this are showing up in the logs:
> 
> ./host_service_logs/masters/kubelet_service.log:Nov 08 23:58:54.775332
> master-1.ostest.test.metalkube.org hyperkube[3305]:     echo "$(date
> -Iseconds) - ERROR RESTARTING - nbdb - too many failed ovn-nbctl attempts,
> giving up"

It's not logging that that error _happened_, it's just logging a debug/info message that includes the complete ovnkube-master podspec, including the script in the `postStart` hook that includes that "echo" command

Comment 2 Stephen Benjamin 2020-11-09 14:12:25 UTC
Oh I see, thanks - that's confusing. I only skimmed the must-gather and it's obviously very noisy, that stuck out to me as something I saw repeatedly in the logs.

Comment 3 Dan Winship 2020-11-09 15:41:37 UTC
So the problem in the linked test run is that the workers are not being created. There are some baremetal/machine objects for them, but no nodes, and I forget the whole chain of events that needs to go right at that point to cause the nodes to be created, but something in there is not working.

Comment 4 Stephen Benjamin 2020-11-10 12:20:55 UTC
It looks like cluster-machine-approver can't link the node, due to hostname mismatch. Machine only contains shortname, kubelet reported FQDN. Maybe something changed in Ironic agent that reports up the network addresses.


machine-approver is scrolling:

I1109 20:53:41.427270       1 main.go:182] CSR csr-nj5nf not authorized: failed to find machine for node worker-0.ostest.test.metalkube.org
I1109 20:53:41.427362       1 main.go:218] Error syncing csr csr-nj5nf: failed to find machine for node worker-0.ostest.test.metalkube.org

The machine network addresses are:

status:
  addresses:
  - address: 192.168.111.23
    type: InternalIP
  - address: fd00:1101::6caf:8d:e57e:f291
    type: InternalIP
  - address: worker-0
    type: Hostname
  - address: worker-0
    type: InternalDNS

Comment 5 Dan Winship 2020-11-10 13:29:32 UTC
So this sounds like what was supposed to be fixed by https://github.com/openshift/ironic-ipa-downloader/pull/43.

When I bring up a dual-stack environment with dev-scripts myself, it comes up fine. So maybe it's a race condition?

Alternatively did the base OS version change maybe? Like maybe the default hostname is `localhost` rather than `localhost.localdomain` now, so the hack above doesn't work. (I guess that would explain why it fails with 4.7 but not 4.6.) IIRC there are no logs of the ironic-ipa-downloader stage so it's hard to debug...

Need to come up with something less hacky here either way...

Comment 6 Derek Higgins 2020-11-10 17:25:15 UTC
Created attachment 1728094 [details]
worker journal log

I think I've reproduced this, attaching the journal log from the worker node, 

see Network Manager looping through failed attempts to run

Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <warn>  [1605020803.7218] dhcp4 (enp1s0): request timed out
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7219] dhcp4 (enp1s0): state changed unknown -> timeout
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <warn>  [1605020803.7219] dhcp4 (enp2s0): request timed out
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7219] dhcp4 (enp2s0): state changed unknown -> timeout
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7220] device (enp1s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <warn>  [1605020803.7242] device (enp1s0): Activation: failed for connection 'Wired Connection'
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7242] device (enp2s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7249] manager: NetworkManager state is now DISCONNECTED
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <warn>  [1605020803.7261] device (enp2s0): Activation: failed for connection 'Wired Connection'
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7267] device (enp1s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org ovs-vswitchd[1361]: ovs|00092|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7333] dhcp4 (enp1s0): canceled DHCP transaction
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7334] dhcp4 (enp1s0): state changed timeout -> done
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7348] device (enp2s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org ovs-vswitchd[1361]: ovs|00093|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org ovs-vswitchd[1361]: ovs|00094|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7454] dhcp4 (enp2s0): canceled DHCP transaction
Nov 10 15:06:43 worker-0.ostest.test.metalkube.org NetworkManager[1393]: <info>  [1605020803.7454] dhcp4 (enp2s0): state changed timeout -> done

Comment 7 Dan Winship 2020-11-10 19:54:40 UTC
> Nov 10 15:06:43 worker-0.ostest.test.metalkube.org ovs-vswitchd[1361]: ovs|00092|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)

I assume the DHCP configuration in the dev-scripts environment uses the node MAC to assign the right IP to each node?

The failures started happening (not completely reliably) on Nov 3. It's possible that it was caused by the bump to OVN 20.09.0-7 on Nov 2? Something changed in the interaction between OVN/OVS/NetworkManager-ovs with respect to how we copy the MAC address from eth0 to br-ex?

(Can you confirm that br-ex on the workers (at least) has the wrong MAC, relative to the ethernet interface / the DHCP config?)

Comment 8 Derek Higgins 2020-11-11 16:53:52 UTC
looks to me like DHCP6_FQDN_FQDN used to have the format "worker-0" and sometime on the 4th(in the periodic jobs) it changed to "worker-0.ostest.test.metalkube.org"
forcing the hostname to exclude the domain allows the job to pass, but I don't know if this is the correct way around to do it.
this worked for me locally

diff --git a/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml b/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml
index 974406a5c..d1afd3c55 100644
--- a/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml
+++ b/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml
@@ -20,7 +20,7 @@ contents:
-    [[ -n "$DHCP6_FQDN_FQDN" && "$DHCP6_FQDN_FQDN" != "localhost.localdomain" && "$DHCP6_FQDN_FQDN" =~ "." ]] && hostnamectl set-hostname --static --transient $DHCP6_FQDN_FQDN
+    [[ "$DHCP6_FQDN_FQDN" =~ "." && ! "$DHCP6_FQDN_FQDN" =~ localhost ]] && hostnamectl set-hostname --static --transient "${DHCP6_FQDN_FQDN/.*}"


What I don't understand is why its only happening on the ovn/dualstack job

(In reply to Dan Winship from comment #7)
> > Nov 10 15:06:43 worker-0.ostest.test.metalkube.org ovs-vswitchd[1361]: ovs|00092|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
> 
> I assume the DHCP configuration in the dev-scripts environment uses the node
> MAC to assign the right IP to each node?
> 
> The failures started happening (not completely reliably) on Nov 3. It's
> possible that it was caused by the bump to OVN 20.09.0-7 on Nov 2? Something
> changed in the interaction between OVN/OVS/NetworkManager-ovs with respect
> to how we copy the MAC address from eth0 to br-ex?
> 
> (Can you confirm that br-ex on the workers (at least) has the wrong MAC,
> relative to the ethernet interface / the DHCP config?)

I think the "ignoring mac in Interface record" error might be a red herring, it is happening even in successful runs with the workaround above, but here are the MACs on the worker node to answer your question
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000\    link/ether 00:1d:6d:b8:f6:7c brd ff:ff:ff:ff:ff:ff
3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP mode DEFAULT group default qlen 1000\    link/ether 00:1d:6d:b8:f6:7e brd ff:ff:ff:ff:ff:ff
4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether 9a:40:25:85:3a:8d brd ff:ff:ff:ff:ff:ff
5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether 00:1d:6d:b8:f6:7e brd ff:ff:ff:ff:ff:ff
6: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether 82:60:10:6a:c2:63 brd ff:ff:ff:ff:ff:ff
7: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether 4a:20:0e:85:f3:cf brd ff:ff:ff:ff:ff:ff
8: br-local: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether 8a:19:b4:9f:41:4e brd ff:ff:ff:ff:ff:ff
9: ovn-k8s-gw0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether 0a:58:a9:fe:00:01 brd ff:ff:ff:ff:ff:ff
10: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether 0a:dd:92:c6:99:d8 brd ff:ff:ff:ff:ff:ff

Comment 9 Dan Winship 2020-11-11 22:13:59 UTC
(In reply to Derek Higgins from comment #8)
> looks to me like DHCP6_FQDN_FQDN used to have the format "worker-0" and
> sometime on the 4th(in the periodic jobs) it changed to
> "worker-0.ostest.test.metalkube.org"

I don't think so; as the name suggests, it's expected to be a fqdn.

> forcing the hostname to exclude the domain allows the job to pass, but I don't know if this is the correct way around to do it.
> this worked for me locally

It will probably break single-stack IPv6 though.

> --- a/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml

Wait where does that file come from? Is that new?


The issue is that in the dev-scripts setup, the DHCPv4 server sends "short" hostnames but the DHCPv6 server sends fully-qualified ones. So in a single-stack IPv4 environment the node gets a short hostname both in the preboot environment and when it's actually running as a node, and in single-stack IPv6 environment, it gets a fqdn hostname both in the preboot environment and when it's actually running as a node.

In dual-stack things get tricky because we've had problems where the preboot environment ends up setting the hostname from the DHCPv6 server but when running as a node it sets the hostname from the DHCPv4 server, and then the names in the different "machine" objects don't match and the worker nodes can't come up.

There's a hack in https://github.com/openshift/ironic-ipa-downloader/blob/master/chrooted.sh to try to only use the DHCPv6 hostname in a single-stack environment. (It only sets the hostname from $DHCP6_FQDN_FQDN if the hostname is currently "localhost.localdomain", meaning it _didn't_ get set from DHCPv4.) That had made things work for a while but it seems broken now.

Comment 10 Dan Winship 2020-11-16 17:09:58 UTC
This is failing in 4.6 apparently too (https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-release-4.6-e2e-metal-ipi-ovn-dualstack) so whatever broke it must have been backported there. (Or else is something unversioned like the openshift/release config fragments)

Comment 11 Derek Higgins 2020-11-24 17:31:00 UTC
I've updated the workaround in IPA to always prefer the FQDN if set by IPv6. This appears to work for out IPv4/IPv6 and dualstack jobs

-	echo -e '[[ "$DHCP6_FQDN_FQDN" =~ - ]] && [[ "$(hostname)" == localhost.localdomain ]] && hostname $DHCP6_FQDN_FQDN' > /etc/NetworkManager/dispatcher.d/01-hostname
+	echo -e '[[ "$DHCP6_FQDN_FQDN" =~ "." ]] && hostname $DHCP6_FQDN_FQDN' > /etc/NetworkManager/dispatcher.d/01-hostname

Comment 14 Scott Dodson 2020-11-30 16:26:41 UTC
A bug should not be in POST state without a PR linked in the external links section, which PR is purported to fix this?

Comment 20 Dan Williams 2020-12-02 19:56:10 UTC
New ironic-images RPM has landed in CI repos and is being brought into OCP 4.7 via https://github.com/openshift/ironic-ipa-downloader/pull/55

Comment 21 Derek Higgins 2020-12-02 23:42:09 UTC
Just seen a 4.7 dualstack CI job pass on https://github.com/openshift/ovn-kubernetes/pull/365
Marking a verified

Comment 25 errata-xmlrpc 2021-02-24 15:31:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633