1860060 – Installing an additional bare metal worker node in UPI with VLAN needs additional reboot to work

Bug 1860060 - Installing an additional bare metal worker node in UPI with VLAN needs additional reboot to work

Summary: Installing an additional bare metal worker node in UPI with VLAN needs additi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Dusty Mabe
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1186913
TreeView+	depends on / blocked

Reported:	2020-07-23 16:04 UTC by Devon
Modified:	2024-03-25 16:12 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:16:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
bootcmdline (17.51 KB, image/png) 2020-07-23 16:04 UTC, Devon	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5619761	0	None	None	None	2020-12-02 11:26:59 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:17:06 UTC

Description Devon 2020-07-23 16:04:17 UTC

Created attachment 1702253 [details]
bootcmdline

Created attachment 1702253 [details]
bootcmdline

Description of problem:

(1) Put the ISO into virtual CD/DVD, configure run once from CD
(2) Boot the node, wait until ISO loads
(3) Edit boot parameters
(4) Enter boot cmdline like attached picture shows
(5) Boot up
(6) Wait until coreos installs, and node reboots
(7) Remove virtual CD
(8) Wait until coreos boots up and does additional configuring
(9) Wait until coreos login prompt is visible
(10) Note that no ip address gets configured
(11) Send ctrl+alt+delete through idrac virtual console
(12) Wait until coreos boots up and login prompt is visible
(13) Note that IP address is configured for eno1.41

At step (10) the node should either configure itself correctly or reboot once again. If no manual intervention is done the node would stay not configured forever.

The issue here is that the vlan interface does not come up on the initial boot and needs to be rebooted in order for the interface to come up as expected.

Version-Release number of selected component (if applicable):

All versions of RHCOS ive tested.

Steps to Reproduce:

(1) Put the ISO into virtual CD/DVD, configure run once from CD
(2) Boot the node, wait until ISO loads
(3) Edit boot parameters
(4) Enter boot cmdline like attached picture shows
(5) Boot up
(6) Wait until coreos installs, and node reboots
(7) Remove virtual CD
(8) Wait until coreos boots up and does additional configuring
(9) Wait until coreos login prompt is visible
(10) Note that no ip address gets configured
(11) Send ctrl+alt+delete through idrac virtual console
(12) Wait until coreos boots up and login prompt is visible
(13) Note that IP address is configured for eno1.41

Actual results:

Node must be manually rebooted in order for the vlan interface to come up.

Expected results:

After the initial reboot for ignition the vlan should come up successfully.

Additional info:

Added the image of the boot line that is being used.

Also the initial bug for this case was 1842887 but found that issue seemed to be unrelated so created this bug to address this issue.

Comment 1 Micah Abbott 2020-07-23 18:34:07 UTC

This may be a limitation in RHCOS 4.4 with propagating vlan configuration from the dracut cmdline during install to the real root.  See similar BZs related to this: BZ#1859897 and BZ#1857532

As part of RHCOS/OCP 4.6 we will have better support for complex network configurations.

Setting as medium priority and targeting for 4.6

Comment 2 Dusty Mabe 2020-07-30 19:37:31 UTC

This bug has not been selected for work in the current sprint.

Comment 3 Dusty Mabe 2020-08-12 21:32:23 UTC

This appears to be an issue with the current network teardown code in pre-4.6.

https://github.com/coreos/ignition-dracut/blob/bad799c410c6c5756ed21e2fa1614795cff7a120/dracut/30ignition/coreos-teardown-initramfs-network.sh#L7-L11

This code was overhauled in 4.6 and an attempt at a `ip link delete` was added first:

https://github.com/coreos/fedora-coreos-config/blob/18a2c5182c8824021f246064cabf2fb665496df1/overlay.d/05core/usr/lib/dracut/modules.d/30ignition-coreos/coreos-teardown-initramfs.sh#L119-L131

I have confirmed that manually doing the `ip link delete ens2.100` before continuing the boot process allows the network to be brought up properly on first boot.

This should be fixed in the latest builds of 4.6.

Comment 6 Micah Abbott 2020-09-19 18:41:08 UTC

I tested this on a single RHCOS 4.6 node and I think it is working as expected.  (I didn't setup a VLAN in libvirt, but I don't think it is necessary)  Need to confirm with Dusty.


- Booted rhcos-46.82.202009182140-0-live.x86_64.iso
- Appended the following args to the kernel line:  `coreos.inst=yes coreos.inst.install_dev=/dev/sda coreos.inst.image_url=http://192.168.122.1:9001/rhcos-46.82.202009182140-0-metal.x86_64.raw.gz coreos.inst.ignition_url=http://192.168.122.1:9001/ignitionv3.json coreos.inst.insecure=true vlan=ens3.41:ens3 ip=ens3.41:dhcp`
- Install completed successfully
- Inspect login banner:

```
Red Hat Enterprise Linux CoreOS 46.82.202009182140-0 (Ootpa) 4.6
SSH host key: SHA256:MtQgvhoMJl/Q4xdHLDtsRGyuc+ufIb9Efy7ckHMk9cw (ECDSA)
SSH host key: SHA256:DMxnj/xmQOzmJ5xnlHYBJhGK09TJMQWXUpF7KdPIjhU (ED25519)
SSH host key: SHA256:4UMexCtXDcpxDhdNO+doyKYm7FVPiOSARTbA6uCaQmw (RSA)
ens3: 192.168.122.220 fe80::2b87:b9d4:751b:d273
localhost login: 
```

- Check journal for evidence of downing interface:

```
$ journalctl -b -1 | grep taking
Sep 19 18:29:39 localhost.localdomain coreos-teardown-initramfs[1021]: info: taking down network device: ens3
Sep 19 18:29:40 localhost.localdomain coreos-teardown-initramfs[1021]: info: taking down network device: ens3
```

- Check nmcli

```
$ nmcli con show
NAME                UUID                                  TYPE      DEVICE 
Wired connection 1  34dad74c-52ce-3f25-974f-5fcab4f4fb2a  ethernet  ens3   
ens3.41             032383f2-bcc0-4681-9184-7a877639054a  vlan      --     
```

Comment 7 Dusty Mabe 2020-10-05 14:57:45 UTC

Hey Devon,

Is there any chance you could try this and verify it is fixed in a build of 4.6?

Comment 8 Devon 2020-10-05 17:33:03 UTC

Yeah for sure i'll go ahead and test this and let you know if this behavior has changed, thanks for the update.

Comment 10 Dusty Mabe 2020-10-09 19:30:43 UTC

From my vague memory, here is how I reproduced the problem and then tested the fix:


1. create second libvirt bridge without DHCP

<network>
  <name>secondary</name>
  <uuid>6bb93ef0-acf3-4b80-9ea7-54a8d5fa2783</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr1' stp='on' delay='0'/>
  <mac address='52:54:00:59:b9:93'/>
  <ip address='192.168.123.1' netmask='255.255.255.0'>
  </ip>
</network>


2. Update bridgehelper configuration.

$ cat /etc/qemu/bridge.conf 
allow virbr0
allow virbr1

3. Start a VM (I used Fedora I think) with NIC on virbr0 and virbr1

The VM will get DHCP on one interface (you can ssh in).
Set up vlan on the second interface. eth1.100 or something.
Install dnsmasq and lay down a file like

$ sudo cat /etc/dnsmasq.d/vlandhcp 
interface=eth1.100
bind-interfaces
dhcp-range=192.168.200.150,192.168.200.160,12h

systemctl start dnsmasq

4. Start RHCOS VM with NIC on virbr1 and `vlan=ens2.100:ens2 ip=ens2.100:dhcp

Verify it gets DHCP. Verify the BZ

Comment 12 Michael Nguyen 2020-10-10 02:22:29 UTC

Verified on RHCOS 46.82.202010022240-0 which is included in registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-10-07-194305

1. create second libvirt bridge without DHCP
# cat << EOF > secondary.yaml
<network>
  <name>secondary</name>
  <uuid>6bb93ef0-acf3-4b80-9ea7-54a8d5fa2783</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr1' stp='on' delay='0'/>
  <mac address='52:54:00:59:b9:93'/>
  <ip address='192.168.123.1' netmask='255.255.255.0'>
  </ip>
</network>
EOF

# virsh net-define secondary.xml
# virsh net-autostart secondary.xml


2. Update bridgehelper configuration.

$ cat /etc/qemu/bridge.conf 
allow virbr0
allow virbr1

3. Start a VM (I used Fedora I think) with NIC on virbr0 and virbr1
The VM will get DHCP on one interface (you can ssh in).

Set up vlan on the second interface. eth1.100 or something.
# ip link add link eth1 name eth1.100 type vlan id 100
# ip addr add 192.168.200.1/24 dev eth1.100
# ip link set eth1.100 up

Install dnsmasq and lay down a file like
# dnf install dnsmasq

$ cat << EOF > /etc/dnsmasq.d/vlandhcp 
interface=eth1.100
bind-interfaces
dhcp-range=192.168.200.150,192.168.200.160,12h
EOF

# systemctl start dnsmasq

4. Start RHCOS VM with NIC on virbr1 and use kargs `vlan=ens2.100:ens2 ip=ens2.100:dhcp`

Comment 15 errata-xmlrpc 2020-10-27 16:16:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.