Created attachment 1702253 [details] bootcmdline Created attachment 1702253 [details] bootcmdline Description of problem: (1) Put the ISO into virtual CD/DVD, configure run once from CD (2) Boot the node, wait until ISO loads (3) Edit boot parameters (4) Enter boot cmdline like attached picture shows (5) Boot up (6) Wait until coreos installs, and node reboots (7) Remove virtual CD (8) Wait until coreos boots up and does additional configuring (9) Wait until coreos login prompt is visible (10) Note that no ip address gets configured (11) Send ctrl+alt+delete through idrac virtual console (12) Wait until coreos boots up and login prompt is visible (13) Note that IP address is configured for eno1.41 At step (10) the node should either configure itself correctly or reboot once again. If no manual intervention is done the node would stay not configured forever. The issue here is that the vlan interface does not come up on the initial boot and needs to be rebooted in order for the interface to come up as expected. Version-Release number of selected component (if applicable): All versions of RHCOS ive tested. Steps to Reproduce: (1) Put the ISO into virtual CD/DVD, configure run once from CD (2) Boot the node, wait until ISO loads (3) Edit boot parameters (4) Enter boot cmdline like attached picture shows (5) Boot up (6) Wait until coreos installs, and node reboots (7) Remove virtual CD (8) Wait until coreos boots up and does additional configuring (9) Wait until coreos login prompt is visible (10) Note that no ip address gets configured (11) Send ctrl+alt+delete through idrac virtual console (12) Wait until coreos boots up and login prompt is visible (13) Note that IP address is configured for eno1.41 Actual results: Node must be manually rebooted in order for the vlan interface to come up. Expected results: After the initial reboot for ignition the vlan should come up successfully. Additional info: Added the image of the boot line that is being used. Also the initial bug for this case was 1842887 but found that issue seemed to be unrelated so created this bug to address this issue.
This may be a limitation in RHCOS 4.4 with propagating vlan configuration from the dracut cmdline during install to the real root. See similar BZs related to this: BZ#1859897 and BZ#1857532 As part of RHCOS/OCP 4.6 we will have better support for complex network configurations. Setting as medium priority and targeting for 4.6
This bug has not been selected for work in the current sprint.
This appears to be an issue with the current network teardown code in pre-4.6. https://github.com/coreos/ignition-dracut/blob/bad799c410c6c5756ed21e2fa1614795cff7a120/dracut/30ignition/coreos-teardown-initramfs-network.sh#L7-L11 This code was overhauled in 4.6 and an attempt at a `ip link delete` was added first: https://github.com/coreos/fedora-coreos-config/blob/18a2c5182c8824021f246064cabf2fb665496df1/overlay.d/05core/usr/lib/dracut/modules.d/30ignition-coreos/coreos-teardown-initramfs.sh#L119-L131 I have confirmed that manually doing the `ip link delete ens2.100` before continuing the boot process allows the network to be brought up properly on first boot. This should be fixed in the latest builds of 4.6.
I tested this on a single RHCOS 4.6 node and I think it is working as expected. (I didn't setup a VLAN in libvirt, but I don't think it is necessary) Need to confirm with Dusty. - Booted rhcos-46.82.202009182140-0-live.x86_64.iso - Appended the following args to the kernel line: `coreos.inst=yes coreos.inst.install_dev=/dev/sda coreos.inst.image_url=http://192.168.122.1:9001/rhcos-46.82.202009182140-0-metal.x86_64.raw.gz coreos.inst.ignition_url=http://192.168.122.1:9001/ignitionv3.json coreos.inst.insecure=true vlan=ens3.41:ens3 ip=ens3.41:dhcp` - Install completed successfully - Inspect login banner: ``` Red Hat Enterprise Linux CoreOS 46.82.202009182140-0 (Ootpa) 4.6 SSH host key: SHA256:MtQgvhoMJl/Q4xdHLDtsRGyuc+ufIb9Efy7ckHMk9cw (ECDSA) SSH host key: SHA256:DMxnj/xmQOzmJ5xnlHYBJhGK09TJMQWXUpF7KdPIjhU (ED25519) SSH host key: SHA256:4UMexCtXDcpxDhdNO+doyKYm7FVPiOSARTbA6uCaQmw (RSA) ens3: 192.168.122.220 fe80::2b87:b9d4:751b:d273 localhost login: ``` - Check journal for evidence of downing interface: ``` $ journalctl -b -1 | grep taking Sep 19 18:29:39 localhost.localdomain coreos-teardown-initramfs[1021]: info: taking down network device: ens3 Sep 19 18:29:40 localhost.localdomain coreos-teardown-initramfs[1021]: info: taking down network device: ens3 ``` - Check nmcli ``` $ nmcli con show NAME UUID TYPE DEVICE Wired connection 1 34dad74c-52ce-3f25-974f-5fcab4f4fb2a ethernet ens3 ens3.41 032383f2-bcc0-4681-9184-7a877639054a vlan -- ```
Hey Devon, Is there any chance you could try this and verify it is fixed in a build of 4.6?
Yeah for sure i'll go ahead and test this and let you know if this behavior has changed, thanks for the update.
From my vague memory, here is how I reproduced the problem and then tested the fix: 1. create second libvirt bridge without DHCP <network> <name>secondary</name> <uuid>6bb93ef0-acf3-4b80-9ea7-54a8d5fa2783</uuid> <forward mode='nat'> <nat> <port start='1024' end='65535'/> </nat> </forward> <bridge name='virbr1' stp='on' delay='0'/> <mac address='52:54:00:59:b9:93'/> <ip address='192.168.123.1' netmask='255.255.255.0'> </ip> </network> 2. Update bridgehelper configuration. $ cat /etc/qemu/bridge.conf allow virbr0 allow virbr1 3. Start a VM (I used Fedora I think) with NIC on virbr0 and virbr1 The VM will get DHCP on one interface (you can ssh in). Set up vlan on the second interface. eth1.100 or something. Install dnsmasq and lay down a file like $ sudo cat /etc/dnsmasq.d/vlandhcp interface=eth1.100 bind-interfaces dhcp-range=192.168.200.150,192.168.200.160,12h systemctl start dnsmasq 4. Start RHCOS VM with NIC on virbr1 and `vlan=ens2.100:ens2 ip=ens2.100:dhcp Verify it gets DHCP. Verify the BZ
Verified on RHCOS 46.82.202010022240-0 which is included in registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-10-07-194305 1. create second libvirt bridge without DHCP # cat << EOF > secondary.yaml <network> <name>secondary</name> <uuid>6bb93ef0-acf3-4b80-9ea7-54a8d5fa2783</uuid> <forward mode='nat'> <nat> <port start='1024' end='65535'/> </nat> </forward> <bridge name='virbr1' stp='on' delay='0'/> <mac address='52:54:00:59:b9:93'/> <ip address='192.168.123.1' netmask='255.255.255.0'> </ip> </network> EOF # virsh net-define secondary.xml # virsh net-autostart secondary.xml 2. Update bridgehelper configuration. $ cat /etc/qemu/bridge.conf allow virbr0 allow virbr1 3. Start a VM (I used Fedora I think) with NIC on virbr0 and virbr1 The VM will get DHCP on one interface (you can ssh in). Set up vlan on the second interface. eth1.100 or something. # ip link add link eth1 name eth1.100 type vlan id 100 # ip addr add 192.168.200.1/24 dev eth1.100 # ip link set eth1.100 up Install dnsmasq and lay down a file like # dnf install dnsmasq $ cat << EOF > /etc/dnsmasq.d/vlandhcp interface=eth1.100 bind-interfaces dhcp-range=192.168.200.150,192.168.200.160,12h EOF # systemctl start dnsmasq 4. Start RHCOS VM with NIC on virbr1 and use kargs `vlan=ens2.100:ens2 ip=ens2.100:dhcp`
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196