Bug 1758091 - [UPI][BAREMETAL] RHCOS 4.2 installation does not work when using bonding configuration
Summary: [UPI][BAREMETAL] RHCOS 4.2 installation does not work when using bonding conf...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.3.0
Assignee: Vadim Rutkovsky
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1186913 1767771
TreeView+ depends on / blocked
 
Reported: 2019-10-03 08:51 UTC by Benjamin Chardi
Modified: 2023-12-15 16:48 UTC (History)
32 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1767771 (view as bug list)
Environment:
Last Closed: 2020-05-13 21:26:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Failed RHCOS with bond - Emergency shell report at /run/initramfs/rdsosreport.txt (139.71 KB, text/plain)
2019-10-15 14:04 UTC, Benjamin Chardi
no flags Details
boot messages during OCP 4.3 install (176.75 KB, image/png)
2019-12-13 16:55 UTC, umesh_sunnapu
no flags Details
boot messages during OCP 4.3 install - screenshot 2 (208.67 KB, image/png)
2019-12-13 16:56 UTC, umesh_sunnapu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-05-13 21:26:44 UTC

Internal Links: 1789601

Description Benjamin Chardi 2019-10-03 08:51:32 UTC
Description of problem:


I am trying to install RHCOS 4.2 with bonding on baremetal server (to install an OCP4.2 cluster on baremetal) using the following PXE-like configuration:

kernel http://xxx/rhcos-42.80.20190828.2-installer-kernel rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=sda coreos.inst.image_url=http://xxx/rhcos-42.80.20190828.2-metal-uefi.raw.gz bond=bond0:eno1,eno2:mode=802.3ad,miimon=100 ip=192.168.122.200::192.168.122.129:255.255.255.0:server:bond0:none:192.168.122.117 
initrd=rhcos-42.80.20190828.2-installer-initramfs.img coreos.inst.ignition_url=http://xxx/bootstrap.ign rd.debug rd.shell rd.driver.pre=ahci

Bonding configuration is explicitly done via dracut.cmdline:
bond=bond0:eno1,eno2:mode=802.3ad,miimon=100


RHCOS installation process starts as expected. I can see from installation logs that all remote installation files are downloaded correctly via bond interface during RHCOS installation.
Also from installation logs I can confirm that bonding configuration is applied, persisted and working on the system that is being installed.

At this point RHCOS installation progress as expected until it gets stuck at the following point and it does not finish:

[   41.807814] server ignition[1545]: INFO     : Ignition finished successfully
[   41.812901] server systemd[1]: Started Ignition (files).
[   41.818648] server systemd[1]: Started Tear down initramfs networking.
nfiguration from the Real Root...
[   41.834030] server systemd[1]: Reloading.
[   41.983564] server systemd[1]: Started Reload Configuration from the Real Root.
[   41.992672] server systemd[1]: Reached target Initrd File Systems. 
[   41.999177] server systemd[1]: Reached target Initrd Default Target.
[   42.007022] server systemd[1]: Starting dracut pre-pivot and cleanup hook...
[   42.025108] server dracut-pre-pivot[1755]: cat: /sys/class/net/bond0/name_assign_type: Invalid argument
[   42.170750] server dracut-pre-pivot[1755]: Oct 02 15:19:22 | /etc/multipath.conf does not exist, blacklisting all devices.
[   42.170750] server dracut-pre-pivot[1755]: Oct 02 15:19:22 | You can run "/sbin/mpathconf --enable" to create
[   42.170750] server dracut-pre-pivot[1755]: Oct 02 15:19:22 | /etc/multipath.conf. See man mpathconf(8) for more details
[   42.213165] server systemd[1]: Started dracut pre-pivot and cleanup hook. 
[   42.213348] server systemd[1]: Starting Cleaning Up and Shutting Down Daemons...
 initramfs networking...
...
...
[   42.991296] server systemd[1]: Closed udev Control Socket. 
[   43.007901] server systemd[1]: Closed udev Kernel Socket.
[   43.013470] server systemd[1]: Started Setup Virtual Console.
Shell.
[   43.027343] server systemd[1]: Reached target Emergency Mode.
[   43.035137] server systemd[1]: Startup finished in 13min 30.733s (firmware) + 2.354s (loader) + 8.136s (kernel) + 0 (initrd) + 34.898s (userspace) = 14min 16.123s.


I believe that the issue is on the following line:

dracut-pre-pivot[1755]: cat: /sys/class/net/bond0/name_assign_type: Invalid argument

file /sys/class/net/bond0/name_assign_type exist on the system but is not readable.
At this point RHCOS 4.2 installation get stuck and do not finish.



Version-Release number of selected component (if applicable):

rhcos-42.80.20190828.2-installer-initramfs.img
rhcos-42.80.20190828.2-installer-kernel
rhcos-42.80.20190828.2-metal-bios.raw.gz
rhcos-42.80.20190828.2-metal-uefi.raw.gz


How reproducible:

Always


Steps to Reproduce:

1. Prepare a RHCOS 4.2 PXE-like install with bonding

kernel http://xxx/rhcos-42.80.20190828.2-installer-kernel rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=sda coreos.inst.image_url=http://xxx/rhcos-42.80.20190828.2-metal-uefi.raw.gz bond=bond0:eno1,eno2:mode=802.3ad,miimon=100 ip=192.168.122.200::192.168.122.129:255.255.255.0:server:bond0:none:192.168.122.117
initrd=rhcos-42.80.20190828.2-installer-initramfs.img coreos.inst.ignition_url=http://xxx/bootstrap.ign rd.debug rd.shell rd.driver.pre=ahci

2.

Run RHCOS 4.2 PXE-like install with bonding against a baremetal server

3.

Installation get stuck and do not finish



Actual results:
RHCOS 4.2 installation get stuck and do not finish.


Expected results:
RHCOS 4.2 installation finish successfully.


Additional info:

Running PXE-like install using a single network interface (without bonding configuration) works as expected.

Comment 5 Benjamin Chardi 2019-10-15 14:04:08 UTC
Created attachment 1625989 [details]
Failed RHCOS with bond -  Emergency shell report at /run/initramfs/rdsosreport.txt

Comment 6 Benjamin Chardi 2019-10-15 14:06:54 UTC
Reproduced the same issue using latest OCP/RHCOS versions:

https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/4.2.0-rc.5
https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.2.0-rc.5



PXE installation initialized with networkstatic=yes:

menuentry 'Install RHEL CoreOS' --class fedora --class gnu-linux --class gnu --class os {
	linux /images/vmlinuz nomodeset rd.neednet=1 coreos.inst=yes coreos.inst.install_dev="vda" coreos.inst.image_url=http://192.168.122.2:8001/rhcos-4.2.0-rc.5-x86_64-metal-uefi.raw.gz coreos.inst.ignition_url=http://192.168.122.2:8001/bootstrap.ign ip=192.168.122.99::192.168.122.1:255.255.255.0:ocp4-bootstrap.info.net:bond0:none rd.shell bond=bond0:ens3,ens4:mode=active-backup,miimon=100 nameserver=192.168.122.2 networkstatic=yes rd.shell rd.debug
	initrd /images/initramfs.img
}



>>> Are we sure were not hitting something like: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-network_bonding_using_the_command_line_interface#sec- Check_if_Bonding_Kernel_Module_is_Installed where RHCOS does not have the right module installed/active? 


Emergency shell dropped after and RHCOS installation get stuck. From RHCOS emergency shell:


dracut:/# lsmod | grep -i bond
bonding               184320  0

So bond kernel module is being used by kernel


dracut:/# cat /proc/cmdline 
BOOT_IMAGE=/ostree/rhcos-1f8a02066bf1850bb60e814e7ffa9c7066494bd88f097eae08b37781f980cefe/vmlinuz-4.18.0-80.11.2.el8_0.x86_64 console=tty0 console=ttyS0,115200n8 rootflags=defaults,prjquota rw ignition.firstboot rd.neednet=1 ip=192.168.122.99::192.168.122.1:255.255.255.0:ocp4-bootstrap.info.net:bond0:none nameserver=192.168.122.2 bond=bond0:ens3,ens4:mode=active-backup,miimon=100 root=UUID=b4778d44-266f-458e-9c52-5665467785a0 ostree=/ostree/boot.0/rhcos/1f8a02066bf1850bb60e814e7ffa9c7066494bd88f097eae08b37781f980cefe/0 coreos.oem.id=metal ignition.platform.id=metal


"networkstatic=yes" does not appear here, but used on the PXE config (maybe this is the root issue ?)



Also I have found similar issue:
https://github.com/coreos/coreos-installer/issues/64


Attached to the case you have rdsosreport.txt from failed RHCOS install.
Please if you need more info from my site, do not hesitate on ask for it. We need this feature working for OCP42.

Comment 13 Baptiste Mille-Mathias 2019-10-22 15:49:36 UTC
same problem for us.
rhcos nodes are bare metal installed using the virtual CD provided by the out-of-band card.
We tried various configuration like providing the network configuration into the ignition or also as parameter passed to the kernel, both method failed.
we eventually fallbacked using one interface for now, but we're interested fixing that as well.

Comment 17 acossett 2019-11-13 14:15:45 UTC
Any update on this issue? When can we expect the release ?

Comment 18 ltourrea 2019-11-18 22:09:01 UTC
Hi we need to have this issue fixed quickly. Please provide an update. We have to install OCP 4.2 on bare-metal with network bonds interfaces.

Comment 20 Micah Abbott 2019-11-26 19:57:53 UTC
The fix for this has been confirmed to work by a number of the private comments and in QE.

Marking verified with 43.81.201911251600.0


Steps for reproduction

1.  Boot RHCOS installer ISO
2.  Provide the following kernel args (change IPs/interface names appropriately)

coreos.inst=yes coreos.inst.install_dev=sda coreos.inst.image_url=http://192.168.124.1:9001/rhcos.metal.raw.gz coreos.inst.ignition_url=http://192.168.124.1:9001/ignition.json ip=192.168.124.199::192.168.124.1:255.255.255.0:rhcos:bond0:none bond=bond0:ens2,ens3:mode=active-backup,miimon=100 nameserver=192.168.124.1

3.  Allow install to complete
4.  Verify ifcfg script generated:

$ cat /etc/sysconfig/network-scripts/ifcfg-bond0 
# Generated by dracut initrd
NAME="bond0"
DEVICE="bond0"
ONBOOT=yes
NETBOOT=yes
UUID="c2bcabb7-cf61-46a4-829e-0cde8d279b59"
BOOTPROTO=none
IPADDR="192.168.124.199"
NETMASK="255.255.255.0"
GATEWAY="192.168.124.1"
BONDING_OPTS="mode=active-backup miimon=100"
NAME="bond0"
TYPE=Bond


The same steps should work if the `bond0` interface is configured to use DHCP as well.

Comment 21 ltourrea 2019-11-27 05:44:20 UTC
Since it is a bug, we need to know if that fix will be backported to 4.2.
Please provide an update of this.

Comment 22 Micah Abbott 2019-11-27 16:04:56 UTC
(In reply to ltourrea from comment #21)
> Since it is a bug, we need to know if that fix will be backported to 4.2.
> Please provide an update of this.

The 4.2 backport will be tracked by BZ#1767771.  Because of the US holiday this week, the backport is unlikely to happen until next week.

Comment 23 acossett 2019-12-04 18:13:41 UTC
any update on this ?

Comment 26 umesh_sunnapu 2019-12-12 17:01:56 UTC
Team,

I have tried the steps mentioned by @Micah Abboot in rhcos-43.81.201912030353.0. (as this is what I see when I clicked on the following link https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/pre-release/latest/)

Below is how my pxe setup looks for each node that is part of the cluster

menuentry 'bootstrap' --class fedora --class gnu-linux --class gnu --class os {
  linux rhcos/4.3/rhcos-43.81.201912030353.0-installer-kernel-x86_64 nomodeset rd.neednet=1 coreos.inst=yes coreos.inst.install_dev=nvme0n1 coreos.inst.image_url=http://100.82.38.11:8892/rhcos/4.3/rhcos-43.81.201912030353.0-metal.x86_64.raw.gz coreos.inst.ignition_url=http://100.82.38.11:8892/ignition/bootstrap.ign ip=100.82.38.36::100.82.38.1:255.255.255.0:rhcos:bond0:none bond=bond0:ens2f0,ens2f1:mode=active-backup,miimon=100 nameserver=100.82.38.11
  initrd rhcos/4.3/rhcos-43.81.201912030353.0-installer-initramfs.x86_64.img
}

Node continuously rebooting. Also, I have a question regarding value 'rhcos' specified in the ip portion of kernel line (100.82.38.1:255.255.255.0:rhcos:bond0:none). Is it pointing to hostname or it by default has to be 'rhcos' 

Please let me know if you need any other additional information to help resolve this.

Comment 27 Micah Abbott 2019-12-12 20:28:56 UTC
(In reply to umesh_sunnapu from comment #26)

> Node continuously rebooting. 

We'll need the logs from the console to determine the cause for the reboot loop.  See the following for information to grab - https://github.com/openshift/os/blob/master/FAQ.md#q-how-do-i-debug-ignition-failures

> Also, I have a question regarding value 'rhcos'
> specified in the ip portion of kernel line
> (100.82.38.1:255.255.255.0:rhcos:bond0:none). Is it pointing to hostname or
> it by default has to be 'rhcos' 

In my example, `rhcos` was just a one-off hostname.  In production environments, it should be the hostname of the node.

NOTE: that `ip=` line is the same that would be specified if you were using RHEL.  I understand RHCOS is a different beast to work with, but at its core there is still mostly RHEL.

See associated docs:  https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-configuring_ip_networking_from_the_kernel_command_line

Also this KBase has an example for using the `ip=` line with RHCOS - https://access.redhat.com/solutions/4531011

Comment 28 hiroshi.suganami 2019-12-13 03:47:35 UTC
Will it be modified so that bonding can be configured with the OpenShift 4.2 installation? Is it possible to set tagged VLAN for bonding setting?

Comment 29 Micah Abbott 2019-12-13 16:18:51 UTC
(In reply to hiroshi.suganami from comment #28)
> Will it be modified so that bonding can be configured with the OpenShift 4.2
> installation? Is it possible to set tagged VLAN for bonding setting?

Please see BZ#1767771 for progress related to OCP 4.2 around this issue.

Comment 30 umesh_sunnapu 2019-12-13 16:55:58 UTC
Created attachment 1644892 [details]
boot messages during OCP 4.3 install

Comment 31 umesh_sunnapu 2019-12-13 16:56:37 UTC
Created attachment 1644893 [details]
boot messages during OCP 4.3 install - screenshot 2

Comment 32 umesh_sunnapu 2019-12-13 16:58:45 UTC
@Micah Abbott, It does not look like just bonding is the issue. I tested OCP 4.3 install without bonding and it still goes for a continuous reboots.

Please check the attached images. I could not get any other way to get to the logs as I dont see any shell prompts during the fail. 

Can you please provide me the version that you tested latest with links to them. I can give that a try in our bare metal servers and see where we land.

Comment 33 Baptiste Mille-Mathias 2019-12-13 17:08:24 UTC
(In reply to umesh_sunnapu from comment #32)
> @Micah Abbott, It does not look like just bonding is the issue. I tested OCP
> 4.3 install without bonding and it still goes for a continuous reboots.
> 

So please open a different issue to continue discussion there.

Comment 34 Steve Milner 2020-01-09 21:38:30 UTC
I've opened up https://bugzilla.redhat.com/show_bug.cgi?id=1789601 for continued work on what looks like a regression.

Comment 36 errata-xmlrpc 2020-05-13 21:26:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.