Bug 1956641 - Kernel args do not configure baremetal worker node networking with bond device and tagged vlan
Summary: Kernel args do not configure baremetal worker node networking with bond devic...
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.7
Hardware: x86_64
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Dusty Mabe
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-04 06:45 UTC by Duncan Milburn
Modified: 2021-05-11 11:26 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Duncan Milburn 2021-05-04 06:45:19 UTC
Description of problem:
Using kernel args for configuration of NIC bonding and vlan tagging doesn't succeed

OCP Version: 4.7.0
Platform: Baremetal
Architecture: x86_64


Configuration of network interface bond device and vlan tagging using the following kernel args with the RHCOS 4.7.0 live ISO result in the interface failing to reach UP status:

bond=bond0:ens3f0,ens3f1:mode=active-backup ip=10.0.0.101::10.0.0.254:255.255.252.0:server01:bond0.2001:none:10.0.0.2:10.0.0.3 nomodeset rd.neednet=1 ipv6.disable=1 ignition.firstboot ignition.platform.id=metal systemd.unified_cgroup_hierarchy=0 coreos.inst.install_dev=sdb  coreos.inst=yes coreos.live.rootfs_url=<URL_to_img> coreos.inst.ignition_url=<URL_to_worker.ign> initrd=<rhcos-live-rootfs> vlan=bond0.2001:bond0 ip=bond0:off ip=ens3f0:off ip=ens3f1:off

However, bringing up the interface manually from the RHCOS live environment succeeds.


Please note that all of the below have been consulted:

Bug 1897660 - Unable to use a bonded device ( bond0 ) on a vlan via UPI install of node workers                                 : https://bugzilla.redhat.com/show_bug.cgi?id=1897660
Bug 1857532 - [sig-network] About vlan parameter of dracut doesn't work on RHCOS                                                   : https://bugzilla.redhat.com/show_bug.cgi?id=1857532
Bug 1882781 - nameserver= option to dracut creates extra NM connection profile                                                     : https://bugzilla.redhat.com/show_bug.cgi?id=1882781
Bug 1940871 - Unable to use a bonded device ( bond0 ) on a vlan via UPI install of node workers                                    : https://bugzilla.redhat.com/show_bug.cgi?id=1940871

Comment 1 Micah Abbott 2021-05-04 13:35:39 UTC
There is not enough information provided to understand what has gone wrong.  Please answer the following questionnaire as best as possible:



OCP Version at Install Time:
RHCOS Version at Install Time:
OCP Version after Upgrade (if applicable):
RHCOS Version after Upgrade (if applicable):
Platform: AWS, Azure, bare metal, GCP, vSphere, etc
Architecture: x86_64/ppc64le/s390x


What are you trying to do? What is your use case?


What happened? What went wrong or what did you expect?


What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node.


If you're having problems booting/installing RHCOS, please provide:
- the full contents of the serial console showing disk initialization, network configuration, and Ignition stage (see https://access.redhat.com/articles/7212 for information about configuring your serial console)
- Ignition JSON
- output of `journalctl -b`


If you're having problems post-upgrade, please provide:
- A complete must-gather (`oc adm must-gather`)


If you're having SELinux related issues, please provide:
- The full `/var/log/audit/audit.log` file
- Were any SELinux modules or booleans changed from the default configuration?
- The output of `ostree admin config-diff | grep selinux/targeted` on impacted nodes


Please add anything else that might be useful, for example:
- kernel command line (`cat /proc/cmdline`)
- contents of `/etc/NetworkManager/system-connections/`
- contents of `/etc/sysconfig/network-scripts/`

Comment 3 Duncan Milburn 2021-05-05 01:28:16 UTC
>> What are you trying to do? What is your use case?

Trying to install OCP on baremetal.

>> What happened? What went wrong or what did you expect?

Network configuration specified by kernel args does not bring up the interface as expected.

>> What are the steps to reproduce your issue? Please try to reduce these steps to something that can be reproduced with a single RHCOS node.

Boot RHCOS live ISO on server, with kernel args as specified (see attachment)

>> If you're having problems booting/installing RHCOS, please provide:
- the full contents of the serial console showing disk initialization, network configuration, and Ignition stage (see https://access.redhat.com/articles/7212 for information about configuring your serial console)

Serial console output is not practical, since the server is a blade server located in off-site data centre.

Note: manual network configuration from within the live environment works OK (see log file)

Comment 5 Duncan Milburn 2021-05-05 02:21:36 UTC
To clarify what we're trying to achieve on the networking front...

OCP is to be installed in a disconnected environment without DHCP, => static IP addressing
Physical server hardware presents two physical NICs each connected to separate physical network switches, => create a bond device
Network is not permitted to be untagged, => only tagged vlan(s) permitted

fwiw, we've been attempting to install OCP 4.6 (with no success), but attempted 4.7 due to comments in linked BZ.

Comment 7 Dusty Mabe 2021-05-05 14:07:11 UTC
Hey Duncan,

I have a few questions/suggestions.

- The screenshot you shared shows `Installing Red Hat Enterprise Linux CoreOS 46.82.202012051820-0". I'd like to standardize on trying to reproduce this problem on RHCOS 4.7. I'm trying locally with 4.7 and can't reproduce any issues. Can we share logs from an attempt with 4.7?

- If you're booting the Live ISO you shouldn't need to provide `coreos.live.rootfs_url=<URL_to_img>`. The rootfs is embedded in the ISO.

Comment 8 Duncan Milburn 2021-05-05 22:37:24 UTC
Thanks Dusty - sorry, my bad - I'll get a fresh capture from the customer.

Comment 9 Duncan Milburn 2021-05-07 06:21:22 UTC
Created attachment 1780600 [details]
RHCOS boot kernel args

Updated kernel boot arguments from customer

Comment 11 Dusty Mabe 2021-05-07 15:02:47 UTC
There's not really much new information here. I could really use a journal log from the boot of a 4.7 machine and a serial console log from the entire install (use console=ttyS0). If you can't get a serial console from the remote customer system then you can try in a VM or on your hardware if you have some.

I could also use a step by step summary of the exact steps you are taking and what you expect to work that's not working.

Without more information I'll take a stab in the dark: I notice you are installing to `sdb`. Any chance that `sda` is what is getting booted after the install?

Comment 14 Duncan Milburn 2021-05-10 06:39:06 UTC
The steps to reproduce are:

1) attach ISO image to server ILO
2) power up server
3) at grub prompt, press e
4) add kernel args
5) press ctrl+x

Any/all of the ip= arguments disappear, and the live environment does not possess network information to be able to proceed.

A third network interface appears in the journal log, have requested Customer disable/remove this and retry.


Please note that no `sda` appears, as in attached block device screenshot.

Comment 15 Dusty Mabe 2021-05-10 14:14:54 UTC
(In reply to Duncan Milburn from comment #14)
 
> Any/all of the ip= arguments disappear,

Well that is concerning and obviously something we need to fix before you'll be able to proceed.

Can you try or have you tried to reproduce this in a VM? I can't reproduce here in a local VM (using virt-install with `--boot=uefi` so that I get GRUB and not isolinux.)

It also might be worth trying with a different server of the same type of hardware and also servers of different type of hardware to give us more data points.

Comment 16 Dusty Mabe 2021-05-10 14:18:06 UTC
When I say "reproduce this", I'm specifically referring to the typing kernel arguments into GRUB and having them disappear once the system is booted (i.e. not on kernel command line).

Comment 17 Duncan Milburn 2021-05-11 11:26:13 UTC
I've not been able to reproduce the behaviour in vSphere either.

I've requested a remote session with the customer to verify and collect more data.


Note You need to log in before you can comment on or make changes to this bug.