Bug 1842887 - [UPI] [Baremetal] RHCOS 4.3 Installation doesn't work when using tagged vlan [need info]
Summary: [UPI] [Baremetal] RHCOS 4.3 Installation doesn't work when using tagged vlan...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Dusty Mabe
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1186913
TreeView+ depends on / blocked
 
Reported: 2020-06-02 10:36 UTC by Abhinit Kumar
Modified: 2023-12-15 18:03 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-15 19:57:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5219001 0 None None None 2022-02-07 02:22:07 UTC

Description Abhinit Kumar 2020-06-02 10:36:25 UTC
Description of problem:

Unable to install RHCOS when booted from PXE with tagged vlan configuration.

On RHEL 7/8 it would look something like this:

vlan=eth0.vlan100:vlan100 ip=192.168.1.100::192.168.1.1:255.255.255.0:localhost:vlan100:none

However the same doesn't work for RHCOS when booted from PXE

Version-Release number of selected component (if applicable):

Environment: OpenShift 4.3
OS:  RHCOS

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
RHCOS doens't boot from PXE with tagged vlan

Expected results:
RHCOS should boot from PXE with tagged vlan

Additional info:

Onsite consultant did workaround issue with the following steps -

1. downloaded the fedora-coreos live CD
2. boot from the CD and run the coreos-installer after setting up the ip address and the VLAN tag manually.
3. editing the first boot from the Server HD and adding the ip=vlan01:dhcp vlan=vlan01:eth0 arguments.

Those steps eventually solved the issue and we where able to continue with the cluster deployment from that point
I truly believe that there is an issue with the initramfs.img of the PXE in regards to VLAN tag because the image initramfs receives the arguments as expected.

Comment 2 Micah Abbott 2020-06-02 20:10:09 UTC
We are currently working on higher priority bugs and features in RHCOS.  The BZ has been targeted for the future 4.6 release of RHCOS/OCP and will be evaluated more thoroughly in the near future.

Comment 3 Dusty Mabe 2020-06-10 04:27:58 UTC
This bug has not been selected for work in the current sprint.

Comment 5 Dusty Mabe 2020-06-26 20:46:13 UTC
This bug has not been selected for work in the current sprint.

Comment 10 Dusty Mabe 2020-07-06 22:10:00 UTC
Hi Abhinit, Frédéric,

I'm trying to isolate the issue but I'm lacking some appropriate infrastructure to do so. I've got a set up with vlans in my local environment with VMs but I don't have the PXE component. I have confirmed that I can boot with appropriate kernel args for vlan and have it do the install off of a vlan tagged interface. The only piece I'm missing is the PXE boot.

Abhinit,

I see you said the consultant did a workaround by booting into the ISO, setting up the network and then running the install. Can you test doing something similar with the ISO except use kargs instead? This should get us a lot closer to the PXE workflow, but without actually introducing PXE yet. Here is what I did that worked:

1. Boot ISO
2. Stop at grub prompt
3. Update kargs to something like:
    - console=ttyS0 coreos.inst.install_dev=sda coreos.inst.ignition_url=http://192.168.201.100:8000/config.ign coreos.inst.image_url=http://192.168.201.100:8000/rhcos-4.3.8-x86_64-metal.x86_64.raw.gz vlan=ens2.100:ens2 ip=192.168.201.101::192.168.201.1:255.255.255.0:localhost:ens2.100:none

4. press enter and watch install complete

Comment 14 Frederic Giloux 2020-07-07 13:52:09 UTC
My customer was successful configuring vlan tagging with RHCOS without PXE boot. That said it is still a major issue for them as it is not practical to manage multiple large bare-metal installations without PXE boots.
The customer was able using PXE-Boot to grub with the fedora shim.efi and grubx64.efi binaries but *grub itself* seems not able to load the grub.conf via VLAN.

The customer setup is
- The nodes have two bonded 10G network interfaces that they want to use as load balancing and failover (both active connections, no LACP)
- They have VLANs configured, so the PXE-request has to come with a VLAN tag. This can be configured in the BIOS when using UEFI mode
- They want to use UEFI for booting (see previous point)
- The final node configuration has to be a bond interface with VLAN configured

Comment 16 Dusty Mabe 2020-07-07 16:03:03 UTC
Hi Frederic, Abhinit,

It looks like only certain hardware has the ability to set the VLAN in the UEFI settings, so I'm a bit limited in trying to reproduce this locally. So we'll try to ask a few questions blind and see if we can work towards a solution.

Does this work when trying to PXE boot in a UEFI environment with plain RHEL8? Some searching landed me on this old mailing list thread which makes me wonder if support to grub was ever added for this: https://help-grub.gnu.narkive.com/iSM0NEe0/uefi-pxe-boot-to-grub2-with-bios-configured-vlan-tagging

Comment 17 Abhinit Kumar 2020-07-08 03:46:46 UTC
When booting with ISO and adding "ip=vlan01:dhcp vlan=vlan01:eth0" arguments at cmdline works fine.
And that is the workaround onsite consultant followed for the installation. 

However the problem is when passing these information via PXE.

Comment 19 Abhinit Kumar 2020-07-08 04:00:42 UTC
(In reply to Dusty Mabe from comment #16)
> Hi Frederic, Abhinit,
> 
> It looks like only certain hardware has the ability to set the VLAN in the
> UEFI settings, so I'm a bit limited in trying to reproduce this locally. So
> we'll try to ask a few questions blind and see if we can work towards a
> solution.
> 
> Does this work when trying to PXE boot in a UEFI environment with plain
> RHEL8? Some searching landed me on this old mailing list thread which makes
> me wonder if support to grub was ever added for this:
> https://help-grub.gnu.narkive.com/iSM0NEe0/uefi-pxe-boot-to-grub2-with-bios-
> configured-vlan-tagging

Hello Dusty,

Below is the comment from onsite consultant when tried to boot the same configuration with RHEL 8.

~~~
it looks like a VLAN TAG issue with the kernel cmdline argument we are providing is not working as expected.
we tried to install RHEL (vlan=eth0.005:eth0) on those servers in the same matter and RHEL was able to deploy successfully.
~~~

Comment 20 Dusty Mabe 2020-07-09 14:01:28 UTC
(In reply to Abhinit Kumar from comment #19)
> 
> Hello Dusty,
> 
> Below is the comment from onsite consultant when tried to boot the same
> configuration with RHEL 8.
> 
> ~~~
> it looks like a VLAN TAG issue with the kernel cmdline argument we are
> providing is not working as expected.
> we tried to install RHEL (vlan=eth0.005:eth0) on those servers in the same
> matter and RHEL was able to deploy successfully.
> ~~~

Thanks Abhinit. I'm a bit confused. From reading this bug report the summary of the problem that I'm coming away with is:

- The server's have a special setting for telling NICs to use a VLAN tag during early boot (i.e., PXE).
- There is an issue when trying to PXE boot on a VLAN tagged network where PXE on UEFI either:
     - successfully pulls the kernel and initrd, but then things get stuck in grub
     - OR never get the kernel and initrd at all

From that it seems like kargs like `vlan=` have no impact on the problem because those don't apply until dracut and we're getting stuck before that.

Comment 24 Dusty Mabe 2020-07-15 19:57:39 UTC
After discussing this with several grub developers it turns out this feature does not currently exist in grub (neither upstream nor in RHEL). The feature does happen to exist in RHEL for the ppc64le architecture because, at the time it was implemented, it was the only platform that passed VLAN information along. 

For this particular BZ I am going to close it as NOTABUG (since the feature never existed in the first place). For those customers currently affected by this problem:

1. I have opened an RFE against grub for this feature to exist: BZ1857410. Please follow and add more information to BZ1857410 so the grub team can make appropriate prioritization decisions.
2. Unfortunately for now you will most likely want to work around this by performing PXE based operations on a non VLAN tagged network.

I will be free to answer questions if anyone would like to discuss this further.


Note You need to log in before you can comment on or make changes to this bug.