Bug 1897660 - Unable to use a bonded device ( bond0 ) on a vlan via UPI install of node workers
Summary: Unable to use a bonded device ( bond0 ) on a vlan via UPI install of node wor...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: 4.8.0
Assignee: Dusty Mabe
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1940871
TreeView+ depends on / blocked
 
Reported: 2020-11-13 17:39 UTC by mharris
Modified: 2024-03-25 17:04 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1940871 (view as bug list)
Environment:
Last Closed: 2021-04-12 13:49:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bond0.nmconnection (292 bytes, text/plain)
2020-11-18 17:26 UTC, mharris
no flags Details
vlan291.nmconnection (354 bytes, text/plain)
2020-11-18 17:36 UTC, mharris
no flags Details
eno5.nmconnection (221 bytes, text/plain)
2020-11-18 17:37 UTC, mharris
no flags Details
eno6.nmconnection (221 bytes, text/plain)
2020-11-18 17:48 UTC, mharris
no flags Details
ens1f0.nmconnection (225 bytes, text/plain)
2020-11-18 17:48 UTC, mharris
no flags Details
ens1f1.nmconnection (225 bytes, text/plain)
2020-11-18 17:49 UTC, mharris
no flags Details

Description mharris 2020-11-13 17:39:59 UTC
Description of problem:

We have been unable to create a bonded device with vlan tagging.

We have been able to independently complete tasks; we have been able to create bonded devices; and we have been able to add vlan tagging to individual interfaces (ie eno5, as opposed to bond0), however, we have been unable to apply vlan tagging to the bond0 (or bond0.291) interface


Version-Release number of selected component (if applicable):

4.6.1


How reproducible:

Every time.


Steps to Reproduce:
1. Setup pxeboot to boot/install worker node
2.
3.

Actual results:

We end up with a bonded device, that can't do vlan tagging.


Expected results:

A bonded device, bond0 or bond0.291, that can do vlan tagging.


Additional info:

Comment 3 Dusty Mabe 2020-11-13 22:38:56 UTC
I think you are using the wrong device name. From the dracut.cmdline man page:

```
vlan=<vlanname>:<phydevice>
    Setup vlan device named <vlanname> on <phydeivce>. We support
    the four styles of vlan names: VLAN_PLUS_VID (vlan0005),
    VLAN_PLUS_VID_NO_PAD (vlan5), DEV_PLUS_VID (eth0.0005),
    DEV_PLUS_VID_NO_PAD (eth0.5)
```

So if I have

- bond=bond0:ens2,ens3:mode=active-backup,miimon=100
- vlan=bond0.291:bond0

Then my device name is `bond0.291` and I would then add:

- ip=bond0.291:dhcp


All of that is to say the first entry before the colon in `vlan=` and `ip=` need to match.

For a full example see: https://bugzilla.redhat.com/show_bug.cgi?id=1857532#c13

Comment 4 Dusty Mabe 2020-11-14 15:26:30 UTC
This bug needs more information. It is not scheduled to be worked on in the current sprint.

Comment 5 Louis Santillan 2020-11-16 17:42:50 UTC
(In reply to Dusty Mabe from comment #3)
> I think you are using the wrong device name. From the dracut.cmdline man
> page:
> 
> ```
> vlan=<vlanname>:<phydevice>
>     Setup vlan device named <vlanname> on <phydeivce>. We support
>     the four styles of vlan names: VLAN_PLUS_VID (vlan0005),
>     VLAN_PLUS_VID_NO_PAD (vlan5), DEV_PLUS_VID (eth0.0005),
>     DEV_PLUS_VID_NO_PAD (eth0.5)
> ```
> 
> So if I have
> 
> - bond=bond0:ens2,ens3:mode=active-backup,miimon=100
> - vlan=bond0.291:bond0
> 
> Then my device name is `bond0.291` and I would then add:
> 
> - ip=bond0.291:dhcp
> 
> 
> All of that is to say the first entry before the colon in `vlan=` and `ip=`
> need to match.
> 
> For a full example see:
> https://bugzilla.redhat.com/show_bug.cgi?id=1857532#c13

Dusty, we did try this kernel parameter set.  However, at some point just before communicating with the ignition server, NetworkManager believes bond0 is down (thought bond0.291 may have an IP) and decides to tear down bond0, taking bond0.291 with it.

Comment 6 Dusty Mabe 2020-11-16 21:28:01 UTC
Hey Louis,

It sounds like it's trying to get DHCP on both the bond and the vlan. I just did some local tests.

Can you try with something like this:

```
ip=vlan291:dhcp ip=bond0:off bond=bond0:ens2,ens3:mode=active-backup,miimon=100 vlan=vlan291:bond0
```

There are two things going on here:

1. NetworkManager is trying DHCP on bond0 AND vlan291. I don't know if this is appropriate behavior but you can tell it not to with `ip=bond0:off`.
2. There is a bug I just found with NetworkManager parsing the `DEV_PLUS_VID` form. https://bugzilla.redhat.com/show_bug.cgi?id=1898294 So use the VLAN_PLUS_VID_NO_PAD (`vlan291`) form for now.

Comment 9 mharris 2020-11-16 23:31:21 UTC
Dusty,

After attempting to use the new boot config you gave us, the end result, is that it still appears that bond0 is not getting put on the vlan.

Something extremely useful to note, about this configuration change, is after getting into the emergency console, we did see, the bond0 did exist, and still existed; which is something I don't believe, that we have seen before.

Comment 10 Dusty Mabe 2020-11-17 01:53:21 UTC
It looks to me like vlan291 just times out and doesn't get DHCP in the log you posted. Any chance you have access to the DHCP server logs and can confirm the requests are coming through?

Also, when you get to the emergency shell can you manually set up a vlan on top of the existing bond0 and set a static IP and talk to other nodes on the network? Something like:

```
ip link add link bond0 name vlan291 type vlan id 291
ip addr add 192.168.200.1/24 dev vlan291
ip link set vlan291 up
```

Of course put in an IP address that makes sense for your network.

This test will give us a sanity check on the network setup to make sure it's what we expect.

Comment 11 mharris 2020-11-17 04:30:30 UTC
(In reply to Dusty Mabe from comment #10)
> It looks to me like vlan291 just times out and doesn't get DHCP in the log
> you posted. Any chance you have access to the DHCP server logs and can
> confirm the requests are coming through?
> 
> Also, when you get to the emergency shell can you manually set up a vlan on
> top of the existing bond0 and set a static IP and talk to other nodes on the
> network? Something like:
> 
> ```
> ip link add link bond0 name vlan291 type vlan id 291
> ip addr add 192.168.200.1/24 dev vlan291
> ip link set vlan291 up
> ```
> 
> Of course put in an IP address that makes sense for your network.
> 
> This test will give us a sanity check on the network setup to make sure it's
> what we expect.

Dusty,

That was how we got our network connection, in order to "off-load" those boot logs. We basically completed the following:

ip link add link bond0 name bond0.291 type vlan id 291
ip link bond0.291 up
ip addr add 10.32.161.16/24 dev bond0.291
ip route add 10.32.161.0/24 via 10.32.161.254 dev bond0.291
ip route add default via 10.32.161.254

I was watching the DHCP server, on the network, as the system booted. I saw the original dhcp request during the pxeboot process, but I didn't see any other requests come in, valid, or otherwise.

Comment 12 Dusty Mabe 2020-11-17 18:39:22 UTC
Thanks for quick debugging and updates @mharris. If you have a machine in that current state (emergency shell) can you also share the files and contents in /run/NetworkManager/system-connections/ ? 


Other than that piece of information let's start peeling away some pieces and see if we can figure out where the problem is.

Can we remove bonding from the equation just to sanity check. So trying something like:

- ip=vlan291:dhcp vlan=vlan291:eno5 ip=eno5:off ip=eno6:off ip=ens1f0:off ip=ens1f1:off

When we have that piece of information then we can start to investigate the bonding bit specifically.

Comment 13 mharris 2020-11-18 17:26:00 UTC
Created attachment 1730617 [details]
bond0.nmconnection

bond0.nmconnection from /run/NetworkManager/system-connections/

Comment 14 mharris 2020-11-18 17:36:34 UTC
Created attachment 1730622 [details]
vlan291.nmconnection

Comment 15 mharris 2020-11-18 17:37:15 UTC
Created attachment 1730623 [details]
eno5.nmconnection

Comment 16 mharris 2020-11-18 17:48:09 UTC
Created attachment 1730624 [details]
eno6.nmconnection

Comment 17 mharris 2020-11-18 17:48:37 UTC
Created attachment 1730625 [details]
ens1f0.nmconnection

Comment 18 mharris 2020-11-18 17:49:36 UTC
Created attachment 1730631 [details]
ens1f1.nmconnection

Comment 19 mharris 2020-11-18 18:46:55 UTC
Hi Dusty,

The following was used:

default menu.c32   
 prompt 1
 timeout 9
 ONTIMEOUT 1
 menu title ######## PXE Boot Menu ########  
 label 1
 menu label ^1) Install Worker Node
 menu default
 kernel rhcos/kernel
 append initrd=rhcos/initramfs.img nomodeset rd.neednet=1 coreos.inst.insecure coreos.inst=yes coreos.inst.install_dev=sda coreos.inst.image_url=http://10.32.161.242:8080/np/install/bios.raw.gz coreos.live.rootfs_url=http://10.32.161.242:8080/np/install/rootfs.img coreos.inst.ignition_url=http://10.32.161.242:8080/np/ignition/worker.ign ip=vlan291:dhcp vlan=vlan291:eno5 ip=eno5:off ip=eno6:off ip=ens1f0:off ip=ens1f1:off


And this boot/install failed, as it appears/appeared, once again, that the device, vlan291 (eno5), wasn't on the vlan.

It errored out, with the error we are seeing when it isn't on the same vlan: "premature end of input data at offset 0"

Comment 20 Dusty Mabe 2020-11-23 16:54:22 UTC
I did a debugging session with Mike last week.

We were able to get to the emergency shell and manually modify the NetworkManager configuration files methodically to determine that Some of the interfaces would properly get DHCP from the DHCP server (vlan tagged), while others would not.

He then tried to start up the server with `ip=vlan291:dhcp vlan=vlan291:ens1f0 ip=eno5:off ip=eno6:off ip=ens1f0:off ip=ens1f1:off` and it worked, while substituting in some other interfaces for the vlan did not.

Currently I think this is some issue where some NICs are attached to appropriate physical networks, where others are not. i.e. this is a non-software misconfiguration and not a bug in the vlan/bonding enablement/configuration that is occurring in the initramfs.

Mike, please provide any additional info to support or deny the current theory.

Thanks!

Comment 25 Dusty Mabe 2020-12-04 22:36:45 UTC
This bug needs more information. It is not scheduled to be worked on in the current sprint.

Comment 27 Dusty Mabe 2020-12-14 21:22:56 UTC
Without any new information beyond the current investigation in comment#20 I'm going to close this as NOTABUG. Please re-open if there is new information to add.

Comment 30 Micah Abbott 2021-01-15 20:35:31 UTC
Higher priority work has prevented from this issue being worked on; adding UpcomingSprint keyword

Comment 34 Micah Abbott 2021-01-19 15:00:44 UTC
It is unlikely we will be able to get this addressed as part of the 4.7 release, given that we need additional data and investigation for the issue.  Going to target this for 4.8 until we get more information.

Comment 40 Dusty Mabe 2021-04-12 13:49:55 UTC
Still waiting on new information from the customer (requested in private comment #31 on 2021-01-15). Without new information I'm going to close this as NOTABUG. Please do re-open if we can get access to that new information.

Comment 41 Red Hat Bugzilla 2023-09-15 00:51:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.