Description of problem: We have been unable to create a bonded device with vlan tagging. We have been able to independently complete tasks; we have been able to create bonded devices; and we have been able to add vlan tagging to individual interfaces (ie eno5, as opposed to bond0), however, we have been unable to apply vlan tagging to the bond0 (or bond0.291) interface Version-Release number of selected component (if applicable): 4.6.1 How reproducible: Every time. Steps to Reproduce: 1. Setup pxeboot to boot/install worker node 2. 3. Actual results: We end up with a bonded device, that can't do vlan tagging. Expected results: A bonded device, bond0 or bond0.291, that can do vlan tagging. Additional info:
I think you are using the wrong device name. From the dracut.cmdline man page: ``` vlan=<vlanname>:<phydevice> Setup vlan device named <vlanname> on <phydeivce>. We support the four styles of vlan names: VLAN_PLUS_VID (vlan0005), VLAN_PLUS_VID_NO_PAD (vlan5), DEV_PLUS_VID (eth0.0005), DEV_PLUS_VID_NO_PAD (eth0.5) ``` So if I have - bond=bond0:ens2,ens3:mode=active-backup,miimon=100 - vlan=bond0.291:bond0 Then my device name is `bond0.291` and I would then add: - ip=bond0.291:dhcp All of that is to say the first entry before the colon in `vlan=` and `ip=` need to match. For a full example see: https://bugzilla.redhat.com/show_bug.cgi?id=1857532#c13
This bug needs more information. It is not scheduled to be worked on in the current sprint.
(In reply to Dusty Mabe from comment #3) > I think you are using the wrong device name. From the dracut.cmdline man > page: > > ``` > vlan=<vlanname>:<phydevice> > Setup vlan device named <vlanname> on <phydeivce>. We support > the four styles of vlan names: VLAN_PLUS_VID (vlan0005), > VLAN_PLUS_VID_NO_PAD (vlan5), DEV_PLUS_VID (eth0.0005), > DEV_PLUS_VID_NO_PAD (eth0.5) > ``` > > So if I have > > - bond=bond0:ens2,ens3:mode=active-backup,miimon=100 > - vlan=bond0.291:bond0 > > Then my device name is `bond0.291` and I would then add: > > - ip=bond0.291:dhcp > > > All of that is to say the first entry before the colon in `vlan=` and `ip=` > need to match. > > For a full example see: > https://bugzilla.redhat.com/show_bug.cgi?id=1857532#c13 Dusty, we did try this kernel parameter set. However, at some point just before communicating with the ignition server, NetworkManager believes bond0 is down (thought bond0.291 may have an IP) and decides to tear down bond0, taking bond0.291 with it.
Hey Louis, It sounds like it's trying to get DHCP on both the bond and the vlan. I just did some local tests. Can you try with something like this: ``` ip=vlan291:dhcp ip=bond0:off bond=bond0:ens2,ens3:mode=active-backup,miimon=100 vlan=vlan291:bond0 ``` There are two things going on here: 1. NetworkManager is trying DHCP on bond0 AND vlan291. I don't know if this is appropriate behavior but you can tell it not to with `ip=bond0:off`. 2. There is a bug I just found with NetworkManager parsing the `DEV_PLUS_VID` form. https://bugzilla.redhat.com/show_bug.cgi?id=1898294 So use the VLAN_PLUS_VID_NO_PAD (`vlan291`) form for now.
Dusty, After attempting to use the new boot config you gave us, the end result, is that it still appears that bond0 is not getting put on the vlan. Something extremely useful to note, about this configuration change, is after getting into the emergency console, we did see, the bond0 did exist, and still existed; which is something I don't believe, that we have seen before.
It looks to me like vlan291 just times out and doesn't get DHCP in the log you posted. Any chance you have access to the DHCP server logs and can confirm the requests are coming through? Also, when you get to the emergency shell can you manually set up a vlan on top of the existing bond0 and set a static IP and talk to other nodes on the network? Something like: ``` ip link add link bond0 name vlan291 type vlan id 291 ip addr add 192.168.200.1/24 dev vlan291 ip link set vlan291 up ``` Of course put in an IP address that makes sense for your network. This test will give us a sanity check on the network setup to make sure it's what we expect.
(In reply to Dusty Mabe from comment #10) > It looks to me like vlan291 just times out and doesn't get DHCP in the log > you posted. Any chance you have access to the DHCP server logs and can > confirm the requests are coming through? > > Also, when you get to the emergency shell can you manually set up a vlan on > top of the existing bond0 and set a static IP and talk to other nodes on the > network? Something like: > > ``` > ip link add link bond0 name vlan291 type vlan id 291 > ip addr add 192.168.200.1/24 dev vlan291 > ip link set vlan291 up > ``` > > Of course put in an IP address that makes sense for your network. > > This test will give us a sanity check on the network setup to make sure it's > what we expect. Dusty, That was how we got our network connection, in order to "off-load" those boot logs. We basically completed the following: ip link add link bond0 name bond0.291 type vlan id 291 ip link bond0.291 up ip addr add 10.32.161.16/24 dev bond0.291 ip route add 10.32.161.0/24 via 10.32.161.254 dev bond0.291 ip route add default via 10.32.161.254 I was watching the DHCP server, on the network, as the system booted. I saw the original dhcp request during the pxeboot process, but I didn't see any other requests come in, valid, or otherwise.
Thanks for quick debugging and updates @mharris. If you have a machine in that current state (emergency shell) can you also share the files and contents in /run/NetworkManager/system-connections/ ? Other than that piece of information let's start peeling away some pieces and see if we can figure out where the problem is. Can we remove bonding from the equation just to sanity check. So trying something like: - ip=vlan291:dhcp vlan=vlan291:eno5 ip=eno5:off ip=eno6:off ip=ens1f0:off ip=ens1f1:off When we have that piece of information then we can start to investigate the bonding bit specifically.
Created attachment 1730617 [details] bond0.nmconnection bond0.nmconnection from /run/NetworkManager/system-connections/
Created attachment 1730622 [details] vlan291.nmconnection
Created attachment 1730623 [details] eno5.nmconnection
Created attachment 1730624 [details] eno6.nmconnection
Created attachment 1730625 [details] ens1f0.nmconnection
Created attachment 1730631 [details] ens1f1.nmconnection
Hi Dusty, The following was used: default menu.c32 prompt 1 timeout 9 ONTIMEOUT 1 menu title ######## PXE Boot Menu ######## label 1 menu label ^1) Install Worker Node menu default kernel rhcos/kernel append initrd=rhcos/initramfs.img nomodeset rd.neednet=1 coreos.inst.insecure coreos.inst=yes coreos.inst.install_dev=sda coreos.inst.image_url=http://10.32.161.242:8080/np/install/bios.raw.gz coreos.live.rootfs_url=http://10.32.161.242:8080/np/install/rootfs.img coreos.inst.ignition_url=http://10.32.161.242:8080/np/ignition/worker.ign ip=vlan291:dhcp vlan=vlan291:eno5 ip=eno5:off ip=eno6:off ip=ens1f0:off ip=ens1f1:off And this boot/install failed, as it appears/appeared, once again, that the device, vlan291 (eno5), wasn't on the vlan. It errored out, with the error we are seeing when it isn't on the same vlan: "premature end of input data at offset 0"
I did a debugging session with Mike last week. We were able to get to the emergency shell and manually modify the NetworkManager configuration files methodically to determine that Some of the interfaces would properly get DHCP from the DHCP server (vlan tagged), while others would not. He then tried to start up the server with `ip=vlan291:dhcp vlan=vlan291:ens1f0 ip=eno5:off ip=eno6:off ip=ens1f0:off ip=ens1f1:off` and it worked, while substituting in some other interfaces for the vlan did not. Currently I think this is some issue where some NICs are attached to appropriate physical networks, where others are not. i.e. this is a non-software misconfiguration and not a bug in the vlan/bonding enablement/configuration that is occurring in the initramfs. Mike, please provide any additional info to support or deny the current theory. Thanks!
Without any new information beyond the current investigation in comment#20 I'm going to close this as NOTABUG. Please re-open if there is new information to add.
Higher priority work has prevented from this issue being worked on; adding UpcomingSprint keyword
It is unlikely we will be able to get this addressed as part of the 4.7 release, given that we need additional data and investigation for the issue. Going to target this for 4.8 until we get more information.
Still waiting on new information from the customer (requested in private comment #31 on 2021-01-15). Without new information I'm going to close this as NOTABUG. Please do re-open if we can get access to that new information.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days