Bug 1940871 - Unable to use a bonded device ( bond0 ) on a vlan via UPI install of node workers
Summary: Unable to use a bonded device ( bond0 ) on a vlan via UPI install of node wor...
Keywords:
Status: CLOSED DUPLICATE of bug 1882781
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.8.0
Assignee: Luca BRUNO
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1897660
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-19 12:47 UTC by Neeraj
Modified: 2023-09-15 01:03 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1897660
Environment:
Last Closed: 2021-04-20 09:35:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Neeraj 2021-03-19 12:47:47 UTC
+++ This bug was initially created as a clone of Bug #1897660 +++

Description of problem:

We have been unable to create a bonded device with vlan tagging.

During Bare Metal CoreOS OCP 4.6 worker node install, using a network with bonding active-active, static ip does not persist and ping fails after 30 seconds, causing worker node scaling to fail.

Version-Release number of selected component (if applicable):

4.6.1 and 4.6.8


How reproducible:

Every time.


Steps to Reproduce:
1. Setup pxeboot to boot/install worker node
2. set the network config paretmer in kernel line
3. or create the networks by booting the ISO and then pass --copy-network with coreos-installer

Actual results:

Network configs lost. Nothing under /etc/NetworkManager/system-connections/


Expected results:

A bonded device, bond0 or bond0.xxxx, that can do vlan tagging.

Comment 4 molnars 2021-03-22 08:26:29 UTC
the above scenario is reproducible mostly when deployed on bare-metal nodes.  (ie: When testing in libvirt/kvm it tends to work just fine.)
The network configs can be pushed in via kernel params or done with nmcli, the result does not change.

Comment 5 molnars 2021-03-22 08:26:52 UTC
the above scenario is reproducible mostly when deployed on bare-metal nodes.  (ie: When testing in libvirt/kvm it tends to work just fine.)
The network configs can be pushed in via kernel params or done with nmcli, the result does not change.

Comment 6 Micah Abbott 2021-03-23 19:56:54 UTC
Could you share the kernel args used to configure the network interfaces?

Please provide the contents of `cat /etc/NetworkManager/system-connections/*` after the `nmcli` commands were performed in the live ISO.

Please provide the serial console/journal for the system after the `coreos-installer` command has been run and the system has been rebooted.


There is not yet enough information in this report to determine what has gone wrong.

Comment 7 molnars 2021-03-29 13:47:37 UTC
Hello,

1) the following were the kernel args passed during the booting of the live ISO:

 kernel /images/pxeboot/vmlinuz
  append initrd=/images/pxeboot/initrd.img,/images/ignition.img random.trust_cpu=on rd.luks.options=discard coreos.liveiso=RHCOS-CustomIso ignition.firstboot ignition.platform.id=metal ip=10.141.97.10::10.141.97.1:255.255.255.0:worker7.ocp-lab.menalab.corp.local:bond0.2225:none vlan=bond0.2225:bond0 bond=bond0:eno1,eno2:mode=802.3ad,lacp_rate=fast,miimon=100 nameserver=172.24.109.51  coreos.inst.install_dev=sda coreos.inst.ignition_url=http://10.141.96.8:80/ignition/worker.ign

2) we don't have the system-connections at the moment, but network was accessible during this stage throughout (though nmcli was giving an object error) and later during the boot-from-disk stage for about ~30 seconds.  will try to pull this from next install.
3) will get fresh from the system install in 2).

Comment 8 Luca BRUNO 2021-04-01 15:44:49 UTC
Possibly related, 4.6 releases had a bug related to the handling of `nameserver=` kernel arguments: https://bugzilla.redhat.com/show_bug.cgi?id=1882781

On top of what Micah already asked, it would be great to perform this cluster installation directly on 4.7, or rework the kernel arguments to avoid hitting the above bug.

Comment 11 Luca BRUNO 2021-04-20 09:43:53 UTC
Speculatively pointing to https://bugzilla.redhat.com/show_bug.cgi?id=1882781 as the root cause for this, due to matching conditions/triggers and the lack of actionable logs to investigate further. Closing as a duplicate.

We deem RHCOS 4.7 to be generally fine for setups in a bond+VLAN environment.
If there are further cases of failures using 4.7, please open a dedicated ticket with full installation details and journal logs from the RHCOS node.

Comment 12 Red Hat Bugzilla 2023-09-15 01:03:42 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.