Bug 1940871

Summary: Unable to use a bonded device ( bond0 ) on a vlan via UPI install of node workers
Product: OpenShift Container Platform Reporter: Neeraj <nbhatt>
Component: RHCOSAssignee: Luca BRUNO <lucab>
Status: CLOSED DUPLICATE QA Contact: Michael Nguyen <mnguyen>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: abhinkum, acai, bbreard, dornelas, dustymabe, imcleod, itiwana, jligon, lsantill, mharris, miabbott, mnguyen, nkaushik, nstielau, rgregory, sferguso, smolnar, vchoudha
Target Milestone: ---Keywords: Reopened
Target Release: 4.8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1897660 Environment:
Last Closed: 2021-04-20 09:35:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1897660    
Bug Blocks:    

Description Neeraj 2021-03-19 12:47:47 UTC
+++ This bug was initially created as a clone of Bug #1897660 +++

Description of problem:

We have been unable to create a bonded device with vlan tagging.

During Bare Metal CoreOS OCP 4.6 worker node install, using a network with bonding active-active, static ip does not persist and ping fails after 30 seconds, causing worker node scaling to fail.

Version-Release number of selected component (if applicable):

4.6.1 and 4.6.8


How reproducible:

Every time.


Steps to Reproduce:
1. Setup pxeboot to boot/install worker node
2. set the network config paretmer in kernel line
3. or create the networks by booting the ISO and then pass --copy-network with coreos-installer

Actual results:

Network configs lost. Nothing under /etc/NetworkManager/system-connections/


Expected results:

A bonded device, bond0 or bond0.xxxx, that can do vlan tagging.

Comment 4 molnars 2021-03-22 08:26:29 UTC
the above scenario is reproducible mostly when deployed on bare-metal nodes.  (ie: When testing in libvirt/kvm it tends to work just fine.)
The network configs can be pushed in via kernel params or done with nmcli, the result does not change.

Comment 5 molnars 2021-03-22 08:26:52 UTC
the above scenario is reproducible mostly when deployed on bare-metal nodes.  (ie: When testing in libvirt/kvm it tends to work just fine.)
The network configs can be pushed in via kernel params or done with nmcli, the result does not change.

Comment 6 Micah Abbott 2021-03-23 19:56:54 UTC
Could you share the kernel args used to configure the network interfaces?

Please provide the contents of `cat /etc/NetworkManager/system-connections/*` after the `nmcli` commands were performed in the live ISO.

Please provide the serial console/journal for the system after the `coreos-installer` command has been run and the system has been rebooted.


There is not yet enough information in this report to determine what has gone wrong.

Comment 7 molnars 2021-03-29 13:47:37 UTC
Hello,

1) the following were the kernel args passed during the booting of the live ISO:

 kernel /images/pxeboot/vmlinuz
  append initrd=/images/pxeboot/initrd.img,/images/ignition.img random.trust_cpu=on rd.luks.options=discard coreos.liveiso=RHCOS-CustomIso ignition.firstboot ignition.platform.id=metal ip=10.141.97.10::10.141.97.1:255.255.255.0:worker7.ocp-lab.menalab.corp.local:bond0.2225:none vlan=bond0.2225:bond0 bond=bond0:eno1,eno2:mode=802.3ad,lacp_rate=fast,miimon=100 nameserver=172.24.109.51  coreos.inst.install_dev=sda coreos.inst.ignition_url=http://10.141.96.8:80/ignition/worker.ign

2) we don't have the system-connections at the moment, but network was accessible during this stage throughout (though nmcli was giving an object error) and later during the boot-from-disk stage for about ~30 seconds.  will try to pull this from next install.
3) will get fresh from the system install in 2).

Comment 8 Luca BRUNO 2021-04-01 15:44:49 UTC
Possibly related, 4.6 releases had a bug related to the handling of `nameserver=` kernel arguments: https://bugzilla.redhat.com/show_bug.cgi?id=1882781

On top of what Micah already asked, it would be great to perform this cluster installation directly on 4.7, or rework the kernel arguments to avoid hitting the above bug.

Comment 11 Luca BRUNO 2021-04-20 09:43:53 UTC
Speculatively pointing to https://bugzilla.redhat.com/show_bug.cgi?id=1882781 as the root cause for this, due to matching conditions/triggers and the lack of actionable logs to investigate further. Closing as a duplicate.

We deem RHCOS 4.7 to be generally fine for setups in a bond+VLAN environment.
If there are further cases of failures using 4.7, please open a dedicated ticket with full installation details and journal logs from the RHCOS node.

Comment 12 Red Hat Bugzilla 2023-09-15 01:03:42 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days