+++ This bug was initially created as a clone of Bug #1897660 +++ Description of problem: We have been unable to create a bonded device with vlan tagging. During Bare Metal CoreOS OCP 4.6 worker node install, using a network with bonding active-active, static ip does not persist and ping fails after 30 seconds, causing worker node scaling to fail. Version-Release number of selected component (if applicable): 4.6.1 and 4.6.8 How reproducible: Every time. Steps to Reproduce: 1. Setup pxeboot to boot/install worker node 2. set the network config paretmer in kernel line 3. or create the networks by booting the ISO and then pass --copy-network with coreos-installer Actual results: Network configs lost. Nothing under /etc/NetworkManager/system-connections/ Expected results: A bonded device, bond0 or bond0.xxxx, that can do vlan tagging.
the above scenario is reproducible mostly when deployed on bare-metal nodes. (ie: When testing in libvirt/kvm it tends to work just fine.) The network configs can be pushed in via kernel params or done with nmcli, the result does not change.
Could you share the kernel args used to configure the network interfaces? Please provide the contents of `cat /etc/NetworkManager/system-connections/*` after the `nmcli` commands were performed in the live ISO. Please provide the serial console/journal for the system after the `coreos-installer` command has been run and the system has been rebooted. There is not yet enough information in this report to determine what has gone wrong.
Hello, 1) the following were the kernel args passed during the booting of the live ISO: kernel /images/pxeboot/vmlinuz append initrd=/images/pxeboot/initrd.img,/images/ignition.img random.trust_cpu=on rd.luks.options=discard coreos.liveiso=RHCOS-CustomIso ignition.firstboot ignition.platform.id=metal ip=10.141.97.10::10.141.97.1:255.255.255.0:worker7.ocp-lab.menalab.corp.local:bond0.2225:none vlan=bond0.2225:bond0 bond=bond0:eno1,eno2:mode=802.3ad,lacp_rate=fast,miimon=100 nameserver=172.24.109.51 coreos.inst.install_dev=sda coreos.inst.ignition_url=http://10.141.96.8:80/ignition/worker.ign 2) we don't have the system-connections at the moment, but network was accessible during this stage throughout (though nmcli was giving an object error) and later during the boot-from-disk stage for about ~30 seconds. will try to pull this from next install. 3) will get fresh from the system install in 2).
Possibly related, 4.6 releases had a bug related to the handling of `nameserver=` kernel arguments: https://bugzilla.redhat.com/show_bug.cgi?id=1882781 On top of what Micah already asked, it would be great to perform this cluster installation directly on 4.7, or rework the kernel arguments to avoid hitting the above bug.
Speculatively pointing to https://bugzilla.redhat.com/show_bug.cgi?id=1882781 as the root cause for this, due to matching conditions/triggers and the lack of actionable logs to investigate further. Closing as a duplicate. We deem RHCOS 4.7 to be generally fine for setups in a bond+VLAN environment. If there are further cases of failures using 4.7, please open a dedicated ticket with full installation details and journal logs from the RHCOS node.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days