Bug 1902071 - [baremetal] Moving from single NIC to bonding configuration using MCO post-config not working
Summary: [baremetal] Moving from single NIC to bonding configuration using MCO post-co...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6.z
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Ben Nemec
QA Contact: Victor Voronkov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-26 17:49 UTC by Manuel Rodriguez
Modified: 2022-02-24 20:18 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-02-24 20:18:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Installation template (3.63 KB, text/plain)
2020-11-26 17:51 UTC, Manuel Rodriguez
no flags Details
machine config template for workers (1.35 KB, text/plain)
2020-11-26 17:52 UTC, Manuel Rodriguez
no flags Details
machine config operator outputs (14.38 KB, text/plain)
2020-11-26 17:53 UTC, Manuel Rodriguez
no flags Details
sosreport with only command outputs, /var and /etc directories (9.64 MB, application/x-xz)
2020-11-26 18:03 UTC, Manuel Rodriguez
no flags Details

Description Manuel Rodriguez 2020-11-26 17:49:52 UTC
Description of problem:

When deploying a baremetal cluster using IPI installation we cannot apply machine config post-deployment to setup bonding.


Version-Release number of selected component (if applicable):
4.6.4


How reproducible:

Scenario:
3 masters, 3 workers.
first nic  - provisioning (enp1s0)
second nic - baremetal (enp2s0)
third nic  - baremetal (enp3s0) -> initially unused on purpose to build bond after.

I'm testing on a virtual LAB (libvirt for VMs and vbmc to simulate IPMI), using other types of bonding get the same results.


Steps to Reproduce:
1. Deploy 4.6.4 OCP with IPI baremetal, first nic for prov, second nic for baremetal, third nic unused
2. Apply machine config for workers or masters to setup bond0 with second and third nics
3. Wait for the machine config to be applied and servers reboot
 

Actual results:
- Everything continues working, as before, but network is not set as defined.
- ifcfg-* files are written for bond and nics as defined in the machine config
- bonding created with only second nic
- br-ex still uses the first nic, and not the bond


Expected results:
- bonding should have the two defined nics as slaves
- br-ex should use bond and not the first nic only, 
 

Additional info:

example of a worker before applying machine config:

[core@worker-2 ~]$ nmcli con show                                                                         
NAME              UUID                                  TYPE           DEVICE                              
ovs-if-br-ex      4b32228e-85df-4646-83c8-df7bea727415  ovs-interface  br-ex                       
Wired Connection  29379185-7639-46f8-a1a9-8349a6a03256  ethernet       enp1s0                               
br-ex             d6446250-26dd-4e41-a3e2-2a0081bfe482  ovs-bridge     br-ex                                
ovs-if-phys0      2e06c1e9-39c9-4b62-9818-98ff3dd2f75f  ethernet       enp2s0                      
ovs-port-br-ex    3787fe2e-fc7a-46fa-9569-39b4cddb4f92  ovs-port       br-ex                    
ovs-port-phys0    eea4945e-4be5-48ad-b4be-95067e8f709e  ovs-port       enp2s0 



After applying the machine config and reboot

[core@worker-2 ~]$ nmcli con show
NAME              UUID                                  TYPE           DEVICE 
bond0             ad33d8b0-1f7b-cab9-9447-ba07f855b143  bond           bond0  
ovs-if-br-ex      4b32228e-85df-4646-83c8-df7bea727415  ovs-interface  br-ex  
Wired Connection  29379185-7639-46f8-a1a9-8349a6a03256  ethernet       enp1s0 
br-ex             d6446250-26dd-4e41-a3e2-2a0081bfe482  ovs-bridge     br-ex  
ovs-if-phys0      2e06c1e9-39c9-4b62-9818-98ff3dd2f75f  ethernet       enp2s0 
ovs-port-br-ex    3787fe2e-fc7a-46fa-9569-39b4cddb4f92  ovs-port       br-ex  
ovs-port-phys0    eea4945e-4be5-48ad-b4be-95067e8f709e  ovs-port       enp2s0 
System enp3s0     63aa2036-8665-f54d-9a92-c3035bad03f7  ethernet       enp3s0 
System enp2s0     8c6fd7b1-ab62-a383-5b96-46e083e04bb1  ethernet       -- 



OVS database still shows first nic as a port in the bridge.

[core@worker-2 ~]$ sudo ovs-vsctl list-ports br-ex
enp2s0
patch-br-ex_worker-2-to-br-int



The bonding only has one leg, (the initially unused nic)

[core@worker-2 ~]$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup) (fail_over_mac follow)
Primary Slave: None
Currently Active Slave: enp3s0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: enp3s0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: 52:54:00:98:00:32
Slave queue ID: 0



ifcfg-* files are build just fine:

[core@worker-2 ~]$ ls -l /etc/sysconfig/network-scripts/
total 12
-rw-r--r--. 1 root root 332 Nov 26 17:00 ifcfg-bond0
-rw-r--r--. 1 root root  77 Nov 26 17:00 ifcfg-enp2s0
-rw-r--r--. 1 root root  77 Nov 26 17:00 ifcfg-enp3s0
[core@worker-2 ~]$ cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
TYPE=Bond
NAME=bond0
BONDING_MASTER=yes
BOOTPROTO=dhcp
ONBOOT=yes
MTU=1500
BONDING_OPTS="mode=active-backup miimon=100 fail_over_mac=follow"
AUTOCONNECT_SLAVES=yes
IPV6INIT=no
DHCPV6C=no
IPV6INIT=no
IPV6_AUTOCONF=no
IPV6_DEFROUTE=no
IPV6_PEERDNS=no
IPV6_PEERROUTES=no
IPV6_FAILURE_FATAL=no
IPV4_DHCP_TIMEOUT=2147483647
[core@worker-2 ~]$ cat /etc/sysconfig/network-scripts/ifcfg-enp2s0
TYPE=Ethernet
DEVICE=enp2s0
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes
[core@worker-2 ~]$ cat /etc/sysconfig/network-scripts/ifcfg-enp3s0
TYPE=Ethernet
DEVICE=enp3s0
BOOTPROTO=none
ONBOOT=yes
MASTER=bond0
SLAVE=yes


When login initially to the worker, I noticed there is a failure with NetworkManager

[root@worker-2 ~]# systemctl status NetworkManager-wait-online.service
● NetworkManager-wait-online.service - Network Manager Wait Online
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2020-11-26 17:01:15 UTC; 36min ago
     Docs: man:nm-online(1)
  Process: 1409 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, status=1/FAILURE)
 Main PID: 1409 (code=exited, status=1/FAILURE)
      CPU: 112ms

Nov 26 17:00:45 localhost systemd[1]: Starting Network Manager Wait Online...
Nov 26 17:01:15 worker-2 systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Nov 26 17:01:15 worker-2 systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Nov 26 17:01:15 worker-2 systemd[1]: Failed to start Network Manager Wait Online.
Nov 26 17:01:15 worker-2 systemd[1]: NetworkManager-wait-online.service: Consumed 112ms CPU time



Attached some other outputs, the machine config templates and sosreports just in case. Please let me know if I can provide any other helpful information.

Thanks,

Comment 1 Manuel Rodriguez 2020-11-26 17:51:37 UTC
Created attachment 1733851 [details]
Installation template

Comment 2 Manuel Rodriguez 2020-11-26 17:52:23 UTC
Created attachment 1733852 [details]
machine config template for workers

Comment 3 Manuel Rodriguez 2020-11-26 17:53:18 UTC
Created attachment 1733853 [details]
machine config operator outputs

Comment 4 Manuel Rodriguez 2020-11-26 18:03:32 UTC
Created attachment 1733857 [details]
sosreport with only command outputs, /var and /etc directories

Comment 5 Manuel Rodriguez 2020-11-26 22:31:57 UTC
I would like to add the following tests results in case this helps to narrow down the problem:

I noticed using a different network type has an impact, to summarize:
 - Using OVNKubernetes in 4.6.4 and setting bonding with MC templates post-install doesn't work (as described above in this BZ) 
 - Using OVNKubernetes in 4.6.4 and passing the mc templates as manifest in a fresh install results in bonding set with round-robin mode instead of the defined mode specified in templates see BZ: 1899350
 - Using OpenshiftSDN in 4.6.4 and passing the mc templates as manifest in a fresh install works. (bonding preserves the options we passed in templates)

At the end we want to use OVNKubernetes doesn't really matter if we setup the bond post or during the installation. We tested with NMSTATE, but also got errors, I'll follow up that in a diff BZ.

Comment 6 Manuel Rodriguez 2020-12-04 13:27:18 UTC
Hi, I see the BZ has the Needinfo label, is there any information you would like me to include? thanks!

Comment 7 Victor Pickard 2022-02-24 20:18:07 UTC
Doing some BZ cleanup, closing this one out as it is over a year old. If this is still an issue, please open a new BZ and provide details. Thanks.


Note You need to log in before you can comment on or make changes to this bug.