Description of problem: If we shut off one Virtual Connect from the redundant pair, and during that time, for any reason, os-refresh-config gets triggered, then its "sub process" called os_net_config will fail to configure interface bonding, but before exiting, it will delete previously working interfaces as well. Although this issue surfaced with HP enclosures and its Virtual Connects, this could be a problem also with AirFrame solutions and its redundant leaf switches. Version-Release number of selected component (if applicable): RHOSP 10 os-net-config-5.2.0-4 How reproducible: Always Steps to Reproduce: on tenant-bond: active-backup 0 ens31f1, ens31f0 ~~~ ifdown ens31f1 os-refresh-config ~~~ Recover from ILO by: ~~~ ifup ens31f1 os-refresh-config ~~~ Additional info: This looks similar to BZ 1437320. But in this case we don't have mapping.yaml under /etc/os-net-config [ os-net-config mapping.yaml will only map active interfaces, but no-carrier interfaces need to be mapped as well ] https://bugzilla.redhat.com/show_bug.cgi?id=1437320
I want to confirm that the logs are from the host that had the interface problem. I see no /etc/os-net-config/* in the sosreport nor any os-net-config related log messages. Can you check?
Also please provide the nic config template files that were used for deployment.
Also please indicate the type of bond being used (it will be obvious in the requested nic config files), there is some discussion on issues with ovs bonds here - https://bugzilla.redhat.com/show_bug.cgi?id=1590598.
Thanks for the logs in the case, we are trying to match up the description in the bug report to what we see in the logs. From the description it looks like you've configured the tenant-bond with interfaces ens31f1 and ens31f0 but in config.json it looks like tenant-bond is set up as a vlan, and there are no ens31f1 and ens31f0 interfaces in the log file. "addresses": [ { "ip_netmask": "172.17.2.16/24" } ], "mtu": 9000, "device": "tenant-bond", "use_dhcp": false, "type": "vlan", "vlan_id": 42 } ], I don't see any errors in the log with ovs-appctl here: Jul 25 11:17:11 overcloud-sriovperformancecompute-0 os-collect-config: [2019/07/25 11:17:11 AM] [INFO] Running ovs-appctl bond/set-active-slave ('infra-bond', 'eno49') but I only see infra-bond being configured, is the only problem with tenant-bond? As per comment 3, can we get the nic config files (heat template files) that were used in the deployment along with the deployment command? This will allow us to see how the config.json is getting set up. Also are you able to rerun the os-net-config command on the node with debug via "os-net-config -c /etc/os-net-config/config.json --debug". If this exhibits the error you are seeing can you provide the resulting log?
Hello, Here is some update info: we reproduced the issue with eno49 and eno50 interfaces: ~~~ # ovs-appctl bond/show ---- infra-bond ---- bond_mode: active-backup bond may use recirculation: no, Recirc-ID : -1 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms lacp_status: off active slave mac: 5c:b9:01:92:21:fc(eno49) slave eno49: enabled active slave may_enable: true slave eno50: enabled may_enable: true ~~~ ## shut the active interface eno49 ~~~ # ifdown eno49 # ovs-appctl bond/show ---- infra-bond ---- bond_mode: active-backup bond may use recirculation: no, Recirc-ID : -1 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms lacp_status: off active slave mac: 5c:b9:01:92:21:fd(eno50) slave eno49: disabled may_enable: false slave eno50: enabled active slave may_enable: true ~~~ ## debug os-net-config ~~~ # os-net-config -c /etc/os-net-config/config.json --debug ~~~ No problem so far; now running 'os-refresh-config' (21-8-19 12:33:21 time-stamp in the logs). The bond disappears and compute gets disconnected To fix it we login to iLO, re-run 'os-refresh-config' → connection and bond are back. We have sosreport and deployment templates plus full /var/log tarball gathered from that hypervisor in case something was skipped on sos collection. Regards, Sergii
(In reply to Bob Fournier from comment #4) > From the description it looks like you've configured the tenant-bond with > interfaces ens31f1 and ens31f0 but in config.json it looks like tenant-bond > is set up as a vlan, and there are no ens31f1 and ens31f0 interfaces in the > log file. > > "addresses": [ > { > "ip_netmask": "172.17.2.16/24" > } > ], > "mtu": 9000, > "device": "tenant-bond", > "use_dhcp": false, > "type": "vlan", > "vlan_id": 42 > } > ], > I found confirmation of this in the NIC config templates. The VLANs are configured as being members of the bond, but they should be members of the bridge. What is in the templates now: - type: vlan vlan_id: 42 addresses: - ip_netmask: {get_param: TenantIpSubnet} device: tenant-bond mtu: 9000 use_dhcp: false What should appear in the template: - type: vlan vlan_id: 42 addresses: - ip_netmask: {get_param: TenantIpSubnet} device: br-ex # <--- Note VLAN should be a member of the bridge, not the bond mtu: 9000 use_dhcp: false When os-net-config has to restart an object (a bond in this case), it will also restart all the members. Since the VLANs are defined as members of the bond, they also get restarted. If the VLANs were members of the underlying bridge instead of the bond, then the bond could be restarted without restarting the VLAN. If changes are made to the VLANs on the br-ex bridge, they should also be made to the br-all bridge where the same error occurs.
(In reply to Dan Sneddon from comment #7) > What should appear in the template: > > - type: vlan > vlan_id: 42 > addresses: > - ip_netmask: {get_param: TenantIpSubnet} > device: br-ex # <--- Note VLAN should be a member of the > bridge, not the bond > mtu: 9000 > use_dhcp: false Looking even more closely, the device: is marked as br-ex, but the VLAN itself is a member of the bridge. These two definitions are in conflict, and I will have to review the OSP 10 os-net-config source code to determine which definition is used.
(In reply to Dan Sneddon from comment #8) > Looking even more closely, the device: is marked as br-ex, but the VLAN > itself is a member of the bridge. These two definitions are in conflict, and > I will have to review the OSP 10 os-net-config source code to determine > which definition is used. Sorry, I meant to say that the tenant VLAN (on the Controller role) is marked as being part of device "tenant-bond", but the VLAN is a member of br-ex. I will determine which takes precedence by looking at the source code and testing.
Regarding comments 7, 8, and 9: I did test os-net-config with the conflicting VLAN membership, and the end result is that the Tenant VLAN will be placed on br-ex due to the members relationship, and the "device: bond-all" will be ignored. It should be removed from the templates, but I no longer think this was the root cause of the issue.
Closing per comment 15.