Description of problem: Created an nncp with below conf with vlan filtering disabled: ~~~ spec: desiredState: interfaces: - bridge: options: stp: enabled: false port: - name: enp11s0 vlan: {}. <<<<<<<< description: Linux bridge with the wrong port ipv4: dhcp: true enabled: true name: br100 state: up type: linux-bridge nodeSelector: kubernetes.io/hostname: worker1.ocp4.shiftvirt.com ~~~ Started a virtual machine with this network. ~~~ [root@worker1 ~]# bridge link show |grep br100 7: enp11s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br100 state forwarding priority 32 cost 100 289: veth0b9f29ee@enp8s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br100 state forwarding priority 32 cost 2. <<<<<< ~~~ Modified the nncp and enabled the vlan filtering. This caused the reactivation of the bridge. ~~~ Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info> [1654060233.7678] audit: op="checkpoint-create" arg="/org/freedesktop/NetworkManager/Checkpoint/31" pid=529363 uid=0 result="success" Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info> [1654060233.8635] audit: op="connection-update" uuid="b7a3edc6-05ca-49d6-abf9-fa95a62e82e6" name="br100" args="connection.lldp,bridge.vlan-filtering" pid=529363 uid=0 result="success" Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info> [1654060233.8852] audit: op="connection-update" uuid="2a787052-8ddb-403c-a258-629475ab66b1" name="enp11s0" args="bridge-port.vlans" pid=529363 uid=0 result="success" Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info> [1654060233.8878] audit: op="device-reapply" interface="br100" ifindex=285 args="connection.lldp,bridge.vlan-filtering" pid=529363 uid=0 result="fail" reason="Can't reapply any changes to 'bridge' setting" Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info> [1654060233.8898] device (br100): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed') Jun 01 05:10:33 worker1.ocp4.shiftvirt.com dbus-daemon[1222]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.8' (uid=0 pid=1674 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0") Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info> [1654060233.8918] device (br100): disconnecting for new activation request. Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info> [1654060233.8919] audit: op="connection-activate" uuid="b7a3edc6-05ca-49d6-abf9-fa95a62e82e6" name="br100" pid=529363 uid=0 result="success" ~~~ And the veth interface was detached from the bridge during reconfiguration. ~~~ Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info> [1654060233.9330] device (br100): detached bridge port veth0b9f29ee Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info> [1654060233.9331] device (veth0b9f29ee): released from master device br100 ~~~ But the veth is an unmanaged device. ~~~ [root@worker1 ~]# grep veth /usr/lib/udev/rules.d/85-nm-unmanaged.rules ENV{ID_NET_DRIVER}=="veth", ENV{NM_UNMANAGED}="1" ~~~ So the port was never added back. ~~~ [root@worker1 ~]# grep veth /usr/lib/udev/rules.d/85-nm-unmanaged.rules ENV{ID_NET_DRIVER}=="veth", ENV{NM_UNMANAGED}="1" ~~~ Causing the VM to go out of network. Also, note that any potential network configuration change that stops and starts the bridge can cause this issue. I have tested the same by changing the MTU value and the result was the same. Version-Release number of selected component (if applicable): 4.10.6 How reproducible: 100 % Steps to Reproduce: 1. Create an nncp with bridge configuration. 2. Start a VM with this network. 3. Change the network configuration. What I tested was vlan filtering conf change and MTU. 4. The VM loss network after the nmstate reconfigures the network. Actual results: Modifying nncp with running virtual machines causes disconnection of VMs from network Expected results: Either the system should not allow changing the conf with running VMs. Or the NetworkManager should add the veth ports back after reconfiguration. However, I believe even if it adds back, there is a disconnection for 1-2 seconds during the configuration which may not desirable for all workloads. Additional info:
This bug can also cause a major network outage of all VMs in the cluster during an upgrade from 2.6 to 4.8. It looks like the vlan filtering was not enabled by default in 2.6 and after the upgrade, the new nmstate-handler adds the vlan to nnce [1]. This causes reconfiguration of all the bridges which results in veth ports getting detached from the bridge. I didn't test this but the logs from an affected environment point to this. [1] https://github.com/nmstate/kubernetes-nmstate/pull/793/commits/bc345316695f1189e4c6fd9e6c731c19735dcb02
I will keep the severity, but lower the priority, to make sure we leave some headroom for cases where a quick fix is needed to resolve an unavoidable breakage. Since the standalone operator was moved out of CNV with 4.11, I'm moving this BZ to the OpenShift Network team. We will be happy to backport new fixes on this topic to CNV 4.10 and 4.11. Note that the issue with veths getting disconnected is not a single issue. There were already a few BZs opened to address this problem in various situations, for example: https://bugzilla.redhat.com/show_bug.cgi?id=2076131 https://bugzilla.redhat.com/show_bug.cgi?id=2035519 We would like to fix these issues at its root, by making sure NetworkManager supports editing of bridges, without disconnecting any interfaces. However, we may also consider a more protective approach where we reject some operations on a higher level.
Reassigning to Quique since he has been driving this fix.