Bug 2092204 - Modifying nncp with running virtual machines causes disconnection of VMs from network
Summary: Modifying nncp with running virtual machines causes disconnection of VMs from...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: All
OS: Linux
high
urgent
Target Milestone: ---
: ---
Assignee: Quique Llorente
QA Contact: Aleksandra Malykhin
URL:
Whiteboard:
Depends On: 2092762
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-01 05:23 UTC by nijin ashok
Modified: 2023-01-16 16:41 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-12-14 19:31:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6961611 0 None None None 2022-06-02 12:51:28 UTC

Internal Links: 2092762

Description nijin ashok 2022-06-01 05:23:43 UTC
Description of problem:

Created an nncp with below conf with vlan filtering disabled:

~~~
spec:
  desiredState:
    interfaces:
    - bridge:
        options:
          stp:
            enabled: false
        port:
        - name: enp11s0
          vlan: {}.  <<<<<<<<
      description: Linux bridge with the wrong port
      ipv4:
        dhcp: true
        enabled: true
      name: br100
      state: up
      type: linux-bridge
  nodeSelector:
    kubernetes.io/hostname: worker1.ocp4.shiftvirt.com
~~~

Started a virtual machine with this network.

~~~
[root@worker1 ~]# bridge link show |grep br100
7: enp11s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br100 state forwarding priority 32 cost 100
289: veth0b9f29ee@enp8s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br100 state forwarding priority 32 cost 2. <<<<<<
~~~

Modified the nncp and enabled the vlan filtering. This caused the reactivation of the bridge.

~~~
Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info>  [1654060233.7678] audit: op="checkpoint-create" arg="/org/freedesktop/NetworkManager/Checkpoint/31" pid=529363 uid=0 result="success"
Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info>  [1654060233.8635] audit: op="connection-update" uuid="b7a3edc6-05ca-49d6-abf9-fa95a62e82e6" name="br100" args="connection.lldp,bridge.vlan-filtering" pid=529363 uid=0 result="success"
Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info>  [1654060233.8852] audit: op="connection-update" uuid="2a787052-8ddb-403c-a258-629475ab66b1" name="enp11s0" args="bridge-port.vlans" pid=529363 uid=0 result="success"
Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info>  [1654060233.8878] audit: op="device-reapply" interface="br100" ifindex=285 args="connection.lldp,bridge.vlan-filtering" pid=529363 uid=0 result="fail" reason="Can't reapply any changes to 'bridge' setting"
Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info>  [1654060233.8898] device (br100): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 01 05:10:33 worker1.ocp4.shiftvirt.com dbus-daemon[1222]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.8' (uid=0 pid=1674 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info>  [1654060233.8918] device (br100): disconnecting for new activation request.
Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info>  [1654060233.8919] audit: op="connection-activate" uuid="b7a3edc6-05ca-49d6-abf9-fa95a62e82e6" name="br100" pid=529363 uid=0 result="success"
~~~

And the veth interface was detached from the bridge during reconfiguration.

~~~
Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info>  [1654060233.9330] device (br100): detached bridge port veth0b9f29ee
Jun 01 05:10:33 worker1.ocp4.shiftvirt.com NetworkManager[1674]: <info>  [1654060233.9331] device (veth0b9f29ee): released from master device br100
~~~

But the veth is an unmanaged device.

~~~
[root@worker1 ~]# grep veth /usr/lib/udev/rules.d/85-nm-unmanaged.rules
ENV{ID_NET_DRIVER}=="veth", ENV{NM_UNMANAGED}="1"
~~~ 

So the port was never added back.

~~~
[root@worker1 ~]# grep veth /usr/lib/udev/rules.d/85-nm-unmanaged.rules
ENV{ID_NET_DRIVER}=="veth", ENV{NM_UNMANAGED}="1"
~~~

Causing the VM to go out of network.

Also, note that any potential network configuration change that stops and starts the bridge can cause this issue. I have tested the same by changing the MTU value and the result was the same.

Version-Release number of selected component (if applicable):

4.10.6

How reproducible:

100 %

Steps to Reproduce:

1. Create an nncp with bridge configuration. 
2. Start a VM with this network.
3. Change the network configuration. What I tested was vlan filtering conf change and MTU. 
4. The VM loss network after the nmstate reconfigures the network. 

Actual results:

Modifying nncp with running virtual machines causes disconnection of VMs from network

Expected results:

Either the system should not allow changing the conf with running VMs. Or the NetworkManager should add the veth ports back after reconfiguration. However, I believe even if it adds back, there is a disconnection for 1-2 seconds during the configuration which may not desirable for all workloads.  

Additional info:

Comment 1 nijin ashok 2022-06-01 05:54:38 UTC
This bug can also cause a major network outage of all VMs in the cluster during an upgrade from 2.6 to 4.8. It looks like the vlan filtering was not enabled by default in 2.6 and after the upgrade, the new nmstate-handler adds the vlan to nnce [1]. This causes reconfiguration of all the bridges which results in veth ports getting detached from the bridge.

I didn't test this but the logs from an affected environment point to this.  


[1] https://github.com/nmstate/kubernetes-nmstate/pull/793/commits/bc345316695f1189e4c6fd9e6c731c19735dcb02

Comment 8 Petr Horáček 2022-06-09 09:03:57 UTC
I will keep the severity, but lower the priority, to make sure we leave some headroom for cases where a quick fix is needed to resolve an unavoidable breakage.

Since the standalone operator was moved out of CNV with 4.11, I'm moving this BZ to the OpenShift Network team. We will be happy to backport new fixes on this topic to CNV 4.10 and 4.11.

Note that the issue with veths getting disconnected is not a single issue. There were already a few BZs opened to address this problem in various situations, for example:
https://bugzilla.redhat.com/show_bug.cgi?id=2076131
https://bugzilla.redhat.com/show_bug.cgi?id=2035519

We would like to fix these issues at its root, by making sure NetworkManager supports editing of bridges, without disconnecting any interfaces. However, we may also consider a more protective approach where we reject some operations on a higher level.

Comment 9 Ben Nemec 2022-06-30 15:20:08 UTC
Reassigning to Quique since he has been driving this fix.


Note You need to log in before you can comment on or make changes to this bug.