Bug 2149012

Summary: NM brings down interfaces attached to a ovs bridge after "nmcli networking off/on"
Product: Red Hat Enterprise Linux 9 Reporter: Beniamino Galvani <bgalvani>
Component: NetworkManagerAssignee: Fernando F. Mancera <ferferna>
Status: VERIFIED --- QA Contact: Vladimir Benes <vbenes>
Severity: unspecified Docs Contact:
Priority: high    
Version: 9.2CC: bgalvani, blitton, lrintel, palonsor, pdiak, rkhan, rravaiol, sfaye, sukulkar, till, tkondvil, vbenes
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: NetworkManager-1.43.10-1.el9 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Beniamino Galvani 2022-11-28 14:47:06 UTC
When a virtual interface is created outside of NetworkManager and attached to an ovs bridge, after disabling and re-enabling networking multiple times via "nmcli networking off; nmcli networking on", NetworkManager brings the interface down. This can be reproduced with the following commands:

  ip link add vxlan1 type vxlan remote 172.25.12.1 id 120 dstport 0
  ip link set vxlan1 up
  ovs-vsctl add-br br1
  ovs-vsctl add-port br1 vxlan1

  ovs-vsctl show
  ip link show vxlan1

  nmcli networking off
  nmcli networking on

  sleep 1

  nmcli networking off
  nmcli networking on

  ovs-vsctl show
  ip link show vxlan1

At the end, vxlan1 is down:

  272: vxlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue master ovs-system state DOWN mode DEFAULT group default qlen 1000

The expected result is that the interface is not touched by NM since it was created externally.

Affected versions:
NetworkManager 1.30, NetworkManager 1.40, current git main

Comment 1 Beniamino Galvani 2022-11-28 14:55:29 UTC
Initially, the vxlan is in disconnected state and is considered 'external'.

  [1669644967.4654] device (vxlan1): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external')

The problem is that after toggling networking, the 'external' state is lost and the device becomes fully managed.

  [1669644995.0894] device (vxlan1): state change: disconnected -> unmanaged (reason 'sleeping', sys-iface-state: 'external')
  [1669645001.8996] device (vxlan1): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
  [1669645001.9245] device (vxlan1): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')

At this point a "networking off" will bring the interface down.

Comment 13 Vladimir Benes 2023-04-20 13:00:06 UTC
after a certain amount of repetitions, I still see missing LOWER_UP
see attachment

Comment 20 Beniamino Galvani 2023-05-29 08:37:44 UTC
> adding may_fail tag to the ovs_vxlan_networking_off_on test

I couldn't reproduce the new failure with the NMCI test, but according to logs it seems caused by a race condition in NM that makes the external device fully managed by NM

By stopping and resuming NM at the right time the issue is 100% reproducible:

  # Temporarily stop NetworkManager to trigger the race condition, which                                                                                       
  # happens when NM detects the interface already attached to the OVS                                                                                          
  # bridge and already announced by udev.
  killall -STOP NetworkManager
  ip link add vxlan1 type vxlan remote 172.25.12.1 id 120 dstport 0
  ip link set vxlan1 up
  ovs-vsctl add-br br1
  ovs-vsctl add-port br1 vxlan1
  sleep .4
  killall -CONT NetworkManager

  ovs-vsctl show
  ip link show vxlan1

  nmcli networking off
  nmcli networking on

  sleep 1

  nmcli networking off
  nmcli networking on

  ovs-vsctl show
  ip link show vxlan1
  # vxlan1 is DOWN now

Comment 24 Vladimir Benes 2023-07-03 14:09:20 UTC
working well, moving to verified