Bug 2168477

Summary: DHCP transaction started on ovs internal interface before set-cloned MAC address
Product: Red Hat Enterprise Linux 9 Reporter: Gris Ge <fge>
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED ERRATA QA Contact: Matej Berezny <mberezny>
Severity: high Docs Contact:
Priority: high    
Version: 9.0CC: bgalvani, ellorent, lrintel, mberezny, phoracek, rkhan, sfaye, sukulkar, till, vbenes
Target Milestone: rcKeywords: Triaged
Target Release: 9.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: NetworkManager-1.42.2-1.el9 Doc Type: No Doc Update
Doc Text:
If this bug requires documentation, please select an appropriate Doc Type value.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-09 08:17:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2165874    
Attachments:
Description Flags
NetworkManager trace log none

Description Gris Ge 2023-02-09 06:37:34 UTC
Description of problem:

In OpenShift kubernetes-nmstate, we set cloned mac address to OVS internal interface and reboot. After bootup, we might different DHCP address:


Feb 01 07:09:38 localhost.localdomain NetworkManager[1673]: <info>  [1675235378.6696] dhcp4 (ovs0): state changed new lease, address=192.168.66.51


Feb 01 07:08:26 localhost.localdomain NetworkManager[1673]: <info>  [1675235306.6211] dhcp4 (ovs0): state changed new lease, address=192.168.66.102


Version-Release number of selected component (if applicable):
NetworkManager-1.40.10-1.el8 (also tried NetworkManager latest git build, got the same)

How reproducible:
5%

Steps to Reproduce:
1. Request openshift team to run kubernetes-nmstate CI
2.
3.

Actual results:

Since the `ovs0` interface is cloning the MAC address of `eth0`, after reboot, the `ovs0` hold __different__ IP address of `eth0` was.

Expected results:

Since the `ovs0` interface is cloning the MAC address of `eth0`, after reboot, the `ovs0` hold __identical__ IP address of `eth0` was.

Additional info:

The full log could be found at
https://storage.googleapis.com/kubevirt-prow/pr-logs/pull/nmstate_kubernetes-nmstate/1059/pull-kubernetes-nmstate-e2e-handler-k8s/1620675826716512256/artifacts/NodeNetworkConfigurationPolicy_default_ovs-bridged_network_when_there_is_a_default_interface_with_dynamic_address_and_ovs_bridge_on_top_of_the_default_interface/NetworkManager.log

Highlights: 

Correct one:

Feb 01 07:07:52 localhost.localdomain NetworkManager[1673]: <info>  [1675235272.1622] device (ovs0): set-hw-addr: set-cloned MAC address to 52:55:00:D1:55:02 (52:55:00:D1:55:02)
Feb 01 07:07:52 localhost.localdomain NetworkManager[1673]: <info>  [1675235272.1626] device (ovs0): carrier: link connected
Feb 01 07:07:52 localhost.localdomain NetworkManager[1673]: <info>  [1675235272.1628] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Feb 01 07:07:52 localhost.localdomain NetworkManager[1673]: <info>  [1675235272.1637] device (eth0): Activation: connection 'eth0' enslaved, continuing activation
Feb 01 07:07:52 localhost.localdomain NetworkManager[1673]: <info>  [1675235272.1640] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Feb 01 07:07:52 localhost.localdomain NetworkManager[1673]: <info>  [1675235272.1649] dhcp4 (ovs0): activation: beginning transaction (no timeout)
Feb 01 07:07:52 localhost.localdomain NetworkManager[1673]: <info>  [1675235272.1661] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Feb 01 07:07:52 localhost.localdomain NetworkManager[1673]: <info>  [1675235272.1663] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Feb 01 07:07:52 localhost.localdomain NetworkManager[1673]: <info>  [1675235272.1668] device (eth0): Activation: successful, device activated.
Feb 01 07:08:26 localhost.localdomain NetworkManager[1673]: <info>  [1675235306.6211] dhcp4 (ovs0): state changed new lease, address=192.168.66.102

Wrong one:


Feb 01 07:09:01 node02 NetworkManager[1673]: <info>  [1675235341.2998] dhcp4 (ovs0): activation: beginning transaction (no timeout)
Feb 01 07:09:01 node02 NetworkManager[1673]: <info>  [1675235341.3010] device (ovs0): set-hw-addr: set-cloned MAC address to 52:55:00:D1:55:02 (52:55:00:D1:55:02)
Feb 01 07:09:02 localhost.localdomain NetworkManager[1673]: <info>  [1675235342.3294] audit: op="checkpoint-adjust-rollback-timeout" arg="/org/freedesktop/NetworkManager/Checkpoint/1" pid=3992 uid=0 result="success"
Feb 01 07:09:38 localhost.localdomain NetworkManager[1673]: <info>  [1675235378.6696] dhcp4 (ovs0): state changed new lease, address=192.168.66.51

Above logs indicate NetworkManager is starting DHCP before `set-hw-addr`

Comment 1 Beniamino Galvani 2023-02-09 08:46:16 UTC
Hi Gris, I don't understand why that happens. Can you please provide a trace log that shows the problem?

Comment 2 Gris Ge 2023-02-09 13:05:02 UTC
Created attachment 1943092 [details]
NetworkManager trace log

Trace log for NetworkManager-1.43.1-31701.copr.04fb042965.el8.x86_64

Comment 3 Gris Ge 2023-02-09 13:05:37 UTC
For above log, the expected IP address is 192.168.66.102. The problematic one is 192.168.66.117

Comment 4 Beniamino Galvani 2023-02-10 16:48:01 UTC
Thanks, there is a race condition when reactivating a ovs connection with a cloned MAC. It depends on how fast ovs-vswitchd is to update the MAC address, and I couldn't reproduce it with normal ovs. To force a delay, I recompiled ovs with the following patch, and then the problem becomes 100% visible.

--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -1827,6 +1827,8 @@ netdev_linux_set_etheraddr(struct netdev *netdev_, const struct eth_addr mac)
     if (is_tap_netdev(netdev_)) {
         update_flags(netdev, NETDEV_UP, 0, &old_flags);
     }
+
+    usleep(200000);
     error = set_etheraddr(netdev_get_name(netdev_), mac);

Comment 7 Gris Ge 2023-02-21 05:19:44 UTC
*** Bug 2165874 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2023-05-09 08:17:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (NetworkManager bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2485