Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2188100

Summary: 802.3ad bond interface comes up with a different, random MAC address after booting when it's a slave of an active-backup interface
Product: Red Hat Enterprise Linux 9 Reporter: Andrew Schorr <ajschorr>
Component: kernelAssignee: Hangbin Liu <haliu>
kernel sub component: Bonding QA Contact: LiLiang <liali>
Status: CLOSED UPSTREAM Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bstinson, jwboyer, network-qe
Version: CentOS StreamFlags: pm-rhel: mirror+
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-13 09:52:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrew Schorr 2023-04-19 19:09:47 UTC
Description of problem:
After rebooting a system with an 802.3ad LACP bond interface, it comes up
with a different, random MAC address each time. 

Version-Release number of selected component (if applicable):
kernel-5.14.0-295.el9.x86_64


How reproducible:
always

Steps to Reproduce:
1. configure an 802.3ad LACP bond interface
2. reboot the system
3.

Actual results:
[   67.410032] i40e 0000:01:00.0 lan2: set new mac address 72:b5:6a:b0:bd:38
[   67.410113] i40e 0000:01:00.1 lan3: set new mac address 72:b5:6a:b0:bd:38
sh-5.1$ ip link ls dev bond0
10: bond0: <BROADCAST,MULTICAST,MASTER,SLAVE,UP,LOWER_UP> mtu 1500 qdisc noqueue master bond1 state UP mode DEFAULT group default qlen 1000
    link/ether 72:b5:6a:b0:bd:38 brd ff:ff:ff:ff:ff:ff
sh-5.1$ ip link ls dev lan2
4: lan2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 72:b5:6a:b0:bd:38 brd ff:ff:ff:ff:ff:ff permaddr 40:a6:b7:b0:b7:c0
sh-5.1$ ip link ls dev lan3
5: lan3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 72:b5:6a:b0:bd:38 brd ff:ff:ff:ff:ff:ff permaddr 40:a6:b7:b0:b7:c1
sh-5.1$ cat /sys/class/net/bond0/addr_assign_type 
3


Expected results:
The bond0 interface should use a MAC address of one of the slave devices.

Additional info:
The Bonding driver HOWTO states in the FAQ in question 8 that it will take its
MAC address from a slave device. That was working OK when I had SFN8522 NIC in the system, but I replaced it with a 4-port Intel X710-DA4, and now I'm getting a different, random address each time. If I ifdown and then ifup the interface, it sets the address properly to a slave address:
sh-5.1# cat /sys/class/net/bond0/addr_assign_type
3
sh-5.1# ip link ls dev bond0
10: bond0: <BROADCAST,MULTICAST,MASTER,SLAVE,UP,LOWER_UP> mtu 1500 qdisc noqueue master bond1 state UP mode DEFAULT group default qlen 1000
    link/ether 72:b5:6a:b0:bd:38 brd ff:ff:ff:ff:ff:ff
sh-5.1# ifdown bond0
Connection 'Bond bond0' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6)
sh-5.1# ifup bond0
Connection successfully activated (master waiting for slaves) (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/8)
sh-5.1# ip link ls dev bond0
11: bond0: <BROADCAST,MULTICAST,MASTER,SLAVE,UP,LOWER_UP> mtu 1500 qdisc noqueue master bond1 state UP mode DEFAULT group default qlen 1000
    link/ether 40:a6:b7:b0:b7:c0 brd ff:ff:ff:ff:ff:ff
sh-5.1# cat /sys/class/net/bond0/addr_assign_type
2

Comment 1 Andrew Schorr 2023-04-19 19:28:13 UTC
Actually, I'm now seeing that after ifdown/ifup, the MAC address is fixed only briefly.
After some time, bond0 again reverts to the random MAC address that was set at boot time.
This may be because I've got an active-backup bond bond1 that has bond0 as a slave
and is resetting the MAC address. I'd probably need to try with only the single bond0
802.3ad bonding interface to get a clearer view of what's going on. Maybe there's a bad
interaction between the 2 bonding interfaces...

Comment 2 Andrew Schorr 2023-04-20 02:36:03 UTC
As far as I can tell, this issue occurs only when the 802.3ad bond in bond0 is a slave
of another bond (in this case, bond1, which is an active-backup mode bond). When I eliminate
bond1, I don't see this problem with bond0. I guess this raises the question of how well
it's supported to have a bond where the slave is another bond. It seems to work but for
this issue, as far as I can tell, although the speed/duplex info for the top-level bond
is also incorrect.

Comment 3 LiLiang 2023-04-20 06:29:30 UTC
I reproduced this on rhel9. kernel-5.14.0-296.2191_828573416.el9.x86_64

After reboot, mybond0 and mybond1 both using a random mac.

# setup
[root@dell-per740-86 ~]# cat re2
nmcli con add con-name mybond1 type bond ifname mybond1 bond.options "mode=1,miimon=100,updelay=5000,primary=mybond0"

nmcli con add con-name mybond0 type bond ifname mybond0 bond.options "mode=802.3ad,miimon=100,updelay=5000" master mybond1
nmcli con add con-name ens1f0 type ethernet ifname ens1f0 master mybond0
nmcli con add con-name ens1f1 type ethernet ifname ens1f1 master mybond0
#$nmcli con up mybond0

nmcli con add con-name ens4f0np0 type ethernet ifname ens4f0np0 master mybond1
nmcli con up mybond1


# after reboot
16: mybond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 12:98:3a:86:eb:f3 brd ff:ff:ff:ff:ff:ff

17: mybond0: <BROADCAST,MULTICAST,MASTER,SLAVE,UP,LOWER_UP> mtu 1500 qdisc noqueue master mybond1 state UP mode DEFAULT group default qlen 1000
    link/ether 12:98:3a:86:eb:f3 brd ff:ff:ff:ff:ff:ff

6: ens4f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master mybond1 state UP mode DEFAULT group default qlen 1000
    link/ether 12:98:3a:86:eb:f3 brd ff:ff:ff:ff:ff:ff permaddr 00:0f:53:7f:88:a0
    altname enp175s0f0np0

8: ens1f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master mybond0 state UP mode DEFAULT group default qlen 1000
    link/ether 12:98:3a:86:eb:f3 brd ff:ff:ff:ff:ff:ff permaddr b4:96:91:a5:9f:50
    altname enp59s0f0
9: ens1f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master mybond0 state UP mode DEFAULT group default qlen 1000
    link/ether 12:98:3a:86:eb:f3 brd ff:ff:ff:ff:ff:ff permaddr b4:96:91:a5:9f:51
    altname enp59s0f1


# after reboot, the lacp mode bonding has correct speed info
[root@dell-per740-86 ~]# ethtool mybond0
Settings for mybond0:
	Supported ports: [  ]
	Supported link modes:   Not reported
	Supported pause frame use: No
	Supports auto-negotiation: No
	Supported FEC modes: Not reported
	Advertised link modes:  Not reported
	Advertised pause frame use: No
	Advertised auto-negotiation: No
	Advertised FEC modes: Not reported
	Speed: 50000Mb/s
	Duplex: Full
	Auto-negotiation: off
	Port: Other
	PHYAD: 0
	Transceiver: internal
	Link detected: yes

# after reboot, the activebackup mode bonding has no speed info
[root@dell-per740-86 ~]# ethtool mybond1
Settings for mybond1:
	Supported ports: [  ]
	Supported link modes:   Not reported
	Supported pause frame use: No
	Supports auto-negotiation: No
	Supported FEC modes: Not reported
	Advertised link modes:  Not reported
	Advertised pause frame use: No
	Advertised auto-negotiation: No
	Advertised FEC modes: Not reported
	Speed: Unknown!
	Duplex: Unknown! (255)
	Auto-negotiation: off
	Port: Other
	PHYAD: 0
	Transceiver: internal
	Link detected: yes


[root@dell-per740-86 ~]# cat /proc/net/bonding/mybond0
Ethernet Channel Bonding Driver: v5.14.0-296.2191_828573416.el9.x86_64

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 5000
Down Delay (ms): 0
Peer Notification Delay (ms): 0

802.3ad info
LACP active: on
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 12:98:3a:86:eb:f3
Active Aggregator Info:
	Aggregator ID: 1
	Number of ports: 2
	Actor Key: 21
	Partner Key: 47
	Partner Mac Address: b0:8b:d0:0a:73:3b

Slave Interface: ens1f0
MII Status: up
Speed: 25000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b4:96:91:a5:9f:50
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: monitoring
Partner Churn State: monitoring
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 12:98:3a:86:eb:f3
    port key: 21
    port priority: 255
    port number: 1
    port state: 61
details partner lacp pdu:
    system priority: 32768
    system mac address: b0:8b:d0:0a:73:3b
    oper key: 47
    port priority: 32768
    port number: 353
    port state: 63

Slave Interface: ens1f1
MII Status: up
Speed: 25000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b4:96:91:a5:9f:51
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: monitoring
Partner Churn State: monitoring
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 12:98:3a:86:eb:f3
    port key: 21
    port priority: 255
    port number: 2
    port state: 61
details partner lacp pdu:
    system priority: 32768
    system mac address: b0:8b:d0:0a:73:3b
    oper key: 47
    port priority: 32768
    port number: 357
    port state: 63


[root@dell-per740-86 ~]# cat /proc/net/bonding/mybond1
Ethernet Channel Bonding Driver: v5.14.0-296.2191_828573416.el9.x86_64

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: mybond0 (primary_reselect always)
Currently Active Slave: mybond0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 5000
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: mybond0
MII Status: up
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: 12:98:3a:86:eb:f3
Slave queue ID: 0

Slave Interface: ens4f0np0
MII Status: up
Speed: 25000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0f:53:7f:88:a0
Slave queue ID: 0

Comment 4 Andrew Schorr 2023-04-20 15:15:51 UTC
Good -- I see that you've duplicated the problem. 

After rebooting multiple times, it did come up correctly with a non-random MAC address one time. 
So this seems like a race condition.

Is there any way to overcome this by configuring the bond0 MAC address, or is a workaround
impossible, and we need a patch? I'd like to find some way to make this work, but haven't
come up with a solution so far. 

Thanks,
Andy

Comment 5 LiLiang 2023-04-21 02:07:24 UTC
(In reply to Andrew Schorr from comment #4)
> Good -- I see that you've duplicated the problem. 
> 
> After rebooting multiple times, it did come up correctly with a non-random
> MAC address one time. 
> So this seems like a race condition.
> 
> Is there any way to overcome this by configuring the bond0 MAC address, or
> is a workaround
> impossible, and we need a patch? I'd like to find some way to make this
> work, but haven't
> come up with a solution so far. 
> 
> Thanks,
> Andy

oh.. I can't get an available workaround...

Comment 6 Andrew Schorr 2023-04-21 03:00:54 UTC
I haven't tried this yet, but since it seems like a race condition, I wonder if downing bond1 and then downing bond0,
and then reactivating bond0 and subsequently bond1 would fix the problem. I suspect the root cause is that NetworkManager
is starting bond1 before bond0 has had a chance to initialize. I'm considering adding a systemctl service
that runs after the network is up to check the MAC addresses and then try this approach if it sees a problem.
But it's certainly an ugly solution. Is there some way to hint to NetworkManager to wait for bond0 to come
up before attempting to start bond1?

Comment 7 LiLiang 2023-04-21 03:15:39 UTC
Maybe NetworkManager dispatcher can do this, but I am not familiar with it...
Need NetworkManager guys have a look.

Comment 8 Andrew Schorr 2023-04-23 20:12:30 UTC
I'm configuring these with /etc/sysconfig/network-scripts legacy files. So I thought that I could solve the race condition problem by settting "ONBOOT=no" for the active-backup bond that's stacked on top of the 802.3ad bond.
I've got this config:

Underlying interfaces:

sh-5.1$ head -n20 /etc/sysconfig/network-scripts/ifcfg-lan*
==> /etc/sysconfig/network-scripts/ifcfg-lan0 <==
DEVICE=lan0
TYPE=Ethernet
BOOTPROTO=none
ONBOOT=yes
HOTPLUG=no
NOZEROCONF=yes
NM_CONTROLLED=yes
MASTER=bond1
SLAVE=yes

==> /etc/sysconfig/network-scripts/ifcfg-lan1 <==
DEVICE=lan1
ONBOOT=no
HOTPLUG=no
NOZEROCONF=yes
NM_CONTROLLED=no

==> /etc/sysconfig/network-scripts/ifcfg-lan2 <==
DEVICE=lan2
TYPE=Ethernet
BOOTPROTO=none
ONBOOT=yes
HOTPLUG=no
NOZEROCONF=yes
NM_CONTROLLED=yes
MASTER=bond0
SLAVE=yes

==> /etc/sysconfig/network-scripts/ifcfg-lan3 <==
DEVICE=lan3
TYPE=Ethernet
BOOTPROTO=none
ONBOOT=yes
HOTPLUG=no
NOZEROCONF=yes
NM_CONTROLLED=yes
MASTER=bond0
SLAVE=yes

==> /etc/sysconfig/network-scripts/ifcfg-lan4 <==
DEVICE=lan4
ONBOOT=no
HOTPLUG=no
NOZEROCONF=yes
NM_CONTROLLED=no

==> /etc/sysconfig/network-scripts/ifcfg-lan5 <==
DEVICE=lan5
ONBOOT=no
HOTPLUG=no
NOZEROCONF=yes
NM_CONTROLLED=no


Bonding interfaces:

sh-5.1$ head -n20 /etc/sysconfig/network-scripts/ifcfg-bond*
==> /etc/sysconfig/network-scripts/ifcfg-bond0 <==
DEVICE=bond0
TYPE=Bond
BOOTPROTO=none
ONBOOT=yes
HOTPLUG=no
NOZEROCONF=yes
NM_CONTROLLED=yes
BONDING_MASTER=yes
BONDING_OPTS="miimon=100 ad_select=bandwidth mode=802.3ad xmit_hash_policy=layer2+3 arp_interval=0 lacp_rate=fast"
MASTER=bond1
SLAVE=yes

==> /etc/sysconfig/network-scripts/ifcfg-bond1 <==
DEVICE=bond1
TYPE=Bond
BOOTPROTO=static
ONBOOT=no
HOTPLUG=no
NOZEROCONF=yes
NM_CONTROLLED=yes
IPADDR=192.168.30.27
NETMASK=255.255.254.0
BONDING_MASTER=yes
BONDING_OPTS="miimon=0 mode=active-backup arp_all_targets=any primary=bond0 arp_ip_target=192.168.30.13,192.168.30.12,192.168.30.7,192.168.30.5 arp_interval=1000 fail_over_mac=follow arp_validate=none primary_reselect=always"

==> /etc/sysconfig/network-scripts/ifcfg-bond1.3 <==
DEVICE=bond1.3
BOOTPROTO=static
ONBOOT=no
HOTPLUG=no
NOZEROCONF=yes
NM_CONTROLLED=yes
VLAN=yes
IPADDR=192.168.33.27
NETMASK=255.255.255.0

I thought that with bond1 and bond1.3 ONBOOT=no, I could add wait for bond0 to come up with a proper MAC address from one of its slaves before starting bond1 and bond1.3. But the NetworkManager behavior is strange. It implicitly starts bond1 anyway, even though ONBOOT=no. But it does not start bond1.3. That seems wrong to me. Why is it implicitly starting bond1 at boot even though ONBOOT=no?

I also noticed that when both bonds are up, if I run "ifdown bond1", it takes down bond0 automatically. And if,
after both are down, I run "ifup bond0", it brings up bond1 automatically. This behavior seems wrong to me,
but I'm certainly not a NetworkManager expert. I opened a NetworkManager bug regarding this behavior:
https://bugzilla.redhat.com/show_bug.cgi?id=2188963

Strangely enough, in this new configuration, it seems like I am no longer getting the random MAC address.
So maybe setting ONBOOT=no somehow affects the timing of when it starts bond1 and delays it a bit.

I then inserted a local network_delayed_start.service to bring up bond1.3, which runs after network.target
and before network-online.target to start bond1.3, and everything seems OK now after a few test reboots.

Regards,
Andy

Comment 9 Hangbin Liu 2023-06-13 09:02:43 UTC
(In reply to LiLiang from comment #3)
> I reproduced this on rhel9. kernel-5.14.0-296.2191_828573416.el9.x86_64
> 
> After reboot, mybond0 and mybond1 both using a random mac.
> 
> # setup
> [root@dell-per740-86 ~]# cat re2
> nmcli con add con-name mybond1 type bond ifname mybond1 bond.options
> "mode=1,miimon=100,updelay=5000,primary=mybond0"

First, you create a bond1, with a random mac address. Let's say MAC1

> 
> nmcli con add con-name mybond0 type bond ifname mybond0 bond.options
> "mode=802.3ad,miimon=100,updelay=5000" master mybond1

Then you create bond0, with a random mac address. Let's say MAC0

After that, you set bond0's master to bond1. When do enslave on bond1.
bond1 will set it's mac address to bond0 as bond0 is the first slave. So bond1 will have MAC0.

Then bond0 will store it's mac address as perm_hwaddr and set it's mac to bond1's mac. Which is still MAC0.
But at the same time. Bond0's addr type will be NET_ADDR_SET.


> nmcli con add con-name ens1f0 type ethernet ifname ens1f0 master mybond0

When set ens1f0 master bond bond0. Although bond0 doesn't have slave. It will not change mac address as
the bond0's addr type is not NET_ADDR_RANDOM.
Later, ens1f0 will store it's mac address as perm_hwaddr and set it's mac to bond0's mac, which is MAC0.

> nmcli con add con-name ens1f1 type ethernet ifname ens1f1 master mybond0

The same with ens1f1.

> nmcli con add con-name ens4f0np0 type ethernet ifname ens4f0np0 master mybond1

ens4f0np0's mac will also be set to bond1's mac, which is MAC0.

Comment 10 Hangbin Liu 2023-06-13 09:38:38 UTC
To make a workaround, you can set bond0 and bond1's fail_over_mac to "follow". 
This will avoid changing the bond0's address type. So when other interfaces enslave to bond0, bond0 will still be able to set it's MAC to the real NIC's mac address.
Although the fail_over_mac will actually not work on bond0 (mode 4) since it only supports active-backup mode. This is just a trick.

Note: after setting bond0 up, you'd better wait a while to make it "really" up, then set bond1 up and add other slaves. Or bond1 will still take bond0's random Mac address instead of the real NIC's address.

Comment 11 Hangbin Liu 2023-06-13 09:52:16 UTC
> Then you create bond0, with a random mac address. Let's say MAC0
> 
> After that, you set bond0's master to bond1. When do enslave on bond1.
> bond1 will set it's mac address to bond0 as bond0 is the first slave. So bond1 will have MAC0.
>
> Then bond0 will store it's mac address as perm_hwaddr and set it's mac to bond1's mac. Which is still MAC0.
> But at the same time. Bond0's addr type will be NET_ADDR_SET.

The root reason we have this race condition is that you enslave a bond to another bond. Which both has random mac address.

Comment 12 Andrew Schorr 2023-06-13 16:02:44 UTC
Regarding Comment #10: that is correct. I do in fact already use fail_over_mac=follow,
as you can see in the configs above. The problem is that bond1 is started too soon.
I believe this is a NetworkManager issue, which is why I opened https://bugzilla.redhat.com/show_bug.cgi?id=2188963

I have a current workaround which is to set ONBOOT=no for interfaces that have slaves that
are bonds. For some reason, NetworkManager decides to bring up the interface anyway,
but it apparently induces just enough of a delay for the real MAC address to get assigned
to bond0 before bond1 steals it.

Regards,
Andy