Bug 1860479

Summary:

Unable to attach VLAN-based logical networks to a bond

Product:

Red Hat Enterprise Linux 8

Reporter:

Mark R. <rhbugzilla>

Component:

kernel

Assignee:

Jonathan Toppins <jtoppins>

kernel sub component:

NIC Drivers

QA Contact:

LiLiang <liali>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

amusil, anantha.subramanyam, ashutosh.kumar, astupnik, atragler, bgalvani, bugs, cgaynor, dholler, dhoward, forestia, gcase, gconsalv, goutham-konaghatta.vijayakumar, ivecera, jbainbri, jbenc, jcastran, jiji, jtoppins, kzhang, liali, linville, markus.falb, mchan, network-qe, pelauter, ptalbert, rmetrich, simon, tredaelli, vasundhara-v.volam

Version:

8.2

Keywords:

ZStream

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

8.4

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-4.18.0-240.6.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1886017 (view as bug list)

Environment:

Last Closed:

2021-05-18 13:54:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Network

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1886017, 1906080

Attachments:

Description	Flags
Logs from supervdsm during attempt	none
Output from NetworkManager during attempt	none
ouput of 'ip monitor link' during attempt to attach network	none
output of 'journalctl' for the 5 minute window around network addition	none
Output from journalctl after adding some further debugging, then attaching network to bond	none
Manually configuring the interfaces	none
Output of 'perf script' after attempting changes	none
Same failure using nmstatectl directly	none
Whoops, the issue _is_ triggered with just the iproute2 utls, no NM involved...	none

Description Mark R. 2020-07-24 18:09:14 UTC

Created attachment 1702375 [details]
Logs from supervdsm during attempt

Created attachment 1702375 [details]
Logs from supervdsm during attempt

Description of problem:

Note: Filing this against current 4.4.1 release, but I have had this issue with all 4.4.x attempts.

PowerEdge R6525, freshly installed and updated CentOS 8.2.2004 host (minimal install selected). I have two iSCSI dedicated interfaces, and a bond0 interface made up of two BCM57414 NetXtreme-E 10Gb/25Gb adapters. The bond is mode 4 (LACP) and is functioning as expected from the OS and switches view.

I do the hosted-engine deployment via 'hosted-engine --deploy' from CLI, keeping defaults everywhere possible. When it's complete my bond0 interface is now attached to the ovirtmgmt bridge. I attempt to create a new VLAN-based logical network and assign it to the same bond as ovirtmgmt from the "Setup Host Interfaces" UI. This results in an error:

    "HostSetupNetworksVDS failed: Internal JSON-RPC error:
    {'reason': 'Unexpected failure of libnm when running the mainloop: run
    execution'}"

I haven't found a way past this yet. I did try a newer NetworkManager, v1.26 from a COPR repo as it was suggested on the list, but the issue persists. I considered the possibility that creating the bond manually outside of oVirt was part of the problem, so I reinstalled the host with just a single interface for ovirtmgmt. Then in the UI I created a new LACP bond, and then created a VLAN-based network and tried to add it but the same error arises.

I can wipe and redeploy the host with latest CentOS 7 / oVirt 4.3 and do the same network configuration I'm attempting here and it works, so appears to be an issue unique to 4.4.



Version-Release number of selected component (if applicable):
vdsm-4.40.22-1.el8.x86_64
NetworkManager-1.22.14-1.el8.x86_64
NetworkManager-libnm-1.22.14-1.el8.x86_64


How reproducible:
See steps to reproduce...

Steps to Reproduce:
1. Deploy 4.4.x hosted-engine on freshly built CentOS 8 
2. Create a VLAN-based logical network
3. Attempt to add that logical network onto bond with ovirtmgmt


Actual results:
Unable to add networks to the bond, "HostSetupNetworksVDS failed: Internal JSON-RPC error: {'reason': 'Unexpected failure of libnm when running the mainloop: run execution'}"

Expected results:
A new usable VLAN-based network for VMs

Additional info:
Attached relevant supervdsm.log lines and NetworkManager logs

Comment 1 Mark R. 2020-07-24 18:12:16 UTC

Created attachment 1702376 [details]
Output from NetworkManager during attempt

Comment 2 Beniamino Galvani 2020-07-28 08:40:13 UTC

Hi,

would it be possible for you to run the following commands before step 2:

 nmcli general logging level TRACE
 ip -ts monitor link > ip-link.txt

Then, after the failure, stop the monitoring with Control-C, attach the ip-link.txt file and the output of 'journalctl --since="5 minutes ago"'. Thank you.

Comment 3 Mark R. 2020-07-28 14:55:11 UTC

Created attachment 1702671 [details]
ouput of 'ip monitor link' during attempt to attach network

Comment 4 Mark R. 2020-07-28 14:57:41 UTC

Created attachment 1702672 [details]
output of 'journalctl' for the 5 minute window around network addition

Both requested logs attached, happy to help pursue the issue however I can. Thanks!

Comment 5 Beniamino Galvani 2020-07-28 18:01:01 UTC

The problem seems related to the failure to add the VLAN to the bridge
reported by kernel:

 NetworkManager[2401]: <debug> [1595947746.3755] platform: (bond0.22) link: enslaving to master 'DMZ'
 kernel: DMZ: port 1(bond0.22) entered blocking state
 kernel: DMZ: port 1(bond0.22) entered disabled state
 NetworkManager[2401]: <debug> [1595947746.3758] platform-linux: do-request-link: 30
 NetworkManager[2401]: <trace> [1595947746.3758] platform-linux: event-notification: RTM_NEWLINK, flags 0, seq 0: 30: bond0.22@14 <UP,LOWER_UP;broadcast,multicast,up,running,lowerup> mtu 1500 master 29 arp 1 vlan* not-init addrgenmode none addr BC:97:E1:24:BA:60 brd FF:FF:FF:FF:FF:FF rx:0,0 tx:0,0; vlan 22 flags 0x1
 NetworkManager[2401]: <debug> [1595947746.3758] platform: (bond0.22) signal: link changed: 30: bond0.22@14 <UP,LOWER_UP;broadcast,multicast,up,running,lowerup> mtu 1500 master 29 arp 1 vlan* init addrgenmode none addr BC:97:E1:24:BA:60 brd FF:FF:FF:FF:FF:FF driver vlan rx:0,0 tx:0,0
 NetworkManager[2401]: <debug> [1595947746.3758] device[e22d004ce19669d6] (bond0.22): queued link change for ifindex 30
 NetworkManager[2401]: <trace> [1595947746.3758] platform-linux: event-notification: RTM_NEWLINK, flags 0, seq 0: 30: bond0.22@14 <UP,LOWER_UP;broadcast,multicast,up,running,lowerup> mtu 1500 arp 1 vlan* not-init addrgenmode none addr BC:97:E1:24:BA:60 brd FF:FF:FF:FF:FF:FF rx:0,0 tx:0,0; vlan 22 flags 0x1
 NetworkManager[2401]: <debug> [1595947746.3759] platform: (bond0.22) signal: link changed: 30: bond0.22@14 <UP,LOWER_UP;broadcast,multicast,up,running,lowerup> mtu 1500 arp 1 vlan* init addrgenmode none addr BC:97:E1:24:BA:60 brd FF:FF:FF:FF:FF:FF driver vlan rx:0,0 tx:0,0
 NetworkManager[2401]: <debug> [1595947746.3759] platform-linux: netlink: recvmsg: error message from kernel: No data available (61) for request 526

ENODATA seems an unusual error code, I don't understand where it comes
from. Also, looking at the iproute output, bond0.22 is added to the
bridge for less than 1ms and then removed immediately:

 [2020-07-28T14:49:06.463305] 31: bond0.22@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
     link/ether bc:97:e1:24:ba:60 brd ff:ff:ff:ff:ff:ff
 [2020-07-28T14:49:06.494630] 31: bond0.22@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master DMZ state UNKNOWN group default
     link/ether bc:97:e1:24:ba:60 brd ff:ff:ff:ff:ff:ff
 [2020-07-28T14:49:06.495180] 31: bond0.22@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
     link/ether bc:97:e1:24:ba:60 brd ff:ff:ff:ff:ff:ff

Could you please try again after running these commands:

 nmcli general logging level TRACE
 echo 'file net/bridge/* +p' > /sys/kernel/debug/dynamic_debug/control
 echo 'file net/8021q/* +p' > /sys/kernel/debug/dynamic_debug/control

The last two commands should add more information to the kernel
output in the journal. Then, attach the journal as usual.

I wonder if the same error happens when configuring interfaces
manually. Could you please try these commands starting from a clean
configuration (no bridges, vlan, bonds) and attach the output?:

 DEV1=eno33np0
 DEV2=ens2f0np0

 ip link add ovirtmgmt type bridge
 ip link add DMZ0 type bridge
 ip link add bond0 type bond mode 802.3ad
 ip link add link bond0 bond0.22 type vlan id 22

 ip link set ovirtmgmt up
 ip link set DMZ0 up
 ip link set bond0 up
 ip link set bond0.22 up

 ip link set $DEV1 down
 ip link set $DEV2 down
 ip link set $DEV1 master bond0
 ip link set $DEV2 master bond0
 ip link set $DEV1 up
 ip link set $DEV2 up

 ip link set bond0 master ovirtmgmt
 ip link set bond0.22 master DMZ0

 echo
 ip link

Thank you.

Comment 6 Mark R. 2020-07-28 18:37:59 UTC

Created attachment 1702711 [details]
Output from journalctl after adding some further debugging, then attaching network to bond

Here's the journalctl output with enhanced debugging. I'll send along the results of the manual interface setup soon.

Comment 7 Mark R. 2020-07-28 19:09:59 UTC

Created attachment 1702718 [details]
Manually configuring the interfaces

I dropped all of the requested commands into a script. After destroying all bridges/bonds and then running the script, the bond wasn't up/ready for a second or so, but did settle and I was then able to throw an IP address onto ovirtmgmt and had all normal network connectivity. Brought up the hosted-engine and could access it and tcpdump showed expected broadcast-type traffic arriving on bond0.22.

Comment 8 Beniamino Galvani 2020-07-29 11:54:06 UTC

Thanks for the additional information. Unfortunately there are no useful messages from kernel. We'll need to add some probes using perf to see why the bridge port addition fails. Can you try these commands?

 dnf --enablerepo=base-debuginfo install kernel-debuginfo-common kernel-debuginfo perf -y

 perf probe -m bridge --add 'br_add_if%return ret=$retval'
 perf probe -m bridge --add 'br_sysfs_addif%return ret=$retval'
 perf probe           --add 'netdev_rx_handler_register%return ret=$retval'
 perf probe           --add 'netdev_master_upper_dev_link%return ret=$retval'
 perf probe -m bridge --add 'nbp_switchdev_mark_set%return ret=$retval'
 perf probe -m bridge --add 'nbp_vlan_init%return ret=$retval'

 perf record -e probe:br_add_if__return,probe:br_sysfs_addif__return,probe:netdev_rx_handler_register__return,probe:netdev_master_upper_dev_link__return,probe:nbp_switchdev_mark_set__return,probe:nbp_vlan_init__return -aR sleep 300

 # now from another console configure the network with nmstate/NM and wait it finishes

 # then interrupt perf recording with Ctrl-C and attach the output of 'perf script'

Thank you.

Comment 9 Beniamino Galvani 2020-07-29 11:55:36 UTC

In the instructions above, note that "perf record -e ... -aR sleep 300" is all on the same line.

Comment 10 Mark R. 2020-07-29 13:39:00 UTC

Hello Beniamino,

Can you just clarify this for me, "from another console configure the network with nmstate/NM and wait it finishes"... is the hope here to capture the same failure seen when modifying the network via oVirt? When the configuration is done from the command line it works, so once I have completed the 'perf record ....' step should I attempt the changes in oVirt (hitting the issue) or command line via 'ip' command?

Thanks,
Mark

Comment 11 Beniamino Galvani 2020-07-29 14:07:40 UTC

> is the hope here to capture the same failure seen when modifying the network via oVirt?

Yes, the idea is to capture why the configuration fails.

> should I attempt the changes in oVirt (hitting the issue) 

Yes, please do the config through oVirt.

Comment 12 Mark R. 2020-07-29 14:29:43 UTC

Created attachment 1702829 [details]
Output of 'perf script' after attempting changes

OK, I suspected that was the case, just wanted to be sure. I've attached the results.

Comment 13 Beniamino Galvani 2020-07-29 15:12:29 UTC

Ok, as suspected it's nbp_switchdev_mark_set() returning -ENODATA:

  NetworkManager  2395 [014]   683.265983:       probe:nbp_switchdev_mark_set__return: (ffffffffc027e980 <- ffffffffc026c820) ret=0xffffffc3
  NetworkManager  2395 [015]   683.266147:                    probe:br_add_if__return: (ffffffffc026c570 <- ffffffff99b292a6) ret=0xffffffc3

I don't know why the failure happens only when using NetworkManager
and not with iproute; however it looks like a kernel issue and as such
I think it should be reassigned to kernel to be investigated.

Comment 14 Mark R. 2020-07-30 20:36:35 UTC

Created attachment 1702997 [details]
Same failure using nmstatectl directly

I agree that this doesn't appear to be oVirt/VDSM, because using 'nmstatectl' directly on the host with a json file that has the desired configuration (adding VLAN 22 to the bond and creating a 'LegacyDMZ' bridge for it) fails in exactly the same way. I've attached the output of that attempt. So, NetworkManager or kernel issue, is there further info I can get from this host to help diagnose?

Comment 15 Mark R. 2020-07-30 21:54:42 UTC

Created attachment 1703004 [details]
Whoops, the issue _is_ triggered with just the iproute2 utls, no NM involved...

I have to apologize for bad info on a previous post, you asked me to start from a clean slate and use the 'ip' command to create the network configuration manually. I removed bonds, bridges, vlans, etc. with 'ip link delete' but that must not have cleaned up as much as I thought.

After completely reloading the host again, just CentOS 8.2.2004 minimal and never having any bonds/bridges/vlans created, stepping through the 'ip' commands you requested _does_ indeed trigger the same issue. You can't attach a bond to a bridge, it even fails just attaching bond0 to ovirtmgmt (I kept the name for consistency, but oVirt isn't installed here).

Output is attached.

Comment 16 Mark R. 2020-07-30 22:41:31 UTC

This is repeatable from the boot media for me as well.  So the steps to reproduce seem to be:

1. Boot 8.2.2004 installation media
2. Switch to tty2 for a shell
3. Verify no interfaces are configured at all
4. Issue these commands, creating a bridge and a bond and attempting to attach bond to bridge:

  ip link add mybridge type bridge
  ip link add bond0 type bond mode 802.3ad
  ip link set mybridge up
  ip link set bond0 up
  ip link set eno33np0 down
  ip link set ens2f0np0 down
  ip link set eno33np0 master bond0
  ip link set ens2f0np0 master bond0
  ip link set eno33np0 up
  ip link set ens2f0np0 up
  ip link set bond0 master mybridge
    RTNETLINK answers: No data available

Booting with either the 7.8.2003 or 8.1.1911 installation media, the above steps work and you end up with the desired network configuration.

Comment 17 Mark R. 2020-07-31 01:26:59 UTC

Could this be driver specific? I can do the steps above, booting 8.2.2004 install media.  Create a bond from the 10/25Gb interfaces as bond0 as in the steps above.  I create a second bond as bond1, also 802.3ad, using a pair of 1Gb interfaces that aren't currently connected. With both bonds up, interfaces assigned, and bridge 'mybridge' created:

  ip link set bond1 master mybridge   # This works just fine  (tg3)
  ip link set bond1 nomaster
  ip link set bond0 master mybridge
    RTNETLINK answers: No data available  # Failure when using the 10/25Gb interfaces (bnxt_en)

If I remove *either* of the bnxt_en interfaces from bond0, I can then attach/detach it from bridges at will. Once it's attached, I can re-add the removed interface and it continues to function. However, 'ip link set bond0 nomaster ; ip link set bond0 master mybridge' fails. Basically it seems if a bond has more than one bnxt_en interface, it can't be attached to a bridge, but will attach if it has only one bnxt_en, and will even attach if I use one bnxt_en and one tg3 interface (of course, that can't really succeed in LACP mode since mismatched speeds, but tried as a test and it still let me put that bond onto a bridge.

Comment 18 Mark R. 2020-07-31 14:28:12 UTC

Further experimenting shows this also seems to require using a bnxt_en interface from two different cards in the bond to get the failure. If I create the bond from both interfaces of a single card (using either eno33np0 + eno34np1, or ens2f0np0 + ens2f1np1) I can assign the bond to a bridge w/o an error. Using one interface from each card triggers it every time.

The cards are:
  63:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
  63:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
  a1:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
  a1:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)

The bnxt_en module reports version 1.10.0, which matches the version from 8.1 and 7.8. I suppose that may squash the idea of it being a driver issue.

Comment 19 Patrick Talbert 2020-08-03 15:12:40 UTC

Hi Mark,

I *think* "CentOS 8.2.2004" corresponds to RHEL 8.2 so I'm assuming the issue is happening with some 4.18.0-193.el8 series kernel?

If so, can you try installing an older 4.18.0-147.el8 kernel and testing with it? RHEL 8.2 includes a pretty big overhaul of switchdev which seems relevant here.


For the bond and bnxt interfaces can you provide the output of:

$ cat /sys/class/net/<device>/phys_switch_id


Then also:

# devlink dev show
# devlink dev eswitch show
# devlink port show






nbp_switchdev_mark_set() is part of bridge forward mark support:

https://www.kernel.org/doc/html/latest/networking/switchdev.html#switch-id
https://www.kernel.org/doc/html/latest/networking/switchdev.html#flooding-l2-domain


- When a device is added to a bridge, br_add_if() calls nbp_switchdev_mark_set(). If the call returns anything but 0 (zero) it unlinks the device from the bridge.

498 /* called with RTNL */
499 int br_add_if(struct net_bridge *br, struct net_device *dev,
500               struct netlink_ext_ack *extack)
501 {
502         struct net_bridge_port *p;
503         int err = 0;
504         unsigned br_hr, dev_hr;
505         bool changed_addr;
.....
>570         err = nbp_switchdev_mark_set(p);
>571         if (err)
>572                 goto err6;
.....
>631 err6:
>632         netdev_upper_dev_unlink(dev, br->dev);
633 err5:
634         dev->priv_flags &= ~IFF_BRIDGE_PORT;
635         netdev_rx_handler_unregister(dev);
636 err4:
637         br_netpoll_disable(p);
638 err3:
639         sysfs_remove_link(br->ifobj, p->dev->name);
640 err2:
641         kobject_put(&p->kobj);
642         p = NULL; /* kobject_put frees */
643 err1:
644         dev_set_allmulti(dev, -1);
645 put_back:
646         dev_put(dev);
647         kfree(p);
648         return err;
649 }



In this case we have a bond device being added to a bridge so the net_bridge_port struct passed into nbp_switchdev_mark_set() by br_add_if() is linked to the bond's net_device.

Note that with the logic here in any release, there are three basic possibilities:

1. The passed in device itself (the bond) triggers an -EOPNOTSUPP (which is ignored) and we move on without issue.
2. The passed in device itself (the bond) somehow provides a Switch ID. As far as I can tell the bonding driver can't do this.
3. The Switch IDs of the bond's lower devs are recursively probed. If at any point it is found that the Switch ID of one lower dev does not match another then -ENODATA is returned. Otherwise either the common Switch ID or -EOPNOTSUPP (which is ignored) is returned.


For the bond case we know? we're going to fall into #3 and be checking the lower dev Switch IDs.


Prior to RHEL 8.1 (including all of RHEL7), nbp_switchdev_mark_set() would use a call to switchdev_port_attr_get() to get the Switch ID for the given port (netdev).

In RHEL 8.1, nbp_switchdev_mark_set() is updated to first check whether the netdev has a ndo_get_port_parent_id() handler set and if so, use that. But bnxt in RHEL 8.1 does not set up such a handler so we fall back to calling switchdev_port_attr_get() and the attempt to retrieve a Switch ID is performed in an identical way as to 8.0.

In RHEL 8.2, further backports change the logic of nbp_switchdev_mark_set() to no longer use switchdev_port_attr_get()...


Keep in mind that for all of these releases, if -EOPNOTSUPP is returned back to nbp_switchdev_mark_set() then it is ignored and the function returns 0 back to br_add_if().


Here in RHEL 8.0 (kernel-4.18.0-80.el8) we can see that nbp_switchdev_mark_set() calls switchdev_port_attr_get() at line 34:

 24 int nbp_switchdev_mark_set(struct net_bridge_port *p)
 25 {
 26         struct switchdev_attr attr = {
 27                 .orig_dev = p->dev,
 28                 .id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
 29         };
 30         int err;
 31 
 32         ASSERT_RTNL();
 33 
>34         err = switchdev_port_attr_get(p->dev, &attr);
 35         if (err) {
 36                 if (err == -EOPNOTSUPP)
 37                         return 0;
 38                 return err;
 39         }
 40 
 41         p->offload_fwd_mark = br_switchdev_mark_get(p->br, p->dev);
 42 
 43         return 0;
 44 }

- In switchdev_port_attr_get(), if the netdev's switchdev_ops switchdev_port_attr_get function pointer is not NULL then it is invoked and the result returned back to the caller:

177 /**
178  *      switchdev_port_attr_get - Get port attribute 
179  *
180  *      @dev: port device
181  *      @attr: attribute to get
182  */
183 int switchdev_port_attr_get(struct net_device *dev, struct switchdev_attr *attr)
184 {
185         const struct switchdev_ops *ops = dev->switchdev_ops;
186         struct net_device *lower_dev;
187         struct list_head *iter;
188         struct switchdev_attr first = {
189                 .id = SWITCHDEV_ATTR_ID_UNDEFINED
190         };
191         int err = -EOPNOTSUPP;
192 
>193         if (ops && ops->switchdev_port_attr_get)
>194                 return ops->switchdev_port_attr_get(dev, attr);
.....

- For bnxt in RHEL8.0, switchdev_port_attr_get() is set and points to bnxt_swdev_port_attr_get(). It simply calls bnxt_port_attr_get():

8358 static const struct switchdev_ops bnxt_switchdev_ops = {
8359         .switchdev_port_attr_get        = bnxt_swdev_port_attr_get
8360 };

8352 static int bnxt_swdev_port_attr_get(struct net_device *dev,
8353                                     struct switchdev_attr *attr)
8354 {
8355         return bnxt_port_attr_get(netdev_priv(dev), attr);
8356 }

- Basically, bnxt_port_attr_get() is either going to copy the switch ID into the passed-in switchdev_attr struct OR return -EOPNOTSUPP:

8332 int bnxt_port_attr_get(struct bnxt *bp, struct switchdev_attr *attr)
8333 {
8334         if (bp->eswitch_mode != DEVLINK_ESWITCH_MODE_SWITCHDEV)
8335                 return -EOPNOTSUPP;
8336 
8337         /* The PF and it's VF-reps only support the switchdev framework */
8338         if (!BNXT_PF(bp))
8339                 return -EOPNOTSUPP;
8340 
8341         switch (attr->id) {
8342         case SWITCHDEV_ATTR_ID_PORT_PARENT_ID:
8343                 attr->u.ppid.id_len = sizeof(bp->switch_id);
8344                 memcpy(attr->u.ppid.id, bp->switch_id, attr->u.ppid.id_len);
8345                 break;
8346         default:
8347                 return -EOPNOTSUPP;
8348         }
8349         return 0;
8350 }


In RHEL 8.1, nbp_switchdev_mark_set() does not just call switchdev_port_attr_get(). Instead, it checks if the netdev's net_device_ops ndo_get_port_parent_id function pointer member is set and if so, uses that. Otherwise, it falls back to calling switchdev_port_attr_get(). However, in RHEL 8.1 the bnxt driver does not have a handler for ndo_get_port_parent_id registered so the basic logic flow is the same as with 8.0. Further, how bnxt handles switchdev_port_attr_get is identical to 8.0.

 25 int nbp_switchdev_mark_set(struct net_bridge_port *p)
 26 {
 27         const struct net_device_ops *ops = p->dev->netdev_ops;
 28         struct switchdev_attr attr = {
 29                 .orig_dev = p->dev,
 30                 .id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
 31         };
 32         int err;
 33 
 34         ASSERT_RTNL();
 35 
 36         if (ops->ndo_get_port_parent_id)
 37                 err = dev_get_port_parent_id(p->dev, &attr.u.ppid, true);
 38         else
 39                 err = switchdev_port_attr_get(p->dev, &attr);
 40         if (err) {
 41                 if (err == -EOPNOTSUPP)
 42                         return 0;
 43                 return err;
 44         }
 45 
 46         p->offload_fwd_mark = br_switchdev_mark_get(p->br, p->dev);
 47 
 48         return 0;
 49 }


Note then that in RHEL 8.0 & RHEL 8.1 with a bnxt PF there is no way for the above logic to result in -ENODATA bubbling back up to br_add_if().



In RHEL 8.2, nbp_switchdev_mark_set() only uses dev_get_port_parent_id():

 24 int nbp_switchdev_mark_set(struct net_bridge_port *p)
 25 {
 26         struct netdev_phys_item_id ppid = { };
 27         int err;
 28 
 29         ASSERT_RTNL();
 30 
 31         err = dev_get_port_parent_id(p->dev, &ppid, true);
 32         if (err) {
 33                 if (err == -EOPNOTSUPP)
 34                         return 0;
 35                 return err;
 36         }
 37 
 38         p->offload_fwd_mark = br_switchdev_mark_get(p->br, p->dev);
 39 
 40         return 0;
 41 }


- Now dev_get_port_parent_id() itself checks the dev's net_device_ops for ndo_get_port_parent_id and uses it if it is set up. Next it will try devlink_compat_switch_id_get(). Otherwise it roughly integrates the same logic from the bottom of the switchdev_port_attr_get() function:

7635 /**
7636  *      dev_get_port_parent_id - Get the device's port parent identifier
7637  *      @dev: network device
7638  *      @ppid: pointer to a storage for the port's parent identifier
7639  *      @recurse: allow/disallow recursion to lower devices
7640  *
7641  *      Get the devices's port parent identifier
7642  */
7643 int dev_get_port_parent_id(struct net_device *dev, 
7644                            struct netdev_phys_item_id *ppid,
7645                            bool recurse)
7646 {
7647         const struct net_device_ops *ops = dev->netdev_ops;
7648         struct netdev_phys_item_id first = { };
7649         struct net_device *lower_dev;
7650         struct list_head *iter;
7651         int err;
7652 
7653         if (ops->ndo_get_port_parent_id) {
7654                 err = ops->ndo_get_port_parent_id(dev, ppid);
7655                 if (err != -EOPNOTSUPP)
7656                         return err; 
7657         }
7658 
7659         err = devlink_compat_switch_id_get(dev, ppid);
7660         if (!err || err != -EOPNOTSUPP)
7661                 return err;
7662 
7663         if (!recurse)
7664                 return -EOPNOTSUPP;
7665 
7666         netdev_for_each_lower_dev(dev, lower_dev, iter) {
7667                 err = dev_get_port_parent_id(lower_dev, ppid, recurse);
7668                 if (err)
7669                         break;
7670                 if (!first.id_len)
7671                         first = *ppid;
7672                 else if (memcmp(&first, ppid, sizeof(*ppid)))
7673                         return -ENODATA;
7674         }
7675 
7676         return err;
7677 }
7678 EXPORT_SYMBOL(dev_get_port_parent_id);


- With RHEL 8.2, bnxt only registers a handler for ndo_get_port_parent_id for VFs, not for PFs.

- When devlink_compat_switch_id_get() is called it checks the netdev's netdev_ops for a ndo_get_devlink_port handler and this is something bnxt has, its bnxt_get_devlink_port().

- The logic in bnxt_get_devlink_port() is either going to return -EOPNOTSUPP or for bnxt, whatever its private struct bnxt->dl_port member (a pointer to a devlink_port struct) has its attrs.switch_id member set to.

6834 int devlink_compat_switch_id_get(struct net_device *dev,
6835                                  struct netdev_phys_item_id *ppid)
6836 {
6837         struct devlink_port *devlink_port;
6838 
6839         /* Caller must hold RTNL mutex or reference to dev, which ensures that
6840          * devlink_port instance cannot disappear in the middle. No need to take
6841          * any devlink lock as only permanent values are accessed.
6842          */
6843         devlink_port = netdev_to_devlink_port(dev);
6844         if (!devlink_port || !devlink_port->attrs.switch_port)
6845                 return -EOPNOTSUPP;
6846 
6847         memcpy(ppid, &devlink_port->attrs.switch_id, sizeof(*ppid));
6848 
6849         return 0;
6850 }

577 static inline struct devlink_port *
578 netdev_to_devlink_port(struct net_device *dev)
579 {
580         if (dev->netdev_ops->ndo_get_devlink_port)
581                 return dev->netdev_ops->ndo_get_devlink_port(dev);
582         return NULL;
583 }

11357 static const struct net_device_ops bnxt_netdev_ops = {
11358         .ndo_open               = bnxt_open,
11359         .ndo_start_xmit         = bnxt_start_xmit,
11360         .ndo_stop               = bnxt_close,
......
11387         .ndo_bridge_getlink     = bnxt_bridge_getlink,
11388         .ndo_bridge_setlink     = bnxt_bridge_setlink,
>11389         .ndo_get_devlink_port   = bnxt_get_devlink_port,
11390 };

11350 static struct devlink_port *bnxt_get_devlink_port(struct net_device *dev)
11351 {
11352         struct bnxt *bp = netdev_priv(dev);
11353 
11354         return &bp->dl_port;
11355 }



So after all of the above, in the RHEL8.2 case with a bnxt PF we can see that the only way for nbp_switchdev_mark_set() to return an -ENODATA back to br_add_if() is if in dev_get_port_parent_id() we enter the netdev_for_each_lower_dev() "loop" at lines 7666-7674.

- Here we're iterating over each lower netdev, taking its ppid and if it doesn't match the prior one (memcmp is non-zero) we return -ENODATA.

.....
7666         netdev_for_each_lower_dev(dev, lower_dev, iter) {
7667                 err = dev_get_port_parent_id(lower_dev, ppid, recurse);
7668                 if (err)
7669                         break;
7670                 if (!first.id_len)
7671                         first = *ppid;
7672                 else if (memcmp(&first, ppid, sizeof(*ppid)))
7673                         return -ENODATA;
7674         }
7675 
7676         return err;
7677 }
7678 EXPORT_SYMBOL(dev_get_port_parent_id);



My *rough* assumption is that prior to RHEL 8.2 the bnxt interfaces returned -EOPNOTSUPP whereas with 8.2 the bridge is finally made aware that the two bond ports do not have the same Switch ID.

But I fully expect some switchdev expert to pop in here and blow this all away.

Patrick

Comment 20 Mark R. 2020-08-03 17:57:47 UTC

Hello Patrick,

> the issue is happening with some 4.18.0-193.el8 series kernel?

Correct. I grabbed RHEL 8.2 install media and verified the same issue arises with it when trying to put this specific bond onto a bridge.

> installing an older 4.18.0-147.el8 kernel and testing with it?

Indeed, I installed 4.18.0-147.8.1.el8 from the 8.1 repos onto this 8.2 host and adding the bond to a bridge now works as expected.

I wasn't certain which kernel you wanted the additional commands run against, so I hit both of them:

#--------------------------------------------
# 4.18.0-147.8.1.el8_1.x86_64:

cat /sys/class/net/eno33np0/phys_switch_id
cat: /sys/class/net/eno33np0/phys_switch_id: Operation not supported

cat /sys/class/net/ens2f0np0/phys_switch_id
cat: /sys/class/net/ens2f0np0/phys_switch_id: Operation not supported

cat /sys/class/net/bond0/phys_switch_id
cat: /sys/class/net/bond0/phys_switch_id: Operation not supported

devlink dev show
pci/0000:63:00.0
pci/0000:63:00.1
pci/0000:a1:00.0
pci/0000:a1:00.1

devlink dev eswitch show pci/0000:63:00.0
pci/0000:63:00.0: mode legacy

devlink dev eswitch show pci/0000:63:00.1
pci/0000:63:00.1: mode legacy

devlink dev eswitch show pci/0000:a1:00.0
pci/0000:a1:00.0: mode legacy

devlink dev eswitch show pci/0000:a1:00.1
pci/0000:a1:00.1: mode legacy

devlink port show
#--------------------------------------------



#--------------------------------------------
# 4.18.0-193.14.2.el8_2.x86_64:

cat /sys/class/net/eno33np0/phys_switch_id
60ba24feffe197bc

cat /sys/class/net/ens2f0np0/phys_switch_id
40f2d0feff2826b0

cat /sys/class/net/bond0/phys_switch_id
cat: /sys/class/net/bond0/phys_switch_id: Operation not supported

devlink dev show
pci/0000:63:00.0
pci/0000:63:00.1
pci/0000:a1:00.0
pci/0000:a1:00.1

devlink dev eswitch show pci/0000:63:00.0
pci/0000:63:00.0: mode legacy

devlink dev eswitch show pci/0000:63:00.1
pci/0000:63:00.1: mode legacy

devlink dev eswitch show pci/0000:a1:00.0
pci/0000:a1:00.0: mode legacy

devlink dev eswitch show pci/0000:a1:00.1
pci/0000:a1:00.1: mode legacy

devlink port show
pci/0000:63:00.0/0: type eth netdev eno33np0 flavour physical port 0
pci/0000:63:00.1/1: type eth netdev eno34np1 flavour physical port 1
pci/0000:a1:00.0/0: type eth netdev ens2f0np0 flavour physical port 0
pci/0000:a1:00.1/1: type eth netdev ens2f1np1 flavour physical port 1
#--------------------------------------------

To get the bond and bridge functional on the 193 kernel I just have to remove one of the interfaces from the bond, add the bond to the bridge, then re-add interface to bond. Hopefully that still gets you valid info from these commands.

Thanks again, let me know if there's any further info/tests you'd like to have.

Mark

Comment 21 Patrick Talbert 2020-08-04 07:50:13 UTC

Hi Mark,

Thank you for this information.

From what you have shared, I think we can say that prior to RHEL 8.2 that with these bnxt devices that when nbp_switchdev_mark_set() called switchdev_port_attr_get() that the decent would have stopped right away in bnxt_port_attr_get() because the interfaces are in legacy mode:

8332 int bnxt_port_attr_get(struct bnxt *bp, struct switchdev_attr *attr)
8333 {
8334         if (bp->eswitch_mode != DEVLINK_ESWITCH_MODE_SWITCHDEV)
8335                 return -EOPNOTSUPP;


Now in 8.2 we're instead calling dev_get_port_parent_id() and when it comes to these bnxt devices the call to devlink_compat_switch_id_get() actually returns a useful value. But of course, the Switch ID of two physically separate cards is not expected to be the same so it is not really surprising that the overall result is the ENODATA.

Note the netdev_for_each_lower_dev logic in dev_get_port_parent_id() is taken almost verbatim from switchdev_port_attr_get(). Here, it includes a nice comment snippet with an explanation:

.....
199         /* Switch device port(s) may be stacked under
200          * bond/team/vlan dev, so recurse down to get attr on
201          * each port.  Return -ENODATA if attr values don't
202          * compare across ports.
203          */
204 
205         netdev_for_each_lower_dev(dev, lower_dev, iter) {
206                 err = switchdev_port_attr_get(lower_dev, attr);
207                 if (err)
208                         break;
209                 if (first.id == SWITCHDEV_ATTR_ID_UNDEFINED)
210                         first = *attr;
211                 else if (memcmp(&first, attr, sizeof(*attr)))
212                         return -ENODATA;
213         }
.....



I will try to get Red Hat's bnxt driver maintainer involved. In my basic understanding of switchdev, it *seems* as if bnxt should still produce a EOPNOTSUPP when in Legacy mode.


Patrick

Comment 22 Patrick Talbert 2020-08-04 08:49:34 UTC

Moving this to NIC Drivers.

Updates to the bnxt driver in RHEL 8.2 prevent the user from creating a bond out of two bnxt ports (from different physical cards) and then adding that bond to a bridge.

When a new bridge port is being set up in br_add_if(), nbp_switchdev_mark_set() is called to get the Switch ID of the new port (in this case, the bond). Prior to 8.2, the bond's lower devs (the bnxt ports) do not report a Switch ID (EOPNOTSUPP) so this activity is moot. However, now in RHEL 8.2 the bnxt driver provides an ID via its new ndo_get_devlink_port() handler. Logic in dev_get_port_parent_id() returns ENODATA if the bond's ports do not all have the same switch identifier (here, phys_switch_id).

This customer has two physical bnxt cards, each with two ports. When the customer creates a bond using ports that are not from the same card, adding that bond to a bridge fails with ENODATA:

  ip link set eno33np0 master bond0
  ip link set ens2f0np0 master bond0
  ip link set bond0 master mybridge
    RTNETLINK answers: No data available




The old logic was nbp_switchdev_mark_set() -> switchdev_port_attr_get() -> switchdev_ops->switchdev_port_attr_get which for bnxt is bnxt_swdev_port_attr_get() which calls bnxt_port_attr_get(). bnxt_port_attr_get() immediately returns EOPNOTSUPP when the card is not in SWITCHDEV mode:

8332 int bnxt_port_attr_get(struct bnxt *bp, struct switchdev_attr *attr)
8333 {
8334         if (bp->eswitch_mode != DEVLINK_ESWITCH_MODE_SWITCHDEV)
8335                 return -EOPNOTSUPP;
......


This customer's cards report still being in Legacy mode so it sorta seems to me that the new bnxt ndo_get_devlink_port() handler might need similar logic?

I cannot find any upstream commit or netdev list discussion about this issue.


Patrick

Comment 23 Jonathan Toppins 2020-08-04 15:18:09 UTC

Michael, is Broadcom aware of the problem described in comment #22?

Comment 24 Michael Chan 2020-08-04 15:53:40 UTC

No, I'm not aware of this issue.  These port related changes are recent upstream changes that seem to have some side effects. CC Anantha.

Comment 25 Ales Musil 2020-08-25 08:01:07 UTC

Hello,

is there any reason to keep this bug as private? It seems like other users are facing this issue
and we would like to share the progress with them.

Comment 26 Mark R. 2020-08-25 20:59:46 UTC

(In reply to Ales Musil from comment #25)
> Hello,
> 
> is there any reason to keep this bug as private? It seems like other users
> are facing this issue
> and we would like to share the progress with them.

I don't think you're asking me (I didn't set it private, I'm just the reporter), but just in case... sure thing, open 'er up.  I'd like to move to 8.2 for my virtualization hosts but this is blocking me so anything that helps is good by me. I've left the original title intact but wonder if it should be changed to accurately reflect the issue, e.g. "Can't attach a bond made from ports on multiple cards to a bridge"?

Comment 27 Dominik Holler 2020-08-26 06:07:18 UTC

(In reply to Jonathan Toppins from comment #23)
> Michael, is Broadcom aware of the problem described in comment #22?

Did adding
Group: broadcom_ccx
changed the bug to private? Is this intended?
As Ales wrote in comment 25, it would be helpful if this bug would be public.

Comment 32 Vasundhara Volam 2020-09-01 14:45:24 UTC

As part of the below upstream commit, switch_id is passed to devlink_port_attrs_set() which can be called only once while registering the devlink port..

---------------
commit 6605a226781eb1224c2dcf974a39eea11862b864
Author: Jiri Pirko <jiri>
Date:   Wed Apr 3 14:24:21 2019 +0200

    bnxt: pass switch ID through devlink_port_attrs_set()

    Pass the switch ID down the to devlink through devlink_port_attrs_set()
    so it can be used by devlink_compat_switch_id_get().

    Signed-off-by: Jiri Pirko <jiri>
    Signed-off-by: David S. Miller <davem>

----------------

And as part of following commit has removed the ndo_get_port_parent_id implementation from bnxt.

----------------
commit 56d9f4e8f70e6f47ad4da7640753cf95ae51a356
Author: Jiri Pirko <jiri>
Date:   Wed Apr 3 14:24:22 2019 +0200

    bnxt: remove ndo_get_port_parent_id implementation for physical ports

    Remove implementation of get_port_parent_id ndo and rely on core calling
    into devlink for the information directly.

    Signed-off-by: Jiri Pirko <jiri>
    Signed-off-by: David S. Miller <davem>
----------------

And here is another commit, where they are planning to remove old ndo method altogether in future.

----------------
commit 119c0b5721da9d97f95202c4ad1be2919dac64b0
Author: Jiri Pirko <jiri>
Date:   Wed Apr 3 14:24:27 2019 +0200

    net: devlink: add warning for ndo_get_port_parent_id set when not needed

    Currently if the driver registers devlink port instance, he should set
    the devlink port attributes as well. Then the devlink core is able to
    obtain switch id itself, no need for driver to implement the ndo.
    Once all drivers will implement devlink port registration, this ndo
    should be removed. This warning guides new drivers to do things as
    they should be done.

    Signed-off-by: Jiri Pirko <jiri>
    Signed-off-by: David S. Miller <davem>
-----------------

So, it requires a general fix in devlink_compat_switch_id_get() to read the switch_id only when mode is set to SWITCHDEV mode even if driver passes the switch_id.

I am planning to start a thread with the upstream community describing the issue.

Comment 33 Dominik Holler 2020-09-01 15:23:35 UTC

*** Bug 1871161 has been marked as a duplicate of this bug. ***

Comment 34 Jonathan Toppins 2020-09-01 18:51:30 UTC

(In reply to Vasundhara Volam from comment #32)

> So, it requires a general fix in devlink_compat_switch_id_get() to read the
> switch_id only when mode is set to SWITCHDEV mode even if driver passes the
> switch_id.
> 
> I am planning to start a thread with the upstream community describing the
> issue.

Yes exactly, this was almost the conclusion I had come too on Friday but had not let Michael know. Glad we came to similar conclusions.

Comment 35 Jonathan Toppins 2020-09-08 14:09:39 UTC

The current proposed upstream fix is the following:

====
diff --git a/net/core/dev.c b/net/core/dev.c
index d42c9ea0c3c0..7932594ca437 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8646,7 +8646,7 @@ int dev_get_port_parent_id(struct net_device *dev,
                if (!first.id_len)
                        first = *ppid;
                else if (memcmp(&first, ppid, sizeof(*ppid)))
-                       return -ENODATA;
+                       return -EOPNOTSUPP;
        }
 
        return err;
====

Comment 36 Vasundhara Volam 2020-09-11 12:52:24 UTC

Upstream fix is merged.

https://patchwork.ozlabs.org/project/netdev/patch/20200910110127.3113683-2-idosch@idosch.org/

Please back-port the fix.

Comment 37 Jonathan Toppins 2020-09-16 14:35:45 UTC

A devel kernel is available here:
http://people.redhat.com/jtoppins/.bz1860479/

If QE, Broadcom, and the reporter could verify this change fixes the problem with the above kernel what would be great.


Question for the networking services team;
There were two patches for this upstream fix, one being the actual fix, the other a selftests change. Currently the selftests patch does not apply due to the infrastructure for self tests not being updated. Do you know if there is an effort to update this infrastructure?

e1b9efe6baeb ("net: Fix bridge enslavement failure")
6374a5606990 ("selftests: rtnetlink: Test bridge enslavement with different parent IDs")

Comment 38 Jiri Benc 2020-09-16 14:56:11 UTC

(In reply to Jonathan Toppins from comment #37)
> There were two patches for this upstream fix, one being the actual fix, the
> other a selftests change. Currently the selftests patch does not apply due
> to the infrastructure for self tests not being updated. Do you know if there
> is an effort to update this infrastructure?

Do you mean 6374a5606990? It seems to almost apply, the rtnetlink.sh is very up to date, the only missing patch is c2a4d2747996, which was applied upstream only recently. I suggest to include c2a4d2747996 with your backport.

Comment 39 Jonathan Toppins 2020-09-16 18:08:41 UTC

Posted to rhel-net tree.

Comment 40 Mark R. 2020-09-17 14:10:42 UTC

(In reply to Jonathan Toppins from comment #37)
> A devel kernel is available here:
> http://people.redhat.com/jtoppins/.bz1860479/
> 
> If QE, Broadcom, and the reporter could verify this change fixes the problem
> with the above kernel what would be great.
> 

I've grabbed the linked RPMs and can give them a shot. Hitting two dependencies though:

  kernel-tools-libs = 4.18.0-236.el8.bz1860479v1 needed by kernel-tools-4.18.0-236.el8.bz1860479v1.x86_64
  linux-firmware >= 20200619-99.git3890db36 needed by kernel-core-4.18.0-236.el8.bz1860479v1.x86_64

Do you know offhand where I might get packages to fulfill those?

Comment 41 Mark R. 2020-09-17 14:19:33 UTC

Apologies, found the necessary linux-firmware package in the repos, just looking for kernel-tools-libs-4.18.0-236.el8.bz1860479v1 at this point.

Comment 42 Jonathan Toppins 2020-09-17 18:48:09 UTC

You do not need to install the kernel-tools-libs package to run the kernel.

Comment 43 Mark R. 2020-09-17 19:25:18 UTC

(In reply to Jonathan Toppins from comment #42)
> You do not need to install the kernel-tools-libs package to run the kernel.

Whoops, that's what I get for assuming I should install everything in that folder.

OK, I've installed the necessary bits to test and the initial test steps (from comment #5) go without a hitch. I can configure everything as listed with no errors and get a functional setup.

Thanks for your patience and work on this!

Comment 45 Dominik Holler 2020-10-05 12:18:06 UTC

This issue would hit RHV-4.4 users, because RHV-4.4 hosts requires at least RHEL 8.2 and is not supported on RHEL 8.1.
For this reason a backport to RHEL 8.3 would be beneficial for RHV-4.4 users.

Comment 48 Jan Stancek 2020-10-06 08:11:59 UTC

Patch(es) available on kernel-4.18.0-240.2.el8.dt1

Comment 51 LiLiang 2020-10-15 09:01:49 UTC

Set Tested base on comment #43 and regression test result https://beaker.engineering.redhat.com/jobs/4626176.

Comment 52 LiLiang 2020-10-19 02:07:37 UTC

reproduced:

[root@dell-per730-49 ~]# cat re

  ip link add mybridge type bridge
  ip link add bond0 type bond mode 802.3ad
  ip link set mybridge up
  ip link set bond0 up
  ip link set enp4s0f0np0 down
  ip link set enp7s0np0 down
  ip link set enp4s0f0np0 master bond0
  ip link set enp7s0np0 master bond0
  ip link set enp4s0f0np0 up
  ip link set enp7s0np0 up
  ip link set bond0 master mybridge

[root@dell-per730-49 ~]# source re
RTNETLINK answers: No data available

[root@dell-per730-49 ~]# uname -r
4.18.0-240.el8.x86_64

Tested:
[root@dell-per730-49 ~]# uname -r
4.18.0-240.4.el8.dt3.x86_64
[root@dell-per730-49 ~]# cat re

  ip link add mybridge type bridge
  ip link add bond0 type bond mode 802.3ad
  ip link set mybridge up
  ip link set bond0 up
  ip link set enp4s0f0np0 down
  ip link set enp7s0np0 down
  ip link set enp4s0f0np0 master bond0
  ip link set enp7s0np0 master bond0
  ip link set enp4s0f0np0 up
  ip link set enp7s0np0 up
  ip link set bond0 master mybridge
[root@dell-per730-49 ~]# source re
[root@dell-per730-49 ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 80:18:44:e4:5d:84 brd ff:ff:ff:ff:ff:ff
3: enp4s0f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:b6:e0:80 brd ff:ff:ff:ff:ff:ff
4: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 80:18:44:e4:5d:85 brd ff:ff:ff:ff:ff:ff
5: eno3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 80:18:44:e4:5d:86 brd ff:ff:ff:ff:ff:ff
6: enp4s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:b6:e0:81 brd ff:ff:ff:ff:ff:ff
7: eno4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 80:18:44:e4:5d:87 brd ff:ff:ff:ff:ff:ff
8: enp7s0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:b6:e0:80 brd ff:ff:ff:ff:ff:ff
9: enp7s0np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:15:4d:13:7a:7e brd ff:ff:ff:ff:ff:ff
10: mybridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:b6:e0:80 brd ff:ff:ff:ff:ff:ff
11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master mybridge state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:b6:e0:80 brd ff:ff:ff:ff:ff:ff

Comment 53 LiLiang 2020-10-19 02:22:37 UTC

This reproducer requires two cards on the same system, this is not easy for our automation.

Comment 54 Jan Stancek 2020-10-19 19:53:48 UTC

Patch(es) available on kernel-4.18.0-240.6.el8

Comment 59 LiLiang 2020-10-20 02:22:35 UTC

verified:

[root@dell-per730-49 ~]# source re
[root@dell-per730-49 ~]# cat re
  ip link add mybridge type bridge
  ip link add bond0 type bond mode 802.3ad
  ip link set mybridge up
  ip link set bond0 up
  ip link set enp4s0f0np0 down
  ip link set enp7s0np0 down
  ip link set enp4s0f0np0 master bond0
  ip link set enp7s0np0 master bond0
  ip link set enp4s0f0np0 up
  ip link set enp7s0np0 up
  ip link set bond0 master mybridge
[root@dell-per730-49 ~]# ip link 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 80:18:44:e4:5d:84 brd ff:ff:ff:ff:ff:ff
3: enp4s0f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:b6:e0:80 brd ff:ff:ff:ff:ff:ff
4: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 80:18:44:e4:5d:85 brd ff:ff:ff:ff:ff:ff
5: eno3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 80:18:44:e4:5d:86 brd ff:ff:ff:ff:ff:ff
6: enp4s0f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:b6:e0:81 brd ff:ff:ff:ff:ff:ff
7: eno4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 80:18:44:e4:5d:87 brd ff:ff:ff:ff:ff:ff
8: enp7s0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:b6:e0:80 brd ff:ff:ff:ff:ff:ff
9: enp7s0np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:15:4d:13:7a:7e brd ff:ff:ff:ff:ff:ff
40: mybridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:b6:e0:80 brd ff:ff:ff:ff:ff:ff
41: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master mybridge state UP mode DEFAULT group default qlen 1000
    link/ether 00:0a:f7:b6:e0:80 brd ff:ff:ff:ff:ff:ff
[root@dell-per730-49 ~]# uname -r
4.18.0-240.6.el8.x86_64

Comment 60 SimonScott 2020-11-11 10:56:37 UTC

Can someone give me a link to the kernel patch please

Comment 61 Jonathan Toppins 2020-11-11 15:32:13 UTC

(In reply to SimonScott from comment #60)
> Can someone give me a link to the kernel patch please

See comment #35, the upstream commit is:

  e1b9efe6baebe79019a2183176686a0e709388ae net: Fix bridge enslavement failure

Comment 62 SimonScott 2020-11-11 16:52:38 UTC

Thanks Jonathan,

Unfortunately that means nothing to me as my git knowledge is little more than knowing it exists. Please can you advise the best way for me to apply a kernel patch to my existing oVirt environment?

Regards

Simon...

Comment 63 Jonathan Toppins 2020-11-11 17:03:46 UTC

(In reply to SimonScott from comment #62)
> Thanks Jonathan,
> 
> Unfortunately that means nothing to me as my git knowledge is little more
> than knowing it exists. Please can you advise the best way for me to apply a
> kernel patch to my existing oVirt environment?
> 

Not really as it would require git and/or patch knowledge to apply the kernel patch. The patch will be included in RHEL-8.4 and is being considered for RHEL-8.3. Would recommend contacting support as they are best positioned to provide recommendations for your specific setup.

Comment 64 SimonScott 2020-11-11 19:53:13 UTC

Many thanks Jonathan

Comment 70 Fani Orestiadou 2021-01-08 12:23:24 UTC

Hello, 

I have a customer who is experiencing the same issue when using bnxt_en ports from different NICs. However according to the updates above, I understand that if ports from same NIC are used for the LACP bond the issue should not be present. My customer experiences the issue that when the machine reboots the slaves do not detect their links and following errors are present: 

Jan  8 12:12:21 redomyec kernel: bnxt_en 0000:13:00.0 eth0: Broadcom BCM57416 NetXtreme-E 10GBase-T Ethernet found at mem da910000, node addr f4:03:43:ca:31:a0
Jan  8 12:12:21 redomyec kernel: bnxt_en 0000:13:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
Jan  8 12:12:21 redomyec kernel: bnxt_en 0000:13:00.1 eth1: Broadcom BCM57416 NetXtreme-E 10GBase-T Ethernet found at mem da900000, node addr f4:03:43:ca:31:a8
Jan  8 12:12:21 redomyec kernel: bnxt_en 0000:13:00.1: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
Jan  8 12:12:21 redomyec kernel: bnxt_en 0000:13:00.0 ens3f0np0: renamed from eth0
Jan  8 12:12:21 redomyec kernel: bnxt_en 0000:13:00.1 ens3f1np1: renamed from eth1
Jan  8 12:12:45 redomyec kernel: bnxt_en 0000:13:00.0 ens3f0np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Jan  8 12:12:45 redomyec kernel: bnxt_en 0000:13:00.0 ens3f0np0: EEE is not active
Jan  8 12:12:45 redomyec kernel: bnxt_en 0000:13:00.0 ens3f0np0: FEC autoneg off encodings: None
Jan  8 12:12:46 redomyec kernel: bnxt_en 0000:13:00.1 ens3f1np1: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
Jan  8 12:12:46 redomyec kernel: bnxt_en 0000:13:00.1 ens3f1np1: EEE is not active
Jan  8 12:12:46 redomyec kernel: bnxt_en 0000:13:00.1 ens3f1np1: FEC autoneg off encodings: None
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.0 ens3f0np0: NIC Link is Down
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.0 ens3f0np0: NIC Link is Down
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1: cmdq[0x19]=0x11 status 0x1
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1: cmdq[0x1a]=0x11 status 0x1
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1 ens3f1np1: NIC Link is Down
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1 ens3f1np1: NIC Link is Down
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1: cmdq[0x1b]=0x11 status 0x1
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1: cmdq[0x1c]=0x11 status 0x1
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1: cmdq[0x1d]=0x11 status 0x1
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1: cmdq[0x1e]=0x11 status 0x1
Jan  8 12:12:47 redomyec kernel: bnxt_en 0000:13:00.1 bnxt_re1: Failed to add GID: 0xfffffff2

Bringing the interfaces manually up is working without issue.
Is this something that is known or do we hit a different bug here? 

Thanks 
Fani

Comment 71 Jonathan Toppins 2021-01-08 15:49:38 UTC

(In reply to Fani Orestiadou from comment #70)
> Hello, 
> 
> I have a customer who is experiencing the same issue when using bnxt_en
> ports from different NICs. However according to the updates above, I
> understand that if ports from same NIC are used for the LACP bond the issue
> should not be present. My customer experiences the issue that when the
> machine reboots the slaves do not detect their links and following errors
> are present: 
> 
... 
> Bringing the interfaces manually up is working without issue.
> Is this something that is known or do we hit a different bug here? 
> 
> Thanks 
> Fani

Not detecting link is different from what this bug is solving, which is allowing one to add a device to a bridge.
Sounds more like bz1879840 or bz1855131. Also you are using bnxt_re on top which makes this an RDMA possible bug and complicates the issue more. I would suggest you file a new bug and let engineering determine where in the stack the problem is occurring.

Comment 73 errata-xmlrpc 2021-05-18 13:54:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1578

Comment 74 Red Hat Bugzilla 2023-09-15 00:34:39 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days