Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1680655

Summary:

[NMCI] default_route_for_vlan_over_team test failure

Product:

Red Hat Enterprise Linux 7

Reporter:

Vladimir Benes <vbenes>

Component:

libteam

Assignee:

Xin Long <lxin>

Status:

CLOSED ERRATA

QA Contact:

LiLiang <liali>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.7

CC:

atragler, bgalvani, fgiudici, haliu, lrintel, lxin, network-qe, rkhan, sukulkar, thaller

Target Milestone:

Keywords:

Regression

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

libteam-1.29-1.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1714831 (view as bug list)

Environment:

Last Closed:

2020-03-31 20:05:37 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1714831

Attachments:

Description	Flags
[PATCH] libteam: set nl_cli event socket as non-blocking	none
teamd strace log	none

Description Vladimir Benes 2019-02-25 13:20:31 UTC

Description of problem:
this test occasionally fails on rhel7 and rhel8 master too.
we see a 
Error: Connection activation failed: Active connection removed before it was initialized
in rhel7
https://desktopqe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/beaker-NetworkManager-master-veth-rhel7-upstream/1386/artifact/artifacts/FAIL_report_NetworkManager-ci_Test380_default_route_for_vlan_over_team.html

and we can see slightly different error in rhel8
Error: Connection activation failed: teamd control failed
in 
https://desktopqe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/beaker-NetworkManager-master-veth-rhel8-upstream/309/artifact/artifacts/FAIL_report_NetworkManager-ci_Test380_default_route_for_vlan_over_team.html

teamd or NM issue?

Version-Release number of selected component (if applicable):
1.16.0

How reproducible:
sometimes

Comment 2 Beniamino Galvani 2019-04-08 09:09:35 UTC

In the log I see:

http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/03/34288/3428849/6649832/89940454/413972737/report_NetworkManager-ci_Test381_default_route_for_vlan_over_team.html

  <debug> [1553132334.8376] device[0x559288c9d8a0] (team7): running: /usr/bin/teamd -o -n -U -D -N -t team7 -gg
  <info>  [1553132334.8391] device (team7): Activation: (team) started teamd [pid 6931]...
  <debug> [1553132334.8391] device[0x559288c9d8a0] (team7): activation-stage: complete activate_stage1_device_prepare,v4 (id 1059)
  Using team device "team7".
  Using PID file "/var/run/teamd/team7.pid"
  This program is not intended to be run as root.
  Added loop callback: daemon, 0x564c991ecce0
  Added loop callback: libteam_events, 0x564c991ecce0
  Added loop callback: workq, 0x564c991ecce0
  Failed to get team runner name from config.
  Using default team runner "roundrobin".
  Failed to set team mode "roundrobin".
  Failed to init runner.
  Removed loop callback: workq, 0x564c991ecce0
  Removed loop callback: libteam_events, 0x564c991ecce0
  Removed loop callback: daemon, 0x564c991ecce0
  teamd_init() failed.
  Failed: Cannot allocate memory
  <debug> [1553132334.8476] device[0x559288c9d8a0] (team7): teamd 6931 died with status 256
  <warn>  [1553132334.8476] device (team7): teamd process 6931 quit unexpectedly; failing activation

which indicates an allocation failure in teamd. However I don't see
any OOM message from kernel in

  http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/03/34288/3428849/6649832/89940454/413972737/dmesg.log

Reassigning to libteam for investigation. The full CI job is at:

  https://beaker.engineering.redhat.com/recipes/6649832#task89940454

Comment 4 Hangbin Liu 2019-04-23 03:48:34 UTC

(In reply to Beniamino Galvani from comment #2)
>  
> http://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2019/03/34288/
> 3428849/6649832/89940454/413972737/dmesg.log
> 

In this log, I saw

[ 8996.084025] team7: Mode changed to "roundrobin"
[ 8996.108920] IPv6: ADDRCONF(NETDEV_UP): team7: link is not ready
[ 8996.141043] team7 (uninitialized): Failed to send options change via netlink (err -105)

The error msg should come from kernel function __team_options_change_check(), and errno 105 is

#define ENOBUFS         105     /* No buffer space available */

One of the call stack that may return -ENOBUFS is:
__team_options_change_check()
- team_nl_send_event_options_get()
  - team_nl_send_options_get()
    - send_func() / team_nl_send_multicast
      - genlmsg_multicast_netns
        - nlmsg_multicast
          - netlink_broadcast
            - netlink_broadcast_filtered

I haven't found other path that could return -ENOBUFS yet.

There is also another possibility that the failure is in userspace(failed in set_option_value), after team init failed and remove the device, the kernel message delivers failed. I'm not sure...

Comment 8 Vladimir Benes 2019-05-07 09:16:26 UTC

so I have a reproducer, it looks quite easy and is going to fail sooner or later:

while nmcli con add type team  con-name team0 autoconnect no ifname nm-team && nmcli con add type bridge con-name team_br autoconnect no ifname brA ip4 192.168.177.100/24 gw4 192.168.177.1 && nmcli con modify id team0 connection.master brA connection.slave-type bridge && nmcli con up team0 && nmcli con del team0 team_br && sleep 1;  do :;done

Is there anything more visible with this?

Comment 9 Hangbin Liu 2019-05-07 09:57:24 UTC

Cool, thanks a lot. Form your "while; do; done" reproducer maybe there is a memory leak issue. I will investigate it now.

Comment 10 Hangbin Liu 2019-05-08 03:06:38 UTC

Hi Vladimir,

With your reproducer, I got the following log info:

NetworkManager: This program is not intended to be run as root.
kernel: nm-team: Mode changed to "roundrobin"
NetworkManager[775]: <info>  [1557284236.8254] device (nm-team): Activation: (team) started teamd [pid 20202]...
NetworkManager[775]: <info>  [1557284236.8260] device (brA): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
teamd_nm-team[20202]: 1.27 successfully started.
NetworkManager[775]: <info>  [1557284236.8282] device (nm-team): teamd appeared on D-Bus
NetworkManager[775]: <error> [1557284241.8339] device (nm-team): failed to connect to teamd (err=-110)
NetworkManager[775]: <info>  [1557284241.8339] device (nm-team): state change: prepare -> failed (reason 'teamd-control-failed', sys-iface-state: 'managed')
NetworkManager[775]: <warn>  [1557284241.8343] device (nm-team): Activation: failed for connection 'team0'
NetworkManager[775]: <info>  [1557284241.8485] device (brA): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
kernel: IPv6: ADDRCONF(NETDEV_UP): brA: link is not ready
kernel: IPv6: ADDRCONF(NETDEV_UP): nm-team: link is not ready
NetworkManager[775]: <info>  [1557284241.8498] device (nm-team): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
teamd_nm-team[20202]: Got SIGINT, SIGQUIT or SIGTERM.
teamd_nm-team[20202]: Exiting...
NetworkManager: Daemon not running
NetworkManager[775]: <error> [1557284241.8589] device (nm-team): failed to connect to teamd (err=-22)
NetworkManager: libteamdctl: teamdctl_connect: Failed to connect using all CLIs.

This looks more like bug 1693142. What do you think?

Comment 11 Vladimir Benes 2019-05-08 09:02:49 UTC

no idea, but this is definitely a different product so we have to keep it open. Is bug 1693142 caused by NM? We have to switch component if yes.

Comment 12 Hangbin Liu 2019-05-09 02:16:02 UTC

(In reply to Vladimir Benes from comment #11)
> no idea, but this is definitely a different product so we have to keep it
> open. Is bug 1693142 caused by NM? We have to switch component if yes.

The failure(test 147,150) in comment 5 should be caused by bug 1693142.

While the failures in comment 0 and comment 2 (tests are 380,381) looks like another issue. Would you please help find a reproducer?

Comment 13 Beniamino Galvani 2019-05-09 15:30:59 UTC

(In reply to Vladimir Benes from comment #11)
> no idea, but this is definitely a different product so we have to keep it
> open. Is bug 1693142 caused by NM? 

Yes, it is.

> We have to switch component if yes.

It seems there are multiple issues here, I will try to reproduce the problem.

Comment 14 Beniamino Galvani 2019-05-15 15:07:31 UTC

I enabled teamdctl debugging and I see that teamd is not responding to the method call via unix socket.

 <info>  [1557931917.2115] device (nm-team): teamd appeared on D-Bus
 <debug> [1557931917.2125] teamdctl: Custom logging function 0x7f5475496830 registered.
 <debug> [1557931917.2126] teamdctl: Connected using CLI "usock".
 <debug> [1557931917.2126] teamdctl: usock: Calling method "ConfigDump"
 <debug> [1557931922.2176] teamdctl: usock: Wait for reply timed-out.
 <error> [1557931922.2177] device (nm-team): failed to connect to teamd (err=-110)

I'm not sure if the problem is in libteamdctl, teamd or elsewhere.

Comment 15 Beniamino Galvani 2019-05-16 06:36:59 UTC

I added debugging prints to teamd and I think there is a problem with
events processing. teamdctl uses a unix socket to send commands to
teamd and waits a reply for 5 seconds. Sometimes it seems teamd
doesn't process the accept() callback on the listening socket and is
stuck; then NM detects the timeout and sends SIGTERM to teamd. Only
at that point the accept() callback is run in teamd but it's too late
since teamd is quitting. An extract from log (attached):
 
 13:43:07 NetworkManager[4443]: <debug> [1557942187.6989] device[0x55c372447ea0] (nm-team): running: /usr/bin/teamd -o -n -U -D -N -t nm-team -gg
 
  (^ teamd launched by NM)
 
 13:43:07 teamd_nm-team[4740]: Custom logging function 0x560577b34bf0 registered.
 13:43:07 teamd_nm-team[4740]: Added loop callback: daemon, 0x5605794e1410
 13:43:07 teamd_nm-team[4740]: Added loop callback: libteam_events, 0x5605794e1410
 13:43:07 teamd_nm-team[4740]: Added loop callback: workq, 0x5605794e1410
 13:43:07 teamd_nm-team[4740]: Failed to get team runner name from config.
 13:43:07 teamd_nm-team[4740]: Using default team runner "roundrobin".
 13:43:07 teamd_nm-team[4740]: usock: Using sockpath "/var/run/teamd/nm-team.sock"
 13:43:07 teamd_nm-team[4740]: Added loop callback: usock, 0x5605794e1410
 
  (^ here the listening socket is added into the main loop select in teamd)
 
 13:43:07 teamd_nm-team[4740]: Added loop callback: dbus_dispatch, 0x5605794e7540
 13:43:07 teamd_nm-team[4740]: Added loop callback: dbus_watch, 0x5605794e5170
 13:43:07 teamd_nm-team[4740]: Added loop callback: dbus_watch, 0x5605794e51c0
 13:43:07 teamd_nm-team[4740]: dbus: connected to 7e18b7c05143a562242d5fe65cdc0c7e with name :1.4058
 ...
 13:43:07 teamd_nm-team[4740]: Added loop callback: dbus_timeout, 0x5605794e5ec0
 13:43:07 teamd_nm-team[4740]: Removed loop callback: dbus_timeout, 0x5605794e5ec0
 13:43:07 teamd_nm-team[4740]: dbus: have name org.libteam.teamd.nm-team
 13:43:07 teamd_nm-team[4740]: 1.27 successfully started.
 13:43:07 teamd_nm-team[4740]:  - invoke cb for 'dbus_dispatch' - events 1
  
 13:43:07 NetworkManager[4443]: <debug> [1557942187.7251] teamdctl: Custom logging function 0x7f2e36a95830 registered.
 13:43:07 NetworkManager[4443]: <debug> [1557942187.7252] teamdctl: Connected using CLI "usock".
 13:43:07 NetworkManager[4443]: <debug> [1557942187.7252] teamdctl: usock: Calling method "ConfigDump"
 13:43:07 NetworkManager[4443]: <debug> [1557942187.7252] teamdctl: usock: sent message "REQUEST\nConfigDump\n" len=19, result=19
 
  (^ NM sends the request)
 
 13:43:17 NetworkManager[4443]: <debug> [1557942197.7352] teamdctl: usock: Wait for reply timed-out.
 13:43:17 NetworkManager[4443]: <error> [1557942197.7353] device (nm-team): failed to connect to teamd (err=-110)
 13:43:17 NetworkManager[4443]: <debug> [1557942197.7353] kill child process 'teamd' (4740): wait for process to terminate after sending SIGTERM (15) (send SIGKILL in 2000 milliseconds)...
 
  (^ after 10 seconds it times out. I had increased the timeout from 5
     to 10 seconds. NM sends SIGTERM to teamd.)
 
 13:43:17 teamd_nm-team[4740]: select() done
 13:43:17 teamd_nm-team[4740]:  - invoke cb for 'usock' - events 1
 13:43:17 teamd_nm-team[4740]: Added loop callback: usock_acc_conn, 0x5605794e5280
 
  (^ now teamd detects the incoming request on the unix socket and
     accept()s it)
 
 13:43:17 teamd_nm-team[4740]:  - invoke cb for 'libteam_events' - events 1
 13:43:17 teamd_nm-team[4740]: Got SIGINT, SIGQUIT or SIGTERM.
 
  (^ and also quits)
 
 13:43:17 teamd_nm-team[4740]: Exiting...
 13:43:17 teamd_nm-team[4740]: Removed loop callback: usock_acc_conn, 0x5605794e5280
 13:43:17 NetworkManager[4443]: Daemon not running

Comment 16 Hangbin Liu 2019-05-16 08:13:04 UTC

(In reply to Beniamino Galvani from comment #15)
>   
>  13:43:07 NetworkManager[4443]: <debug> [1557942187.7251] teamdctl: Custom
> logging function 0x7f2e36a95830 registered.
>  13:43:07 NetworkManager[4443]: <debug> [1557942187.7252] teamdctl:
> Connected using CLI "usock".
>  13:43:07 NetworkManager[4443]: <debug> [1557942187.7252] teamdctl: usock:
> Calling method "ConfigDump"
>  13:43:07 NetworkManager[4443]: <debug> [1557942187.7252] teamdctl: usock:
> sent message "REQUEST\nConfigDump\n" len=19, result=19
>  
>   (^ NM sends the request)
>  
>  13:43:17 NetworkManager[4443]: <debug> [1557942197.7352] teamdctl: usock:
> Wait for reply timed-out.
>  13:43:17 NetworkManager[4443]: <error> [1557942197.7353] device (nm-team):
> failed to connect to teamd (err=-110)
>  13:43:17 NetworkManager[4443]: <debug> [1557942197.7353] kill child process
> 'teamd' (4740): wait for process to terminate after sending SIGTERM (15)
> (send SIGKILL in 2000 milliseconds)...
>  
>   (^ after 10 seconds it times out. I had increased the timeout from 5
>      to 10 seconds. NM sends SIGTERM to teamd.)
>  

Hi Beniamino,

Thanks for the debug info. I reviewed the chang log since 1.27 (Aug 2017) but there
is no change for usock or teamdctl... Could you recall when this issue starts to happen?

Hi Xin Long, any thoughts for this issue?

Comment 17 Vladimir Benes 2019-05-23 07:54:13 UTC

https://desktopqe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/beaker-NetworkManager-veth-RHEL-8.0/119/artifact/artifacts/FAIL_report_NetworkManager-ci_Test375_default_route_for_vlan_over_team.html

first occurrence:
03/19/2019

Any progress? I see it now on 1.18 and centos 7.6 too.

Comment 18 Xin Long 2019-05-23 14:36:46 UTC

(In reply to Vladimir Benes from comment #17)
> https://desktopqe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/beaker-
> NetworkManager-veth-RHEL-8.0/119/artifact/artifacts/
> FAIL_report_NetworkManager-ci_Test375_default_route_for_vlan_over_team.html
> 
> first occurrence:
> 03/19/2019
> 
> Any progress? I see it now on 1.18 and centos 7.6 too.

could be fixed by:
https://bugzilla.redhat.com/show_bug.cgi?id=1689774#c5

Pls try with libteam-1.27-9.el7

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=892337

Comment 19 Beniamino Galvani 2019-05-24 07:33:17 UTC

I tried libteam-1.27-9.el7.x86_64 and there is the same problem:

NetworkManager[13450]: <info>  [1558683072.7106] device (nm-team): teamd appeared on D-Bus
NetworkManager[13450]: <error> [1558683077.7162] device (nm-team): failed to connect to teamd (err=-110)

Comment 20 Vladimir Benes 2019-05-24 07:40:48 UTC

(In reply to Beniamino Galvani from comment #19)
> I tried libteam-1.27-9.el7.x86_64 and there is the same problem:
> 
> NetworkManager[13450]: <info>  [1558683072.7106] device (nm-team): teamd
> appeared on D-Bus
> NetworkManager[13450]: <error> [1558683077.7162] device (nm-team): failed to
> connect to teamd (err=-110)

Yeah, just trying it now too with centos 7.6 and the newest libteam and I can easily reproduce.

Comment 21 Beniamino Galvani 2019-05-24 16:16:46 UTC

Hi, I think I found the issue. Sometimes teamd blocks when reading netlink messages:

 09:18:44.802079 select(18, [3 10 11 15 16 17], [], [16], NULL) = 1 (in [10]) <0.000009>
 09:18:44.802189 epoll_wait(10, [{EPOLLIN, {u32=9, u64=38654705673}}], 2, -1) = 1 <0.000007>
 09:18:44.802214 recvmsg(9, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000001}, msg_iov(1)=[{"\354\4\0\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0k\6\0\0\3\20\0\0\0\0\0\0"..., 16384}], msg_controllen=0,  
                         msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 1260 <10.040834>
 09:18:54.843206 recvmsg(9, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000001}, msg_iov(1)=[{"", 16384}], msg_controllen=0, msg_flags=0}, 0) = 0 <0.000048>

Note how the first recvmsg blocks for 10 seconds, then probably it is interrupted by the signal sent by NM. I don't know if this is expected or there is a bug in kernel or libnl3. In any case, it seems better to make the netlink socket non-blocking as in the patch attached.

Comment 22 Beniamino Galvani 2019-05-24 16:17:26 UTC

Created attachment 1572958 [details]
[PATCH] libteam: set nl_cli event socket as non-blocking

Comment 23 Beniamino Galvani 2019-05-24 16:18:30 UTC

Created attachment 1572959 [details]
teamd strace log

Comment 25 Beniamino Galvani 2019-05-24 17:18:09 UTC

I suspect the bug is somehow caused by kernel sending an empty netlink notification when removing non existent bridge vlan:

 ip link add br1 type bridge                                                                                                                                                                                                                 
 ip link set br1 up                                                                                                                                                                                                                          
 ip monitor &                                                                                                                                                                                                                                
 bridge vlan delete dev br1 vid 1 self                                                                                                                                                                                                       
 > EOF on netlink
 > [3]   Exit 2                  ip monitor

This doesn't happen with kernel 5.0.7-200.fc29.x86_64. A related kernel patch seems to be:

 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e19b42a1a0669ed5b8009930c5269a5a87cc363c

Comment 26 Vladimir Benes 2019-05-25 11:43:41 UTC

(In reply to Beniamino Galvani from comment #24)
> Vladimir, can you test the scratch build:

It looks very good, no fail so far, you nailed it! Thank you!

Comment 27 Hangbin Liu 2019-05-27 04:15:02 UTC

(In reply to Beniamino Galvani from comment #25)
> This doesn't happen with kernel 5.0.7-200.fc29.x86_64. A related kernel
> patch seems to be:
> 
>  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e19b42a1a0669ed5b8009930c5269a5a87cc363c

Hi Beniamino, I have built a test kernel with this patch[1] but it seems we still have this issue. Would you please help have a try?

[1] https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=21847958
Thanks
Hangbin

Comment 28 Vladimir Benes 2019-05-27 08:05:21 UTC

# while nmcli con add type team  con-name team0 autoconnect no ifname nm-team && nmcli con add type bridge con-name team_br autoconnect no ifname brA ip4 192.168.177.100/24 gw4 192.168.177.1 && nmcli con modify id team0 connection.master brA connection.slave-type bridge && nmcli con up team0 && nmcli con del team0 team_br ;  do :;done 

<snip>
Connection successfully activated (master waiting for slaves) (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/70)
Connection 'team0' (2c1be6a2-8256-4871-9bc4-809bde62aa1c) successfully deleted.
Connection 'team_br' (ed9d78ca-2e10-4098-90d2-9f5d7fcd27df) successfully deleted.
Connection 'team0' (90e44dcc-66ce-4a35-be56-5ebf609872cc) successfully added.
Connection 'team_br' (16921117-16b7-436e-b88b-35984a9c979e) successfully added.
Error: Connection activation failed: teamd control failed


so no, we should go the libteam way in this bug and clone the kernel issue and investigate in a separate bugzilla.

Comment 29 Hangbin Liu 2019-05-28 13:07:34 UTC

(In reply to Beniamino Galvani from comment #25)
> I suspect the bug is somehow caused by kernel sending an empty netlink
> notification when removing non existent bridge vlan:
> 
>  ip link add br1 type bridge                                                
> 
>  ip link set br1 up                                                         
> 
>  ip monitor &                                                               
> 
>  bridge vlan delete dev br1 vid 1 self                                      
> 
>  > EOF on netlink
>  > [3]   Exit 2                  ip monitor
> 

Hi Beniamino, Vladimir,

After debugging, the issue is fixed by

    commit 59ccaaaa49b5b096cdc1f16706a9f931416b2332
    Author: Roopa Prabhu <roopa>
    Date:   Wed Jan 28 16:23:11 2015 -0800

        bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify

I have submitted a brew build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=21863150

Would you please help have a try when it's finished? If it fixed your problem, I think we can backport it to 7.7 since it's a blocker.

Thanks
Hangbin

Comment 30 Vladimir Benes 2019-05-28 17:39:15 UTC

Yeah, this fixes the issue as well, thank you!

Comment 31 sushil kulkarni 2019-06-05 18:16:14 UTC

Marking this as blocker-. Will fix it in 7.8.

-Sushil

Comment 32 Beniamino Galvani 2019-06-27 07:21:02 UTC

*** Bug 1701280 has been marked as a duplicate of this bug. ***

Comment 33 Hangbin Liu 2019-07-24 09:47:31 UTC

Hi Xin,

Beniamino's patch 5c5e498 ("libteam: set netlink event socket as non-blocking") has been applied by upstream.
Will you take this bug and backport it on 7.8?

Thanks
Hangbin

Comment 34 Xin Long 2019-08-28 05:26:24 UTC

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=23213580

Comment 36 LiLiang 2019-10-21 09:21:04 UTC

(In reply to Vladimir Benes from comment #28)
> # while nmcli con add type team  con-name team0 autoconnect no ifname
> nm-team && nmcli con add type bridge con-name team_br autoconnect no ifname
> brA ip4 192.168.177.100/24 gw4 192.168.177.1 && nmcli con modify id team0
> connection.master brA connection.slave-type bridge && nmcli con up team0 &&
> nmcli con del team0 team_br ;  do :;done 
> 
> <snip>
> Connection successfully activated (master waiting for slaves) (D-Bus active
> path: /org/freedesktop/NetworkManager/ActiveConnection/70)
> Connection 'team0' (2c1be6a2-8256-4871-9bc4-809bde62aa1c) successfully
> deleted.
> Connection 'team_br' (ed9d78ca-2e10-4098-90d2-9f5d7fcd27df) successfully
> deleted.
> Connection 'team0' (90e44dcc-66ce-4a35-be56-5ebf609872cc) successfully added.
> Connection 'team_br' (16921117-16b7-436e-b88b-35984a9c979e) successfully
> added.
> Error: Connection activation failed: teamd control failed
> 
> 
> so no, we should go the libteam way in this bug and clone the kernel issue
> and investigate in a separate bugzilla.

Hi Vladimir,

I don't know why i can't reproduce this with your script...
I run it over 30 minutes.
Do you know the possible reason?

[root@ibm-x3650m4-03 ~]# rpm -q libteam
libteam-1.27-9.el7.x86_64
[root@ibm-x3650m4-03 ~]# rpm -q NetworkManager
NetworkManager-1.18.0-5.el7.x86_64
[root@ibm-x3650m4-03 ~]# uname -r
3.10.0-1062.el7.x86_64

Comment 37 Vladimir Benes 2019-10-30 10:30:26 UTC

it took me a while but I was able to reproduce here:
[root@wsfd-netdev34-vm-8 NetworkManager-ci]# rpm -q libteam
libteam-1.27-5.el7.x86_64
[root@wsfd-netdev34-vm-8 NetworkManager-ci]# uname -a
Linux wsfd-netdev34-vm-8.ntdv.lab.eng.bos.redhat.com 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@wsfd-netdev34-vm-8 NetworkManager-ci]# rpm -q NetworkManager
NetworkManager-1.18.0-5.el7_7.1.x86_64
[root@wsfd-netdev34-vm-8 NetworkManager-ci]# cat /etc/redhat-release 
Red Hat Enterprise Linux Workstation release 7.6 (Maipo)


bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify

that landed into: 
3.10.0-1053.el7

fixed it as well.


So to verify this it should be enough to grab a 7.6 machine and update libteam and teamd and run the above-mentioned reproducer. For me, it works well even with the buggy kernel.

Comment 38 LiLiang 2019-10-31 03:22:46 UTC

reproduced:

(process:21044): GLib-GIO-WARNING **: 23:02:56.345: gdbusobjectmanagerclient.c:1589: Processing InterfaceRemoved signal for path /org/freedesktop/NetworkManager/IP6Config/3864 but no object proxy exists
Connection 'team0' (c89934c0-0209-431a-b648-41545eafed06) successfully added.
Connection 'team_br' (898371e9-8a8d-4319-a356-d883fb3a34b2) successfully added.
Error: Connection activation failed: teamd control failed
[root@hp-dl380g10-02 ~]# uname -r
3.10.0-957.el7.x86_64
[root@hp-dl380g10-02 ~]# rpm -q libteam
libteam-1.27-5.el7.x86_64
[root@hp-dl380g10-02 ~]# rpm -q NetworkManager
NetworkManager-1.18.0-5.el7_7.1.x86_64


and can't reproduce on below version, so set verified:
[root@hp-dl380g10-02 ~]# rpm -q libteam
libteam-1.29-1.el7.x86_64
[root@hp-dl380g10-02 ~]# rpm -q teamd
teamd-1.29-1.el7.x86_64
[root@hp-dl380g10-02 ~]# rpm -q NetworkManager
NetworkManager-1.18.0-5.el7_7.1.x86_64
[root@hp-dl380g10-02 ~]# uname -r
3.10.0-957.el7.x86_64

Comment 40 errata-xmlrpc 2020-03-31 20:05:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1133