Bug 1257195

Summary: teamd fails to respond on aarch64
Product: Red Hat Enterprise Linux 7 Reporter: Vitezslav Humpa <vhumpa>
Component: libteamAssignee: Marcelo Ricardo Leitner <mleitner>
Status: CLOSED ERRATA QA Contact: Amit Supugade <asupugad>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.2CC: asupugad, jklimes, kzhang, lrintel, mleitner, network-qe, vbenes, vhumpa, zhchen
Target Milestone: rc   
Target Release: ---   
Hardware: aarch64   
OS: Unspecified   
Whiteboard:
Fixed In Version: 1.17-2.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-19 03:56:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vitezslav Humpa 2015-08-26 13:01:28 UTC
Description of problem:
On (look like solely) aarch64, the teamd daemon fails to respond.

Setting up a default team via NetworkManager:
$ nmcli con add type team

Instance of teamd run via NetworkManager fails to reply to it:

Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <error> [1440593092.744665] [nm-device-team.c:160] ensure_teamd_connection(): (nm-team): failed to connect to teamd (err=-22)

Log:
Aug 26 08:44:52 apm-mustang-ev3-04 kernel: IPv6: ADDRCONF(NETDEV_UP): nm-team: link is not ready
Aug 26 08:44:52 apm-mustang-ev3-04 kernel: IPv6: ADDRCONF(NETDEV_UP): nm-team: link is not ready
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <debug> [1440593092.716192] [nm-device-team.c:445] teamd_kill(): [0x2aad790c600] (nm-team): running: /usr/bin/teamd -k -t nm-team
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Daemon not running
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <debug> [1440593092.723081] [nm-device-team.c:381] teamd_dbus_vanished(): [0x2aad790c600] (nm-team): teamd not on D-Bus (ignored)
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <debug> [1440593092.733431] [nm-device-team.c:495] teamd_start(): [0x2aad790c600] (nm-team): running: /usr/bin/teamd -o -n -U -D -N -t nm-team -gg
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <info>  (nm-team): Activation: (team) started teamd [pid 6144]...
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Using team device "nm-team".
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Using PID file "/var/run/teamd/nm-team.pid"
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Added loop callback: daemon, 0x2aaff9003c0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Added loop callback: libteam_events, 0x2aaff9003c0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Added loop callback: workq, 0x2aaff9003c0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Failed to get team runner name from config.
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Using default team runner "roundrobin".
Aug 26 08:44:52 apm-mustang-ev3-04 kernel: IPv6: ADDRCONF(NETDEV_UP): nm-team: link is not ready
Aug 26 08:44:52 apm-mustang-ev3-04 kernel: nm-team: Mode changed to "roundrobin"
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: usock: Using sockpath "/var/run/teamd/nm-team.sock"
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Added loop callback: usock, 0x2aaff9003c0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Added loop callback: dbus_dispatch, 0x2aaff904c70
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Added loop callback: dbus_watch, 0x2aaff9414c0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Added loop callback: dbus_watch, 0x2aaff901a20
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: dbus: connected to 86ec7cb56fe0d6653122b8d655dce9cb with name :1.1496
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: <ifinfo_list>
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: 1436: nm-team: 06:5f:6d:39:88:58: 0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: </ifinfo_list>
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: <port_list>
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: </port_list>
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: <changed_option_list>
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: </changed_option_list>
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Added loop callback: dbus_timeout, 0x2aaff9025a0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <info>  (nm-team): teamd appeared on D-Bus
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <error> [1440593092.744665] [nm-device-team.c:160] ensure_teamd_connection(): (nm-team): failed to connect to teamd (err=-22)
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Removed loop callback: dbus_timeout, 0x2aaff9025a0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: dbus: have name org.libteam.teamd.nm-team
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: 1.17 successfully started.
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: <changed_option_list>
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: *mode roundrobin
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: </changed_option_list>
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Added loop callback: usock_acc_conn, 0x2aaff905040
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: usock: calling method "ConfigDump"
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <info>  (nm-team): deactivation: stopping teamd...
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <debug> [1440593092.788768] [NetworkManagerUtils.c:667] nm_utils_kill_child_async(): kill child process 'teamd' (6144): wait for process to terminate after sending SIGTERM (15) (send SIGKILL in 2000 milliseconds)...
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Got SIGINT, SIGQUIT or SIGTERM.
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Exiting...
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Removed loop callback: usock_acc_conn, 0x2aaff905040
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Removed loop callback: usock, 0x2aaff9003c0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Removed loop callback: workq, 0x2aaff9003c0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Removed loop callback: libteam_events, 0x2aaff9003c0
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Removed loop callback: daemon, 0x2aaff9003c0
Aug 26 08:44:52 apm-mustang-ev3-04 kernel: IPv6: ADDRCONF(NETDEV_UP): nm-team: link is not ready
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <debug> [1440593092.861972] [NetworkManagerUtils.c:522] _kc_cb_watch_child(): kill child process 'teamd' (6144): terminated normally with status 0 (73209 usec elapsed)
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <debug> [1440593092.862030] [nm-device-team.c:381] teamd_dbus_vanished(): [0x2aad790c600] (nm-team): teamd not on D-Bus (ignored)
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager[6049]: <debug> [1440593092.862278] [nm-device-team.c:445] teamd_kill(): [0x2aad790c600] (nm-team): running: /usr/bin/teamd -k -t nm-team
Aug 26 08:44:52 apm-mustang-ev3-04 NetworkManager: Daemon not running

Manually without NM (all as root):

$ teamd
1.17 successfully started.

... running and device team0 is set up ...

$ teamdctl team0 state
teamdctl_connect failed (Invalid argument)

No info in logs here.

Version-Release number of selected component (if applicable):
libteam-1.17-1.el7.aarch64
teamd-1.17-1.el7.aarch64
NetworkManager-team-1.0.4-10.el7.aarch64


How reproducible:
Always on aarch64 (RHEL-7.2-20150824.n.0)

Additional info:
Please ping me if you require an access to aarch64 system, I will try to provide that.

Comment 1 Marcelo Ricardo Leitner 2015-08-26 13:50:04 UTC
Yes Vitezslav, please share an access to such system when possible.

Comment 3 Marcelo Ricardo Leitner 2015-08-26 18:01:20 UTC
stracing teamdctl command from comment #0, I get:

socket(PF_LOCAL, SOCK_SEQPACKET, 0)     = 3
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/teamd/team0.sock"}, 27) = 0
sendto(3, "REQUEST\nConfigDump\n", 19, MSG_NOSIGNAL, NULL, 0) = 19
pselect6(4, [3], NULL, NULL, {0, 5000000000}, 0) = -1 EINVAL (Invalid argument)
write(2, "teamdctl_connect failed (Invalid"..., 43teamdctl_connect failed (Invalid argument)

The timeout parameter is invalid, though accepted on other arches.
pselect6 uses sec+nanosec resolution. Libteam uses select() flavor (which is converted by glibc later) but which uses sec+microsec. But:

libteamdctl/teamdctl_private.h
#define TEAMDCTL_REPLY_TIMEOUT 5000 /* ms */                              

libteamdctl/cli_usock.c
#define WAIT_USEC (TEAMDCTL_REPLY_TIMEOUT * 1000)                         
                                                                          
static int cli_usock_wait_recv(int sock)                                  
{                                                                         
        fd_set rfds;                                                      
        int fdmax;                                                        
        int ret;                                                          
        struct timeval tv;                                                
                                                                          
        tv.tv_sec = 0;                                                    
        tv.tv_usec = WAIT_USEC;                                           
        FD_ZERO(&rfds);                                                   
        FD_SET(sock, &rfds);                                              
        fdmax = sock + 1;                                                 
        ret = select(fdmax, &rfds, NULL, NULL, &tv);                      

That causes tv.tv_usec to be 5*10^6 (or 5s), which is not supposed to happen.

Comment 5 Marcelo Ricardo Leitner 2015-08-26 20:22:56 UTC
Please rebuild the .src.rpm from https://brewweb.devel.redhat.com/taskinfo?taskID=9752664 and try again. Thanks!

It has this patch:
commit 381983987d7dd01b7e7b12c676fb8f33694f7c36
Author: Marcelo Ricardo Leitner <mleitner>
Date:   Wed Aug 26 16:23:01 2015 -0300

    libteamdctl: fix timeval value for select
    
    timeval.tv_usec shouldn't be bigger than 10^6, as then it overlaps
    .tv_sec. aarch64 currently rejects such value with EINVAL.
    
    The fix is to normalize the fields regarding their resolutions.
    
    Reported-by: Vitezslav Humpa <vhumpa>
    Signed-off-by: Marcelo Ricardo Leitner <mleitner>

diff --git a/libteamdctl/cli_usock.c b/libteamdctl/cli_usock.c
index 0136d6909ea6..0dc97ae53f89 100644
--- a/libteamdctl/cli_usock.c
+++ b/libteamdctl/cli_usock.c
@@ -79,7 +79,8 @@ static int cli_usock_send(int sock, char *msg)
 	return 0;
 }
 
-#define WAIT_USEC (TEAMDCTL_REPLY_TIMEOUT * 1000)
+#define WAIT_SEC (TEAMDCTL_REPLY_TIMEOUT / 1000)
+#define WAIT_USEC (TEAMDCTL_REPLY_TIMEOUT % 1000 * 1000)
 
 static int cli_usock_wait_recv(int sock)
 {
@@ -88,7 +89,7 @@ static int cli_usock_wait_recv(int sock)
 	int ret;
 	struct timeval tv;
 
-	tv.tv_sec = 0;
+	tv.tv_sec = WAIT_SEC;
 	tv.tv_usec = WAIT_USEC;
 	FD_ZERO(&rfds);
 	FD_SET(sock, &rfds);

Comment 6 Vitezslav Humpa 2015-08-27 11:17:05 UTC
I've rebuilt the SRPM on aarch64, and tested the c#0 scenarios + plus some extra usecases via NetworkManager. Works as expected. I have also scheduled a full set of jobs on aarch64, x86_64, s390x and ppc to provide more regression testing.

Comment 8 Marcelo Ricardo Leitner 2015-08-27 18:56:26 UTC
Awesome, thanks.

Comment 9 Marcelo Ricardo Leitner 2015-08-27 19:18:45 UTC
Posted upstream:
https://lists.fedorahosted.org/pipermail/libteam/2015-August/000409.html

Comment 10 Vitezslav Humpa 2015-08-28 08:26:25 UTC
Regression tests on the side of NetworkManager also passed on all archs.

Comment 19 Amit Supugade 2015-09-08 18:42:15 UTC
[root@amd-seattle-01 ~]# rpm -q teamd libteam
teamd-1.17-2.el7.aarch64
libteam-1.17-2.el7.aarch64


[root@amd-seattle-01 ~]# teamd -d -t team0
[  925.872809] team0: Mode changed to "roundrobin"
[root@amd-seattle-01 ~]# teamdctl team0 state
setup:
  runner: roundrobin
[root@amd-seattle-01 ~]# ip link set eth0 down
[ 1308.984465] amd-xgbe AMDI8001:00 eth0: Link is Down
[root@amd-seattle-01 ~]# teamdctl team0 port add eth0
[ 1311.014700] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 1311.020610] team0: Port device eth0 added
[root@amd-seattle-01 ~]# [ 1312.012750] amd-xgbe AMDI8001:00 eth0: Link is Up - 1Gbps/Full - flow control off
[ 1312.020232] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[root@amd-seattle-01 ~]# ip link set team0 up
[root@amd-seattle-01 ~]# 
[root@amd-seattle-01 ~]# teamdctl team0 state
setup:
  runner: roundrobin
ports:
  eth0
    link watches:
      link summary: up
      instance[link_watch_0]:
        name: ethtool
        link: up
        down count: 0
[root@amd-seattle-01 ~]# 
[root@amd-seattle-01 ~]# teamd -k -t team0
[ 1346.038355] amd-xgbe AMDI8001:00 eth0: Link is Down
[ 1346.044510] team0: Port device eth0 removed
[root@amd-seattle-01 ~]#


[root@amd-seattle-01 ~]# nmcli con add type team
[ 1362.223111] IPv6: ADDRCONF(NETDEV_UP): nm-team: link is not ready
[ 1362.229341] IPv6: ADDRCONF(NETDEV_UP): nm-team: link is not ready
[ 1362.244508] IPv6: ADDRCONF(NETDEV_UP): nm-team: link is not ready
[ 1362.259672] nm-team: Mode changed to "roundrobin"
Connection 'team' (4154f9c0-baf8-44ed-b180-e71056a49f3d) successfully added.
[root@amd-seattle-01 ~]#
[root@amd-seattle-01 ~]# nmcli connection show
NAME  UUID                                  TYPE            DEVICE 
eth0  269932e1-75f3-49d9-9f54-b1ec371eb9f2  802-3-ethernet  eth0   
eth1  11278af7-6733-47c5-9a59-04dccb84e9ea  802-3-ethernet  --     
team  4154f9c0-baf8-44ed-b180-e71056a49f3d  team            --     
[root@amd-seattle-01 ~]#

Comment 21 errata-xmlrpc 2015-11-19 03:56:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2176.html