Bug 1273052

Summary: teamd fails to start after reboot
Product: Red Hat Enterprise Linux 7 Reporter: ctcard
Component: libteamAssignee: Marcelo Ricardo Leitner <mleitner>
Status: CLOSED ERRATA QA Contact: Amit Supugade <asupugad>
Severity: high Docs Contact:
Priority: high    
Version: 7.1CC: ctcard, jiri, jiri, kzhang, lxin, mleitner, network-qe
Target Milestone: rcFlags: lxin: needinfo-
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libteam-1.25-4.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-04 01:00:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1301628, 1313485    

Description ctcard 2015-10-19 13:09:26 UTC
Description of problem:
We are seeing occasionally seeing issues with the teaming daemon not starting after a reboot on centos 7 VMs. Here is an example (from /var/log/messages.minor):
Oct  6 23:36:21 ****** ovs-ctl[623]: Starting ovsdb-server [  OK  ]
Oct  6 23:36:22 ****** ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.6.2
Oct  6 23:36:22 ****** ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=2.3.1 "external-ids:system-id=\"47ff9309-5609-47e0-819c-b9055b25edbb\"" "system-type=\"CentOS\"" "system-version=\"7.1.1503-Core\""
Oct  6 23:36:22 ****** ovs-ctl[623]: Configuring Open vSwitch system IDs [  OK  ]
Oct  6 23:36:22 ****** network[733]: Bringing up loopback interface:  [  OK  ]
Oct  6 23:36:22 ****** kernel: [    6.158533] gre: GRE over IPv4 demultiplexor driver
Oct  6 23:36:22 ****** systemd[1]: Starting system-teamd.slice.
Oct  6 23:36:22 ****** systemd[1]: Created slice system-teamd.slice.
Oct  6 23:36:22 ****** systemd[1]: Starting Team Daemon for device bond0...
Oct  6 23:36:22 ****** kernel: [    6.199635] openvswitch: Open vSwitch switching datapath
Oct  6 23:36:22 ****** ovs-ctl[623]: Inserting openvswitch module [  OK  ]
Oct  6 23:36:22 ****** kernel: [    6.338577] device ovs-system entered promiscuous mode
Oct  6 23:36:22 ****** kernel: [    6.340086] openvswitch: netlink: Unknown key attribute (type=62, max=21).
Oct  6 23:36:22 ****** kernel: [    6.385293] device br-ex entered promiscuous mode
Oct  6 23:36:22 ****** kernel: [    6.426511] device br-int entered promiscuous mode
Oct  6 23:36:22 ****** teamd[857]: Failed to get interface information list.
Oct  6 23:36:22 ****** teamd[857]: Failed to init interface information list.
Oct  6 23:36:22 ****** teamd[857]: Team init failed.
Oct  6 23:36:22 ****** teamd[857]: teamd_init() failed.
Oct  6 23:36:22 ****** teamd[857]: Failed: Invalid argument
Oct  6 23:36:22 ****** systemd[1]: teamd: main process exited, code=exited, status=1/FAILURE
Oct  6 23:36:22 ****** network[733]: Bringing up interface bond0:  Job for teamd failed. See 'systemctl status teamd' and 'journalctl -xn' for details.
Oct  6 23:36:22 ****** kernel: [    6.433515] device br-tun entered promiscuous mode
Oct  6 23:36:22 ****** systemd[1]: Unit teamd entered failed state.
Oct  6 23:36:22 ****** ovs-ctl[623]: Starting ovs-vswitchd [  OK  ]
Oct  6 23:36:22 ****** network[733]: [FAILED]
Oct  6 23:36:22 ****** ovs-ctl[623]: Enabling remote OVSDB managers [  OK  ] 

Version-Release number of selected component (if applicable):
teamd-1.15-1.el7.centos.x86_64
libteam-1.15-1.el7.centos.x86_64


How reproducible:
Only happens occasionally, not reproducible on demand


Steps to Reproduce:
1. reboot a VM
2. after reboot teamd fails to start with error "Failed to get interface information list."

Actual results:


Expected results:


Additional info:
Investigation has showed that teamd is failing because libteam code in ifinfo.c is not handling error NLE_DUMP_INTR returned by nl_recvmsgs (part of libnl3)

Comment 2 Jiri Benc 2015-10-20 10:05:58 UTC
Adding upstream maintainer to CC.

Comment 3 Xin Long 2015-10-20 11:53:27 UTC
Hi, can you offer the starting commands of VM and the network config file in guest?

Comment 5 Xin Long 2015-12-26 09:06:40 UTC
(In reply to Jiri Pirko from comment #4)
> This is fixed by:
> 
> https://github.com/jpirko/libteam/commit/
> 8e44b17159522e6afecd64a507cdfae3ed341257

ok, thanks, Jiri.

Comment 6 Marcelo Ricardo Leitner 2016-01-20 15:35:09 UTC
Fix is prepared for 7.3.
Flagging 7.2.z as there is no workaround for this issue.

Comment 7 Marcelo Ricardo Leitner 2016-03-11 19:16:36 UTC
Oups, should be Modified really, as libteam is updated to 1.23 which contains that commit.

Comment 9 Amit Supugade 2016-06-28 15:29:24 UTC
Verified on-
libteam-1.23-1.el7.x86_64
teamd-1.23-1.el7.x86_64

Ran test multiple times.
LOG-

:: [   PASS   ] :: Command 'virsh reboot vm1' (Expected 0, got 0)
:: [   LOG    ] :: Duration: 2m 54s
:: [   LOG    ] :: Assertions: 12 good, 0 bad
:: [   PASS   ] :: RESULT: Start VM and upgrade kernel

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: Test
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   LOG    ] :: Output of 'vmsh run_cmd vm1 'ip a | grep team0:'':
:: [   LOG    ] :: --------------- OUTPUT START ---------------
:: [   LOG    ] :: spawn virsh console vm1
:: [   LOG    ] :: 
:: [   LOG    ] :: Connected to domain vm1
:: [   LOG    ] :: 
:: [   LOG    ] :: Escape character is ^]
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: Red Hat Enterprise Linux Server 7.2 Beta (Maipo)
:: [   LOG    ] :: 
:: [   LOG    ] :: Kernel 3.10.0-451.el7.x86_64 on an x86_64
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: localhost login: root
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: Password: 
:: [   LOG    ] :: 
:: [   LOG    ] :: Last login: Tue Jun 28 10:02:56 on ttyS0
:: [   LOG    ] :: 
:: [   LOG    ] :: [root@localhost ~]# ip a | grep team0:
:: [   LOG    ] :: 
:: [   LOG    ] :: 4: [01;31m[Kteam0:[m[K <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
:: [   LOG    ] :: 
:: [   LOG    ] :: [root@localhost ~]# echo $?
:: [   LOG    ] :: 
:: [   LOG    ] :: 0
:: [   LOG    ] :: 
:: [   LOG    ] :: [root@localhost ~]# logout
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: Red Hat Enterprise Linux Server 7.2 Beta (Maipo)
:: [   LOG    ] :: 
:: [   LOG    ] :: Kernel 3.10.0-451.el7.x86_64 on an x86_64
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: 
:: [   LOG    ] :: localhost login: 
:: [   LOG    ] :: 
:: [   LOG    ] :: ---------------  OUTPUT END  ---------------
:: [   PASS   ] :: Command 'vmsh run_cmd vm1 'ip a | grep team0:'' (Expected 0, got 0)
:: [   PASS   ] :: There should not be an error and Team should initialise without errors (Assert: '0' should equal '0')
:: [   LOG    ] :: Output of 'ping -c 5 192.168.1.22':
:: [   LOG    ] :: --------------- OUTPUT START ---------------
:: [   LOG    ] :: PING 192.168.1.22 (192.168.1.22) 56(84) bytes of data.
:: [   LOG    ] :: 64 bytes from 192.168.1.22: icmp_seq=1 ttl=64 time=0.328 ms
:: [   LOG    ] :: 64 bytes from 192.168.1.22: icmp_seq=2 ttl=64 time=0.118 ms
:: [   LOG    ] :: 64 bytes from 192.168.1.22: icmp_seq=3 ttl=64 time=0.192 ms
:: [   LOG    ] :: 64 bytes from 192.168.1.22: icmp_seq=4 ttl=64 time=0.118 ms
:: [   LOG    ] :: 64 bytes from 192.168.1.22: icmp_seq=5 ttl=64 time=0.112 ms
:: [   LOG    ] :: 
:: [   LOG    ] :: --- 192.168.1.22 ping statistics ---
:: [   LOG    ] :: 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
:: [   LOG    ] :: rtt min/avg/max/mdev = 0.112/0.173/0.328/0.083 ms
:: [   LOG    ] :: ---------------  OUTPUT END  ---------------
:: [   PASS   ] :: Command 'ping -c 5 192.168.1.22' (Expected 0, got 0)
:: [   LOG    ] :: Duration: 13s
:: [   LOG    ] :: Assertions: 3 good, 0 bad
:: [   PASS   ] :: RESULT: Test

Comment 13 errata-xmlrpc 2016-11-04 01:00:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2219.html