Bug 1257237

Summary:

NetworkManager loops and takes CPU until it dies when teamd is unresponsive

Product:

Red Hat Enterprise Linux 7

Reporter:

Vitezslav Humpa <vhumpa>

Component:

NetworkManager

Assignee:

Beniamino Galvani <bgalvani>

Status:

CLOSED ERRATA

QA Contact:

Desktop QE <desktop-qa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.2

CC:

aloughla, atragler, bgalvani, dcbw, kzhang, lkundrak, lrintel, mleitner, rkhan, thaller, vbenes

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-11-03 19:15:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1301628, 1313485

Attachments:

Description	Flags
A debug log of the loop	none

Description Vitezslav Humpa 2015-08-26 14:50:54 UTC

Created attachment 1067282 [details]
A debug log of the loop

Description of problem:
Currently, there is an issue with teamd failing to respond on dbus
on aarch64: https://bugzilla.redhat.com/show_bug.cgi?id=1257195

When attempting to create a plain standalone team with such failing teamd,
NetworkManager gives up attempting on first failure.

$ nmcli con add type team
Connection 'team' (f5933f53-70aa-4e46-9174-f7f7f2be7b15) successfully added.
$ nmcli con
NAME  UUID                                  TYPE            DEVICE 
eth0  a37acea1-1a99-45b9-9ba9-ea665c79fa2c  802-3-ethernet  eth0   
eth2  01718f51-dc36-4fd6-8f07-66080217cb4f  802-3-ethernet  --     
team  f5933f53-70aa-4e46-9174-f7f7f2be7b15  team            --     
eth1  03ed154d-509a-47a7-980b-47955eb52caa  802-3-ethernet  --   
$ nmcli device
DEVICE  TYPE      STATE         CONNECTION 
eth0    ethernet  connected     eth0       
eth1    ethernet  disconnected  --         
eth2    ethernet  disconnected  --         
lo      loopback  unmanaged     --

Log output in comment 0 in bz1257195

However, when we try to add a slave connection:

$ nmcli con add type team
$ nmcli connection add type team-slave ifname eth1 con-name team0.0 master nm-team

$ nmcli con
NAME     UUID                                  TYPE            DEVICE 
eth0     a37acea1-1a99-45b9-9ba9-ea665c79fa2c  802-3-ethernet  eth0   
eth2     01718f51-dc36-4fd6-8f07-66080217cb4f  802-3-ethernet  --     
eth1     03ed154d-509a-47a7-980b-47955eb52caa  802-3-ethernet  --     
team0.0  7ba578f3-2f5f-437e-8e84-a8079dfbdd92  802-3-ethernet  --     
team     c96f4a95-a79b-4fbf-a8ea-8ee444615223  team            --     

$ nmcli dev
DEVICE  TYPE      STATE                                  CONNECTION 
eth0    ethernet  connected                              eth0       
eth1    ethernet  connecting (getting IP configuration)  team0.0    
eth2    ethernet  disconnected                           --         
lo      loopback  unmanaged    

At that moment NetworkManager continues attempting activation of the device in loop,
taking up the CPU heavily, until finally the NM daemon itself becomes unresponsive.

$ nmcli gen

(process:13129): nmcli-CRITICAL **: Error: Could not create NMClient object: Unable to authenticate request..

Debug log of the messages loop attached.
  
Version-Release number of selected component (if applicable):
libteam-1.17-1.el7.aarch64
teamd-1.17-1.el7.aarch64
NetworkManager-team-1.0.4-10.el7.aarch64

Additional info:

I believe this bug is not aarch64 specific, though currently only occurs there due to bz1257195 , please ping me if you request access to an affected system.

Comment 3 Dan Williams 2016-04-26 15:34:28 UTC

NM uses libteamdctl, which does not implement async methods to control/communicate with teamd.  Ideally libteamdctl would add async methods alongside the current sync ones.  Otherwise we have to figure out some other way of talking to teamd.

Comment 4 Beniamino Galvani 2016-06-09 12:47:05 UTC

The attached log show a loop of auto-connect retries for team0.0 which
doesn't stop because the retry counter is not decremented:

[1440599581.531769] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left
[1440599581.721971] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left
[1440599581.880070] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left
[1440599582.074959] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left

This is the probable cause of the unresponsiveness of the NM
process. NM 1.2 does not exhibit this behavior when teamd fails.

(In reply to Dan Williams from comment #3)
> NM uses libteamdctl, which does not implement async methods to
> control/communicate with teamd.  Ideally libteamdctl would add async methods
> alongside the current sync ones.  Otherwise we have to figure out some other
> way of talking to teamd.

Even if I'm not sure that the issue in comment 0 is caused by teamd
not responding and blocking NM, this is definitely a possible scenario
that should be avoided.

libteamdctl implements different backends to talk to teamd (unix
socket, D-Bus and ZeroMQ), but all its methods are sync, so if teamd
doesn't respond to a call NM would wait. But we could bypass
libteamdctl and interact with teamd via D-Bus using this API:

https://github.com/jpirko/libteam/wiki/Infrastructure-Specification#teamd-control-api

Actually, libteamdctl does not much more than wrapping the D-Bus
methods, so it should be quite simple.

Comment 5 Thomas Haller 2016-06-09 16:09:54 UTC

(In reply to Beniamino Galvani from comment #4)
> The attached log show a loop of auto-connect retries for team0.0 which
> doesn't stop because the retry counter is not decremented:
> 
> [1440599581.531769] [nm-policy.c:1125] device_state_changed(): Connection
> 'team0.0' failed to autoconnect; 4 tries left
> [1440599581.721971] [nm-policy.c:1125] device_state_changed(): Connection
> 'team0.0' failed to autoconnect; 4 tries left
> [1440599581.880070] [nm-policy.c:1125] device_state_changed(): Connection
> 'team0.0' failed to autoconnect; 4 tries left
> [1440599582.074959] [nm-policy.c:1125] device_state_changed(): Connection
> 'team0.0' failed to autoconnect; 4 tries left
> 
> This is the probable cause of the unresponsiveness of the NM
> process. NM 1.2 does not exhibit this behavior when teamd fails.
> 
> (In reply to Dan Williams from comment #3)
> > NM uses libteamdctl, which does not implement async methods to
> > control/communicate with teamd.  Ideally libteamdctl would add async methods
> > alongside the current sync ones.  Otherwise we have to figure out some other
> > way of talking to teamd.
> 
> Even if I'm not sure that the issue in comment 0 is caused by teamd
> not responding and blocking NM, this is definitely a possible scenario
> that should be avoided.
> 
> libteamdctl implements different backends to talk to teamd (unix
> socket, D-Bus and ZeroMQ), but all its methods are sync, so if teamd
> doesn't respond to a call NM would wait. But we could bypass
> libteamdctl and interact with teamd via D-Bus using this API:
> 
> https://github.com/jpirko/libteam/wiki/Infrastructure-Specification#teamd-
> control-api
> 
> Actually, libteamdctl does not much more than wrapping the D-Bus
> methods, so it should be quite simple.

Adding asyncronous API to libteamdctl seems a bit complicated, and maybe out of scope for the simple library that it is (which is a good thing).

libteamdctl.so is pretty small (23400 bytes), so doing without it is not a big win. But maybe still worth it? Usually it's pretty straight-forward to use D-Bus directly via gdbus (YMMV).

Comment 6 Beniamino Galvani 2016-06-21 07:40:22 UTC

There are places in which we can't use asynchronous calls and a
conversion to D-Bus would not provide any benefit since the calling
function must wait for the result before continuing:

 - update_connection()
 - master_update_slave_connection()
 - enslave_slave()

There we would have to add a timeout and continue if teamd doesn't
respond. Which is exactly what libteamdctl does (the timeout is
set to 5s).

Maybe instead we can continue using libteamdctl, but fail the
connection at the first communication error, so that we don't keep
trying to contact teamd and block every time for 5s?

Comment 7 Beniamino Galvani 2016-06-30 12:42:30 UTC

I believe the main issue is cause by a loop in activation failures as
described in comment 4 and this seems solved with the patches for bug
1270814.

Regarding the improvements in how NM interacts with teamd, in my
opinion we should do the following:

 1. in the short term, fix the few places where we ignore the return
   value of libteamdctl calls and make the activation fail immediately
   instead of potentially block again later

 2. as a long term solution, use teamd D-Bus API directly from NM to
   achieve asynchronous communication where possible. This will also
   allow to drop the dependency on libteamdctl.

Branch bg/team-conf-read-rh1257237 implements 1. Regarding 2. we have
an upstream bugzilla tracking it [1].


[1] https://bugzilla.gnome.org/show_bug.cgi?id=768189

Comment 8 Thomas Haller 2016-06-30 13:21:21 UTC

(In reply to Beniamino Galvani from comment #7)

The plan sounds good.

Branch bg/team-conf-read-rh1257237 lgtm

Comment 9 Lubomir Rintel 2016-07-07 13:46:27 UTC

The branch looks good to me too.

Comment 10 Beniamino Galvani 2016-07-07 15:37:45 UTC

Merged to master:

https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=0de2483685e75ccf55303d6e0c593371456cdfd0

Comment 12 Vladimir Benes 2016-08-17 14:36:12 UTC

CPU peaks are not seen anymore

Comment 14 errata-xmlrpc 2016-11-03 19:15:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2581.html