Red Hat Bugzilla – Bug 1257237
NetworkManager loops and takes CPU until it dies when teamd is unresponsive
Last modified: 2016-11-03 15:15:30 EDT
Created attachment 1067282 [details] A debug log of the loop Description of problem: Currently, there is an issue with teamd failing to respond on dbus on aarch64: https://bugzilla.redhat.com/show_bug.cgi?id=1257195 When attempting to create a plain standalone team with such failing teamd, NetworkManager gives up attempting on first failure. $ nmcli con add type team Connection 'team' (f5933f53-70aa-4e46-9174-f7f7f2be7b15) successfully added. $ nmcli con NAME UUID TYPE DEVICE eth0 a37acea1-1a99-45b9-9ba9-ea665c79fa2c 802-3-ethernet eth0 eth2 01718f51-dc36-4fd6-8f07-66080217cb4f 802-3-ethernet -- team f5933f53-70aa-4e46-9174-f7f7f2be7b15 team -- eth1 03ed154d-509a-47a7-980b-47955eb52caa 802-3-ethernet -- $ nmcli device DEVICE TYPE STATE CONNECTION eth0 ethernet connected eth0 eth1 ethernet disconnected -- eth2 ethernet disconnected -- lo loopback unmanaged -- Log output in comment 0 in bz1257195 However, when we try to add a slave connection: $ nmcli con add type team $ nmcli connection add type team-slave ifname eth1 con-name team0.0 master nm-team $ nmcli con NAME UUID TYPE DEVICE eth0 a37acea1-1a99-45b9-9ba9-ea665c79fa2c 802-3-ethernet eth0 eth2 01718f51-dc36-4fd6-8f07-66080217cb4f 802-3-ethernet -- eth1 03ed154d-509a-47a7-980b-47955eb52caa 802-3-ethernet -- team0.0 7ba578f3-2f5f-437e-8e84-a8079dfbdd92 802-3-ethernet -- team c96f4a95-a79b-4fbf-a8ea-8ee444615223 team -- $ nmcli dev DEVICE TYPE STATE CONNECTION eth0 ethernet connected eth0 eth1 ethernet connecting (getting IP configuration) team0.0 eth2 ethernet disconnected -- lo loopback unmanaged At that moment NetworkManager continues attempting activation of the device in loop, taking up the CPU heavily, until finally the NM daemon itself becomes unresponsive. $ nmcli gen (process:13129): nmcli-CRITICAL **: Error: Could not create NMClient object: Unable to authenticate request.. Debug log of the messages loop attached. Version-Release number of selected component (if applicable): libteam-1.17-1.el7.aarch64 teamd-1.17-1.el7.aarch64 NetworkManager-team-1.0.4-10.el7.aarch64 Additional info: I believe this bug is not aarch64 specific, though currently only occurs there due to bz1257195 , please ping me if you request access to an affected system.
NM uses libteamdctl, which does not implement async methods to control/communicate with teamd. Ideally libteamdctl would add async methods alongside the current sync ones. Otherwise we have to figure out some other way of talking to teamd.
The attached log show a loop of auto-connect retries for team0.0 which doesn't stop because the retry counter is not decremented: [1440599581.531769] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left [1440599581.721971] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left [1440599581.880070] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left [1440599582.074959] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left This is the probable cause of the unresponsiveness of the NM process. NM 1.2 does not exhibit this behavior when teamd fails. (In reply to Dan Williams from comment #3) > NM uses libteamdctl, which does not implement async methods to > control/communicate with teamd. Ideally libteamdctl would add async methods > alongside the current sync ones. Otherwise we have to figure out some other > way of talking to teamd. Even if I'm not sure that the issue in comment 0 is caused by teamd not responding and blocking NM, this is definitely a possible scenario that should be avoided. libteamdctl implements different backends to talk to teamd (unix socket, D-Bus and ZeroMQ), but all its methods are sync, so if teamd doesn't respond to a call NM would wait. But we could bypass libteamdctl and interact with teamd via D-Bus using this API: https://github.com/jpirko/libteam/wiki/Infrastructure-Specification#teamd-control-api Actually, libteamdctl does not much more than wrapping the D-Bus methods, so it should be quite simple.
(In reply to Beniamino Galvani from comment #4) > The attached log show a loop of auto-connect retries for team0.0 which > doesn't stop because the retry counter is not decremented: > > [1440599581.531769] [nm-policy.c:1125] device_state_changed(): Connection > 'team0.0' failed to autoconnect; 4 tries left > [1440599581.721971] [nm-policy.c:1125] device_state_changed(): Connection > 'team0.0' failed to autoconnect; 4 tries left > [1440599581.880070] [nm-policy.c:1125] device_state_changed(): Connection > 'team0.0' failed to autoconnect; 4 tries left > [1440599582.074959] [nm-policy.c:1125] device_state_changed(): Connection > 'team0.0' failed to autoconnect; 4 tries left > > This is the probable cause of the unresponsiveness of the NM > process. NM 1.2 does not exhibit this behavior when teamd fails. > > (In reply to Dan Williams from comment #3) > > NM uses libteamdctl, which does not implement async methods to > > control/communicate with teamd. Ideally libteamdctl would add async methods > > alongside the current sync ones. Otherwise we have to figure out some other > > way of talking to teamd. > > Even if I'm not sure that the issue in comment 0 is caused by teamd > not responding and blocking NM, this is definitely a possible scenario > that should be avoided. > > libteamdctl implements different backends to talk to teamd (unix > socket, D-Bus and ZeroMQ), but all its methods are sync, so if teamd > doesn't respond to a call NM would wait. But we could bypass > libteamdctl and interact with teamd via D-Bus using this API: > > https://github.com/jpirko/libteam/wiki/Infrastructure-Specification#teamd- > control-api > > Actually, libteamdctl does not much more than wrapping the D-Bus > methods, so it should be quite simple. Adding asyncronous API to libteamdctl seems a bit complicated, and maybe out of scope for the simple library that it is (which is a good thing). libteamdctl.so is pretty small (23400 bytes), so doing without it is not a big win. But maybe still worth it? Usually it's pretty straight-forward to use D-Bus directly via gdbus (YMMV).
There are places in which we can't use asynchronous calls and a conversion to D-Bus would not provide any benefit since the calling function must wait for the result before continuing: - update_connection() - master_update_slave_connection() - enslave_slave() There we would have to add a timeout and continue if teamd doesn't respond. Which is exactly what libteamdctl does (the timeout is set to 5s). Maybe instead we can continue using libteamdctl, but fail the connection at the first communication error, so that we don't keep trying to contact teamd and block every time for 5s?
I believe the main issue is cause by a loop in activation failures as described in comment 4 and this seems solved with the patches for bug 1270814. Regarding the improvements in how NM interacts with teamd, in my opinion we should do the following: 1. in the short term, fix the few places where we ignore the return value of libteamdctl calls and make the activation fail immediately instead of potentially block again later 2. as a long term solution, use teamd D-Bus API directly from NM to achieve asynchronous communication where possible. This will also allow to drop the dependency on libteamdctl. Branch bg/team-conf-read-rh1257237 implements 1. Regarding 2. we have an upstream bugzilla tracking it [1]. [1] https://bugzilla.gnome.org/show_bug.cgi?id=768189
(In reply to Beniamino Galvani from comment #7) The plan sounds good. Branch bg/team-conf-read-rh1257237 lgtm
The branch looks good to me too.
Merged to master: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=0de2483685e75ccf55303d6e0c593371456cdfd0
CPU peaks are not seen anymore
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2581.html