Bug 1257237
Summary: | NetworkManager loops and takes CPU until it dies when teamd is unresponsive | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Vitezslav Humpa <vhumpa> | ||||
Component: | NetworkManager | Assignee: | Beniamino Galvani <bgalvani> | ||||
Status: | CLOSED ERRATA | QA Contact: | Desktop QE <desktop-qa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.2 | CC: | aloughla, atragler, bgalvani, dcbw, kzhang, lkundrak, lrintel, mleitner, rkhan, thaller, vbenes | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-11-03 19:15:30 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1301628, 1313485 | ||||||
Attachments: |
|
Description
Vitezslav Humpa
2015-08-26 14:50:54 UTC
NM uses libteamdctl, which does not implement async methods to control/communicate with teamd. Ideally libteamdctl would add async methods alongside the current sync ones. Otherwise we have to figure out some other way of talking to teamd. The attached log show a loop of auto-connect retries for team0.0 which doesn't stop because the retry counter is not decremented: [1440599581.531769] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left [1440599581.721971] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left [1440599581.880070] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left [1440599582.074959] [nm-policy.c:1125] device_state_changed(): Connection 'team0.0' failed to autoconnect; 4 tries left This is the probable cause of the unresponsiveness of the NM process. NM 1.2 does not exhibit this behavior when teamd fails. (In reply to Dan Williams from comment #3) > NM uses libteamdctl, which does not implement async methods to > control/communicate with teamd. Ideally libteamdctl would add async methods > alongside the current sync ones. Otherwise we have to figure out some other > way of talking to teamd. Even if I'm not sure that the issue in comment 0 is caused by teamd not responding and blocking NM, this is definitely a possible scenario that should be avoided. libteamdctl implements different backends to talk to teamd (unix socket, D-Bus and ZeroMQ), but all its methods are sync, so if teamd doesn't respond to a call NM would wait. But we could bypass libteamdctl and interact with teamd via D-Bus using this API: https://github.com/jpirko/libteam/wiki/Infrastructure-Specification#teamd-control-api Actually, libteamdctl does not much more than wrapping the D-Bus methods, so it should be quite simple. (In reply to Beniamino Galvani from comment #4) > The attached log show a loop of auto-connect retries for team0.0 which > doesn't stop because the retry counter is not decremented: > > [1440599581.531769] [nm-policy.c:1125] device_state_changed(): Connection > 'team0.0' failed to autoconnect; 4 tries left > [1440599581.721971] [nm-policy.c:1125] device_state_changed(): Connection > 'team0.0' failed to autoconnect; 4 tries left > [1440599581.880070] [nm-policy.c:1125] device_state_changed(): Connection > 'team0.0' failed to autoconnect; 4 tries left > [1440599582.074959] [nm-policy.c:1125] device_state_changed(): Connection > 'team0.0' failed to autoconnect; 4 tries left > > This is the probable cause of the unresponsiveness of the NM > process. NM 1.2 does not exhibit this behavior when teamd fails. > > (In reply to Dan Williams from comment #3) > > NM uses libteamdctl, which does not implement async methods to > > control/communicate with teamd. Ideally libteamdctl would add async methods > > alongside the current sync ones. Otherwise we have to figure out some other > > way of talking to teamd. > > Even if I'm not sure that the issue in comment 0 is caused by teamd > not responding and blocking NM, this is definitely a possible scenario > that should be avoided. > > libteamdctl implements different backends to talk to teamd (unix > socket, D-Bus and ZeroMQ), but all its methods are sync, so if teamd > doesn't respond to a call NM would wait. But we could bypass > libteamdctl and interact with teamd via D-Bus using this API: > > https://github.com/jpirko/libteam/wiki/Infrastructure-Specification#teamd- > control-api > > Actually, libteamdctl does not much more than wrapping the D-Bus > methods, so it should be quite simple. Adding asyncronous API to libteamdctl seems a bit complicated, and maybe out of scope for the simple library that it is (which is a good thing). libteamdctl.so is pretty small (23400 bytes), so doing without it is not a big win. But maybe still worth it? Usually it's pretty straight-forward to use D-Bus directly via gdbus (YMMV). There are places in which we can't use asynchronous calls and a conversion to D-Bus would not provide any benefit since the calling function must wait for the result before continuing: - update_connection() - master_update_slave_connection() - enslave_slave() There we would have to add a timeout and continue if teamd doesn't respond. Which is exactly what libteamdctl does (the timeout is set to 5s). Maybe instead we can continue using libteamdctl, but fail the connection at the first communication error, so that we don't keep trying to contact teamd and block every time for 5s? I believe the main issue is cause by a loop in activation failures as described in comment 4 and this seems solved with the patches for bug 1270814. Regarding the improvements in how NM interacts with teamd, in my opinion we should do the following: 1. in the short term, fix the few places where we ignore the return value of libteamdctl calls and make the activation fail immediately instead of potentially block again later 2. as a long term solution, use teamd D-Bus API directly from NM to achieve asynchronous communication where possible. This will also allow to drop the dependency on libteamdctl. Branch bg/team-conf-read-rh1257237 implements 1. Regarding 2. we have an upstream bugzilla tracking it [1]. [1] https://bugzilla.gnome.org/show_bug.cgi?id=768189 (In reply to Beniamino Galvani from comment #7) The plan sounds good. Branch bg/team-conf-read-rh1257237 lgtm The branch looks good to me too. Merged to master: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=0de2483685e75ccf55303d6e0c593371456cdfd0 CPU peaks are not seen anymore Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2581.html |