Bug 1408219 - ifup is very slow when NetworkManager is running
Summary: ifup is very slow when NetworkManager is running
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: initscripts
Version: 7.3
Hardware: All
OS: Linux
high
urgent
Target Milestone: pre-dev-freeze
: ---
Assignee: David Kaspar [Dee'Kej]
QA Contact: Leos Pol
Ioanna Gkioka
URL:
Whiteboard:
Keywords: ZStream
Depends On:
Blocks: vdsm_config_NetworkMgr_to_be_passive 1422820
TreeView+ depends on / blocked
 
Reported: 2016-12-22 13:51 UTC by Meni Yakove
Modified: 2017-08-01 07:29 UTC (History)
27 users (show)

(edit)
The *ifup* script no longer slows down *NetworkManager*

Previously, the *ifup* script was very slow when notifying *NetworkManager*. This particularly affected Red Hat Virtualization (RHV) network startup times. A patch has been applied to initscripts, and the described problem no longer occurs.
Clone Of:
: 1422820 (view as bug list)
(edit)
Last Closed: 2017-08-01 07:29:01 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2286 normal SHIPPED_LIVE initscripts bug fix and enhancement update 2017-08-01 11:26:43 UTC

Description Meni Yakove 2016-12-22 13:51:38 UTC
Description of problem:
SetupNetwork command is slow when NetworkManager is running.
SetupNetworks command:
Sends setupNetwork action request to VDS host   with {'add': {'bond024': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond024', 'mode': None, 'slaves': ['dummy_3', 'dummy_2'], 'properties': None}, 'bond025': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond025', 'mode': None, 'slaves': ['dummy_19', 'dummy_18'], 'properties': None}, 'bond026': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond026', 'mode': None, 'slaves': ['dummy_17', 'dummy_16'], 'properties': None}, 'bond027': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond027', 'mode': None, 'slaves': ['dummy_15', 'dummy_14'], 'properties': None}, 'bond021': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond021', 'mode': None, 'slaves': ['dummy_9', 'dummy_8'], 'properties': None}, 'bond022': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond022', 'mode': None, 'slaves': ['dummy_7', 'dummy_6'], 'properties': None}, 'bond023': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond023', 'mode': None, 'slaves': ['dummy_5', 'dummy_4'], 'properties': None}, 'bond028': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond028', 'mode': None, 'slaves': ['dummy_13', 'dummy_12'], 'properties': None}, 'bond029': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond029', 'mode': None, 'slaves': ['dummy_11', 'dummy_10'], 'properties': None}, 'C2_sync1_7': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_7', 'nic': 'bond027', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_6': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_6', 'nic': 'bond026', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_5': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_5', 'nic': 'bond025', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_4': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_4', 'nic': 'bond024', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_3': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_3', 'nic': 'bond023', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_2': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_2', 'nic': 'bond022', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_1': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_1', 'nic': 'bond021', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_9': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_9', 'nic': 'bond029', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_8': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_8', 'nic': 'bond028', 'mode': None, 'slaves': [], 'properties': None}}, 'host_name': 'host_mixed_1', 'persist': False}


NetworkManager not running: 
    ADD:    --- 29.0276751518 seconds ---
    REMOVE: --- 46.1872498989 seconds ---

NetworkManager running:     
    ADD:    --- 154.355500937 seconds ---
    REMOVE: --- 181.257375956 seconds ---

NetworkManager running and 
if is_nm_running && [ "$REALDEVICE" != "lo" ] ; then
        nmcli con load "/etc/sysconfig/network-scripts/$CONFIG"
        if ! is_false $NM_CONTROLLED; then
            UUID=$(get_uuid_by_config $CONFIG)
            [ -n "$UUID" ] && _use_nm=true
        fi
    fi
Is comment in /etc/sysconfig/network-scripts/network-functions
    ADD:    --- 26.5447192192 seconds ---
    REMOVE: --- 40.7543609142 seconds ---


Version-Release number of selected component (if applicable):
vdsm-4.18.999-1219.gite8c2926.el7.centos.x86_64
NetworkManager-1.4.0-13.el7_3.x86_64
ovirt-engine-4.2.0-0.0.master.20161219154406.gitf36001c.el7.centos.noarch


Steps to Reproduce:
1.host_network_api/sync_test case 02

Comment 1 Edward Haas 2016-12-23 20:07:28 UTC
This seems a regression from RHEL 7.2 when NetworkManager is running.

VDSM is frequently using ifdown/ifup while setting up networks.
With RHEL 7.3, on each iface action nmcli is called, once to check if it us and another to load the ifcfg file.
These two calls seem to heavily slow down the setup process.

It will be impossible to run VDSM with NM running in this state.

Comment 4 Beniamino Galvani 2016-12-29 17:09:27 UTC
Hi,

is there a way to reproduce the regression without running the ovirt
test suite? Maybe you can share the ifcfg files and a sequence of
ifup/ifdown commands that exhibit the slowdown when NM is enabled?

Also, NM trace logs collected during the network setup would be
helpful. To increase logging level set 'level=TRACE' in the [logging]
section of /etc/NetworkManager/NetworkManager.conf and restart the
service ('systemctl restart NetworkManager'). Thanks!

Comment 5 Edward Haas 2017-01-02 13:30:17 UTC
(In reply to Beniamino Galvani from comment #4)

I have consulted at the time with Thomas Haller about this issue.
It seems it is a known issue with the nmcli tool, which loads all the data on each call.
In RHEL 7.3, initscripts have changed to include calls to nmcli (two such calls) on each ifup and therefore we suspected the slowdown to be related.
After removing these calls from the script, everything passed in a reasonable time frame.

The ifcfg files are simple static ip addresses on dummy ifaces, if you will run ifup/ifdown on multiple such devices multiple times, you should see the problem.

I guess a intermediate solution could be to avoid calling nmcli in the initscripts and provide other alternatives (accessing dbus directly for example).

Comment 6 Beniamino Galvani 2017-01-03 18:17:36 UTC
Hi,

I understand, thanks. Maybe the suggestion in [1] would improve the
situation a lot (it does in my tests, but I'm not sure we have a
similar setup). Can you replace:

  nmcli con load "/etc/sysconfig/network-scripts/$CONFIG"

in /etc/sysconfig/network-scripts/network-functions with the
equivalent D-Bus call (which avoids the load of the whole NM D-Bus
object tree):

  dbus-send --system --print-reply \
      --dest=org.freedesktop.NetworkManager \
      /org/freedesktop/NetworkManager/Settings \
      org.freedesktop.NetworkManager.Settings.LoadConnections \
      array:string:"/etc/sysconfig/network-scripts/$CONFIG"

and see which is the impact on performances?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1345919#c2

Comment 8 Edward Haas 2017-01-04 06:58:35 UTC
(In reply to Beniamino Galvani from comment #6)
> Hi,
> 
> I understand, thanks. Maybe the suggestion in [1] would improve the
> situation a lot (it does in my tests, but I'm not sure we have a
> similar setup). Can you replace:
> 
>   nmcli con load "/etc/sysconfig/network-scripts/$CONFIG"
> 
> in /etc/sysconfig/network-scripts/network-functions with the
> equivalent D-Bus call (which avoids the load of the whole NM D-Bus
> object tree):
> 
>   dbus-send --system --print-reply \
>       --dest=org.freedesktop.NetworkManager \
>       /org/freedesktop/NetworkManager/Settings \
>       org.freedesktop.NetworkManager.Settings.LoadConnections \
>       array:string:"/etc/sysconfig/network-scripts/$CONFIG"
> 
> and see which is the impact on performances?
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1345919#c2

We will check this out, thank you.
What about the second nmcli reference: is_nm_running ?
is_nm_running ()
{
    [ "$(LANG=C nmcli -t --fields running general status 2>/dev/null)" = "running" ]
}

Comment 9 Beniamino Galvani 2017-01-04 08:58:29 UTC
(In reply to Edward Haas from comment #8)
> We will check this out, thank you.
> What about the second nmcli reference: is_nm_running ?
> is_nm_running ()
> {
>     [ "$(LANG=C nmcli -t --fields running general status 2>/dev/null)" =
> "running" ]
> }

The equivalent command using D-Bus would be:

is_nm_running ()
{
    dbus-send --system --print-reply \
        --dest=org.freedesktop.DBus \
        /org/freedesktop/DBus \
        org.freedesktop.DBus.GetNameOwner \
        string:"org.freedesktop.NetworkManager" >/dev/null 2>&1
}

Comment 10 Ori Ben Sasson 2017-01-05 15:29:27 UTC
(In reply to Beniamino Galvani from comment #9)
> (In reply to Edward Haas from comment #8)
> > We will check this out, thank you.
> > What about the second nmcli reference: is_nm_running ?
> > is_nm_running ()
> > {
> >     [ "$(LANG=C nmcli -t --fields running general status 2>/dev/null)" =
> > "running" ]
> > }
> 
> The equivalent command using D-Bus would be:
> 
> is_nm_running ()
> {
>     dbus-send --system --print-reply \
>         --dest=org.freedesktop.DBus \
>         /org/freedesktop/DBus \
>         org.freedesktop.DBus.GetNameOwner \
>         string:"org.freedesktop.NetworkManager" >/dev/null 2>&1
> }

Hi,
With the two changes that you recommended look fine,
NetworkManager running:
ADD 27.10322904586792 seconds
REMOVE 43.55953407287598 seconds

NetworkManager not running:
ADD  25.3023841381073
REMOVE 40.55843997001648 seconds

Comment 11 Beniamino Galvani 2017-01-09 09:45:12 UTC
I think initscripts should consider applying the changes in comments 6
and 9. We are constantly improving the performances of nmcli and we
are working on speeding up the "con load" and "general status"
commands, but this will require time and will not easily backportable
to previous RHEL versions due to a nearly complete rewrite of the
D-Bus handling code done for 1.6.

The proposed dbus-send replacement uses the same D-Bus API as nmcli
but without the overhead that currently libnm has in the
initialization phase.

Reassigning the bug to initscripts to have a feedback about this proposal.

Comment 12 Beniamino Galvani 2017-01-11 09:36:37 UTC
Hi David,

what do you think about comment 11?

Comment 13 Yaniv Lavi 2017-01-22 13:55:11 UTC
Can you please reply?

Comment 14 Yaniv Lavi 2017-01-23 08:25:13 UTC
Can you please help here? 
This is a urgent matter.

Comment 15 David Kaspar [Dee'Kej] 2017-01-23 12:30:40 UTC
(In reply to Beniamino Galvani from comment #12)
> Hi David,
> 
> what do you think about comment 11?

Hello Beniamino,

I'm no expert on the D-Bus. Can you guarantee that these changes with using D-Bus do not have any side-effects? I don't want to backport anything that could even worse the current situation by introducing new regressions.

Otherwise, I'm okay with changing this just for RHEL-7. (RHEL-6 and Fedora should be fine AFAIK).

Comment 16 David Kaspar [Dee'Kej] 2017-01-23 12:33:29 UTC
Restoring the needinfo lost from last comment.

Comment 17 Beniamino Galvani 2017-01-23 14:13:59 UTC
(In reply to David Kaspar [Dee'Kej] from comment #15)
> I'm no expert on the D-Bus. Can you guarantee that these changes with using
> D-Bus do not have any side-effects? I don't want to backport anything that
> could even worse the current situation by introducing new regressions.

Hi, those changes are equivalent to the old calls using nmcli.

Comment 18 David Kaspar [Dee'Kej] 2017-01-23 14:17:39 UTC
(In reply to Beniamino Galvani from comment #17)
> Hi, those changes are equivalent to the old calls using nmcli.

OK, in this case I'm OK to change it RHEL-7.

Comment 20 Petr Horáček 2017-01-31 12:14:52 UTC
I opened a pull request on [1] based on Beniamino's suggestions. Please review.

[1] https://github.com/fedora-sysv/initscripts/pull/57

Comment 22 Leos Pol 2017-02-07 09:01:37 UTC
Well, I was a little bit hasty with qa_ack here. On the second though I don't think this kind of change is good candidate for a z-stream. Simply I don't see the strong argument, that this change does not break something. Also let's apply some workaround to one of the key component 'cause nmcli is slow and it would take some time to fix it, I believe is not the best approach. Could you guys rethink to proposed solution otherwise I'm afraid we're not able to guarantee regression free release.

Comment 23 Dan Kenigsberg 2017-02-07 09:10:24 UTC
Leos, the main point is that rhel-7.3.0 was not "regression free". The code solving bug 1345919 added incredible slowness to ifup, which causes even worse slowdown in RHV network regression tests.

Here we are trying to revert this unbearable side-effect of bug 1345919, so we can start using NM in RHV.

Comment 24 Yaniv Lavi 2017-02-08 09:15:49 UTC
(In reply to Leos Pol from comment #22)
> Well, I was a little bit hasty with qa_ack here. On the second though I
> don't think this kind of change is good candidate for a z-stream. Simply I
> don't see the strong argument, that this change does not break something.
> Also let's apply some workaround to one of the key component 'cause nmcli is
> slow and it would take some time to fix it, I believe is not the best
> approach. Could you guys rethink to proposed solution otherwise I'm afraid
> we're not able to guarantee regression free release.

This is a regression in 7.3 and causes networking ifup to be 5 times slower.
This is a major issue for us at scale and we need this to be fixed and that is why we got PM approval.

Comment 25 Yaniv Lavi 2017-02-08 09:52:07 UTC
We also did wide testing with this patches on the RHV automation and saw not regression for all of our use cases.

Comment 27 Yaniv Lavi 2017-02-15 09:00:59 UTC
Can we please clone this to z and backport?

Comment 28 Yaniv Lavi 2017-02-15 09:43:53 UTC
Can you also provide a brew build for us to test? Our network automation run are now 9 hours instead of 3 hours due to this bug.

Comment 29 David Kaspar [Dee'Kej] 2017-02-15 12:43:36 UTC
(In reply to Yaniv Dary from comment #27)
> Can we please clone this to z and backport?

Sorry Yaniv, I'm not authorized in my role (I don't have the privileges) to create a clone for Z-Stream.

(In reply to Yaniv Dary from comment #28)
> Can you also provide a brew build for us to test? Our network automation run
> are now 9 hours instead of 3 hours due to this bug.

Yes, I can do that. But please keep in mind the scratch build is *not* to be used in production environment.

Comment 41 errata-xmlrpc 2017-08-01 07:29:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2286


Note You need to log in before you can comment on or make changes to this bug.