Hide Forgot
Description of problem: SetupNetwork command is slow when NetworkManager is running. SetupNetworks command: Sends setupNetwork action request to VDS host with {'add': {'bond024': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond024', 'mode': None, 'slaves': ['dummy_3', 'dummy_2'], 'properties': None}, 'bond025': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond025', 'mode': None, 'slaves': ['dummy_19', 'dummy_18'], 'properties': None}, 'bond026': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond026', 'mode': None, 'slaves': ['dummy_17', 'dummy_16'], 'properties': None}, 'bond027': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond027', 'mode': None, 'slaves': ['dummy_15', 'dummy_14'], 'properties': None}, 'bond021': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond021', 'mode': None, 'slaves': ['dummy_9', 'dummy_8'], 'properties': None}, 'bond022': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond022', 'mode': None, 'slaves': ['dummy_7', 'dummy_6'], 'properties': None}, 'bond023': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond023', 'mode': None, 'slaves': ['dummy_5', 'dummy_4'], 'properties': None}, 'bond028': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond028', 'mode': None, 'slaves': ['dummy_13', 'dummy_12'], 'properties': None}, 'bond029': {'datacenter': None, 'qos': None, 'network': None, 'nic': 'bond029', 'mode': None, 'slaves': ['dummy_11', 'dummy_10'], 'properties': None}, 'C2_sync1_7': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_7', 'nic': 'bond027', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_6': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_6', 'nic': 'bond026', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_5': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_5', 'nic': 'bond025', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_4': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_4', 'nic': 'bond024', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_3': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_3', 'nic': 'bond023', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_2': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_2', 'nic': 'bond022', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_1': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_1', 'nic': 'bond021', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_9': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_9', 'nic': 'bond029', 'mode': None, 'slaves': [], 'properties': None}, 'C2_sync1_8': {'datacenter': 'golden_env_mixed', 'qos': None, 'network': 'C2_sync1_8', 'nic': 'bond028', 'mode': None, 'slaves': [], 'properties': None}}, 'host_name': 'host_mixed_1', 'persist': False} NetworkManager not running: ADD: --- 29.0276751518 seconds --- REMOVE: --- 46.1872498989 seconds --- NetworkManager running: ADD: --- 154.355500937 seconds --- REMOVE: --- 181.257375956 seconds --- NetworkManager running and if is_nm_running && [ "$REALDEVICE" != "lo" ] ; then nmcli con load "/etc/sysconfig/network-scripts/$CONFIG" if ! is_false $NM_CONTROLLED; then UUID=$(get_uuid_by_config $CONFIG) [ -n "$UUID" ] && _use_nm=true fi fi Is comment in /etc/sysconfig/network-scripts/network-functions ADD: --- 26.5447192192 seconds --- REMOVE: --- 40.7543609142 seconds --- Version-Release number of selected component (if applicable): vdsm-4.18.999-1219.gite8c2926.el7.centos.x86_64 NetworkManager-1.4.0-13.el7_3.x86_64 ovirt-engine-4.2.0-0.0.master.20161219154406.gitf36001c.el7.centos.noarch Steps to Reproduce: 1.host_network_api/sync_test case 02
This seems a regression from RHEL 7.2 when NetworkManager is running. VDSM is frequently using ifdown/ifup while setting up networks. With RHEL 7.3, on each iface action nmcli is called, once to check if it us and another to load the ifcfg file. These two calls seem to heavily slow down the setup process. It will be impossible to run VDSM with NM running in this state.
Hi, is there a way to reproduce the regression without running the ovirt test suite? Maybe you can share the ifcfg files and a sequence of ifup/ifdown commands that exhibit the slowdown when NM is enabled? Also, NM trace logs collected during the network setup would be helpful. To increase logging level set 'level=TRACE' in the [logging] section of /etc/NetworkManager/NetworkManager.conf and restart the service ('systemctl restart NetworkManager'). Thanks!
(In reply to Beniamino Galvani from comment #4) I have consulted at the time with Thomas Haller about this issue. It seems it is a known issue with the nmcli tool, which loads all the data on each call. In RHEL 7.3, initscripts have changed to include calls to nmcli (two such calls) on each ifup and therefore we suspected the slowdown to be related. After removing these calls from the script, everything passed in a reasonable time frame. The ifcfg files are simple static ip addresses on dummy ifaces, if you will run ifup/ifdown on multiple such devices multiple times, you should see the problem. I guess a intermediate solution could be to avoid calling nmcli in the initscripts and provide other alternatives (accessing dbus directly for example).
Hi, I understand, thanks. Maybe the suggestion in [1] would improve the situation a lot (it does in my tests, but I'm not sure we have a similar setup). Can you replace: nmcli con load "/etc/sysconfig/network-scripts/$CONFIG" in /etc/sysconfig/network-scripts/network-functions with the equivalent D-Bus call (which avoids the load of the whole NM D-Bus object tree): dbus-send --system --print-reply \ --dest=org.freedesktop.NetworkManager \ /org/freedesktop/NetworkManager/Settings \ org.freedesktop.NetworkManager.Settings.LoadConnections \ array:string:"/etc/sysconfig/network-scripts/$CONFIG" and see which is the impact on performances? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1345919#c2
(In reply to Beniamino Galvani from comment #6) > Hi, > > I understand, thanks. Maybe the suggestion in [1] would improve the > situation a lot (it does in my tests, but I'm not sure we have a > similar setup). Can you replace: > > nmcli con load "/etc/sysconfig/network-scripts/$CONFIG" > > in /etc/sysconfig/network-scripts/network-functions with the > equivalent D-Bus call (which avoids the load of the whole NM D-Bus > object tree): > > dbus-send --system --print-reply \ > --dest=org.freedesktop.NetworkManager \ > /org/freedesktop/NetworkManager/Settings \ > org.freedesktop.NetworkManager.Settings.LoadConnections \ > array:string:"/etc/sysconfig/network-scripts/$CONFIG" > > and see which is the impact on performances? > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1345919#c2 We will check this out, thank you. What about the second nmcli reference: is_nm_running ? is_nm_running () { [ "$(LANG=C nmcli -t --fields running general status 2>/dev/null)" = "running" ] }
(In reply to Edward Haas from comment #8) > We will check this out, thank you. > What about the second nmcli reference: is_nm_running ? > is_nm_running () > { > [ "$(LANG=C nmcli -t --fields running general status 2>/dev/null)" = > "running" ] > } The equivalent command using D-Bus would be: is_nm_running () { dbus-send --system --print-reply \ --dest=org.freedesktop.DBus \ /org/freedesktop/DBus \ org.freedesktop.DBus.GetNameOwner \ string:"org.freedesktop.NetworkManager" >/dev/null 2>&1 }
(In reply to Beniamino Galvani from comment #9) > (In reply to Edward Haas from comment #8) > > We will check this out, thank you. > > What about the second nmcli reference: is_nm_running ? > > is_nm_running () > > { > > [ "$(LANG=C nmcli -t --fields running general status 2>/dev/null)" = > > "running" ] > > } > > The equivalent command using D-Bus would be: > > is_nm_running () > { > dbus-send --system --print-reply \ > --dest=org.freedesktop.DBus \ > /org/freedesktop/DBus \ > org.freedesktop.DBus.GetNameOwner \ > string:"org.freedesktop.NetworkManager" >/dev/null 2>&1 > } Hi, With the two changes that you recommended look fine, NetworkManager running: ADD 27.10322904586792 seconds REMOVE 43.55953407287598 seconds NetworkManager not running: ADD 25.3023841381073 REMOVE 40.55843997001648 seconds
I think initscripts should consider applying the changes in comments 6 and 9. We are constantly improving the performances of nmcli and we are working on speeding up the "con load" and "general status" commands, but this will require time and will not easily backportable to previous RHEL versions due to a nearly complete rewrite of the D-Bus handling code done for 1.6. The proposed dbus-send replacement uses the same D-Bus API as nmcli but without the overhead that currently libnm has in the initialization phase. Reassigning the bug to initscripts to have a feedback about this proposal.
Hi David, what do you think about comment 11?
Can you please reply?
Can you please help here? This is a urgent matter.
(In reply to Beniamino Galvani from comment #12) > Hi David, > > what do you think about comment 11? Hello Beniamino, I'm no expert on the D-Bus. Can you guarantee that these changes with using D-Bus do not have any side-effects? I don't want to backport anything that could even worse the current situation by introducing new regressions. Otherwise, I'm okay with changing this just for RHEL-7. (RHEL-6 and Fedora should be fine AFAIK).
Restoring the needinfo lost from last comment.
(In reply to David Kaspar [Dee'Kej] from comment #15) > I'm no expert on the D-Bus. Can you guarantee that these changes with using > D-Bus do not have any side-effects? I don't want to backport anything that > could even worse the current situation by introducing new regressions. Hi, those changes are equivalent to the old calls using nmcli.
(In reply to Beniamino Galvani from comment #17) > Hi, those changes are equivalent to the old calls using nmcli. OK, in this case I'm OK to change it RHEL-7.
I opened a pull request on [1] based on Beniamino's suggestions. Please review. [1] https://github.com/fedora-sysv/initscripts/pull/57
Well, I was a little bit hasty with qa_ack here. On the second though I don't think this kind of change is good candidate for a z-stream. Simply I don't see the strong argument, that this change does not break something. Also let's apply some workaround to one of the key component 'cause nmcli is slow and it would take some time to fix it, I believe is not the best approach. Could you guys rethink to proposed solution otherwise I'm afraid we're not able to guarantee regression free release.
Leos, the main point is that rhel-7.3.0 was not "regression free". The code solving bug 1345919 added incredible slowness to ifup, which causes even worse slowdown in RHV network regression tests. Here we are trying to revert this unbearable side-effect of bug 1345919, so we can start using NM in RHV.
(In reply to Leos Pol from comment #22) > Well, I was a little bit hasty with qa_ack here. On the second though I > don't think this kind of change is good candidate for a z-stream. Simply I > don't see the strong argument, that this change does not break something. > Also let's apply some workaround to one of the key component 'cause nmcli is > slow and it would take some time to fix it, I believe is not the best > approach. Could you guys rethink to proposed solution otherwise I'm afraid > we're not able to guarantee regression free release. This is a regression in 7.3 and causes networking ifup to be 5 times slower. This is a major issue for us at scale and we need this to be fixed and that is why we got PM approval.
We also did wide testing with this patches on the RHV automation and saw not regression for all of our use cases.
patch merged to rhel7 branch-> https://github.com/fedora-sysv/initscripts/commit/44b84c4b92705917f7c124b714bc93ca42e00ec3 https://github.com/fedora-sysv/initscripts/commit/44b84c4b92705917f7c124b714bc93ca42e00ec3 -> post
Can we please clone this to z and backport?
Can you also provide a brew build for us to test? Our network automation run are now 9 hours instead of 3 hours due to this bug.
(In reply to Yaniv Dary from comment #27) > Can we please clone this to z and backport? Sorry Yaniv, I'm not authorized in my role (I don't have the privileges) to create a clone for Z-Stream. (In reply to Yaniv Dary from comment #28) > Can you also provide a brew build for us to test? Our network automation run > are now 9 hours instead of 3 hours due to this bug. Yes, I can do that. But please keep in mind the scratch build is *not* to be used in production environment.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2286