Bug 1211133 - high cpu use with many IPv6 cloned routes
Summary: high cpu use with many IPv6 cloned routes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: NetworkManager
Version: 7.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Thomas Haller
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-04-13 06:25 UTC by Rik van Riel
Modified: 2015-11-19 11:01 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-11-19 11:01:34 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
GNOME Bugzilla 735445 None None None Never
GNOME Bugzilla 747981 None None None Never
Red Hat Product Errata RHSA-2015:2315 normal SHIPPED_LIVE Moderate: NetworkManager security, bug fix, and enhancement update 2015-11-19 10:06:58 UTC

Description Rik van Riel 2015-04-13 06:25:47 UTC
Description of problem:

When running ipv6 traffic to lots of different destinations (busy DNS server), NetworkManager seems to take up a lot of CPU time. Most of the NetworkManager activity seems to be reading and writing netlink messages, according to strace.

It appears NetworkManager is copying cached ipv6 routes into the normal route table, and removing them again when the cached ipv6 routes are expired.

The number of normal ipv6 routes seems to track the number of cached ipv6 routes quite closely:

# ip -6 ro sh | grep -v cache |wc
   1404    9837  101213
# ip -6 ro sh | grep cache |wc
   1401    1401   15411

# ip -6 ro sh | grep -v cache |wc
   1239    8682   89234
# ip -6 ro sh | grep cache |wc
   1234    1234   13574

NetworkManager seems to spend between 5 and 30% CPU time most of the time.

Version-Release number of selected component (if applicable):

NetworkManager-1.0.0-14.git20150121.b4ea599c.el7.x86_64

Steps to Reproduce:
1. run ipv6 service that serves lots of addresses (eg. a dnsbl name server)
2. watch NetworkManager use lots of cpu time
3.

Expected results:

NetworkManager leaves cached ipv6 routes alone, and does not waste CPU time cloning them into the regular route table.

Comment 3 Dan Williams 2015-04-15 18:41:09 UTC
Rik pointed out that on IRC his greps are wrong because 'cache' is on a separate line.  NM is not (and should not) be duplicating the cached routes as static routes like the original summary implicates.

Possibly we could have src/platform/nm-linux-platform.c::event_notification() return early when a RTM_NEWROUTE is seen for a RTM_F_CLONED route, and just skip any processing of it.

I can reduce the NM CPU usage in response to this ping6-bomb script:

#!/bin/bash
for i in `seq 1 4000`; do
    ping6 -c 4 2001:470:20::${i} &
done

from around 80% down to 6% with this change to src/platform/nm-linux-platform.c::event_notification():

        debug ("netlink event (type %d)", event);
    }

+   if (   event == RTM_NEWROUTE
+       && type == OBJECT_TYPE_IP6_ROUTE
+       && rtnl_route_get_flags ((struct rtnl_route*) object) & RTM_F_CLONED)
+       return NL_OK;
+
    cache = choose_cache_by_type (platform, type);
    cached_object = nm_nl_cache_search (cache, object);

This at least prevents the cloned routes from showing up in the NM cache.  May not be the correct solution here, but since NM shouldn't really be caring about cloned routes anyway, possibly this is OK?  Thomas?

Comment 4 Dan Williams 2015-04-15 18:54:06 UTC
Scratch build here with that patch for testing:

https://brewweb.devel.redhat.com/taskinfo?taskID=8986680

Comment 5 Rik van Riel 2015-04-15 21:14:08 UTC
CPU use spikes appear to coincide with ipv6 route cache garbage collection, and subsequent reinsertion of the route cache entries.

It makes me wonder if this should be fixed in the kernel, instead of in userspace.

Are there any applications at all that want NETLINK_ROUTE to mean "notify me every time the ipv6 route cache table is changed", instead of the ipv4 meaning of "notify me every time a static network route changes"?

Should ipv6 behaviour be changed to match ipv4 behaviour?

Comment 6 Thomas Haller 2015-05-29 11:05:04 UTC
(In reply to Rik van Riel from comment #5)
> CPU use spikes appear to coincide with ipv6 route cache garbage collection,
> and subsequent reinsertion of the route cache entries.
> 
> It makes me wonder if this should be fixed in the kernel, instead of in
> userspace.
> 
> Are there any applications at all that want NETLINK_ROUTE to mean "notify me
> every time the ipv6 route cache table is changed", instead of the ipv4
> meaning of "notify me every time a static network route changes"?
> 
> Should ipv6 behaviour be changed to match ipv4 behaviour?

currently, NM uses rtnl_route_alloc_cache() which expliclty requests a dump including RTM_F_CLONED, so we see them there.

Also, we enable netlink events via 
»···nle = nl_socket_add_memberships (priv->nlh_event,
»···                                 RTNLGRP_LINK,
»···                                 RTNLGRP_IPV4_IFADDR, RTNLGRP_IPV6_IFADDR,
»···                                 RTNLGRP_IPV4_ROUTE,  RTNLGRP_IPV6_ROUTE,
»···                                 0);

and there we too get RTM_F_CLONED.

While this could all be improved, I think the upstream branch that completely refactors NMPlatform is the right fix. We should merge https://bugzilla.gnome.org/show_bug.cgi?id=747981 and backport that to nm-1-0/rhel-7.

Comment 12 Rik van Riel 2015-10-29 15:16:01 UTC
The new NetworkManager seems to have a lot lower CPU use on my "problem system".

Comment 13 errata-xmlrpc 2015-11-19 11:01:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2315.html


Note You need to log in before you can comment on or make changes to this bug.