Bug 1283676 - net.core.netdev_max_backlog not set high enough
net.core.netdev_max_backlog not set high enough
Status: CLOSED CURRENTRELEASE
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
urgent Severity unspecified
: ga
: 8.0 (Liberty)
Assigned To: Giulio Fidente
Amit Ugol
: TestOnly, Triaged
: 1299080 (view as bug list)
Depends On:
Blocks: 1299080
  Show dependency treegraph
 
Reported: 2015-11-19 09:44 EST by Will Foster
Modified: 2016-06-23 14:18 EDT (History)
10 users (show)

See Also:
Fixed In Version: openstack-tripleo-heat-templates-0.8.11-1.el7ost
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1299080 (view as bug list)
Environment:
Last Closed: 2016-06-23 13:34:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Bugzilla 1095811 None None None Never
OpenStack gerrit 277601 None None None 2016-02-08 18:00 EST

  None (edit)
Description Will Foster 2015-11-19 09:44:59 EST
Description of problem:

/proc/sys/net/core/netdev_max_backlog default setting of 1000 is extremely low/unsuitable for medium to large cloud environments.

What happens is that things just stop working once you reach over
1,000 dhcp-agent ports in use, very common for medium to large clouds.

We first saw this on Trystack[1] and thought it was a dnsmasq bug
until this bugzilla was brought to our attention:

https://bugzilla.redhat.com/show_bug.cgi?id=1095811

We've set net.core.netdev_max_backlog to around 100k which has solved
Neutron grinding to a halt, a backlog of dhcp-agent/L3 churn and the
system being brought down because of backlog queues.

I understand that setting sysctl settings from an installer
perspective might seem a little invasive, but it's something that I
think would trip up large customers and deployments as it's not immediately apparent this might be the cause.

On a recent RHOS-D OSP7 installation I don't see this default changed so figure
I'd bring it to light here.

[root@overcloud-controller-0 ~]# cat
/proc/sys/net/core/netdev_max_backlog
1000

Whether or not the installer is the best place to set this I am not sure.

Version-Release number of selected component (if applicable):

RHEL-OSP7 (or anything with Kilo)

How reproducible:


Steps to Reproduce:
1. Deploy RHEL-OSP7
2. Create over 1,000 Neutron networks with gateways set (using dhcp-agent)
3. Note things grinding to a halt.

Actual results:


Expected results:


Additional info:
Comment 3 Mike Burns 2016-01-20 12:23:08 EST
Will can you provide an appropriate default value?
Comment 4 Mike Burns 2016-01-27 10:36:05 EST
From trello

@hughbrock we use 100k for /proc/sys/net/core/netdev_max_backlog setting, best way to set this is probably a tuned profile.
Comment 5 Hugh Brock 2016-02-07 09:29:42 EST
*** Bug 1299080 has been marked as a duplicate of this bug. ***
Comment 6 Giulio Fidente 2016-02-08 17:51:56 EST
From the tuning guide it looks like this is number should be a factor of the CPU capabilities (both the number of cores and speed) and the NIC capabilities (again both the number of links and speed).

It is described as a queue within the Linux kernel where traffic is stored after reception from the NIC, but before processing by the protocol stacks.

The /proc/net/softnet_stat file contains a counter in the 2nd column that is incremented when the netdev backlog queue overflows. If this value is incrementing over time, then netdev_max_backlog needs to be increased.

The guide suggests to double it and try again until no overflows are observed. I will set a value which is 10x the default in hiera so that it can be customized further if necessary.

Note You need to log in before you can comment on or make changes to this bug.