1283676 – net.core.netdev_max_backlog not set high enough

Bug 1283676 - net.core.netdev_max_backlog not set high enough

Summary: net.core.netdev_max_backlog not set high enough

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	unspecified
Target Milestone:	ga
Target Release:	8.0 (Liberty)
Assignee:	Giulio Fidente
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1299080 (view as bug list)
Depends On:
Blocks:	1299080
TreeView+	depends on / blocked

Reported:	2015-11-19 14:44 UTC by Will Foster
Modified:	2016-06-23 18:18 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-0.8.11-1.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1299080 (view as bug list)
Environment:
Last Closed:	2016-06-23 17:34:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	277601	0	None	MERGED	Increase default netdev_max_backlog to 10x	2021-02-12 09:29:08 UTC
Red Hat Bugzilla	1095811	1	None	None	None	2021-01-20 06:05:38 UTC

Description Will Foster 2015-11-19 14:44:59 UTC

Description of problem:

/proc/sys/net/core/netdev_max_backlog default setting of 1000 is extremely low/unsuitable for medium to large cloud environments.

What happens is that things just stop working once you reach over
1,000 dhcp-agent ports in use, very common for medium to large clouds.

We first saw this on Trystack[1] and thought it was a dnsmasq bug
until this bugzilla was brought to our attention:

https://bugzilla.redhat.com/show_bug.cgi?id=1095811

We've set net.core.netdev_max_backlog to around 100k which has solved
Neutron grinding to a halt, a backlog of dhcp-agent/L3 churn and the
system being brought down because of backlog queues.

I understand that setting sysctl settings from an installer
perspective might seem a little invasive, but it's something that I
think would trip up large customers and deployments as it's not immediately apparent this might be the cause.

On a recent RHOS-D OSP7 installation I don't see this default changed so figure
I'd bring it to light here.

[root@overcloud-controller-0 ~]# cat
/proc/sys/net/core/netdev_max_backlog
1000

Whether or not the installer is the best place to set this I am not sure.

Version-Release number of selected component (if applicable):

RHEL-OSP7 (or anything with Kilo)

How reproducible:


Steps to Reproduce:
1. Deploy RHEL-OSP7
2. Create over 1,000 Neutron networks with gateways set (using dhcp-agent)
3. Note things grinding to a halt.

Actual results:


Expected results:


Additional info:

Comment 3 Mike Burns 2016-01-20 17:23:08 UTC

Will can you provide an appropriate default value?

Comment 4 Mike Burns 2016-01-27 15:36:05 UTC

From trello

@hughbrock we use 100k for /proc/sys/net/core/netdev_max_backlog setting, best way to set this is probably a tuned profile.

Comment 5 Hugh Brock 2016-02-07 14:29:42 UTC

*** Bug 1299080 has been marked as a duplicate of this bug. ***

Comment 6 Giulio Fidente 2016-02-08 22:51:56 UTC

From the tuning guide it looks like this is number should be a factor of the CPU capabilities (both the number of cores and speed) and the NIC capabilities (again both the number of links and speed).

It is described as a queue within the Linux kernel where traffic is stored after reception from the NIC, but before processing by the protocol stacks.

The /proc/net/softnet_stat file contains a counter in the 2nd column that is incremented when the netdev backlog queue overflows. If this value is incrementing over time, then netdev_max_backlog needs to be increased.

The guide suggests to double it and try again until no overflows are observed. I will set a value which is 10x the default in hiera so that it can be customized further if necessary.

Note You need to log in before you can comment on or make changes to this bug.