Description of problem: /proc/sys/net/core/netdev_max_backlog default setting of 1000 is extremely low/unsuitable for medium to large cloud environments. What happens is that things just stop working once you reach over 1,000 dhcp-agent ports in use, very common for medium to large clouds. We first saw this on Trystack[1] and thought it was a dnsmasq bug until this bugzilla was brought to our attention: https://bugzilla.redhat.com/show_bug.cgi?id=1095811 We've set net.core.netdev_max_backlog to around 100k which has solved Neutron grinding to a halt, a backlog of dhcp-agent/L3 churn and the system being brought down because of backlog queues. I understand that setting sysctl settings from an installer perspective might seem a little invasive, but it's something that I think would trip up large customers and deployments as it's not immediately apparent this might be the cause. On a recent RHOS-D OSP7 installation I don't see this default changed so figure I'd bring it to light here. [root@overcloud-controller-0 ~]# cat /proc/sys/net/core/netdev_max_backlog 1000 Whether or not the installer is the best place to set this I am not sure. Version-Release number of selected component (if applicable): RHEL-OSP7 (or anything with Kilo) How reproducible: Steps to Reproduce: 1. Deploy RHEL-OSP7 2. Create over 1,000 Neutron networks with gateways set (using dhcp-agent) 3. Note things grinding to a halt. Actual results: Expected results: Additional info:
Will can you provide an appropriate default value?
From trello @hughbrock we use 100k for /proc/sys/net/core/netdev_max_backlog setting, best way to set this is probably a tuned profile.
*** Bug 1299080 has been marked as a duplicate of this bug. ***
From the tuning guide it looks like this is number should be a factor of the CPU capabilities (both the number of cores and speed) and the NIC capabilities (again both the number of links and speed). It is described as a queue within the Linux kernel where traffic is stored after reception from the NIC, but before processing by the protocol stacks. The /proc/net/softnet_stat file contains a counter in the 2nd column that is incremented when the netdev backlog queue overflows. If this value is incrementing over time, then netdev_max_backlog needs to be increased. The guide suggests to double it and try again until no overflows are observed. I will set a value which is 10x the default in hiera so that it can be customized further if necessary.