Bug 1491505

Summary: Cannot Scale to more than 116 subnets and VMs
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: openstack-tripleo-heat-templatesAssignee: Sai Sindhur Malleni <smalleni>
Status: CLOSED ERRATA QA Contact: Sai Sindhur Malleni <smalleni>
Severity: high Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: abond, amuller, apevec, bhaley, chrisw, jjoyce, jlibosva, lhh, lpeer, mburns, nsantos, nyechiel, oblaut, pablo.iranzo, pneedle, racedoro, rhel-osp-director-maint, sauchter, sgaddam, smalleni, srelf, srevivo, tvignaud
Target Milestone: gaKeywords: Triaged
Target Release: 12.0 (Pike)Flags: sauchter: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: scale_lab
Fixed In Version: openstack-tripleo-heat-templates-7.0.3-0.20171023134947.8da5e1f.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1512489 (view as bug list) Environment:
N/A
Last Closed: 2017-12-13 22:08:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1512489    

Description Sai Sindhur Malleni 2017-09-14 03:28:12 UTC
Description of problem:

We are using an OpenStack setup with 1 controller and 11 compute nodes. Ml2/ODL is the neutron backend. We execute the following Browbeat Rally Plugin:

1. Create a network
2. Create a subnet
3. Boot an instance on this subnet

We do the above sequence of operations 500 times at a concurrency of 8. 

Even after several attempts we are unable to scale past 116 VMs (each VM is on its own subnet). 116 seems to be the hard limit. The port never transitionas into active as even though the VIF Plugging happens, it fails the provioning block (DHCP), Since Ml2/ODL makes use of the neutron DHCP agent for DHCP, on looking in the DHCP agent logs we see

2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent [req-288477aa-318a-40c3-954e-dd6fc98c6c1b - - - - -] Unable to enable dhcp for bb6cdb16-72c0-4cc4-a316-69ebcd7633b2.: ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
dnsmasq: failed to create inotify: Too many open files
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 142, in call_driver
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     getattr(driver, action)(**action_kwargs)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 218, in enable
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     self.spawn_process()
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 439, in spawn_process
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     self._spawn_or_reload_process(reload_with_HUP=False)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 453, in _spawn_or_reload_process
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     pm.enable(reload_cfg=reload_with_HUP)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 96, in enable
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     run_as_root=self.run_as_root)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 903, in execute
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     log_fail_as_error=log_fail_as_error, **kwargs)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     raise ProcessExecutionError(msg, returncode=returncode)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent dnsmasq: failed to create inotify: Too many open files
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent
2017-09-13 21:45:38.260 91663 ERROR neutron.agent.linux.utils [req-d0ade748-22ea-4a45-ba10-277d45f20981 - - - - -] Exit code: 5; Stdin: ; Stdout: ; Stderr:
dnsmasq: failed to create inotify: Too many open files


Even after increasing max_user_watches from 8192 to 50000 we are seeing the same behaviour

[root@overcloud-controller-0 heat-admin]# sysctl fs.inotify.max_user_watches                                                                                                                  
fs.inotify.max_user_watches = 50000




Version-Release number of selected component (if applicable):
OSP 12
puppet-neutron-11.3.0-0.20170805104936.743dde6.el7ost.noarch
python-neutronclient-6.5.0-0.20170807200849.355983d.el7ost.noarch
openstack-neutron-lbaas-11.0.0-0.20170807144457.c9adfd4.el7ost.noarch
python-neutron-11.0.0-0.20170807223712.el7ost.noarch
python-neutron-lbaas-11.0.0-0.20170807144457.c9adfd4.el7ost.noarch
openstack-neutron-linuxbridge-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-ml2-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-sriov-nic-agent-11.0.0-0.20170807223712.el7ost.noarch
python-neutron-lib-1.9.1-0.20170731102145.0ef54c3.el7ost.noarch
openstack-neutron-metering-agent-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-common-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-openvswitch-11.0.0-0.20170807223712.el7ost.noarch
opendaylight-6.2.0-0.1.20170913snap58.el7.noarch
python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Scale test booting VMs with each Vm on its own subnet
2.
3.

Actual results:

Cannot scale to more than 116 VMs

Expected results:
Should be able to boot 500 VMs on 500 different subnets since we have the capacity from a hypervisor point of view

Additional info:

Comment 1 Sai Sindhur Malleni 2017-09-14 03:59:28 UTC
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep dnsmasq
dnsmasq-2.76-2.el7.x86_64
dnsmasq-utils-2.66-21.el7.x86_64

Comment 2 Brian Haley 2017-09-14 19:26:04 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1474515 has some good info on a similar bug reported to the RHEL team by Joe Talerico.

Comment 3 Sai Sindhur Malleni 2017-09-15 14:54:47 UTC
I can confirm that doing sysctl -w fs.inotify.max_user_instances=256 >> /etc/sysctl.conf to raise the value from 128 to 256 results in more subnets, VMs being created.

Question is should we bump this default in RHEL or have Director bump it at least for overcloud nodes?

Comment 4 Brian Haley 2017-09-15 16:37:18 UTC
I would think bumping this in Director would be best since it might not apply to all RHEL users.  Assuming there is no negative side-effect setting to it's maximum might be best.  Maybe someone on the RHEL team can help with that.

Comment 5 Jakub Libosvar 2017-10-30 13:51:59 UTC
Patch was merged

Comment 8 Sai Sindhur Malleni 2017-10-31 17:33:20 UTC
KCS: https://access.redhat.com/solutions/3228801

Comment 9 Sai Sindhur Malleni 2017-10-31 17:38:06 UTC
Ramon,

BZ for OSP10 backport

Comment 10 Sai Sindhur Malleni 2017-10-31 17:38:34 UTC
Sorry, I failed to link the actual BZ in the previous comment. Here it is,
https://bugzilla.redhat.com/show_bug.cgi?id=1508030

Comment 17 errata-xmlrpc 2017-12-13 22:08:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462