Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1491505 - Cannot Scale to more than 116 subnets and VMs [NEEDINFO]
Cannot Scale to more than 116 subnets and VMs
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates (Show other bugs)
12.0 (Pike)
Unspecified Unspecified
high Severity high
: ga
: 12.0 (Pike)
Assigned To: Sai Sindhur Malleni
Sai Sindhur Malleni
scale_lab
: Triaged
Depends On:
Blocks: 1512489
  Show dependency treegraph
 
Reported: 2017-09-13 23:28 EDT by Sai Sindhur Malleni
Modified: 2018-10-18 03:22 EDT (History)
23 users (show)

See Also:
Fixed In Version: openstack-tripleo-heat-templates-7.0.3-0.20171023134947.8da5e1f.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1512489 (view as bug list)
Environment:
N/A
Last Closed: 2017-12-13 17:08:57 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
pablo.iranzo: needinfo? (sauchter)


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3228801 None None None 2017-11-03 09:51 EDT
OpenStack gerrit 505381 None None None 2017-09-25 10:25 EDT
OpenStack gerrit 509433 None None None 2017-10-30 09:51 EDT
Red Hat Product Errata RHEA-2017:3462 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-15 20:43:25 EST

  None (edit)
Description Sai Sindhur Malleni 2017-09-13 23:28:12 EDT
Description of problem:

We are using an OpenStack setup with 1 controller and 11 compute nodes. Ml2/ODL is the neutron backend. We execute the following Browbeat Rally Plugin:

1. Create a network
2. Create a subnet
3. Boot an instance on this subnet

We do the above sequence of operations 500 times at a concurrency of 8. 

Even after several attempts we are unable to scale past 116 VMs (each VM is on its own subnet). 116 seems to be the hard limit. The port never transitionas into active as even though the VIF Plugging happens, it fails the provioning block (DHCP), Since Ml2/ODL makes use of the neutron DHCP agent for DHCP, on looking in the DHCP agent logs we see

2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent [req-288477aa-318a-40c3-954e-dd6fc98c6c1b - - - - -] Unable to enable dhcp for bb6cdb16-72c0-4cc4-a316-69ebcd7633b2.: ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
dnsmasq: failed to create inotify: Too many open files
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 142, in call_driver
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     getattr(driver, action)(**action_kwargs)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 218, in enable
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     self.spawn_process()
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 439, in spawn_process
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     self._spawn_or_reload_process(reload_with_HUP=False)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 453, in _spawn_or_reload_process
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     pm.enable(reload_cfg=reload_with_HUP)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 96, in enable
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     run_as_root=self.run_as_root)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 903, in execute
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     log_fail_as_error=log_fail_as_error, **kwargs)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     raise ProcessExecutionError(msg, returncode=returncode)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent dnsmasq: failed to create inotify: Too many open files
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent
2017-09-13 21:45:38.260 91663 ERROR neutron.agent.linux.utils [req-d0ade748-22ea-4a45-ba10-277d45f20981 - - - - -] Exit code: 5; Stdin: ; Stdout: ; Stderr:
dnsmasq: failed to create inotify: Too many open files


Even after increasing max_user_watches from 8192 to 50000 we are seeing the same behaviour

[root@overcloud-controller-0 heat-admin]# sysctl fs.inotify.max_user_watches                                                                                                                  
fs.inotify.max_user_watches = 50000




Version-Release number of selected component (if applicable):
OSP 12
puppet-neutron-11.3.0-0.20170805104936.743dde6.el7ost.noarch
python-neutronclient-6.5.0-0.20170807200849.355983d.el7ost.noarch
openstack-neutron-lbaas-11.0.0-0.20170807144457.c9adfd4.el7ost.noarch
python-neutron-11.0.0-0.20170807223712.el7ost.noarch
python-neutron-lbaas-11.0.0-0.20170807144457.c9adfd4.el7ost.noarch
openstack-neutron-linuxbridge-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-ml2-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-sriov-nic-agent-11.0.0-0.20170807223712.el7ost.noarch
python-neutron-lib-1.9.1-0.20170731102145.0ef54c3.el7ost.noarch
openstack-neutron-metering-agent-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-common-11.0.0-0.20170807223712.el7ost.noarch
openstack-neutron-openvswitch-11.0.0-0.20170807223712.el7ost.noarch
opendaylight-6.2.0-0.1.20170913snap58.el7.noarch
python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Scale test booting VMs with each Vm on its own subnet
2.
3.

Actual results:

Cannot scale to more than 116 VMs

Expected results:
Should be able to boot 500 VMs on 500 different subnets since we have the capacity from a hypervisor point of view

Additional info:
Comment 1 Sai Sindhur Malleni 2017-09-13 23:59:28 EDT
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep dnsmasq
dnsmasq-2.76-2.el7.x86_64
dnsmasq-utils-2.66-21.el7.x86_64
Comment 2 Brian Haley 2017-09-14 15:26:04 EDT
https://bugzilla.redhat.com/show_bug.cgi?id=1474515 has some good info on a similar bug reported to the RHEL team by Joe Talerico.
Comment 3 Sai Sindhur Malleni 2017-09-15 10:54:47 EDT
I can confirm that doing sysctl -w fs.inotify.max_user_instances=256 >> /etc/sysctl.conf to raise the value from 128 to 256 results in more subnets, VMs being created.

Question is should we bump this default in RHEL or have Director bump it at least for overcloud nodes?
Comment 4 Brian Haley 2017-09-15 12:37:18 EDT
I would think bumping this in Director would be best since it might not apply to all RHEL users.  Assuming there is no negative side-effect setting to it's maximum might be best.  Maybe someone on the RHEL team can help with that.
Comment 5 Jakub Libosvar 2017-10-30 09:51:59 EDT
Patch was merged
Comment 8 Sai Sindhur Malleni 2017-10-31 13:33:20 EDT
KCS: https://access.redhat.com/solutions/3228801
Comment 9 Sai Sindhur Malleni 2017-10-31 13:38:06 EDT
Ramon,

BZ for OSP10 backport
Comment 10 Sai Sindhur Malleni 2017-10-31 13:38:34 EDT
Sorry, I failed to link the actual BZ in the previous comment. Here it is,
https://bugzilla.redhat.com/show_bug.cgi?id=1508030
Comment 17 errata-xmlrpc 2017-12-13 17:08:57 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462

Note You need to log in before you can comment on or make changes to this bug.