Hide Forgot
Description of problem: We are using an OpenStack setup with 1 controller and 11 compute nodes. Ml2/ODL is the neutron backend. We execute the following Browbeat Rally Plugin: 1. Create a network 2. Create a subnet 3. Boot an instance on this subnet We do the above sequence of operations 500 times at a concurrency of 8. Even after several attempts we are unable to scale past 116 VMs (each VM is on its own subnet). 116 seems to be the hard limit. The port never transitionas into active as even though the VIF Plugging happens, it fails the provioning block (DHCP), Since Ml2/ODL makes use of the neutron DHCP agent for DHCP, on looking in the DHCP agent logs we see 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent [req-288477aa-318a-40c3-954e-dd6fc98c6c1b - - - - -] Unable to enable dhcp for bb6cdb16-72c0-4cc4-a316-69ebcd7633b2.: ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr: dnsmasq: failed to create inotify: Too many open files 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent Traceback (most recent call last): 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 142, in call_driver 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 218, in enable 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent self.spawn_process() 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 439, in spawn_process 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=False) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 453, in _spawn_or_reload_process 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent pm.enable(reload_cfg=reload_with_HUP) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 96, in enable 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent run_as_root=self.run_as_root) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 903, in execute 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent log_fail_as_error=log_fail_as_error, **kwargs) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent raise ProcessExecutionError(msg, returncode=returncode) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr: 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent dnsmasq: failed to create inotify: Too many open files 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent 2017-09-13 21:45:38.260 91663 ERROR neutron.agent.linux.utils [req-d0ade748-22ea-4a45-ba10-277d45f20981 - - - - -] Exit code: 5; Stdin: ; Stdout: ; Stderr: dnsmasq: failed to create inotify: Too many open files Even after increasing max_user_watches from 8192 to 50000 we are seeing the same behaviour [root@overcloud-controller-0 heat-admin]# sysctl fs.inotify.max_user_watches fs.inotify.max_user_watches = 50000 Version-Release number of selected component (if applicable): OSP 12 puppet-neutron-11.3.0-0.20170805104936.743dde6.el7ost.noarch python-neutronclient-6.5.0-0.20170807200849.355983d.el7ost.noarch openstack-neutron-lbaas-11.0.0-0.20170807144457.c9adfd4.el7ost.noarch python-neutron-11.0.0-0.20170807223712.el7ost.noarch python-neutron-lbaas-11.0.0-0.20170807144457.c9adfd4.el7ost.noarch openstack-neutron-linuxbridge-11.0.0-0.20170807223712.el7ost.noarch openstack-neutron-ml2-11.0.0-0.20170807223712.el7ost.noarch openstack-neutron-11.0.0-0.20170807223712.el7ost.noarch openstack-neutron-sriov-nic-agent-11.0.0-0.20170807223712.el7ost.noarch python-neutron-lib-1.9.1-0.20170731102145.0ef54c3.el7ost.noarch openstack-neutron-metering-agent-11.0.0-0.20170807223712.el7ost.noarch openstack-neutron-common-11.0.0-0.20170807223712.el7ost.noarch openstack-neutron-openvswitch-11.0.0-0.20170807223712.el7ost.noarch opendaylight-6.2.0-0.1.20170913snap58.el7.noarch python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Scale test booting VMs with each Vm on its own subnet 2. 3. Actual results: Cannot scale to more than 116 VMs Expected results: Should be able to boot 500 VMs on 500 different subnets since we have the capacity from a hypervisor point of view Additional info:
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep dnsmasq dnsmasq-2.76-2.el7.x86_64 dnsmasq-utils-2.66-21.el7.x86_64
https://bugzilla.redhat.com/show_bug.cgi?id=1474515 has some good info on a similar bug reported to the RHEL team by Joe Talerico.
I can confirm that doing sysctl -w fs.inotify.max_user_instances=256 >> /etc/sysctl.conf to raise the value from 128 to 256 results in more subnets, VMs being created. Question is should we bump this default in RHEL or have Director bump it at least for overcloud nodes?
I would think bumping this in Director would be best since it might not apply to all RHEL users. Assuming there is no negative side-effect setting to it's maximum might be best. Maybe someone on the RHEL team can help with that.
Patch was merged
KCS: https://access.redhat.com/solutions/3228801
Ramon, BZ for OSP10 backport
Sorry, I failed to link the actual BZ in the previous comment. Here it is, https://bugzilla.redhat.com/show_bug.cgi?id=1508030
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462