Bug 1474515

Summary: dhcp-agent dnsmasq max files
Product: Red Hat Enterprise Linux 7 Reporter: Joe Talerico <jtaleric>
Component: dnsmasqAssignee: Petr Menšík <pemensik>
Status: CLOSED ERRATA QA Contact: Daniel Rusek <drusek>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: abond, amedeo.salvati, brian.fife, drusek, james.beal, jtaleric, oblaut, pemensik, pneedle, psklenar, racedoro, rcernin, salmy, smalleni, srelf, srevivo, thozza
Target Milestone: alphaKeywords: Upstream
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: dnsmasq-2.76-7.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 09:49:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1534569, 1549614    
Attachments:
Description Flags
inotify conditional open patch none

Description Joe Talerico 2017-07-24 20:05:24 UTC
Description of problem:
During neutron tests, neutron-dhcp-agent / dnsmasq hit a max open file limit :

dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 142, in call_driver
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent     getattr(driver, action)(**action_kwargs)
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 218, in enable
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent     self.spawn_process()
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 439, in spawn_process
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent     self._spawn_or_reload_process(reload_with_HUP=False)
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 453, in _spawn_or_reload_process
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent     pm.enable(reload_cfg=reload_with_HUP)
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 96, in enable
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent     run_as_root=self.run_as_root)
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 900, in execute
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent     log_fail_as_error=log_fail_as_error, **kwargs)
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent     raise ProcessExecutionError(msg, returncode=returncode)
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent dnsmasq: failed to create inotify: Too many open files
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.919 102053 ERROR neutron.agent.dhcp.agent
dhcp-agent.log-20170721.gz:2017-07-20 08:22:53.930 102053 ERROR neutron.agent.dhcp.agent [req-94082baf-936a-496b-890e-d5829442f125 - - - - -] Unable to enable dhcp for 63a1c55e-2623-4fbf-a9a9-ba3baebbe53b.: ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:


# neutron-dhcp-agent process
[root@overcloud-controller-0 neutron]# cat /proc/102053/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1159783              1159783              processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1159783              1159783              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        


# dnsmasq process
[root@overcloud-controller-2 heat-admin]# cat /proc/845399/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1159783              1159783              processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1159783              1159783              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
[root@overcloud-controller-2 heat-admin]# 


Version-Release number of selected component (if applicable):
dnsmasq-2.76-2.el7.x86_64
dnsmasq-utils-2.66-21.el7.x86_64
[root@overcloud-controller-0 neutron]# rpm -qa | grep neutron
python-neutron-lib-1.7.0-0.20170529134801.0ee4f4a.el7ost.noarch
python-neutron-lbaas-11.0.0-0.20170627061233.2c054e0.el7ost.noarch
openstack-neutron-metering-agent-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
python-neutronclient-6.3.0-0.20170601203754.ba535c6.el7ost.noarch
openstack-neutron-common-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-lbaas-11.0.0-0.20170627061233.2c054e0.el7ost.noarch
openstack-neutron-l2gw-agent-10.1.0-0.20170619104111.4bb9806.el7ost.noarch
openstack-neutron-sriov-nic-agent-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
python-neutron-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-ml2-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-linuxbridge-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
puppet-neutron-11.2.0-0.20170626053011.862f130.el7ost.noarch
openstack-neutron-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-openvswitch-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch


How reproducible:
N/A

Steps to Reproduce:
1. Rally neutron-router-create (Times: 1000, concurrency 32)

Comment 4 Joe Talerico 2017-07-25 12:20:53 UTC
Hey Tomas - Thanks for moving my bug from OpenStack to RHEL, since that seems to be the right product. 

I won't be able to invent a reproducer outside of OpenStack for you. I suggest using OpenStack in order to reproduce the issue. 

I can easily reproduce it with multiple scenarios orchestrated via OpenStack Rally. 

Running create-port-list times:1000, concurrency 32 I see:

2017-07-24 17:56:35.269 87837 DEBUG neutron.agent.dhcp.agent [req-3633ae11-2ab3-44eb-ac5a-9f12568900fd - - - - -] resync (594f387b-bdb2-45c3-bea5-47f6d378907c): [ProcessExecutionError(u'Exit code: 4; Stdin: # Generated by iptables_manager\n*filter\n:neutron-dhcp-age-FORWARD - [0:0]\n:neutron-dhcp-age-INPUT - [0:0]\n:neutron-dhcp-age-OUTPUT - [0:0]\n:neutron-dhcp-age-local - [0:0]\n:neutron-filter-top - [0:0]\n-I FORWARD 1 -j neutron-filter-top\n-I FORWARD 2 -j neutron-dhcp-age-FORWARD\n-I INPUT 1 -j neutron-dhcp-age-INPUT\n-I OUTPUT 1 -j neutron-filter-top\n-I OUTPUT 2 -j neutron-dhcp-age-OUTPUT\n-I neutron-filter-top 1 -j neutron-dhcp-age-local\nCOMMIT\n# Completed by iptables_manager\n# Generated by iptables_manager\n*mangle\n:neutron-dhcp-age-FORWARD - [0:0]\n:neutron-dhcp-age-INPUT - [0:0]\n:neutron-dhcp-age-OUTPUT - [0:0]\n:neutron-dhcp-age-POSTROUTING - [0:0]\n:neutron-dhcp-age-PREROUTING - [0:0]\n:neutron-dhcp-age-mark - [0:0]\n-I FORWARD 1 -j neutron-dhcp-age-FORWARD\n-I INPUT 1 -j neutron-dhcp-age-INPUT\n-I OUTPUT 1 -j neutron-dhcp-age-OUTPUT\n-I POSTROUTING 1 -j neutron-dhcp-age-POSTROUTING\n-I PREROUTING 1 -j neutron-dhcp-age-PREROUTING\n-I neutron-dhcp-age-POSTROUTING 1 -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill\n-I neutron-dhcp-age-PREROUTING 1 -j neutron-dhcp-age-mark\nCOMMIT\n# Completed by iptables_manager\n# Generated by iptables_manager\n*nat\n:neutron-dhcp-age-OUTPUT - [0:0]\n:neutron-dhcp-age-POSTROUTING - [0:0]\n:neutron-dhcp-age-PREROUTING - [0:0]\n:neutron-dhcp-age-float-snat - [0:0]\n:neutron-dhcp-age-snat - [0:0]\n:neutron-postrouting-bottom - [0:0]\n-I OUTPUT 1 -j neutron-dhcp-age-OUTPUT\n-I POSTROUTING 1 -j neutron-dhcp-age-POSTROUTING\n-I POSTROUTING 2 -j neutron-postrouting-bottom\n-I PREROUTING 1 -j neutron-dhcp-age-PREROUTING\n-I neutron-dhcp-age-snat 1 -j neutron-dhcp-age-float-snat\n-I neutron-postrouting-bottom 1 -m comment --comment "Perform source NAT on outgoing traffic." -j neutron-dhcp-age-snat\nCOMMIT\n# Completed by iptables_manager\n# Generated by iptables_manager\n*raw\n:neutron-dhcp-age-OUTPUT - [0:0]\n:neutron-dhcp-age-PREROUTING - [0:0]\n-I OUTPUT 1 -j neutron-dhcp-age-OUTPUT\n-I PREROUTING 1 -j neutron-dhcp-age-PREROUTING\nCOMMIT\n# Completed by iptables_manager\n; Stdout: ; Stderr: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?\n',), ProcessExecutionError(u'Exit code: 5; Stdin: ; Stdout: ; Stderr: \ndnsmasq: failed to create inotify: Too many open files\n',), ProcessExecutionError(u'Exit code: 5; Stdin: ; Stdout: ; Stderr: \ndnsmasq: failed to create inotify: Too many open files\n',)] _periodic_resync_helper /usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:255

Comment 8 Petr Menšík 2017-08-23 12:25:16 UTC
I have looked into dnsmasq code. There is always only single instance of inotify socket per dnsmasq instance. It does not explicitly close file descriptor of inotify before exitting. I think that should be closed by kernel when the process dies. I will try to verify that it is not leaked even after process exit. I have not found a way to open notify socket more than once however. Unless there is 128 instances of dnsmasq, it should be ok.

Instance of dnsmasq used by libvirtd for example uses 11 file descriptors for main process and 5 for dhcp-script helper.

# main process
$ lsof -p 18906
COMMAND   PID   USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME
dnsmasq 18906 nobody  cwd       DIR              253,1      224      64 /
dnsmasq 18906 nobody  rtd       DIR              253,1      224      64 /
dnsmasq 18906 nobody  txt       REG              253,1   344888 6331416 /usr/sbin/dnsmasq
dnsmasq 18906 nobody  mem       REG              253,1    62184 6333789 /usr/lib64/libnss_files-2.17.so
dnsmasq 18906 nobody  mem       REG              253,1    44448 6333801 /usr/lib64/librt-2.17.so
dnsmasq 18906 nobody  mem       REG              253,1   144792 6333797 /usr/lib64/libpthread-2.17.so
dnsmasq 18906 nobody  mem       REG              253,1  2127336 6333771 /usr/lib64/libc-2.17.so
dnsmasq 18906 nobody  mem       REG              253,1   208928 6413923 /usr/lib64/libidn.so.11.6.11
dnsmasq 18906 nobody  mem       REG              253,1   304576 6391819 /usr/lib64/libdbus-1.so.3.7.4
dnsmasq 18906 nobody  mem       REG              253,1   164264 6333764 /usr/lib64/ld-2.17.so
dnsmasq 18906 nobody  mem       REG              253,1    26254 6337493 /usr/lib64/gconv/gconv-modules.cache
dnsmasq 18906 nobody    0u      CHR                1,3      0t0    4856 /dev/null
dnsmasq 18906 nobody    1u      CHR                1,3      0t0    4856 /dev/null
dnsmasq 18906 nobody    2u      CHR                1,3      0t0    4856 /dev/null
dnsmasq 18906 nobody    3u     IPv4              40164      0t0     UDP *:bootps 
dnsmasq 18906 nobody    4u  netlink                         0t0   40165 ROUTE
dnsmasq 18906 nobody    5u     IPv4              40167      0t0     UDP qeos-64.lab.eng.rdu2.redhat.com:domain 
dnsmasq 18906 nobody    6r  a_inode                0,9        0    4852 inotify
dnsmasq 18906 nobody    7r     FIFO                0,8      0t0   40173 pipe
dnsmasq 18906 nobody    8w     FIFO                0,8      0t0   40173 pipe
dnsmasq 18906 nobody    9u     unix 0xffff88005ff8ac00      0t0   40198 socket
dnsmasq 18906 nobody   12w     FIFO                0,8      0t0   40199 pipe

# helper process
$ lsof -p 18908
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
dnsmasq 18908 root  cwd    DIR  253,1      224      64 /
dnsmasq 18908 root  rtd    DIR  253,1      224      64 /
dnsmasq 18908 root  txt    REG  253,1   344888 6331416 /usr/sbin/dnsmasq
dnsmasq 18908 root  mem    REG  253,1    62184 6333789 /usr/lib64/libnss_files-2.17.so
dnsmasq 18908 root  mem    REG  253,1    44448 6333801 /usr/lib64/librt-2.17.so
dnsmasq 18908 root  mem    REG  253,1   144792 6333797 /usr/lib64/libpthread-2.17.so
dnsmasq 18908 root  mem    REG  253,1  2127336 6333771 /usr/lib64/libc-2.17.so
dnsmasq 18908 root  mem    REG  253,1   208928 6413923 /usr/lib64/libidn.so.11.6.11
dnsmasq 18908 root  mem    REG  253,1   304576 6391819 /usr/lib64/libdbus-1.so.3.7.4
dnsmasq 18908 root  mem    REG  253,1   164264 6333764 /usr/lib64/ld-2.17.so
dnsmasq 18908 root    0u   CHR    1,3      0t0    4856 /dev/null
dnsmasq 18908 root    1u   CHR    1,3      0t0    4856 /dev/null
dnsmasq 18908 root    2u   CHR    1,3      0t0    4856 /dev/null
dnsmasq 18908 root    8w  FIFO    0,8      0t0   40173 pipe
dnsmasq 18908 root   11r  FIFO    0,8      0t0   40199 pipe

I think the most important part is what Robin pointed out here: there is another limit for inotify, which is low enough. By default 128.

But what I think is more important, it is not per session, but per USER. And dnsmasq does not have its dedicated user, it uses nobody user for the main process (which opens inotify).

would you be able to run on failure this command and share its output?
$ lsof -u nobody | grep inotify
or just
$ lsof -u nobody | grep inotify | wc -l

I think it is possible there are other inotify socket holders that might use many inotify sockets. Dnsmasq might just try to use few above to hit maximum number used.

Do I understand parameters right that there should be at most 32 instances running at the same time?

I found a nice command to list inotify sockets used. It might help:

$ find /proc/*/fd -lname anon_inode:inotify |
   cut -d/ -f3 |
   xargs -I '{}' -- ps --no-headers -o '%p %U %c' -p '{}' |
   uniq -c |
   sort -nr

Comment 9 Sai Sindhur Malleni 2017-09-15 14:34:04 UTC
Hitting the same issue on a different environment, when  trying to create networks and boot VMS in OpenStack. Once 116 networks are created (each has its own dnsmasq) and 116 VMs are booted (each on its own network created as mentioned previously), we are unable to boot any more VMs because of dhcp failing for the VMs port. 

017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent [req-288477aa-318a-40c3-954e-dd6fc98c6c1b - - - - -] Unable to enable dhcp for bb6cdb16-72c0-4cc4-a316-69ebcd7633b2.: ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
dnsmasq: failed to create inotify: Too many open files
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 142, in call_driver
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     getattr(driver, action)(**action_kwargs)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 218, in enable
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     self.spawn_process()
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 439, in spawn_process
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     self._spawn_or_reload_process(reload_with_HUP=False)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 453, in _spawn_or_reload_process
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     pm.enable(reload_cfg=reload_with_HUP)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 96, in enable
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     run_as_root=self.run_as_root)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 903, in execute
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     log_fail_as_error=log_fail_as_error, **kwargs)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent     raise ProcessExecutionError(msg, returncode=returncode)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent dnsmasq: failed to create inotify: Too many open files
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent
2017-09-13 21:45:38.260 91663 ERROR neutron.agent.linux.utils [req-d0ade748-22ea-4a45-ba10-277d45f20981 - - - - -] Exit code: 5; Stdin: ; Stdout: ; Stderr:
dnsmasq: failed to create inotify: Too many open files


[root@overcloud-controller-0 heat-admin]#  lsof -u nobody | grep inotify | wc -l
116
[root@overcloud-controller-0 heat-admin]# ps aux | grep dnsmasq | wc -l
117


An OpenStack Scale related bug has been filed here: https://bugzilla.redhat.com/show_bug.cgi?id=1491505

This is clearly a blocker for OpenStack Scale, requesting more urgency on this.

[root@overcloud-controller-0 heat-admin]# rpm -qa | grep dnsmasq
dnsmasq-2.76-2.el7.x86_64
dnsmasq-utils-2.66-21.el7.x86_64
[root@overcloud-controller-0 heat-admin]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)

Comment 10 Sai Sindhur Malleni 2017-09-15 14:38:08 UTC
Also, to be clear, the rate at which I was creating networks(dnsmasq processes) and VMs was just 2 at a time. I tried with 8 at a time, and wasn't sure if the rate of creation had an impact, so I even tried 2 at a time. OpenStack fails to scale beyond 116 networks and subnets because of this issue.

Comment 11 Sai Sindhur Malleni 2017-09-15 14:40:31 UTC
[root@overcloud-controller-0 heat-admin]# find /proc/*/fd -lname anon_inode:inotify |
>    cut -d/ -f3 |
>    xargs -I '{}' -- ps --no-headers -o '%p %U %c' -p '{}' |
>    uniq -c |
>    sort -nr
      5       1 root     systemd
      3    1419 root     NetworkManager
      2   17949 odl      java
      2    1417 polkitd  polkitd
      1    3932 root     crond
      1  312997 nobody   dnsmasq
      1  307772 nobody   dnsmasq
      1  307170 nobody   dnsmasq
      1  303854 nobody   dnsmasq
      1  302882 nobody   dnsmasq
      1  299363 nobody   dnsmasq
      1  298959 nobody   dnsmasq
      1  295331 nobody   dnsmasq
      1  295099 nobody   dnsmasq
      1  291719 nobody   dnsmasq
      1  290445 nobody   dnsmasq
      1  288273 nobody   dnsmasq
      1  287448 nobody   dnsmasq
      1  285631 nobody   dnsmasq
      1  284068 nobody   dnsmasq
      1  281812 nobody   dnsmasq
      1  278858 nobody   dnsmasq
      1  277852 nobody   dnsmasq
      1  275957 nobody   dnsmasq
      1  273351 nobody   dnsmasq
      1  269025 nobody   dnsmasq
      1  266707 nobody   dnsmasq
      1  265901 nobody   dnsmasq
      1  263640 nobody   dnsmasq
      1  263330 nobody   dnsmasq
      1  260207 nobody   dnsmasq
      1  260137 nobody   dnsmasq
      1  257830 nobody   dnsmasq
      1  256695 nobody   dnsmasq
      1  254814 nobody   dnsmasq
      1  254505 nobody   dnsmasq
      1  253550 nobody   dnsmasq
  1  253046 nobody   dnsmasq
      1  252287 nobody   dnsmasq
      1  252052 nobody   dnsmasq
      1  250961 nobody   dnsmasq
      1  250644 nobody   dnsmasq
      1  249868 nobody   dnsmasq
      1  249789 nobody   dnsmasq
      1  248762 nobody   dnsmasq
      1  248664 nobody   dnsmasq
      1  247941 nobody   dnsmasq
      1  247329 nobody   dnsmasq
      1  246261 nobody   dnsmasq
      1  245892 nobody   dnsmasq
      1  245511 nobody   dnsmasq
      1  244449 nobody   dnsmasq
      1  244139 nobody   dnsmasq
      1  243307 nobody   dnsmasq
      1  243002 nobody   dnsmasq
      1  242546 nobody   dnsmasq
      1  241946 nobody   dnsmasq
      1  241138 nobody   dnsmasq
      1  240684 nobody   dnsmasq
      1  240246 nobody   dnsmasq
      1  239502 nobody   dnsmasq
      1  239224 nobody   dnsmasq
      1  238329 nobody   dnsmasq
      1  238008 nobody   dnsmasq
      1  237835 nobody   dnsmasq
      1  237011 nobody   dnsmasq
      1  236634 nobody   dnsmasq
      1  235539 nobody   dnsmasq
      1  234809 nobody   dnsmasq
      1  234500 nobody   dnsmasq
      1  233481 nobody   dnsmasq
      1  232097 nobody   dnsmasq
      1  230986 nobody   dnsmasq
      1  229111 nobody   dnsmasq
      1  228553 nobody   dnsmasq
      1  226775 nobody   dnsmasq
      1  226038 nobody   dnsmasq
      1  224474 nobody   dnsmasq
      1  224145 nobody   dnsmasq
      1  223520 nobody   dnsmasq
      1  223008 nobody   dnsmasq
      1  222776 nobody   dnsmasq
      1  221582 nobody   dnsmasq
      1  220839 nobody   dnsmasq
      1  220594 nobody   dnsmasq
      1  220343 nobody   dnsmasq
      1  219465 nobody   dnsmasq
      1  218360 nobody   dnsmasq
      1  218146 nobody   dnsmasq
      1  217854 nobody   dnsmasq
      1  217087 nobody   dnsmasq
      1  216831 nobody   dnsmasq
      1  215738 nobody   dnsmasq
      1  215333 nobody   dnsmasq
      1  214496 nobody   dnsmasq
      1  214263 nobody   dnsmasq
      1  213385 nobody   dnsmasq
      1  213176 nobody   dnsmasq
      1  212103 nobody   dnsmasq
      1  211857 nobody   dnsmasq
      1  211501 nobody   dnsmasq
      1  210808 nobody   dnsmasq
      1  210536 nobody   dnsmasq
      1  209556 nobody   dnsmasq
      1  209243 nobody   dnsmasq
      1  208387 nobody   dnsmasq
      1  208167 nobody   dnsmasq
      1  207853 nobody   dnsmasq
      1  207344 nobody   dnsmasq
      1  206523 nobody   dnsmasq
      1  205860 nobody   dnsmasq
      1  205389 nobody   dnsmasq
      1  204868 nobody   dnsmasq
      1  203709 nobody   dnsmasq
      1  203211 nobody   dnsmasq
      1  202776 nobody   dnsmasq
      1  202243 nobody   dnsmasq
      1  201589 nobody   dnsmasq
      1  200944 nobody   dnsmasq
      1  200723 nobody   dnsmasq
      1  199292 nobody   dnsmasq
      1    1379 dbus     dbus-daemon
      1    1365 root     rsyslogd
      1    1032 root     systemd-udevd

Comment 12 Sai Sindhur Malleni 2017-09-15 14:54:05 UTC
I can confirm that doing sysctl -w fs.inotify.max_user_instances=256 >> /etc/sysctl.conf to raise the value from 128 to 256 results in more subnets, VMs being created. Should we set a higher default so?

Comment 13 Ramon Acedo 2017-10-04 13:45:45 UTC
I'd suggest to treat this as an improvement for The Heat Templates (THT) instead of dnsmasq and change the Product and Component to Red Hat OpenStack and openstack-tripleo-heat-templates adding the patch https://review.openstack.org/#/c/505381/ as an external tracker. Does that make sense to everybody?

Comment 14 Sai Sindhur Malleni 2017-10-04 13:54:23 UTC
Ramon,

I have a separate bug for OpenStack specific to scale. I'm tracking the upstream patches there https://bugzilla.redhat.com/show_bug.cgi?id=1491505

Comment 15 James Beal 2017-10-09 18:11:37 UTC
I have just been bitten by this... +1

Comment 16 Steve Relf 2017-10-29 13:18:48 UTC
just been gotten by this as well, on a prod platform. the item in comment 12 got me back and working

Comment 17 Brian Fife 2017-10-30 17:16:16 UTC
We just hit this on an upgrade within Newton from a fresh install back in April.  We had ~1200 networks / ~1200 dnsmasq processes running fine beforehand.

Comment 21 Petr Menšík 2017-10-31 14:49:58 UTC
Hi, I made a quick look if there are other options available. Found something, that might help under specific options.

Current upstream spawns inotify socket always. There are some cases when this is not necessary. Inotify socket is used to monitor resolv.conf file(s). If no-resolv option is used AND no --resolv-file is used, one usage of inotify is never used. If none of --hostsdir, --dhcp-hostsdir and --dhcp-optsdir is NOT used as well, inotify socket does not have to be created for such dnsmasq instance.

That would help if dnsmasq_dns_servers is used to configure dnsmasq. With option dnsmasq_local_resolv it would still require inotify socket per instance.

Comment 22 Petr Menšík 2017-10-31 14:52:38 UTC
Created attachment 1345958 [details]
inotify conditional open patch

Open inotify socket only if there is resolv.conf file or hosts directory to watch.

Comment 23 Ramon Acedo 2017-10-31 15:01:37 UTC
Petr, this is a typical dnsmasq process as used in Neutron:

dnsmasq --no-hosts \
--no-resolv \
--strict-order \
--except-interface=lo \
--pid-file=/var/lib/neutron/dhcp/2cba5238-5cde-4393-bed5-f58ed465b458/pid \
--dhcp-hostsfile=/var/lib/neutron/dhcp/2cba5238-5cde-4393-bed5-f58ed465b458/host \
--addn-hosts=/var/lib/neutron/dhcp/2cba5238-5cde-4393-bed5-f58ed465b458/addn_hosts \
--dhcp-optsfile=/var/lib/neutron/dhcp/2cba5238-5cde-4393-bed5-f58ed465b458/opts \
--dhcp-leasefile=/var/lib/neutron/dhcp/2cba5238-5cde-4393-bed5-f58ed465b458/leases \
--dhcp-match=set:ipxe,175 \
--bind-interfaces \
--interface=tap5eff5ae2-af \
--dhcp-range=set:tag0,192.168.1.0,static,86400s \
--dhcp-option-force=option:mtu,1500 \
--dhcp-lease-max=256 \
--conf-file=/etc/dnsmasq-ironic.conf \
--domain=openstacklocal

It uses --no-resolv and it doesn't use --resolve-file. So it looks like your patch would work. 

On the short term Sai has created a patch for TripleO in Newton which hopefully will make it to OSP 10z7.

Thanks both for the info and work.

Comment 24 Sai Sindhur Malleni 2017-10-31 17:32:31 UTC
KCS for this: https://access.redhat.com/solutions/3228801

Comment 25 Petr Menšík 2017-11-01 17:15:14 UTC
Thanks Ramon for typical dnsmasq setup commands. Because it uses only files, it looks the patch would fix it. Posted upstream for opinions.

http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2017q4/011814.html

Comment 26 Petr Menšík 2017-11-14 14:38:29 UTC
Patch accepted upstream in commit 075366ad6e6f53a68b173862546ab4cf70fa0b8d.

Comment 37 errata-xmlrpc 2018-10-30 09:49:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3110