Bug 1474515
Summary: | dhcp-agent dnsmasq max files | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Joe Talerico <jtaleric> | ||||
Component: | dnsmasq | Assignee: | Petr Menšík <pemensik> | ||||
Status: | CLOSED ERRATA | QA Contact: | Daniel Rusek <drusek> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.5 | CC: | abond, amedeo.salvati, brian.fife, drusek, james.beal, jtaleric, oblaut, pemensik, pneedle, psklenar, racedoro, rcernin, salmy, smalleni, srelf, srevivo, thozza | ||||
Target Milestone: | alpha | Keywords: | Upstream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Fixed In Version: | dnsmasq-2.76-7.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-10-30 09:49:42 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1534569, 1549614 | ||||||
Attachments: |
|
Description
Joe Talerico
2017-07-24 20:05:24 UTC
Hey Tomas - Thanks for moving my bug from OpenStack to RHEL, since that seems to be the right product. I won't be able to invent a reproducer outside of OpenStack for you. I suggest using OpenStack in order to reproduce the issue. I can easily reproduce it with multiple scenarios orchestrated via OpenStack Rally. Running create-port-list times:1000, concurrency 32 I see: 2017-07-24 17:56:35.269 87837 DEBUG neutron.agent.dhcp.agent [req-3633ae11-2ab3-44eb-ac5a-9f12568900fd - - - - -] resync (594f387b-bdb2-45c3-bea5-47f6d378907c): [ProcessExecutionError(u'Exit code: 4; Stdin: # Generated by iptables_manager\n*filter\n:neutron-dhcp-age-FORWARD - [0:0]\n:neutron-dhcp-age-INPUT - [0:0]\n:neutron-dhcp-age-OUTPUT - [0:0]\n:neutron-dhcp-age-local - [0:0]\n:neutron-filter-top - [0:0]\n-I FORWARD 1 -j neutron-filter-top\n-I FORWARD 2 -j neutron-dhcp-age-FORWARD\n-I INPUT 1 -j neutron-dhcp-age-INPUT\n-I OUTPUT 1 -j neutron-filter-top\n-I OUTPUT 2 -j neutron-dhcp-age-OUTPUT\n-I neutron-filter-top 1 -j neutron-dhcp-age-local\nCOMMIT\n# Completed by iptables_manager\n# Generated by iptables_manager\n*mangle\n:neutron-dhcp-age-FORWARD - [0:0]\n:neutron-dhcp-age-INPUT - [0:0]\n:neutron-dhcp-age-OUTPUT - [0:0]\n:neutron-dhcp-age-POSTROUTING - [0:0]\n:neutron-dhcp-age-PREROUTING - [0:0]\n:neutron-dhcp-age-mark - [0:0]\n-I FORWARD 1 -j neutron-dhcp-age-FORWARD\n-I INPUT 1 -j neutron-dhcp-age-INPUT\n-I OUTPUT 1 -j neutron-dhcp-age-OUTPUT\n-I POSTROUTING 1 -j neutron-dhcp-age-POSTROUTING\n-I PREROUTING 1 -j neutron-dhcp-age-PREROUTING\n-I neutron-dhcp-age-POSTROUTING 1 -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill\n-I neutron-dhcp-age-PREROUTING 1 -j neutron-dhcp-age-mark\nCOMMIT\n# Completed by iptables_manager\n# Generated by iptables_manager\n*nat\n:neutron-dhcp-age-OUTPUT - [0:0]\n:neutron-dhcp-age-POSTROUTING - [0:0]\n:neutron-dhcp-age-PREROUTING - [0:0]\n:neutron-dhcp-age-float-snat - [0:0]\n:neutron-dhcp-age-snat - [0:0]\n:neutron-postrouting-bottom - [0:0]\n-I OUTPUT 1 -j neutron-dhcp-age-OUTPUT\n-I POSTROUTING 1 -j neutron-dhcp-age-POSTROUTING\n-I POSTROUTING 2 -j neutron-postrouting-bottom\n-I PREROUTING 1 -j neutron-dhcp-age-PREROUTING\n-I neutron-dhcp-age-snat 1 -j neutron-dhcp-age-float-snat\n-I neutron-postrouting-bottom 1 -m comment --comment "Perform source NAT on outgoing traffic." -j neutron-dhcp-age-snat\nCOMMIT\n# Completed by iptables_manager\n# Generated by iptables_manager\n*raw\n:neutron-dhcp-age-OUTPUT - [0:0]\n:neutron-dhcp-age-PREROUTING - [0:0]\n-I OUTPUT 1 -j neutron-dhcp-age-OUTPUT\n-I PREROUTING 1 -j neutron-dhcp-age-PREROUTING\nCOMMIT\n# Completed by iptables_manager\n; Stdout: ; Stderr: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?\n',), ProcessExecutionError(u'Exit code: 5; Stdin: ; Stdout: ; Stderr: \ndnsmasq: failed to create inotify: Too many open files\n',), ProcessExecutionError(u'Exit code: 5; Stdin: ; Stdout: ; Stderr: \ndnsmasq: failed to create inotify: Too many open files\n',)] _periodic_resync_helper /usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py:255 I have looked into dnsmasq code. There is always only single instance of inotify socket per dnsmasq instance. It does not explicitly close file descriptor of inotify before exitting. I think that should be closed by kernel when the process dies. I will try to verify that it is not leaked even after process exit. I have not found a way to open notify socket more than once however. Unless there is 128 instances of dnsmasq, it should be ok. Instance of dnsmasq used by libvirtd for example uses 11 file descriptors for main process and 5 for dhcp-script helper. # main process $ lsof -p 18906 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME dnsmasq 18906 nobody cwd DIR 253,1 224 64 / dnsmasq 18906 nobody rtd DIR 253,1 224 64 / dnsmasq 18906 nobody txt REG 253,1 344888 6331416 /usr/sbin/dnsmasq dnsmasq 18906 nobody mem REG 253,1 62184 6333789 /usr/lib64/libnss_files-2.17.so dnsmasq 18906 nobody mem REG 253,1 44448 6333801 /usr/lib64/librt-2.17.so dnsmasq 18906 nobody mem REG 253,1 144792 6333797 /usr/lib64/libpthread-2.17.so dnsmasq 18906 nobody mem REG 253,1 2127336 6333771 /usr/lib64/libc-2.17.so dnsmasq 18906 nobody mem REG 253,1 208928 6413923 /usr/lib64/libidn.so.11.6.11 dnsmasq 18906 nobody mem REG 253,1 304576 6391819 /usr/lib64/libdbus-1.so.3.7.4 dnsmasq 18906 nobody mem REG 253,1 164264 6333764 /usr/lib64/ld-2.17.so dnsmasq 18906 nobody mem REG 253,1 26254 6337493 /usr/lib64/gconv/gconv-modules.cache dnsmasq 18906 nobody 0u CHR 1,3 0t0 4856 /dev/null dnsmasq 18906 nobody 1u CHR 1,3 0t0 4856 /dev/null dnsmasq 18906 nobody 2u CHR 1,3 0t0 4856 /dev/null dnsmasq 18906 nobody 3u IPv4 40164 0t0 UDP *:bootps dnsmasq 18906 nobody 4u netlink 0t0 40165 ROUTE dnsmasq 18906 nobody 5u IPv4 40167 0t0 UDP qeos-64.lab.eng.rdu2.redhat.com:domain dnsmasq 18906 nobody 6r a_inode 0,9 0 4852 inotify dnsmasq 18906 nobody 7r FIFO 0,8 0t0 40173 pipe dnsmasq 18906 nobody 8w FIFO 0,8 0t0 40173 pipe dnsmasq 18906 nobody 9u unix 0xffff88005ff8ac00 0t0 40198 socket dnsmasq 18906 nobody 12w FIFO 0,8 0t0 40199 pipe # helper process $ lsof -p 18908 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME dnsmasq 18908 root cwd DIR 253,1 224 64 / dnsmasq 18908 root rtd DIR 253,1 224 64 / dnsmasq 18908 root txt REG 253,1 344888 6331416 /usr/sbin/dnsmasq dnsmasq 18908 root mem REG 253,1 62184 6333789 /usr/lib64/libnss_files-2.17.so dnsmasq 18908 root mem REG 253,1 44448 6333801 /usr/lib64/librt-2.17.so dnsmasq 18908 root mem REG 253,1 144792 6333797 /usr/lib64/libpthread-2.17.so dnsmasq 18908 root mem REG 253,1 2127336 6333771 /usr/lib64/libc-2.17.so dnsmasq 18908 root mem REG 253,1 208928 6413923 /usr/lib64/libidn.so.11.6.11 dnsmasq 18908 root mem REG 253,1 304576 6391819 /usr/lib64/libdbus-1.so.3.7.4 dnsmasq 18908 root mem REG 253,1 164264 6333764 /usr/lib64/ld-2.17.so dnsmasq 18908 root 0u CHR 1,3 0t0 4856 /dev/null dnsmasq 18908 root 1u CHR 1,3 0t0 4856 /dev/null dnsmasq 18908 root 2u CHR 1,3 0t0 4856 /dev/null dnsmasq 18908 root 8w FIFO 0,8 0t0 40173 pipe dnsmasq 18908 root 11r FIFO 0,8 0t0 40199 pipe I think the most important part is what Robin pointed out here: there is another limit for inotify, which is low enough. By default 128. But what I think is more important, it is not per session, but per USER. And dnsmasq does not have its dedicated user, it uses nobody user for the main process (which opens inotify). would you be able to run on failure this command and share its output? $ lsof -u nobody | grep inotify or just $ lsof -u nobody | grep inotify | wc -l I think it is possible there are other inotify socket holders that might use many inotify sockets. Dnsmasq might just try to use few above to hit maximum number used. Do I understand parameters right that there should be at most 32 instances running at the same time? I found a nice command to list inotify sockets used. It might help: $ find /proc/*/fd -lname anon_inode:inotify | cut -d/ -f3 | xargs -I '{}' -- ps --no-headers -o '%p %U %c' -p '{}' | uniq -c | sort -nr Hitting the same issue on a different environment, when trying to create networks and boot VMS in OpenStack. Once 116 networks are created (each has its own dnsmasq) and 116 VMs are booted (each on its own network created as mentioned previously), we are unable to boot any more VMs because of dhcp failing for the VMs port. 017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent [req-288477aa-318a-40c3-954e-dd6fc98c6c1b - - - - -] Unable to enable dhcp for bb6cdb16-72c0-4cc4-a316-69ebcd7633b2.: ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr: dnsmasq: failed to create inotify: Too many open files 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent Traceback (most recent call last): 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 142, in call_driver 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 218, in enable 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent self.spawn_process() 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 439, in spawn_process 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=False) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 453, in _spawn_or_reload_process 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent pm.enable(reload_cfg=reload_with_HUP) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 96, in enable 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent run_as_root=self.run_as_root) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 903, in execute 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent log_fail_as_error=log_fail_as_error, **kwargs) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent raise ProcessExecutionError(msg, returncode=returncode) 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr: 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent dnsmasq: failed to create inotify: Too many open files 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent 2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent 2017-09-13 21:45:38.260 91663 ERROR neutron.agent.linux.utils [req-d0ade748-22ea-4a45-ba10-277d45f20981 - - - - -] Exit code: 5; Stdin: ; Stdout: ; Stderr: dnsmasq: failed to create inotify: Too many open files [root@overcloud-controller-0 heat-admin]# lsof -u nobody | grep inotify | wc -l 116 [root@overcloud-controller-0 heat-admin]# ps aux | grep dnsmasq | wc -l 117 An OpenStack Scale related bug has been filed here: https://bugzilla.redhat.com/show_bug.cgi?id=1491505 This is clearly a blocker for OpenStack Scale, requesting more urgency on this. [root@overcloud-controller-0 heat-admin]# rpm -qa | grep dnsmasq dnsmasq-2.76-2.el7.x86_64 dnsmasq-utils-2.66-21.el7.x86_64 [root@overcloud-controller-0 heat-admin]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.4 (Maipo) Also, to be clear, the rate at which I was creating networks(dnsmasq processes) and VMs was just 2 at a time. I tried with 8 at a time, and wasn't sure if the rate of creation had an impact, so I even tried 2 at a time. OpenStack fails to scale beyond 116 networks and subnets because of this issue. [root@overcloud-controller-0 heat-admin]# find /proc/*/fd -lname anon_inode:inotify |
> cut -d/ -f3 |
> xargs -I '{}' -- ps --no-headers -o '%p %U %c' -p '{}' |
> uniq -c |
> sort -nr
5 1 root systemd
3 1419 root NetworkManager
2 17949 odl java
2 1417 polkitd polkitd
1 3932 root crond
1 312997 nobody dnsmasq
1 307772 nobody dnsmasq
1 307170 nobody dnsmasq
1 303854 nobody dnsmasq
1 302882 nobody dnsmasq
1 299363 nobody dnsmasq
1 298959 nobody dnsmasq
1 295331 nobody dnsmasq
1 295099 nobody dnsmasq
1 291719 nobody dnsmasq
1 290445 nobody dnsmasq
1 288273 nobody dnsmasq
1 287448 nobody dnsmasq
1 285631 nobody dnsmasq
1 284068 nobody dnsmasq
1 281812 nobody dnsmasq
1 278858 nobody dnsmasq
1 277852 nobody dnsmasq
1 275957 nobody dnsmasq
1 273351 nobody dnsmasq
1 269025 nobody dnsmasq
1 266707 nobody dnsmasq
1 265901 nobody dnsmasq
1 263640 nobody dnsmasq
1 263330 nobody dnsmasq
1 260207 nobody dnsmasq
1 260137 nobody dnsmasq
1 257830 nobody dnsmasq
1 256695 nobody dnsmasq
1 254814 nobody dnsmasq
1 254505 nobody dnsmasq
1 253550 nobody dnsmasq
1 253046 nobody dnsmasq
1 252287 nobody dnsmasq
1 252052 nobody dnsmasq
1 250961 nobody dnsmasq
1 250644 nobody dnsmasq
1 249868 nobody dnsmasq
1 249789 nobody dnsmasq
1 248762 nobody dnsmasq
1 248664 nobody dnsmasq
1 247941 nobody dnsmasq
1 247329 nobody dnsmasq
1 246261 nobody dnsmasq
1 245892 nobody dnsmasq
1 245511 nobody dnsmasq
1 244449 nobody dnsmasq
1 244139 nobody dnsmasq
1 243307 nobody dnsmasq
1 243002 nobody dnsmasq
1 242546 nobody dnsmasq
1 241946 nobody dnsmasq
1 241138 nobody dnsmasq
1 240684 nobody dnsmasq
1 240246 nobody dnsmasq
1 239502 nobody dnsmasq
1 239224 nobody dnsmasq
1 238329 nobody dnsmasq
1 238008 nobody dnsmasq
1 237835 nobody dnsmasq
1 237011 nobody dnsmasq
1 236634 nobody dnsmasq
1 235539 nobody dnsmasq
1 234809 nobody dnsmasq
1 234500 nobody dnsmasq
1 233481 nobody dnsmasq
1 232097 nobody dnsmasq
1 230986 nobody dnsmasq
1 229111 nobody dnsmasq
1 228553 nobody dnsmasq
1 226775 nobody dnsmasq
1 226038 nobody dnsmasq
1 224474 nobody dnsmasq
1 224145 nobody dnsmasq
1 223520 nobody dnsmasq
1 223008 nobody dnsmasq
1 222776 nobody dnsmasq
1 221582 nobody dnsmasq
1 220839 nobody dnsmasq
1 220594 nobody dnsmasq
1 220343 nobody dnsmasq
1 219465 nobody dnsmasq
1 218360 nobody dnsmasq
1 218146 nobody dnsmasq
1 217854 nobody dnsmasq
1 217087 nobody dnsmasq
1 216831 nobody dnsmasq
1 215738 nobody dnsmasq
1 215333 nobody dnsmasq
1 214496 nobody dnsmasq
1 214263 nobody dnsmasq
1 213385 nobody dnsmasq
1 213176 nobody dnsmasq
1 212103 nobody dnsmasq
1 211857 nobody dnsmasq
1 211501 nobody dnsmasq
1 210808 nobody dnsmasq
1 210536 nobody dnsmasq
1 209556 nobody dnsmasq
1 209243 nobody dnsmasq
1 208387 nobody dnsmasq
1 208167 nobody dnsmasq
1 207853 nobody dnsmasq
1 207344 nobody dnsmasq
1 206523 nobody dnsmasq
1 205860 nobody dnsmasq
1 205389 nobody dnsmasq
1 204868 nobody dnsmasq
1 203709 nobody dnsmasq
1 203211 nobody dnsmasq
1 202776 nobody dnsmasq
1 202243 nobody dnsmasq
1 201589 nobody dnsmasq
1 200944 nobody dnsmasq
1 200723 nobody dnsmasq
1 199292 nobody dnsmasq
1 1379 dbus dbus-daemon
1 1365 root rsyslogd
1 1032 root systemd-udevd
I can confirm that doing sysctl -w fs.inotify.max_user_instances=256 >> /etc/sysctl.conf to raise the value from 128 to 256 results in more subnets, VMs being created. Should we set a higher default so? I'd suggest to treat this as an improvement for The Heat Templates (THT) instead of dnsmasq and change the Product and Component to Red Hat OpenStack and openstack-tripleo-heat-templates adding the patch https://review.openstack.org/#/c/505381/ as an external tracker. Does that make sense to everybody? Ramon, I have a separate bug for OpenStack specific to scale. I'm tracking the upstream patches there https://bugzilla.redhat.com/show_bug.cgi?id=1491505 I have just been bitten by this... +1 just been gotten by this as well, on a prod platform. the item in comment 12 got me back and working We just hit this on an upgrade within Newton from a fresh install back in April. We had ~1200 networks / ~1200 dnsmasq processes running fine beforehand. Hi, I made a quick look if there are other options available. Found something, that might help under specific options. Current upstream spawns inotify socket always. There are some cases when this is not necessary. Inotify socket is used to monitor resolv.conf file(s). If no-resolv option is used AND no --resolv-file is used, one usage of inotify is never used. If none of --hostsdir, --dhcp-hostsdir and --dhcp-optsdir is NOT used as well, inotify socket does not have to be created for such dnsmasq instance. That would help if dnsmasq_dns_servers is used to configure dnsmasq. With option dnsmasq_local_resolv it would still require inotify socket per instance. Created attachment 1345958 [details]
inotify conditional open patch
Open inotify socket only if there is resolv.conf file or hosts directory to watch.
Petr, this is a typical dnsmasq process as used in Neutron: dnsmasq --no-hosts \ --no-resolv \ --strict-order \ --except-interface=lo \ --pid-file=/var/lib/neutron/dhcp/2cba5238-5cde-4393-bed5-f58ed465b458/pid \ --dhcp-hostsfile=/var/lib/neutron/dhcp/2cba5238-5cde-4393-bed5-f58ed465b458/host \ --addn-hosts=/var/lib/neutron/dhcp/2cba5238-5cde-4393-bed5-f58ed465b458/addn_hosts \ --dhcp-optsfile=/var/lib/neutron/dhcp/2cba5238-5cde-4393-bed5-f58ed465b458/opts \ --dhcp-leasefile=/var/lib/neutron/dhcp/2cba5238-5cde-4393-bed5-f58ed465b458/leases \ --dhcp-match=set:ipxe,175 \ --bind-interfaces \ --interface=tap5eff5ae2-af \ --dhcp-range=set:tag0,192.168.1.0,static,86400s \ --dhcp-option-force=option:mtu,1500 \ --dhcp-lease-max=256 \ --conf-file=/etc/dnsmasq-ironic.conf \ --domain=openstacklocal It uses --no-resolv and it doesn't use --resolve-file. So it looks like your patch would work. On the short term Sai has created a patch for TripleO in Newton which hopefully will make it to OSP 10z7. Thanks both for the info and work. KCS for this: https://access.redhat.com/solutions/3228801 Thanks Ramon for typical dnsmasq setup commands. Because it uses only files, it looks the patch would fix it. Posted upstream for opinions. http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2017q4/011814.html Patch accepted upstream in commit 075366ad6e6f53a68b173862546ab4cf70fa0b8d. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3110 |