Description of problem: We have a setup for doing bulk DNS lookups using 2 bind instances configured to run on high ports and a custom program for submitting bulk queries in parallel. Since upgrading to the version of bind in 5.4, we have experienced a failure where bind stops responding to queries after perhaps 5 to 10 million lookups. We have verified that this is reproducible with both bind-9.3.6-4.P1.el5_4.1 and bind-9.3.6-4.P1.el5 on two different i386 systems. It is unlikely that I can share with you the code performing the lookups, but there is nothing particularly unusual about it that I am aware of other than that we are doing bulk parallel A record lookups through 2 bind instances on one server; and that we run an unusual .conf file to optimise for that load (below). With bind-9.3.4-10.P1.el5_3.3 and previous versions of bind-9.3.4 in 5.3, we experienced no problems. We have reverted to bind-9.3.4-10.P1.el5_3.3 (and bind-libs, bind-chroot, caching-nameserver etc) on our production setup (but still running on 5.4 otherwise), and with this our query load works fine, as it did for 1-2 years before updating to 5.4. Restarting the bind instance affected fixes the problem. [root@...]# dig -p 10053 slashdot.org @localhost ... ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27490 [root@...]# /etc/init.d/named.fastdns.1 restart Stopping named.fastdns.2: . [ OK ] Starting named.fastdns.2: [ OK ] [root@...]# dig -p 10053 slashdot.org @localhost ... ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10712 Version-Release number of selected component (if applicable): bind-9.3.6-4.P1.el5_4.1 bind-9.3.6-4.P1.el5 Actual results: Failure of bind; stops responding to queries, nothing happening in strace(), normal service is restored by a restart. Expected results: Successful lookups. Additional info: named.conf used: options { listen-on port 10053 { 127.0.0.1; }; listen-on-v6 port 10053 { ::1; }; directory "/var/named.fastdns.1"; dump-file "/var/named.fastdns.1/data/cache_dump.db"; statistics-file "/var/named.fastdns.1/data/named_stats.txt"; memstatistics-file "/var/named.fastdns.1/data/named_mem_stats.txt"; query-source port 10053; query-source-v6 port 10053; allow-query { localhost; }; // config for fastdns max-cache-size 2m; // 2Mbyte is the minimum in 9.2.0a2 cleaning-interval 10; // reclaim every 10mins (default is 60mins) max-cache-ttl 60; // possibly not needed, but harmless max-ncache-ttl 60; // possibly not needed, but harmless recursive-clients 64000; // default is 1000, fastdns -p N needs >N }; view localhost_resolver { match-clients { localhost; }; match-destinations { localhost; }; recursion yes; include "/etc/named.rfc1912.zones"; }; I hope you find this information helpful.
Would it be possible to install bind-debuginfo package and get backtrace from hanged named process, please? - run "gdb attach <named_pid>" - in the gdb prompt run "t a a bt full" - attach output here Thank you.
There was no response since 2010-03-30. If you hit this problem again, attach a backtrace from the hung named process and reopen this bug, please. Closing as wontfix.
Closing insufficient data - in needinfo for more than three months ... If you still experience the issue, provide the requested information and feel free to reopen the bugzilla ticket. Thanks in advance.