Bug 553334
| Summary: | bind-9.3.6-4.P1.el5 fails under high (bulk lookup) query load | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Colin Phipps <cph> | |
| Component: | bind | Assignee: | Adam Tkac <atkac> | |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | qe-baseos-daemons | |
| Severity: | low | Docs Contact: | ||
| Priority: | low | |||
| Version: | 5.4 | CC: | ovasik | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | i386 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 692595 (view as bug list) | Environment: | ||
| Last Closed: | 2010-07-02 11:03:32 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
Would it be possible to install bind-debuginfo package and get backtrace from hanged named process, please? - run "gdb attach <named_pid>" - in the gdb prompt run "t a a bt full" - attach output here Thank you. There was no response since 2010-03-30. If you hit this problem again, attach a backtrace from the hung named process and reopen this bug, please. Closing as wontfix. Closing insufficient data - in needinfo for more than three months ... If you still experience the issue, provide the requested information and feel free to reopen the bugzilla ticket. Thanks in advance. |
Description of problem: We have a setup for doing bulk DNS lookups using 2 bind instances configured to run on high ports and a custom program for submitting bulk queries in parallel. Since upgrading to the version of bind in 5.4, we have experienced a failure where bind stops responding to queries after perhaps 5 to 10 million lookups. We have verified that this is reproducible with both bind-9.3.6-4.P1.el5_4.1 and bind-9.3.6-4.P1.el5 on two different i386 systems. It is unlikely that I can share with you the code performing the lookups, but there is nothing particularly unusual about it that I am aware of other than that we are doing bulk parallel A record lookups through 2 bind instances on one server; and that we run an unusual .conf file to optimise for that load (below). With bind-9.3.4-10.P1.el5_3.3 and previous versions of bind-9.3.4 in 5.3, we experienced no problems. We have reverted to bind-9.3.4-10.P1.el5_3.3 (and bind-libs, bind-chroot, caching-nameserver etc) on our production setup (but still running on 5.4 otherwise), and with this our query load works fine, as it did for 1-2 years before updating to 5.4. Restarting the bind instance affected fixes the problem. [root@...]# dig -p 10053 slashdot.org @localhost ... ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27490 [root@...]# /etc/init.d/named.fastdns.1 restart Stopping named.fastdns.2: . [ OK ] Starting named.fastdns.2: [ OK ] [root@...]# dig -p 10053 slashdot.org @localhost ... ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10712 Version-Release number of selected component (if applicable): bind-9.3.6-4.P1.el5_4.1 bind-9.3.6-4.P1.el5 Actual results: Failure of bind; stops responding to queries, nothing happening in strace(), normal service is restored by a restart. Expected results: Successful lookups. Additional info: named.conf used: options { listen-on port 10053 { 127.0.0.1; }; listen-on-v6 port 10053 { ::1; }; directory "/var/named.fastdns.1"; dump-file "/var/named.fastdns.1/data/cache_dump.db"; statistics-file "/var/named.fastdns.1/data/named_stats.txt"; memstatistics-file "/var/named.fastdns.1/data/named_mem_stats.txt"; query-source port 10053; query-source-v6 port 10053; allow-query { localhost; }; // config for fastdns max-cache-size 2m; // 2Mbyte is the minimum in 9.2.0a2 cleaning-interval 10; // reclaim every 10mins (default is 60mins) max-cache-ttl 60; // possibly not needed, but harmless max-ncache-ttl 60; // possibly not needed, but harmless recursive-clients 64000; // default is 1000, fastdns -p N needs >N }; view localhost_resolver { match-clients { localhost; }; match-destinations { localhost; }; recursion yes; include "/etc/named.rfc1912.zones"; }; I hope you find this information helpful.