553334 – bind-9.3.6-4.P1.el5 fails under high (bulk lookup) query load

Bug 553334 - bind-9.3.6-4.P1.el5 fails under high (bulk lookup) query load

Summary: bind-9.3.6-4.P1.el5 fails under high (bulk lookup) query load

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	bind
Sub Component:
Version:	5.4
Hardware:	i386
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Adam Tkac
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-01-07 16:30 UTC by Colin Phipps
Modified:	2010-07-02 11:05 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	692595 (view as bug list)
Environment:
Last Closed:	2010-07-02 11:03:32 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Colin Phipps 2010-01-07 16:30:20 UTC

Description of problem:
We have a setup for doing bulk DNS lookups using 2 bind instances configured to run on high ports and a custom program for submitting bulk queries in parallel. Since upgrading to the version of bind in 5.4, we have experienced a failure where bind stops responding to queries after perhaps 5 to 10 million lookups.

We have verified that this is reproducible with both bind-9.3.6-4.P1.el5_4.1 and bind-9.3.6-4.P1.el5 on two different i386 systems.

It is unlikely that I can share with you the code performing the lookups, but there is nothing particularly unusual about it that I am aware of other than that we are doing bulk parallel A record lookups through 2 bind instances on one server; and that we run an unusual .conf file to optimise for that load (below).

With bind-9.3.4-10.P1.el5_3.3 and previous versions of bind-9.3.4 in 5.3, we experienced no problems. We have reverted to bind-9.3.4-10.P1.el5_3.3 (and bind-libs, bind-chroot, caching-nameserver etc) on our production setup (but still running on 5.4 otherwise), and with this our query load works fine, as it did for 1-2 years before updating to 5.4.

Restarting the bind instance affected fixes the problem.

[root@...]# dig -p 10053 slashdot.org @localhost
...
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27490
[root@...]# /etc/init.d/named.fastdns.1 restart
Stopping named.fastdns.2: . [ OK ]
Starting named.fastdns.2: [ OK ]
[root@...]# dig -p 10053 slashdot.org @localhost
...
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10712

Version-Release number of selected component (if applicable):

bind-9.3.6-4.P1.el5_4.1
bind-9.3.6-4.P1.el5

Actual results:
Failure of bind; stops responding to queries, nothing happening in strace(), normal service is restored by a restart.

Expected results:
Successful lookups.

Additional info:
named.conf used:
options {
        listen-on port 10053 { 127.0.0.1; };
        listen-on-v6 port 10053 { ::1; };
        directory       "/var/named.fastdns.1";
        dump-file       "/var/named.fastdns.1/data/cache_dump.db";
        statistics-file "/var/named.fastdns.1/data/named_stats.txt";
        memstatistics-file "/var/named.fastdns.1/data/named_mem_stats.txt";
        query-source    port 10053;
        query-source-v6 port 10053;
        allow-query     { localhost; };

        // config for fastdns
        max-cache-size 2m;        // 2Mbyte is the minimum in 9.2.0a2
        cleaning-interval 10;     // reclaim every 10mins (default is 60mins)
        max-cache-ttl 60;         // possibly not needed, but harmless
        max-ncache-ttl 60;        // possibly not needed, but harmless
        recursive-clients 64000;  // default is 1000, fastdns -p N needs >N

};
view localhost_resolver {
        match-clients      { localhost; };
        match-destinations { localhost; };
        recursion yes;
        include "/etc/named.rfc1912.zones";
};

I hope you find this information helpful.

Comment 1 Adam Tkac 2010-03-30 12:36:28 UTC

Would it be possible to install bind-debuginfo package and get backtrace from hanged named process, please?

- run "gdb attach <named_pid>"
- in the gdb prompt run "t a a bt full"
- attach output here

Thank you.

Comment 2 Adam Tkac 2010-07-02 11:03:32 UTC

There was no response since 2010-03-30. If you hit this problem again, attach a backtrace from the hung named process and reopen this bug, please. Closing as wontfix.

Comment 3 Ondrej Vasik 2010-07-02 11:05:08 UTC

Closing insufficient data - in needinfo for more than three months ... If you still experience the issue, provide the requested information and feel free to reopen the bugzilla ticket. Thanks in advance.

Note You need to log in before you can comment on or make changes to this bug.