Bug 553334

Summary:	bind-9.3.6-4.P1.el5 fails under high (bulk lookup) query load
Product:	Red Hat Enterprise Linux 5	Reporter:	Colin Phipps <cph>
Component:	bind	Assignee:	Adam Tkac <atkac>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	qe-baseos-daemons
Severity:	low	Docs Contact:
Priority:	low
Version:	5.4	CC:	ovasik
Target Milestone:	rc
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	692595 (view as bug list)		Environment:
Last Closed:	2010-07-02 11:03:32 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Colin Phipps 2010-01-07 16:30:20 UTC

Description of problem:
We have a setup for doing bulk DNS lookups using 2 bind instances configured to run on high ports and a custom program for submitting bulk queries in parallel. Since upgrading to the version of bind in 5.4, we have experienced a failure where bind stops responding to queries after perhaps 5 to 10 million lookups.

We have verified that this is reproducible with both bind-9.3.6-4.P1.el5_4.1 and bind-9.3.6-4.P1.el5 on two different i386 systems.

It is unlikely that I can share with you the code performing the lookups, but there is nothing particularly unusual about it that I am aware of other than that we are doing bulk parallel A record lookups through 2 bind instances on one server; and that we run an unusual .conf file to optimise for that load (below).

With bind-9.3.4-10.P1.el5_3.3 and previous versions of bind-9.3.4 in 5.3, we experienced no problems. We have reverted to bind-9.3.4-10.P1.el5_3.3 (and bind-libs, bind-chroot, caching-nameserver etc) on our production setup (but still running on 5.4 otherwise), and with this our query load works fine, as it did for 1-2 years before updating to 5.4.

Restarting the bind instance affected fixes the problem.

[root@...]# dig -p 10053 slashdot.org @localhost
...
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27490
[root@...]# /etc/init.d/named.fastdns.1 restart
Stopping named.fastdns.2: . [ OK ]
Starting named.fastdns.2: [ OK ]
[root@...]# dig -p 10053 slashdot.org @localhost
...
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10712

Version-Release number of selected component (if applicable):

bind-9.3.6-4.P1.el5_4.1
bind-9.3.6-4.P1.el5

Actual results:
Failure of bind; stops responding to queries, nothing happening in strace(), normal service is restored by a restart.

Expected results:
Successful lookups.

Additional info:
named.conf used:
options {
        listen-on port 10053 { 127.0.0.1; };
        listen-on-v6 port 10053 { ::1; };
        directory       "/var/named.fastdns.1";
        dump-file       "/var/named.fastdns.1/data/cache_dump.db";
        statistics-file "/var/named.fastdns.1/data/named_stats.txt";
        memstatistics-file "/var/named.fastdns.1/data/named_mem_stats.txt";
        query-source    port 10053;
        query-source-v6 port 10053;
        allow-query     { localhost; };

        // config for fastdns
        max-cache-size 2m;        // 2Mbyte is the minimum in 9.2.0a2
        cleaning-interval 10;     // reclaim every 10mins (default is 60mins)
        max-cache-ttl 60;         // possibly not needed, but harmless
        max-ncache-ttl 60;        // possibly not needed, but harmless
        recursive-clients 64000;  // default is 1000, fastdns -p N needs >N

};
view localhost_resolver {
        match-clients      { localhost; };
        match-destinations { localhost; };
        recursion yes;
        include "/etc/named.rfc1912.zones";
};

I hope you find this information helpful.

Comment 1 Adam Tkac 2010-03-30 12:36:28 UTC

Would it be possible to install bind-debuginfo package and get backtrace from hanged named process, please?

- run "gdb attach <named_pid>"
- in the gdb prompt run "t a a bt full"
- attach output here

Thank you.

Comment 2 Adam Tkac 2010-07-02 11:03:32 UTC

There was no response since 2010-03-30. If you hit this problem again, attach a backtrace from the hung named process and reopen this bug, please. Closing as wontfix.

Comment 3 Ondrej Vasik 2010-07-02 11:05:08 UTC

Closing insufficient data - in needinfo for more than three months ... If you still experience the issue, provide the requested information and feel free to reopen the bugzilla ticket. Thanks in advance.