1281484 – bind stops resolving - flush cache clears issue

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1281484 - bind stops resolving - flush cache clears issue

Summary: bind stops resolving - flush cache clears issue

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	bind
Sub Component:
Version:	6.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Petr Menšík
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-12 15:57 UTC by duncan
Modified:	2017-09-05 14:27 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-05 14:27:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
named.run.9 (1.02 MB, text/plain) 2015-11-13 20:39 UTC, duncan	no flags	Details
named.run.8 (1.00 MB, text/plain) 2015-11-13 20:41 UTC, duncan	no flags	Details
View All

Description duncan 2015-11-12 15:57:39 UTC

Description of problem:

Periodically bind version bind-9.8.2-0.37.rc1.el6_7.4.i686 will stop resolving and return Host xxxxxxx.com not found: 2(SERVFAIL). A cache flush (/usr/sbin/rndc flush) will resolve the issue and bind will starting resolving again.

Version-Release number of selected component (if applicable):

9.8.2-0.37.rc1.el6_7.4.i686

How reproducible:

Not readily reproducable but happens randomly but often on 2 servers running this version of bind. 

Downgrading to 9.8.2-0.30.rc1.el6.i686 resolves the issue. I have 23 servers running this version with no issues since downgrading. All servers configured exactly the same. If I upgrade one of the working servers it will exhibit the issue.

Additional info:

I can detect the issue with reasonable ease via a cronjob and when it next happens will attempt to retain the cache dump at failure and one a few minutes before the failure.

Please advise anything you would want me to capture and how to do it.

Comment 2 Tomáš Hozza 2015-11-13 09:09:46 UTC

Hello.

Please increase the debug level using 'rndc trace 99' OR by adding '-d 99' on the command line when starting named. Then please attach the log from named (by default stored in /var/named/data/named.run).

I presume you are using DNSSEC validation on the resolver, is that correct?

Is this happening for any domains until you flush the cache, or does named return SERVFAIL only for some particular domains?

Thank you in advance.

Comment 3 duncan 2015-11-13 12:14:51 UTC

I have disabled DNSSEC as I thought that was the issue.

dnssec-enable no;
dnssec-validation no;

It appears to happen for all domains until I flush the cache. My crontab job checks one of my domains on 127.0.0.1, then if that fails pings 8.8.8.8 to check for a network connection, then tries a lookup of google.com using 127.0.0.1 and if that fails finally attempts a lookup of my domain on 8.8.8.8 (which works). At that point it does a flush and the rechecks the lookup to check dns is now working. Finally it mails me the log and the cache at the time of the issue and the cache at the last check.

I will enable the trace and upload when I next have the issue, it generally happens 2 or 3 times a day so shouldn't be too long.

Comment 4 duncan 2015-11-13 20:39:27 UTC

Created attachment 1093838 [details]
named.run.9

named.run.9 everything working fine

Comment 5 duncan 2015-11-13 20:41:08 UTC

Created attachment 1093839 [details]
named.run.8

named.run.8 SERVFAIL reported

Comment 6 duncan 2015-11-13 20:42:23 UTC

Uploaded named.run.9 and named.run.8.

Working file as per log name.run.9 then starts returning SERVFAIL in named.run.8.

Hope this helps, let me know if you need anything else.

Comment 7 duncan 2015-11-18 15:12:54 UTC

Example (not related to name.run files already uploaded)

version: 9.8.2rc1-RedHat-9.8.2-0.37.rc1.el6_7.4
CPUs found: 2
worker threads: 2
number of zones: 19
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 0/0/1000
tcp clients: 0/100
server is up and running

Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases: 

Host yahoo.com not found: 2(SERVFAIL)

version: 9.8.2rc1-RedHat-9.8.2-0.37.rc1.el6_7.4
CPUs found: 2
worker threads: 2
number of zones: 19
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 0/0/1000
tcp clients: 0/100
server is up and running

Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases: 

Host google.com not found: 2(SERVFAIL)
named status

version: 9.8.2rc1-RedHat-9.8.2-0.37.rc1.el6_7.4
CPUs found: 2
worker threads: 2
number of zones: 19
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 0/0/1000
tcp clients: 0/100
server is up and running

Using domain server:
Name: 8.8.8.8
Address: 8.8.8.8#53
Aliases: 

yahoo.com has address 206.190.36.45
yahoo.com has address 98.139.183.24
yahoo.com has address 98.138.253.109

Comment 11 Tomáš Hozza 2016-08-16 10:52:44 UTC

Red Hat Enterprise Linux version 6 is entering the Production 2 phase of its lifetime and this bug doesn't meet the criteria for it, i.e. only high severity issues will be fixed. Please see https://access.redhat.com/support/policy/updates/errata/ for further information.

Comment 12 duncan 2016-09-08 09:28:03 UTC

I have since found that in an ipv6 only environment BIND stops resolving altogether and has to be retarted. This can occur on a very lightly loaded name server and results in restarting the name server every few hours.

As with the above problem downgrading to 9.8.2-0.30.rc1.el6.i686 fixes the issue.

Both issues can be reproduced in the latest version of BIND available with Red Hat Enterprise Linux version 6

Comment 13 Tomáš Hozza 2016-09-09 07:06:20 UTC

(In reply to duncan from comment #12)
> I have since found that in an ipv6 only environment BIND stops resolving
> altogether and has to be retarted. This can occur on a very lightly loaded
> name server and results in restarting the name server every few hours.
> 
> As with the above problem downgrading to 9.8.2-0.30.rc1.el6.i686 fixes the
> issue.
> 
> Both issues can be reproduced in the latest version of BIND available with
> Red Hat Enterprise Linux version 6

Thank you for the information.

Comment 15 Tomáš Hozza 2017-09-05 14:27:55 UTC

Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017.  During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:
http://redhat.com/rhel/lifecycle

This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification.  Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com

Note You need to log in before you can comment on or make changes to this bug.