Bug 1281484

Summary: bind stops resolving - flush cache clears issue
Product: Red Hat Enterprise Linux 6 Reporter: duncan
Component: bindAssignee: Petr Menšík <pemensik>
Status: CLOSED WONTFIX QA Contact: qe-baseos-daemons
Severity: medium Docs Contact:
Priority: low    
Version: 6.7CC: duncan, pemensik, thozza
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-05 14:27:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
named.run.9
none
named.run.8 none

Description duncan 2015-11-12 15:57:39 UTC
Description of problem:

Periodically bind version bind-9.8.2-0.37.rc1.el6_7.4.i686 will stop resolving and return Host xxxxxxx.com not found: 2(SERVFAIL). A cache flush (/usr/sbin/rndc flush) will resolve the issue and bind will starting resolving again.

Version-Release number of selected component (if applicable):

9.8.2-0.37.rc1.el6_7.4.i686

How reproducible:

Not readily reproducable but happens randomly but often on 2 servers running this version of bind. 

Downgrading to 9.8.2-0.30.rc1.el6.i686 resolves the issue. I have 23 servers running this version with no issues since downgrading. All servers configured exactly the same. If I upgrade one of the working servers it will exhibit the issue.

Additional info:

I can detect the issue with reasonable ease via a cronjob and when it next happens will attempt to retain the cache dump at failure and one a few minutes before the failure.

Please advise anything you would want me to capture and how to do it.

Comment 2 Tomáš Hozza 2015-11-13 09:09:46 UTC
Hello.

Please increase the debug level using 'rndc trace 99' OR by adding '-d 99' on the command line when starting named. Then please attach the log from named (by default stored in /var/named/data/named.run).

I presume you are using DNSSEC validation on the resolver, is that correct?

Is this happening for any domains until you flush the cache, or does named return SERVFAIL only for some particular domains?

Thank you in advance.

Comment 3 duncan 2015-11-13 12:14:51 UTC
I have disabled DNSSEC as I thought that was the issue.

dnssec-enable no;
dnssec-validation no;

It appears to happen for all domains until I flush the cache. My crontab job checks one of my domains on 127.0.0.1, then if that fails pings 8.8.8.8 to check for a network connection, then tries a lookup of google.com using 127.0.0.1 and if that fails finally attempts a lookup of my domain on 8.8.8.8 (which works). At that point it does a flush and the rechecks the lookup to check dns is now working. Finally it mails me the log and the cache at the time of the issue and the cache at the last check.

I will enable the trace and upload when I next have the issue, it generally happens 2 or 3 times a day so shouldn't be too long.

Comment 4 duncan 2015-11-13 20:39:27 UTC
Created attachment 1093838 [details]
named.run.9

named.run.9 everything working fine

Comment 5 duncan 2015-11-13 20:41:08 UTC
Created attachment 1093839 [details]
named.run.8

named.run.8 SERVFAIL reported

Comment 6 duncan 2015-11-13 20:42:23 UTC
Uploaded named.run.9 and named.run.8.

Working file as per log name.run.9 then starts returning SERVFAIL in named.run.8.

Hope this helps, let me know if you need anything else.

Comment 7 duncan 2015-11-18 15:12:54 UTC
Example (not related to name.run files already uploaded)

version: 9.8.2rc1-RedHat-9.8.2-0.37.rc1.el6_7.4
CPUs found: 2
worker threads: 2
number of zones: 19
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 0/0/1000
tcp clients: 0/100
server is up and running

Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases: 

Host yahoo.com not found: 2(SERVFAIL)

version: 9.8.2rc1-RedHat-9.8.2-0.37.rc1.el6_7.4
CPUs found: 2
worker threads: 2
number of zones: 19
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 0/0/1000
tcp clients: 0/100
server is up and running

Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases: 

Host google.com not found: 2(SERVFAIL)
named status

version: 9.8.2rc1-RedHat-9.8.2-0.37.rc1.el6_7.4
CPUs found: 2
worker threads: 2
number of zones: 19
debug level: 0
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 0/0/1000
tcp clients: 0/100
server is up and running

Using domain server:
Name: 8.8.8.8
Address: 8.8.8.8#53
Aliases: 

yahoo.com has address 206.190.36.45
yahoo.com has address 98.139.183.24
yahoo.com has address 98.138.253.109

Comment 11 Tomáš Hozza 2016-08-16 10:52:44 UTC
Red Hat Enterprise Linux version 6 is entering the Production 2 phase of its lifetime and this bug doesn't meet the criteria for it, i.e. only high severity issues will be fixed. Please see https://access.redhat.com/support/policy/updates/errata/ for further information.

Comment 12 duncan 2016-09-08 09:28:03 UTC
I have since found that in an ipv6 only environment BIND stops resolving altogether and has to be retarted. This can occur on a very lightly loaded name server and results in restarting the name server every few hours.

As with the above problem downgrading to 9.8.2-0.30.rc1.el6.i686 fixes the issue.

Both issues can be reproduced in the latest version of BIND available with Red Hat Enterprise Linux version 6

Comment 13 Tomáš Hozza 2016-09-09 07:06:20 UTC
(In reply to duncan from comment #12)
> I have since found that in an ipv6 only environment BIND stops resolving
> altogether and has to be retarted. This can occur on a very lightly loaded
> name server and results in restarting the name server every few hours.
> 
> As with the above problem downgrading to 9.8.2-0.30.rc1.el6.i686 fixes the
> issue.
> 
> Both issues can be reproduced in the latest version of BIND available with
> Red Hat Enterprise Linux version 6

Thank you for the information.

Comment 15 Tomáš Hozza 2017-09-05 14:27:55 UTC
Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017.  During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:
http://redhat.com/rhel/lifecycle

This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification.  Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com