RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1409321 - [bind-pkcs11] SERVFAIL on some lookups
Summary: [bind-pkcs11] SERVFAIL on some lookups
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: bind
Version: 8.5
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: 8.7
Assignee: Petr Menšík
QA Contact: rhel-cs-infra-services-qe
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-31 16:08 UTC by Brian J. Murrell
Modified: 2023-09-15 01:25 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-31 07:27:50 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-112824 0 None None None 2022-02-18 12:22:48 UTC

Description Brian J. Murrell 2016-12-31 16:08:19 UTC
Description of problem:
named-pkcs11 seems to get into a state where it returns SERVFAIL for certain names and needs to be restarted to clear that state.

Version-Release number of selected component (if applicable):
bind-pkcs11-9.9.4-38.el7_3.x86_64

How reproducible:
Intermittently

Steps to Reproduce:
1. Unknown
2.
3.

Actual results:
SERVFAIL

Expected results:
Should return info, like it will if simply restart it

Additional info:
# dig @localhost lh3.googleusercontent.com

; <<>> DiG 9.9.4-RedHat-9.9.4-38.el7_3 <<>> @localhost lh3.googleusercontent.com
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 36005
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;lh3.googleusercontent.com.     IN      A

;; Query time: 9 msec
;; SERVER: ::1#53(::1)
;; WHEN: Sat Dec 31 11:05:11 EST 2016
;; MSG SIZE  rcvd: 54

# dig @localhost lh3.googleusercontent.com

; <<>> DiG 9.9.4-RedHat-9.9.4-38.el7_3 <<>> @localhost lh3.googleusercontent.com
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33131
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 4, ADDITIONAL: 5

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;lh3.googleusercontent.com.     IN      A

;; ANSWER SECTION:
lh3.googleusercontent.com. 86400 IN     CNAME   googlehosted.l.googleusercontent.com.
googlehosted.l.googleusercontent.com. 300 IN A  216.58.219.193

;; AUTHORITY SECTION:   
googleusercontent.com.  172800  IN      NS      ns3.google.com.
googleusercontent.com.  172800  IN      NS      ns4.google.com.
googleusercontent.com.  172800  IN      NS      ns2.google.com.
googleusercontent.com.  172800  IN      NS      ns1.google.com.

;; ADDITIONAL SECTION:  
ns2.google.com.         172800  IN      A       216.239.34.10
ns1.google.com.         172800  IN      A       216.239.32.10
ns3.google.com.         172800  IN      A       216.239.36.10
ns4.google.com.         172800  IN      A       216.239.38.10

;; Query time: 131 msec 
;; SERVER: ::1#53(::1)  
;; WHEN: Sat Dec 31 11:05:57 EST 2016
;; MSG SIZE  rcvd: 242  

This has only been happening since upgrading from bind-pkcs11-32:9.9.4-29.el7_2.4.x86_64

Comment 3 Brian J. Murrell 2017-01-02 18:44:12 UTC
It seems also, if I wait a while the SERVFAILs will go away and named will return a result for the given name being looked up.

Comment 5 Petr Menšík 2017-09-25 14:00:38 UTC
Can you update to the latest version? There were some security fixes in handling CNAME replies that could fix this issue.

Such errors are usually connected with DNSSEC verification failures. In this case it is more likely that something in cache prevents fetching correct result. I need more details from your cache especially for names that are likely served by different servers in different parts of the world.

I think rndc commands would help here. Tracing can be enabled runtime by "rndc trace". More details would be logged into named log. Run it more times for more details. When query is received, much more details about its processing would be logged into /var/named/data/named.run.

When enough information, turn off tracing by "rndc notrace".

Useful would be also at least part of cache. It can be dumped by "rndc dumpdb". Named will then write cache contents into /var/named/data/cache_dump.db. At least parts of cache when it reports sigfail would be useful. For 

$ grep google -B1  /var/named/data/cache_dump.db

Without any relevant logs or way to reproduce, there is little I can do to fix this bug.

Comment 6 Brian J. Murrell 2017-09-25 14:09:53 UTC
> Can you update to the latest version?

Is the bind-pkcs11-9.9.4-50.el7_3.1 that is in the latest 7.3 sufficient or is the bind-pkcs11-9.9.4-51.el7 that is in 7.4 required?  I have not done my 7.4 validation testing yet so can't jump up that far quite yet.

As for getting trace data and cache dumps, I guess that will have to wait until the next occurrence as I also have no way to reproduce this on demand.

Comment 7 Brian J. Murrell 2017-12-13 12:32:37 UTC
Nothing terribly interesting I'm afraid.  For example:

$ dig github.com

; <<>> DiG 9.11.1-P3-RedHat-9.11.1-3.P3.fc26 <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 50945
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;github.com.			IN	A

;; Query time: 1 msec
;; SERVER: 10.75.22.247#53(10.75.22.247)
;; WHEN: Tue Dec 12 10:24:56 EST 2017
;; MSG SIZE  rcvd: 39

# grep github.com /var/named/data/named.run
12-Dec-2017 09:53:04.141 createfetch: api.github.com A
12-Dec-2017 09:53:04.142 createfetch: api.github.com AAAA
12-Dec-2017 10:08:36.895 createfetch: github.com A
12-Dec-2017 10:08:36.895 createfetch: github.com AAAA
12-Dec-2017 10:19:54.915 createfetch: api.github.com A
12-Dec-2017 10:19:54.916 createfetch: github.com AAAA
12-Dec-2017 10:19:54.918 createfetch: api.github.com AAAA
12-Dec-2017 10:19:54.918 createfetch: github.com A
12-Dec-2017 10:19:54.924 createfetch: assets-cdn.github.com A
12-Dec-2017 10:19:54.924 createfetch: assets-cdn.github.com AAAA
12-Dec-2017 10:19:58.258 createfetch: live.github.com A
12-Dec-2017 10:19:58.258 createfetch: live.github.com AAAA
12-Dec-2017 10:21:25.194 createfetch: github.com A
12-Dec-2017 10:23:42.622 createfetch: github.com AAAA
12-Dec-2017 10:23:42.625 client 10.75.22.215#52230 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:23:42.626 createfetch: github.com A
12-Dec-2017 10:23:42.626 createfetch: api.github.com A
12-Dec-2017 10:23:42.626 client 10.75.22.215#39721 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:23:42.627 createfetch: api.github.com AAAA
12-Dec-2017 10:23:42.628 createfetch: assets-cdn.github.com AAAA
12-Dec-2017 10:23:42.628 client 10.75.22.215#47590 (api.github.com): query failed (SERVFAIL) for api.github.com/IN/A at query.c:7007
12-Dec-2017 10:23:42.628 client 10.75.22.215#51376 (api.github.com): query failed (SERVFAIL) for api.github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:23:42.629 createfetch: assets-cdn.github.com A
12-Dec-2017 10:23:42.629 client 10.75.22.215#59679 (assets-cdn.github.com): query failed (SERVFAIL) for assets-cdn.github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:23:42.630 client 10.75.22.215#37076 (assets-cdn.github.com): query failed (SERVFAIL) for assets-cdn.github.com/IN/A at query.c:7007
12-Dec-2017 10:23:43.858 createfetch: assets-cdn.github.com.ilinx A
12-Dec-2017 10:23:43.858 createfetch: assets-cdn.github.com.ilinx AAAA
12-Dec-2017 10:23:43.861 createfetch: github.com.ilinx A
12-Dec-2017 10:23:43.863 createfetch: github.com.ilinx AAAA
12-Dec-2017 10:23:43.867 createfetch: api.github.com.ilinx A
12-Dec-2017 10:23:43.868 createfetch: api.github.com.ilinx AAAA
12-Dec-2017 10:23:45.049 createfetch: github.com AAAA
12-Dec-2017 10:23:45.049 createfetch: github.com A
12-Dec-2017 10:23:45.049 client 10.75.22.215#39076 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:23:45.049 client 10.75.22.215#46895 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:23:50.059 createfetch: github.com A
12-Dec-2017 10:23:50.060 createfetch: github.com AAAA
12-Dec-2017 10:23:50.060 client 10.75.22.215#60499 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:23:50.060 client 10.75.22.215#47494 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:24:09.608 createfetch: github.com A
12-Dec-2017 10:24:09.608 client 10.75.22.1#40614 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:24:20.071 createfetch: github.com A
12-Dec-2017 10:24:20.071 createfetch: github.com AAAA
12-Dec-2017 10:24:20.071 client 10.75.22.215#53790 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:24:20.071 client 10.75.22.215#36491 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:24:56.184 createfetch: github.com A
12-Dec-2017 10:24:56.185 client 10.75.22.1#41427 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:25:20.085 createfetch: github.com AAAA
12-Dec-2017 10:23:45.049 client 10.75.22.215#39076 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:23:45.049 client 10.75.22.215#46895 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:23:50.059 createfetch: github.com A
12-Dec-2017 10:23:50.060 createfetch: github.com AAAA
12-Dec-2017 10:23:50.060 client 10.75.22.215#60499 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:23:50.060 client 10.75.22.215#47494 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:24:09.608 createfetch: github.com A
12-Dec-2017 10:24:09.608 client 10.75.22.1#40614 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:24:20.071 createfetch: github.com A
12-Dec-2017 10:24:20.071 createfetch: github.com AAAA
12-Dec-2017 10:24:20.071 client 10.75.22.215#53790 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:24:20.071 client 10.75.22.215#36491 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:24:56.184 createfetch: github.com A
12-Dec-2017 10:24:56.185 client 10.75.22.1#41427 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:25:20.085 createfetch: github.com AAAA
12-Dec-2017 10:25:20.086 createfetch: github.com A
12-Dec-2017 10:25:20.086 client 10.75.22.215#33357 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:25:20.086 client 10.75.22.215#59591 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:26:48.597 createfetch: github.com AAAA
12-Dec-2017 10:26:48.598 createfetch: github.com A
12-Dec-2017 10:26:48.598 client 10.75.22.215#55990 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007
12-Dec-2017 10:26:48.598 client 10.75.22.215#35485 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007
12-Dec-2017 10:26:56.082 createfetch: github.com AAAA
12-Dec-2017 10:26:56.083 createfetch: github.com A

# grep github -B1  /var/named/data/cache_dump.db
; answer
github.com.ilinx.       10759   \-ANY   ;-$NXDOMAIN
--
; answer
api.github.com.ilinx.   10759   \-ANY   ;-$NXDOMAIN
--
; answer
assets-cdn.github.com.ilinx. 10759 \-ANY ;-$NXDOMAIN
--
; answer
collector.githubapp.com.ilinx. 10759 \-ANY ;-$NXDOMAIN
--
; answer
camo.githubusercontent.com.ilinx. 10759 \-ANY ;-$NXDOMAIN

Comment 8 Brian J. Murrell 2017-12-13 12:41:26 UTC
It seems to be happen quite frequently with google domains in the last few days.  But not limited to google domains.  Happened with bugzilla.redhat.com just as I was trying to submit this comment since it was happening right this minute.

Comment 9 Brian J. Murrell 2017-12-16 15:53:15 UTC
Any more ideas here?  This is happening very frequently (many many times per day) with very frequently used domains such as google domains and is really starting to annoy the users.

Interestingly, when it does start happening, something as simple as "rndc reload" will make the problem stop, but obviously that is not a workable solution or workaround even as it requires somebody to be babysitting, waiting for the server to start failing and issuing the "rndc reload".

Comment 10 Brian J. Murrell 2017-12-16 15:54:26 UTC
Additionally, since it is happening so frequently, there is ample opportunity to do more debugging and/or information gathering.  Happy to provide whatever is needed to help get to the bottom of this.

Comment 11 Brian J. Murrell 2018-01-17 12:08:17 UTC
Can we please try to make some progress on this?  Daily, and most frequently it's google.com domains that return SERVFAIL.  That wreaks havoc on a userbase rife with Android phones, Chromecasts, Google Homes, etc.

Just let me know what more information is needed and I'm happy to gather it.

But without being able to fix this, the RHEL7 named is big time impediment in my network, not an asset.

Comment 12 Brian J. Murrell 2018-01-17 12:09:32 UTC
I should add, issuing an "rndc reload" *always* rectifies the situation.  But I can't be sitting connected to my IPA server 24/7 just waiting to issue an "rndc reload" when people start complaining that they can't reach Google again.

Comment 13 Brian J. Murrell 2018-01-17 12:54:07 UTC
More detailed trace when SERVFAILs are happening:

17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: UDP request
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: using view '_default'
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: request is not signed
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: recursion available
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: query
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): query (cache) 'connectivitycheck.gstatic.com/A/IN' approved
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): replace
17-Jan-2018 07:47:20.948 clientmgr @0x7f102c4928a0: get client
17-Jan-2018 07:47:20.948 clientmgr @0x7f102c4928a0: recycle
17-Jan-2018 07:47:20.948 createfetch: connectivitycheck.gstatic.com A
17-Jan-2018 07:47:20.948 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): create
17-Jan-2018 07:47:20.948 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): join
17-Jan-2018 07:47:20.948 fetch 0x7f102112cd58 (fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A)): created
17-Jan-2018 07:47:20.948 client @0x7f1012522cf0: udprecv
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: UDP request
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: using view '_default'
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: request is not signed
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: recursion available
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: query
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): query (cache) 'connectivitycheck.gstatic.com/AAAA/IN' approved
17-Jan-2018 07:47:20.948 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): replace
17-Jan-2018 07:47:20.948 clientmgr @0x7f102c4928a0: get client
17-Jan-2018 07:47:20.948 clientmgr @0x7f102c4928a0: recycle
17-Jan-2018 07:47:20.948 createfetch: connectivitycheck.gstatic.com AAAA
17-Jan-2018 07:47:20.948 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): create
17-Jan-2018 07:47:20.948 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): join
17-Jan-2018 07:47:20.948 fetch 0x7f1012542270 (fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA)): created
17-Jan-2018 07:47:20.948 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): start
17-Jan-2018 07:47:20.948 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): try
17-Jan-2018 07:47:20.948 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): cancelqueries
17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): getaddresses
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): no addresses
17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): done
17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): stopeverything
17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): cancelqueries
17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): sendevents
17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): start
17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): try
17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): cancelqueries
17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): getaddresses
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): no addresses
17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): done
17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): stopeverything
17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): cancelqueries
17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): sendevents
17-Jan-2018 07:47:20.949 client @0x7f1022ae1090: udprecv
17-Jan-2018 07:47:20.952 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): query failed (SERVFAIL) for connectivitycheck.gstatic.com/IN/A at query.c:7007
17-Jan-2018 07:47:20.953 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): error
17-Jan-2018 07:47:20.953 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): send
17-Jan-2018 07:47:20.953 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): sendto
17-Jan-2018 07:47:20.953 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): senddone
17-Jan-2018 07:47:20.954 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): next
17-Jan-2018 07:47:20.954 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): endrequest
17-Jan-2018 07:47:20.954 fetch completed at resolver.c:3074 for connectivitycheck.gstatic.com/A in 0.000691: failure/success [domain:gstatic.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:4,findfail:0,valfail:0]
17-Jan-2018 07:47:20.954 fetch 0x7f102112cd58 (fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A)): destroyfetch
17-Jan-2018 07:47:20.954 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): shutdown
17-Jan-2018 07:47:20.954 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): query failed (SERVFAIL) for connectivitycheck.gstatic.com/IN/AAAA at query.c:7007
17-Jan-2018 07:47:20.954 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): error
17-Jan-2018 07:47:20.955 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): send
17-Jan-2018 07:47:20.955 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): sendto
17-Jan-2018 07:47:20.955 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): senddone
17-Jan-2018 07:47:20.955 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): next
17-Jan-2018 07:47:20.955 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): endrequest
17-Jan-2018 07:47:20.955 fetch completed at resolver.c:3074 for connectivitycheck.gstatic.com/AAAA in 0.000568: failure/success [domain:gstatic.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:4,findfail:0,valfail:0]
17-Jan-2018 07:47:20.956 fetch 0x7f1012542270 (fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA)): destroyfetch
17-Jan-2018 07:47:20.956 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): shutdown
17-Jan-2018 07:47:20.956 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): doshutdown
17-Jan-2018 07:47:20.956 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): stopeverything
17-Jan-2018 07:47:20.956 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): cancelqueries
17-Jan-2018 07:47:20.956 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): unlink
17-Jan-2018 07:47:20.956 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): destroy
17-Jan-2018 07:47:20.956 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): doshutdown
17-Jan-2018 07:47:20.956 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): stopeverything
17-Jan-2018 07:47:20.957 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): cancelqueries
17-Jan-2018 07:47:20.957 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): unlink
17-Jan-2018 07:47:20.957 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): destroy
17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: UDP request
17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: using view '_default'
17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: request is not signed
17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: recursion available
17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: query
17-Jan-2018 07:47:20.998 client 10.75.22.201#32852 (www.gstatic.com): query (cache) 'www.gstatic.com/A/IN' approved
17-Jan-2018 07:47:20.998 client 10.75.22.201#32852 (www.gstatic.com): replace
17-Jan-2018 07:47:20.998 clientmgr @0x7f102c4928a0: get client
17-Jan-2018 07:47:20.998 clientmgr @0x7f102c4928a0: recycle
17-Jan-2018 07:47:20.998 createfetch: www.gstatic.com A
17-Jan-2018 07:47:20.998 fctx 0x7f1012b87290(www.gstatic.com/A): create
17-Jan-2018 07:47:20.998 fctx 0x7f1012b87290(www.gstatic.com/A): join
17-Jan-2018 07:47:20.998 fetch 0x7f1012542270 (fctx 0x7f1012b87290(www.gstatic.com/A)): created
17-Jan-2018 07:47:20.998 client @0x7f102233f240: udprecv
17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: UDP request
17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: using view '_default'
17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: request is not signed
17-Jan-2018 07:47:20.999 client 10.75.22.201#32852: recursion available
17-Jan-2018 07:47:20.999 client 10.75.22.201#32852: query
17-Jan-2018 07:47:20.999 client 10.75.22.201#32852 (www.gstatic.com): query (cache) 'www.gstatic.com/AAAA/IN' approved
17-Jan-2018 07:47:20.999 client 10.75.22.201#32852 (www.gstatic.com): replace
17-Jan-2018 07:47:20.999 clientmgr @0x7f102c4928a0: get client
17-Jan-2018 07:47:20.999 clientmgr @0x7f102c4928a0: recycle
17-Jan-2018 07:47:20.999 createfetch: www.gstatic.com AAAA
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): create
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): join
17-Jan-2018 07:47:20.999 fetch 0x7f102112cd58 (fctx 0x7f1012b9be50(www.gstatic.com/AAAA)): created
17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): start
17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): try
17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): cancelqueries
17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): getaddresses
17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): no addresses
17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): done
17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): stopeverything
17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): cancelqueries
17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): sendevents
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): start
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): try
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): cancelqueries
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): getaddresses
17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): no addresses
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): done
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): stopeverything
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): cancelqueries
17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): sendevents
17-Jan-2018 07:47:20.999 client @0x7f1012573140: udprecv
17-Jan-2018 07:47:21.001 client 10.75.22.201#32852 (www.gstatic.com): query failed (SERVFAIL) for www.gstatic.com/IN/A at query.c:7007
17-Jan-2018 07:47:21.002 client 10.75.22.201#32852 (www.gstatic.com): error
17-Jan-2018 07:47:21.003 client 10.75.22.201#32852 (www.gstatic.com): send
17-Jan-2018 07:47:21.003 client 10.75.22.201#32852 (www.gstatic.com): sendto
17-Jan-2018 07:47:21.004 client 10.75.22.201#32852 (www.gstatic.com): senddone
17-Jan-2018 07:47:21.004 client 10.75.22.201#32852 (www.gstatic.com): next
17-Jan-2018 07:47:21.004 client 10.75.22.201#32852 (www.gstatic.com): endrequest
17-Jan-2018 07:47:21.004 fetch completed at resolver.c:3074 for www.gstatic.com/A in 0.000683: failure/success [domain:gstatic.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:4,findfail:0,valfail:0]
17-Jan-2018 07:47:21.004 fetch 0x7f1012542270 (fctx 0x7f1012b87290(www.gstatic.com/A)): destroyfetch
17-Jan-2018 07:47:21.004 fctx 0x7f1012b87290(www.gstatic.com/A): shutdown
17-Jan-2018 07:47:21.004 client 10.75.22.201#32852 (www.gstatic.com): query failed (SERVFAIL) for www.gstatic.com/IN/AAAA at query.c:7007
17-Jan-2018 07:47:21.005 client 10.75.22.201#32852 (www.gstatic.com): error
17-Jan-2018 07:47:21.005 client 10.75.22.201#32852 (www.gstatic.com): send
17-Jan-2018 07:47:21.005 client 10.75.22.201#32852 (www.gstatic.com): sendto
17-Jan-2018 07:47:21.005 client 10.75.22.201#32852 (www.gstatic.com): senddone
17-Jan-2018 07:47:21.006 client 10.75.22.201#32852 (www.gstatic.com): next
17-Jan-2018 07:47:21.006 client 10.75.22.201#32852 (www.gstatic.com): endrequest
17-Jan-2018 07:47:21.006 fetch completed at resolver.c:3074 for www.gstatic.com/AAAA in 0.000591: failure/success [domain:gstatic.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:4,findfail:0,valfail:0]
17-Jan-2018 07:47:21.006 fetch 0x7f102112cd58 (fctx 0x7f1012b9be50(www.gstatic.com/AAAA)): destroyfetch
17-Jan-2018 07:47:21.007 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): shutdown
17-Jan-2018 07:47:21.007 fctx 0x7f1012b87290(www.gstatic.com/A): doshutdown
17-Jan-2018 07:47:21.007 fctx 0x7f1012b87290(www.gstatic.com/A): stopeverything
17-Jan-2018 07:47:21.007 fctx 0x7f1012b87290(www.gstatic.com/A): cancelqueries
17-Jan-2018 07:47:21.007 fctx 0x7f1012b87290(www.gstatic.com/A): unlink
17-Jan-2018 07:47:21.007 fctx 0x7f1012b87290(www.gstatic.com/A): destroy
17-Jan-2018 07:47:21.007 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): doshutdown
17-Jan-2018 07:47:21.008 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): stopeverything
17-Jan-2018 07:47:21.008 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): cancelqueries
17-Jan-2018 07:47:21.008 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): unlink
17-Jan-2018 07:47:21.009 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): destroy

Is that at all helpful?

Comment 14 Tomáš Hozza 2018-01-17 13:36:50 UTC
(In reply to Brian J. Murrell from comment #11)
> Can we please try to make some progress on this?  Daily, and most frequently
> it's google.com domains that return SERVFAIL.  That wreaks havoc on a
> userbase rife with Android phones, Chromecasts, Google Homes, etc.
> 
> Just let me know what more information is needed and I'm happy to gather it.
> 
> But without being able to fix this, the RHEL7 named is big time impediment
> in my network, not an asset.

Thank you for taking the time to report this issue to us. We appreciate the feedback and use reports such as this one to guide our efforts at improving our products. That being said, this bug tracking system is not a mechanism for requesting support, and we are not able to guarantee the timeliness or suitability of a resolution.

If this issue is critical or in any way time sensitive, please raise a ticket through the regular Red Hat support channels to ensure it receives the proper attention and prioritization to assure a timely resolution. 

For information on how to contact the Red Hat production support team, please visit:
    https://www.redhat.com/support/process/production/#howto

Currently our team is busy with higher priority issues and we will get to this bug on "best effort" basis.

Comment 15 Martin Osvald 🛹 2019-07-08 15:30:49 UTC
Hello,

my name is Martin Osvald and I am helping my colleague with bind cases which are old or not touched for a very long time.

I hope you are still able and willing to help me move this BZ further.


First of all, I would like to ask you whether you are still facing the same problem.

* If yes, please, kindly respond to all the below questions:

 1. Which version of bind do you currently have installed?
 2. How long does it usually take for the problem to show up after you (re)start bind?
 3. Would you be able to install & run an altered version of bind which would help with the investigation? (to get rid of internal timeouts to be able to use gdb for debugging)
 4. Would you be able to disable temporarily (for a very short time - minutes) any dns traffic coming to your machine except localhost when the problem happens again? (to log a sequence of called functions by gdb and not to be influenced by other clients)
 5. Would you be able and willing to continually run/test several bind packages based on a very old version? (to test which patch introduced this behavior)

I am asking the above questions so that I can prepare appropriate steps to investigate this further as the info/logs provided so far is not enough to find the real culprit.

Only note: If you don't reply within 1 month, I will (have to) close this BZ as WORKSFORME.



The below are technical details from the investigation (you can skip to read them, only if you are interested, they show why I am asking the above questions):

Short version:

According to the log snippet from comment 13 the code iterated 4 times between fctx_getaddresses() and dns_adb_createfind2() failing somewhere very likely shortly after or inside new_adbfind(). Looks like a problem with memory allocation (or related mutexes), new_adbfind() returning the same address (0x7f1020a40550) every time looks really suspicious/wrong.

Long version:

I went through all the comments in this BZ and can pinpoint the important info from them:

1. The last working version was 9.9.4-29.el7_2.4 and first non-working 9.9.4-38.el7_3:

~~~
Version-Release number of selected component (if applicable):
bind-pkcs11-9.9.4-38.el7_3.x86_64
...
This has only been happening since upgrading from bind-pkcs11-32:9.9.4-29.el7_2.4.x86_64
~~~

It is unfortunate that at the time of opening this BZ we could have provided you and asked you to run several versions of bind every time without one of the patches added between these two versions.

I have checked manually (and automatically comparing usually executed functions with the ones changed by these patches) which of the below set of patches could be responsible:

bind99-CVE-2015-8000.patch
bind99-CVE-2015-8704.patch
bind99-CVE-2016-1285-CVE-2016-1286.patch
bind99-rh1291185.patch
bind99-rh1259514.patch
bind99-rh1306610.patch
bind99-rh1220594-geoip.patch
bind99-automatic-interface-scanning-rh1294506.patch
bind99-CVE-2016-2776.patch
bind99-CVE-2016-8864.patch

but couldn't match the changed code with the behavior the provided logs are showing even some of the changes get executed on the way.

2. The below log snippet from comment 13:

~~~
17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): getaddresses
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): no addresses
...
17-Jan-2018 07:47:20.952 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): query failed (SERVFAIL) for connectivitycheck.gstatic.com/IN/A at query.c:7007
...
17-Jan-2018 07:47:20.954 fetch completed at resolver.c:3074 for connectivitycheck.gstatic.com/A in 0.000691: failure/success [domain:gstatic.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:4,findfail:0,valfail:0]
~~~

vs snippet when it works (captured on my testing machine):

~~~
08-Jul-2019 13:20:29.788 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): getaddresses
08-Jul-2019 13:20:29.788 expiring v4 for name 0x7fbe79677990
08-Jul-2019 13:20:29.788 expiring v6 for name 0x7fbe79677990
08-Jul-2019 13:20:29.788 dns_adb_createfind: found A for name 0x7fbe79677990 in db
08-Jul-2019 13:20:29.788 dns_adb_createfind: found AAAA for name 0x7fbe79677990
08-Jul-2019 13:20:29.788 expiring v4 for name 0x7fbe79677860
08-Jul-2019 13:20:29.788 expiring v6 for name 0x7fbe79677860
08-Jul-2019 13:20:29.788 dns_adb_createfind: found A for name 0x7fbe79677860 in db
08-Jul-2019 13:20:29.788 dns_adb_createfind: found AAAA for name 0x7fbe79677860
08-Jul-2019 13:20:29.788 expiring v4 for name 0x7fbe79677730
08-Jul-2019 13:20:29.788 expiring v6 for name 0x7fbe79677730
08-Jul-2019 13:20:29.788 dns_adb_createfind: found A for name 0x7fbe79677730 in db
08-Jul-2019 13:20:29.788 dns_adb_createfind: found AAAA for name 0x7fbe79677730
08-Jul-2019 13:20:29.789 expiring v4 for name 0x7fbe79677600
08-Jul-2019 13:20:29.789 expiring v6 for name 0x7fbe79677600
08-Jul-2019 13:20:29.789 dns_adb_createfind: found A for name 0x7fbe79677600 in db
08-Jul-2019 13:20:29.789 dns_adb_createfind: found AAAA for name 0x7fbe79677600
08-Jul-2019 13:20:29.789 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): query
08-Jul-2019 13:20:29.789 resquery 0x7fbe794f4010 (fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A)): send
08-Jul-2019 13:20:29.789 resquery 0x7fbe794f4010 (fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A)): sent
08-Jul-2019 13:20:29.789 resquery 0x7fbe794f4010 (fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A)): udpconnected
08-Jul-2019 13:20:29.789 resquery 0x7fbe794f4010 (fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A)): senddone
08-Jul-2019 13:20:29.808 resquery 0x7fbe794f4010 (fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A)): response
08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): answer_response
08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): cache_message
08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): findnoqname
08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): clone_results
08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): cancelquery
08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): done
08-Jul-2019 13:20:29.809 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): stopeverything
08-Jul-2019 13:20:29.809 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): cancelqueries
08-Jul-2019 13:20:29.809 dns_adb_destroyfind on find 0x7fbe79675c40
08-Jul-2019 13:20:29.809 dns_adb_destroyfind on find 0x7fbe796754c0
08-Jul-2019 13:20:29.809 dns_adb_destroyfind on find 0x7fbe7967d790
08-Jul-2019 13:20:29.809 dns_adb_destroyfind on find 0x7fbe79675e20
~~~

I spent a few days investigating the related code, but unfortunately, the provided log lines are not enough to trace every corner of the code which could return with failure.

The main points here are that there are only "dns_adb_destroyfind" lines with the same address 0x7f1020a40550 between "getaddresses" and "no addresses" and the "adberr:4".

This means the code iterated 4 times (processing 4 nameservers for gstatic.com) prematurely ending somewhere inside dns_adb_createfind2() or functions called by it. The really strange thing is the same address 0x7f1020a40550 returned by new_adbfind(). It is normally every time different.

The usual sequence of called functions involved in this iteration (from my testing machine using gdb):

clone -> start_thread -> run -> dispatch -> resquery_response -> fctx_try -> fctx_getaddresses -> findname -> dns_adb_createfind2 -> new_adbfind()

According to the mentioned log snippet from comment 13 the code iterated 4 times between fctx_getaddresses() and dns_adb_createfind2() failing somewhere very likely shortly after or inside new_adbfind(). Looks like a problem with memory allocation (or related mutexes).

We can deduce it from the log functions and code from the below snippets (from version 9.9.4-38.el7_3):

~~~
2580 static isc_result_t
2581 fctx_getaddresses(fetchctx_t *fctx, isc_boolean_t badcache) {
...
2593         FCTXTRACE("getaddresses");
...
2739                 findname(fctx, &ns.name, 0, stdoptions, 0, now,
2740                          &need_alternate);
...
2811                         FCTXTRACE("no addresses");

2463 static void
2464 findname(fetchctx_t *fctx, dns_name_t *name, in_port_t port,
2465          unsigned int options, unsigned int flags, isc_stdtime_t now,
2466          isc_boolean_t *need_alternate)
2467 {
...
2491         find = NULL;
2492         result = dns_adb_createfind2(fctx->adb,
2493                                      res->buckets[fctx->bucketnum].task,
2494                                      fctx_finddone, fctx, name,
2495                                      &fctx->name, fctx->type,
2496                                      options, now, NULL,
2497                                      res->view->dstport,
2498                                      fctx->depth + 1, fctx->qc, &find);
2499         if (result != ISC_R_SUCCESS) {
...
2527         } else {
...
2548                 } else {
...
2551                         else
2552                                 fctx->adberr++; /* unreachable server, etc. */
...
2565                         dns_adb_destroyfind(&find);

3122 void    
3123 dns_adb_destroyfind(dns_adbfind_t **findp) {
...
3137         DP(DEF_LEVEL, "dns_adb_destroyfind on find %p", find);

2779 isc_result_t
2780 dns_adb_createfind2(dns_adb_t *adb, isc_task_t *task, isc_taskaction_t action,
2781                     void *arg, dns_name_t *name, dns_name_t *qname,
2782                     dns_rdatatype_t qtype, unsigned int options,
2783                     isc_stdtime_t now, dns_name_t *target,
2784                     in_port_t port, unsigned int depth, isc_counter_t *qc,
2785                     dns_adbfind_t **findp)
2786 {
...
2840         find = new_adbfind(adb);
~~~

The reason why I think the code returned/failed "very likely shortly after or inside new_adbfind()" is that we would see other logging messages between 

~~~
17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): getaddresses
...
17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): no addresses
~~~

I mean messages like:

~~~
2927                         DP(DEF_LEVEL,
2928                            "dns_adb_createfind: found A for name %p in db",
2929                            adbname);

or

3019                         DP(DEF_LEVEL,
3020                            "dns_adb_createfind: started A fetch for name %p",
3021                            adbname);
~~~

than only the below one:

~~~
17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550
~~~

because they are printed every time at the same trace level 5 when everything works correctly.

Comment 16 Tomáš Hozza 2020-01-06 12:19:36 UTC
Red Hat Enterprise Linux version 7 entered the Maintenance Support 1 phase in August 2019. In this phase only qualified Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released as they become available. Other errata advisories may be delivered as appropriate.

This bug has been reviewed by Support and Engineering representative and does not meet the inclusion criteria for Maintenance Support 1 phase. Also information requested in comment #15 was never provided.

For more information about Red Hat Enterprise Linux Lifecycle, please see https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_1_Phase.

Comment 17 Brian J. Murrell 2022-02-18 12:21:07 UTC
I'm reopening this since this continues to happen on EL8.5 also, using bind-pkcs11-9.11.26-6.el8.x86_64.  I don't seem to have permission to change the Product to RHEL 8 though.

Answering the questions from comment#15:

> 1. Which version of bind do you currently have installed?

bind-pkcs11-9.11.26-6.el8.x86_64

> 2. How long does it usually take for the problem to show up after you (re)start bind?

It's not really related to when bind is started.  It seems to happen for a period of time after a DDNS update is done, which fairly regularly as the DHCP server is doing DDNS updates when clients acquire leases.

 3. Would you be able to install & run an altered version of bind which would help with the investigation?

Yes.

 4. Would you be able to disable temporarily (for a very short time - minutes) any dns traffic coming to your machine except localhost when the problem happens again?

I'm afraid not, no.  Besides, given the intermittent and unpredictable behaviour, the machine needs to process queries for the situation to even be discovered.

 5. Would you be able and willing to continually run/test several bind packages based on a very old version?

Probably not, given that this bind instance is exposed to the Internet and running old (and presumably insecure) versions of bind would expose me to security risks.

Comment 21 Petr Menšík 2022-03-15 11:18:16 UTC
Switching to RHEL8, RHEL7 fix might be considered only when existing issue is confirmed and is serious enough.

It would be helpful if cache dump containing failing name and its parent would be made. rndc flushname <name> or rndc flushtree <name> should help if only few names are failing.

Servfail can be caused by multiple issues. Unfortunately it does not provide details with a reason into reply, best answer should be found in server log.

Brian, can you supply a recent log with RHEL8 version?

adberr is increased in case name server for domain points to cname [1], which is considered invalid. Also when loop is detected [2]. It would log it when it happens. Other case is generic failure with remote server [3], which does not log anything specific. I expect this would be the reason here.

Do we have tcpdump proof that response from remote server delivered a response when SERVFAIL happens? It might be valid result if remote network connectivity is broken for a while. If bad server is marked, some reason should be logged on first query. It would report errors just from cache.

1. https://gitlab.isc.org/isc-projects/bind9/-/blob/main/lib/dns/resolver.c#L3344
2. https://gitlab.isc.org/isc-projects/bind9/-/blob/main/lib/dns/resolver.c#L3390
3. https://gitlab.isc.org/isc-projects/bind9/-/blob/main/lib/dns/resolver.c#L3440

Comment 28 RHEL Program Management 2022-08-31 07:27:50 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 29 Red Hat Bugzilla 2023-09-15 01:25:24 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.