Bug 1409321
| Summary: | [bind-pkcs11] SERVFAIL on some lookups | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Brian J. Murrell <brian> |
| Component: | bind | Assignee: | Petr Menšík <pemensik> |
| Status: | CLOSED WONTFIX | QA Contact: | rhel-cs-infra-services-qe <rhel-cs-infra-services-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.5 | CC: | pemensik, psklenar, thozza |
| Target Milestone: | rc | Keywords: | Reopened, Triaged |
| Target Release: | 8.7 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-31 07:27:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Brian J. Murrell
2016-12-31 16:08:19 UTC
It seems also, if I wait a while the SERVFAILs will go away and named will return a result for the given name being looked up. Can you update to the latest version? There were some security fixes in handling CNAME replies that could fix this issue. Such errors are usually connected with DNSSEC verification failures. In this case it is more likely that something in cache prevents fetching correct result. I need more details from your cache especially for names that are likely served by different servers in different parts of the world. I think rndc commands would help here. Tracing can be enabled runtime by "rndc trace". More details would be logged into named log. Run it more times for more details. When query is received, much more details about its processing would be logged into /var/named/data/named.run. When enough information, turn off tracing by "rndc notrace". Useful would be also at least part of cache. It can be dumped by "rndc dumpdb". Named will then write cache contents into /var/named/data/cache_dump.db. At least parts of cache when it reports sigfail would be useful. For $ grep google -B1 /var/named/data/cache_dump.db Without any relevant logs or way to reproduce, there is little I can do to fix this bug. > Can you update to the latest version?
Is the bind-pkcs11-9.9.4-50.el7_3.1 that is in the latest 7.3 sufficient or is the bind-pkcs11-9.9.4-51.el7 that is in 7.4 required? I have not done my 7.4 validation testing yet so can't jump up that far quite yet.
As for getting trace data and cache dumps, I guess that will have to wait until the next occurrence as I also have no way to reproduce this on demand.
Nothing terribly interesting I'm afraid. For example: $ dig github.com ; <<>> DiG 9.11.1-P3-RedHat-9.11.1-3.P3.fc26 <<>> github.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 50945 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;github.com. IN A ;; Query time: 1 msec ;; SERVER: 10.75.22.247#53(10.75.22.247) ;; WHEN: Tue Dec 12 10:24:56 EST 2017 ;; MSG SIZE rcvd: 39 # grep github.com /var/named/data/named.run 12-Dec-2017 09:53:04.141 createfetch: api.github.com A 12-Dec-2017 09:53:04.142 createfetch: api.github.com AAAA 12-Dec-2017 10:08:36.895 createfetch: github.com A 12-Dec-2017 10:08:36.895 createfetch: github.com AAAA 12-Dec-2017 10:19:54.915 createfetch: api.github.com A 12-Dec-2017 10:19:54.916 createfetch: github.com AAAA 12-Dec-2017 10:19:54.918 createfetch: api.github.com AAAA 12-Dec-2017 10:19:54.918 createfetch: github.com A 12-Dec-2017 10:19:54.924 createfetch: assets-cdn.github.com A 12-Dec-2017 10:19:54.924 createfetch: assets-cdn.github.com AAAA 12-Dec-2017 10:19:58.258 createfetch: live.github.com A 12-Dec-2017 10:19:58.258 createfetch: live.github.com AAAA 12-Dec-2017 10:21:25.194 createfetch: github.com A 12-Dec-2017 10:23:42.622 createfetch: github.com AAAA 12-Dec-2017 10:23:42.625 client 10.75.22.215#52230 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:23:42.626 createfetch: github.com A 12-Dec-2017 10:23:42.626 createfetch: api.github.com A 12-Dec-2017 10:23:42.626 client 10.75.22.215#39721 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:23:42.627 createfetch: api.github.com AAAA 12-Dec-2017 10:23:42.628 createfetch: assets-cdn.github.com AAAA 12-Dec-2017 10:23:42.628 client 10.75.22.215#47590 (api.github.com): query failed (SERVFAIL) for api.github.com/IN/A at query.c:7007 12-Dec-2017 10:23:42.628 client 10.75.22.215#51376 (api.github.com): query failed (SERVFAIL) for api.github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:23:42.629 createfetch: assets-cdn.github.com A 12-Dec-2017 10:23:42.629 client 10.75.22.215#59679 (assets-cdn.github.com): query failed (SERVFAIL) for assets-cdn.github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:23:42.630 client 10.75.22.215#37076 (assets-cdn.github.com): query failed (SERVFAIL) for assets-cdn.github.com/IN/A at query.c:7007 12-Dec-2017 10:23:43.858 createfetch: assets-cdn.github.com.ilinx A 12-Dec-2017 10:23:43.858 createfetch: assets-cdn.github.com.ilinx AAAA 12-Dec-2017 10:23:43.861 createfetch: github.com.ilinx A 12-Dec-2017 10:23:43.863 createfetch: github.com.ilinx AAAA 12-Dec-2017 10:23:43.867 createfetch: api.github.com.ilinx A 12-Dec-2017 10:23:43.868 createfetch: api.github.com.ilinx AAAA 12-Dec-2017 10:23:45.049 createfetch: github.com AAAA 12-Dec-2017 10:23:45.049 createfetch: github.com A 12-Dec-2017 10:23:45.049 client 10.75.22.215#39076 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:23:45.049 client 10.75.22.215#46895 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:23:50.059 createfetch: github.com A 12-Dec-2017 10:23:50.060 createfetch: github.com AAAA 12-Dec-2017 10:23:50.060 client 10.75.22.215#60499 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:23:50.060 client 10.75.22.215#47494 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:24:09.608 createfetch: github.com A 12-Dec-2017 10:24:09.608 client 10.75.22.1#40614 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:24:20.071 createfetch: github.com A 12-Dec-2017 10:24:20.071 createfetch: github.com AAAA 12-Dec-2017 10:24:20.071 client 10.75.22.215#53790 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:24:20.071 client 10.75.22.215#36491 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:24:56.184 createfetch: github.com A 12-Dec-2017 10:24:56.185 client 10.75.22.1#41427 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:25:20.085 createfetch: github.com AAAA 12-Dec-2017 10:23:45.049 client 10.75.22.215#39076 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:23:45.049 client 10.75.22.215#46895 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:23:50.059 createfetch: github.com A 12-Dec-2017 10:23:50.060 createfetch: github.com AAAA 12-Dec-2017 10:23:50.060 client 10.75.22.215#60499 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:23:50.060 client 10.75.22.215#47494 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:24:09.608 createfetch: github.com A 12-Dec-2017 10:24:09.608 client 10.75.22.1#40614 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:24:20.071 createfetch: github.com A 12-Dec-2017 10:24:20.071 createfetch: github.com AAAA 12-Dec-2017 10:24:20.071 client 10.75.22.215#53790 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:24:20.071 client 10.75.22.215#36491 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:24:56.184 createfetch: github.com A 12-Dec-2017 10:24:56.185 client 10.75.22.1#41427 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:25:20.085 createfetch: github.com AAAA 12-Dec-2017 10:25:20.086 createfetch: github.com A 12-Dec-2017 10:25:20.086 client 10.75.22.215#33357 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:25:20.086 client 10.75.22.215#59591 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:26:48.597 createfetch: github.com AAAA 12-Dec-2017 10:26:48.598 createfetch: github.com A 12-Dec-2017 10:26:48.598 client 10.75.22.215#55990 (github.com): query failed (SERVFAIL) for github.com/IN/AAAA at query.c:7007 12-Dec-2017 10:26:48.598 client 10.75.22.215#35485 (github.com): query failed (SERVFAIL) for github.com/IN/A at query.c:7007 12-Dec-2017 10:26:56.082 createfetch: github.com AAAA 12-Dec-2017 10:26:56.083 createfetch: github.com A # grep github -B1 /var/named/data/cache_dump.db ; answer github.com.ilinx. 10759 \-ANY ;-$NXDOMAIN -- ; answer api.github.com.ilinx. 10759 \-ANY ;-$NXDOMAIN -- ; answer assets-cdn.github.com.ilinx. 10759 \-ANY ;-$NXDOMAIN -- ; answer collector.githubapp.com.ilinx. 10759 \-ANY ;-$NXDOMAIN -- ; answer camo.githubusercontent.com.ilinx. 10759 \-ANY ;-$NXDOMAIN It seems to be happen quite frequently with google domains in the last few days. But not limited to google domains. Happened with bugzilla.redhat.com just as I was trying to submit this comment since it was happening right this minute. Any more ideas here? This is happening very frequently (many many times per day) with very frequently used domains such as google domains and is really starting to annoy the users. Interestingly, when it does start happening, something as simple as "rndc reload" will make the problem stop, but obviously that is not a workable solution or workaround even as it requires somebody to be babysitting, waiting for the server to start failing and issuing the "rndc reload". Additionally, since it is happening so frequently, there is ample opportunity to do more debugging and/or information gathering. Happy to provide whatever is needed to help get to the bottom of this. Can we please try to make some progress on this? Daily, and most frequently it's google.com domains that return SERVFAIL. That wreaks havoc on a userbase rife with Android phones, Chromecasts, Google Homes, etc. Just let me know what more information is needed and I'm happy to gather it. But without being able to fix this, the RHEL7 named is big time impediment in my network, not an asset. I should add, issuing an "rndc reload" *always* rectifies the situation. But I can't be sitting connected to my IPA server 24/7 just waiting to issue an "rndc reload" when people start complaining that they can't reach Google again. More detailed trace when SERVFAILs are happening: 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: UDP request 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: using view '_default' 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: request is not signed 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: recursion available 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: query 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): query (cache) 'connectivitycheck.gstatic.com/A/IN' approved 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): replace 17-Jan-2018 07:47:20.948 clientmgr @0x7f102c4928a0: get client 17-Jan-2018 07:47:20.948 clientmgr @0x7f102c4928a0: recycle 17-Jan-2018 07:47:20.948 createfetch: connectivitycheck.gstatic.com A 17-Jan-2018 07:47:20.948 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): create 17-Jan-2018 07:47:20.948 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): join 17-Jan-2018 07:47:20.948 fetch 0x7f102112cd58 (fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A)): created 17-Jan-2018 07:47:20.948 client @0x7f1012522cf0: udprecv 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: UDP request 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: using view '_default' 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: request is not signed 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: recursion available 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084: query 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): query (cache) 'connectivitycheck.gstatic.com/AAAA/IN' approved 17-Jan-2018 07:47:20.948 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): replace 17-Jan-2018 07:47:20.948 clientmgr @0x7f102c4928a0: get client 17-Jan-2018 07:47:20.948 clientmgr @0x7f102c4928a0: recycle 17-Jan-2018 07:47:20.948 createfetch: connectivitycheck.gstatic.com AAAA 17-Jan-2018 07:47:20.948 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): create 17-Jan-2018 07:47:20.948 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): join 17-Jan-2018 07:47:20.948 fetch 0x7f1012542270 (fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA)): created 17-Jan-2018 07:47:20.948 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): start 17-Jan-2018 07:47:20.948 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): try 17-Jan-2018 07:47:20.948 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): cancelqueries 17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): getaddresses 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): no addresses 17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): done 17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): stopeverything 17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): cancelqueries 17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): sendevents 17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): start 17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): try 17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): cancelqueries 17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): getaddresses 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): no addresses 17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): done 17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): stopeverything 17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): cancelqueries 17-Jan-2018 07:47:20.949 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): sendevents 17-Jan-2018 07:47:20.949 client @0x7f1022ae1090: udprecv 17-Jan-2018 07:47:20.952 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): query failed (SERVFAIL) for connectivitycheck.gstatic.com/IN/A at query.c:7007 17-Jan-2018 07:47:20.953 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): error 17-Jan-2018 07:47:20.953 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): send 17-Jan-2018 07:47:20.953 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): sendto 17-Jan-2018 07:47:20.953 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): senddone 17-Jan-2018 07:47:20.954 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): next 17-Jan-2018 07:47:20.954 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): endrequest 17-Jan-2018 07:47:20.954 fetch completed at resolver.c:3074 for connectivitycheck.gstatic.com/A in 0.000691: failure/success [domain:gstatic.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:4,findfail:0,valfail:0] 17-Jan-2018 07:47:20.954 fetch 0x7f102112cd58 (fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A)): destroyfetch 17-Jan-2018 07:47:20.954 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): shutdown 17-Jan-2018 07:47:20.954 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): query failed (SERVFAIL) for connectivitycheck.gstatic.com/IN/AAAA at query.c:7007 17-Jan-2018 07:47:20.954 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): error 17-Jan-2018 07:47:20.955 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): send 17-Jan-2018 07:47:20.955 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): sendto 17-Jan-2018 07:47:20.955 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): senddone 17-Jan-2018 07:47:20.955 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): next 17-Jan-2018 07:47:20.955 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): endrequest 17-Jan-2018 07:47:20.955 fetch completed at resolver.c:3074 for connectivitycheck.gstatic.com/AAAA in 0.000568: failure/success [domain:gstatic.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:4,findfail:0,valfail:0] 17-Jan-2018 07:47:20.956 fetch 0x7f1012542270 (fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA)): destroyfetch 17-Jan-2018 07:47:20.956 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): shutdown 17-Jan-2018 07:47:20.956 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): doshutdown 17-Jan-2018 07:47:20.956 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): stopeverything 17-Jan-2018 07:47:20.956 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): cancelqueries 17-Jan-2018 07:47:20.956 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): unlink 17-Jan-2018 07:47:20.956 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): destroy 17-Jan-2018 07:47:20.956 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): doshutdown 17-Jan-2018 07:47:20.956 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): stopeverything 17-Jan-2018 07:47:20.957 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): cancelqueries 17-Jan-2018 07:47:20.957 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): unlink 17-Jan-2018 07:47:20.957 fctx 0x7f102329d9d0(connectivitycheck.gstatic.com/AAAA): destroy 17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: UDP request 17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: using view '_default' 17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: request is not signed 17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: recursion available 17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: query 17-Jan-2018 07:47:20.998 client 10.75.22.201#32852 (www.gstatic.com): query (cache) 'www.gstatic.com/A/IN' approved 17-Jan-2018 07:47:20.998 client 10.75.22.201#32852 (www.gstatic.com): replace 17-Jan-2018 07:47:20.998 clientmgr @0x7f102c4928a0: get client 17-Jan-2018 07:47:20.998 clientmgr @0x7f102c4928a0: recycle 17-Jan-2018 07:47:20.998 createfetch: www.gstatic.com A 17-Jan-2018 07:47:20.998 fctx 0x7f1012b87290(www.gstatic.com/A): create 17-Jan-2018 07:47:20.998 fctx 0x7f1012b87290(www.gstatic.com/A): join 17-Jan-2018 07:47:20.998 fetch 0x7f1012542270 (fctx 0x7f1012b87290(www.gstatic.com/A)): created 17-Jan-2018 07:47:20.998 client @0x7f102233f240: udprecv 17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: UDP request 17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: using view '_default' 17-Jan-2018 07:47:20.998 client 10.75.22.201#32852: request is not signed 17-Jan-2018 07:47:20.999 client 10.75.22.201#32852: recursion available 17-Jan-2018 07:47:20.999 client 10.75.22.201#32852: query 17-Jan-2018 07:47:20.999 client 10.75.22.201#32852 (www.gstatic.com): query (cache) 'www.gstatic.com/AAAA/IN' approved 17-Jan-2018 07:47:20.999 client 10.75.22.201#32852 (www.gstatic.com): replace 17-Jan-2018 07:47:20.999 clientmgr @0x7f102c4928a0: get client 17-Jan-2018 07:47:20.999 clientmgr @0x7f102c4928a0: recycle 17-Jan-2018 07:47:20.999 createfetch: www.gstatic.com AAAA 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): create 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): join 17-Jan-2018 07:47:20.999 fetch 0x7f102112cd58 (fctx 0x7f1012b9be50(www.gstatic.com/AAAA)): created 17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): start 17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): try 17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): cancelqueries 17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): getaddresses 17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): no addresses 17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): done 17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): stopeverything 17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): cancelqueries 17-Jan-2018 07:47:20.999 fctx 0x7f1012b87290(www.gstatic.com/A): sendevents 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): start 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): try 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): cancelqueries 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): getaddresses 17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.999 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): no addresses 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): done 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): stopeverything 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): cancelqueries 17-Jan-2018 07:47:20.999 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): sendevents 17-Jan-2018 07:47:20.999 client @0x7f1012573140: udprecv 17-Jan-2018 07:47:21.001 client 10.75.22.201#32852 (www.gstatic.com): query failed (SERVFAIL) for www.gstatic.com/IN/A at query.c:7007 17-Jan-2018 07:47:21.002 client 10.75.22.201#32852 (www.gstatic.com): error 17-Jan-2018 07:47:21.003 client 10.75.22.201#32852 (www.gstatic.com): send 17-Jan-2018 07:47:21.003 client 10.75.22.201#32852 (www.gstatic.com): sendto 17-Jan-2018 07:47:21.004 client 10.75.22.201#32852 (www.gstatic.com): senddone 17-Jan-2018 07:47:21.004 client 10.75.22.201#32852 (www.gstatic.com): next 17-Jan-2018 07:47:21.004 client 10.75.22.201#32852 (www.gstatic.com): endrequest 17-Jan-2018 07:47:21.004 fetch completed at resolver.c:3074 for www.gstatic.com/A in 0.000683: failure/success [domain:gstatic.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:4,findfail:0,valfail:0] 17-Jan-2018 07:47:21.004 fetch 0x7f1012542270 (fctx 0x7f1012b87290(www.gstatic.com/A)): destroyfetch 17-Jan-2018 07:47:21.004 fctx 0x7f1012b87290(www.gstatic.com/A): shutdown 17-Jan-2018 07:47:21.004 client 10.75.22.201#32852 (www.gstatic.com): query failed (SERVFAIL) for www.gstatic.com/IN/AAAA at query.c:7007 17-Jan-2018 07:47:21.005 client 10.75.22.201#32852 (www.gstatic.com): error 17-Jan-2018 07:47:21.005 client 10.75.22.201#32852 (www.gstatic.com): send 17-Jan-2018 07:47:21.005 client 10.75.22.201#32852 (www.gstatic.com): sendto 17-Jan-2018 07:47:21.005 client 10.75.22.201#32852 (www.gstatic.com): senddone 17-Jan-2018 07:47:21.006 client 10.75.22.201#32852 (www.gstatic.com): next 17-Jan-2018 07:47:21.006 client 10.75.22.201#32852 (www.gstatic.com): endrequest 17-Jan-2018 07:47:21.006 fetch completed at resolver.c:3074 for www.gstatic.com/AAAA in 0.000591: failure/success [domain:gstatic.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:4,findfail:0,valfail:0] 17-Jan-2018 07:47:21.006 fetch 0x7f102112cd58 (fctx 0x7f1012b9be50(www.gstatic.com/AAAA)): destroyfetch 17-Jan-2018 07:47:21.007 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): shutdown 17-Jan-2018 07:47:21.007 fctx 0x7f1012b87290(www.gstatic.com/A): doshutdown 17-Jan-2018 07:47:21.007 fctx 0x7f1012b87290(www.gstatic.com/A): stopeverything 17-Jan-2018 07:47:21.007 fctx 0x7f1012b87290(www.gstatic.com/A): cancelqueries 17-Jan-2018 07:47:21.007 fctx 0x7f1012b87290(www.gstatic.com/A): unlink 17-Jan-2018 07:47:21.007 fctx 0x7f1012b87290(www.gstatic.com/A): destroy 17-Jan-2018 07:47:21.007 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): doshutdown 17-Jan-2018 07:47:21.008 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): stopeverything 17-Jan-2018 07:47:21.008 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): cancelqueries 17-Jan-2018 07:47:21.008 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): unlink 17-Jan-2018 07:47:21.009 fctx 0x7f1012b9be50(www.gstatic.com/AAAA): destroy Is that at all helpful? (In reply to Brian J. Murrell from comment #11) > Can we please try to make some progress on this? Daily, and most frequently > it's google.com domains that return SERVFAIL. That wreaks havoc on a > userbase rife with Android phones, Chromecasts, Google Homes, etc. > > Just let me know what more information is needed and I'm happy to gather it. > > But without being able to fix this, the RHEL7 named is big time impediment > in my network, not an asset. Thank you for taking the time to report this issue to us. We appreciate the feedback and use reports such as this one to guide our efforts at improving our products. That being said, this bug tracking system is not a mechanism for requesting support, and we are not able to guarantee the timeliness or suitability of a resolution. If this issue is critical or in any way time sensitive, please raise a ticket through the regular Red Hat support channels to ensure it receives the proper attention and prioritization to assure a timely resolution. For information on how to contact the Red Hat production support team, please visit: https://www.redhat.com/support/process/production/#howto Currently our team is busy with higher priority issues and we will get to this bug on "best effort" basis. Hello, my name is Martin Osvald and I am helping my colleague with bind cases which are old or not touched for a very long time. I hope you are still able and willing to help me move this BZ further. First of all, I would like to ask you whether you are still facing the same problem. * If yes, please, kindly respond to all the below questions: 1. Which version of bind do you currently have installed? 2. How long does it usually take for the problem to show up after you (re)start bind? 3. Would you be able to install & run an altered version of bind which would help with the investigation? (to get rid of internal timeouts to be able to use gdb for debugging) 4. Would you be able to disable temporarily (for a very short time - minutes) any dns traffic coming to your machine except localhost when the problem happens again? (to log a sequence of called functions by gdb and not to be influenced by other clients) 5. Would you be able and willing to continually run/test several bind packages based on a very old version? (to test which patch introduced this behavior) I am asking the above questions so that I can prepare appropriate steps to investigate this further as the info/logs provided so far is not enough to find the real culprit. Only note: If you don't reply within 1 month, I will (have to) close this BZ as WORKSFORME. The below are technical details from the investigation (you can skip to read them, only if you are interested, they show why I am asking the above questions): Short version: According to the log snippet from comment 13 the code iterated 4 times between fctx_getaddresses() and dns_adb_createfind2() failing somewhere very likely shortly after or inside new_adbfind(). Looks like a problem with memory allocation (or related mutexes), new_adbfind() returning the same address (0x7f1020a40550) every time looks really suspicious/wrong. Long version: I went through all the comments in this BZ and can pinpoint the important info from them: 1. The last working version was 9.9.4-29.el7_2.4 and first non-working 9.9.4-38.el7_3: ~~~ Version-Release number of selected component (if applicable): bind-pkcs11-9.9.4-38.el7_3.x86_64 ... This has only been happening since upgrading from bind-pkcs11-32:9.9.4-29.el7_2.4.x86_64 ~~~ It is unfortunate that at the time of opening this BZ we could have provided you and asked you to run several versions of bind every time without one of the patches added between these two versions. I have checked manually (and automatically comparing usually executed functions with the ones changed by these patches) which of the below set of patches could be responsible: bind99-CVE-2015-8000.patch bind99-CVE-2015-8704.patch bind99-CVE-2016-1285-CVE-2016-1286.patch bind99-rh1291185.patch bind99-rh1259514.patch bind99-rh1306610.patch bind99-rh1220594-geoip.patch bind99-automatic-interface-scanning-rh1294506.patch bind99-CVE-2016-2776.patch bind99-CVE-2016-8864.patch but couldn't match the changed code with the behavior the provided logs are showing even some of the changes get executed on the way. 2. The below log snippet from comment 13: ~~~ 17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): getaddresses 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): no addresses ... 17-Jan-2018 07:47:20.952 client 10.75.22.201#55084 (connectivitycheck.gstatic.com): query failed (SERVFAIL) for connectivitycheck.gstatic.com/IN/A at query.c:7007 ... 17-Jan-2018 07:47:20.954 fetch completed at resolver.c:3074 for connectivitycheck.gstatic.com/A in 0.000691: failure/success [domain:gstatic.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:4,findfail:0,valfail:0] ~~~ vs snippet when it works (captured on my testing machine): ~~~ 08-Jul-2019 13:20:29.788 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): getaddresses 08-Jul-2019 13:20:29.788 expiring v4 for name 0x7fbe79677990 08-Jul-2019 13:20:29.788 expiring v6 for name 0x7fbe79677990 08-Jul-2019 13:20:29.788 dns_adb_createfind: found A for name 0x7fbe79677990 in db 08-Jul-2019 13:20:29.788 dns_adb_createfind: found AAAA for name 0x7fbe79677990 08-Jul-2019 13:20:29.788 expiring v4 for name 0x7fbe79677860 08-Jul-2019 13:20:29.788 expiring v6 for name 0x7fbe79677860 08-Jul-2019 13:20:29.788 dns_adb_createfind: found A for name 0x7fbe79677860 in db 08-Jul-2019 13:20:29.788 dns_adb_createfind: found AAAA for name 0x7fbe79677860 08-Jul-2019 13:20:29.788 expiring v4 for name 0x7fbe79677730 08-Jul-2019 13:20:29.788 expiring v6 for name 0x7fbe79677730 08-Jul-2019 13:20:29.788 dns_adb_createfind: found A for name 0x7fbe79677730 in db 08-Jul-2019 13:20:29.788 dns_adb_createfind: found AAAA for name 0x7fbe79677730 08-Jul-2019 13:20:29.789 expiring v4 for name 0x7fbe79677600 08-Jul-2019 13:20:29.789 expiring v6 for name 0x7fbe79677600 08-Jul-2019 13:20:29.789 dns_adb_createfind: found A for name 0x7fbe79677600 in db 08-Jul-2019 13:20:29.789 dns_adb_createfind: found AAAA for name 0x7fbe79677600 08-Jul-2019 13:20:29.789 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): query 08-Jul-2019 13:20:29.789 resquery 0x7fbe794f4010 (fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A)): send 08-Jul-2019 13:20:29.789 resquery 0x7fbe794f4010 (fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A)): sent 08-Jul-2019 13:20:29.789 resquery 0x7fbe794f4010 (fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A)): udpconnected 08-Jul-2019 13:20:29.789 resquery 0x7fbe794f4010 (fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A)): senddone 08-Jul-2019 13:20:29.808 resquery 0x7fbe794f4010 (fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A)): response 08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): answer_response 08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): cache_message 08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): findnoqname 08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): clone_results 08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): cancelquery 08-Jul-2019 13:20:29.808 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): done 08-Jul-2019 13:20:29.809 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): stopeverything 08-Jul-2019 13:20:29.809 fctx 0x7fbe794ed010(connectivitycheck.gstatic.com/A): cancelqueries 08-Jul-2019 13:20:29.809 dns_adb_destroyfind on find 0x7fbe79675c40 08-Jul-2019 13:20:29.809 dns_adb_destroyfind on find 0x7fbe796754c0 08-Jul-2019 13:20:29.809 dns_adb_destroyfind on find 0x7fbe7967d790 08-Jul-2019 13:20:29.809 dns_adb_destroyfind on find 0x7fbe79675e20 ~~~ I spent a few days investigating the related code, but unfortunately, the provided log lines are not enough to trace every corner of the code which could return with failure. The main points here are that there are only "dns_adb_destroyfind" lines with the same address 0x7f1020a40550 between "getaddresses" and "no addresses" and the "adberr:4". This means the code iterated 4 times (processing 4 nameservers for gstatic.com) prematurely ending somewhere inside dns_adb_createfind2() or functions called by it. The really strange thing is the same address 0x7f1020a40550 returned by new_adbfind(). It is normally every time different. The usual sequence of called functions involved in this iteration (from my testing machine using gdb): clone -> start_thread -> run -> dispatch -> resquery_response -> fctx_try -> fctx_getaddresses -> findname -> dns_adb_createfind2 -> new_adbfind() According to the mentioned log snippet from comment 13 the code iterated 4 times between fctx_getaddresses() and dns_adb_createfind2() failing somewhere very likely shortly after or inside new_adbfind(). Looks like a problem with memory allocation (or related mutexes). We can deduce it from the log functions and code from the below snippets (from version 9.9.4-38.el7_3): ~~~ 2580 static isc_result_t 2581 fctx_getaddresses(fetchctx_t *fctx, isc_boolean_t badcache) { ... 2593 FCTXTRACE("getaddresses"); ... 2739 findname(fctx, &ns.name, 0, stdoptions, 0, now, 2740 &need_alternate); ... 2811 FCTXTRACE("no addresses"); 2463 static void 2464 findname(fetchctx_t *fctx, dns_name_t *name, in_port_t port, 2465 unsigned int options, unsigned int flags, isc_stdtime_t now, 2466 isc_boolean_t *need_alternate) 2467 { ... 2491 find = NULL; 2492 result = dns_adb_createfind2(fctx->adb, 2493 res->buckets[fctx->bucketnum].task, 2494 fctx_finddone, fctx, name, 2495 &fctx->name, fctx->type, 2496 options, now, NULL, 2497 res->view->dstport, 2498 fctx->depth + 1, fctx->qc, &find); 2499 if (result != ISC_R_SUCCESS) { ... 2527 } else { ... 2548 } else { ... 2551 else 2552 fctx->adberr++; /* unreachable server, etc. */ ... 2565 dns_adb_destroyfind(&find); 3122 void 3123 dns_adb_destroyfind(dns_adbfind_t **findp) { ... 3137 DP(DEF_LEVEL, "dns_adb_destroyfind on find %p", find); 2779 isc_result_t 2780 dns_adb_createfind2(dns_adb_t *adb, isc_task_t *task, isc_taskaction_t action, 2781 void *arg, dns_name_t *name, dns_name_t *qname, 2782 dns_rdatatype_t qtype, unsigned int options, 2783 isc_stdtime_t now, dns_name_t *target, 2784 in_port_t port, unsigned int depth, isc_counter_t *qc, 2785 dns_adbfind_t **findp) 2786 { ... 2840 find = new_adbfind(adb); ~~~ The reason why I think the code returned/failed "very likely shortly after or inside new_adbfind()" is that we would see other logging messages between ~~~ 17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): getaddresses ... 17-Jan-2018 07:47:20.949 fctx 0x7f10232b3590(connectivitycheck.gstatic.com/A): no addresses ~~~ I mean messages like: ~~~ 2927 DP(DEF_LEVEL, 2928 "dns_adb_createfind: found A for name %p in db", 2929 adbname); or 3019 DP(DEF_LEVEL, 3020 "dns_adb_createfind: started A fetch for name %p", 3021 adbname); ~~~ than only the below one: ~~~ 17-Jan-2018 07:47:20.949 dns_adb_destroyfind on find 0x7f1020a40550 ~~~ because they are printed every time at the same trace level 5 when everything works correctly. Red Hat Enterprise Linux version 7 entered the Maintenance Support 1 phase in August 2019. In this phase only qualified Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released as they become available. Other errata advisories may be delivered as appropriate. This bug has been reviewed by Support and Engineering representative and does not meet the inclusion criteria for Maintenance Support 1 phase. Also information requested in comment #15 was never provided. For more information about Red Hat Enterprise Linux Lifecycle, please see https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_1_Phase. I'm reopening this since this continues to happen on EL8.5 also, using bind-pkcs11-9.11.26-6.el8.x86_64. I don't seem to have permission to change the Product to RHEL 8 though. Answering the questions from comment#15: > 1. Which version of bind do you currently have installed? bind-pkcs11-9.11.26-6.el8.x86_64 > 2. How long does it usually take for the problem to show up after you (re)start bind? It's not really related to when bind is started. It seems to happen for a period of time after a DDNS update is done, which fairly regularly as the DHCP server is doing DDNS updates when clients acquire leases. 3. Would you be able to install & run an altered version of bind which would help with the investigation? Yes. 4. Would you be able to disable temporarily (for a very short time - minutes) any dns traffic coming to your machine except localhost when the problem happens again? I'm afraid not, no. Besides, given the intermittent and unpredictable behaviour, the machine needs to process queries for the situation to even be discovered. 5. Would you be able and willing to continually run/test several bind packages based on a very old version? Probably not, given that this bind instance is exposed to the Internet and running old (and presumably insecure) versions of bind would expose me to security risks. Switching to RHEL8, RHEL7 fix might be considered only when existing issue is confirmed and is serious enough. It would be helpful if cache dump containing failing name and its parent would be made. rndc flushname <name> or rndc flushtree <name> should help if only few names are failing. Servfail can be caused by multiple issues. Unfortunately it does not provide details with a reason into reply, best answer should be found in server log. Brian, can you supply a recent log with RHEL8 version? adberr is increased in case name server for domain points to cname [1], which is considered invalid. Also when loop is detected [2]. It would log it when it happens. Other case is generic failure with remote server [3], which does not log anything specific. I expect this would be the reason here. Do we have tcpdump proof that response from remote server delivered a response when SERVFAIL happens? It might be valid result if remote network connectivity is broken for a while. If bad server is marked, some reason should be logged on first query. It would report errors just from cache. 1. https://gitlab.isc.org/isc-projects/bind9/-/blob/main/lib/dns/resolver.c#L3344 2. https://gitlab.isc.org/isc-projects/bind9/-/blob/main/lib/dns/resolver.c#L3390 3. https://gitlab.isc.org/isc-projects/bind9/-/blob/main/lib/dns/resolver.c#L3440 After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |