Bug 2180341
| Summary: | possible query loop due to missing glue record in .by TLD | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Kseniya <ksyblast> |
| Component: | bind | Assignee: | Petr Menšík <pemensik> |
| Status: | ASSIGNED --- | QA Contact: | rhel-cs-infra-services-qe <rhel-cs-infra-services-qe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | --- | Keywords: | MoveUpstream, Reopened |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| URL: | https://lists.isc.org/pipermail/bind-users/2022-March/105885.html | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-04 16:20:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
While I think there is existing issue, I do not think this should be fixed at bind side. This creates strange loop with intermediate domain hoster.by. ;by. IN NS ;; ANSWER SECTION: by. 2351 IN NS dns2.tld.becloudby.com. by. 2351 IN NS dns4.tld.becloudby.com. by. 2351 IN NS dns1.tld.becloudby.com. by. 2351 IN NS dns7.tld.becloudby.com. by. 2351 IN NS dns3.tld.becloudby.com. ;becloudby.com. IN NS ;; ANSWER SECTION: becloudby.com. 600 IN NS u1.hoster.by. becloudby.com. 600 IN NS u2.hoster.by. u1 and u2 of hoster.by should include GLUE with addresses, because iteration from root cannot resolve u1.hoster.by, unless it has already it in cache. root servers however supply initial cache, so it works at least somehow: ; <<>> DiG 9.18.14 <<>> +norec -4 @a.root-servers.net by. ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47737 ;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 5, ADDITIONAL: 10 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;by. IN A ;; AUTHORITY SECTION: by. 172800 IN NS dns2.tld.becloudby.com. by. 172800 IN NS dns1.tld.becloudby.com. by. 172800 IN NS dns4.tld.becloudby.com. by. 172800 IN NS dns3.tld.becloudby.com. by. 172800 IN NS dns7.tld.becloudby.com. ;; ADDITIONAL SECTION: dns2.tld.becloudby.com. 172800 IN A 93.125.25.73 dns2.tld.becloudby.com. 172800 IN AAAA 2a00:c827:a:3::2 dns1.tld.becloudby.com. 172800 IN A 93.125.25.72 dns1.tld.becloudby.com. 172800 IN AAAA 2a00:c827:a:2::2 dns4.tld.becloudby.com. 172800 IN A 31.44.1.137 dns4.tld.becloudby.com. 172800 IN AAAA 2a0e:b81:8001:1001::2 dns3.tld.becloudby.com. 172800 IN A 185.98.83.4 dns3.tld.becloudby.com. 172800 IN AAAA 2a01:ba80:e:c:1::4c dns7.tld.becloudby.com. 172800 IN A 31.44.5.245 I think the fix would be either at "by." nameservers or "hoster.by." nameservers. Either one should work. Written mail to .by domain holder. There were changes of becloudby.com zone nameservers. $ dig -t ns becloud.com ; <<>> DiG 9.16.23-RH <<>> -t ns becloud.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12664 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 5 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: f729087332ca9d99010000006453d878302b7008fb5718a5 (good) ;; QUESTION SECTION: ;becloud.com. IN NS ;; ANSWER SECTION: becloud.com. 86259 IN NS ns1.dan.com. becloud.com. 86259 IN NS ns2.dan.com. ;; ADDITIONAL SECTION: ns1.dan.com. 172658 IN A 97.74.98.67 ns2.dan.com. 172658 IN A 173.201.66.67 ns1.dan.com. 172658 IN AAAA 2603:5:2125::43 ns2.dan.com. 172658 IN AAAA 2603:5:2225::43 ;; Query time: 0 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Thu May 04 16:08:24 UTC 2023 ;; MSG SIZE rcvd: 199 $ dig -t ns dan.com ; <<>> DiG 9.16.23-RH <<>> -t ns dan.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13649 ;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 2 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; COOKIE: 15df3393040bbcc5010000006453d90f1d07840733cc1ec4 (good) ;; QUESTION SECTION: ;dan.com. IN NS ;; ANSWER SECTION: dan.com. 3600 IN NS a1-245.akam.net. dan.com. 3600 IN NS a6-66.akam.net. dan.com. 3600 IN NS a9-67.akam.net. dan.com. 3600 IN NS a11-64.akam.net. dan.com. 3600 IN NS a20-65.akam.net. dan.com. 3600 IN NS a8-67.akam.net. ;; ADDITIONAL SECTION: a9-67.akam.net. 172800 IN A 184.85.248.67 ;; Query time: 793 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Thu May 04 16:10:55 UTC 2023 ;; MSG SIZE rcvd: 211 The loop seems gone. $ dig -4 +nssearch by. SOA dns1.tld.becloud.by. support.becloud.by. 2305041610 3600 600 604800 3600 from server 31.44.5.245 in 76 ms. SOA dns1.tld.becloud.by. support.becloud.by. 2305041610 3600 600 604800 3600 from server 31.44.1.137 in 76 ms. SOA dns1.tld.becloud.by. support.becloud.by. 2305041610 3600 600 604800 3600 from server 93.125.25.73 in 122 ms. SOA dns1.tld.becloud.by. support.becloud.by. 2305041610 3600 600 604800 3600 from server 93.125.25.72 in 123 ms. SOA dns1.tld.becloud.by. support.becloud.by. 2305041610 3600 600 604800 3600 from server 185.98.83.4 in 129 ms. SOA record of by. zone still points to dns1.tld.becloud.by., but that does not exists. That is only minor issue. After flushing cache for by. or becloud.com or becloud.by, by rndc flushtree becloudby.com. it still works. Seems it were fixed at authoritative servers. Flush becloudby.com zone to refresh those that data. Once that is refreshed, it should work. Unfortunately it doesn't work and was reproduced again.
Cache:
; glue
by. 772509 NS dns1.tld.becloudby.com.
772509 NS dns7.tld.becloudby.com.
772509 NS dns2.tld.becloudby.com.
772509 NS dns3.tld.becloudby.com.
772509 NS dns4.tld.becloudby.com.
becloudby.com. 772509 NS u1.hoster.by.
772509 NS u2.hoster.by.
I don't think something was changed at the .by zone maintainters
dig -t ns becloudby.com
; <<>> DiG 9.18.14 <<>> -t ns becloudby.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43478
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;becloudby.com. IN NS
;; ANSWER SECTION:
becloudby.com. 600 IN NS u2.hoster.by.
becloudby.com. 600 IN NS u1.hoster.by.
;; Query time: 84 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Fri Jun 16 11:28:28 +03 2023
;; MSG SIZE rcvd: 85
I understand that it's most probably a zone misconfiguration, however the problem it's reproduced only with bind nameserver. In the world these names are being resolved without any problems.
The same issue is with bind9.16-9.16.23-0.12.el8.x86_64, at the same time the .by zone is reachable in the world. rndc flush fixes the issue I do not have a long running iterative server, where I could try to reproduce this issue. But as far as I can tell, those nameservers seems to be somehow unstable. Could that be responsible for those issues? $ dig -4t NS +norec hoster.by. @dns1.hoster.by. ;; communications error to 93.125.31.240#53: timed out ; <<>> DiG 9.18.16 <<>> -4t NS +norec hoster.by. @dns1.hoster.by. ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45919 ;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 5 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ;; QUESTION SECTION: ;hoster.by. IN NS ;; ANSWER SECTION: hoster.by. 3600 IN NS dns2.hoster.by. hoster.by. 3600 IN NS dns1.hoster.by. ;; ADDITIONAL SECTION: dns2.hoster.by. 3600 IN AAAA 2a0a:7d80:3:1::7 dns1.hoster.by. 3600 IN AAAA 2a0a:7d80:1:1::5:0 dns1.hoster.by. 3600 IN A 93.125.31.240 dns2.hoster.by. 3600 IN A 185.179.82.103 ;; Query time: 62 msec ;; SERVER: 93.125.31.240#53(dns1.hoster.by.) (UDP) ;; WHEN: Mon Jul 17 10:54:49 CEST 2023 ;; MSG SIZE rcvd: 164 I have tried playing with it on one my instance. at debug level 2, it proves at least that my instance has connectivity problems. It is based in US. 2023-07-17T10:31:21.238Z fetch: by/NS 2023-07-17T10:31:21.238Z fetch: dns1.tld.becloudby.com/A 2023-07-17T10:31:21.238Z fetch: dns2.tld.becloudby.com/A 2023-07-17T10:31:21.238Z fetch: dns3.tld.becloudby.com/A 2023-07-17T10:31:21.238Z fetch: dns4.tld.becloudby.com/A 2023-07-17T10:31:21.238Z fetch: dns7.tld.becloudby.com/A 2023-07-17T10:31:21.238Z fetch: u1.hoster.by/A 2023-07-17T10:31:21.238Z fetch: u2.hoster.by/A 2023-07-17T10:31:30.662Z fetch: by/A 2023-07-17T10:31:31.239Z client @0x7f86b451acc8 127.0.0.1#59895 (by): query failed (timed out) for by/IN/NS at ../../../lib/ns/query.c:7355 2023-07-17T10:31:31.239Z client @0x7f86b452a038 127.0.0.1#54589 (by): query failed (failure) for by/IN/A at ../../../lib/ns/query.c:7355 2023-07-17T10:31:31.240Z fetch: becloudby.com/A 2023-07-17T10:31:31.240Z fetch: u1.hoster.by/A 2023-07-17T10:31:31.240Z fetch: u2.hoster.by/A 2023-07-17T10:31:31.240Z client @0x7f86b452a038 127.0.0.1#56214 (becloudby.com): query failed (failure) for becloudby.com/IN/A at ../../../lib/ns/query.c:7355 2023-07-17T10:32:49.735Z fetch: by/A 2023-07-17T10:32:49.735Z fetch: dns1.tld.becloudby.com/A 2023-07-17T10:32:49.735Z fetch: dns2.tld.becloudby.com/A 2023-07-17T10:32:49.735Z fetch: dns3.tld.becloudby.com/A 2023-07-17T10:32:49.735Z fetch: dns4.tld.becloudby.com/A 2023-07-17T10:32:49.735Z fetch: dns7.tld.becloudby.com/A 2023-07-17T10:32:49.735Z fetch: u1.hoster.by/A 2023-07-17T10:32:49.735Z fetch: u2.hoster.by/A 2023-07-17T10:32:50.889Z fetch: becloudby.com/A Results are unstable. It runs okay some time, but then it does not. As soon as hoster.by become stale, it is not able to refresh them for some reason. When I query localhost with +norec, it gives me answer. But with +rec it just SERVFAILs. I have found this problem discussed on upstream list last year: https://lists.isc.org/pipermail/bind-users/2022-March/105885.html It were forwarded to dns-operations list as well. https://lists.dns-oarc.net/pipermail/dns-operations/2022-January/021501.html I think they agree on the fix should be done on TLD configuration, not in bind. But I admit unbound seems to handle it better. It seems Fedora version 9.18 handles it better.
I were able to make it stable on c9s with this workaround only:
zone "by" IN {
type static-stub;
server-addresses {
2a00:c827:a:2::2; #dns1.tld.becloudby.com. 169007 IN AAAA
2a00:c827:a:3::2; #dns2.tld.becloudby.com. 169007 IN AAAA
2a01:ba80:e:c:1::4c; #dns3.tld.becloudby.com. 169007 IN AAAA
2a0e:b81:8001:1001::2; #dns4.tld.becloudby.com. 169007 IN AAAA
93.125.25.72; #dns1.tld.becloudby.com. 169007 IN A
93.125.25.73; #dns2.tld.becloudby.com. 169007 IN A
185.98.83.4; #dns3.tld.becloudby.com. 169007 IN A
31.44.1.137; #dns4.tld.becloudby.com. 169007 IN A
31.44.5.245; #dns7.tld.becloudby.com. 169007 IN A
};
};
Unless this would be solved by a rebase, I do not think backporting such change should be done. This is to be fixed on TLD by using not crazy configuration.
|
Description of problem: Sometimes we have a problem with resolving domains in "by" zone. Queries result in SERVFAIL. Version-Release number of selected component (if applicable): BIND 9.11.36-RedHat-9.11.36-8.el8 (Extended Support Version) <id:68dbd5b> running on Linux x86_64 4.18.0-394.el8.x86_64 #1 SMP Tue May 31 16:19:11 UTC 2022 CentOS Stream release 8 How reproducible: It's not always reproducible, probably the issue appears after some cache expiration. I will provide the possible steps to reproduce with explanation below. Steps to Reproduce: 1. name server is started - try resolve any name in "by" zone (onliner.by, prior.by) - it works 2. after some time (I can't say exactly yet how long) it stops working returning SERVFAIL. it can be fixed by restarting the server or flushing the cache. My investigation on the issue: cache_dump.db: by. 777506 NS dns1.tld.becloudby.com. 777506 NS dns7.tld.becloudby.com. 777506 NS dns2.tld.becloudby.com. 777506 NS dns3.tld.becloudby.com. 777506 NS dns4.tld.becloudby.com. ; glue becloudby.com. 777506 NS u1.hoster.by. 777506 NS u2.hoster.by. there are also lots of records like: ; dns4.tld.becloudby.com [v4 TTL 2] [v6 TTL 2] [v4 failure] [v6 failure] ; dns1.tld.becloudby.com [v4 TTL 2] [v6 TTL 2] [v4 failure] [v6 failure] ; dns3.tld.becloudby.com [v4 TTL 2] [v6 TTL 2] [v4 failure] [v6 failure] ; dns7.tld.becloudby.com [v4 TTL 2] [v6 TTL 2] [v4 failure] [v6 failure] There are no other related records for becloudby.com, hoster.by, onliner.by or prior.by found in cache. debug: Feb 02 13:27:31 myhostname named[944]: fetch: prior.by/A Feb 02 13:27:31 myhostname named[944]: fetch: dns1.tld.becloudby.com/A Feb 02 13:27:31 myhostname named[944]: fetch: dns1.tld.becloudby.com/AAAA Feb 02 13:27:31 myhostname named[944]: fetch: dns2.tld.becloudby.com/A Feb 02 13:27:31 myhostname named[944]: fetch: dns2.tld.becloudby.com/AAAA Feb 02 13:27:31 myhostname named[944]: fetch: dns3.tld.becloudby.com/A Feb 02 13:27:31 myhostname named[944]: fetch: dns3.tld.becloudby.com/AAAA Feb 02 13:27:31 myhostname named[944]: fetch: dns4.tld.becloudby.com/A Feb 02 13:27:31 myhostname named[944]: fetch: dns4.tld.becloudby.com/AAAA Feb 02 13:27:31 myhostname named[944]: fetch: dns7.tld.becloudby.com/A Feb 02 13:27:31 myhostname named[944]: fetch: dns7.tld.becloudby.com/AAAA Feb 02 13:27:31 myhostname named[944]: fetch: u1.hoster.by/A Feb 02 13:27:31 myhostname named[944]: fetch: u1.hoster.by/AAAA Feb 02 13:27:31 myhostname named[944]: fetch: u2.hoster.by/A Feb 02 13:27:31 myhostname named[944]: fetch: u2.hoster.by/AAAA At the same time no actual requests are sent to the network. As I understand there might be a possible loop when the glue records disappear from cache because "by" zone has NS records pointing to dns[1-7].tld.becloudby.com and "becloudby.com" has NS records pointing to u[12].hoster.by which brings us back to dns[1-7].tld.becloudby.com. However this issue is not observed with other DNS servers and these names are resolved successfully in the world. I am also providing cache records for successful resolving: by. 607714 NS dns4.tld.becloudby.com. 607714 NS dns3.tld.becloudby.com. 607714 NS dns2.tld.becloudby.com. 607714 NS dns7.tld.becloudby.com. 607714 NS dns1.tld.becloudby.com. ; glue hoster.by. 607143 NS dns2.hoster.by. 607143 NS dns1.hoster.by. ; glue dns1.hoster.by. 607143 A 93.125.31.240 ; glue 607143 AAAA 2a0a:7d80:1:1::5:0 ; glue dns2.hoster.by. 607143 A 178.172.139.139 ; glue 607143 AAAA 2a0a:7d80:3:2::139 ; authanswer u1.hoster.by. 607143 A 93.125.30.201 ; authanswer 607143 AAAA 2a0a:7d80:1:1::4:0 ; authanswer u2.hoster.by. 607143 A 178.172.137.158 ``` dns1.hoster.by. 607143 A 93.125.31.240 ; glue 607143 AAAA 2a0a:7d80:1:1::5:0 ; glue dns2.hoster.by. 607143 A 178.172.139.139 ; glue 607143 AAAA 2a0a:7d80:3:2::139 ; authanswer u1.hoster.by. 607143 A 93.125.30.201 ; authanswer 607143 AAAA 2a0a:7d80:1:1::4:0 ; authanswer u2.hoster.by. 607143 A 178.172.137.158 ; authanswer 607143 AAAA 2a0a:7d80:3:2::b ; glue onliner.by. 607143 NS u1.hoster.by. 607143 NS u2.hoster.by. ; authanswer 607143 A 178.124.129.12 607143 A 178.124.129.14 607143 A 178.124.129.16 ; authanswer prior.by. 604945 NS ns-ext2.priorbank.by. 604945 NS ns-ext1.priorbank.by. ; glue ns-ext1.priorbank.by. 607645 A 185.137.116.3 ; glue ns-ext2.priorbank.by. 607645 A 185.137.116.4 ``` becloudby.com. 728830 NS u1.hoster.by. 728830 NS u2.hoster.by. ; glue dns1.tld.becloudby.com. 771792 A 93.125.25.72 ; glue 771792 AAAA 2a00:c827:a:2::2 ; glue dns2.tld.becloudby.com. 771792 A 93.125.25.73 ; glue 771792 AAAA 2a00:c827:a:3::2 ; glue dns3.tld.becloudby.com. 771792 A 185.98.83.4 ; glue 771792 AAAA 2a01:ba80:e:c:1::4c ; glue dns4.tld.becloudby.com. 771792 A 31.44.1.137 ; glue 771792 AAAA 2a0e:b81:8001:1001::2 ; glue dns7.tld.becloudby.com. 725763 A 31.44.5.245 Could you please tell me if it's a zone misconfiguration or a possible bug? If it's a zone misconfiguration could you please point me to RFC or some other information that I can probably use it to contact the zone maintainers? I would like to mention again here that I see no issues with the other resolvers.