Bug 2180341

Summary: possible query loop due to missing glue record in .by TLD
Product: Red Hat Enterprise Linux 8 Reporter: Kseniya <ksyblast>
Component: bindAssignee: Petr Menšík <pemensik>
Status: ASSIGNED --- QA Contact: rhel-cs-infra-services-qe <rhel-cs-infra-services-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: ---Keywords: MoveUpstream, Reopened
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
URL: https://lists.isc.org/pipermail/bind-users/2022-March/105885.html
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-04 16:20:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kseniya 2023-03-21 08:51:53 UTC
Description of problem:

Sometimes we have a problem with resolving domains in "by" zone. Queries result in SERVFAIL.

Version-Release number of selected component (if applicable):

BIND 9.11.36-RedHat-9.11.36-8.el8 (Extended Support Version) <id:68dbd5b>
running on Linux x86_64 4.18.0-394.el8.x86_64 #1 SMP Tue May 31 16:19:11 UTC 2022
CentOS Stream release 8

How reproducible:
It's not always reproducible, probably the issue appears after some cache expiration. I will provide the possible steps to reproduce with explanation below.

Steps to Reproduce:

1. name server is started - try resolve any name in "by" zone (onliner.by, prior.by) - it works
2. after some time (I can't say exactly yet how long) it stops working returning SERVFAIL. it can be fixed by restarting the server or flushing the cache.

My investigation on the issue:
cache_dump.db:

by.                     777506  NS      dns1.tld.becloudby.com.
                        777506  NS      dns7.tld.becloudby.com.
                        777506  NS      dns2.tld.becloudby.com.
                        777506  NS      dns3.tld.becloudby.com.
                        777506  NS      dns4.tld.becloudby.com.

; glue
becloudby.com.          777506  NS      u1.hoster.by.
                        777506  NS      u2.hoster.by.

there are also lots of records like:

; dns4.tld.becloudby.com [v4 TTL 2] [v6 TTL 2] [v4 failure] [v6 failure]
; dns1.tld.becloudby.com [v4 TTL 2] [v6 TTL 2] [v4 failure] [v6 failure]
; dns3.tld.becloudby.com [v4 TTL 2] [v6 TTL 2] [v4 failure] [v6 failure]
; dns7.tld.becloudby.com [v4 TTL 2] [v6 TTL 2] [v4 failure] [v6 failure]


There are no other related records for becloudby.com, hoster.by, onliner.by or prior.by found in cache.

debug:

Feb 02 13:27:31 myhostname named[944]: fetch: prior.by/A
Feb 02 13:27:31 myhostname named[944]: fetch: dns1.tld.becloudby.com/A
Feb 02 13:27:31 myhostname named[944]: fetch: dns1.tld.becloudby.com/AAAA
Feb 02 13:27:31 myhostname named[944]: fetch: dns2.tld.becloudby.com/A
Feb 02 13:27:31 myhostname named[944]: fetch: dns2.tld.becloudby.com/AAAA
Feb 02 13:27:31 myhostname named[944]: fetch: dns3.tld.becloudby.com/A
Feb 02 13:27:31 myhostname named[944]: fetch: dns3.tld.becloudby.com/AAAA
Feb 02 13:27:31 myhostname named[944]: fetch: dns4.tld.becloudby.com/A
Feb 02 13:27:31 myhostname named[944]: fetch: dns4.tld.becloudby.com/AAAA
Feb 02 13:27:31 myhostname named[944]: fetch: dns7.tld.becloudby.com/A
Feb 02 13:27:31 myhostname named[944]: fetch: dns7.tld.becloudby.com/AAAA
Feb 02 13:27:31 myhostname named[944]: fetch: u1.hoster.by/A
Feb 02 13:27:31 myhostname named[944]: fetch: u1.hoster.by/AAAA
Feb 02 13:27:31 myhostname named[944]: fetch: u2.hoster.by/A
Feb 02 13:27:31 myhostname named[944]: fetch: u2.hoster.by/AAAA

At the same time no actual requests are sent to the network.

As I understand there might be a possible loop when the glue records disappear from cache because "by" zone has NS records pointing to dns[1-7].tld.becloudby.com and "becloudby.com" has NS records pointing to u[12].hoster.by which brings us back to dns[1-7].tld.becloudby.com.

However this issue is not observed with other DNS servers and these names are resolved successfully in the world.

I am also providing cache records for successful resolving:


by.                     607714  NS      dns4.tld.becloudby.com.
                        607714  NS      dns3.tld.becloudby.com.
                        607714  NS      dns2.tld.becloudby.com.
                        607714  NS      dns7.tld.becloudby.com.
                        607714  NS      dns1.tld.becloudby.com.

; glue
hoster.by.              607143  NS      dns2.hoster.by.
                        607143  NS      dns1.hoster.by.
; glue
dns1.hoster.by.         607143  A       93.125.31.240
; glue
                        607143  AAAA    2a0a:7d80:1:1::5:0
; glue
dns2.hoster.by.         607143  A       178.172.139.139
; glue
                        607143  AAAA    2a0a:7d80:3:2::139
; authanswer
u1.hoster.by.           607143  A       93.125.30.201
; authanswer
                        607143  AAAA    2a0a:7d80:1:1::4:0
; authanswer
u2.hoster.by.           607143  A       178.172.137.158
```
dns1.hoster.by.         607143  A       93.125.31.240
; glue
                        607143  AAAA    2a0a:7d80:1:1::5:0
; glue
dns2.hoster.by.         607143  A       178.172.139.139
; glue
                        607143  AAAA    2a0a:7d80:3:2::139
; authanswer
u1.hoster.by.           607143  A       93.125.30.201
; authanswer
                        607143  AAAA    2a0a:7d80:1:1::4:0
; authanswer
u2.hoster.by.           607143  A       178.172.137.158
; authanswer
                        607143  AAAA    2a0a:7d80:3:2::b
; glue
onliner.by.             607143  NS      u1.hoster.by.
                        607143  NS      u2.hoster.by.
; authanswer
                        607143  A       178.124.129.12
                        607143  A       178.124.129.14
                        607143  A       178.124.129.16
; authanswer
prior.by.               604945  NS      ns-ext2.priorbank.by.
                        604945  NS      ns-ext1.priorbank.by.
; glue
ns-ext1.priorbank.by.   607645  A       185.137.116.3
; glue
ns-ext2.priorbank.by.   607645  A       185.137.116.4
```
becloudby.com.          728830  NS      u1.hoster.by.
                        728830  NS      u2.hoster.by.
; glue
dns1.tld.becloudby.com. 771792  A       93.125.25.72
; glue
                        771792  AAAA    2a00:c827:a:2::2
; glue
dns2.tld.becloudby.com. 771792  A       93.125.25.73
; glue
                        771792  AAAA    2a00:c827:a:3::2
; glue
dns3.tld.becloudby.com. 771792  A       185.98.83.4
; glue
                        771792  AAAA    2a01:ba80:e:c:1::4c
; glue
dns4.tld.becloudby.com. 771792  A       31.44.1.137
; glue
                        771792  AAAA    2a0e:b81:8001:1001::2
; glue
dns7.tld.becloudby.com. 725763  A       31.44.5.245

Could you please tell me if it's a zone misconfiguration or a possible bug? If it's a zone misconfiguration could you please point me to RFC or some other information that I can probably use it to contact the zone maintainers? I would like to mention again here that I see no issues with the other resolvers.

Comment 1 Petr Menšík 2023-05-03 16:40:41 UTC
While I think there is existing issue, I do not think this should be fixed at bind side. This creates strange loop with intermediate domain hoster.by.

;by.				IN	NS

;; ANSWER SECTION:
by.			2351	IN	NS	dns2.tld.becloudby.com.
by.			2351	IN	NS	dns4.tld.becloudby.com.
by.			2351	IN	NS	dns1.tld.becloudby.com.
by.			2351	IN	NS	dns7.tld.becloudby.com.
by.			2351	IN	NS	dns3.tld.becloudby.com.

;becloudby.com.			IN	NS

;; ANSWER SECTION:
becloudby.com.		600	IN	NS	u1.hoster.by.
becloudby.com.		600	IN	NS	u2.hoster.by.

u1 and u2 of hoster.by should include GLUE with addresses, because iteration from root cannot resolve u1.hoster.by, unless it has already it in cache.

root servers however supply initial cache, so it works at least somehow:

; <<>> DiG 9.18.14 <<>> +norec -4 @a.root-servers.net by.
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47737
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 5, ADDITIONAL: 10

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;by.				IN	A

;; AUTHORITY SECTION:
by.			172800	IN	NS	dns2.tld.becloudby.com.
by.			172800	IN	NS	dns1.tld.becloudby.com.
by.			172800	IN	NS	dns4.tld.becloudby.com.
by.			172800	IN	NS	dns3.tld.becloudby.com.
by.			172800	IN	NS	dns7.tld.becloudby.com.

;; ADDITIONAL SECTION:
dns2.tld.becloudby.com.	172800	IN	A	93.125.25.73
dns2.tld.becloudby.com.	172800	IN	AAAA	2a00:c827:a:3::2
dns1.tld.becloudby.com.	172800	IN	A	93.125.25.72
dns1.tld.becloudby.com.	172800	IN	AAAA	2a00:c827:a:2::2
dns4.tld.becloudby.com.	172800	IN	A	31.44.1.137
dns4.tld.becloudby.com.	172800	IN	AAAA	2a0e:b81:8001:1001::2
dns3.tld.becloudby.com.	172800	IN	A	185.98.83.4
dns3.tld.becloudby.com.	172800	IN	AAAA	2a01:ba80:e:c:1::4c
dns7.tld.becloudby.com.	172800	IN	A	31.44.5.245

I think the fix would be either at "by." nameservers or "hoster.by." nameservers. Either one should work.

Comment 2 Petr Menšík 2023-05-04 16:20:56 UTC
Written mail to .by domain holder. There were changes of becloudby.com zone nameservers.

$ dig -t ns becloud.com

; <<>> DiG 9.16.23-RH <<>> -t ns becloud.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12664
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 5

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: f729087332ca9d99010000006453d878302b7008fb5718a5 (good)
;; QUESTION SECTION:
;becloud.com.			IN	NS

;; ANSWER SECTION:
becloud.com.		86259	IN	NS	ns1.dan.com.
becloud.com.		86259	IN	NS	ns2.dan.com.

;; ADDITIONAL SECTION:
ns1.dan.com.		172658	IN	A	97.74.98.67
ns2.dan.com.		172658	IN	A	173.201.66.67
ns1.dan.com.		172658	IN	AAAA	2603:5:2125::43
ns2.dan.com.		172658	IN	AAAA	2603:5:2225::43

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu May 04 16:08:24 UTC 2023
;; MSG SIZE  rcvd: 199

$ dig -t ns dan.com

; <<>> DiG 9.16.23-RH <<>> -t ns dan.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13649
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 15df3393040bbcc5010000006453d90f1d07840733cc1ec4 (good)
;; QUESTION SECTION:
;dan.com.			IN	NS

;; ANSWER SECTION:
dan.com.		3600	IN	NS	a1-245.akam.net.
dan.com.		3600	IN	NS	a6-66.akam.net.
dan.com.		3600	IN	NS	a9-67.akam.net.
dan.com.		3600	IN	NS	a11-64.akam.net.
dan.com.		3600	IN	NS	a20-65.akam.net.
dan.com.		3600	IN	NS	a8-67.akam.net.

;; ADDITIONAL SECTION:
a9-67.akam.net.		172800	IN	A	184.85.248.67

;; Query time: 793 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu May 04 16:10:55 UTC 2023
;; MSG SIZE  rcvd: 211

The loop seems gone.

$ dig -4 +nssearch by.
SOA dns1.tld.becloud.by. support.becloud.by. 2305041610 3600 600 604800 3600 from server 31.44.5.245 in 76 ms.
SOA dns1.tld.becloud.by. support.becloud.by. 2305041610 3600 600 604800 3600 from server 31.44.1.137 in 76 ms.
SOA dns1.tld.becloud.by. support.becloud.by. 2305041610 3600 600 604800 3600 from server 93.125.25.73 in 122 ms.
SOA dns1.tld.becloud.by. support.becloud.by. 2305041610 3600 600 604800 3600 from server 93.125.25.72 in 123 ms.
SOA dns1.tld.becloud.by. support.becloud.by. 2305041610 3600 600 604800 3600 from server 185.98.83.4 in 129 ms.

SOA record of by. zone still points to dns1.tld.becloud.by., but that does not exists. That is only minor issue.
After flushing cache for by. or becloud.com or becloud.by, by rndc flushtree becloudby.com. it still works. Seems it were fixed at authoritative servers. Flush becloudby.com zone to refresh those that data. Once that is refreshed, it should work.

Comment 3 Kseniya 2023-06-16 08:31:01 UTC
Unfortunately it doesn't work and was reproduced again.

Cache:
; glue
by.                     772509  NS      dns1.tld.becloudby.com.
                        772509  NS      dns7.tld.becloudby.com.
                        772509  NS      dns2.tld.becloudby.com.
                        772509  NS      dns3.tld.becloudby.com.
                        772509  NS      dns4.tld.becloudby.com.

becloudby.com.          772509  NS      u1.hoster.by.
                        772509  NS      u2.hoster.by.

I don't think something was changed at the .by zone maintainters

dig -t ns  becloudby.com 

; <<>> DiG 9.18.14 <<>> -t ns becloudby.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43478
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;becloudby.com.			IN	NS

;; ANSWER SECTION:
becloudby.com.		600	IN	NS	u2.hoster.by.
becloudby.com.		600	IN	NS	u1.hoster.by.

;; Query time: 84 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Fri Jun 16 11:28:28 +03 2023
;; MSG SIZE  rcvd: 85


I understand that it's most probably a zone misconfiguration, however the problem it's reproduced only with bind nameserver. In the world these names are being resolved without any problems.

Comment 5 Kseniya 2023-07-11 10:26:51 UTC
The same issue is with bind9.16-9.16.23-0.12.el8.x86_64, at the same time the .by zone is reachable in the world. rndc flush fixes the issue

Comment 6 Petr Menšík 2023-07-17 10:33:48 UTC
I do not have a long running iterative server, where I could try to reproduce this issue.

But as far as I can tell, those nameservers seems to be somehow unstable. Could that be responsible for those issues?

$ dig -4t NS +norec hoster.by. @dns1.hoster.by.
;; communications error to 93.125.31.240#53: timed out

; <<>> DiG 9.18.16 <<>> -4t NS +norec hoster.by. @dns1.hoster.by.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45919
;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 5

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;hoster.by.			IN	NS

;; ANSWER SECTION:
hoster.by.		3600	IN	NS	dns2.hoster.by.
hoster.by.		3600	IN	NS	dns1.hoster.by.

;; ADDITIONAL SECTION:
dns2.hoster.by.		3600	IN	AAAA	2a0a:7d80:3:1::7
dns1.hoster.by.		3600	IN	AAAA	2a0a:7d80:1:1::5:0
dns1.hoster.by.		3600	IN	A	93.125.31.240
dns2.hoster.by.		3600	IN	A	185.179.82.103

;; Query time: 62 msec
;; SERVER: 93.125.31.240#53(dns1.hoster.by.) (UDP)
;; WHEN: Mon Jul 17 10:54:49 CEST 2023
;; MSG SIZE  rcvd: 164

Comment 7 Petr Menšík 2023-07-17 10:48:52 UTC
I have tried playing with it on one my instance. at debug level 2, it proves at least that my instance has connectivity problems. It is based in US.

2023-07-17T10:31:21.238Z fetch: by/NS
2023-07-17T10:31:21.238Z fetch: dns1.tld.becloudby.com/A
2023-07-17T10:31:21.238Z fetch: dns2.tld.becloudby.com/A
2023-07-17T10:31:21.238Z fetch: dns3.tld.becloudby.com/A
2023-07-17T10:31:21.238Z fetch: dns4.tld.becloudby.com/A
2023-07-17T10:31:21.238Z fetch: dns7.tld.becloudby.com/A
2023-07-17T10:31:21.238Z fetch: u1.hoster.by/A
2023-07-17T10:31:21.238Z fetch: u2.hoster.by/A
2023-07-17T10:31:30.662Z fetch: by/A
2023-07-17T10:31:31.239Z client @0x7f86b451acc8 127.0.0.1#59895 (by): query failed (timed out) for by/IN/NS at ../../../lib/ns/query.c:7355
2023-07-17T10:31:31.239Z client @0x7f86b452a038 127.0.0.1#54589 (by): query failed (failure) for by/IN/A at ../../../lib/ns/query.c:7355
2023-07-17T10:31:31.240Z fetch: becloudby.com/A
2023-07-17T10:31:31.240Z fetch: u1.hoster.by/A
2023-07-17T10:31:31.240Z fetch: u2.hoster.by/A
2023-07-17T10:31:31.240Z client @0x7f86b452a038 127.0.0.1#56214 (becloudby.com): query failed (failure) for becloudby.com/IN/A at ../../../lib/ns/query.c:7355
2023-07-17T10:32:49.735Z fetch: by/A
2023-07-17T10:32:49.735Z fetch: dns1.tld.becloudby.com/A
2023-07-17T10:32:49.735Z fetch: dns2.tld.becloudby.com/A
2023-07-17T10:32:49.735Z fetch: dns3.tld.becloudby.com/A
2023-07-17T10:32:49.735Z fetch: dns4.tld.becloudby.com/A
2023-07-17T10:32:49.735Z fetch: dns7.tld.becloudby.com/A
2023-07-17T10:32:49.735Z fetch: u1.hoster.by/A
2023-07-17T10:32:49.735Z fetch: u2.hoster.by/A
2023-07-17T10:32:50.889Z fetch: becloudby.com/A

Results are unstable. It runs okay some time, but then it does not. As soon as hoster.by become stale, it is not able to refresh them for some reason. When I query localhost with +norec, it gives me answer. But with +rec it just SERVFAILs.

Comment 8 Petr Menšík 2023-07-17 11:55:17 UTC
I have found this problem discussed on upstream list last year:
https://lists.isc.org/pipermail/bind-users/2022-March/105885.html

It were forwarded to dns-operations list as well.
https://lists.dns-oarc.net/pipermail/dns-operations/2022-January/021501.html

I think they agree on the fix should be done on TLD configuration, not in bind. But I admit unbound seems to handle it better.

Comment 9 Petr Menšík 2023-07-17 12:28:52 UTC
It seems Fedora version 9.18 handles it better.
I were able to make it stable on c9s with this workaround only:

zone "by" IN {
        type static-stub;
        server-addresses {
                2a00:c827:a:2::2;       #dns1.tld.becloudby.com.        169007  IN      AAAA    
                2a00:c827:a:3::2;       #dns2.tld.becloudby.com.        169007  IN      AAAA    
                2a01:ba80:e:c:1::4c;    #dns3.tld.becloudby.com.        169007  IN      AAAA    
                2a0e:b81:8001:1001::2;  #dns4.tld.becloudby.com.        169007  IN      AAAA    
                93.125.25.72;    #dns1.tld.becloudby.com.       169007  IN      A       
                93.125.25.73;    #dns2.tld.becloudby.com.       169007  IN      A       
                185.98.83.4;     #dns3.tld.becloudby.com.       169007  IN      A       
                31.44.1.137;     #dns4.tld.becloudby.com.       169007  IN      A       
                31.44.5.245;     #dns7.tld.becloudby.com.       169007  IN      A
        };
};

Unless this would be solved by a rebase, I do not think backporting such change should be done. This is to be fixed on TLD by using not crazy configuration.