Bug 1882209
Summary: | [ BateMetal IPI ] local coredns resolution not working | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Anil Dhingra <adhingra> |
Component: | Machine Config Operator | Assignee: | Ben Nemec <bnemec> |
Status: | CLOSED ERRATA | QA Contact: | Victor Voronkov <vvoronko> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.5 | CC: | adhingra, bnemec, bward, mburman, mkrejci, vvoronko, yboaron |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | Telco:Deployment, Squad:Networking | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: CoreDNS forward plugin randomly distributes queries to all configured DNS servers. This is different from the behavior of the system resolver.
Consequence: resolv.conf configurations that work on their own may cause random resolution failures in coredns because queries are sent to non-functional DNS servers.
Fix: Set forward plugin to use the sequential method, which matches the system resolver.
Result: DNS server configurations in coredns will work the same as in the system resolv.conf.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 15:19:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1896384 |
Description
Anil Dhingra
2020-09-24 04:26:10 UTC
/etc/coredns/Corefile [kni@provision ~]$ oc rsh -n openshift-kni-infra -c coredns coredns-master0.ocp4.example.com sh-4.2# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.2 0.2 735104 47216 ? Ssl Sep17 20:47 /usr/bin/coredns --conf /etc/coredns/Corefile root 32 0.0 0.0 11836 2812 pts/0 Ss 10:36 0:00 /bin/sh root 38 0.0 0.0 51768 3428 pts/0 R+ 10:36 0:00 ps aux sh-4.2# vi /etc/coredns/Corefile sh-4.2# cat /etc/coredns/Corefile . { errors health :18080 mdns ocp4.example.com 0 ocp4 192.168.7.21 forward . 192.168.7.77 172.22.0.1 cache 30 reload template IN A ocp4.example.com { match .*.apps.ocp4.example.com answer "{{ .Name }} 60 in {{ .Type }} 192.168.7.41" fallthrough } template IN AAAA ocp4.example.com { match .*.apps.ocp4.example.com fallthrough } template IN A ocp4.example.com { match api.ocp4.example.com answer "{{ .Name }} 60 in {{ .Type }} 192.168.7.40" fallthrough } template IN AAAA ocp4.example.com { match api.ocp4.example.com fallthrough } template IN A ocp4.example.com { match api-int.ocp4.example.com answer "{{ .Name }} 60 in {{ .Type }} 192.168.7.40" fallthrough } template IN AAAA ocp4.example.com { match api-int.ocp4.example.com fallthrough } } sh-4.2# exit nslookup has some odd behaviors due to its custom resolver logic, so I prefer to use dig for DNS debugging. I have a suspicion about what might be going on here, which is that we could be having issues with the way the coredns forward plugin works. Can you run the following commands and let me know the results? Thanks. dig @192.168.7.21 github.com dig @192.168.7.77 github.com dig @172.22.0.1 github.com [root@master0 ~]# dig @192.168.7.21 github.com ; <<>> DiG 9.11.13-RedHat-9.11.13-5.el8_2 <<>> @192.168.7.21 github.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 54068 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: df0cc86fdcf447c2 (echoed) ;; QUESTION SECTION: ;github.com. IN A ;; Query time: 2 msec ;; SERVER: 192.168.7.21#53(192.168.7.21) ;; WHEN: Sat Sep 26 03:53:00 UTC 2020 ;; MSG SIZE rcvd: 51 [root@master0 ~]# ===== [root@master0 ~]# dig @192.168.7.77 github.com ; <<>> DiG 9.11.13-RedHat-9.11.13-5.el8_2 <<>> @192.168.7.77 github.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35113 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 11 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: cfa9c0a9fcee50d6b98746835f6ebb4809279aadadf6fe4a (good) ;; QUESTION SECTION: ;github.com. IN A ;; ANSWER SECTION: github.com. 18 IN A 140.82.113.4 ;; AUTHORITY SECTION: github.com. 198 IN NS dns4.p08.nsone.net. github.com. 198 IN NS ns-520.awsdns-01.net. github.com. 198 IN NS dns1.p08.nsone.net. github.com. 198 IN NS dns3.p08.nsone.net. github.com. 198 IN NS ns-1707.awsdns-21.co.uk. github.com. 198 IN NS ns-1283.awsdns-32.org. github.com. 198 IN NS ns-421.awsdns-52.com. github.com. 198 IN NS dns2.p08.nsone.net. ;; ADDITIONAL SECTION: ns-1707.awsdns-21.co.uk. 3858 IN A 205.251.198.171 dns1.p08.nsone.net. 3677 IN A 198.51.44.8 dns3.p08.nsone.net. 3677 IN A 198.51.44.72 dns4.p08.nsone.net. 3677 IN A 198.51.45.72 ns-421.awsdns-52.com. 58485 IN A 205.251.193.165 dns2.p08.nsone.net. 3677 IN A 198.51.45.8 ns-1283.awsdns-32.org. 3567 IN A 205.251.197.3 ns-520.awsdns-01.net. 2777 IN AAAA 2600:9000:5302:800::1 ns-421.awsdns-52.com. 58485 IN AAAA 2600:9000:5301:a500::1 ns-1283.awsdns-32.org. 3567 IN AAAA 2600:9000:5305:300::1 ;; Query time: 1 msec ;; SERVER: 192.168.7.77#53(192.168.7.77) ;; WHEN: Sat Sep 26 03:53:44 UTC 2020 ;; MSG SIZE rcvd: 502 [root@master0 ~]# ==== [root@master0 ~]# dig @172.22.0.1 github.com ; <<>> DiG 9.11.13-RedHat-9.11.13-5.el8_2 <<>> @172.22.0.1 github.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 7273 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;github.com. IN A ;; Query time: 1 msec ;; SERVER: 172.22.0.1#53(172.22.0.1) ;; WHEN: Sat Sep 26 03:53:53 UTC 2020 ;; MSG SIZE rcvd: 28 [root@master0 ~]# After multiple tried once got response [root@master0 ~]# dig @192.168.7.21 github.com ; <<>> DiG 9.11.13-RedHat-9.11.13-5.el8_2 <<>> @192.168.7.21 github.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 45904 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: d5926a1e3ec9da83 (echoed) ;; QUESTION SECTION: ;github.com. IN A ;; Query time: 10 msec ;; SERVER: 192.168.7.21#53(192.168.7.21) ;; WHEN: Sat Sep 26 03:58:13 UTC 2020 ;; MSG SIZE rcvd: 51 worked once after multiple tries [root@master0 ~]# dig @192.168.7.21 github.com ; <<>> DiG 9.11.13-RedHat-9.11.13-5.el8_2 <<>> @192.168.7.21 github.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47803 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 11 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: 46ad167f13f99fdf1675b7a95f6ebc57e6b3abc54ea01c91 (good) ;; QUESTION SECTION: ;github.com. IN A ;; ANSWER SECTION: github.com. 30 IN A 140.82.114.4 ;; AUTHORITY SECTION: github.com. 30 IN NS ns-1283.awsdns-32.org. github.com. 30 IN NS ns-421.awsdns-52.com. github.com. 30 IN NS dns3.p08.nsone.net. github.com. 30 IN NS ns-520.awsdns-01.net. github.com. 30 IN NS dns4.p08.nsone.net. github.com. 30 IN NS dns1.p08.nsone.net. github.com. 30 IN NS dns2.p08.nsone.net. github.com. 30 IN NS ns-1707.awsdns-21.co.uk. ;; ADDITIONAL SECTION: dns1.p08.nsone.net. 30 IN A 198.51.44.8 ns-1283.awsdns-32.org. 30 IN A 205.251.197.3 dns4.p08.nsone.net. 30 IN A 198.51.45.72 dns2.p08.nsone.net. 30 IN A 198.51.45.8 ns-1707.awsdns-21.co.uk. 30 IN A 205.251.198.171 ns-421.awsdns-52.com. 30 IN A 205.251.193.165 dns3.p08.nsone.net. 30 IN A 198.51.44.72 ns-1283.awsdns-32.org. 30 IN AAAA 2600:9000:5305:300::1 ns-520.awsdns-01.net. 30 IN AAAA 2600:9000:5302:800::1 ns-421.awsdns-52.com. 30 IN AAAA 2600:9000:5301:a500::1 ;; Query time: 2 msec ;; SERVER: 192.168.7.21#53(192.168.7.21) ;; WHEN: Sat Sep 26 03:58:15 UTC 2020 ;; MSG SIZE rcvd: 834 [root@master0 ~]# Okay, that confirms my suspicions. One of the other DNS servers is not resolving external records. The way we configure the forward plugin in coredns results in it forwarding requests to both of the other configured servers, so the reason the problem is intermittent is that it depends on which server the request gets routed to. There are two possible solutions in this case: 1) Remove the bad DNS server from the node's configuration. It should be coming from DHCP in IPI, so that would require a change in the external DHCP configuration. 2) Switch the forward plugin to sequential so it will always use the first server when it is up. I would argue that having invalid DNS servers in resolv.conf is an error in itself, but this would make coredns behave more like regular resolv.conf server selection. local nodes were added as nameserver in resolv.conf [ via /etc/NetworkManager/dispatcher.d/30-resolv-prepender ] so manual removal will recreate those entries as managed by operator . Why we need to configure 2 DNS as a process of IPI & here local node ip in resolv.conf creating issue & external DNS at (.77) is working perfectly 2) Switch the forward plugin to sequential so it will always use the first server when it is up to my understanding coredns default config is sequential & [ this script - /etc/NetworkManager/dispatcher.d/30-resolv-prepender ] always add local node ip as first server. (In reply to Anil Dhingra from comment #7) > > local nodes were added as nameserver in resolv.conf [ via > /etc/NetworkManager/dispatcher.d/30-resolv-prepender ] so manual removal > will recreate those entries as managed by operator . > > Why we need to configure 2 DNS as a process of IPI & here local node ip in > resolv.conf creating issue & external DNS at (.77) is working perfectly Because on baremetal we don't have a cloud provider to serve the internal DNS records needed by OpenShift. To get around that, we run our own DNS server on each node that contains the internal records. Also note that while .77 may be working, the other server is not so if .77 went down you'd still be broken. This is not a good way to configure a system. > > 2) Switch the forward plugin to sequential so it will always use the first > server when it is up > > to my understanding coredns default config is sequential & [ this script - > /etc/NetworkManager/dispatcher.d/30-resolv-prepender ] always add local node > ip as first server. In the coredns forward plugin it is not sequential by default: "policy specifies the policy to use for selecting upstream servers. The default is random." from https://coredns.io/plugins/forward/ Verified on 4.7.0-fc.5 [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-fc.5 True False 2d19h Cluster version is 4.7.0-fc.5 Didn't managed to reproduce the original issue, verified coredns policy updated to sequential, so both external and internal resolution works even in case invalid forward server is added, like 8.8.8.8: [core@worker-0-0 ~]$ sudo cat /etc/coredns/Corefile . { errors health :18080 mdns ocp-edge-cluster-0.qe.lab.redhat.com 0 ocp-edge-cluster-0 fd2e:6f44:5dd8::122 forward . fe80::5054:ff:fe4e:386b%br-ex fd2e:6f44:5dd8::1 { policy sequential } ... [core@worker-0-0 ~]$ host tst.apps.ocp-edge-cluster-0.qe.lab.redhat.com tst.apps.ocp-edge-cluster-0.qe.lab.redhat.com has IPv6 address fd2e:6f44:5dd8::10 [core@worker-0-0 ~]$ host github.com github.com has address 140.82.112.4 github.com mail is handled by 10 alt3.aspmx.l.google.com. ... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |