Description of problem: Baremetal IPI used to deploy local COREDNS pods which are breaking resolution during app deployment As IPI provide local DNS in addition to coredns running with OCP & it adds local nodes as nameserver in resolv.conf [ via /etc/NetworkManager/dispatcher.d/30-resolv-prepender ] which is creation issues openshift-kni-infra coredns-master0.ocp4.example.com 2/2 Running 0 14d 192.168.7.21 master0.ocp4.example.com openshift-kni-infra coredns-master1.ocp4.example.com 2/2 Running 0 14d 192.168.7.22 master1.ocp4.example.com openshift-kni-infra coredns-master2.ocp4.example.com 2/2 Running 0 14d 192.168.7.23 master2.ocp4.example.com openshift-kni-infra coredns-worker0.ocp4.example.com 2/2 Running 0 14d 192.168.7.31 worker0.ocp4.example.com openshift-kni-infra coredns-worker1.ocp4.example.com 2/2 Running 0 14d 192.168.7.32 worker1.ocp4.example.com openshift-kni-infra coredns-worker2.ocp4.example.com 2/2 Running 0 10d 192.168.7.33 worker2.ocp4.example.com [kni@provision ~]$ oc rsh -n openshift-kni-infra -c coredns coredns-master0.ocp4.example.com sh-4.2# cat /etc/resolv.conf search ocp4.example.com nameserver 192.168.7.21 nameserver 192.168.7.77 nameserver 172.22.0.1 On physical nodes [root@master0 ~]# cat /etc/resolv.conf # Generated by KNI resolv prepender NM dispatcher script search ocp4.example.com nameserver 192.168.7.21 nameserver 192.168.7.77 nameserver 172.22.0.1 so any external domain quey goes to port=5353 on local host is forwarded to localcoredns which forward to external DNS & breaks 90% [root@master0 ~]# ./test1.sh + nslookup -port=5353 github.com. 10.128.0.6 Server: 10.128.0.6 Address: 10.128.0.6#5353 Non-authoritative answer: Name: github.com Address: 140.82.114.4 ** server can't find github.com: REFUSED + nslookup -port=5353 github.com. 10.128.2.3 Server: 10.128.2.3 Address: 10.128.2.3#5353 ** server can't find github.com: REFUSED + nslookup -port=5353 github.com. 10.130.0.8 << good Server: 10.130.0.8 Address: 10.130.0.8#5353 Non-authoritative answer: Name: github.com Address: 140.82.114.4 + nslookup -port=5353 github.com. 10.129.2.3 Server: 10.129.2.3 Address: 10.129.2.3#5353 ** server can't find github.com: REFUSED + nslookup -port=5353 github.com. 10.129.0.20 Server: 10.129.0.20 Address: 10.129.0.20#5353 Non-authoritative answer: Name: github.com Address: 140.82.114.4 ** server can't find github.com: REFUSED + nslookup -port=5353 github.com. 10.131.0.3 Server: 10.131.0.3 Address: 10.131.0.3#5353 ** server can't find github.com: REFUSED looks like request is sometimes sent to local coredns & sometimes to external dns like only 1 request above looks perfect [kni@provision ~]$ oc -n openshift-dns get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE dns-default-2ct94 3/3 Running 0 14d 10.128.0.6 master2.ocp4.example.com dns-default-bht2v 3/3 Running 0 14d 10.128.2.3 worker0.ocp4.example.com dns-default-fz2s5 3/3 Running 0 14d 10.130.0.8 master1.ocp4.example.com dns-default-glwjq 3/3 Running 0 10d 10.129.2.3 worker2.ocp4.example.com dns-default-v5s2h 3/3 Running 0 14d 10.129.0.20 master0.ocp4.example.com dns-default-vzd4z 3/3 Running 0 14d 10.131.0.3 worker1.ocp4.example.com Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag like error: fatal: unable to access 'https://github.com/sclorg/ruby-ex.git/': Could not resolve host: github.com; Unknown error [root@master0 ~]# nslookup github.com. Server: 192.168.7.21 Address: 192.168.7.21#53 Non-authoritative answer: Name: github.com Address: 140.82.113.4 ** server can't find github.com: REFUSED -- & if i remove locahost nameserver ip entry from resolv.conf it s fine & it directly goes to my [.77 DNS] -- [root@master0 ~]# nslookup github.com. Server: 192.168.7.77 Address: 192.168.7.77#53 Non-authoritative answer: Name: github.com Address: 140.82.113.4
/etc/coredns/Corefile [kni@provision ~]$ oc rsh -n openshift-kni-infra -c coredns coredns-master0.ocp4.example.com sh-4.2# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.2 0.2 735104 47216 ? Ssl Sep17 20:47 /usr/bin/coredns --conf /etc/coredns/Corefile root 32 0.0 0.0 11836 2812 pts/0 Ss 10:36 0:00 /bin/sh root 38 0.0 0.0 51768 3428 pts/0 R+ 10:36 0:00 ps aux sh-4.2# vi /etc/coredns/Corefile sh-4.2# cat /etc/coredns/Corefile . { errors health :18080 mdns ocp4.example.com 0 ocp4 192.168.7.21 forward . 192.168.7.77 172.22.0.1 cache 30 reload template IN A ocp4.example.com { match .*.apps.ocp4.example.com answer "{{ .Name }} 60 in {{ .Type }} 192.168.7.41" fallthrough } template IN AAAA ocp4.example.com { match .*.apps.ocp4.example.com fallthrough } template IN A ocp4.example.com { match api.ocp4.example.com answer "{{ .Name }} 60 in {{ .Type }} 192.168.7.40" fallthrough } template IN AAAA ocp4.example.com { match api.ocp4.example.com fallthrough } template IN A ocp4.example.com { match api-int.ocp4.example.com answer "{{ .Name }} 60 in {{ .Type }} 192.168.7.40" fallthrough } template IN AAAA ocp4.example.com { match api-int.ocp4.example.com fallthrough } } sh-4.2# exit
nslookup has some odd behaviors due to its custom resolver logic, so I prefer to use dig for DNS debugging. I have a suspicion about what might be going on here, which is that we could be having issues with the way the coredns forward plugin works. Can you run the following commands and let me know the results? Thanks. dig @192.168.7.21 github.com dig @192.168.7.77 github.com dig @172.22.0.1 github.com
[root@master0 ~]# dig @192.168.7.21 github.com ; <<>> DiG 9.11.13-RedHat-9.11.13-5.el8_2 <<>> @192.168.7.21 github.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 54068 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: df0cc86fdcf447c2 (echoed) ;; QUESTION SECTION: ;github.com. IN A ;; Query time: 2 msec ;; SERVER: 192.168.7.21#53(192.168.7.21) ;; WHEN: Sat Sep 26 03:53:00 UTC 2020 ;; MSG SIZE rcvd: 51 [root@master0 ~]# ===== [root@master0 ~]# dig @192.168.7.77 github.com ; <<>> DiG 9.11.13-RedHat-9.11.13-5.el8_2 <<>> @192.168.7.77 github.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35113 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 11 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: cfa9c0a9fcee50d6b98746835f6ebb4809279aadadf6fe4a (good) ;; QUESTION SECTION: ;github.com. IN A ;; ANSWER SECTION: github.com. 18 IN A 140.82.113.4 ;; AUTHORITY SECTION: github.com. 198 IN NS dns4.p08.nsone.net. github.com. 198 IN NS ns-520.awsdns-01.net. github.com. 198 IN NS dns1.p08.nsone.net. github.com. 198 IN NS dns3.p08.nsone.net. github.com. 198 IN NS ns-1707.awsdns-21.co.uk. github.com. 198 IN NS ns-1283.awsdns-32.org. github.com. 198 IN NS ns-421.awsdns-52.com. github.com. 198 IN NS dns2.p08.nsone.net. ;; ADDITIONAL SECTION: ns-1707.awsdns-21.co.uk. 3858 IN A 205.251.198.171 dns1.p08.nsone.net. 3677 IN A 198.51.44.8 dns3.p08.nsone.net. 3677 IN A 198.51.44.72 dns4.p08.nsone.net. 3677 IN A 198.51.45.72 ns-421.awsdns-52.com. 58485 IN A 205.251.193.165 dns2.p08.nsone.net. 3677 IN A 198.51.45.8 ns-1283.awsdns-32.org. 3567 IN A 205.251.197.3 ns-520.awsdns-01.net. 2777 IN AAAA 2600:9000:5302:800::1 ns-421.awsdns-52.com. 58485 IN AAAA 2600:9000:5301:a500::1 ns-1283.awsdns-32.org. 3567 IN AAAA 2600:9000:5305:300::1 ;; Query time: 1 msec ;; SERVER: 192.168.7.77#53(192.168.7.77) ;; WHEN: Sat Sep 26 03:53:44 UTC 2020 ;; MSG SIZE rcvd: 502 [root@master0 ~]# ==== [root@master0 ~]# dig @172.22.0.1 github.com ; <<>> DiG 9.11.13-RedHat-9.11.13-5.el8_2 <<>> @172.22.0.1 github.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 7273 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;github.com. IN A ;; Query time: 1 msec ;; SERVER: 172.22.0.1#53(172.22.0.1) ;; WHEN: Sat Sep 26 03:53:53 UTC 2020 ;; MSG SIZE rcvd: 28 [root@master0 ~]#
After multiple tried once got response [root@master0 ~]# dig @192.168.7.21 github.com ; <<>> DiG 9.11.13-RedHat-9.11.13-5.el8_2 <<>> @192.168.7.21 github.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 45904 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: d5926a1e3ec9da83 (echoed) ;; QUESTION SECTION: ;github.com. IN A ;; Query time: 10 msec ;; SERVER: 192.168.7.21#53(192.168.7.21) ;; WHEN: Sat Sep 26 03:58:13 UTC 2020 ;; MSG SIZE rcvd: 51 worked once after multiple tries [root@master0 ~]# dig @192.168.7.21 github.com ; <<>> DiG 9.11.13-RedHat-9.11.13-5.el8_2 <<>> @192.168.7.21 github.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47803 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 11 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: 46ad167f13f99fdf1675b7a95f6ebc57e6b3abc54ea01c91 (good) ;; QUESTION SECTION: ;github.com. IN A ;; ANSWER SECTION: github.com. 30 IN A 140.82.114.4 ;; AUTHORITY SECTION: github.com. 30 IN NS ns-1283.awsdns-32.org. github.com. 30 IN NS ns-421.awsdns-52.com. github.com. 30 IN NS dns3.p08.nsone.net. github.com. 30 IN NS ns-520.awsdns-01.net. github.com. 30 IN NS dns4.p08.nsone.net. github.com. 30 IN NS dns1.p08.nsone.net. github.com. 30 IN NS dns2.p08.nsone.net. github.com. 30 IN NS ns-1707.awsdns-21.co.uk. ;; ADDITIONAL SECTION: dns1.p08.nsone.net. 30 IN A 198.51.44.8 ns-1283.awsdns-32.org. 30 IN A 205.251.197.3 dns4.p08.nsone.net. 30 IN A 198.51.45.72 dns2.p08.nsone.net. 30 IN A 198.51.45.8 ns-1707.awsdns-21.co.uk. 30 IN A 205.251.198.171 ns-421.awsdns-52.com. 30 IN A 205.251.193.165 dns3.p08.nsone.net. 30 IN A 198.51.44.72 ns-1283.awsdns-32.org. 30 IN AAAA 2600:9000:5305:300::1 ns-520.awsdns-01.net. 30 IN AAAA 2600:9000:5302:800::1 ns-421.awsdns-52.com. 30 IN AAAA 2600:9000:5301:a500::1 ;; Query time: 2 msec ;; SERVER: 192.168.7.21#53(192.168.7.21) ;; WHEN: Sat Sep 26 03:58:15 UTC 2020 ;; MSG SIZE rcvd: 834 [root@master0 ~]#
Okay, that confirms my suspicions. One of the other DNS servers is not resolving external records. The way we configure the forward plugin in coredns results in it forwarding requests to both of the other configured servers, so the reason the problem is intermittent is that it depends on which server the request gets routed to. There are two possible solutions in this case: 1) Remove the bad DNS server from the node's configuration. It should be coming from DHCP in IPI, so that would require a change in the external DHCP configuration. 2) Switch the forward plugin to sequential so it will always use the first server when it is up. I would argue that having invalid DNS servers in resolv.conf is an error in itself, but this would make coredns behave more like regular resolv.conf server selection.
local nodes were added as nameserver in resolv.conf [ via /etc/NetworkManager/dispatcher.d/30-resolv-prepender ] so manual removal will recreate those entries as managed by operator . Why we need to configure 2 DNS as a process of IPI & here local node ip in resolv.conf creating issue & external DNS at (.77) is working perfectly 2) Switch the forward plugin to sequential so it will always use the first server when it is up to my understanding coredns default config is sequential & [ this script - /etc/NetworkManager/dispatcher.d/30-resolv-prepender ] always add local node ip as first server.
(In reply to Anil Dhingra from comment #7) > > local nodes were added as nameserver in resolv.conf [ via > /etc/NetworkManager/dispatcher.d/30-resolv-prepender ] so manual removal > will recreate those entries as managed by operator . > > Why we need to configure 2 DNS as a process of IPI & here local node ip in > resolv.conf creating issue & external DNS at (.77) is working perfectly Because on baremetal we don't have a cloud provider to serve the internal DNS records needed by OpenShift. To get around that, we run our own DNS server on each node that contains the internal records. Also note that while .77 may be working, the other server is not so if .77 went down you'd still be broken. This is not a good way to configure a system. > > 2) Switch the forward plugin to sequential so it will always use the first > server when it is up > > to my understanding coredns default config is sequential & [ this script - > /etc/NetworkManager/dispatcher.d/30-resolv-prepender ] always add local node > ip as first server. In the coredns forward plugin it is not sequential by default: "policy specifies the policy to use for selecting upstream servers. The default is random." from https://coredns.io/plugins/forward/
Verified on 4.7.0-fc.5 [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-fc.5 True False 2d19h Cluster version is 4.7.0-fc.5 Didn't managed to reproduce the original issue, verified coredns policy updated to sequential, so both external and internal resolution works even in case invalid forward server is added, like 8.8.8.8: [core@worker-0-0 ~]$ sudo cat /etc/coredns/Corefile . { errors health :18080 mdns ocp-edge-cluster-0.qe.lab.redhat.com 0 ocp-edge-cluster-0 fd2e:6f44:5dd8::122 forward . fe80::5054:ff:fe4e:386b%br-ex fd2e:6f44:5dd8::1 { policy sequential } ... [core@worker-0-0 ~]$ host tst.apps.ocp-edge-cluster-0.qe.lab.redhat.com tst.apps.ocp-edge-cluster-0.qe.lab.redhat.com has IPv6 address fd2e:6f44:5dd8::10 [core@worker-0-0 ~]$ host github.com github.com has address 140.82.112.4 github.com mail is handled by 10 alt3.aspmx.l.google.com. ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633