Description of problem: I have a customer that has two Openshift 4.6.21 clusters pre and prod but they set the names in a way that one cluster is part of the other cluster clustername.basedomain. When they try to import images from pre cluster to prod the pull gives the "no such host error". To replicate the issue I created the following: Type of install: RHV / IPI DEV Cluster: cluster name: pre basedomain: ocp4.testlab.local apps domain wildcard: *.apps.pre.ocp4.testlab.local PROD Cluster: cluster name: ocp4 basedomain: testlab.local apps domain wildcard: *.apps.ocp4.testlab.local The issue is due that Coredns static pods delcares itself as owner of the clustername.basedomain: DEV Cluster: . { errors health :18080 mdns ocp4.testlab.local 0 ocp4 192.168.1.50 forward . 192.168.1.191 cache 30 reload file /etc/coredns/node-dns-db ocp4.testlab.local } So the domain file contains: $ORIGIN ocp4.testlab.local. @ 3600 IN SOA host.ocp4.testlab.local. hostmaster ( 2017042752 ; serial 7200 ; refresh (2 hours) 3600 ; retry (1 hour) 1209600 ; expire (2 weeks) 3600 ; minimum (1 hour) ) api-int IN A 192.168.1.201 api IN A 192.168.1.201 *.apps IN A 192.168.1.200 Because of this configuration this pod will answer queries directed to ocp4.testlab.local as authoritative nameserver and won't forward this query to the upstream dns server. === [root@ocp4-7vgpf-master-0 ~]# podman pull registry.apps.pre.ocp4.testlab.local/image Trying to pull registry.apps.pre.ocp4.testlab.local/image... Get "https://registry.apps.pre.ocp4.testlab.local/v2/": dial tcp: lookup registry.apps.pre.ocp4.testlab.local: no such host ; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.1 <<>> registry.apps.pre.ocp4.testlab.local ;; global options: +cmd ;; Got answer: ;; WARNING: .local is reserved for Multicast DNS ;; You are currently testing what happens when an mDNS query is leaked to DNS ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 31591 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: eced1033351d849b (echoed) ;; QUESTION SECTION: ;registry.apps.pre.ocp4.testlab.local. IN A ;; AUTHORITY SECTION: ocp4.testlab.local. 30 IN SOA host.ocp4.testlab.local. hostmaster.ocp4.testlab.local. 2017042752 7200 3600 1209600 3600 ;; Query time: 0 msec ;; SERVER: 192.168.1.50#53(192.168.1.50) ;; WHEN: Wed Apr 28 14:13:13 UTC 2021 ;; MSG SIZE rcvd: 183 === We can see the query for registry.apps.pre.ocp4.testlab.local never was forwarded to the upstream dns. I also, see that Coredns static pods are differently generated comparing Vsphere/IPI from OCP 4.6.21 Vsphere/IPI Corefile: . { errors health :18080 mdns ocp46ipi.rhlabs.local 0 ocp46ipi 172.20.1.50 forward . 172.20.1.83 cache 30 reload hosts { 172.20.1.160 api-int.ocp46ipi.rhlabs.local 172.20.1.160 api.ocp46ipi.rhlabs.local fallthrough } template IN A ocp46ipi.rhlabs.local { match .*.apps.ocp46ipi.rhlabs.local answer "{{ .Name }} 60 in a 172.20.1.161" fallthrough } } In this way, I don't see the problem: === cluster name: ocp46ipi base domain: rhlabs.local external registry url: registry.apps.pre.ocp46ipi.rhlabs.local. from a node: [root@ocp46ipi-t46gj-worker-7sr4q /]# dig registry.apps.pre.ocp46ipi.rhlabs.local ; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.1 <<>> registry.apps.pre.ocp46ipi.rhlabs.local ;; global options: +cmd ;; Got answer: ;; WARNING: .local is reserved for Multicast DNS ;; You are currently testing what happens when an mDNS query is leaked to DNS ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37036 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: 51d02173216c406c67699a4360897378d7dde66f392a29b2 (good) ;; QUESTION SECTION: ;registry.apps.pre.ocp46ipi.rhlabs.local. IN A ;; ANSWER SECTION: registry.apps.pre.ocp46ipi.rhlabs.local. 30 IN A 172.20.1.83 ;; AUTHORITY SECTION: rhlabs.local. 30 IN NS ns.rhlabs.local. ;; ADDITIONAL SECTION: ns.rhlabs.local. 30 IN A 172.20.1.83 ;; Query time: 2 msec ;; SERVER: 172.20.1.50#53(172.20.1.50) ;; WHEN: Wed Apr 28 14:40:15 UTC 2021 ;; MSG SIZE rcvd: 223 === We see, in this case the query is correctly forwarded and answered by the upstream dns server. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Install a cluster RHV/IPI 4.6.21 2. Create an external dns record pointing to a name that contains the cluster name and basedomain 3. Try to resolve the name, the name is answered by coredns instead being forwarded to the upstream dns server. Actual results: The query is answered by Coredns as authoritative name server. Expected results: The query should be forwarded to the upstream dns server. Additional info:
I forgot to clarify that we see the issue only affects RHV/IPI version, but when I replicate the same scenario using a VSPHERE/IPI cluster, it works as intended. This is using the same exact version for both tests.
The network edge team doesn't manage the kni/coredns static pods. Given this this appears to work for vsphere/IPI it may be an installer issue or a KNI issue. Given that the Coredns configurations are different between RHV and VSPHERE going to move this to mDNS component.
As noted in the report, the RHV config just needs to switch to a plugin that supports fallthrough. Note that this has already been fixed in 4.8 because we consolidated the coredns configs and the one every platform uses will handle this correctly.
Since this does not exist in 4.8, I will clone it to 4.7 and close this one. From there we can follow the normal backport process.