1962288 – error "no such host" when nodes pull images from a registry that is part of a the own cluster "cluster.basedomain" domain name.

Bug 1962288 - error "no such host" when nodes pull images from a registry that is part of a the own cluster "cluster.basedomain" domain name.

Summary: error "no such host" when nodes pull images from a registry that is part of a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Ben Nemec
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:	1954670
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-19 16:52 UTC by Ben Nemec
Modified:	2021-07-06 11:39 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Use of a coredns plugin that does not allow queries to be forwarded if they can't be answered by the local server. Consequence: Some queries for DNS names in the cluster domain incorrectly fail. Fix: Changed to a coredns plugin that would correctly forward queries Result: All valid queries in the cluster domain will work correctly.
Clone Of:	1954670
Environment:
Last Closed:	2021-07-06 11:38:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2581	0	None	open	Bug 1962288: [ovirt] Stop using db file for static internal records	2021-05-19 16:55:28 UTC
Red Hat Product Errata	RHBA-2021:2554	0	None	None	None	2021-07-06 11:39:14 UTC

Description Ben Nemec 2021-05-19 16:52:03 UTC

+++ This bug was initially created as a clone of Bug #1954670 +++

Description of problem:

I have a customer that has two Openshift 4.6.21 clusters pre and prod but they set the names in a way that one cluster is part of the other cluster clustername.basedomain.  When they try to import images from pre cluster to prod the pull gives the "no such host error". To replicate the issue I created the following:


Type of install: RHV / IPI

DEV Cluster:
cluster name: pre
basedomain: ocp4.testlab.local
apps domain wildcard: *.apps.pre.ocp4.testlab.local

PROD Cluster:
cluster name: ocp4
basedomain: testlab.local
apps domain wildcard: *.apps.ocp4.testlab.local


The issue is due that Coredns static pods delcares itself as owner of the clustername.basedomain:

DEV Cluster:

. {
    errors
    health :18080
    mdns ocp4.testlab.local 0 ocp4 192.168.1.50
    forward . 192.168.1.191
    cache 30
    reload
    file /etc/coredns/node-dns-db ocp4.testlab.local
}


So the domain file contains:
$ORIGIN ocp4.testlab.local.
@    3600 IN SOA host.ocp4.testlab.local. hostmaster (
                                2017042752 ; serial
                                7200       ; refresh (2 hours)
                                3600       ; retry (1 hour)
                                1209600    ; expire (2 weeks)
                                3600       ; minimum (1 hour)
                                )
api-int IN A 192.168.1.201
api IN A 192.168.1.201

*.apps  IN  A 192.168.1.200

Because of this configuration this pod will answer queries directed to ocp4.testlab.local as authoritative nameserver and won't forward this query to the upstream dns server. 

===
[root@ocp4-7vgpf-master-0 ~]# podman pull registry.apps.pre.ocp4.testlab.local/image
Trying to pull registry.apps.pre.ocp4.testlab.local/image...
  Get "https://registry.apps.pre.ocp4.testlab.local/v2/": dial tcp: lookup registry.apps.pre.ocp4.testlab.local: no such host

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.1 <<>> registry.apps.pre.ocp4.testlab.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 31591
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: eced1033351d849b (echoed)
;; QUESTION SECTION:
;registry.apps.pre.ocp4.testlab.local. IN A

;; AUTHORITY SECTION:
ocp4.testlab.local.	30	IN	SOA	host.ocp4.testlab.local. hostmaster.ocp4.testlab.local. 2017042752 7200 3600 1209600 3600

;; Query time: 0 msec
;; SERVER: 192.168.1.50#53(192.168.1.50)
;; WHEN: Wed Apr 28 14:13:13 UTC 2021
;; MSG SIZE  rcvd: 183


===

We can see the query for registry.apps.pre.ocp4.testlab.local never was forwarded to the upstream dns. 

I also, see that Coredns static pods are differently generated comparing Vsphere/IPI

from OCP 4.6.21 Vsphere/IPI Corefile:

. {
    errors
    health :18080
    mdns ocp46ipi.rhlabs.local 0 ocp46ipi 172.20.1.50
    forward . 172.20.1.83
    cache 30
    reload
    hosts {
        172.20.1.160 api-int.ocp46ipi.rhlabs.local
        172.20.1.160 api.ocp46ipi.rhlabs.local
        fallthrough
    }
    template IN A ocp46ipi.rhlabs.local {
        match .*.apps.ocp46ipi.rhlabs.local
        answer "{{ .Name }} 60 in a 172.20.1.161"
        fallthrough
    }
}

In this way, I don't see the problem:

===
cluster name: ocp46ipi
base domain: rhlabs.local
external registry url: registry.apps.pre.ocp46ipi.rhlabs.local.

from a node:
[root@ocp46ipi-t46gj-worker-7sr4q /]# dig registry.apps.pre.ocp46ipi.rhlabs.local

; <<>> DiG 9.11.13-RedHat-9.11.13-6.el8_2.1 <<>> registry.apps.pre.ocp46ipi.rhlabs.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37036
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 51d02173216c406c67699a4360897378d7dde66f392a29b2 (good)
;; QUESTION SECTION:
;registry.apps.pre.ocp46ipi.rhlabs.local. IN A

;; ANSWER SECTION:
registry.apps.pre.ocp46ipi.rhlabs.local. 30 IN A 172.20.1.83

;; AUTHORITY SECTION:
rhlabs.local.		30	IN	NS	ns.rhlabs.local.

;; ADDITIONAL SECTION:
ns.rhlabs.local.	30	IN	A	172.20.1.83

;; Query time: 2 msec
;; SERVER: 172.20.1.50#53(172.20.1.50)
;; WHEN: Wed Apr 28 14:40:15 UTC 2021
;; MSG SIZE  rcvd: 223

===

We see, in this case the query is correctly forwarded and answered by the upstream dns server.




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install a cluster RHV/IPI 4.6.21
2. Create an external dns record pointing to a name that contains the cluster name and basedomain
3. Try to resolve the name, the name is answered by coredns instead being forwarded to the upstream dns server.

Actual results:
The query is answered by Coredns as authoritative name server.

Expected results:
The query should be forwarded to the upstream dns server.

Additional info:

--- Additional comment from Alfredo Pizarro on 2021-04-28 14:54:14 UTC ---

I forgot to clarify that we see the issue only affects RHV/IPI version, but when I replicate the same scenario using a VSPHERE/IPI cluster, it works as intended. This is using the same exact version for both tests.

--- Additional comment from Andrew McDermott on 2021-04-29 16:10:19 UTC ---

The network edge team doesn't manage the kni/coredns static pods. Given this this appears to work for vsphere/IPI it may be an installer issue or a KNI issue. Given that the Coredns configurations are different between RHV and VSPHERE going to move this to mDNS component.

--- Additional comment from Ben Nemec on 2021-05-14 21:51:50 UTC ---

As noted in the report, the RHV config just needs to switch to a plugin that supports fallthrough.

Note that this has already been fixed in 4.8 because we consolidated the coredns configs and the one every platform uses will handle this correctly.

--- Additional comment from Ben Nemec on 2021-05-19 16:49:50 UTC ---

Since this does not exist in 4.8, I will clone it to 4.7 and close this one. From there we can follow the normal backport process.

Comment 3 Siddharth Sharma 2021-06-04 18:40:02 UTC

This bug will be shipped as part of next z-stream release 4.7.15 on June 14th, as 4.7.14 was dropped due to a regression https://bugzilla.redhat.com/show_bug.cgi?id=1967614

Comment 8 OpenShift Automated Release Tooling 2021-06-17 12:29:08 UTC

OpenShift engineering has decided to not ship Red Hat OpenShift Container Platform 4.7.17 due a regression https://bugzilla.redhat.com/show_bug.cgi?id=1973006. All the fixes which were part of 4.7.17 will be now part of 4.7.18 and planned to be available in candidate channel on June 23 2021 and in fast channel on June 28th.

Comment 12 Amit Ugol 2021-06-24 06:55:53 UTC

I communicated with the OCP on RHV team and we decided that once they will have time for it, they will help us with validation.

Comment 17 errata-xmlrpc 2021-07-06 11:38:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.19 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2554

Note You need to log in before you can comment on or make changes to this bug.