Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1744046

Summary:	e2e failed: Failed to connect to kube-apiserver Kube API and openshift-apiserver OpenShift API due to dns issue
Product:	OpenShift Container Platform	Reporter:	zhou ying <yinzhou>
Component:	Installer	Assignee:	Abhinav Dahiya <adahiya>
Installer sub component:	openshift-installer	QA Contact:	sheng.lao <shlao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	high	CC:	anli, aos-bugs, bleanhar, piqin
Version:	4.2.0
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-23 11:05:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1745720
Bug Blocks:

Description zhou ying 2019-08-21 08:14:57 UTC

Description of problem:
Failed test: https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/24


Failed error: 
fail [k8s.io/kubernetes/test/e2e/e2e.go:104]: Unexpected error:
    <*url.Error | 0xc003334300>: {
        Op: "Get",
        URL: "https://api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0",
        Err: {
            Op: "dial",
            Net: "tcp",
            Source: nil,
            Addr: nil,
            Err: {
                Err: "no such host",
                Name: "api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com",
                Server: "10.142.15.249:53",
                IsTimeout: false,
                IsTemporary: false,
            },
        },
    }
    Get https://api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp: lookup api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com on 10.142.15.249:53: no such host
occurred

Aug 20 10:32:03.010 E kube-apiserver Kube API is not responding to GET requests
Aug 20 10:32:03.010 E openshift-apiserver OpenShift API is not responding to GET requests

Version-Release number of selected component (if applicable):
redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-gcp-serial-4.2

How reproducible:
always

Comment 1 Dan Mace 2019-08-21 13:16:44 UTC

Looks like something related to the DNS record for the API server, which is part of the installer. The DNS component is for cluster DNS bugs (e.g. CoreDNS). Routing would be appropriate for DNS issues related to routes.

Hope that helps clarify. I reassigned this to the Installer component. Let me know if that was a mistake!

Comment 2 Abhinav Dahiya 2019-08-21 23:26:09 UTC

```
E0820 10:28:31.227539     244 reflector.go:126] github.com/openshift/origin/pkg/monitor/operator.go:126: Failed to list *v1.ClusterOperator: Get https://api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators?limit=500&resourceVersion=0: dial tcp: lookup api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com on 10.142.15.249:53: no such host
```

The IP `10.142.15.249:53` that is being requested for DNS

> https://github.com/openshift/installer/blob/63bb767efaafde1b0daf9638b7f0889af97cff8f/pkg/types/defaults/installconfig.go#L17-L19

the cluster network (pod cidr) is 10.128.0.0/14 (First IP 10.128.0.0 Last IP 10.131.255.255)
the machine network (machine cidr) is 10.0.0./16 (First IP 10.0.0.0 Last IP 10.0.255.255)

So this IP doesn't belong to the virtual network or the pod network of the cluster. That means that request was made from a the `test` pod of CI run.. Now the either the DNS failed in the ci-cluster or the GCP had a hiccup.. this doesn't seem like installer's problem.

on another run:

see DNS working but failing to connect to api https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/26#0:build-log.txt%3A71035
and then the DNS not resolving at all few seconds later https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/26#0:build-log.txt%3A71042

Comment 4 Abhinav Dahiya 2019-08-26 17:55:57 UTC

e2e-gcp-serial is running tests that are failing, since the serial suite is run one at a time, this causes the test to timeout and therefore the `no such host` errors happen towrds the end of the run as the CI cluster is being torn down.. 

a class of failures is tracked here https://bugzilla.redhat.com/show_bug.cgi?id=1745720

Comment 5 Abhinav Dahiya 2019-09-04 16:31:21 UTC

*** Bug 1748760 has been marked as a duplicate of this bug. ***

Comment 6 sheng.lao 2019-10-08 13:06:36 UTC

all the jobs are failed on 4.3 branch, So I have to wait.

Comment 7 sheng.lao 2019-10-22 06:14:28 UTC

It is fixed. I check it with https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.3/137

Comment 9 errata-xmlrpc 2020-01-23 11:05:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062