Bug 1744046 - e2e failed: Failed to connect to kube-apiserver Kube API and openshift-apiserver OpenShift API due to dns issue
Summary: e2e failed: Failed to connect to kube-apiserver Kube API and openshift-apiser...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
low
Target Milestone: ---
: 4.3.0
Assignee: Abhinav Dahiya
QA Contact: sheng.lao
URL:
Whiteboard:
: 1748760 (view as bug list)
Depends On: 1745720
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-21 08:14 UTC by zhou ying
Modified: 2020-01-23 11:05 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:05:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:05:41 UTC

Description zhou ying 2019-08-21 08:14:57 UTC
Description of problem:
Failed test: https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/24


Failed error: 
fail [k8s.io/kubernetes/test/e2e/e2e.go:104]: Unexpected error:
    <*url.Error | 0xc003334300>: {
        Op: "Get",
        URL: "https://api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0",
        Err: {
            Op: "dial",
            Net: "tcp",
            Source: nil,
            Addr: nil,
            Err: {
                Err: "no such host",
                Name: "api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com",
                Server: "10.142.15.249:53",
                IsTimeout: false,
                IsTemporary: false,
            },
        },
    }
    Get https://api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp: lookup api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com on 10.142.15.249:53: no such host
occurred

Aug 20 10:32:03.010 E kube-apiserver Kube API is not responding to GET requests
Aug 20 10:32:03.010 E openshift-apiserver OpenShift API is not responding to GET requests

Version-Release number of selected component (if applicable):
redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-gcp-serial-4.2

How reproducible:
always

Comment 1 Dan Mace 2019-08-21 13:16:44 UTC
Looks like something related to the DNS record for the API server, which is part of the installer. The DNS component is for cluster DNS bugs (e.g. CoreDNS). Routing would be appropriate for DNS issues related to routes.

Hope that helps clarify. I reassigned this to the Installer component. Let me know if that was a mistake!

Comment 2 Abhinav Dahiya 2019-08-21 23:26:09 UTC
```
E0820 10:28:31.227539     244 reflector.go:126] github.com/openshift/origin/pkg/monitor/operator.go:126: Failed to list *v1.ClusterOperator: Get https://api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators?limit=500&resourceVersion=0: dial tcp: lookup api.ci-op-x3fpxir9-03113.origin-ci-int-gce.dev.openshift.com on 10.142.15.249:53: no such host
```

The IP `10.142.15.249:53` that is being requested for DNS

> https://github.com/openshift/installer/blob/63bb767efaafde1b0daf9638b7f0889af97cff8f/pkg/types/defaults/installconfig.go#L17-L19

the cluster network (pod cidr) is 10.128.0.0/14 (First IP 10.128.0.0 Last IP 10.131.255.255)
the machine network (machine cidr) is 10.0.0./16 (First IP 10.0.0.0 Last IP 10.0.255.255)

So this IP doesn't belong to the virtual network or the pod network of the cluster. That means that request was made from a the `test` pod of CI run.. Now the either the DNS failed in the ci-cluster or the GCP had a hiccup.. this doesn't seem like installer's problem.

on another run:

see DNS working but failing to connect to api https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/26#0:build-log.txt%3A71035
and then the DNS not resolving at all few seconds later https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/26#0:build-log.txt%3A71042

Comment 4 Abhinav Dahiya 2019-08-26 17:55:57 UTC
e2e-gcp-serial is running tests that are failing, since the serial suite is run one at a time, this causes the test to timeout and therefore the `no such host` errors happen towrds the end of the run as the CI cluster is being torn down.. 

a class of failures is tracked here https://bugzilla.redhat.com/show_bug.cgi?id=1745720

Comment 5 Abhinav Dahiya 2019-09-04 16:31:21 UTC
*** Bug 1748760 has been marked as a duplicate of this bug. ***

Comment 6 sheng.lao 2019-10-08 13:06:36 UTC
all the jobs are failed on 4.3 branch, So I have to wait.

Comment 9 errata-xmlrpc 2020-01-23 11:05:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.