Bug 1837754 - CI blocker: Unable to connect to the server: dial tcp: lookup api.ci-op-xxx on yy.yy.y: no such host
Summary: CI blocker: Unable to connect to the server: dial tcp: lookup api.ci-op-xxx o...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Aniket Bhat
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-19 22:04 UTC by Kirsten Garrison
Modified: 2020-05-20 19:57 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-20 18:54:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Kirsten Garrison 2020-05-19 22:04:06 UTC
Description of problem:

We are seeing in MCO ci tests that post machine-config being applied to a pool, we are unable to connect to host and seeing messages such as:
```
Unable to connect to the server: dial tcp: lookup api.ci-op-x160clxz-1354f.origin-ci-int-gce.dev.openshift.com on 10.142.0.52:53: no such host
```

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2305

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1719/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2306

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1719/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2307

This started happening today all at once:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op

We are also missing the /pods dir from the must-gather in some cases and where we have a /pods dir there seem to be a lot of missing pods? for ex:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1733/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2303/artifacts/e2e-gcp-op/pods/


This is blocking our CI entirely and we haven't had a PR merge for > 24hrs to our repo. Please reassign, I took my best guess for component.

Comment 1 Kirsten Garrison 2020-05-19 22:13:24 UTC
Also noticing see this in a recent upgrade test from around the same time:

` May 19 21:26:50.335: FAIL: All nodes should be ready after test, Get https://api.ci-op-i2fqxshl-28de9.origin-ci-int-gce.dev.openshift.com:6443/api/v1/nodes: dial tcp: lookup api.ci-op-i2fqxshl-28de9.origin-ci-int-gce.dev.openshift.com on 10.142.0.79:53: no such host `

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1733/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/2080

Both started happening at the same time with runs around 2020-05-19 18:05:14 +0000 UTC

Comment 2 Kirsten Garrison 2020-05-19 23:09:25 UTC
In some of the artifacts that are missing a pods dir I can look thru the events.json and see failures:

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2305/artifacts/e2e-gcp-op/events.json

            "message": "Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_apiserver-58d8975c75-g6wvf_openshift-apiserver_658c3da4-7674-4567-bb36-ea4096758792_0(f24fa9506ad156fa871df2f28406f849557e4bbd1b38e579643092fcbe666749): netplugin failed with no error message"

            
However in another events logs I see:
            "lastTimestamp": "2020-05-19T21:05:16Z",
            "message": "Status for clusteroperator/openshift-apiserver changed: Available changed from True to False (\"APIServicesAvailable: \\\"apps.openshift.io.v1\\\" is not ready: 503 (the server is currently unable to handle the request)\")",

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1733/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2304/artifacts/e2e-gcp-op/events.json

Comment 3 Kirsten Garrison 2020-05-19 23:33:44 UTC
Wondering if this has anything to do with quay outage going on today?

Comment 5 Kirsten Garrison 2020-05-20 16:32:14 UTC
Hi Aniket, thanks for your attention. Jobs look good now, so this _was_ related to quay issue then?


Note You need to log in before you can comment on or make changes to this bug.