1968079 – GCP CI: dial tcp: lookup api... on 172.30.0.10:53: no such host

Bug 1968079 - GCP CI: dial tcp: lookup api... on 172.30.0.10:53: no such host

Summary: GCP CI: dial tcp: lookup api... on 172.30.0.10:53: no such host

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Test Infrastructure
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Steve Kuznetsov
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-04 21:29 UTC by W. Trevor King
Modified:	2021-06-24 16:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-24 16:07:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2021-06-04 21:29:47 UTC

We've seen a few of this type of thing over the years, including bug 1748760, bug 1744046, and bug 1837754.  Very popular recently:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=dial+tcp:+lookup+api%5B.%5D.*:53:+no+such+host&maxAge=24h&type=junit' | grep 'failures match' | grep -v 'pull-ci-\|rehearse-' | sort
periodic-ci-devfile-integration-tests-main-v4.7.console-e2e-gcp-console-periodic (all) - 4 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-devfile-integration-tests-main-v4.8.console-e2e-gcp-console-periodic (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-opendatahub-io-odh-manifests-master-odh-manifests-e2e-nightly (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-opendatahub-io-opendatahub-operator-master-operator-e2e-nightly (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-insights-operator-master-insights-operator-e2e-tests-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-insights-operator-release-4.6-insights-operator-e2e-tests-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-insights-operator-release-4.8-insights-operator-test-time-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn-periodic (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-periodic (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-kni-cnf-features-deploy-release-4.6-e2e-gcp-ovn-periodic (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-kni-cnf-features-deploy-release-4.7-e2e-gcp-ovn-periodic (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-kni-cnf-features-deploy-release-4.7-e2e-gcp-periodic (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
periodic-ci-openshift-openshift-tests-private-release-4.7-sanity (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-openshift-tests-private-release-4.8-sanity (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.4-e2e-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.5-upgrade-from-stable-4.4-e2e-aws-upgrade (all) - 5 runs, 80% failed, 25% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.6-upgrade-from-stable-4.5-e2e-gcp-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.5-e2e-gcp (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-proxy (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-proxy (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-redhat-operator-ecosystem-cvp-ocp-4.6-cvp-common-aws (all) - 24 runs, 8% failed, 50% of failures match = 4% impact
periodic-ci-redhat-operator-ecosystem-cvp-ocp-4.7-cvp-common-aws (all) - 30 runs, 3% failed, 100% of failures match = 3% impact
release-openshift-ocp-installer-e2e-gcp-ovn-4.3 (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
release-openshift-ocp-installer-e2e-gcp-ovn-4.7 (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-ovn-4.8 (all) - 5 runs, 80% failed, 75% of failures match = 60% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.2 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.3 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.5 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.6 (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.7 (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.8 (all) - 5 runs, 80% failed, 100% of failures match = 80% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.4-to-4.5-to-4.6-to-4.7-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-compact-4.3 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-serial-4.2 (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
release-openshift-origin-installer-e2e-gcp-serial-4.3 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-serial-4.4 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.5 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.6 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
release-openshift-origin-installer-launch-gcp (all) - 95 runs, 34% failed, 3% of failures match = 1% impact

Anecdotal reports that the error comes and goes for a specific cluster, which would make it less likely to be an early-teardown issue like some of the earlier bugs.  The fact that it seems evenly spread over multiple jobs and 4.y suggests it is test-infra and not a product issue.  Although the lack of Azure matches means it may be provider infra and not something in the build* clusters themselves.

Comment 1 W. Trevor King 2021-06-04 21:34:35 UTC

All of them are having trouble with the local DNS server:

  $ curl -s 'https://search.ci.openshift.org/search?search=dial+tcp:+lookup+api%5B.%5D.*:53:+no+such+host&maxAge=24h&type=junit' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed -n 's/.*lookup api[.].* on \(.*:53\): no such host.*/\1/p' | sort | uniq -c | sort -n
      480 172.30.0.10:53

Comment 2 W. Trevor King 2021-06-04 21:46:52 UTC

Picking a recent release job [1]:

INFO[2021-06-04T13:38:46Z] Jun  4 12:54:09.927: FAIL: Get "https://api.ci-op-iim16wm1-6cf85.origin-ci-int-gce.dev.openshift.com:6443/apis/user.openshift.io/v1/users/e2e-test-project-api-6twk7-user": dial tcp: lookup api.ci-op-iim16wm1-6cf85.origin-ci-int-gce.dev.openshift.com on 172.30.0.10:53: no such host 
...
INFO[2021-06-04T13:38:46Z] failed: (30.3s) 2021-06-04T12:54:40 "[sig-storage] [Serial] Volume metrics PVController should create unbound pv count metrics for pvc controller after creating pv only [Suite:openshift/conformance/serial] [Suite:k8s]" 

Refining my search to only pick up failures like that:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=FAIL.*dial+tcp:+lookup+api%5B.%5D.*172.30.0.10:53:+no+such+host&maxAge=24h&type=build-log' | grep 'failures match' | grep -v 'pull-ci-\|rehearse-' | sort
release-openshift-ocp-installer-e2e-gcp-ovn-4.6 (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
release-openshift-ocp-installer-e2e-gcp-ovn-4.7 (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-ovn-4.8 (all) - 5 runs, 80% failed, 75% of failures match = 60% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.7 (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.8 (all) - 5 runs, 80% failed, 100% of failures match = 80% impact
release-openshift-origin-installer-e2e-gcp-compact-4.3 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.5 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.6 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Seems like it broke around 2020-06-03 19:37Z [2], based on [3].

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1400787820120903680#1:build-log.txt%3A114
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26200/pull-ci-openshift-origin-master-e2e-gcp-disruptive/1400537066063794176
[3]: https://search.ci.openshift.org/chart?search=FAIL.*dial+tcp:+lookup+api%5B.%5D.*172.30.0.10:53:+no+such+host&maxAge=48h&type=build-log

Comment 3 W. Trevor King 2021-06-04 22:23:00 UTC

Doesn't seem related to a specific build cluster:

$ curl -s 'https://search.ci.openshift.org/search?search=FAIL.*dial+tcp:+lookup+api%5B.%5D.*172.30.0.10:53:+no+such+host&search=Using+namespace.*ci.openshift.org&maxAge=24h&type=build-log' | jq -r 'to_entries[].value | select(length > 1)["Using namespace.*ci.openshift.org"][].context[]' | sed -n 's|.*Using namespace https://console.\(.*\).ci.openshift.org.*|\1|p' | sort | uniq -c
     37 build01
     53 build02

And currently build01 is 4.8.0-fc.7, and build02 is 4.7.9, so they're unlikely to be exposed to the same issues.

Comment 4 W. Trevor King 2021-06-04 22:25:12 UTC

Also including GCP in the title, based on comment 2 only matching GCP jobs.

Comment 5 W. Trevor King 2021-06-04 22:33:47 UTC

Possibly we broke something in our GCP-1 account?  Or there's something flaky in its DNS chain, anyway:

$ curl -s 'https://search.ci.openshift.org/search?search=FAIL.*dial+tcp:+lookup+api%5B.%5D.*172.30.0.10:53:+no+such+host&search=Acquired+.+lease&maxAge=24h&type=build-log' | jq -r 'to_entries[].value | select(length > 1)["Acquired . lease"][].context[]' | sed -n 's|.*Acquired . lease.*\[\(.*\)-[0-9]*]|\1|p' | sort | uniq -c
     92 us-east1--gcp-quota-slice

GCP-2 seems fine.

Comment 6 W. Trevor King 2021-06-04 22:51:55 UTC

(In reply to W. Trevor King from comment #0)
> Anecdotal reports that the error comes and goes for a specific cluster...

Ok, pinned down in [1], a job I mentioned in comment 2, with two big API-connectivity outages.  The e2e-intervals chart show two big outages, and the associated synthetic JUnit is:

  [sig-api-machinery] kube-apiserver-new-connection should be available
  Run #0: Failed expand_less	51m58s
    kube-apiserver-new-connection was failing for 6m43s seconds (13% of the test duration)

linking the following stdout:

  Jun 04 12:53:50.723 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-iim16wm1-6cf85.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/default": dial tcp: lookup api.ci-op-iim16wm1-6cf85.origin-ci-int-gce.dev.openshift.com on 172.30.0.10:53: no such host
  Jun 04 12:53:50.723 - 103s  E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
  Jun 04 12:55:33.756 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests
  Jun 04 13:05:43.728 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-iim16wm1-6cf85.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/default": dial tcp: lookup api.ci-op-iim16wm1-6cf85.origin-ci-int-gce.dev.openshift.com on 172.30.0.10:53: no such host
  Jun 04 13:05:43.728 - 300s  E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
  Jun 04 13:10:43.771 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests

Using that to pivot from build-log to faster JUnit queries, at the cost of excluding older 4.y who have slightly different strings in their uptime monitors:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=kube-apiserver-new-connection+started+failing:.*dial+tcp:+lookup+api%5B.%5D.*on+172.30.0.10:53:+no+such+host&maxAge=24h&type=junit' | grep 'failures match' | grep -v 'pull-ci-\|rehearse-' | sort
release-openshift-ocp-installer-e2e-gcp-ovn-4.7 (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-ovn-4.8 (all) - 4 runs, 75% failed, 100% of failures match = 75% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.7 (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.8 (all) - 4 runs, 75% failed, 100% of failures match = 75% impact
release-openshift-origin-installer-e2e-gcp-compact-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-serial-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.7 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-e2e-gcp-shared-vpc-4.8 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
release-openshift-origin-installer-launch-gcp (all) - 97 runs, 33% failed, 3% of failures match = 1% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1400787820120903680

Comment 7 W. Trevor King 2021-06-04 23:12:02 UTC

Refreshing my earlier breakdowns with the new kube-apiserver-new-connection root:

$ curl -s 'https://search.ci.openshift.org/search?search=kube-apiserver-new-connection+started+failing:.*dial+tcp:+lookup+api%5B.%5D.*on+172.30.0.10:53:+no+such+host&search=Using+namespace.*ci.openshift.org&maxAge=24h&type=build-log' | jq -r 'to_entries[].value | select(length > 1)["Using namespace.*ci.openshift.org"][].context[]' | sed -n 's|.*Using namespace https://console.\(.*\).ci.openshift.org.*|\1|p' | sort | uniq -c
     51 build01
     81 build02
$ curl -s 'https://search.ci.openshift.org/search?search=kube-apiserver-new-connection+started+failing:.*dial+tcp:+lookup+api%5B.%5D.*on+172.30.0.10:53:+no+such+host&search=Acquired+.+lease&maxAge=24h&type=build-log' | jq -r 'to_entries[].value | select(length > 1)["Acquired . lease"][].context[]' | sed -n 's|.*Acquired . lease.*\[\(.*\)-[0-9]*]|\1|p' | sort | uniq -c
    131 us-east1--gcp-quota-slice

So summarizing:

Recently, around 2020-06-03 19:37Z (comment 2), something happened that causes intermittent "no such host" in e2e code running on both build01 and build02 that attempts to resolve api.*.origin-ci-int-gce.dev.openshift.com for clusters on GCP-1.  Clusters on GCP-2 use gcp-2.ci.openshift.org, and we don't see any problems there.  Not clear to me yet where the breakdown is in the lookup.

Comment 9 Steve Kuznetsov 2021-06-24 16:07:54 UTC

This was resolved by waiting.

Note You need to log in before you can comment on or make changes to this bug.