1724346 – Install failure flake - timed out waiting for etcd - delayed DNS record creation?

Bug 1724346 - Install failure flake - timed out waiting for etcd - delayed DNS record creation?

Summary: Install failure flake - timed out waiting for etcd - delayed DNS record creat...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-26 20:23 UTC by Mark McLoughlin
Modified:	2019-10-16 06:32 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:32:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:32:53 UTC

Description Mark McLoughlin 2019-06-26 20:23:00 UTC

Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/3149

Jun 26 12:15:06 ip-10-0-12-118 bootkube.sh[1389]: https://etcd-2.ci-op-0dr21wv1-77109.origin-ci-int-aws.dev.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp 10.0.165.46:2379: connect: connection refused
Jun 26 12:15:06 ip-10-0-12-118 bootkube.sh[1389]: https://etcd-0.ci-op-0dr21wv1-77109.origin-ci-int-aws.dev.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp 10.0.130.178:2379: connect: connection refused
Jun 26 12:15:06 ip-10-0-12-118 bootkube.sh[1389]: https://etcd-1.ci-op-0dr21wv1-77109.origin-ci-int-aws.dev.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp 10.0.145.70:2379: connect: connection refused
Jun 26 12:15:06 ip-10-0-12-118 bootkube.sh[1389]: Error: unhealthy cluster

Version-Release number of the following components:

4.2.0-0.nightly-2019-06-26-111723

How reproducible:

Only saw one example of this as today's build cop

Comment 1 Mark McLoughlin 2019-06-26 20:24:15 UTC

From Sam on slack:

looks like it might be dns? creates records what feels really late.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/3149/artifacts/e2e-aws-upgrade/installer/.openshift_install.log


time="2019-06-26T11:37:42Z" level=debug msg="2019/06/26 11:37:42 [DEBUG] Resource instance state not found for node \"module.dns.aws_route53_record.etcd_a_nodes[0]\", instance module.dns.aws_route53_record.etcd_a_nodes[0]"

then 8 mins later

time="2019-06-26T11:45:02Z" level=debug msg="module.dns.aws_route53_record.etcd_a_nodes[0]: Creation complete after 48s [id=Z068822582SCO3XXVJLR_etcd-0.ci-op-0dr21wv1-77109.origin-ci-int-aws.dev.rhcloud.com_A]"

Comment 2 Sam Batschelet 2019-06-26 20:53:37 UTC

I found a bit more in the bootstrap must gather that might be of more interest. 

curl https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/3149/artifacts/e2e-aws-upgrade/installer/log-bundle-20190626121617.tar | gunzip

lines 1079->38678

Jun 26 11:43:55 ip-10-0-12-118 systemd[1]: Started Kubernetes Kubelet.
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: E0626 11:43:55.775558    1316 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: W0626 11:43:55.775599    1316 kubelet.go:1394] No api server defined - no node status update will be sent.
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: I0626 11:43:55.776681    1316 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: I0626 11:43:55.777382    1316 status_manager.go:148] Kubernetes client is nil, not starting status manager.
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: I0626 11:43:55.797933    1316 server.go:141] Starting to listen on 0.0.0.0:10250
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: I0626 11:43:55.801752    1316 server.go:347] Adding debug handlers to kubelet server.
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: E0626 11:43:55.813077    1316 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /usr/lib/golang/src/runtime/asm_amd64.s:522
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /usr/lib/golang/src/runtime/panic.go:513
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/reflector.go:189
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/reflector.go:214
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/reflector.go:125
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/reflector.go:124
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/controller.go:122
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:54
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /usr/lib/golang/src/runtime/asm_amd64.s:1333

Comment 3 Sam Batschelet 2019-06-26 20:59:09 UTC

A little later this starts looping as part of above.

Jun 26 11:48:35 ip-10-0-12-118 hyperkube[1316]: E0626 11:48:35.805079    1316 remote_runtime.go:332] ContainerStatus of "9af3a36f500219ad37fad4cf1a286f900bb2b8f315a4cf7364687a9b14d947cc" failed: Image is not set

Comment 4 Antonio Murdaca 2019-06-27 13:15:03 UTC

bootstrap logs from this job aren't about "Image is not set" https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/901/pull-ci-openshift-machine-config-operator-master-e2e-aws/4112 but have same panic

Comment 5 Antonio Murdaca 2019-06-27 13:24:44 UTC

Looks like a kubelet issue from logs, assigning Node team https://github.com/kubernetes/kubernetes/issues/76075

Comment 6 Antonio Murdaca 2019-06-27 13:29:45 UTC

Upstream issue is from kubeadm but it looks like a generic issue with 1.14 kubelet (logs match also)

Comment 10 Ryan Phillips 2019-07-08 19:05:14 UTC

Fixed as of RHCOS build 420.8.20190708.1

Comment 11 Ryan Phillips 2019-07-29 22:07:31 UTC

My fault... Moved into the wrong state.

Comment 13 errata-xmlrpc 2019-10-16 06:32:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.