Bug 1724346 - Install failure flake - timed out waiting for etcd - delayed DNS record creation?
Summary: Install failure flake - timed out waiting for etcd - delayed DNS record creat...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-26 20:23 UTC by Mark McLoughlin
Modified: 2019-10-16 06:32 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:32:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:32:53 UTC

Description Mark McLoughlin 2019-06-26 20:23:00 UTC
Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/3149

Jun 26 12:15:06 ip-10-0-12-118 bootkube.sh[1389]: https://etcd-2.ci-op-0dr21wv1-77109.origin-ci-int-aws.dev.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp 10.0.165.46:2379: connect: connection refused
Jun 26 12:15:06 ip-10-0-12-118 bootkube.sh[1389]: https://etcd-0.ci-op-0dr21wv1-77109.origin-ci-int-aws.dev.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp 10.0.130.178:2379: connect: connection refused
Jun 26 12:15:06 ip-10-0-12-118 bootkube.sh[1389]: https://etcd-1.ci-op-0dr21wv1-77109.origin-ci-int-aws.dev.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp 10.0.145.70:2379: connect: connection refused
Jun 26 12:15:06 ip-10-0-12-118 bootkube.sh[1389]: Error: unhealthy cluster

Version-Release number of the following components:

4.2.0-0.nightly-2019-06-26-111723

How reproducible:

Only saw one example of this as today's build cop

Comment 1 Mark McLoughlin 2019-06-26 20:24:15 UTC
From Sam on slack:

looks like it might be dns? creates records what feels really late.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/3149/artifacts/e2e-aws-upgrade/installer/.openshift_install.log


time="2019-06-26T11:37:42Z" level=debug msg="2019/06/26 11:37:42 [DEBUG] Resource instance state not found for node \"module.dns.aws_route53_record.etcd_a_nodes[0]\", instance module.dns.aws_route53_record.etcd_a_nodes[0]"

then 8 mins later

time="2019-06-26T11:45:02Z" level=debug msg="module.dns.aws_route53_record.etcd_a_nodes[0]: Creation complete after 48s [id=Z068822582SCO3XXVJLR_etcd-0.ci-op-0dr21wv1-77109.origin-ci-int-aws.dev.rhcloud.com_A]"

Comment 2 Sam Batschelet 2019-06-26 20:53:37 UTC
I found a bit more in the bootstrap must gather that might be of more interest. 

curl https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/3149/artifacts/e2e-aws-upgrade/installer/log-bundle-20190626121617.tar | gunzip

lines 1079->38678

Jun 26 11:43:55 ip-10-0-12-118 systemd[1]: Started Kubernetes Kubelet.
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: E0626 11:43:55.775558    1316 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: W0626 11:43:55.775599    1316 kubelet.go:1394] No api server defined - no node status update will be sent.
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: I0626 11:43:55.776681    1316 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: I0626 11:43:55.777382    1316 status_manager.go:148] Kubernetes client is nil, not starting status manager.
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: I0626 11:43:55.797933    1316 server.go:141] Starting to listen on 0.0.0.0:10250
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: I0626 11:43:55.801752    1316 server.go:347] Adding debug handlers to kubelet server.
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: E0626 11:43:55.813077    1316 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /usr/lib/golang/src/runtime/asm_amd64.s:522
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /usr/lib/golang/src/runtime/panic.go:513
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/reflector.go:189
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/reflector.go:214
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/reflector.go:125
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/reflector.go:124
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/controller.go:122
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:54
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /builddir/build/BUILD/openshift-git-0.04ae0f4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71
Jun 26 11:43:55 ip-10-0-12-118 hyperkube[1316]: /usr/lib/golang/src/runtime/asm_amd64.s:1333

Comment 3 Sam Batschelet 2019-06-26 20:59:09 UTC
A little later this starts looping as part of above.

Jun 26 11:48:35 ip-10-0-12-118 hyperkube[1316]: E0626 11:48:35.805079    1316 remote_runtime.go:332] ContainerStatus of "9af3a36f500219ad37fad4cf1a286f900bb2b8f315a4cf7364687a9b14d947cc" failed: Image is not set

Comment 4 Antonio Murdaca 2019-06-27 13:15:03 UTC
bootstrap logs from this job aren't about "Image is not set" https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/901/pull-ci-openshift-machine-config-operator-master-e2e-aws/4112 but have same panic

Comment 5 Antonio Murdaca 2019-06-27 13:24:44 UTC
Looks like a kubelet issue from logs, assigning Node team https://github.com/kubernetes/kubernetes/issues/76075

Comment 6 Antonio Murdaca 2019-06-27 13:29:45 UTC
Upstream issue is from kubeadm but it looks like a generic issue with 1.14 kubelet (logs match also)

Comment 10 Ryan Phillips 2019-07-08 19:05:14 UTC
Fixed as of RHCOS build 420.8.20190708.1

Comment 11 Ryan Phillips 2019-07-29 22:07:31 UTC
My fault... Moved into the wrong state.

Comment 13 errata-xmlrpc 2019-10-16 06:32:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.