Bug 1750286 - Bootkube will report etcd cluster is up even all etcd-members are not ready
Summary: Bootkube will report etcd cluster is up even all etcd-members are not ready
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.2.0
Assignee: Micah Abbott
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-09 08:43 UTC by weiwei jiang
Modified: 2019-10-16 06:41 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:40:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:41:01 UTC

Description weiwei jiang 2019-09-09 08:43:44 UTC
Description of problem:
When installing IPI on OSP, found following things in bootstrap instance:

Sep 09 08:05:09 share-0909c-dgk97-bootstrap bootkube.sh[1730]: Starting etcd certificate signer...
Sep 09 08:05:12 share-0909c-dgk97-bootstrap bootkube.sh[1730]: 9dcecdffc78b278959a534e54d1dd6cc5d9e0327bd7616b93ffacd6a85e7252f
Sep 09 08:05:12 share-0909c-dgk97-bootstrap bootkube.sh[1730]: Waiting for etcd cluster...
Sep 09 08:15:16 share-0909c-dgk97-bootstrap bootkube.sh[1730]: https://etcd-0.share-0909c.qe.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-0.share-0909c.qe.rhcloud.com on 192.168.0.12:53: no such host
Sep 09 08:15:16 share-0909c-dgk97-bootstrap bootkube.sh[1730]: https://etcd-1.share-0909c.qe.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-1.share-0909c.qe.rhcloud.com on 192.168.0.12:53: no such host
Sep 09 08:15:16 share-0909c-dgk97-bootstrap bootkube.sh[1730]: https://etcd-2.share-0909c.qe.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-2.share-0909c.qe.rhcloud.com on 192.168.0.12:53: no such host
Sep 09 08:15:16 share-0909c-dgk97-bootstrap bootkube.sh[1730]: Error: unhealthy cluster
Sep 09 08:15:16 share-0909c-dgk97-bootstrap bootkube.sh[1730]: etcd cluster up. Killing etcd certificate signer...



Version-Release number of the following components:
./openshift-install v4.2.0
built from commit b5dbb46b7e97d2c63333048f055dd518aa01eb10
release image registry.svc.ci.openshift.org/ocp/release@sha256:0ef8b927112149e6eaee60074992cff97a16f386079de1d332c202eff766f55b

How reproducible:
Not sure

Steps to Reproduce:
1. Try install IPI on OSP
2. Check bootkube service log on bootstrap
3.

Actual results:

> journalctl  -u bootkube|grep bootkube.sh|less
...
Sep 09 08:05:09 share-0909c-dgk97-bootstrap bootkube.sh[1730]: Starting etcd certificate signer...
Sep 09 08:05:12 share-0909c-dgk97-bootstrap bootkube.sh[1730]: 9dcecdffc78b278959a534e54d1dd6cc5d9e0327bd7616b93ffacd6a85e7252f
Sep 09 08:05:12 share-0909c-dgk97-bootstrap bootkube.sh[1730]: Waiting for etcd cluster...
Sep 09 08:15:16 share-0909c-dgk97-bootstrap bootkube.sh[1730]: https://etcd-0.share-0909c.qe.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-0.share-0909c.qe.rhcloud.com on 192.168.0.12:53: no such host
Sep 09 08:15:16 share-0909c-dgk97-bootstrap bootkube.sh[1730]: https://etcd-1.share-0909c.qe.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-1.share-0909c.qe.rhcloud.com on 192.168.0.12:53: no such host
Sep 09 08:15:16 share-0909c-dgk97-bootstrap bootkube.sh[1730]: https://etcd-2.share-0909c.qe.rhcloud.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-2.share-0909c.qe.rhcloud.com on 192.168.0.12:53: no such host
Sep 09 08:15:16 share-0909c-dgk97-bootstrap bootkube.sh[1730]: Error: unhealthy cluster
Sep 09 08:15:16 share-0909c-dgk97-bootstrap bootkube.sh[1730]: etcd cluster up. Killing etcd certificate signer...

Expected results:
Should wait for another round


Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 Martin André 2019-09-09 09:36:44 UTC
The issue seems to be with the etcd image reporting success via its exit code while it failed to bring the etcd cluster up.
We're running the `etcdctl endpoint health` command [1].

This may be an issue with etcd itself.

[1] https://github.com/openshift/installer/blob/b5dbb46b7e97d2c63333048f055dd518aa01eb10/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L256-L273

Comment 3 Abhinav Dahiya 2019-09-09 16:30:21 UTC

*** This bug has been marked as a duplicate of bug 1741157 ***

Comment 4 weiwei jiang 2019-09-20 02:56:41 UTC
Re-open this since it sometime block our installation on bare metal and OpenStack.
And we have to track when the patch will be in for OpenShift

Comment 5 Abhinav Dahiya 2019-09-20 03:05:32 UTC
Since you reopened this issue can you provide details about how this issue is not the duplicate of already fixed bz https://bugzilla.redhat.com/show_bug.cgi?id=1741157 ??

What are we tracking? Is this issue not fixed by the bz 1741157?

Comment 9 weiwei jiang 2019-09-25 06:47:12 UTC
Verified on podman 1.4.2-stable2


Sep 25 06:02:54 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: Starting etcd certificate signer...
Sep 25 06:02:59 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: 38be6350ff30b8d0edc3cb3f154bbb092449af33f91ebc91be07cb7783353e6f
Sep 25 06:02:59 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: Waiting for etcd cluster...
Sep 25 06:13:05 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-1.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-1.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:13:05 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-0.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-0.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:13:05 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-2.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-2.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:13:05 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: Error: unhealthy cluster
Sep 25 06:13:05 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: etcdctl failed. Retrying in 5 seconds...
Sep 25 06:23:11 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-0.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-0.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:23:11 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-1.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-1.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:23:11 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-2.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-2.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:23:11 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: Error: unhealthy cluster
Sep 25 06:23:11 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: etcdctl failed. Retrying in 5 seconds...
Sep 25 06:33:17 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-1.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-1.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:33:17 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-2.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-2.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:33:17 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-0.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-0.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:33:17 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: Error: unhealthy cluster
Sep 25 06:33:17 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: etcdctl failed. Retrying in 5 seconds...
Sep 25 06:43:22 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-1.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-1.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:43:22 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-2.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-2.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:43:22 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: https://etcd-0.wjpurebm.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp: lookup etcd-0.wjpurebm.qe.devcluster.openshift.com on 147.75.207.208:53: server misbehaving
Sep 25 06:43:22 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: Error: unhealthy cluster
Sep 25 06:43:23 bootstrap.wjpurebm.qe.devcluster.openshift.com bootkube.sh[1829]: etcdctl failed. Retrying in 5 seconds...
[core@bootstrap ~]$ podman version 
Version:            1.4.2-stable2
RemoteAPI Version:  1
Go Version:         go1.12.8
OS/Arch:            linux/amd64
[core@bootstrap ~]$ rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1506a17d6f21d253cff1b3f84da3a8d9f49e76b8f23bd4eead2487ed003bd63f
              CustomOrigin: Image generated via coreos-assembler
                   Version: 42.80.20190923.1 (2019-09-23T19:53:33Z)

Comment 10 errata-xmlrpc 2019-10-16 06:40:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.