1699456 – bootstrap-kube-apiserver failing to connect to etcd with invalid cert error

Bug 1699456 - bootstrap-kube-apiserver failing to connect to etcd with invalid cert error

Summary: bootstrap-kube-apiserver failing to connect to etcd with invalid cert error

Keywords:
Status:	CLOSED DUPLICATE of bug 1698456
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-12 18:55 UTC by Mike Fiedler
Modified:	2019-05-05 07:43 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-02 13:31:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
bootstrap-kube-apiserver pod log (69.99 KB, application/gzip) 2019-04-12 18:55 UTC, Mike Fiedler	no flags	Details
View All

Description Mike Fiedler 2019-04-12 18:55:10 UTC

Created attachment 1554862 [details]
bootstrap-kube-apiserver pod log

Description of problem:

While looking at bug 1698950 @deads2k noticed the following errors - bootstrap-kube-apiserver is unable to talk to etcd due to bad certs:

2019-04-12T18:25:18.046585457+00:00 stderr F I0412 18:25:18.046567       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>}]
2019-04-12T18:25:18.069595475+00:00 stderr F W0412 18:25:18.069030       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.mffiedler-132.qe.devcluster.openshift.com, not etcd-0.mffiedler-132.qe.devcluster.openshift.com". Reconnecting...
2019-04-12T18:25:18.069595475+00:00 stderr F W0412 18:25:18.069360       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.mffiedler-132.qe.devcluster.openshift.com, not etcd-0.mffiedler-132.qe.devcluster.openshift.com". Reconnecting...
2019-04-12T18:25:18.073724663+00:00 stderr F I0412 18:25:18.073688       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>}]
2019-04-12T18:25:18.073782523+00:00 stderr F W0412 18:25:18.073729       1 clientconn.go:953] Failed to dial etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379: context canceled; please retry.
2019-04-12T18:25:18.073782523+00:00 stderr F W0412 18:25:18.073740       1 clientconn.go:953] Failed to dial etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379: context canceled; please retry.
2019-04-12T18:25:18.082036906+00:00 stderr F I0412 18:25:18.082004       1 clientconn.go:551] parsed scheme: ""
2019-04-12T18:25:18.082117937+00:00 stderr F I0412 18:25:18.082105       1 clientconn.go:557] scheme "" not registered, fallback to default scheme



Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-04-12-141717


How reproducible: 2/2 clusters so far


Steps to Reproduce:
1. Try to install 4.0.0-0.nightly-2019-04-12-141717
2. It will fail with the 30 min bootstrap issue
3. Errors above are seen in the bootstrap-kube-apiserver pod logs

Full pod logs attached

Comment 4 liujia 2019-04-30 09:54:27 UTC

Still hit the etcd issue on 4.1.0-0.nightly-2019-04-29-235412 when upi installation on vmware.

refer to https://bugzilla.redhat.com/show_bug.cgi?id=1698456#c14

bootkube.sh log
Apr 30 09:12:40 bootstrap-0 bootkube.sh[11321]: E0430 09:12:40.713004       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to lis>
Apr 30 09:31:50 bootstrap-0 bootkube.sh[11321]: Error: error while checking pod status: timed out waiting for the condition
Apr 30 09:31:50 bootstrap-0 bootkube.sh[11321]: Tearing down temporary bootstrap control plane...
Apr 30 09:31:50 bootstrap-0 bootkube.sh[11321]: Error: error while checking pod status: timed out waiting for the condition
Apr 30 09:31:50 bootstrap-0 bootkube.sh[11321]: unable to find container etcd-signer: no container with name or ID etcd-signer found: no such container
Apr 30 09:31:50 bootstrap-0 systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a
Apr 30 09:31:50 bootstrap-0 systemd[1]: bootkube.service: Failed with result 'exit-code'.
Apr 30 09:31:55 bootstrap-0 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
Apr 30 09:31:55 bootstrap-0 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 2.
Apr 30 09:31:55 bootstrap-0 systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Apr 30 09:31:55 bootstrap-0 systemd[1]: Started Bootstrap a Kubernetes cluster.
...
Apr 30 09:33:00 bootstrap-0 bootkube.sh[16365]: E0430 09:33:00.493809       1 reflector.go:251] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to watch *v1.Pod: Get https://api.jliu-demo.qe.devcluster.openshift.com:6443/api/v1/pods?watch=true: dial tcp 139.178.89.199:6443: connect: connection refused
Apr 30 09:33:01 bootstrap-0 bootkube.sh[16365]: E0430 09:33:01.527050       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.jliu-demo.qe.devcluster.openshift.com:6443/api/v1/pods: dial tcp 139.178.89.198:6443: connect: connection refused

Since the issue in bug 1698456 has been fixed, So re-open this bug to track the installation issue.

Comment 5 Sam Batschelet 2019-04-30 13:37:50 UTC

> Still hit the etcd issue on 4.1.0-0.nightly-2019-04-29-235412 when upi installation on vmware.

The issue reported was a cert SAN issue but the above logs are regarding a failing signer container. Did you see errors like reported? If yes can you please share them? Thanks!

Comment 6 Greg Blomquist 2019-05-01 12:58:04 UTC

Jia Liu, the log in comment #4 looks like a new issue related to bootkube restarting.  I can't tell by the logs provided what's causing bootkube to restart.

Please open a new issue against the installer component to track the bootkube restart.  The team will likely need more context from the logs to track down what's happening there.

Comment 7 Eric Paris 2019-05-02 13:31:17 UTC

I'm going to go ahead and re-close this as a dup. Johnny, please feel free to open a new bug with the new (we believe unrelated) information in #4.

*** This bug has been marked as a duplicate of bug 1698456 ***

Comment 8 liujia 2019-05-05 07:43:50 UTC

I'm ok to close this one. But since it's not a 100% re-produce issue, QE will file a new bug for it if hit it again.

Note You need to log in before you can comment on or make changes to this bug.