Created attachment 1554862 [details] bootstrap-kube-apiserver pod log Description of problem: While looking at bug 1698950 @deads2k noticed the following errors - bootstrap-kube-apiserver is unable to talk to etcd due to bad certs: 2019-04-12T18:25:18.046585457+00:00 stderr F I0412 18:25:18.046567 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>}] 2019-04-12T18:25:18.069595475+00:00 stderr F W0412 18:25:18.069030 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.mffiedler-132.qe.devcluster.openshift.com, not etcd-0.mffiedler-132.qe.devcluster.openshift.com". Reconnecting... 2019-04-12T18:25:18.069595475+00:00 stderr F W0412 18:25:18.069360 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.mffiedler-132.qe.devcluster.openshift.com, not etcd-0.mffiedler-132.qe.devcluster.openshift.com". Reconnecting... 2019-04-12T18:25:18.073724663+00:00 stderr F I0412 18:25:18.073688 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>}] 2019-04-12T18:25:18.073782523+00:00 stderr F W0412 18:25:18.073729 1 clientconn.go:953] Failed to dial etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379: context canceled; please retry. 2019-04-12T18:25:18.073782523+00:00 stderr F W0412 18:25:18.073740 1 clientconn.go:953] Failed to dial etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379: context canceled; please retry. 2019-04-12T18:25:18.082036906+00:00 stderr F I0412 18:25:18.082004 1 clientconn.go:551] parsed scheme: "" 2019-04-12T18:25:18.082117937+00:00 stderr F I0412 18:25:18.082105 1 clientconn.go:557] scheme "" not registered, fallback to default scheme Version-Release number of selected component (if applicable): 4.0.0-0.nightly-2019-04-12-141717 How reproducible: 2/2 clusters so far Steps to Reproduce: 1. Try to install 4.0.0-0.nightly-2019-04-12-141717 2. It will fail with the 30 min bootstrap issue 3. Errors above are seen in the bootstrap-kube-apiserver pod logs Full pod logs attached
Still hit the etcd issue on 4.1.0-0.nightly-2019-04-29-235412 when upi installation on vmware. refer to https://bugzilla.redhat.com/show_bug.cgi?id=1698456#c14 bootkube.sh log Apr 30 09:12:40 bootstrap-0 bootkube.sh[11321]: E0430 09:12:40.713004 1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to lis> Apr 30 09:31:50 bootstrap-0 bootkube.sh[11321]: Error: error while checking pod status: timed out waiting for the condition Apr 30 09:31:50 bootstrap-0 bootkube.sh[11321]: Tearing down temporary bootstrap control plane... Apr 30 09:31:50 bootstrap-0 bootkube.sh[11321]: Error: error while checking pod status: timed out waiting for the condition Apr 30 09:31:50 bootstrap-0 bootkube.sh[11321]: unable to find container etcd-signer: no container with name or ID etcd-signer found: no such container Apr 30 09:31:50 bootstrap-0 systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a Apr 30 09:31:50 bootstrap-0 systemd[1]: bootkube.service: Failed with result 'exit-code'. Apr 30 09:31:55 bootstrap-0 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart. Apr 30 09:31:55 bootstrap-0 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 2. Apr 30 09:31:55 bootstrap-0 systemd[1]: Stopped Bootstrap a Kubernetes cluster. Apr 30 09:31:55 bootstrap-0 systemd[1]: Started Bootstrap a Kubernetes cluster. ... Apr 30 09:33:00 bootstrap-0 bootkube.sh[16365]: E0430 09:33:00.493809 1 reflector.go:251] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to watch *v1.Pod: Get https://api.jliu-demo.qe.devcluster.openshift.com:6443/api/v1/pods?watch=true: dial tcp 139.178.89.199:6443: connect: connection refused Apr 30 09:33:01 bootstrap-0 bootkube.sh[16365]: E0430 09:33:01.527050 1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://api.jliu-demo.qe.devcluster.openshift.com:6443/api/v1/pods: dial tcp 139.178.89.198:6443: connect: connection refused Since the issue in bug 1698456 has been fixed, So re-open this bug to track the installation issue.
> Still hit the etcd issue on 4.1.0-0.nightly-2019-04-29-235412 when upi installation on vmware. The issue reported was a cert SAN issue but the above logs are regarding a failing signer container. Did you see errors like reported? If yes can you please share them? Thanks!
Jia Liu, the log in comment #4 looks like a new issue related to bootkube restarting. I can't tell by the logs provided what's causing bootkube to restart. Please open a new issue against the installer component to track the bootkube restart. The team will likely need more context from the logs to track down what's happening there.
I'm going to go ahead and re-close this as a dup. Johnny, please feel free to open a new bug with the new (we believe unrelated) information in #4. *** This bug has been marked as a duplicate of bug 1698456 ***
I'm ok to close this one. But since it's not a 100% re-produce issue, QE will file a new bug for it if hit it again.