Bug 1743661 - Fail to bootstrap an UPI BM cluster with OCP 4.2
Summary: Fail to bootstrap an UPI BM cluster with OCP 4.2
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.3.0
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-20 12:10 UTC by Denis Ollier
Modified: 2019-11-01 12:15 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-31 10:14:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Denis Ollier 2019-08-20 12:10:42 UTC
Description of problem
----------------------

I try to install OCP 4 on an UPI BM cluster following the instructions from the documentation: https://docs.openshift.com/container-platform/4.1/installing/installing_bare_metal/installing-bare-metal.html

When using RHCOS image and OCP tools for OCP 4.1, it works like a charm.

When switching to RHCOS and OCP tools versions for OCP 4.2, the cluster fails to deploy while I'm using the same machines/UPI/procedure as for 4.1.

Installation fails during the boostrap phase.

I'm attaching journalctl from bootstrap and master nodes.

I suspect this kind of errors to be the cause:

>    Aug 20 10:33:58 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com openshift.sh[2867]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": Get https://localhost:6443/api?timeout=32s: x509: certificate is valid for api.bm1.oc4, not localhost

The bootstrap node tries to connect to the API server using localhost instead of api.bm1.oc4 as specified in provided install-config.yaml. 

Version
--------

My last try was with those versions but it also failed with older 4.2 builds from august:

- OCP 4.2.0-0.nightly-2019-08-20-043744
- RHCOS 42.80.20190816.2

Comment 4 W. Trevor King 2019-08-20 19:21:55 UTC
>    Aug 20 10:33:58 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com openshift.sh[2867]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": Get https://localhost:6443/api?timeout=32s: x509: certificate is valid for api.bm1.oc4, not localhost

This is a red-herring, and I've spun off bug 1743840 about quieting it down.  I haven't dug into the actual failure cause here yet.

Comment 5 Abhinav Dahiya 2019-08-21 23:45:01 UTC
Please provide the log bundle from

```
openshift-install gather bootstrap --bootstrap <bootstrap-host-ip> --master <master-0-host-ip> [--master <master-N-host-ip>]
```

Comment 6 Abhinav Dahiya 2019-08-21 23:48:23 UTC
From the initial look

from `cat <attachment>/journal-bootstrap.log | rg 'bootkube'`

```
Aug 20 10:49:59 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com bootkube.sh[3439]: E0820 10:49:59.088980       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods: dial tcp [::1]:6443: connect: connection refused
Aug 20 10:49:59 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com bootkube.sh[3439]: [#1175] failed to fetch discovery: Get https://localhost:6443/api?timeout=32s: dial tcp [::1]:6443: connect: connection refused
Aug 20 10:49:59 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com bootkube.sh[3439]: [#1176] failed to fetch discovery: Get https://localhost:6443/api?timeout=32s: dial tcp [::1]:6443: connect: connection refused
Aug 20 10:49:59 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com bootkube.sh[3439]: [#1177] failed to fetch discovery: Get https://localhost:6443/api?timeout=32s: dial tcp [::1]:6443: connect: connection refused
Aug 20 10:49:59 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com bootkube.sh[3439]: [#1178] failed to fetch discovery: Get https://localhost:6443/api?timeout=32s: dial tcp [::1]:6443: connect: connection refused
Aug 20 10:49:59 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com bootkube.sh[3439]: [#1179] failed to fetch discovery: Get https://localhost:6443/api?timeout=32s: dial tcp [::1]:6443: connect: connection refused

```

the bootstrap-kube-apiserver is failing to start.. so moving the component.

Comment 8 Michal Fojtik 2019-08-23 11:45:32 UTC
Seth, I don't see the reason for:

Aug 22 08:38:53 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com hyperkube[6863]: E0822 08:38:53.045487    6863 pod_workers.go:190] Error syncing pod 60d454f702c957e050f32e835f08f8f3 ("bootstrap-kube-apiserver-cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com_kube-system(60d454f702c957e050f32e835f08f8f3)"), skipping: failed to "StartContainer" for "kube-apiserver" with CrashLoopBackOff: "Back-off 2m40s restarting failed container=kube-apiserver pod=bootstrap-kube-apiserver-cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com_kube-system(60d454f702c957e050f32e835f08f8f3)"

None of the static pods come up. I don't see any relevant container logs, so we can't do much.

Comment 10 Denis Ollier 2019-08-31 10:14:58 UTC
Hi,

I retried with newer versions:

- OCP 4.2.0-0.nightly-2019-08-28-035628
- RHCOS 42.80.20190828.0

The bootstrap node was properly setup and I managed to have a working cluster.

> oc get clusterversion 
> 
> NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
> version   4.2.0-0.nightly-2019-08-28-035628   True        False         54s     Cluster version is 4.2.0-0.nightly-2019-08-28-035628

Some nodes often register as "localhost" although they get a proper hostname from DNS but it's probably another issue.

> oc get nodes
> 
> NAME                                      STATUS   ROLES    AGE   VERSION
> cnv-qe-07.cnvqe.lab.eng.rdu2.redhat.com   Ready    master   27m   v1.14.0+b985ea310
> localhost                                 Ready    worker   18m   v1.14.0+b985ea310

Closing this issue. (I will probably open a new one for the localhost issue after more investigations).

Thanks.


Note You need to log in before you can comment on or make changes to this bug.