Bug 1768051 - OpenShift 4.2 Install Fails with error: the server doesn't have a resource type "csr"
Summary: OpenShift 4.2 Install Fails with error: the server doesn't have a resource ty...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.4.0
Assignee: Erica von Buelow
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-02 00:50 UTC by Aja Lightner
Modified: 2023-03-24 15:53 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-03 18:24:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Log bundle (534.26 KB, application/gzip)
2019-11-02 00:56 UTC, Aja Lightner
no flags Details

Description Aja Lightner 2019-11-02 00:50:55 UTC
Description of problem:
OpenShift 4.2 UPI install fails on CoreOS/VMWare
- The bootstrap API server is up to an extent, but it's returning a 404
- The bootstrap is unable to approve CSRs

Version-Release number of the following components:
4.2


How reproducible:

Steps to Reproduce:
1. Customer is following instructions per: https://docs.openshift.com/container-platform/4.2/installing/installing_vsphere/installing-vsphere.html


Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:
Successful install of 4.2

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Aja Lightner 2019-11-02 00:55:26 UTC
control-plane/10.123.13.102/journals/kubelet.log:Oct 31 14:49:37 etcd-1.o4.dr3.demo.sk hyperkube[1154]: E1031 14:49:37.907026    1154 certificate_manager.go:385] Failed while requesting a signed certificate from the master: cannot create certificate signing request: the server rejected our request for an unknown reason (post certificatesigningrequests.certificates.k8s.io)

Comment 2 Aja Lightner 2019-11-02 00:56:16 UTC
Created attachment 1631710 [details]
Log bundle

Comment 3 Abhinav Dahiya 2019-11-04 19:43:30 UTC
> log-bundle-20191031145106/bootstrap/journals/bootkube.log


```
Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-0.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.101:2379: connect: connection refused
Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-2.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.103:2379: connect: connection refused
Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-1.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.102:2379: connect: connection refused
Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: Error: unhealthy cluster
Oct 31 14:45:34 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: etcdctl failed. Retrying in 5 seconds...
```

The bootstrap-host is waiting for etcd-cluster formation on control-plane hosts.


> log-bundle-20191031145106/bootstrap/containers/machine-config-server-ae8426373114ed617b03030a747589d9a38efc8a7aa38b07849219995bb86a86.log

```
I1031 14:04:26.885488       1 api.go:97] Pool master requested by 10.123.13.80:46260
I1031 14:04:26.885538       1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml"
I1031 14:04:26.887600       1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-b38dadd973a9c0be0f894d3cd69ee8e8.yaml"
I1031 14:05:47.131961       1 api.go:97] Pool master requested by 10.123.13.80:46890
I1031 14:05:47.132943       1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml"
I1031 14:05:47.133592       1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-b38dadd973a9c0be0f894d3cd69ee8e8.yaml"
I1031 14:07:35.453662       1 api.go:97] Pool master requested by 10.123.13.80:47728
I1031 14:07:35.454651       1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml"
```

The control-plane hosts have requested the ignition from bootstrap-host.


So looking at 
> /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.101/
> /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.102/

there are containers running on that host `empty containers` directory and kubelet is also not showing errors for why etcd statisc pods are not running.

> /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.103/

the init containers for etcd have completed but etcd-member pods are failinig or haven't started yet, no logs from kubelet regarding anything.


Moving to node team to help debug.

Comment 4 Aja Lightner 2019-11-04 22:16:50 UTC
Thank you, Abhinav, for the update.

Comment 5 Aja Lightner 2019-11-09 17:19:52 UTC
Hi Abhinav - Did you have any updates from the node team? Were they able to help with debugging this?

Thank you,
Aja

Comment 6 Ryan Phillips 2019-11-11 15:22:17 UTC
bootstrap/containers/machine-config-controller-7b89f76874a18448df276b8ecf7a14cef4ad2911a6f1b9f062a20e12dc4ddbaf.log:
```

I1031 14:04:22.198338       1 bootstrap.go:40] Version: v4.2.0-201910101614-dirty (62b0b6d2a751a5f364f2e6d5c9cfe63419668777)
W1031 14:04:22.426844       1 render.go:137] Warning: the controller config referenced an unsupported platform: vsphere
W1031 14:04:22.466008       1 render.go:137] Warning: the controller config referenced an unsupported platform: vsphere

```

Looks like the MCO is reporting vsphere is an unsupported platform. This is strange because the docs show vsphere should be supported [1]. Going to reassign to the MCO team for more input.

1. https://docs.openshift.com/container-platform/4.2/installing/installing_vsphere/installing-vsphere.html#installation-vsphere-config-yaml_installing-vsphere

Comment 7 Kirsten Garrison 2019-11-12 14:59:16 UTC
MCO has been waiting for verification on the status of vsphere :

https://github.com/openshift/machine-config-operator/pull/998#discussion_r318568006

We are happy to merge (and make any other changes) but were told there were issues with kubelet on vsphere with no update to the contrary.

Please let us know..

Comment 8 Ryan Phillips 2019-11-12 15:06:20 UTC
I have not heard of any Kubelet issues on Vsphere. Is there something the Node team should look into?

Comment 9 Kirsten Garrison 2019-11-12 16:53:59 UTC
Antonio, is there anything Ryan/Node needs outside of your comment here?: 
https://github.com/openshift/machine-config-operator/pull/998#discussion_r318568006

Comment 10 Antonio Murdaca 2019-11-21 11:15:52 UTC
(In reply to Kirsten Garrison from comment #9)
> Antonio, is there anything Ryan/Node needs outside of your comment here?: 
> https://github.com/openshift/machine-config-operator/pull/
> 998#discussion_r318568006

I don't think so, also, that's just a warning, how is it causing any issue here?

Comment 11 Erica von Buelow 2019-11-25 16:01:54 UTC
As https://github.com/openshift/machine-config-operator/pull/998 has merged, is there a further problem we need to investigate in this BZ?

Comment 12 Aja Lightner 2019-11-27 21:28:44 UTC
Hi Team,

The problem we were investigating is the failed 4.2 installation on vSphere (see Comment 1 and Comment 3), and not just removing the warning message that stated vSphere was unsupported (Comment 6). 

Please let me know if there is more I need to gather to help solve this.



Log Errors, from Comment 1:
control-plane/10.123.13.102/journals/kubelet.log:Oct 31 14:49:37 etcd-1.o4.dr3.demo.sk hyperkube[1154]: E1031 14:49:37.907026    1154 certificate_manager.go:385] Failed while requesting a signed certificate from the master: cannot create certificate signing request: the server rejected our request for an unknown reason (post certificatesigningrequests.certificates.k8s.io)


Log Errors, From Comment 3:
> log-bundle-20191031145106/bootstrap/journals/bootkube.log


```
Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-0.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.101:2379: connect: connection refused
Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-2.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.103:2379: connect: connection refused
Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: https://etcd-1.o4.dr3.demo.sk:2379 is unhealthy: failed to connect: dial tcp 10.123.13.102:2379: connect: connection refused
Oct 31 14:45:33 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: Error: unhealthy cluster
Oct 31 14:45:34 bootstrap.o4.dr3.demo.sk bootkube.sh[1783]: etcdctl failed. Retrying in 5 seconds...
```

The bootstrap-host is waiting for etcd-cluster formation on control-plane hosts.


> log-bundle-20191031145106/bootstrap/containers/machine-config-server-ae8426373114ed617b03030a747589d9a38efc8a7aa38b07849219995bb86a86.log

```
I1031 14:04:26.885488       1 api.go:97] Pool master requested by 10.123.13.80:46260
I1031 14:04:26.885538       1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml"
I1031 14:04:26.887600       1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-b38dadd973a9c0be0f894d3cd69ee8e8.yaml"
I1031 14:05:47.131961       1 api.go:97] Pool master requested by 10.123.13.80:46890
I1031 14:05:47.132943       1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml"
I1031 14:05:47.133592       1 bootstrap_server.go:82] reading file "/etc/mcs/bootstrap/machine-configs/rendered-master-b38dadd973a9c0be0f894d3cd69ee8e8.yaml"
I1031 14:07:35.453662       1 api.go:97] Pool master requested by 10.123.13.80:47728
I1031 14:07:35.454651       1 bootstrap_server.go:62] reading file "/etc/mcs/bootstrap/machine-pools/master.yaml"
```

The control-plane hosts have requested the ignition from bootstrap-host.


So looking at 
> /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.101/
> /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.102/

there are containers running on that host `empty containers` directory and kubelet is also not showing errors for why etcd statisc pods are not running.

> /tmp/mozilla_adahiya0/log-bundle-20191031145106/control-plane/10.123.13.103/

the init containers for etcd have completed but etcd-member pods are failinig or haven't started yet, no logs from kubelet regarding anything.

Comment 13 Joseph Callen 2019-12-05 17:08:59 UTC
Please double-check:

- NTP/time on the ESXi hosts and confirm the guests have the correct time as well
- Confirm all DNS records, confirm the RHCOS guests are resolving correctly.

Comment 14 Aja Lightner 2019-12-09 17:01:29 UTC
I am asking the customer to confirm these items now. Thanks Joseph.

Comment 15 Erica von Buelow 2019-12-12 16:18:35 UTC
Any updates on this?

Comment 17 Steve Milner 2020-02-03 15:06:30 UTC
Aja,

Any updates?


Note You need to log in before you can comment on or make changes to this bug.