1720178 – [DOCS] UPI installation - expired certificate after 1 day - no serving certificate available for the kubelet

Bug 1720178 - [DOCS] UPI installation - expired certificate after 1 day - no serving certificate available for the kubelet

Summary: [DOCS] UPI installation - expired certificate after 1 day - no serving certif...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Kathryn Alexander
QA Contact:	Johnny Liu
Docs Contact:	Vikram Goyal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-13 10:20 UTC by Filip Brychta
Modified:	2019-07-31 20:50 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-31 20:49:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Filip Brychta 2019-06-13 10:20:06 UTC

Description of problem:
I have cluster running in OpenStack installed using UPI installation mode 3 masters + 2 workers all on rhcos-410.8.20190520.1. Cluster was healthy without any alerts.
After 1 day there are following errors visible in master logs:
Jun 13 09:49:31 host-10-0-150-208 hyperkube[1245]: I0613 09:49:31.779140 1245 certificate_manager.go:213] Current certificate is expired.
Jun 13 09:49:31 host-10-0-150-208 hyperkube[1245]: I0613 09:49:31.779302 1245 log.go:172] http: TLS handshake error from 10.128.0.9:44042: no serving certificate available for the kubelet

Following alerts are firing:
KubeletDown
Kubelet has disappeared from Prometheus target discovery.
TargetDown
100% of the kubelet targets are down.

Metrics for pods are not collected and it's not possible to get pod log.

Version-Release number of selected component (if applicable):
4.1.0
3 masters + 2 workers all on rhcos-410.8.20190520.1

How reproducible:
Always

Steps to Reproduce:
1. follow bare metal installation https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/installing/installing-on-bare-metal#installing-bare-metal

Only difference is that RHCOS was installed on VMs in OpenStack

2. wait 1 day

Actual results:
Se description above.

Expected results:
Cluster should be working fine

Additional info:
Note that it's not duplicate of bz1693951, I check uptime and VMs were up all the time.
Also tried https://bugzilla.redhat.com/show_bug.cgi?id=1714771#c6 which didn't help.

Logs from all masters available here: http://web.bc.jonqe.lab.eng.bos.redhat.com/master-logs.tar.gz

Cluster is still running so I can provide access there.

Comment 2 Seth Jennings 2019-06-13 12:57:55 UTC

It is likely that you do not have something approving the kubelet serving CSRs.  This can be verified with `oc get csr` showing CSRs in Pending state.

I run a bare metal installation and run this in my cluster (not for production use).
https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_bootstrap_autoapprover/files/openshift-bootstrap-controller.yaml

UPI requires the customer to provide some mechanism to approve these CSRs as we can not blindly approve them but have no information on which to do a verification check, unlike IPI where we have the cloudprovider API.

Comment 3 Filip Brychta 2019-06-13 13:45:39 UTC

Yes, `oc get csr` showed lots of Pending states. In that case it's necessary to update documentation for baremetal installation [1] because it says:
"After you approve the initial CSRs, the subsequent CSRs are automatically approved by the cluster kube-controller-manger."

Thank you for quick response, I approved all CSRs and cluster is working again.

[1] - https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/installing/installing-on-bare-metal#installation-approve-csrs_installing-bare-metal

Comment 4 Xingxing Xia 2019-06-14 03:05:41 UTC

FYI, similar (maybe not totally same) scenario to https://bugzilla.redhat.com/show_bug.cgi?id=1699293

Comment 5 Seth Jennings 2019-06-14 13:22:18 UTC

Michael, can you add that, for baremetal install, the CU will need a way to continuously approve kubelet server CSRs.  The kube-controller-manager will only approve the kubelet client CSRs.

Comment 6 Kathryn Alexander 2019-06-14 17:59:52 UTC

Seth, do you have suggestions for how to continuously approve the kubelet CSRs? I can add it to the bare metal prereqs.


Ryan, why is this different from https://bugzilla.redhat.com/show_bug.cgi?id=1710427?

Comment 7 Kathryn Alexander 2019-06-17 17:25:38 UTC

Filip, can you provide the must-gather output?

Comment 9 Filip Brychta 2019-06-18 10:51:14 UTC

(In reply to Kathryn Alexander from comment #7)
> Filip, can you provide the must-gather output?

Sorry, but I don't know what is the "must-gather output".

Comment 10 Kathryn Alexander 2019-06-20 15:27:00 UTC

Hi Filip! The description of the command is here: https://docs.openshift.com/container-platform/4.1/cli_reference/administrator-cli-commands.html#must-gather

> $ oc adm must-gather

Comment 11 Filip Brychta 2019-06-20 16:06:47 UTC

(In reply to Kathryn Alexander from comment #10)
> Hi Filip! The description of the command is here:
> https://docs.openshift.com/container-platform/4.1/cli_reference/
> administrator-cli-commands.html#must-gather
> 
> > $ oc adm must-gather

Hi Kathryn,
it's a 500 MB file even after compression. Is there any specific log or config I should attach or you need everything? It contains lot of noise and other issues not related to this one.

Comment 12 Kathryn Alexander 2019-06-20 21:24:34 UTC

Hi Filip! I'm not sure which specific part of the log dev was interested in, but I think I have enough information to move forward.

The draft PR is here: https://github.com/openshift/openshift-docs/pull/15488/

David, does this sound right to you?

Comment 15 Kathryn Alexander 2019-07-22 14:45:26 UTC

Seth approved the change on the PR. Jianlin, will you PTAL?

Comment 16 Jens Reimann 2019-07-22 15:22:15 UTC

I just ran into the same issue. However, the "solution" described in PR #15488 [1] isn't really helping IMHO.

What should be decided? Which request is valid? Which isn't? If the decision cannot be made by OpenShift, then who (and how) can a decision be made?

And how should the process be implemented? I guess manually running `oc adm …` isn't a proper way to handle this.

[1] https://github.com/openshift/openshift-docs/pull/15488/

Comment 17 Johnny Liu 2019-07-23 02:24:12 UTC

The PR looks good to me. If anyone still have concern about the resolution, pls re-open this issue.

Comment 18 Kathryn Alexander 2019-07-23 12:39:45 UTC

I'm going to get a peer review before I merge. Thank you Jianlin!

Comment 19 Kathryn Alexander 2019-07-23 15:22:19 UTC

I've merged the change and am waiting for it to go live.

Comment 20 Kathryn Alexander 2019-07-31 20:49:54 UTC

This change is live on the portal: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/installing/installing-on-bare-metal#installation-approve-csrs_installing-bare-metal

And on docs.openshift.com: https://docs.openshift.com/container-platform/4.1/installing/installing_bare_metal/installing-bare-metal.html#installation-approve-csrs_installing-bare-metal

Note You need to log in before you can comment on or make changes to this bug.