Description of problem: I have cluster running in OpenStack installed using UPI installation mode 3 masters + 2 workers all on rhcos-410.8.20190520.1. Cluster was healthy without any alerts. After 1 day there are following errors visible in master logs: Jun 13 09:49:31 host-10-0-150-208 hyperkube[1245]: I0613 09:49:31.779140 1245 certificate_manager.go:213] Current certificate is expired. Jun 13 09:49:31 host-10-0-150-208 hyperkube[1245]: I0613 09:49:31.779302 1245 log.go:172] http: TLS handshake error from 10.128.0.9:44042: no serving certificate available for the kubelet Following alerts are firing: KubeletDown Kubelet has disappeared from Prometheus target discovery. TargetDown 100% of the kubelet targets are down. Metrics for pods are not collected and it's not possible to get pod log. Version-Release number of selected component (if applicable): 4.1.0 3 masters + 2 workers all on rhcos-410.8.20190520.1 How reproducible: Always Steps to Reproduce: 1. follow bare metal installation https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/installing/installing-on-bare-metal#installing-bare-metal Only difference is that RHCOS was installed on VMs in OpenStack 2. wait 1 day Actual results: Se description above. Expected results: Cluster should be working fine Additional info: Note that it's not duplicate of bz1693951, I check uptime and VMs were up all the time. Also tried https://bugzilla.redhat.com/show_bug.cgi?id=1714771#c6 which didn't help. Logs from all masters available here: http://web.bc.jonqe.lab.eng.bos.redhat.com/master-logs.tar.gz Cluster is still running so I can provide access there.
It is likely that you do not have something approving the kubelet serving CSRs. This can be verified with `oc get csr` showing CSRs in Pending state. I run a bare metal installation and run this in my cluster (not for production use). https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_bootstrap_autoapprover/files/openshift-bootstrap-controller.yaml UPI requires the customer to provide some mechanism to approve these CSRs as we can not blindly approve them but have no information on which to do a verification check, unlike IPI where we have the cloudprovider API.
Yes, `oc get csr` showed lots of Pending states. In that case it's necessary to update documentation for baremetal installation [1] because it says: "After you approve the initial CSRs, the subsequent CSRs are automatically approved by the cluster kube-controller-manger." Thank you for quick response, I approved all CSRs and cluster is working again. [1] - https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/installing/installing-on-bare-metal#installation-approve-csrs_installing-bare-metal
FYI, similar (maybe not totally same) scenario to https://bugzilla.redhat.com/show_bug.cgi?id=1699293
Michael, can you add that, for baremetal install, the CU will need a way to continuously approve kubelet server CSRs. The kube-controller-manager will only approve the kubelet client CSRs.
Seth, do you have suggestions for how to continuously approve the kubelet CSRs? I can add it to the bare metal prereqs. Ryan, why is this different from https://bugzilla.redhat.com/show_bug.cgi?id=1710427?
Filip, can you provide the must-gather output?
(In reply to Kathryn Alexander from comment #7) > Filip, can you provide the must-gather output? Sorry, but I don't know what is the "must-gather output".
Hi Filip! The description of the command is here: https://docs.openshift.com/container-platform/4.1/cli_reference/administrator-cli-commands.html#must-gather > $ oc adm must-gather
(In reply to Kathryn Alexander from comment #10) > Hi Filip! The description of the command is here: > https://docs.openshift.com/container-platform/4.1/cli_reference/ > administrator-cli-commands.html#must-gather > > > $ oc adm must-gather Hi Kathryn, it's a 500 MB file even after compression. Is there any specific log or config I should attach or you need everything? It contains lot of noise and other issues not related to this one.
Hi Filip! I'm not sure which specific part of the log dev was interested in, but I think I have enough information to move forward. The draft PR is here: https://github.com/openshift/openshift-docs/pull/15488/ David, does this sound right to you?
Seth approved the change on the PR. Jianlin, will you PTAL?
I just ran into the same issue. However, the "solution" described in PR #15488 [1] isn't really helping IMHO. What should be decided? Which request is valid? Which isn't? If the decision cannot be made by OpenShift, then who (and how) can a decision be made? And how should the process be implemented? I guess manually running `oc adm …` isn't a proper way to handle this. [1] https://github.com/openshift/openshift-docs/pull/15488/
The PR looks good to me. If anyone still have concern about the resolution, pls re-open this issue.
I'm going to get a peer review before I merge. Thank you Jianlin!
I've merged the change and am waiting for it to go live.
This change is live on the portal: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.1/html/installing/installing-on-bare-metal#installation-approve-csrs_installing-bare-metal And on docs.openshift.com: https://docs.openshift.com/container-platform/4.1/installing/installing_bare_metal/installing-bare-metal.html#installation-approve-csrs_installing-bare-metal