+++ This bug was initially created as a clone of Bug #1937466 +++ Description of problem ====================== There are few issues with description of KubeClientCertificateExpiration alert: - description of the alert is brief, without more context or action which needs to be taken - the alert is not covered in OCS 4 documentation at all This makes the alert confusing and not actionable. Use case ======== 1. Install OCP cluster 2. OCP monitoring starts to fire KubeClientCertificateExpiration alert. 3. OCP admin is learns about the alert. Actual results ============== OCP admin is confused. The description of the alert states: > Client certificate is about to expire. > A client certificate used to authenticate to the apiserver is > expiring in less than 1.5 hours. When one tries to search in the documentation, no information about this condition can be found. Using search engine doesn't help either, as most references are discussing OCP 3.11, where the responsibility for the client certificate was partially on the admin, while in OCP 4, this seems to be fully automated by some operator. See eg.: https://access.redhat.com/solutions/5319341 The end result is that admin is confused and stressed, expecting a monitoring service will be degraded without further action soon, without clear understanding what to do about it. Expected results ================ As OCP Admin I know or can figure out what to do When the KubeClientCertificateExpiration is firing. Additional info =============== I stumbled upon this alert recently, without any understanding what is going on (see screenshot #1), and then later nothing happened. I can't assume that this will be always the case. If you believe that the OCP Alert itself needs to tweaked, open an OCP eng. bugzilla as well. I would have done that, if I understood this topic in more detail. But even if this requires engineering changes, documentation update will be necessary anyway. --- Additional comment from Martin Bukatovic on 2021-03-10 17:33:00 UTC --- KubeClientCertificateExpiration alert is defined in: https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml --- Additional comment from W. Trevor King on 2021-03-12 21:37:52 UTC --- Looks like it was dropped 10d ago [1] as part of bug 1923984. Maybe close this bug as a dup of that one, and start talking about whether we need backports? [1]: https://github.com/openshift/cluster-monitoring-operator/commit/1496d7fb66e3043ad21014509221bdf37fbb2eaf#diff-9a529e7399b36b3c02f816e864690cfad2559b40127f86268cc44c5dbce1277fR16 --- Additional comment from Martin Bukatovic on 2021-03-15 09:04:49 UTC --- Thanks for referencing BZ 1923984. I agree that this bug can be closed now. --- Additional comment from Martin Bukatovic on 2021-03-15 10:55:06 UTC --- Additional details from aos-devel list https://mailman-int.corp.redhat.com/archives/aos-devel/2021-March/msg00161.html On 3/15/21 10:02 AM, Simon Pasquier wrote: > Yeah, Burr mentioned the same issue a few weeks ago. The alert tells > us that someone uses a soon-to-expired client certificate but > unfortunately it can't surface which client (and it can be anything: > kubelet, operators, user workloads). A cluster admin would have to go > through the API logs to find out exactly the client details. > > We've discussed removing the alert upstream [1] because we considered > that the alert isn't really actionable but we didn't reach a > consensus. Instead we've removed the alert from the cluster-monitoring > operator (starting 4.8). FWIW we still have alerts in place if > kubelets can't renew their certificates. > > [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/550 --- Additional comment from Simon Pasquier on 2021-04-16 10:21:57 UTC --- Moving to ON_QA to be able to backport the removal of the rule in KubeClientCertificateExpiration.
tested with the not merged PR, KubeClientCertificateExpiration alert rule is removed
fix is in 4.7.0-0.nightly-2021-05-27-172500 and later builds, based on Comment 2, move to VERIFIED
This bug will be shipped as part of next z-stream release 4.7.15 on June 14th, as 4.7.14 was dropped due to a regression https://bugzilla.redhat.com/show_bug.cgi?id=1967614
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.16 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2286