+++ This bug was initially created as a clone of Bug #1937466 +++
Description of problem
There are few issues with description of KubeClientCertificateExpiration alert:
- description of the alert is brief, without more context or action which
needs to be taken
- the alert is not covered in OCS 4 documentation at all
This makes the alert confusing and not actionable.
1. Install OCP cluster
2. OCP monitoring starts to fire KubeClientCertificateExpiration alert.
3. OCP admin is learns about the alert.
OCP admin is confused. The description of the alert states:
> Client certificate is about to expire.
> A client certificate used to authenticate to the apiserver is
> expiring in less than 1.5 hours.
When one tries to search in the documentation, no information about this
condition can be found.
Using search engine doesn't help either, as most references are discussing
OCP 3.11, where the responsibility for the client certificate was partially on
the admin, while in OCP 4, this seems to be fully automated by some operator.
See eg.: https://access.redhat.com/solutions/5319341
The end result is that admin is confused and stressed, expecting a monitoring
service will be degraded without further action soon, without clear
understanding what to do about it.
As OCP Admin I know or can figure out what to do When the
KubeClientCertificateExpiration is firing.
I stumbled upon this alert recently, without any understanding what is going
on (see screenshot #1), and then later nothing happened. I can't assume that
this will be always the case.
If you believe that the OCP Alert itself needs to tweaked, open an OCP eng.
bugzilla as well. I would have done that, if I understood this topic in more
detail. But even if this requires engineering changes, documentation update
will be necessary anyway.
--- Additional comment from Martin Bukatovic on 2021-03-10 17:33:00 UTC ---
KubeClientCertificateExpiration alert is defined in:
--- Additional comment from W. Trevor King on 2021-03-12 21:37:52 UTC ---
Looks like it was dropped 10d ago  as part of bug 1923984. Maybe close this bug as a dup of that one, and start talking about whether we need backports?
--- Additional comment from Martin Bukatovic on 2021-03-15 09:04:49 UTC ---
Thanks for referencing BZ 1923984. I agree that this bug can be closed now.
--- Additional comment from Martin Bukatovic on 2021-03-15 10:55:06 UTC ---
Additional details from aos-devel list https://mailman-int.corp.redhat.com/archives/aos-devel/2021-March/msg00161.html
On 3/15/21 10:02 AM, Simon Pasquier wrote:
> Yeah, Burr mentioned the same issue a few weeks ago. The alert tells
> us that someone uses a soon-to-expired client certificate but
> unfortunately it can't surface which client (and it can be anything:
> kubelet, operators, user workloads). A cluster admin would have to go
> through the API logs to find out exactly the client details.
> We've discussed removing the alert upstream  because we considered
> that the alert isn't really actionable but we didn't reach a
> consensus. Instead we've removed the alert from the cluster-monitoring
> operator (starting 4.8). FWIW we still have alerts in place if
> kubelets can't renew their certificates.
>  https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/550
--- Additional comment from Simon Pasquier on 2021-04-16 10:21:57 UTC ---
Moving to ON_QA to be able to backport the removal of the rule in KubeClientCertificateExpiration.
tested with the not merged PR, KubeClientCertificateExpiration alert rule is removed
fix is in 4.7.0-0.nightly-2021-05-27-172500 and later builds, based on Comment 2, move to VERIFIED
This bug will be shipped as part of next z-stream release 4.7.15 on June 14th, as 4.7.14 was dropped due to a regression https://bugzilla.redhat.com/show_bug.cgi?id=1967614
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.16 security and bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.