Bug 1950290 - KubeClientCertificateExpiration alert is confusing, without explanation in the documentation
Summary: KubeClientCertificateExpiration alert is confusing, without explanation in th...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.z
Assignee: Arunprasad Rajkumar
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1937466
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-16 10:24 UTC by Simon Pasquier
Modified: 2021-06-15 09:27 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1937466
Environment:
Last Closed: 2021-06-15 09:26:45 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1171 0 None open Bug 1950290: remove KubeClientCertificateExpiration alert rule 2021-05-19 15:05:35 UTC
Red Hat Product Errata RHSA-2021:2286 0 None None None 2021-06-15 09:27:26 UTC

Description Simon Pasquier 2021-04-16 10:24:20 UTC
+++ This bug was initially created as a clone of Bug #1937466 +++

Description of problem
======================

There are few issues with description of KubeClientCertificateExpiration alert:

- description of the alert is brief, without more context or action which
  needs to be taken
- the alert is not covered in OCS 4 documentation at all

This makes the alert confusing and not actionable.

Use case
========

1. Install OCP cluster 
2. OCP monitoring starts to fire KubeClientCertificateExpiration alert.
3. OCP admin is learns about the alert.

Actual results
==============

OCP admin is confused. The description of the alert states:

> Client certificate is about to expire.
> A client certificate used to authenticate to the apiserver is
> expiring in less than 1.5 hours.

When one tries to search in the documentation, no information about this
condition can be found.

Using search engine doesn't help either, as most references are discussing
OCP 3.11, where the responsibility for the client certificate was partially on
the admin, while in OCP 4, this seems to be fully automated by some operator.

See eg.: https://access.redhat.com/solutions/5319341

The end result is that admin is confused and stressed, expecting a monitoring
service will be degraded without further action soon, without clear
understanding what to do about it.

Expected results
================

As OCP Admin I know or can figure out what to do When the
KubeClientCertificateExpiration is firing.

Additional info
===============

I stumbled upon this alert recently, without any understanding what is going
on (see screenshot #1), and then later nothing happened. I can't assume that
this will be always the case.

If you believe that the OCP Alert itself needs to tweaked, open an OCP eng.
bugzilla as well. I would have done that, if I understood this topic in more
detail. But even if this requires engineering changes, documentation update
will be necessary anyway.

--- Additional comment from Martin Bukatovic on 2021-03-10 17:33:00 UTC ---

KubeClientCertificateExpiration alert is defined in:

https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml

--- Additional comment from W. Trevor King on 2021-03-12 21:37:52 UTC ---

Looks like it was dropped 10d ago [1] as part of bug 1923984.  Maybe close this bug as a dup of that one, and start talking about whether we need backports?

[1]: https://github.com/openshift/cluster-monitoring-operator/commit/1496d7fb66e3043ad21014509221bdf37fbb2eaf#diff-9a529e7399b36b3c02f816e864690cfad2559b40127f86268cc44c5dbce1277fR16

--- Additional comment from Martin Bukatovic on 2021-03-15 09:04:49 UTC ---

Thanks for referencing BZ 1923984. I agree that this bug can be closed now.

--- Additional comment from Martin Bukatovic on 2021-03-15 10:55:06 UTC ---

Additional details from aos-devel list https://mailman-int.corp.redhat.com/archives/aos-devel/2021-March/msg00161.html

On 3/15/21 10:02 AM, Simon Pasquier wrote:
> Yeah, Burr mentioned the same issue a few weeks ago. The alert tells
> us that someone uses a soon-to-expired client certificate but
> unfortunately it can't surface which client (and it can be anything:
> kubelet, operators, user workloads). A cluster admin would have to go
> through the API logs to find out exactly the client details.
> 
> We've discussed removing the alert upstream [1] because we considered
> that the alert isn't really actionable but we didn't reach a
> consensus. Instead we've removed the alert from the cluster-monitoring
> operator (starting 4.8). FWIW we still have alerts in place if
> kubelets can't renew their certificates.
> 
> [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/550

--- Additional comment from Simon Pasquier on 2021-04-16 10:21:57 UTC ---

Moving to ON_QA to be able to backport the removal of the rule in KubeClientCertificateExpiration.

Comment 2 Junqi Zhao 2021-05-21 06:49:56 UTC
tested with the not merged PR, KubeClientCertificateExpiration alert rule is removed

Comment 4 Junqi Zhao 2021-05-31 02:01:13 UTC
fix is in 4.7.0-0.nightly-2021-05-27-172500 and later builds, based on Comment 2, move to VERIFIED

Comment 6 Siddharth Sharma 2021-06-04 18:38:50 UTC
This bug will be shipped as part of next z-stream release 4.7.15 on June 14th, as 4.7.14 was dropped due to a regression https://bugzilla.redhat.com/show_bug.cgi?id=1967614

Comment 10 errata-xmlrpc 2021-06-15 09:26:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.16 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2286


Note You need to log in before you can comment on or make changes to this bug.