Bug 1937466 - KubeClientCertificateExpiration alert is confusing, without explanation in the documentation
Summary: KubeClientCertificateExpiration alert is confusing, without explanation in th...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.8.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1950290
TreeView+ depends on / blocked
 
Reported: 2021-03-10 17:31 UTC by Martin Bukatovic
Modified: 2021-07-27 22:53 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1950290 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:52:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
screenshot #1: critical KubeClientCertificateExpiration is firing for each node of the cluster (277.21 KB, image/png)
2021-03-10 17:31 UTC, Martin Bukatovic
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:53:05 UTC

Description Martin Bukatovic 2021-03-10 17:31:45 UTC
Created attachment 1762382 [details]
screenshot #1: critical KubeClientCertificateExpiration is firing for each node of the cluster

Description of problem
======================

There are few issues with description of KubeClientCertificateExpiration alert:

- description of the alert is brief, without more context or action which
  needs to be taken
- the alert is not covered in OCS 4 documentation at all

This makes the alert confusing and not actionable.

Use case
========

1. Install OCP cluster 
2. OCP monitoring starts to fire KubeClientCertificateExpiration alert.
3. OCP admin is learns about the alert.

Actual results
==============

OCP admin is confused. The description of the alert states:

> Client certificate is about to expire.
> A client certificate used to authenticate to the apiserver is
> expiring in less than 1.5 hours.

When one tries to search in the documentation, no information about this
condition can be found.

Using search engine doesn't help either, as most references are discussing
OCP 3.11, where the responsibility for the client certificate was partially on
the admin, while in OCP 4, this seems to be fully automated by some operator.

See eg.: https://access.redhat.com/solutions/5319341

The end result is that admin is confused and stressed, expecting a monitoring
service will be degraded without further action soon, without clear
understanding what to do about it.

Expected results
================

As OCP Admin I know or can figure out what to do When the
KubeClientCertificateExpiration is firing.

Additional info
===============

I stumbled upon this alert recently, without any understanding what is going
on (see screenshot #1), and then later nothing happened. I can't assume that
this will be always the case.

If you believe that the OCP Alert itself needs to tweaked, open an OCP eng.
bugzilla as well. I would have done that, if I understood this topic in more
detail. But even if this requires engineering changes, documentation update
will be necessary anyway.

Comment 1 Martin Bukatovic 2021-03-10 17:33:00 UTC
KubeClientCertificateExpiration alert is defined in:

https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml

Comment 2 W. Trevor King 2021-03-12 21:37:52 UTC
Looks like it was dropped 10d ago [1] as part of bug 1923984.  Maybe close this bug as a dup of that one, and start talking about whether we need backports?

[1]: https://github.com/openshift/cluster-monitoring-operator/commit/1496d7fb66e3043ad21014509221bdf37fbb2eaf#diff-9a529e7399b36b3c02f816e864690cfad2559b40127f86268cc44c5dbce1277fR16

Comment 3 Martin Bukatovic 2021-03-15 09:04:49 UTC
Thanks for referencing BZ 1923984. I agree that this bug can be closed now.

*** This bug has been marked as a duplicate of bug 1923984 ***

Comment 4 Martin Bukatovic 2021-03-15 10:55:06 UTC
Additional details from aos-devel list https://mailman-int.corp.redhat.com/archives/aos-devel/2021-March/msg00161.html

On 3/15/21 10:02 AM, Simon Pasquier wrote:
> Yeah, Burr mentioned the same issue a few weeks ago. The alert tells
> us that someone uses a soon-to-expired client certificate but
> unfortunately it can't surface which client (and it can be anything:
> kubelet, operators, user workloads). A cluster admin would have to go
> through the API logs to find out exactly the client details.
> 
> We've discussed removing the alert upstream [1] because we considered
> that the alert isn't really actionable but we didn't reach a
> consensus. Instead we've removed the alert from the cluster-monitoring
> operator (starting 4.8). FWIW we still have alerts in place if
> kubelets can't renew their certificates.
> 
> [1] https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/550

Comment 6 Junqi Zhao 2021-04-19 03:31:07 UTC
tested with 4.8.0-0.nightly-2021-04-18-101412, KubeClientCertificateExpiration rule is removed

Comment 9 errata-xmlrpc 2021-07-27 22:52:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.