Bug 2033575 - monitoring targets are down after the cluster run for more than 1 day
Summary: monitoring targets are down after the cluster run for more than 1 day
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.11.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 2076831
Blocks: 2075757
TreeView+ depends on / blocked
 
Reported: 2021-12-17 10:03 UTC by Junqi Zhao
Modified: 2023-01-03 08:10 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2075757 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:40:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
monitoring targets are down (180.55 KB, image/png)
2021-12-17 10:03 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1637 0 None Merged Bug 2033575: use bearer token as fall-back authn method 2022-10-26 07:39:43 UTC
Github prometheus common pull 345 0 None open Check if TLS certificate and key file have been modified 2022-10-26 07:39:58 UTC
Github prometheus prometheus issues 9512 0 None open Prometheus keeps keepalive HTTP connections to targets open with expired TLS credentials even when new ones have been pr... 2022-10-26 07:39:56 UTC
Red Hat Knowledge Base (Solution) 6956725 0 None None None 2022-05-06 08:51:23 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:41:00 UTC

Description Junqi Zhao 2021-12-17 10:03:01 UTC
Created attachment 1846694 [details]
monitoring targets are down

Description of problem:
the cluster has running for 36h
# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-12-14-083101   True        False         36h     Cluster version is 4.10.0-0.nightly-2021-12-14-083101

but a few monitoring targets are down due to Error: server returned HTTP status 401 Unauthorized
inlcude: grafana/prometheus-k8s/telemeter-client/thanos-sidecar/alertmamager targets

checked from logs, there are error "Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time **** is after 2021-12-16T20:14:30Z"
after the cluster has been run for almost 1 day, then the targets are down.

example: kube-rbac-proxy container logs of prometheus-k8s-0 pod
2021-12-15T20:42:14.037307604Z I1215 20:42:14.037071       1 main.go:151] Reading config file: /etc/kube-rbac-proxy/config.yaml
2021-12-15T20:42:14.040466228Z I1215 20:42:14.040394       1 main.go:181] Valid token audiences: 
2021-12-15T20:42:14.040860463Z I1215 20:42:14.040816       1 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca::/etc/tls/client/client-ca.crt"
2021-12-15T20:42:14.041045983Z I1215 20:42:14.040997       1 main.go:305] Reading certificate files
2021-12-15T20:42:14.041296788Z I1215 20:42:14.041154       1 dynamic_cafile_content.go:167] Starting client-ca::/etc/tls/client/client-ca.crt
2021-12-15T20:42:14.041358071Z I1215 20:42:14.041342       1 reloader.go:98] reloading key /etc/tls/private/tls.key certificate /etc/tls/private/tls.crt
2021-12-15T20:42:14.041722605Z I1215 20:42:14.041682       1 main.go:339] Starting TCP socket on 0.0.0.0:9092
2021-12-15T20:42:14.042563009Z I1215 20:42:14.042495       1 main.go:346] Listening securely on 0.0.0.0:9092
2021-12-16T15:28:14.043412756Z I1216 15:28:14.043113       1 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca::/etc/tls/client/client-ca.crt"
2021-12-16T15:32:14.042711352Z I1216 15:32:14.042521       1 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca::/etc/tls/client/client-ca.crt"
2021-12-16T20:14:35.200286140Z E1216 20:14:35.200081       1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:14:35Z is after 2021-12-16T20:14:30Z
2021-12-16T20:14:44.071594661Z E1216 20:14:44.071463       1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:14:44Z is after 2021-12-16T20:14:30Z
2021-12-16T20:15:05.200062356Z E1216 20:15:05.199898       1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:15:05Z is after 2021-12-16T20:14:30Z
2021-12-16T20:15:14.072273022Z E1216 20:15:14.070036       1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:15:14Z is after 2021-12-16T20:14:30Z

and kube-rbac-proxy-metric container logs of alertmanager-main-0 pod
2021-12-15T20:41:57.832903155Z I1215 20:41:57.832680       1 main.go:151] Reading config file: /etc/kube-rbac-proxy/config.yaml
2021-12-15T20:41:57.849665600Z I1215 20:41:57.849561       1 main.go:181] Valid token audiences: 
2021-12-15T20:41:57.851298263Z I1215 20:41:57.851142       1 main.go:305] Reading certificate files
2021-12-15T20:41:57.855484254Z I1215 20:41:57.855434       1 dynamic_cafile_content.go:167] Starting client-ca::/etc/tls/client/client-ca.crt
2021-12-15T20:41:57.856139965Z I1215 20:41:57.856001       1 main.go:339] Starting TCP socket on 0.0.0.0:9097
2021-12-15T20:41:57.866276093Z I1215 20:41:57.863687       1 main.go:346] Listening securely on 0.0.0.0:9097
2021-12-16T20:14:30.664561462Z E1216 20:14:30.664417       1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:14:30Z is after 2021-12-16T20:14:30Z
2021-12-16T20:14:56.424561585Z E1216 20:14:56.424473       1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:14:56Z is after 2021-12-16T20:14:30Z
2021-12-16T20:15:00.664743050Z E1216 20:15:00.664629       1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:15:00Z is after 2021-12-16T20:14:30Z
2021-12-16T20:15:26.424947375Z E1216 20:15:26.424880       1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:15:26Z is after 2021-12-16T20:14:30Z

NOTE: recreate alertmanager pods, the alertmanager targets are healthy, should be the same for other targets


Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-12-14-083101

How reproducible:
checke targets after the cluster has been run for a long time

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
a few monitoring targets are down

Expected results:
no targets down

Additional info:

Comment 4 Simon Pasquier 2021-12-17 13:45:18 UTC
from the CMO logs, I see that CMO asks for the renewal of the metrics client certificate at least twice (the certificate is used by Prometheus to authenticate against metrics endpoint).

2021-12-16T14:21:11.327876626Z I1216 14:21:11.327257       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateRotationStarted' The current client certificate for OpenShiftMonitoringClientCertRequester expires in 5h53m19s. Start certificate rotation
2021-12-16T14:21:11.390888208Z I1216 14:21:11.375482       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CSRCreated' A csr "system:openshift:openshift-monitoring-xwh98" is created for OpenShiftMonitoringClientCertRequester
2021-12-16T14:21:11.574245322Z I1216 14:21:11.574057       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ClientCertificateCreated' A new client certificate for OpenShiftMonitoringClientCertRequester is available
2021-12-16T18:55:16.265051777Z I1216 18:55:16.264073       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateRotationStarted' The current client certificate for OpenShiftMonitoringClientCertRequester expires in 1h19m14s. Start certificate rotation
2021-12-16T18:55:16.273436660Z I1216 18:55:16.273365       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CSRCreated' A csr "system:openshift:openshift-monitoring-nqk4q" is created for OpenShiftMonitoringClientCertRequester
2021-12-16T18:55:16.383463392Z I1216 18:55:16.383290       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ClientCertificateCreated' A new client certificate for OpenShiftMonitoringClientCertRequester is available

The certificate seems to be renewed properly since the generated secret has a valid expiry date:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            a1:4f:95:61:a4:fe:1f:c4:db:21:cb:4a:f1:e8:66:1f
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = kube-csr-signer_@1639668584
        Validity
            Not Before: Dec 16 18:50:16 2021 GMT
            Not After : Jan 15 15:29:44 2022 GMT
        Subject: CN = system:serviceaccount:openshift-monitoring:prometheus-k8s

The log line at 2021-12-16T18:55:16Z says that the certificate expires in 1h19m14s which means 2021-12-16T20:14:30Z and is consistent with the first error logs in Alertmanager's kube-rbac-proxy starting at 2021-12-16T20:14:33Z.

I don't understand why Prometheus uses the expired certificate but it could be that the certificate on disk hasn't been updated. To dig further we would need the following information:

* oc get secrets -n openshift-monitoring -o yaml metrics-client-certs
* oc get configmaps -n openshift-monitoring -o yaml metrics-client-ca
* a copy of /etc/tls/client/client-ca.crt from a kube-rbac-proxy sidecar that triggers the failure.
* a copy of /etc/prometheus/secrets/metrics-client-certs/tls.crt from prometheus-k8s-0 and prometheus-k8s-1.

Note: some components aren't affected (e.g. node exporter is still ok) because their service monitors still provide the service account's token.

Comment 9 Simon Pasquier 2021-12-21 11:46:56 UTC
Since Prometheus uses keep-alived connections, it doesn't renegotiate the TLS connection to kube-rbac-proxy. Instead Prometheus should forcibly close the connection when it detects that the certificate/key on disk have changed. This explains why the issue partially resolves over time since when Prometheus reconnects after a network failure (container restart or network reset), it will pick up the good cert/key.

The issue is already tracked upstream.
https://github.com/prometheus/prometheus/issues/9512
https://github.com/prometheus/common/pull/345

Comment 12 Junqi Zhao 2022-02-16 09:38:08 UTC
checked in a cluster has been running for 31h, the pods under openshift-monitoring did not recreated except prometheus-adapter pods, the monitoring targets are down except prometheus-adapter/node-exporter for 401 error, see the picture
# oc get no
NAME                              STATUS   ROLES    AGE     VERSION
**-fsbkf-master-0         Ready    master   31h     v1.23.3+2e8bad7
**-fsbkf-master-1         Ready    master   31h     v1.23.3+2e8bad7
**-fsbkf-master-2         Ready    master   31h     v1.23.3+2e8bad7
**-fsbkf-worker-1-pv24c   Ready    worker   30h     v1.23.3+2e8bad7
**-fsbkf-worker-2-8vm7v   Ready    worker   30h     v1.23.3+2e8bad7
**-fsbkf-worker-3-z9j6r   Ready    worker   30h     v1.23.3+2e8bad7

# oc -n openshift-monitoring get pod
NAME                                         READY   STATUS    RESTARTS   AGE
alertmanager-main-0                          6/6     Running   0          29h
alertmanager-main-1                          6/6     Running   0          29h
cluster-monitoring-operator-68b5d4c8-zbb4c   2/2     Running   0          31h
grafana-5f9949b5d6-44jpb                     3/3     Running   0          30h
kube-state-metrics-55b8886bf8-l5xch          3/3     Running   0          30h
node-exporter-b69zg                          2/2     Running   0          30h
node-exporter-kp2zj                          2/2     Running   0          30h
node-exporter-ktfb2                          2/2     Running   0          30h
node-exporter-rd6gd                          2/2     Running   0          30h
node-exporter-rtqbs                          2/2     Running   0          30h
node-exporter-rx8vd                          2/2     Running   0          30h
openshift-state-metrics-99cbfbf78-j7wqd      3/3     Running   0          30h
prometheus-adapter-c476767f5-v27r9           1/1     Running   0          7h17m
prometheus-adapter-c476767f5-vbksd           1/1     Running   0          7h17m
prometheus-k8s-0                             6/6     Running   0          29h
prometheus-k8s-1                             6/6     Running   0          29h
prometheus-operator-ffd44f999-8v27b          2/2     Running   0          29h
telemeter-client-6f776759f4-gvnwf            3/3     Running   0          30h
thanos-querier-567f9c5f66-6qxgd              6/6     Running   0          29h
thanos-querier-567f9c5f66-nndjg              6/6     Running   0          29h

# oc -n openshift-user-workload-monitoring get pod
NAME                                  READY   STATUS    RESTARTS   AGE
prometheus-operator-c4b5bcb58-xtp66   2/2     Running   0          29h
prometheus-user-workload-0            5/5     Running   0          29h
prometheus-user-workload-1            5/5     Running   0          29h
thanos-ruler-user-workload-0          3/3     Running   0          29h
thanos-ruler-user-workload-1          3/3     Running   0          29h

Comment 23 Junqi Zhao 2022-04-24 04:36:10 UTC
tested with
# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-22-182942   True        False         33h     Cluster version is 4.11.0-0.nightly-2022-04-22-182942

the cluster has been running for 33 hours, checked from telemeter server, no TargetDown alerts for monitoring, and no down endpoints for monitoring

Comment 28 errata-xmlrpc 2022-08-10 10:40:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.