Created attachment 1846694 [details] monitoring targets are down Description of problem: the cluster has running for 36h # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-12-14-083101 True False 36h Cluster version is 4.10.0-0.nightly-2021-12-14-083101 but a few monitoring targets are down due to Error: server returned HTTP status 401 Unauthorized inlcude: grafana/prometheus-k8s/telemeter-client/thanos-sidecar/alertmamager targets checked from logs, there are error "Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time **** is after 2021-12-16T20:14:30Z" after the cluster has been run for almost 1 day, then the targets are down. example: kube-rbac-proxy container logs of prometheus-k8s-0 pod 2021-12-15T20:42:14.037307604Z I1215 20:42:14.037071 1 main.go:151] Reading config file: /etc/kube-rbac-proxy/config.yaml 2021-12-15T20:42:14.040466228Z I1215 20:42:14.040394 1 main.go:181] Valid token audiences: 2021-12-15T20:42:14.040860463Z I1215 20:42:14.040816 1 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca::/etc/tls/client/client-ca.crt" 2021-12-15T20:42:14.041045983Z I1215 20:42:14.040997 1 main.go:305] Reading certificate files 2021-12-15T20:42:14.041296788Z I1215 20:42:14.041154 1 dynamic_cafile_content.go:167] Starting client-ca::/etc/tls/client/client-ca.crt 2021-12-15T20:42:14.041358071Z I1215 20:42:14.041342 1 reloader.go:98] reloading key /etc/tls/private/tls.key certificate /etc/tls/private/tls.crt 2021-12-15T20:42:14.041722605Z I1215 20:42:14.041682 1 main.go:339] Starting TCP socket on 0.0.0.0:9092 2021-12-15T20:42:14.042563009Z I1215 20:42:14.042495 1 main.go:346] Listening securely on 0.0.0.0:9092 2021-12-16T15:28:14.043412756Z I1216 15:28:14.043113 1 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca::/etc/tls/client/client-ca.crt" 2021-12-16T15:32:14.042711352Z I1216 15:32:14.042521 1 dynamic_cafile_content.go:129] Loaded a new CA Bundle and Verifier for "client-ca::/etc/tls/client/client-ca.crt" 2021-12-16T20:14:35.200286140Z E1216 20:14:35.200081 1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:14:35Z is after 2021-12-16T20:14:30Z 2021-12-16T20:14:44.071594661Z E1216 20:14:44.071463 1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:14:44Z is after 2021-12-16T20:14:30Z 2021-12-16T20:15:05.200062356Z E1216 20:15:05.199898 1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:15:05Z is after 2021-12-16T20:14:30Z 2021-12-16T20:15:14.072273022Z E1216 20:15:14.070036 1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:15:14Z is after 2021-12-16T20:14:30Z and kube-rbac-proxy-metric container logs of alertmanager-main-0 pod 2021-12-15T20:41:57.832903155Z I1215 20:41:57.832680 1 main.go:151] Reading config file: /etc/kube-rbac-proxy/config.yaml 2021-12-15T20:41:57.849665600Z I1215 20:41:57.849561 1 main.go:181] Valid token audiences: 2021-12-15T20:41:57.851298263Z I1215 20:41:57.851142 1 main.go:305] Reading certificate files 2021-12-15T20:41:57.855484254Z I1215 20:41:57.855434 1 dynamic_cafile_content.go:167] Starting client-ca::/etc/tls/client/client-ca.crt 2021-12-15T20:41:57.856139965Z I1215 20:41:57.856001 1 main.go:339] Starting TCP socket on 0.0.0.0:9097 2021-12-15T20:41:57.866276093Z I1215 20:41:57.863687 1 main.go:346] Listening securely on 0.0.0.0:9097 2021-12-16T20:14:30.664561462Z E1216 20:14:30.664417 1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:14:30Z is after 2021-12-16T20:14:30Z 2021-12-16T20:14:56.424561585Z E1216 20:14:56.424473 1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:14:56Z is after 2021-12-16T20:14:30Z 2021-12-16T20:15:00.664743050Z E1216 20:15:00.664629 1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:15:00Z is after 2021-12-16T20:14:30Z 2021-12-16T20:15:26.424947375Z E1216 20:15:26.424880 1 proxy.go:73] Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid: current time 2021-12-16T20:15:26Z is after 2021-12-16T20:14:30Z NOTE: recreate alertmanager pods, the alertmanager targets are healthy, should be the same for other targets Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-12-14-083101 How reproducible: checke targets after the cluster has been run for a long time Steps to Reproduce: 1. see the description 2. 3. Actual results: a few monitoring targets are down Expected results: no targets down Additional info:
from the CMO logs, I see that CMO asks for the renewal of the metrics client certificate at least twice (the certificate is used by Prometheus to authenticate against metrics endpoint). 2021-12-16T14:21:11.327876626Z I1216 14:21:11.327257 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateRotationStarted' The current client certificate for OpenShiftMonitoringClientCertRequester expires in 5h53m19s. Start certificate rotation 2021-12-16T14:21:11.390888208Z I1216 14:21:11.375482 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CSRCreated' A csr "system:openshift:openshift-monitoring-xwh98" is created for OpenShiftMonitoringClientCertRequester 2021-12-16T14:21:11.574245322Z I1216 14:21:11.574057 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ClientCertificateCreated' A new client certificate for OpenShiftMonitoringClientCertRequester is available 2021-12-16T18:55:16.265051777Z I1216 18:55:16.264073 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateRotationStarted' The current client certificate for OpenShiftMonitoringClientCertRequester expires in 1h19m14s. Start certificate rotation 2021-12-16T18:55:16.273436660Z I1216 18:55:16.273365 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CSRCreated' A csr "system:openshift:openshift-monitoring-nqk4q" is created for OpenShiftMonitoringClientCertRequester 2021-12-16T18:55:16.383463392Z I1216 18:55:16.383290 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-monitoring", Name:"openshift-monitoring", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ClientCertificateCreated' A new client certificate for OpenShiftMonitoringClientCertRequester is available The certificate seems to be renewed properly since the generated secret has a valid expiry date: Certificate: Data: Version: 3 (0x2) Serial Number: a1:4f:95:61:a4:fe:1f:c4:db:21:cb:4a:f1:e8:66:1f Signature Algorithm: sha256WithRSAEncryption Issuer: CN = kube-csr-signer_@1639668584 Validity Not Before: Dec 16 18:50:16 2021 GMT Not After : Jan 15 15:29:44 2022 GMT Subject: CN = system:serviceaccount:openshift-monitoring:prometheus-k8s The log line at 2021-12-16T18:55:16Z says that the certificate expires in 1h19m14s which means 2021-12-16T20:14:30Z and is consistent with the first error logs in Alertmanager's kube-rbac-proxy starting at 2021-12-16T20:14:33Z. I don't understand why Prometheus uses the expired certificate but it could be that the certificate on disk hasn't been updated. To dig further we would need the following information: * oc get secrets -n openshift-monitoring -o yaml metrics-client-certs * oc get configmaps -n openshift-monitoring -o yaml metrics-client-ca * a copy of /etc/tls/client/client-ca.crt from a kube-rbac-proxy sidecar that triggers the failure. * a copy of /etc/prometheus/secrets/metrics-client-certs/tls.crt from prometheus-k8s-0 and prometheus-k8s-1. Note: some components aren't affected (e.g. node exporter is still ok) because their service monitors still provide the service account's token.
Since Prometheus uses keep-alived connections, it doesn't renegotiate the TLS connection to kube-rbac-proxy. Instead Prometheus should forcibly close the connection when it detects that the certificate/key on disk have changed. This explains why the issue partially resolves over time since when Prometheus reconnects after a network failure (container restart or network reset), it will pick up the good cert/key. The issue is already tracked upstream. https://github.com/prometheus/prometheus/issues/9512 https://github.com/prometheus/common/pull/345
checked in a cluster has been running for 31h, the pods under openshift-monitoring did not recreated except prometheus-adapter pods, the monitoring targets are down except prometheus-adapter/node-exporter for 401 error, see the picture # oc get no NAME STATUS ROLES AGE VERSION **-fsbkf-master-0 Ready master 31h v1.23.3+2e8bad7 **-fsbkf-master-1 Ready master 31h v1.23.3+2e8bad7 **-fsbkf-master-2 Ready master 31h v1.23.3+2e8bad7 **-fsbkf-worker-1-pv24c Ready worker 30h v1.23.3+2e8bad7 **-fsbkf-worker-2-8vm7v Ready worker 30h v1.23.3+2e8bad7 **-fsbkf-worker-3-z9j6r Ready worker 30h v1.23.3+2e8bad7 # oc -n openshift-monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 6/6 Running 0 29h alertmanager-main-1 6/6 Running 0 29h cluster-monitoring-operator-68b5d4c8-zbb4c 2/2 Running 0 31h grafana-5f9949b5d6-44jpb 3/3 Running 0 30h kube-state-metrics-55b8886bf8-l5xch 3/3 Running 0 30h node-exporter-b69zg 2/2 Running 0 30h node-exporter-kp2zj 2/2 Running 0 30h node-exporter-ktfb2 2/2 Running 0 30h node-exporter-rd6gd 2/2 Running 0 30h node-exporter-rtqbs 2/2 Running 0 30h node-exporter-rx8vd 2/2 Running 0 30h openshift-state-metrics-99cbfbf78-j7wqd 3/3 Running 0 30h prometheus-adapter-c476767f5-v27r9 1/1 Running 0 7h17m prometheus-adapter-c476767f5-vbksd 1/1 Running 0 7h17m prometheus-k8s-0 6/6 Running 0 29h prometheus-k8s-1 6/6 Running 0 29h prometheus-operator-ffd44f999-8v27b 2/2 Running 0 29h telemeter-client-6f776759f4-gvnwf 3/3 Running 0 30h thanos-querier-567f9c5f66-6qxgd 6/6 Running 0 29h thanos-querier-567f9c5f66-nndjg 6/6 Running 0 29h # oc -n openshift-user-workload-monitoring get pod NAME READY STATUS RESTARTS AGE prometheus-operator-c4b5bcb58-xtp66 2/2 Running 0 29h prometheus-user-workload-0 5/5 Running 0 29h prometheus-user-workload-1 5/5 Running 0 29h thanos-ruler-user-workload-0 3/3 Running 0 29h thanos-ruler-user-workload-1 3/3 Running 0 29h
tested with # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-22-182942 True False 33h Cluster version is 4.11.0-0.nightly-2022-04-22-182942 the cluster has been running for 33 hours, checked from telemeter server, no TargetDown alerts for monitoring, and no down endpoints for monitoring
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069