Created attachment 1775144 [details] prometheus-operator container logs in multiple cluster Description of problem: "remote error: tls: bad certificate" logs in prometheus-operator container, it does not affect the cluster # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate" ts=2021-04-25T02:08:04.736355299Z caller=stdlib.go:89 ts="2021/04/25 02:08:04" caller="http: TLS handshake error from 10.128.0.1:56180" msg="remote error: tls: bad certificate" ts=2021-04-25T02:28:04.452578114Z caller=stdlib.go:89 ts="2021/04/25 02:28:04" caller="http: TLS handshake error from 10.128.0.1:51848" msg="remote error: tls: bad certificate" ts=2021-04-25T02:48:04.383050955Z caller=stdlib.go:89 ts="2021/04/25 02:48:04" caller="http: TLS handshake error from 10.128.0.1:47396" msg="remote error: tls: bad certificate" ts=2021-04-25T02:53:03.681243478Z caller=stdlib.go:89 ts="2021/04/25 02:53:03" caller="http: TLS handshake error from 10.128.0.1:53380" msg="remote error: tls: bad certificate" ts=2021-04-25T02:58:04.3985992Z caller=stdlib.go:89 ts="2021/04/25 02:58:04" caller="http: TLS handshake error from 10.128.0.1:59346" msg="remote error: tls: bad certificate" NOTE: ony see the error in prometheus-operator container in the cluster # for project in $(oc get project | grep -v NAME | awk '{print $1}'); do echo "namespace is: " $project; for pod_name in $(oc get pod -n $project| grep -v NAME | awk '{print $1}'); do echo "pod name is: " $pod_name; for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}"); do echo container: $container;oc -n $project logs -c $container $pod_name | grep "remote error: tls: bad certificate"; done; done; done and there are less such error in multi nodes cluster, but lots of such error in SNO cluster # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate" | wc -l 504 # oc get no NAME STATUS ROLES AGE VERSION ip-10-0-142-1.ap-northeast-2.compute.internal Ready master,worker 88m v1.21.0-rc.0+a40ec9c Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-24-175929 True False 61m Cluster version is 4.8.0-0.nightly-2021-04-24-175929 prometheus-operator 0.45.0 How reproducible: always Steps to Reproduce: 1. see the descripiton 2. 3. Actual results: Expected results: Additional info:
Created attachment 1775145 [details] prometheus-operator container logs in single node cluster
I suspect that the errors come from the API servers trying to validate PrometheusRules resources managed by CVO against the prometheus operator (certificate authority not being synced?). Can you share the API logs as well as the CVO logs?
Still investigating.
(In reply to Simon Pasquier from comment #2) > I suspect that the errors come from the API servers trying to validate > PrometheusRules resources managed by CVO against the prometheus operator > (certificate authority not being synced?). Can you share the API logs as > well as the CVO logs? This is correct. CVO has many lines like the following in its logs: task.go:112] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 677): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority Looking at the service-ca-operator logs there seem to be two things going on: It looks like the new servce-ca-operator is a bit delayed in taking over from an old instance: 1 leaderelection.go:346] lock is held by service-ca-operator-57bc47f76d-fn67k_01376061-3bcb-4243-b7f8-61b6238359b7 and has not yet expired 1 leaderelection.go:248] failed to acquire lease openshift-service-ca-operator/service-ca-operator-lock And the service-ca-operator get throttled on API watches (like other operators too): 1 reflector.go:138] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Failed to watch *v1.ConfigMap: the server has received too many requests and has asked us to try again later (get configmaps) The failed watches happen before and after the initial symptoms (the above log messages are all around the same timestamp). It seems likely that the ca-operator doesn't sync new certs as it can't retrieve the current cluster state. If that is indeed the case https://github.com/openshift/cluster-monitoring-operator/pull/1168 should improve this and fix this issue here. @spasquie Would appreciate your opinion here.
@jan your analysis makes sense. I don't think that https://github.com/openshift/cluster-monitoring-operator/pull/1168 will fix it since I'm still saying the "remote error: tls: bad certificate" messages in the prometheus operator's logs [1]. FWIW the errors stop as soon as the service-ca operator says "synced all controller resources" for the first time (14:19:02). Also the operator starts at 14:17:40 but it acquires the lock only at 14:19:00. The logs from the previous service-ca operator instance show that it fails to renew the lease and probably bailed out? It might be worth checking with the serving-ca operator team what they think about that. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1395370123937714176/artifacts/e2e-aws-single-node/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-5f76564447-bvmjr_prometheus-operator.log [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1395370123937714176/artifacts/e2e-aws-single-node/gather-extra/artifacts/pods/openshift-service-ca-operator_service-ca-operator-58d8d97f4b-sbv8g_service-ca-operator.log [3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1395370123937714176/artifacts/e2e-aws-single-node/gather-extra/artifacts/pods/openshift-service-ca-operator_service-ca-operator-58d8d97f4b-sbv8g_service-ca-operator_previous.log
@slaznick Any chance you can comment on this or point me to someone who can? What is the correct way to handle an update to the service-ca operator without using outdated certificates while the ca operator is updated?
same issue with 4.9.0-0.nightly-2021-07-25-125326 # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate" | wc -l 120
Solution is still unclear, needs further investigation
https://github.com/openshift/cluster-monitoring-operator/pull/1451 improved the situation already, but there are still a few instances left.
tested with 4.10.0-0.nightly-2021-12-01-164437, Prometheus Operator version=0.52.1, no such error now, # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate" no result
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056