Bug 1953264
Summary: | "remote error: tls: bad certificate" logs in prometheus-operator container | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||||
Component: | Monitoring | Assignee: | Jan Fajerski <jfajersk> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||||
Severity: | low | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4.8 | CC: | anpicker, cchen, erooth, jfajersk, slaznick, spasquie, wking | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.10.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Cause:
CMO makes use of the service-ca-operator which manages self-signed TLS artifacts. The serice-ca-operator will inject such artifacts into appropriately labeled resources such as a configmap, specifically into the data field. CMO creates such resources with an empty data field.
Consequence:
This leads to CMO and service-ca-operator competing over the content of the data field (CMO wants in empyt, SCO wants to add TLS artifacts). This can lead to components trying to parse empty TLS artifacts. Additionally both operators run superfluous reconcile loops.
Fix:
CMO no longer creates an empty data field in those resources, leaving the management of this field to the service-ca-operator. With additional declarations in the volume specifications of pods, the scheduler ensures that the pods wait until the needed TLS artifacts are populated.
Result:
User no longer see "remote error: tls: bad certificate" errors in component logs.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2022-03-10 16:03:07 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Created attachment 1775145 [details]
prometheus-operator container logs in single node cluster
I suspect that the errors come from the API servers trying to validate PrometheusRules resources managed by CVO against the prometheus operator (certificate authority not being synced?). Can you share the API logs as well as the CVO logs? Still investigating. (In reply to Simon Pasquier from comment #2) > I suspect that the errors come from the API servers trying to validate > PrometheusRules resources managed by CVO against the prometheus operator > (certificate authority not being synced?). Can you share the API logs as > well as the CVO logs? This is correct. CVO has many lines like the following in its logs: task.go:112] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 677): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority Looking at the service-ca-operator logs there seem to be two things going on: It looks like the new servce-ca-operator is a bit delayed in taking over from an old instance: 1 leaderelection.go:346] lock is held by service-ca-operator-57bc47f76d-fn67k_01376061-3bcb-4243-b7f8-61b6238359b7 and has not yet expired 1 leaderelection.go:248] failed to acquire lease openshift-service-ca-operator/service-ca-operator-lock And the service-ca-operator get throttled on API watches (like other operators too): 1 reflector.go:138] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Failed to watch *v1.ConfigMap: the server has received too many requests and has asked us to try again later (get configmaps) The failed watches happen before and after the initial symptoms (the above log messages are all around the same timestamp). It seems likely that the ca-operator doesn't sync new certs as it can't retrieve the current cluster state. If that is indeed the case https://github.com/openshift/cluster-monitoring-operator/pull/1168 should improve this and fix this issue here. @spasquie Would appreciate your opinion here. @jan your analysis makes sense. I don't think that https://github.com/openshift/cluster-monitoring-operator/pull/1168 will fix it since I'm still saying the "remote error: tls: bad certificate" messages in the prometheus operator's logs [1]. FWIW the errors stop as soon as the service-ca operator says "synced all controller resources" for the first time (14:19:02). Also the operator starts at 14:17:40 but it acquires the lock only at 14:19:00. The logs from the previous service-ca operator instance show that it fails to renew the lease and probably bailed out? It might be worth checking with the serving-ca operator team what they think about that. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1395370123937714176/artifacts/e2e-aws-single-node/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-5f76564447-bvmjr_prometheus-operator.log [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1395370123937714176/artifacts/e2e-aws-single-node/gather-extra/artifacts/pods/openshift-service-ca-operator_service-ca-operator-58d8d97f4b-sbv8g_service-ca-operator.log [3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1395370123937714176/artifacts/e2e-aws-single-node/gather-extra/artifacts/pods/openshift-service-ca-operator_service-ca-operator-58d8d97f4b-sbv8g_service-ca-operator_previous.log @slaznick Any chance you can comment on this or point me to someone who can? What is the correct way to handle an update to the service-ca operator without using outdated certificates while the ca operator is updated? same issue with 4.9.0-0.nightly-2021-07-25-125326 # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate" | wc -l 120 Solution is still unclear, needs further investigation https://github.com/openshift/cluster-monitoring-operator/pull/1451 improved the situation already, but there are still a few instances left. tested with 4.10.0-0.nightly-2021-12-01-164437, Prometheus Operator version=0.52.1, no such error now, # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate" no result Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |
Created attachment 1775144 [details] prometheus-operator container logs in multiple cluster Description of problem: "remote error: tls: bad certificate" logs in prometheus-operator container, it does not affect the cluster # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate" ts=2021-04-25T02:08:04.736355299Z caller=stdlib.go:89 ts="2021/04/25 02:08:04" caller="http: TLS handshake error from 10.128.0.1:56180" msg="remote error: tls: bad certificate" ts=2021-04-25T02:28:04.452578114Z caller=stdlib.go:89 ts="2021/04/25 02:28:04" caller="http: TLS handshake error from 10.128.0.1:51848" msg="remote error: tls: bad certificate" ts=2021-04-25T02:48:04.383050955Z caller=stdlib.go:89 ts="2021/04/25 02:48:04" caller="http: TLS handshake error from 10.128.0.1:47396" msg="remote error: tls: bad certificate" ts=2021-04-25T02:53:03.681243478Z caller=stdlib.go:89 ts="2021/04/25 02:53:03" caller="http: TLS handshake error from 10.128.0.1:53380" msg="remote error: tls: bad certificate" ts=2021-04-25T02:58:04.3985992Z caller=stdlib.go:89 ts="2021/04/25 02:58:04" caller="http: TLS handshake error from 10.128.0.1:59346" msg="remote error: tls: bad certificate" NOTE: ony see the error in prometheus-operator container in the cluster # for project in $(oc get project | grep -v NAME | awk '{print $1}'); do echo "namespace is: " $project; for pod_name in $(oc get pod -n $project| grep -v NAME | awk '{print $1}'); do echo "pod name is: " $pod_name; for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}"); do echo container: $container;oc -n $project logs -c $container $pod_name | grep "remote error: tls: bad certificate"; done; done; done and there are less such error in multi nodes cluster, but lots of such error in SNO cluster # oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate" | wc -l 504 # oc get no NAME STATUS ROLES AGE VERSION ip-10-0-142-1.ap-northeast-2.compute.internal Ready master,worker 88m v1.21.0-rc.0+a40ec9c Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-24-175929 True False 61m Cluster version is 4.8.0-0.nightly-2021-04-24-175929 prometheus-operator 0.45.0 How reproducible: always Steps to Reproduce: 1. see the descripiton 2. 3. Actual results: Expected results: Additional info: