Bug 1953264

Summary:

"remote error: tls: bad certificate" logs in prometheus-operator container

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Monitoring

Assignee:

Jan Fajerski <jfajersk>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

low

Docs Contact:

Priority:

medium

Version:

4.8

CC:

anpicker, cchen, erooth, jfajersk, slaznick, spasquie, wking

Target Milestone:

---

Target Release:

4.10.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: CMO makes use of the service-ca-operator which manages self-signed TLS artifacts. The serice-ca-operator will inject such artifacts into appropriately labeled resources such as a configmap, specifically into the data field. CMO creates such resources with an empty data field. Consequence: This leads to CMO and service-ca-operator competing over the content of the data field (CMO wants in empyt, SCO wants to add TLS artifacts). This can lead to components trying to parse empty TLS artifacts. Additionally both operators run superfluous reconcile loops. Fix: CMO no longer creates an empty data field in those resources, leaving the management of this field to the service-ca-operator. With additional declarations in the volume specifications of pods, the scheduler ensures that the pods wait until the needed TLS artifacts are populated. Result: User no longer see "remote error: tls: bad certificate" errors in component logs.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-03-10 16:03:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
prometheus-operator container logs in multiple cluster	none
prometheus-operator container logs in single node cluster	none

Description Junqi Zhao 2021-04-25 03:04:56 UTC

Created attachment 1775144 [details]
prometheus-operator container logs in multiple cluster

Description of problem:
"remote error: tls: bad certificate" logs in prometheus-operator container, it does not affect the cluster
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate"
ts=2021-04-25T02:08:04.736355299Z caller=stdlib.go:89 ts="2021/04/25 02:08:04" caller="http: TLS handshake error from 10.128.0.1:56180" msg="remote error: tls: bad certificate"
ts=2021-04-25T02:28:04.452578114Z caller=stdlib.go:89 ts="2021/04/25 02:28:04" caller="http: TLS handshake error from 10.128.0.1:51848" msg="remote error: tls: bad certificate"
ts=2021-04-25T02:48:04.383050955Z caller=stdlib.go:89 ts="2021/04/25 02:48:04" caller="http: TLS handshake error from 10.128.0.1:47396" msg="remote error: tls: bad certificate"
ts=2021-04-25T02:53:03.681243478Z caller=stdlib.go:89 ts="2021/04/25 02:53:03" caller="http: TLS handshake error from 10.128.0.1:53380" msg="remote error: tls: bad certificate"
ts=2021-04-25T02:58:04.3985992Z caller=stdlib.go:89 ts="2021/04/25 02:58:04" caller="http: TLS handshake error from 10.128.0.1:59346" msg="remote error: tls: bad certificate"

NOTE:
ony see the error in prometheus-operator container in the cluster
# for project in $(oc get project | grep -v NAME | awk '{print $1}'); do echo "namespace is: " $project; for pod_name in $(oc get pod -n $project| grep -v NAME | awk '{print $1}'); do echo "pod name is: "  $pod_name; for container in $(oc -n $project get pods $pod_name -o jsonpath="{.spec.containers[*].name}"); do echo container: $container;oc -n $project logs -c $container $pod_name | grep "remote error: tls: bad certificate"; done; done; done

and there are less such error in multi nodes cluster, but lots of such error in SNO cluster
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate" | wc -l
504

# oc get no
NAME                                            STATUS   ROLES           AGE   VERSION
ip-10-0-142-1.ap-northeast-2.compute.internal   Ready    master,worker   88m   v1.21.0-rc.0+a40ec9c

Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-24-175929   True        False         61m     Cluster version is 4.8.0-0.nightly-2021-04-24-175929

prometheus-operator 0.45.0

How reproducible:
always

Steps to Reproduce:
1. see the descripiton
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Junqi Zhao 2021-04-25 03:07:41 UTC

Created attachment 1775145 [details]
prometheus-operator container logs in single node cluster

Comment 2 Simon Pasquier 2021-04-26 08:03:48 UTC

I suspect that the errors come from the API servers trying to validate PrometheusRules resources managed by CVO against the prometheus operator (certificate authority not being synced?). Can you share the API logs as well as the CVO logs?

Comment 4 Jan Fajerski 2021-05-25 14:57:13 UTC

Still investigating.

Comment 5 Jan Fajerski 2021-05-26 10:33:01 UTC

(In reply to Simon Pasquier from comment #2)
> I suspect that the errors come from the API servers trying to validate
> PrometheusRules resources managed by CVO against the prometheus operator
> (certificate authority not being synced?). Can you share the API logs as
> well as the CVO logs?

This is correct. CVO has many lines like the following in its logs:

task.go:112] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 677): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority

Looking at the service-ca-operator logs there seem to be two things going on:

It looks like the new servce-ca-operator is a bit delayed in taking over from an old instance:

1 leaderelection.go:346] lock is held by service-ca-operator-57bc47f76d-fn67k_01376061-3bcb-4243-b7f8-61b6238359b7 and has not yet expired
1 leaderelection.go:248] failed to acquire lease openshift-service-ca-operator/service-ca-operator-lock

And the service-ca-operator get throttled on API watches (like other operators too):

1 reflector.go:138] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Failed to watch *v1.ConfigMap: the server has received too many requests and has asked us to try again later (get configmaps)

The failed watches happen before and after the initial symptoms (the above log messages are all around the same timestamp).

It seems likely that the ca-operator doesn't sync new certs as it can't retrieve the current cluster state. If that is indeed the case https://github.com/openshift/cluster-monitoring-operator/pull/1168 should improve this and fix this issue here.

@spasquie Would appreciate your opinion here.

Comment 6 Simon Pasquier 2021-05-26 16:01:31 UTC

@jan your analysis makes sense. I don't think that https://github.com/openshift/cluster-monitoring-operator/pull/1168 will fix it since I'm still saying the "remote error: tls: bad certificate" messages in the prometheus operator's logs [1].
FWIW the errors stop as soon as the service-ca operator says "synced all controller resources" for the first time (14:19:02). Also the operator starts at 14:17:40 but it acquires the lock only at 14:19:00.
The logs from the previous service-ca operator instance show that it fails to renew the lease and probably bailed out? It might be worth checking with the serving-ca operator team what they think about that.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1395370123937714176/artifacts/e2e-aws-single-node/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-5f76564447-bvmjr_prometheus-operator.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1395370123937714176/artifacts/e2e-aws-single-node/gather-extra/artifacts/pods/openshift-service-ca-operator_service-ca-operator-58d8d97f4b-sbv8g_service-ca-operator.log
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1168/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-single-node/1395370123937714176/artifacts/e2e-aws-single-node/gather-extra/artifacts/pods/openshift-service-ca-operator_service-ca-operator-58d8d97f4b-sbv8g_service-ca-operator_previous.log

Comment 7 Jan Fajerski 2021-05-27 08:59:10 UTC

@slaznick Any chance you can comment on this or point me to someone who can? What is the correct way to handle an update to the service-ca operator without using outdated certificates while the ca operator is updated?

Comment 8 Junqi Zhao 2021-07-27 08:23:56 UTC

same issue with 4.9.0-0.nightly-2021-07-25-125326
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate" | wc -l
120

Comment 12 Jan Fajerski 2021-09-29 14:51:43 UTC

Solution is still unclear, needs further investigation

Comment 15 Jan Fajerski 2021-11-10 10:35:07 UTC

https://github.com/openshift/cluster-monitoring-operator/pull/1451 improved the situation already, but there are still a few instances left.

Comment 20 Junqi Zhao 2021-12-02 12:29:38 UTC

tested with 4.10.0-0.nightly-2021-12-01-164437, Prometheus Operator version=0.52.1, no such error now,
# oc -n openshift-monitoring logs $(oc -n openshift-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator | grep "tls: bad certificate"
no result

Comment 27 errata-xmlrpc 2022-03-10 16:03:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056