1683913 – "x509: certificate signed by unknown authority" for 10250/metrics and 10250/metrics/cadvisor targets on worker nodes

Bug 1683913 - "x509: certificate signed by unknown authority" for 10250/metrics and 10250/metrics/cadvisor targets on worker nodes

Summary: "x509: certificate signed by unknown authority" for 10250/metrics and 10250/m...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-28 05:46 UTC by Junqi Zhao
Modified:	2019-06-04 10:44 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:44:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:44:49 UTC

Description Junqi Zhao 2019-02-28 05:46:48 UTC

Description of problem:
Cloned from https://jira.coreos.com/browse/MON-591

Let the 4.0 cluster runs for more than a day, "x509: certificate signed by unknown authority" error for 10250/metrics and 10250/metrics/cadvisor targets on all worker nodes.

In this case, we have two worker nodes, 172.31.128.135 and 172.31.153.7

$ oc get node -o wide | awk '{print $1"    "$3"    "$4"    "$6}'
NAME    ROLES    AGE    INTERNAL-IP
ip-172-31-128-135.us-east-2.compute.internal    worker    28h    172.31.128.135
ip-172-31-137-246.us-east-2.compute.internal    master    28h    172.31.137.246
ip-172-31-146-164.us-east-2.compute.internal    master    28h    172.31.146.164
ip-172-31-153-7.us-east-2.compute.internal      worker    28h    172.31.153.7
ip-172-31-164-14.us-east-2.compute.internal     master    28h    172.31.164.14

 

$ oc get pod -n openshift-monitoring -o wide
NAME                                           READY     STATUS    RESTARTS   AGE       IP               NODE                                           NOMINATED NODE
alertmanager-main-0                            3/3       Running   0          27h       10.128.2.5       ip-172-31-128-135.us-east-2.compute.internal   <none>
alertmanager-main-1                            3/3       Running   0          27h       10.131.0.12      ip-172-31-153-7.us-east-2.compute.internal     <none>
alertmanager-main-2                            3/3       Running   0          27h       10.128.2.11      ip-172-31-128-135.us-east-2.compute.internal   <none>
cluster-monitoring-operator-549ff4d5dd-vl9lj   1/1       Running   0          28h       10.130.0.21      ip-172-31-137-246.us-east-2.compute.internal   <none>
grafana-754d4bf6bc-nhtk9                       2/2       Running   0          27h       10.128.2.4       ip-172-31-128-135.us-east-2.compute.internal   <none>
kube-state-metrics-5799dc74ff-rbhtt            3/3       Running   0          27h       10.131.0.11      ip-172-31-153-7.us-east-2.compute.internal     <none>
node-exporter-8dtg5                            2/2       Running   0          27h       172.31.146.164   ip-172-31-146-164.us-east-2.compute.internal   <none>
node-exporter-gzh8l                            2/2       Running   0          27h       172.31.137.246   ip-172-31-137-246.us-east-2.compute.internal   <none>
node-exporter-j4gzp                            2/2       Running   0          27h       172.31.164.14    ip-172-31-164-14.us-east-2.compute.internal    <none>
node-exporter-p5rbm                            2/2       Running   0          27h       172.31.128.135   ip-172-31-128-135.us-east-2.compute.internal   <none>
node-exporter-wbxhq                            2/2       Running   0          27h       172.31.153.7     ip-172-31-153-7.us-east-2.compute.internal     <none>
prometheus-adapter-85555d8646-776q5            1/1       Running   0          7m6s      10.131.1.138     ip-172-31-153-7.us-east-2.compute.internal     <none>
prometheus-adapter-85555d8646-rhf2c            1/1       Running   0          6m58s     10.128.3.105     ip-172-31-128-135.us-east-2.compute.internal   <none>
prometheus-k8s-0                               6/6       Running   1          27h       10.128.2.10      ip-172-31-128-135.us-east-2.compute.internal   <none>
prometheus-k8s-1                               6/6       Running   1          27h       10.131.0.14      ip-172-31-153-7.us-east-2.compute.internal     <none>
prometheus-operator-64fc65bf9c-x8w5h           1/1       Running   0          28h       10.131.0.7       ip-172-31-153-7.us-east-2.compute.internal     <none>
telemeter-client-6cfd8d6879-bhj8x              3/3       Running   0          12h       10.128.2.226     ip-172-31-128-135.us-east-2.compute.internal   <none>

$ prometheus_route=$(oc -n openshift-monitoring  get route | grep prometheus-k8s | awk '{print $2}');curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://${prometheus_route}/targets | grep -i downprometheus_route=$(oc -n openshift-monitoring  get route | grep prometheus-k8s | awk '{print $2}');curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://${prometheus_route}/targets | grep -i x509

              <span class="alert alert-danger state_indicator">Get https://172.31.128.135:10250/metrics: x509: certificate signed by unknown authority</span>
              <span class="alert alert-danger state_indicator">Get https://172.31.153.7:10250/metrics: x509: certificate signed by unknown authority</span>
              <span class="alert alert-danger state_indicator">Get https://172.31.128.135:10250/metrics/cadvisor: x509: certificate signed by unknown authority</span>
              <span class="alert alert-danger state_indicator">Get https://172.31.153.7:10250/metrics/cadvisor: x509: certificate signed by unknown authority</span>

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-02-26-125216   True        False         28h       Cluster version is 4.0.0-0.nightly-2019-02-26-125216


How reproducible:
Cluster runs for more than a day

Steps to Reproduce:
1. See the description part
2.
3.

Actual results:
"x509: certificate signed by unknown authority" for 10250/metrics and 10250/metrics/cadvisor targets on worker nodes

Expected results:
Should not see this error

Additional info:

Comment 1 Frederic Branczyk 2019-02-28 10:38:10 UTC

We have discovered that this is a bug in Prometheus itself, not properly reloading the certificates. This will need to be fixed upstream.

Comment 2 Frederic Branczyk 2019-02-28 10:38:50 UTC

The respective upstream issue is: https://github.com/prometheus/prometheus/issues/4155

Comment 4 Junqi Zhao 2019-03-07 01:43:22 UTC

Tested with 4.0.0-0.nightly-2019-03-04-234414, there is not "x509: certificate signed by unknown authority" for 10250/metrics and 10250/metrics/cadvisor targets on worker nodes now

Comment 8 errata-xmlrpc 2019-06-04 10:44:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.