Bug 1683913

Summary: "x509: certificate signed by unknown authority" for 10250/metrics and 10250/metrics/cadvisor targets on worker nodes
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: high    
Version: 4.1.0CC: mloibl, sponnaga, surbania
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:44:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Junqi Zhao 2019-02-28 05:46:48 UTC
Description of problem:
Cloned from https://jira.coreos.com/browse/MON-591

Let the 4.0 cluster runs for more than a day, "x509: certificate signed by unknown authority" error for 10250/metrics and 10250/metrics/cadvisor targets on all worker nodes.

In this case, we have two worker nodes, 172.31.128.135 and 172.31.153.7

$ oc get node -o wide | awk '{print $1"    "$3"    "$4"    "$6}'
NAME    ROLES    AGE    INTERNAL-IP
ip-172-31-128-135.us-east-2.compute.internal    worker    28h    172.31.128.135
ip-172-31-137-246.us-east-2.compute.internal    master    28h    172.31.137.246
ip-172-31-146-164.us-east-2.compute.internal    master    28h    172.31.146.164
ip-172-31-153-7.us-east-2.compute.internal      worker    28h    172.31.153.7
ip-172-31-164-14.us-east-2.compute.internal     master    28h    172.31.164.14

 

$ oc get pod -n openshift-monitoring -o wide
NAME                                           READY     STATUS    RESTARTS   AGE       IP               NODE                                           NOMINATED NODE
alertmanager-main-0                            3/3       Running   0          27h       10.128.2.5       ip-172-31-128-135.us-east-2.compute.internal   <none>
alertmanager-main-1                            3/3       Running   0          27h       10.131.0.12      ip-172-31-153-7.us-east-2.compute.internal     <none>
alertmanager-main-2                            3/3       Running   0          27h       10.128.2.11      ip-172-31-128-135.us-east-2.compute.internal   <none>
cluster-monitoring-operator-549ff4d5dd-vl9lj   1/1       Running   0          28h       10.130.0.21      ip-172-31-137-246.us-east-2.compute.internal   <none>
grafana-754d4bf6bc-nhtk9                       2/2       Running   0          27h       10.128.2.4       ip-172-31-128-135.us-east-2.compute.internal   <none>
kube-state-metrics-5799dc74ff-rbhtt            3/3       Running   0          27h       10.131.0.11      ip-172-31-153-7.us-east-2.compute.internal     <none>
node-exporter-8dtg5                            2/2       Running   0          27h       172.31.146.164   ip-172-31-146-164.us-east-2.compute.internal   <none>
node-exporter-gzh8l                            2/2       Running   0          27h       172.31.137.246   ip-172-31-137-246.us-east-2.compute.internal   <none>
node-exporter-j4gzp                            2/2       Running   0          27h       172.31.164.14    ip-172-31-164-14.us-east-2.compute.internal    <none>
node-exporter-p5rbm                            2/2       Running   0          27h       172.31.128.135   ip-172-31-128-135.us-east-2.compute.internal   <none>
node-exporter-wbxhq                            2/2       Running   0          27h       172.31.153.7     ip-172-31-153-7.us-east-2.compute.internal     <none>
prometheus-adapter-85555d8646-776q5            1/1       Running   0          7m6s      10.131.1.138     ip-172-31-153-7.us-east-2.compute.internal     <none>
prometheus-adapter-85555d8646-rhf2c            1/1       Running   0          6m58s     10.128.3.105     ip-172-31-128-135.us-east-2.compute.internal   <none>
prometheus-k8s-0                               6/6       Running   1          27h       10.128.2.10      ip-172-31-128-135.us-east-2.compute.internal   <none>
prometheus-k8s-1                               6/6       Running   1          27h       10.131.0.14      ip-172-31-153-7.us-east-2.compute.internal     <none>
prometheus-operator-64fc65bf9c-x8w5h           1/1       Running   0          28h       10.131.0.7       ip-172-31-153-7.us-east-2.compute.internal     <none>
telemeter-client-6cfd8d6879-bhj8x              3/3       Running   0          12h       10.128.2.226     ip-172-31-128-135.us-east-2.compute.internal   <none>

$ prometheus_route=$(oc -n openshift-monitoring  get route | grep prometheus-k8s | awk '{print $2}');curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://${prometheus_route}/targets | grep -i downprometheus_route=$(oc -n openshift-monitoring  get route | grep prometheus-k8s | awk '{print $2}');curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://${prometheus_route}/targets | grep -i x509

              <span class="alert alert-danger state_indicator">Get https://172.31.128.135:10250/metrics: x509: certificate signed by unknown authority</span>
              <span class="alert alert-danger state_indicator">Get https://172.31.153.7:10250/metrics: x509: certificate signed by unknown authority</span>
              <span class="alert alert-danger state_indicator">Get https://172.31.128.135:10250/metrics/cadvisor: x509: certificate signed by unknown authority</span>
              <span class="alert alert-danger state_indicator">Get https://172.31.153.7:10250/metrics/cadvisor: x509: certificate signed by unknown authority</span>

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-02-26-125216   True        False         28h       Cluster version is 4.0.0-0.nightly-2019-02-26-125216


How reproducible:
Cluster runs for more than a day

Steps to Reproduce:
1. See the description part
2.
3.

Actual results:
"x509: certificate signed by unknown authority" for 10250/metrics and 10250/metrics/cadvisor targets on worker nodes

Expected results:
Should not see this error

Additional info:

Comment 1 Frederic Branczyk 2019-02-28 10:38:10 UTC
We have discovered that this is a bug in Prometheus itself, not properly reloading the certificates. This will need to be fixed upstream.

Comment 2 Frederic Branczyk 2019-02-28 10:38:50 UTC
The respective upstream issue is: https://github.com/prometheus/prometheus/issues/4155

Comment 4 Junqi Zhao 2019-03-07 01:43:22 UTC
Tested with 4.0.0-0.nightly-2019-03-04-234414, there is not "x509: certificate signed by unknown authority" for 10250/metrics and 10250/metrics/cadvisor targets on worker nodes now

Comment 8 errata-xmlrpc 2019-06-04 10:44:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758