Hide Forgot
Description of problem: Cloned from https://jira.coreos.com/browse/MON-591 Let the 4.0 cluster runs for more than a day, "x509: certificate signed by unknown authority" error for 10250/metrics and 10250/metrics/cadvisor targets on all worker nodes. In this case, we have two worker nodes, 172.31.128.135 and 172.31.153.7 $ oc get node -o wide | awk '{print $1" "$3" "$4" "$6}' NAME ROLES AGE INTERNAL-IP ip-172-31-128-135.us-east-2.compute.internal worker 28h 172.31.128.135 ip-172-31-137-246.us-east-2.compute.internal master 28h 172.31.137.246 ip-172-31-146-164.us-east-2.compute.internal master 28h 172.31.146.164 ip-172-31-153-7.us-east-2.compute.internal worker 28h 172.31.153.7 ip-172-31-164-14.us-east-2.compute.internal master 28h 172.31.164.14 $ oc get pod -n openshift-monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE alertmanager-main-0 3/3 Running 0 27h 10.128.2.5 ip-172-31-128-135.us-east-2.compute.internal <none> alertmanager-main-1 3/3 Running 0 27h 10.131.0.12 ip-172-31-153-7.us-east-2.compute.internal <none> alertmanager-main-2 3/3 Running 0 27h 10.128.2.11 ip-172-31-128-135.us-east-2.compute.internal <none> cluster-monitoring-operator-549ff4d5dd-vl9lj 1/1 Running 0 28h 10.130.0.21 ip-172-31-137-246.us-east-2.compute.internal <none> grafana-754d4bf6bc-nhtk9 2/2 Running 0 27h 10.128.2.4 ip-172-31-128-135.us-east-2.compute.internal <none> kube-state-metrics-5799dc74ff-rbhtt 3/3 Running 0 27h 10.131.0.11 ip-172-31-153-7.us-east-2.compute.internal <none> node-exporter-8dtg5 2/2 Running 0 27h 172.31.146.164 ip-172-31-146-164.us-east-2.compute.internal <none> node-exporter-gzh8l 2/2 Running 0 27h 172.31.137.246 ip-172-31-137-246.us-east-2.compute.internal <none> node-exporter-j4gzp 2/2 Running 0 27h 172.31.164.14 ip-172-31-164-14.us-east-2.compute.internal <none> node-exporter-p5rbm 2/2 Running 0 27h 172.31.128.135 ip-172-31-128-135.us-east-2.compute.internal <none> node-exporter-wbxhq 2/2 Running 0 27h 172.31.153.7 ip-172-31-153-7.us-east-2.compute.internal <none> prometheus-adapter-85555d8646-776q5 1/1 Running 0 7m6s 10.131.1.138 ip-172-31-153-7.us-east-2.compute.internal <none> prometheus-adapter-85555d8646-rhf2c 1/1 Running 0 6m58s 10.128.3.105 ip-172-31-128-135.us-east-2.compute.internal <none> prometheus-k8s-0 6/6 Running 1 27h 10.128.2.10 ip-172-31-128-135.us-east-2.compute.internal <none> prometheus-k8s-1 6/6 Running 1 27h 10.131.0.14 ip-172-31-153-7.us-east-2.compute.internal <none> prometheus-operator-64fc65bf9c-x8w5h 1/1 Running 0 28h 10.131.0.7 ip-172-31-153-7.us-east-2.compute.internal <none> telemeter-client-6cfd8d6879-bhj8x 3/3 Running 0 12h 10.128.2.226 ip-172-31-128-135.us-east-2.compute.internal <none> $ prometheus_route=$(oc -n openshift-monitoring get route | grep prometheus-k8s | awk '{print $2}');curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://${prometheus_route}/targets | grep -i downprometheus_route=$(oc -n openshift-monitoring get route | grep prometheus-k8s | awk '{print $2}');curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://${prometheus_route}/targets | grep -i x509 <span class="alert alert-danger state_indicator">Get https://172.31.128.135:10250/metrics: x509: certificate signed by unknown authority</span> <span class="alert alert-danger state_indicator">Get https://172.31.153.7:10250/metrics: x509: certificate signed by unknown authority</span> <span class="alert alert-danger state_indicator">Get https://172.31.128.135:10250/metrics/cadvisor: x509: certificate signed by unknown authority</span> <span class="alert alert-danger state_indicator">Get https://172.31.153.7:10250/metrics/cadvisor: x509: certificate signed by unknown authority</span> Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-26-125216 True False 28h Cluster version is 4.0.0-0.nightly-2019-02-26-125216 How reproducible: Cluster runs for more than a day Steps to Reproduce: 1. See the description part 2. 3. Actual results: "x509: certificate signed by unknown authority" for 10250/metrics and 10250/metrics/cadvisor targets on worker nodes Expected results: Should not see this error Additional info:
We have discovered that this is a bug in Prometheus itself, not properly reloading the certificates. This will need to be fixed upstream.
The respective upstream issue is: https://github.com/prometheus/prometheus/issues/4155
Tested with 4.0.0-0.nightly-2019-03-04-234414, there is not "x509: certificate signed by unknown authority" for 10250/metrics and 10250/metrics/cadvisor targets on worker nodes now
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758