1819565 – user-workload-monitoring prometheus-operator endpoint is down due to x509 issue

Bug 1819565 - user-workload-monitoring prometheus-operator endpoint is down due to x509 issue

Summary: user-workload-monitoring prometheus-operator endpoint is down due to x509 issue

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Pawel Krupa
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-01 04:32 UTC by Junqi Zhao
Modified:	2020-07-13 17:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:24:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
user-workload-monitoring prometheus-operator endpoint is down (65.38 KB, image/png) 2020-04-01 04:34 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 730	0	None	closed	Bug 1819565: pkg/manifests: set correct server name for UWM prom-op service monitor	2020-07-21 13:39:57 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:25:14 UTC

Description Junqi Zhao 2020-04-01 04:32:36 UTC

Description of problem:
enabled techPreviewUserWorkload, user-workload-monitoring prometheus-operator endpoint is down due to x509 error
# oc -n openshift-monitoring get cm cluster-monitoring-config -oyaml
apiVersion: v1
data:
  config.yaml: |
    techPreviewUserWorkload:
      enabled: true
kind: ConfigMap
metadata:
  creationTimestamp: "2020-04-01T04:14:22Z"
  name: cluster-monitoring-config
  namespace: openshift-monitoring
...

user-workload-monitoring prometheus-operator endpoint is down due to x509 error:
Get https://10.130.0.51:8443/metrics: x509: certificate is valid for prometheus-operator.openshift-user-workload-monitoring.svc, prometheus-operator.openshift-user-workload-monitoring.svc.cluster.local, not prometheus-operator-user-workload.openshift-monitoring.svc

See the picture


# oc -n openshift-user-workload-monitoring get pod -o wide | grep prometheus-operator
prometheus-operator-8687bb4d7c-qpz2q   2/2     Running   0          3h19m   10.130.0.51   ip-10-0-173-92.us-east-2.compute.internal    <none>           <none>

But there is not issue from command
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-user-workload-monitoring exec -c prometheus-operator prometheus-operator-8687bb4d7c-qpz2q -- curl -k -H "Authorization: Bearer $token" https://10.130.0.51:8443/metrics | head -n 5
# oc -n openshift-user-workload-monitoring exec -c prometheus-operator prometheus-operator-8687bb4d7c-qpz2q -- curl -k -H "Authorization: Bearer $token" https://10.130.0.51:8443/metrics | head -n 5
HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.2129e-05
go_gc_duration_seconds{quantile="0.25"} 2.1588e-05
go_gc_duration_seconds{quantile="0.5"} 4.0001e-05

configuration file, server_name is: prometheus-operator-user-workload.openshift-monitoring.svc
- job_name: openshift-user-workload-monitoring/prometheus-operator/0
  honor_labels: true
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - openshift-user-workload-monitoring
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
    server_name: prometheus-operator-user-workload.openshift-monitoring.svc
    insecure_skip_verify: false

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-03-31-203533

How reproducible:
Always

Steps to Reproduce:
1. See the description
2.
3.

Actual results:
user-workload-monitoring prometheus-operator endpoint is down

Expected results:
user-workload-monitoring prometheus-operator endpoint should be up

Additional info:

Comment 1 Junqi Zhao 2020-04-01 04:34:15 UTC

Created attachment 1675292 [details]
user-workload-monitoring prometheus-operator endpoint is down

Comment 4 Junqi Zhao 2020-04-02 02:48:40 UTC

Tested with 4.5.0-0.nightly-2020-04-01-232323, user-workload-monitoring prometheus-operator endpoint is up
- job_name: openshift-monitoring/prometheus-operator/0
  honor_labels: true
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: https
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - openshift-monitoring
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    ca_file: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
    server_name: prometheus-operator.openshift-monitoring.svc
    insecure_skip_verify: false

Comment 6 errata-xmlrpc 2020-07-13 17:24:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.