Bug 1753352 - Telemeter-client unable to verify internal Prometheus certificate
Summary: Telemeter-client unable to verify internal Prometheus certificate
Keywords:
Status: CLOSED DUPLICATE of bug 1746711
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.3.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-18 16:38 UTC by Chance Zibolski
Modified: 2019-09-23 10:40 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-23 10:40:12 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Chance Zibolski 2019-09-18 16:38:31 UTC
Description of problem: Starting september 13th, my 4.1.16 cluster's telemeter-client stopped reporting telemeter metrics due to it being unable to verify the certificate of the prometheus service in my cluster.

My 4.1.15 upgrade completed on the 13th, and I started & completed the 4.1.16 upgrade on the 13th also. 

The following logs are in telemeter-client:

2019/09/15 10:35:38 error: unable to forward results: Get https://prometheus-k8s.openshift-monitoring.svc:9091/federate?match%5B%5D=%7B__name__%3D%22up%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version_available_updates%22%7D&match%5B%5D=%7B__name__%3D%22cluster_operator_up%22%7D&match%5B%5D=%7B__name__%3D%22cluster_operator_conditions%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version_payload%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version_payload_errors%22%7D&match%5B%5D=%7B__name__%3D%22cluster_installer%22%7D&match%5B%5D=%7B__name__%3D%22cluster_infrastructure_provider%22%7D&match%5B%5D=%7B__name__%3D%22cluster_feature_set%22%7D&match%5B%5D=%7B__name__%3D%22instance%3Aetcd_object_counts%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22ALERTS%22%2Calertstate%3D%22firing%22%7D&match%5B%5D=%7B__name__%3D%22code%3Aapiserver_request_count%3Arate%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22kube_pod_status_ready%3Aetcd%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22kube_pod_status_ready%3Aimage_registry%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Acapacity_cpu_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Acapacity_memory_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Acpu_usage_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Amemory_usage_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22openshift%3Acpu_usage_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22openshift%3Amemory_usage_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22workload%3Acpu_usage_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22workload%3Amemory_usage_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Anode_instance_type_count%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22subscription_sync_total%22%7D: x509: certificate signed by unknown authority


My recent upgrade history:

    history:
    - completionTime: "2019-09-14T01:25:04Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7
      startedTime: "2019-09-14T00:03:21Z"
      state: Completed
      verified: true
      version: 4.1.16
    - completionTime: "2019-09-14T00:03:21Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef
      startedTime: "2019-09-11T19:58:11Z"
      state: Completed
      verified: true
      version: 4.1.15
    - completionTime: "2019-09-11T19:58:11Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:fd41c9bda9e0ff306954f1fd7af6428edff8c3989b75f9fe984968db66846231
      startedTime: "2019-09-05T08:41:49Z"
      state: Completed
      verified: true
      version: 4.1.14
    - completionTime: "2019-09-05T08:41:49Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:212296a41e04176c308bfe169e7c6e05d77b76f403361664c3ce55cd30682a94
      startedTime: "2019-08-27T18:36:55Z"
      state: Completed
      verified: true
      version: 4.1.13


ClusterID: af8bc55b-9ae3-4735-bf65-b6ef43aeced9

Version-Release number of selected component (if applicable): 4.1.16


How reproducible: It's unclear.


Steps to Reproduce:
1. Upgrade cluster
2. ???
3. Telemeter-client fails to verify prometheus certificate

Actual results: Telemeter-client fails to verify prometheus certificate


Expected results: Telemeter-client handles certificate rotation and can verify prometheus service certificate


Additional info: ClusterID af8bc55b-9ae3-4735-bf65-b6ef43aeced9

Comment 1 Ben Parees 2019-09-18 18:49:01 UTC
My understanding is Sergiusz's team owns the telemeter client, so assigning over there.  That said, I do not think we would block 4.2 on this since it doesn't render the cluster unusable.  Losing telemeter data would be very bad, if indeed cert rotation isn't being handled and we're going to eventually lose data for all clusters, but we do have time to put out a fix in a Z-stream.  (It's also possible this is already fixed in 4.2?).

I will let Sergiusz make the call on whether we can safely defer this, in this morning, but my vote would be to target it to 4.3 and then consider z-stream backports.


Note You need to log in before you can comment on or make changes to this bug.