Bug 1753352

Summary: Telemeter-client unable to verify internal Prometheus certificate
Product: OpenShift Container Platform Reporter: Chance Zibolski <chancez>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.1.zCC: alegrand, anpicker, bparees, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-23 10:40:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chance Zibolski 2019-09-18 16:38:31 UTC
Description of problem: Starting september 13th, my 4.1.16 cluster's telemeter-client stopped reporting telemeter metrics due to it being unable to verify the certificate of the prometheus service in my cluster.

My 4.1.15 upgrade completed on the 13th, and I started & completed the 4.1.16 upgrade on the 13th also. 

The following logs are in telemeter-client:

2019/09/15 10:35:38 error: unable to forward results: Get https://prometheus-k8s.openshift-monitoring.svc:9091/federate?match%5B%5D=%7B__name__%3D%22up%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version_available_updates%22%7D&match%5B%5D=%7B__name__%3D%22cluster_operator_up%22%7D&match%5B%5D=%7B__name__%3D%22cluster_operator_conditions%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version_payload%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version_payload_errors%22%7D&match%5B%5D=%7B__name__%3D%22cluster_installer%22%7D&match%5B%5D=%7B__name__%3D%22cluster_infrastructure_provider%22%7D&match%5B%5D=%7B__name__%3D%22cluster_feature_set%22%7D&match%5B%5D=%7B__name__%3D%22instance%3Aetcd_object_counts%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22ALERTS%22%2Calertstate%3D%22firing%22%7D&match%5B%5D=%7B__name__%3D%22code%3Aapiserver_request_count%3Arate%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22kube_pod_status_ready%3Aetcd%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22kube_pod_status_ready%3Aimage_registry%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Acapacity_cpu_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Acapacity_memory_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Acpu_usage_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Amemory_usage_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22openshift%3Acpu_usage_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22openshift%3Amemory_usage_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22workload%3Acpu_usage_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22workload%3Amemory_usage_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Anode_instance_type_count%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22subscription_sync_total%22%7D: x509: certificate signed by unknown authority


My recent upgrade history:

    history:
    - completionTime: "2019-09-14T01:25:04Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7
      startedTime: "2019-09-14T00:03:21Z"
      state: Completed
      verified: true
      version: 4.1.16
    - completionTime: "2019-09-14T00:03:21Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef
      startedTime: "2019-09-11T19:58:11Z"
      state: Completed
      verified: true
      version: 4.1.15
    - completionTime: "2019-09-11T19:58:11Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:fd41c9bda9e0ff306954f1fd7af6428edff8c3989b75f9fe984968db66846231
      startedTime: "2019-09-05T08:41:49Z"
      state: Completed
      verified: true
      version: 4.1.14
    - completionTime: "2019-09-05T08:41:49Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:212296a41e04176c308bfe169e7c6e05d77b76f403361664c3ce55cd30682a94
      startedTime: "2019-08-27T18:36:55Z"
      state: Completed
      verified: true
      version: 4.1.13


ClusterID: af8bc55b-9ae3-4735-bf65-b6ef43aeced9

Version-Release number of selected component (if applicable): 4.1.16


How reproducible: It's unclear.


Steps to Reproduce:
1. Upgrade cluster
2. ???
3. Telemeter-client fails to verify prometheus certificate

Actual results: Telemeter-client fails to verify prometheus certificate


Expected results: Telemeter-client handles certificate rotation and can verify prometheus service certificate


Additional info: ClusterID af8bc55b-9ae3-4735-bf65-b6ef43aeced9

Comment 1 Ben Parees 2019-09-18 18:49:01 UTC
My understanding is Sergiusz's team owns the telemeter client, so assigning over there.  That said, I do not think we would block 4.2 on this since it doesn't render the cluster unusable.  Losing telemeter data would be very bad, if indeed cert rotation isn't being handled and we're going to eventually lose data for all clusters, but we do have time to put out a fix in a Z-stream.  (It's also possible this is already fixed in 4.2?).

I will let Sergiusz make the call on whether we can safely defer this, in this morning, but my vote would be to target it to 4.3 and then consider z-stream backports.