Description of problem: Starting september 13th, my 4.1.16 cluster's telemeter-client stopped reporting telemeter metrics due to it being unable to verify the certificate of the prometheus service in my cluster. My 4.1.15 upgrade completed on the 13th, and I started & completed the 4.1.16 upgrade on the 13th also. The following logs are in telemeter-client: 2019/09/15 10:35:38 error: unable to forward results: Get https://prometheus-k8s.openshift-monitoring.svc:9091/federate?match%5B%5D=%7B__name__%3D%22up%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version_available_updates%22%7D&match%5B%5D=%7B__name__%3D%22cluster_operator_up%22%7D&match%5B%5D=%7B__name__%3D%22cluster_operator_conditions%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version_payload%22%7D&match%5B%5D=%7B__name__%3D%22cluster_version_payload_errors%22%7D&match%5B%5D=%7B__name__%3D%22cluster_installer%22%7D&match%5B%5D=%7B__name__%3D%22cluster_infrastructure_provider%22%7D&match%5B%5D=%7B__name__%3D%22cluster_feature_set%22%7D&match%5B%5D=%7B__name__%3D%22instance%3Aetcd_object_counts%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22ALERTS%22%2Calertstate%3D%22firing%22%7D&match%5B%5D=%7B__name__%3D%22code%3Aapiserver_request_count%3Arate%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22kube_pod_status_ready%3Aetcd%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22kube_pod_status_ready%3Aimage_registry%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Acapacity_cpu_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Acapacity_memory_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Acpu_usage_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Amemory_usage_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22openshift%3Acpu_usage_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22openshift%3Amemory_usage_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22workload%3Acpu_usage_cores%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22workload%3Amemory_usage_bytes%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22cluster%3Anode_instance_type_count%3Asum%22%7D&match%5B%5D=%7B__name__%3D%22subscription_sync_total%22%7D: x509: certificate signed by unknown authority My recent upgrade history: history: - completionTime: "2019-09-14T01:25:04Z" image: quay.io/openshift-release-dev/ocp-release@sha256:61ed953962d43cae388cb3c544b4cac358d4675076c2fc0befb236209d5116f7 startedTime: "2019-09-14T00:03:21Z" state: Completed verified: true version: 4.1.16 - completionTime: "2019-09-14T00:03:21Z" image: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef startedTime: "2019-09-11T19:58:11Z" state: Completed verified: true version: 4.1.15 - completionTime: "2019-09-11T19:58:11Z" image: quay.io/openshift-release-dev/ocp-release@sha256:fd41c9bda9e0ff306954f1fd7af6428edff8c3989b75f9fe984968db66846231 startedTime: "2019-09-05T08:41:49Z" state: Completed verified: true version: 4.1.14 - completionTime: "2019-09-05T08:41:49Z" image: quay.io/openshift-release-dev/ocp-release@sha256:212296a41e04176c308bfe169e7c6e05d77b76f403361664c3ce55cd30682a94 startedTime: "2019-08-27T18:36:55Z" state: Completed verified: true version: 4.1.13 ClusterID: af8bc55b-9ae3-4735-bf65-b6ef43aeced9 Version-Release number of selected component (if applicable): 4.1.16 How reproducible: It's unclear. Steps to Reproduce: 1. Upgrade cluster 2. ??? 3. Telemeter-client fails to verify prometheus certificate Actual results: Telemeter-client fails to verify prometheus certificate Expected results: Telemeter-client handles certificate rotation and can verify prometheus service certificate Additional info: ClusterID af8bc55b-9ae3-4735-bf65-b6ef43aeced9
My understanding is Sergiusz's team owns the telemeter client, so assigning over there. That said, I do not think we would block 4.2 on this since it doesn't render the cluster unusable. Losing telemeter data would be very bad, if indeed cert rotation isn't being handled and we're going to eventually lose data for all clusters, but we do have time to put out a fix in a Z-stream. (It's also possible this is already fixed in 4.2?). I will let Sergiusz make the call on whether we can safely defer this, in this morning, but my vote would be to target it to 4.3 and then consider z-stream backports.