Description of problem: metrics endpoint for master kubelets may return HTTP 500, causing a false-positive Prometheus alert, after etcd quorum restore procedure has been performed Also reproducible on CI for MCO/installer's e2e-etcd-quorum-loss test Version-Release number of selected component (if applicable): master/4.2/4.1 How reproducible: 90% Steps to Reproduce: 1. Follow the steps in https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit#heading=h.qej2sc5mtfd2 2. Check Prometheus targets 3. Actual results: Two masters are displayed as 'down' in Prometheus, as `/metrics` returns HTTP 500: Jun 03 22:17:39 ip-10-0-154-5 hyperkube[39381]: I0603 22:17:39.329549 39381 server.go:818] GET /metrics: (209.94539ms) 500 ... Jun 03 22:17:39 ip-10-0-154-5 hyperkube[39381]: logging error output: "An error has occurred during metrics collection:\n\n3 error(s) occurred:\n* collected metric kubelet_container_log_filesystem_used_bytes label:<name:\"container\" value:\"tuned\" > label:<name:\"namespace\" value:\"openshift-cluster-node-tuning-operator\" > label:<name:\"pod\" value:\"tuned-226ck\" > gauge:<value:0 > was collected before with the same name and label values\n* collected metric kubelet_container_log_filesystem_used_bytes label:<name:\"container\" value:\"openvswitch\" > label:<name:\"namespace\" value:\"openshift-sdn\" > label:<name:\"pod\" value:\"ovs-d642k\" > gauge:<value:53248 > was collected before with the same name and label values\n* collected metric kubelet_container_log_filesystem_used_bytes label:<name:\"container\" value:\"machine-config-daemon\" > label:<name:\"namespace\" value:\"openshift-machine-config-operator\" > label:<name:\"pod\" value:\"machine-config-daemon-248m5\" > gauge:<value:0 > was collected before with the same name and label values\n" Expected results: `/metrics` returns 200 Additional info: See master kubelet logs - https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/807/pull-ci-openshift-machine-config-operator-master-e2e-etcd-quorum-loss/37/artifacts/e2e-etcd-quorum-loss/nodes/masters-journal
Seems there is some duplication of containers when building up the kubelet_container_log_filesystem_used_bytes. It should probably ensure uniqueness at some level. https://github.com/kubernetes/kubernetes/pull/70749
Ryan identified https://github.com/kubernetes/kubernetes/pull/77426
This is fixed in OCP 4.2. Prometheus client-go was updated in Kubernetes 1.14.0 [1] which includes a race fix [2]. The required version of Prometheus client-go is 0.9.2 or better [3]. 1. https://github.com/kubernetes/kubernetes/pull/74248 2. https://github.com/prometheus/client_golang/pull/513 3. https://github.com/prometheus/client_golang/releases/tag/v0.9.2
*** This bug has been marked as a duplicate of bug 1712645 ***