Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1716914

Summary:	[4.1.z] Kubelet metrics throw HTTP 500 after quorum restore procedure
Product:	OpenShift Container Platform	Reporter:	Vadim Rutkovsky <vrutkovs>
Component:	Node	Assignee:	Seth Jennings <sjenning>
Status:	CLOSED DUPLICATE	QA Contact:	Jianwei Hou <jhou>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1.z	CC:	aos-bugs, gblomqui, jhou, jokerman, mmccomas, sjenning
Target Milestone:	---
Target Release:	4.1.z
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1716913	Environment:
Last Closed:	2019-06-04 17:35:59 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1712645, 1716913
Bug Blocks:

Description Vadim Rutkovsky 2019-06-04 12:02:16 UTC

+++ This bug was initially created as a clone of Bug #1716913 +++

Description of problem:

metrics endpoint for master kubelets may return HTTP 500, causing a false-positive Prometheus alert, after etcd quorum restore procedure has been performed

Also reproducible on CI for MCO/installer's e2e-etcd-quorum-loss test


Version-Release number of selected component (if applicable):
master/4.2/4.1

How reproducible:
90%

Steps to Reproduce:
1. Follow the steps in https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit#heading=h.qej2sc5mtfd2
2. Check Prometheus targets
3.

Actual results:
Two masters are displayed as 'down' in Prometheus, as `/metrics` returns HTTP 500:

Jun 03 22:17:39 ip-10-0-154-5 hyperkube[39381]: I0603 22:17:39.329549   39381 server.go:818] GET /metrics: (209.94539ms) 500
...
Jun 03 22:17:39 ip-10-0-154-5 hyperkube[39381]: logging error output: "An error has occurred during metrics collection:\n\n3 error(s) occurred:\n* collected metric kubelet_container_log_filesystem_used_bytes label:<name:\"container\" value:\"tuned\" > label:<name:\"namespace\" value:\"openshift-cluster-node-tuning-operator\" > label:<name:\"pod\" value:\"tuned-226ck\" > gauge:<value:0 >  was collected before with the same name and label values\n* collected metric kubelet_container_log_filesystem_used_bytes label:<name:\"container\" value:\"openvswitch\" > label:<name:\"namespace\" value:\"openshift-sdn\" > label:<name:\"pod\" value:\"ovs-d642k\" > gauge:<value:53248 >  was collected before with the same name and label values\n* collected metric kubelet_container_log_filesystem_used_bytes label:<name:\"container\" value:\"machine-config-daemon\" > label:<name:\"namespace\" value:\"openshift-machine-config-operator\" > label:<name:\"pod\" value:\"machine-config-daemon-248m5\" > gauge:<value:0 >  was collected before with the same name and label values\n"

Expected results:
`/metrics` returns 200

Additional info:
See master kubelet logs - https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/807/pull-ci-openshift-machine-config-operator-master-e2e-etcd-quorum-loss/37/artifacts/e2e-etcd-quorum-loss/nodes/masters-journal

Comment 1 Seth Jennings 2019-06-04 17:35:59 UTC


*** This bug has been marked as a duplicate of bug 1712645 ***

Comment 2 Seth Jennings 2019-06-04 17:39:40 UTC


*** This bug has been marked as a duplicate of bug 1713098 ***