1689021 – Prometheus adapter is reported as unreachable by apiservers and never recovers, causing wedged ns deletions

Bug 1689021 - Prometheus adapter is reported as unreachable by apiservers and never recovers, causing wedged ns deletions

Summary: Prometheus adapter is reported as unreachable by apiservers and never recover...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-14 23:47 UTC by Clayton Coleman
Modified:	2019-06-04 10:46 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:45:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-incubator custom-metrics-apiserver pull 50	None	closed	Bump to Kubernetes 1.14	2020-04-06 14:00:40 UTC
Github	kubernetes-incubator metrics-server pull 245	None	closed	add s-urbaniak, prune directxman12 from owners et al	2020-04-06 14:00:39 UTC
Github	openshift k8s-prometheus-adapter pull 14	None	closed	bump to k8s 1.14	2020-04-06 14:00:39 UTC
Github	openshift service-ca-operator pull 44	None	closed	Bug 1700037: Check cert issuer directly	2020-04-06 14:00:38 UTC
Red Hat Product Errata	RHBA-2019:0758	None	None	None	2019-06-04 10:45:59 UTC

Description Clayton Coleman 2019-03-14 23:47:28 UTC

Observed in a CI run that the metrics apiservice was down, causing e2es to timeout and fail (due to namespace deletion not continuing).

The pods appeared fine, with only the following in the logs:

 oc logs -n openshift-monitoring deploy/prometheus-adapter
Found 2 pods, using pod/prometheus-adapter-69bd595d44-5plvk
I0314 21:55:24.335450       1 adapter.go:91] successfully using in-cluster auth
I0314 21:55:25.393188       1 serve.go:96] Serving securely on [::]:6443
E0314 21:56:14.948618       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=131, ErrCode=NO_ERROR, debug=""
E0314 21:56:14.949009       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=131, ErrCode=NO_ERROR, debug=""
E0314 21:57:40.183961       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
E0314 21:57:40.187336       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""

The apiservice reported

v1beta1.metrics.k8s.io                   openshift-monitoring/prometheus-adapter                                  False (FailedDiscoveryCheck)   104m

Run was https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.0/78

I've seen similar failures in other e2e runs. Not clear where problem resides - apiserver, network, or the endpoint.

Comment 2 lserven 2019-03-18 12:26:12 UTC

It seems that we see similar errors in other components in the openshift stack, including Prometheus itself, but some of these components are able to recover from failure.
We need to investigate if the Prometheus adapter is simply missing retry logic.

Comment 3 Frederic Branczyk 2019-04-11 09:48:21 UTC

Regarding the error in the logs, it seems that this was fixed in newer versions of Kubernetes apimachinery. I'd say we should upgrade to v1.14 (as we only use the pod and node API this should be safe to do).

Comment 4 Frederic Branczyk 2019-04-11 10:59:46 UTC

For what it's worth the log lines are from the list/watch and are suggesting the apiserver is closing connections unexpectedly, so it doesn't seem to me that these log lines have anything to do with the failure. However, we're going to go through all the components necessary to update everything to the latest apimachinery code.

Comment 5 Frederic Branczyk 2019-04-11 12:11:43 UTC

Moving to assigned as we're working on updating the Kubernetes dependencies throughout the stack.

Comment 8 Clayton Coleman 2019-04-17 18:50:02 UTC

Note that GOAWAY isn't an actual error.  It's just an informational message and isn't indicative of any error state.

Comment 13 Junqi Zhao 2019-04-25 05:38:00 UTC

tested with payload: 4.1.0-0.nightly-2019-04-23-223857, no such issue now
# oc get apiservice v1beta1.metrics.k8s.io
NAME                     SERVICE                                   AVAILABLE   AGE
v1beta1.metrics.k8s.io   openshift-monitoring/prometheus-adapter   True        28h

Comment 15 errata-xmlrpc 2019-06-04 10:45:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.