1862806 – The Prometheus test is failing on GCP

Bug 1862806 - The Prometheus test is failing on GCP

Summary: The Prometheus test is failing on GCP

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Seth Jennings
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-02 17:18 UTC by Ben Bennett
Modified:	2020-08-26 14:20 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-26 14:20:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Bennett 2020-08-02 17:18:25 UTC

Description of problem:

We are failing prometheus availability tests on GCP, but the master is available.  So either we haven't got the rule correct in the test, or the endpoint is not responding when the server starts.

Either way, we need to fix this before we ship.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/727/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1289911562324152320

Version-Release number of selected component (if applicable):

4.6

How reproducible:

Every time.

Steps to Reproduce:
1. Run the e2e-gcp-ovn test suite
2.
3.

Actual results:

Failing prometheus tests.

Expected results:

All tests pass.

Additional info:

Comment 4 Seth Jennings 2020-08-10 16:06:18 UTC

Ryan is on leave

Comment 5 Seth Jennings 2020-08-10 21:41:42 UTC

David made a test to observe this happening in the e2e tests
https://github.com/openshift/origin/pull/25391

Not sure what would cause the delay.  However, the error does print after just a second.
https://github.com/kubernetes/kubernetes/blob/eb8b5a9854e3e113ac6d4d3e66d01548488b8e2f/pkg/kubelet/util/manager/watch_based_manager.go#L189-L192

Has about a 5-10% failure rate in e2e-aws
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-ocp-installer-e2e-aws-4.6&width=5&include-filter-by-regex=sig-instrumentation&include-filter-by-regex=Prometheus%20when%20installed

Comment 6 Seth Jennings 2020-08-10 21:43:39 UTC

watch based manager in the kubelet hasn't seen a lot of action
https://github.com/kubernetes/kubernetes/commits/master/pkg/kubelet/util/manager/watch_based_manager.go

Comment 7 Seth Jennings 2020-08-17 16:42:53 UTC

this hasn't failed in 5 days at this point

Comment 8 Seth Jennings 2020-08-26 14:20:00 UTC

Looks like this may have been a transient issue.  High flake rates having stopped and failures in these tests now are not related to the prometheus pods not starting.

Note You need to log in before you can comment on or make changes to this bug.