2033379 – Prometheus is not highly available

Bug 2033379 - Prometheus is not highly available

Summary: Prometheus is not highly available

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Test Framework
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.z
Assignee:	W. Trevor King
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	2033378
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-16 16:09 UTC by OpenShift BugZilla Robot
Modified:	2021-12-20 20:32 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-20 20:32:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 26698	0	None	Merged	bug 2033379: [release-4.8] remove perma-failing prometheus upgrade invariant	2021-12-20 20:25:41 UTC

Comment 4 Devan Goodwin 2021-12-20 13:57:54 UTC

QE looking for help verifying, I assume you are their best bet @wking.

Comment 5 W. Trevor King 2021-12-20 20:32:42 UTC

repeating the query from [1], but with a reduced maxAge because [2] landed in 4.8 4 days ago:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=72h&type=junit' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-uwm (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 9 runs, 78% failed, 14% of failures match = 11% impact

So that's... better...  Poking at one of the single-node hits [3]:

  INFO[2021-12-19T22:40:01Z] Resolved release initial to registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-12-10-211525 
  INFO[2021-12-19T22:40:01Z] Resolved release latest to registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-12-11-001048 

No idea why they're still running jobs between those older nightlies, but makes sense to me that jobs whose target release doesn't contain the fix will still be impacted.  I'll optimistically close CURRENTRELEASE  based on the reduction in hit volume, and we'll open a new series or come back to this run if we are bothered by this test-case going forward.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2030539#c0
[2]: https://github.com/openshift/origin/pull/26698#event-5781227260
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node/1472698236211826688#1:build-log.txt%3A4

Note You need to log in before you can comment on or make changes to this bug.