Bug 2033379

Summary:	Prometheus is not highly available
Product:	OpenShift Container Platform	Reporter:	OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component:	Test Framework	Assignee:	W. Trevor King <wking>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.8	CC:	bparees, dgoodwin, wking
Target Milestone:	---
Target Release:	4.8.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-12-20 20:32:42 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2033378
Bug Blocks:

Comment 4 Devan Goodwin 2021-12-20 13:57:54 UTC

QE looking for help verifying, I assume you are their best bet @wking.

Comment 5 W. Trevor King 2021-12-20 20:32:42 UTC

repeating the query from [1], but with a reduced maxAge because [2] landed in 4.8 4 days ago:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=72h&type=junit' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-uwm (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 9 runs, 78% failed, 14% of failures match = 11% impact

So that's... better...  Poking at one of the single-node hits [3]:

  INFO[2021-12-19T22:40:01Z] Resolved release initial to registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-12-10-211525 
  INFO[2021-12-19T22:40:01Z] Resolved release latest to registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-12-11-001048 

No idea why they're still running jobs between those older nightlies, but makes sense to me that jobs whose target release doesn't contain the fix will still be impacted.  I'll optimistically close CURRENTRELEASE  based on the reduction in hit volume, and we'll open a new series or come back to this run if we are bothered by this test-case going forward.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2030539#c0
[2]: https://github.com/openshift/origin/pull/26698#event-5781227260
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node/1472698236211826688#1:build-log.txt%3A4