Bug 2033379

Summary: Prometheus is not highly available
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: Test FrameworkAssignee: W. Trevor King <wking>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: bparees, dgoodwin, wking
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-20 20:32:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2033378    
Bug Blocks:    

Comment 4 Devan Goodwin 2021-12-20 13:57:54 UTC
QE looking for help verifying, I assume you are their best bet @wking.

Comment 5 W. Trevor King 2021-12-20 20:32:42 UTC
repeating the query from [1], but with a reduced maxAge because [2] landed in 4.8 4 days ago:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Watchdog+alert+had+missing+intervals&maxAge=72h&type=junit' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-uwm (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 9 runs, 78% failed, 14% of failures match = 11% impact

So that's... better...  Poking at one of the single-node hits [3]:

  INFO[2021-12-19T22:40:01Z] Resolved release initial to registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-12-10-211525 
  INFO[2021-12-19T22:40:01Z] Resolved release latest to registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-12-11-001048 

No idea why they're still running jobs between those older nightlies, but makes sense to me that jobs whose target release doesn't contain the fix will still be impacted.  I'll optimistically close CURRENTRELEASE  based on the reduction in hit volume, and we'll open a new series or come back to this run if we are bothered by this test-case going forward.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2030539#c0
[2]: https://github.com/openshift/origin/pull/26698#event-5781227260
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node/1472698236211826688#1:build-log.txt%3A4