Bug 1833098

Summary: 35% of Azure failures include the alert e2e test NodeClockNotSynchronising firing
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: RHCOSAssignee: Colin Walters <walters>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: high Docs Contact:
Priority: medium    
Version: 4.5CC: bbreard, dmace, ffranz, imcleod, jligon, miabbott, nstielau, walters, wking
Target Milestone: ---Keywords: Reopened
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1835801 (view as bug list) Environment:
[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]
Last Closed: 2020-10-27 15:58:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1835801    

Description Clayton Coleman 2020-05-07 19:31:00 UTC
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3561/pull-ci-openshift-installer-master-e2e-azure/540

https://search.apps.build01.ci.devcluster.openshift.com/?search=NodeClockNotSynchronising&maxAge=168h&context=1&type=bug%2Bjunit&name=azure&maxMatches=5&maxBytes=20971520&groupBy=job

Across 805 runs and 80 jobs (54.29% failed), matched 35.24% of failing runs and 13.75% of jobs

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less 	1m30s
fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"NodeClockNotSynchronising\",\"alertstate\":\"firing\",\"endpoint\":\"https\",\"instance\":\"ci-op-hy8z3bni-2dc90-xpt9z-master-0\"

Looks like in the run I linked NodeClockNotSynchronising is firing on all three nodes because node_timex_sync_status is empty.

Comment 1 Colin Walters 2020-05-07 19:45:01 UTC
This should be fixed since https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/918
but rollout was stalled by https://bugzilla.redhat.com/show_bug.cgi?id=1781575

I think also we've only done the bump in 4.6 and need to backport it to 4.5.

Comment 3 Colin Walters 2020-05-15 20:22:01 UTC
Awesome!  Since then we have the same fix inbound for EC2 and GCP: https://github.com/coreos/fedora-coreos-config/pull/393

Comment 5 Colin Walters 2020-05-19 12:34:04 UTC
We also need https://github.com/openshift/installer/pull/3613 AKA bug 1837039 for the bootimage, although I didn't think that would be critical.
It's also possible that I regressed this when generalizing it in https://github.com/coreos/fedora-coreos-config/pull/393

I'll take some time to verify the code in the current release payload.

Comment 7 Colin Walters 2020-05-20 18:24:08 UTC
Re-marking as verified; haven't seen this in the last 12 hours, which is around when the fix merged into CI.

Comment 10 errata-xmlrpc 2020-10-27 15:58:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 11 Red Hat Bugzilla 2023-09-14 05:57:39 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days