Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1833098

Summary:	35% of Azure failures include the alert e2e test NodeClockNotSynchronising firing
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	RHCOS	Assignee:	Colin Walters <walters>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.5	CC:	bbreard, dmace, ffranz, imcleod, jligon, miabbott, nstielau, walters, wking
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1835801 (view as bug list)		Environment:	[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]
Last Closed:	2020-10-27 15:58:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1835801

Description Clayton Coleman 2020-05-07 19:31:00 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3561/pull-ci-openshift-installer-master-e2e-azure/540

https://search.apps.build01.ci.devcluster.openshift.com/?search=NodeClockNotSynchronising&maxAge=168h&context=1&type=bug%2Bjunit&name=azure&maxMatches=5&maxBytes=20971520&groupBy=job

Across 805 runs and 80 jobs (54.29% failed), matched 35.24% of failing runs and 13.75% of jobs

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less 	1m30s
fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"NodeClockNotSynchronising\",\"alertstate\":\"firing\",\"endpoint\":\"https\",\"instance\":\"ci-op-hy8z3bni-2dc90-xpt9z-master-0\"

Looks like in the run I linked NodeClockNotSynchronising is firing on all three nodes because node_timex_sync_status is empty.

Comment 1 Colin Walters 2020-05-07 19:45:01 UTC

This should be fixed since https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/918
but rollout was stalled by https://bugzilla.redhat.com/show_bug.cgi?id=1781575

I think also we've only done the bump in 4.6 and need to backport it to 4.5.

Comment 2 Clayton Coleman 2020-05-15 19:32:54 UTC

https://search.apps.build01.ci.devcluster.openshift.com/?search=NodeClockNotSynchronising&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Looks like no hits in last 6 days.  Going to mark this as verified.

Comment 3 Colin Walters 2020-05-15 20:22:01 UTC

Awesome!  Since then we have the same fix inbound for EC2 and GCP: https://github.com/coreos/fedora-coreos-config/pull/393

Comment 4 Dan Mace 2020-05-19 11:48:35 UTC

This is marked closed in 4.5, but it's still happening, and a lot:

https://search.apps.build01.ci.devcluster.openshift.com/?search=NodeClockNotSynchronising&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

example:

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/354/pull-ci-openshift-cluster-etcd-operator-master-e2e-azure/1552

Comment 5 Colin Walters 2020-05-19 12:34:04 UTC

We also need https://github.com/openshift/installer/pull/3613 AKA bug 1837039 for the bootimage, although I didn't think that would be critical.
It's also possible that I regressed this when generalizing it in https://github.com/coreos/fedora-coreos-config/pull/393

I'll take some time to verify the code in the current release payload.

Comment 6 Colin Walters 2020-05-19 15:08:11 UTC

:cry: https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/955

Comment 7 Colin Walters 2020-05-20 18:24:08 UTC

Re-marking as verified; haven't seen this in the last 12 hours, which is around when the fix merged into CI.

Comment 10 errata-xmlrpc 2020-10-27 15:58:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 11 Red Hat Bugzilla 2023-09-14 05:57:39 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days