Bug 1833098 - 35% of Azure failures include the alert e2e test NodeClockNotSynchronising firing [NEEDINFO]
Summary: 35% of Azure failures include the alert e2e test NodeClockNotSynchronising fi...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.5
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.6.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
Depends On:
Blocks: 1835801
TreeView+ depends on / blocked
Reported: 2020-05-07 19:31 UTC by Clayton Coleman
Modified: 2020-10-27 15:59 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1835801 (view as bug list)
[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]
Last Closed: 2020-10-27 15:58:53 UTC
Target Upstream Version:
dmace: needinfo? (ccoleman)

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:59:14 UTC

Description Clayton Coleman 2020-05-07 19:31:00 UTC


Across 805 runs and 80 jobs (54.29% failed), matched 35.24% of failing runs and 13.75% of jobs

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less 	1m30s
fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"NodeClockNotSynchronising\",\"alertstate\":\"firing\",\"endpoint\":\"https\",\"instance\":\"ci-op-hy8z3bni-2dc90-xpt9z-master-0\"

Looks like in the run I linked NodeClockNotSynchronising is firing on all three nodes because node_timex_sync_status is empty.

Comment 1 Colin Walters 2020-05-07 19:45:01 UTC
This should be fixed since https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/918
but rollout was stalled by https://bugzilla.redhat.com/show_bug.cgi?id=1781575

I think also we've only done the bump in 4.6 and need to backport it to 4.5.

Comment 3 Colin Walters 2020-05-15 20:22:01 UTC
Awesome!  Since then we have the same fix inbound for EC2 and GCP: https://github.com/coreos/fedora-coreos-config/pull/393

Comment 5 Colin Walters 2020-05-19 12:34:04 UTC
We also need https://github.com/openshift/installer/pull/3613 AKA bug 1837039 for the bootimage, although I didn't think that would be critical.
It's also possible that I regressed this when generalizing it in https://github.com/coreos/fedora-coreos-config/pull/393

I'll take some time to verify the code in the current release payload.

Comment 7 Colin Walters 2020-05-20 18:24:08 UTC
Re-marking as verified; haven't seen this in the last 12 hours, which is around when the fix merged into CI.

Comment 10 errata-xmlrpc 2020-10-27 15:58:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.