Bug 1835801 - [4.5] 35% of Azure failures include the alert e2e test NodeClockNotSynchronising firing
Summary: [4.5] 35% of Azure failures include the alert e2e test NodeClockNotSynchronis...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.5.0
Assignee: Colin Walters
QA Contact: Ben Howard
URL:
Whiteboard:
Depends On: 1833098
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-14 14:03 UTC by Micah Abbott
Modified: 2020-07-13 17:39 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1833098
Environment:
[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]
Last Closed: 2020-07-13 17:38:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:39:10 UTC

Description Micah Abbott 2020-05-14 14:03:19 UTC
+++ This bug was initially created as a clone of Bug #1833098 +++

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3561/pull-ci-openshift-installer-master-e2e-azure/540

https://search.apps.build01.ci.devcluster.openshift.com/?search=NodeClockNotSynchronising&maxAge=168h&context=1&type=bug%2Bjunit&name=azure&maxMatches=5&maxBytes=20971520&groupBy=job

Across 805 runs and 80 jobs (54.29% failed), matched 35.24% of failing runs and 13.75% of jobs

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less 	1m30s
fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"NodeClockNotSynchronising\",\"alertstate\":\"firing\",\"endpoint\":\"https\",\"instance\":\"ci-op-hy8z3bni-2dc90-xpt9z-master-0\"

Looks like in the run I linked NodeClockNotSynchronising is firing on all three nodes because node_timex_sync_status is empty.

--- Additional comment from Colin Walters on 2020-05-07 19:45:01 UTC ---

This should be fixed since https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/918
but rollout was stalled by https://bugzilla.redhat.com/show_bug.cgi?id=1781575

I think also we've only done the bump in 4.6 and need to backport it to 4.5.

Comment 2 Micah Abbott 2020-05-27 19:58:17 UTC
Linked PR was merged on May 15; we've had a number of successful RHCOS 4.5 builds since then.  Marking as MODIFIED.

Comment 5 Ben Howard 2020-06-17 17:20:02 UTC
Validation Steps:
1. Launched an Azure Cluster
2. Confirmed that:
    - /run/systemd/generator/chronyd.service.d/coreos-platform-chrony.conf exists
    - /run/coreos-platform-chrony.conf exists
3. Confirmed that the run-time configuration is being used.
sh-4.4# systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
  Drop-In: /run/systemd/generator/chronyd.service.d
           └─coreos-platform-chrony.conf
   Active: active (running) since Wed 2020-06-17 16:34:54 UTC; 42min ago
     Docs: man:chronyd(8)
           man:chrony.conf(5)
  Process: 1300 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
  Process: 1294 ExecStart=/usr/sbin/chronyd -f /run/coreos-platform-chrony.conf $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 1298 (chronyd)
    Tasks: 1
   Memory: 1.8M
      CPU: 449ms
   CGroup: /system.slice/chronyd.service
           └─1298 /usr/sbin/chronyd -f /run/coreos-platform-chrony.conf

Verified.

Comment 7 errata-xmlrpc 2020-07-13 17:38:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.