Bug 1835801

Summary: [4.5] 35% of Azure failures include the alert e2e test NodeClockNotSynchronising firing
Product: OpenShift Container Platform Reporter: Micah Abbott <miabbott>
Component: RHCOSAssignee: Colin Walters <walters>
Status: CLOSED ERRATA QA Contact: Ben Howard <behoward>
Severity: high Docs Contact:
Priority: medium    
Version: 4.5CC: bbreard, behoward, ccoleman, imcleod, jligon, mnguyen, nstielau, walters, wking
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1833098 Environment:
[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early]
Last Closed: 2020-07-13 17:38:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1833098    
Bug Blocks:    

Description Micah Abbott 2020-05-14 14:03:19 UTC
+++ This bug was initially created as a clone of Bug #1833098 +++

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3561/pull-ci-openshift-installer-master-e2e-azure/540

https://search.apps.build01.ci.devcluster.openshift.com/?search=NodeClockNotSynchronising&maxAge=168h&context=1&type=bug%2Bjunit&name=azure&maxMatches=5&maxBytes=20971520&groupBy=job

Across 805 runs and 80 jobs (54.29% failed), matched 35.24% of failing runs and 13.75% of jobs

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] expand_less 	1m30s
fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1": {
            s: "promQL query: count_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|KubeAPILatencyHigh\",alertstate=\"firing\",severity!=\"info\"}[2h]) >= 1 had reported incorrect results:\n[{\"metric\":{\"alertname\":\"NodeClockNotSynchronising\",\"alertstate\":\"firing\",\"endpoint\":\"https\",\"instance\":\"ci-op-hy8z3bni-2dc90-xpt9z-master-0\"

Looks like in the run I linked NodeClockNotSynchronising is firing on all three nodes because node_timex_sync_status is empty.

--- Additional comment from Colin Walters on 2020-05-07 19:45:01 UTC ---

This should be fixed since https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/918
but rollout was stalled by https://bugzilla.redhat.com/show_bug.cgi?id=1781575

I think also we've only done the bump in 4.6 and need to backport it to 4.5.

Comment 2 Micah Abbott 2020-05-27 19:58:17 UTC
Linked PR was merged on May 15; we've had a number of successful RHCOS 4.5 builds since then.  Marking as MODIFIED.

Comment 5 Ben Howard 2020-06-17 17:20:02 UTC
Validation Steps:
1. Launched an Azure Cluster
2. Confirmed that:
    - /run/systemd/generator/chronyd.service.d/coreos-platform-chrony.conf exists
    - /run/coreos-platform-chrony.conf exists
3. Confirmed that the run-time configuration is being used.
sh-4.4# systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
  Drop-In: /run/systemd/generator/chronyd.service.d
           └─coreos-platform-chrony.conf
   Active: active (running) since Wed 2020-06-17 16:34:54 UTC; 42min ago
     Docs: man:chronyd(8)
           man:chrony.conf(5)
  Process: 1300 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
  Process: 1294 ExecStart=/usr/sbin/chronyd -f /run/coreos-platform-chrony.conf $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 1298 (chronyd)
    Tasks: 1
   Memory: 1.8M
      CPU: 449ms
   CGroup: /system.slice/chronyd.service
           └─1298 /usr/sbin/chronyd -f /run/coreos-platform-chrony.conf

Verified.

Comment 7 errata-xmlrpc 2020-07-13 17:38:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409