Bug 1874340

Summary: vmware: NodeClockNotSynchronising alert is triggered in openshift cluster after upgrading form 4.4.16 to 4.5.6
Product: OpenShift Container Platform Reporter: Vedanti Jaypurkar <vjaypurk>
Component: MonitoringAssignee: Pawel Krupa <pkrupa>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: alegrand, amigliet, anisal, anpicker, bbreard, dkiselev, erooth, fshaikh, hongyli, imcleod, jligon, juzhao, kakkoyun, lcosic, miabbott, mlichvar, nstielau, pkrupa, spasquie, ssadhale, surbania
Target Milestone: ---Keywords: Triaged
Target Release: 4.7.0Flags: anisal: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:17:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 3 Pawel Krupa 2020-09-02 13:48:06 UTC
I believe you are hitting a kernel/chrony issue[1] with setting sync_status flag. node_exporter reports sync_status via timex collector as "node_timex_sync_status" metric on which alert is based. This mechanism relies on information reported by kernel adjtimex syscall. In this case it seems like status value reported by kernel is incorrect and node_exporter just propagates it.

Lowering severity as this is not a release blocking issue and workaround is to silence a warning-level alert.

Reassigning to RHCOS team as this is kernel/chrony issue and monitoring is just a messenger here.

[1]: https://listengine.tuxfamily.org/chrony.tuxfamily.org/chrony-users/2019/06/msg00000.html

Comment 4 Micah Abbott 2020-09-02 19:30:13 UTC
Thanks Pawel for the help triaging; since it is confirmed that this is not a Monitoring issue, it seems the issue lies with chrony (as RHCOS ships the same `chrony` package as RHEL 8).

I'm going to send this to the RHEL team to see if they are aware of the issue or can provide additional debug.

Comment 24 Pawel Krupa 2020-09-23 09:31:37 UTC
I talked with node_exporter maintainer and https://github.com/prometheus/node_exporter/pull/1850 won't be accepted as it is against best practices. We figured a way to fix the alert and after https://github.com/prometheus/node_exporter/pull/1851 is merged, I'll start the process of moving this fix into OpenShift.

Comment 34 hongyan li 2020-11-09 12:19:41 UTC
Checked alert rule: the issue is fixed in payload 4.7.0-0.nightly-2020-10-27-051128

NodeClockNotSynchronising
min_over_time(node_timex_sync_status[5m]) == 0 and node_timex_maxerror_seconds >= 16

Comment 35 Junqi Zhao 2020-11-11 09:49:34 UTC
tested with 4.7.0-0.nightly-2020-11-10-232221, there is not NodeClockNotSynchronising for vSphere now
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels | {alertname,instance}'
{
  "alertname": "CannotRetrieveUpdates",
  "instance": "172.31.249.4:9099"
}
{
  "alertname": "AlertmanagerReceiversNotConfigured",
  "instance": null
}
{
  "alertname": "Watchdog",
  "instance": null
}

Comment 49 errata-xmlrpc 2021-02-24 15:17:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 51 Red Hat Bugzilla 2023-09-18 00:22:10 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days