1874340 – vmware: NodeClockNotSynchronising alert is triggered in openshift cluster after upgrading form 4.4.16 to 4.5.6

Bug 1874340 - vmware: NodeClockNotSynchronising alert is triggered in openshift cluster after upgrading form 4.4.16 to 4.5.6

Summary: vmware: NodeClockNotSynchronising alert is triggered in openshift cluster aft...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Pawel Krupa
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-01 05:40 UTC by Vedanti Jaypurkar
Modified:	2024-03-25 16:24 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:17:02 UTC
Target Upstream Version:
Embargoed:
Flags:	anisal: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 963	None	closed	Bug 1890808: bump mixins to include new etcd alerts	2021-02-05 01:56:50 UTC
Github	prometheus-operator kube-prometheus pull 729	None	closed	bump node-exporter rules to latest version	2021-02-05 01:56:50 UTC
Github	prometheus node_exporter pull 1851	None	closed	docs/node-mixin: add max error condition to alert about desynchronized clock	2021-02-05 01:56:50 UTC
Red Hat Knowledge Base (Solution)	5368461	None	None	None	2020-10-08 07:43:34 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:17:26 UTC

Internal Links: 1976162

Comment 3 Pawel Krupa 2020-09-02 13:48:06 UTC

I believe you are hitting a kernel/chrony issue[1] with setting sync_status flag. node_exporter reports sync_status via timex collector as "node_timex_sync_status" metric on which alert is based. This mechanism relies on information reported by kernel adjtimex syscall. In this case it seems like status value reported by kernel is incorrect and node_exporter just propagates it.

Lowering severity as this is not a release blocking issue and workaround is to silence a warning-level alert.

Reassigning to RHCOS team as this is kernel/chrony issue and monitoring is just a messenger here.

[1]: https://listengine.tuxfamily.org/chrony.tuxfamily.org/chrony-users/2019/06/msg00000.html

Comment 4 Micah Abbott 2020-09-02 19:30:13 UTC

Thanks Pawel for the help triaging; since it is confirmed that this is not a Monitoring issue, it seems the issue lies with chrony (as RHCOS ships the same `chrony` package as RHEL 8).

I'm going to send this to the RHEL team to see if they are aware of the issue or can provide additional debug.

Comment 24 Pawel Krupa 2020-09-23 09:31:37 UTC

I talked with node_exporter maintainer and https://github.com/prometheus/node_exporter/pull/1850 won't be accepted as it is against best practices. We figured a way to fix the alert and after https://github.com/prometheus/node_exporter/pull/1851 is merged, I'll start the process of moving this fix into OpenShift.

Comment 34 hongyan li 2020-11-09 12:19:41 UTC

Checked alert rule: the issue is fixed in payload 4.7.0-0.nightly-2020-10-27-051128

NodeClockNotSynchronising
min_over_time(node_timex_sync_status[5m]) == 0 and node_timex_maxerror_seconds >= 16

Comment 35 Junqi Zhao 2020-11-11 09:49:34 UTC

tested with 4.7.0-0.nightly-2020-11-10-232221, there is not NodeClockNotSynchronising for vSphere now
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels | {alertname,instance}'
{
  "alertname": "CannotRetrieveUpdates",
  "instance": "172.31.249.4:9099"
}
{
  "alertname": "AlertmanagerReceiversNotConfigured",
  "instance": null
}
{
  "alertname": "Watchdog",
  "instance": null
}

Comment 49 errata-xmlrpc 2021-02-24 15:17:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 51 Red Hat Bugzilla 2023-09-18 00:22:10 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

alegrand
amigliet
anisal
anpicker
bbreard
dkiselev
erooth
fshaikh
hongyli
imcleod
jligon
juzhao
kakkoyun
lcosic
miabbott
mlichvar
nstielau
pkrupa
spasquie
ssadhale
surbania