Bug 1874340 - vmware: NodeClockNotSynchronising alert is triggered in openshift cluster after upgrading form 4.4.16 to 4.5.6
Summary: vmware: NodeClockNotSynchronising alert is triggered in openshift cluster aft...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Pawel Krupa
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-01 05:40 UTC by Vedanti Jaypurkar
Modified: 2024-03-25 16:24 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:17:02 UTC
Target Upstream Version:
Embargoed:
anisal: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 963 0 None closed Bug 1890808: bump mixins to include new etcd alerts 2021-02-05 01:56:50 UTC
Github prometheus-operator kube-prometheus pull 729 0 None closed bump node-exporter rules to latest version 2021-02-05 01:56:50 UTC
Github prometheus node_exporter pull 1851 0 None closed docs/node-mixin: add max error condition to alert about desynchronized clock 2021-02-05 01:56:50 UTC
Red Hat Knowledge Base (Solution) 5368461 0 None None None 2020-10-08 07:43:34 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:17:26 UTC

Internal Links: 1976162

Comment 3 Pawel Krupa 2020-09-02 13:48:06 UTC
I believe you are hitting a kernel/chrony issue[1] with setting sync_status flag. node_exporter reports sync_status via timex collector as "node_timex_sync_status" metric on which alert is based. This mechanism relies on information reported by kernel adjtimex syscall. In this case it seems like status value reported by kernel is incorrect and node_exporter just propagates it.

Lowering severity as this is not a release blocking issue and workaround is to silence a warning-level alert.

Reassigning to RHCOS team as this is kernel/chrony issue and monitoring is just a messenger here.

[1]: https://listengine.tuxfamily.org/chrony.tuxfamily.org/chrony-users/2019/06/msg00000.html

Comment 4 Micah Abbott 2020-09-02 19:30:13 UTC
Thanks Pawel for the help triaging; since it is confirmed that this is not a Monitoring issue, it seems the issue lies with chrony (as RHCOS ships the same `chrony` package as RHEL 8).

I'm going to send this to the RHEL team to see if they are aware of the issue or can provide additional debug.

Comment 24 Pawel Krupa 2020-09-23 09:31:37 UTC
I talked with node_exporter maintainer and https://github.com/prometheus/node_exporter/pull/1850 won't be accepted as it is against best practices. We figured a way to fix the alert and after https://github.com/prometheus/node_exporter/pull/1851 is merged, I'll start the process of moving this fix into OpenShift.

Comment 34 hongyan li 2020-11-09 12:19:41 UTC
Checked alert rule: the issue is fixed in payload 4.7.0-0.nightly-2020-10-27-051128

NodeClockNotSynchronising
min_over_time(node_timex_sync_status[5m]) == 0 and node_timex_maxerror_seconds >= 16

Comment 35 Junqi Zhao 2020-11-11 09:49:34 UTC
tested with 4.7.0-0.nightly-2020-11-10-232221, there is not NodeClockNotSynchronising for vSphere now
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels | {alertname,instance}'
{
  "alertname": "CannotRetrieveUpdates",
  "instance": "172.31.249.4:9099"
}
{
  "alertname": "AlertmanagerReceiversNotConfigured",
  "instance": null
}
{
  "alertname": "Watchdog",
  "instance": null
}

Comment 49 errata-xmlrpc 2021-02-24 15:17:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 51 Red Hat Bugzilla 2023-09-18 00:22:10 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.