I believe you are hitting a kernel/chrony issue[1] with setting sync_status flag. node_exporter reports sync_status via timex collector as "node_timex_sync_status" metric on which alert is based. This mechanism relies on information reported by kernel adjtimex syscall. In this case it seems like status value reported by kernel is incorrect and node_exporter just propagates it. Lowering severity as this is not a release blocking issue and workaround is to silence a warning-level alert. Reassigning to RHCOS team as this is kernel/chrony issue and monitoring is just a messenger here. [1]: https://listengine.tuxfamily.org/chrony.tuxfamily.org/chrony-users/2019/06/msg00000.html
Thanks Pawel for the help triaging; since it is confirmed that this is not a Monitoring issue, it seems the issue lies with chrony (as RHCOS ships the same `chrony` package as RHEL 8). I'm going to send this to the RHEL team to see if they are aware of the issue or can provide additional debug.
I talked with node_exporter maintainer and https://github.com/prometheus/node_exporter/pull/1850 won't be accepted as it is against best practices. We figured a way to fix the alert and after https://github.com/prometheus/node_exporter/pull/1851 is merged, I'll start the process of moving this fix into OpenShift.
Checked alert rule: the issue is fixed in payload 4.7.0-0.nightly-2020-10-27-051128 NodeClockNotSynchronising min_over_time(node_timex_sync_status[5m]) == 0 and node_timex_maxerror_seconds >= 16
tested with 4.7.0-0.nightly-2020-11-10-232221, there is not NodeClockNotSynchronising for vSphere now # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels | {alertname,instance}' { "alertname": "CannotRetrieveUpdates", "instance": "172.31.249.4:9099" } { "alertname": "AlertmanagerReceiversNotConfigured", "instance": null } { "alertname": "Watchdog", "instance": null }
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days