Bug 1834252 - vmware: NodeClockNotSynchronising alert is triggered in restricted networks
Summary: vmware: NodeClockNotSynchronising alert is triggered in restricted networks
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.7.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1186913 1867512
TreeView+ depends on / blocked
 
Reported: 2020-05-11 11:58 UTC by Junqi Zhao
Modified: 2023-12-15 17:52 UTC (History)
25 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1867512 (view as bug list)
Environment:
Last Closed: 2020-10-02 14:57:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Junqi Zhao 2020-05-11 11:58:43 UTC
Description of problem:
created 4.5.0-0.nightly-2020-05-11-011730 cluster, and added 3 RHEL workers to it, the 3 RHEL workers triggered NodeClockNotSynchronising alert
alert: NodeClockNotSynchronising
expr: min_over_time(node_timex_sync_status[5m])
  == 0
for: 10m
labels:
  severity: warning
annotations:
  message: Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured
    on this host.
  summary: Clock not synchronising.


min_over_time(node_timex_sync_status[5m]) == 0
Element 	Value
{endpoint="https",instance="qe-jiazha3-up-05110345-rhel-0",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-kk8k9",service="node-exporter"}	0
{endpoint="https",instance="qe-jiazha3-up-05110345-rhel-1",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-dvnkh",service="node-exporter"}	0
{endpoint="https",instance="qe-jiazha3-up-05110345-rhel-2",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-ljhbh",service="node-exporter"}	0



# oc get node -o wide | grep rhel
qe-jiazha3-up-05110345-rhel-0             Ready    worker   5h19m   v1.18.0-rc.1   10.0.1.9      <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.el7.x86_64        cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7
qe-jiazha3-up-05110345-rhel-1             Ready    worker   5h19m   v1.18.0-rc.1   10.0.1.7      <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.el7.x86_64        cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7
qe-jiazha3-up-05110345-rhel-2             Ready    worker   5h19m   v1.18.0-rc.1   10.0.1.8      <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.el7.x86_64        cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://10.0.1.9:9100/metrics' | grep node_timex_sync_status
# HELP node_timex_sync_status Is clock synchronized to a reliable server (1 = yes, 0 = no).
# TYPE node_timex_sync_status gauge
node_timex_sync_status 0


Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-11-011730

How reproducible:
cluster with RHEL workers

Steps to Reproduce:
1. create cluster, and added RHEL workers to it
2.
3.

Actual results:
NodeClockNotSynchronising triggered

Expected results:


Additional info:

Comment 2 Pawel Krupa 2020-05-11 13:35:28 UTC
Seems like nodes don't have NTP enabled. From monitoring team perspective everything looks fine, kernel exposes data about not synchronized clock, alert is fired. Reassigning to node team to investigate if NTP is enabled.

Comment 3 Ryan Phillips 2020-05-11 14:52:26 UTC
We need more information if ntp or chronyd is enabled or disabled on the Red Hat node.

I believe the RHCOS team could help out the most with this BZ, so I am going to reassign to them.

Comment 4 Colin Walters 2020-05-11 14:58:37 UTC
What platform is this?

The RHCOS team takes care of RHCOS, I think it's on RHEL base images and/or openshift-ansible to take care of things like NTP.

But see also e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1765609 for Azure.

Comment 5 Colin Walters 2020-05-11 15:02:31 UTC
Some related discussion in https://github.com/openshift/machine-config-operator/issues/629 too.

Comment 6 Russell Teague 2020-05-12 13:53:32 UTC
As part of the RHEL scaleup 'chrony' is installed.  I installed a fresh 4.5 cluster and added RHEL nodes.  The chrony service is enabled and running.  I checked all three nodes and they all report `node_timex_sync_status 1`.



[root@ip-10-0-139-146 ec2-user]# systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2020-05-12 13:30:41 UTC; 1min 54s ago
     Docs: man:chronyd(8)
           man:chrony.conf(5)
  Process: 553 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
  Process: 526 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 531 (chronyd)
   Memory: 1.1M
   CGroup: /system.slice/chronyd.service
           └─531 /usr/sbin/chronyd

May 12 13:30:41 ip-10-0-139-146.us-east-2.compute.internal systemd[1]: Starting NTP client/server...
May 12 13:30:41 ip-10-0-139-146.us-east-2.compute.internal chronyd[531]: chronyd version 3.4 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER...DEBUG)
May 12 13:30:41 ip-10-0-139-146.us-east-2.compute.internal chronyd[531]: Frequency 6.153 +/- 0.092 ppm read from /var/lib/chrony/drift
May 12 13:30:41 ip-10-0-139-146.us-east-2.compute.internal systemd[1]: Started NTP client/server.
May 12 13:30:53 ip-10-0-139-146.us-east-2.compute.internal chronyd[531]: Selected source 169.254.169.123
Hint: Some lines were ellipsized, use -l to show in full.



$ oc get nodes -o wide | grep Maipo
ip-10-0-139-146.us-east-2.compute.internal   Ready    worker   16m   v1.18.2   10.0.139.146   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.el7.x86_64        cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7
ip-10-0-147-239.us-east-2.compute.internal   Ready    worker   16m   v1.18.2   10.0.147.239   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.el7.x86_64        cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7
ip-10-0-175-183.us-east-2.compute.internal   Ready    worker   16m   v1.18.2   10.0.175.183   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.el7.x86_64        cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7



$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://10.0.139.146:9100/metrics' | grep node_timex_sync_status
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 62395    0 62395    0     0   624k      0 --:--:-- --:--:-- --:--:--  628k
# HELP node_timex_sync_status Is clock synchronized to a reliable server (1 = yes, 0 = no).
# TYPE node_timex_sync_status gauge
node_timex_sync_status 1


Let me know if any additional information is needed from RHEL hosts.

Comment 7 Pawel Krupa 2020-05-12 14:00:57 UTC
@Junqi are you able to check if chrony is working? If it isn't then this is the culprit for the alert. Additionally, if this is a restricted environment then maybe some firewall rule is preventing NTP synchronization. 

Either way, this seems like a problem with environment configuration and not Monitoring stack nor OpenShift itself.

Comment 9 Junqi Zhao 2020-05-13 09:34:19 UTC
checked in one 4.5.0-0.nightly-2020-05-12-195228 UPI_AWS_disconnected_network with http_proxy cluster, the NodeClockNotSynchronising is not fired for RHEL worker, but now fired for CoreOS workers now, but debugged on the CoreOS workers, NTP is Started 
# oc get node  -o wide
NAME                                        STATUS   ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-0-50-153.us-east-2.compute.internal   Ready    master   4h46m   v1.18.2   10.0.50.153   <none>        Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8
ip-10-0-51-80.us-east-2.compute.internal    Ready    master   4h46m   v1.18.2   10.0.51.80    <none>        Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8
ip-10-0-53-158.us-east-2.compute.internal   Ready    worker   4h34m   v1.18.2   10.0.53.158   <none>        Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8
ip-10-0-53-186.us-east-2.compute.internal   Ready    worker   4h33m   v1.18.2   10.0.53.186   <none>        Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8
ip-10-0-53-249.us-east-2.compute.internal   Ready    worker   3h37m   v1.18.2   10.0.53.249   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.8.2.el7.x86_64    cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7
ip-10-0-56-202.us-east-2.compute.internal   Ready    worker   3h37m   v1.18.2   10.0.56.202   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.8.2.el7.x86_64    cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7
ip-10-0-60-41.us-east-2.compute.internal    Ready    worker   3h37m   v1.18.2   10.0.60.41    <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.8.2.el7.x86_64    cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7
ip-10-0-68-254.us-east-2.compute.internal   Ready    worker   4h34m   v1.18.2   10.0.68.254   <none>        Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8
ip-10-0-70-104.us-east-2.compute.internal   Ready    master   4h46m   v1.18.2   10.0.70.104   <none>        Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8

# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels | {alertname,instance}'
{
  "alertname": "CustomResourceDetected",
  "instance": "10.129.0.30:8081"
}
{
  "alertname": "ClusterOperatorDown",
  "instance": "10.0.50.153:9099"
}
{
  "alertname": "ClusterOperatorDown",
  "instance": "10.0.50.153:9099"
}
{
  "alertname": "ClusterOperatorDegraded",
  "instance": "10.0.50.153:9099"
}
{
  "alertname": "ClusterOperatorDegraded",
  "instance": "10.0.50.153:9099"
}
{
  "alertname": "NodeClockNotSynchronising",
  "instance": "ip-10-0-68-254.us-east-2.compute.internal"
}
{
  "alertname": "NodeClockNotSynchronising",
  "instance": "ip-10-0-53-158.us-east-2.compute.internal"
}
{
  "alertname": "NodeClockNotSynchronising",
  "instance": "ip-10-0-53-186.us-east-2.compute.internal"
}
{
  "alertname": "AlertmanagerReceiversNotConfigured",
  "instance": null
}
{
  "alertname": "Watchdog",
  "instance": null
}


# oc debug node/ip-10-0-68-254.us-east-2.compute.internal
Starting pod/ip-10-0-68-254us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.68.254
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# systemctl status chronyd
● chronyd.service - NTP client/server
   Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2020-05-13 04:50:44 UTC; 4h 39min ago
     Docs: man:chronyd(8)
           man:chrony.conf(5)
  Process: 1257 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS)
  Process: 1245 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 1251 (chronyd)
    Tasks: 1
   Memory: 2.7M
      CPU: 50ms
   CGroup: /system.slice/chronyd.service
           └─1251 /usr/sbin/chronyd

May 13 04:50:44 localhost systemd[1]: Started NTP client/server.
May 13 04:53:08 ip-10-0-68-254 chronyd[1251]: Source 23.152.160.126 replaced with 2607:fa18:3:88::35
May 13 05:27:34 ip-10-0-68-254 chronyd[1251]: Source 50.205.244.111 replaced with 64.62.190.177
May 13 05:57:49 ip-10-0-68-254 chronyd[1251]: Source 2607:fa18:3:88::35 replaced with 2607:f3c8:3803:1::6
May 13 06:36:26 ip-10-0-68-254 chronyd[1251]: Source 199.180.255.17 replaced with 2600:2600::99
May 13 07:06:50 ip-10-0-68-254 chronyd[1251]: Source 2607:f3c8:3803:1::6 replaced with 2606:4700:f1::1
May 13 07:36:59 ip-10-0-68-254 chronyd[1251]: Source 64.62.190.177 replaced with 2600:3c02::13:220
May 13 08:07:07 ip-10-0-68-254 chronyd[1251]: Source 2606:4700:f1::1 replaced with 66.246.75.245
May 13 08:37:17 ip-10-0-68-254 chronyd[1251]: Source 64.225.34.103 replaced with 50.116.52.97
May 13 09:11:21 ip-10-0-68-254 chronyd[1251]: Source 2600:2600::99 replaced with 159.203.82.102

Comment 11 Junqi Zhao 2020-05-13 09:46:50 UTC
also checked on a non-proxy cluster, there is not such error, then it maybe related to the PROXY
# oc get node -o wide
NAME                                        STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-0-48-64.us-east-2.compute.internal    Ready    master   74m   v1.18.2   10.0.48.64    <none>        Red Hat Enterprise Linux CoreOS 45.81.202005130252-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8
ip-10-0-51-129.us-east-2.compute.internal   Ready    worker   64m   v1.18.2   10.0.51.129   <none>        Red Hat Enterprise Linux CoreOS 45.81.202005130252-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8
ip-10-0-59-223.us-east-2.compute.internal   Ready    worker   33m   v1.18.2   10.0.59.223   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.8.2.el7.x86_64    cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7
ip-10-0-61-19.us-east-2.compute.internal    Ready    master   75m   v1.18.2   10.0.61.19    <none>        Red Hat Enterprise Linux CoreOS 45.81.202005130252-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8
ip-10-0-62-244.us-east-2.compute.internal   Ready    worker   33m   v1.18.2   10.0.62.244   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.8.2.el7.x86_64    cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7
ip-10-0-63-165.us-east-2.compute.internal   Ready    worker   33m   v1.18.2   10.0.63.165   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.8.2.el7.x86_64    cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7
ip-10-0-71-197.us-east-2.compute.internal   Ready    master   75m   v1.18.2   10.0.71.197   <none>        Red Hat Enterprise Linux CoreOS 45.81.202005130252-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8
ip-10-0-79-210.us-east-2.compute.internal   Ready    worker   63m   v1.18.2   10.0.79.210   <none>        Red Hat Enterprise Linux CoreOS 45.81.202005130252-0 (Ootpa)   4.18.0-147.8.1.el8_1.x86_64   cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels | {alertname,instance}'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7794    0  7794    0     0   233k      0 --:--:-- --:--:-- --:--:--  237k
{
  "alertname": "NodeIPTablesStale",
  "instance": "10.128.2.13:8443"
}
{
  "alertname": "NodeIPTablesStale",
  "instance": "10.128.2.13:8443"
}
{
  "alertname": "NodeIPTablesStale",
  "instance": "10.128.2.13:8443"
}
{
  "alertname": "NodeIPTablesStale",
  "instance": "10.128.2.13:8443"
}
{
  "alertname": "NodeIPTablesStale",
  "instance": "10.128.2.13:8443"
}
{
  "alertname": "NodeIPTablesStale",
  "instance": "10.128.2.13:8443"
}
{
  "alertname": "NodeIPTablesStale",
  "instance": "10.128.2.13:8443"
}
{
  "alertname": "NodeIPTablesStale",
  "instance": "10.128.2.13:8443"
}
{
  "alertname": "ClusterIPTablesStale",
  "instance": null
}
{
  "alertname": "AlertmanagerReceiversNotConfigured",
  "instance": null
}
{
  "alertname": "Watchdog",
  "instance": null
}
{
  "alertname": "CustomResourceDetected",
  "instance": "10.128.0.26:8081"
}

Comment 12 Pawel Krupa 2020-05-13 09:59:45 UTC
Clock on those nodes is not in sync because chrony cannot contact external NTP server and alert is correctly triggered. Closing as this is due to incorrect environment configuration.


$ oc debug node/ip-10-0-68-254.us-east-2.compute.internal
sh-4.4# chroot /host
sh-4.4# timedatectl 
               Local time: Wed 2020-05-13 09:53:58 UTC
           Universal time: Wed 2020-05-13 09:53:58 UTC
                 RTC time: Wed 2020-05-13 09:53:58
                Time zone: UTC (UTC, +0000)
System clock synchronized: no
              NTP service: active
          RTC in local TZ: no

sh-4.4# chronyc -a 'burst 4/4'
501 Not authorised
sh-4.4# chronyc -a makestep
501 Not authorised

Comment 13 Colin Walters 2020-05-13 13:11:08 UTC
> sh-4.4# chronyc -a 'burst 4/4'
> 501 Not authorised
> sh-4.4# chronyc -a makestep
> 501 Not authorised

That's sadly probably SELinux... spc_t denies random things that unconfined_t doesn't (IMO this is a bug).

xref too https://github.com/openshift/machine-config-operator/issues/629

I still think we should consider having the control plane run NTP by default in some scenarios.

(Though, at least for public cloud what Azure does in providing a virtual hardware clock by default is the best)

In looking through docs I also found:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html
We should definitely do that by default in AWS (for both openshift-ansible+RHEL and RHCOS).

Comment 15 Colin Walters 2020-05-13 13:40:30 UTC
Yep we'll take it.

Comment 25 Colin Walters 2020-06-17 17:33:57 UTC
So this bug was originally reported around RHEL7 workers, but we fixed the issue in RHCOS because there are well defined methods to do that for public cloud.

I probably made a mistake in "repurposing" this bug for aws/azure/gcp.

The affected platform here is VMWare - we do not currently have a default time synchronization method there to the best of my knowledge.

https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf
is relevant.

Comment 29 Colin Walters 2020-07-13 11:08:08 UTC
Just to recap, as far as I know, OpenShift has never supported automatically synchronizing clocks on restricted vSphere systems.  We definitely should though.

https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf
is relevant.

Comment 37 Colin Walters 2020-10-02 14:49:07 UTC
This never worked, I think we should track this as an RFE.

In the short term, the workaround is to explicitly configure time servers, see e.g.
https://github.com/openshift/machine-config-operator/tree/6c37f9c3bfb16b3d452a6f74195dd5ad30cb759d#applying-configuration-changes-to-the-cluster

Comment 38 Colin Walters 2020-10-02 14:57:24 UTC
Moving to https://issues.redhat.com/browse/RFE-1260


Note You need to log in before you can comment on or make changes to this bug.