Description of problem: created 4.5.0-0.nightly-2020-05-11-011730 cluster, and added 3 RHEL workers to it, the 3 RHEL workers triggered NodeClockNotSynchronising alert alert: NodeClockNotSynchronising expr: min_over_time(node_timex_sync_status[5m]) == 0 for: 10m labels: severity: warning annotations: message: Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host. summary: Clock not synchronising. min_over_time(node_timex_sync_status[5m]) == 0 Element Value {endpoint="https",instance="qe-jiazha3-up-05110345-rhel-0",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-kk8k9",service="node-exporter"} 0 {endpoint="https",instance="qe-jiazha3-up-05110345-rhel-1",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-dvnkh",service="node-exporter"} 0 {endpoint="https",instance="qe-jiazha3-up-05110345-rhel-2",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-ljhbh",service="node-exporter"} 0 # oc get node -o wide | grep rhel qe-jiazha3-up-05110345-rhel-0 Ready worker 5h19m v1.18.0-rc.1 10.0.1.9 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 qe-jiazha3-up-05110345-rhel-1 Ready worker 5h19m v1.18.0-rc.1 10.0.1.7 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 qe-jiazha3-up-05110345-rhel-2 Ready worker 5h19m v1.18.0-rc.1 10.0.1.8 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://10.0.1.9:9100/metrics' | grep node_timex_sync_status # HELP node_timex_sync_status Is clock synchronized to a reliable server (1 = yes, 0 = no). # TYPE node_timex_sync_status gauge node_timex_sync_status 0 Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-05-11-011730 How reproducible: cluster with RHEL workers Steps to Reproduce: 1. create cluster, and added RHEL workers to it 2. 3. Actual results: NodeClockNotSynchronising triggered Expected results: Additional info:
Seems like nodes don't have NTP enabled. From monitoring team perspective everything looks fine, kernel exposes data about not synchronized clock, alert is fired. Reassigning to node team to investigate if NTP is enabled.
We need more information if ntp or chronyd is enabled or disabled on the Red Hat node. I believe the RHCOS team could help out the most with this BZ, so I am going to reassign to them.
What platform is this? The RHCOS team takes care of RHCOS, I think it's on RHEL base images and/or openshift-ansible to take care of things like NTP. But see also e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1765609 for Azure.
Some related discussion in https://github.com/openshift/machine-config-operator/issues/629 too.
As part of the RHEL scaleup 'chrony' is installed. I installed a fresh 4.5 cluster and added RHEL nodes. The chrony service is enabled and running. I checked all three nodes and they all report `node_timex_sync_status 1`. [root@ip-10-0-139-146 ec2-user]# systemctl status chronyd ● chronyd.service - NTP client/server Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2020-05-12 13:30:41 UTC; 1min 54s ago Docs: man:chronyd(8) man:chrony.conf(5) Process: 553 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS) Process: 526 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS) Main PID: 531 (chronyd) Memory: 1.1M CGroup: /system.slice/chronyd.service └─531 /usr/sbin/chronyd May 12 13:30:41 ip-10-0-139-146.us-east-2.compute.internal systemd[1]: Starting NTP client/server... May 12 13:30:41 ip-10-0-139-146.us-east-2.compute.internal chronyd[531]: chronyd version 3.4 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER...DEBUG) May 12 13:30:41 ip-10-0-139-146.us-east-2.compute.internal chronyd[531]: Frequency 6.153 +/- 0.092 ppm read from /var/lib/chrony/drift May 12 13:30:41 ip-10-0-139-146.us-east-2.compute.internal systemd[1]: Started NTP client/server. May 12 13:30:53 ip-10-0-139-146.us-east-2.compute.internal chronyd[531]: Selected source 169.254.169.123 Hint: Some lines were ellipsized, use -l to show in full. $ oc get nodes -o wide | grep Maipo ip-10-0-139-146.us-east-2.compute.internal Ready worker 16m v1.18.2 10.0.139.146 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 ip-10-0-147-239.us-east-2.compute.internal Ready worker 16m v1.18.2 10.0.147.239 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 ip-10-0-175-183.us-east-2.compute.internal Ready worker 16m v1.18.2 10.0.175.183 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://10.0.139.146:9100/metrics' | grep node_timex_sync_status % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 62395 0 62395 0 0 624k 0 --:--:-- --:--:-- --:--:-- 628k # HELP node_timex_sync_status Is clock synchronized to a reliable server (1 = yes, 0 = no). # TYPE node_timex_sync_status gauge node_timex_sync_status 1 Let me know if any additional information is needed from RHEL hosts.
@Junqi are you able to check if chrony is working? If it isn't then this is the culprit for the alert. Additionally, if this is a restricted environment then maybe some firewall rule is preventing NTP synchronization. Either way, this seems like a problem with environment configuration and not Monitoring stack nor OpenShift itself.
checked in one 4.5.0-0.nightly-2020-05-12-195228 UPI_AWS_disconnected_network with http_proxy cluster, the NodeClockNotSynchronising is not fired for RHEL worker, but now fired for CoreOS workers now, but debugged on the CoreOS workers, NTP is Started # oc get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-50-153.us-east-2.compute.internal Ready master 4h46m v1.18.2 10.0.50.153 <none> Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 ip-10-0-51-80.us-east-2.compute.internal Ready master 4h46m v1.18.2 10.0.51.80 <none> Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 ip-10-0-53-158.us-east-2.compute.internal Ready worker 4h34m v1.18.2 10.0.53.158 <none> Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 ip-10-0-53-186.us-east-2.compute.internal Ready worker 4h33m v1.18.2 10.0.53.186 <none> Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 ip-10-0-53-249.us-east-2.compute.internal Ready worker 3h37m v1.18.2 10.0.53.249 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.8.2.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 ip-10-0-56-202.us-east-2.compute.internal Ready worker 3h37m v1.18.2 10.0.56.202 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.8.2.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 ip-10-0-60-41.us-east-2.compute.internal Ready worker 3h37m v1.18.2 10.0.60.41 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.8.2.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 ip-10-0-68-254.us-east-2.compute.internal Ready worker 4h34m v1.18.2 10.0.68.254 <none> Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 ip-10-0-70-104.us-east-2.compute.internal Ready master 4h46m v1.18.2 10.0.70.104 <none> Red Hat Enterprise Linux CoreOS 45.81.202005121551-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels | {alertname,instance}' { "alertname": "CustomResourceDetected", "instance": "10.129.0.30:8081" } { "alertname": "ClusterOperatorDown", "instance": "10.0.50.153:9099" } { "alertname": "ClusterOperatorDown", "instance": "10.0.50.153:9099" } { "alertname": "ClusterOperatorDegraded", "instance": "10.0.50.153:9099" } { "alertname": "ClusterOperatorDegraded", "instance": "10.0.50.153:9099" } { "alertname": "NodeClockNotSynchronising", "instance": "ip-10-0-68-254.us-east-2.compute.internal" } { "alertname": "NodeClockNotSynchronising", "instance": "ip-10-0-53-158.us-east-2.compute.internal" } { "alertname": "NodeClockNotSynchronising", "instance": "ip-10-0-53-186.us-east-2.compute.internal" } { "alertname": "AlertmanagerReceiversNotConfigured", "instance": null } { "alertname": "Watchdog", "instance": null } # oc debug node/ip-10-0-68-254.us-east-2.compute.internal Starting pod/ip-10-0-68-254us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.68.254 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# systemctl status chronyd ● chronyd.service - NTP client/server Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2020-05-13 04:50:44 UTC; 4h 39min ago Docs: man:chronyd(8) man:chrony.conf(5) Process: 1257 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS) Process: 1245 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS) Main PID: 1251 (chronyd) Tasks: 1 Memory: 2.7M CPU: 50ms CGroup: /system.slice/chronyd.service └─1251 /usr/sbin/chronyd May 13 04:50:44 localhost systemd[1]: Started NTP client/server. May 13 04:53:08 ip-10-0-68-254 chronyd[1251]: Source 23.152.160.126 replaced with 2607:fa18:3:88::35 May 13 05:27:34 ip-10-0-68-254 chronyd[1251]: Source 50.205.244.111 replaced with 64.62.190.177 May 13 05:57:49 ip-10-0-68-254 chronyd[1251]: Source 2607:fa18:3:88::35 replaced with 2607:f3c8:3803:1::6 May 13 06:36:26 ip-10-0-68-254 chronyd[1251]: Source 199.180.255.17 replaced with 2600:2600::99 May 13 07:06:50 ip-10-0-68-254 chronyd[1251]: Source 2607:f3c8:3803:1::6 replaced with 2606:4700:f1::1 May 13 07:36:59 ip-10-0-68-254 chronyd[1251]: Source 64.62.190.177 replaced with 2600:3c02::13:220 May 13 08:07:07 ip-10-0-68-254 chronyd[1251]: Source 2606:4700:f1::1 replaced with 66.246.75.245 May 13 08:37:17 ip-10-0-68-254 chronyd[1251]: Source 64.225.34.103 replaced with 50.116.52.97 May 13 09:11:21 ip-10-0-68-254 chronyd[1251]: Source 2600:2600::99 replaced with 159.203.82.102
also checked on a non-proxy cluster, there is not such error, then it maybe related to the PROXY # oc get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-48-64.us-east-2.compute.internal Ready master 74m v1.18.2 10.0.48.64 <none> Red Hat Enterprise Linux CoreOS 45.81.202005130252-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 ip-10-0-51-129.us-east-2.compute.internal Ready worker 64m v1.18.2 10.0.51.129 <none> Red Hat Enterprise Linux CoreOS 45.81.202005130252-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 ip-10-0-59-223.us-east-2.compute.internal Ready worker 33m v1.18.2 10.0.59.223 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.8.2.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 ip-10-0-61-19.us-east-2.compute.internal Ready master 75m v1.18.2 10.0.61.19 <none> Red Hat Enterprise Linux CoreOS 45.81.202005130252-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 ip-10-0-62-244.us-east-2.compute.internal Ready worker 33m v1.18.2 10.0.62.244 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.8.2.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 ip-10-0-63-165.us-east-2.compute.internal Ready worker 33m v1.18.2 10.0.63.165 <none> Red Hat Enterprise Linux Server 7.8 (Maipo) 3.10.0-1127.8.2.el7.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el7 ip-10-0-71-197.us-east-2.compute.internal Ready master 75m v1.18.2 10.0.71.197 <none> Red Hat Enterprise Linux CoreOS 45.81.202005130252-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 ip-10-0-79-210.us-east-2.compute.internal Ready worker 63m v1.18.2 10.0.79.210 <none> Red Hat Enterprise Linux CoreOS 45.81.202005130252-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.18.0-17.dev.rhaos4.5.gitdea34b9.el8 # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts' | jq '.data.alerts[].labels | {alertname,instance}' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7794 0 7794 0 0 233k 0 --:--:-- --:--:-- --:--:-- 237k { "alertname": "NodeIPTablesStale", "instance": "10.128.2.13:8443" } { "alertname": "NodeIPTablesStale", "instance": "10.128.2.13:8443" } { "alertname": "NodeIPTablesStale", "instance": "10.128.2.13:8443" } { "alertname": "NodeIPTablesStale", "instance": "10.128.2.13:8443" } { "alertname": "NodeIPTablesStale", "instance": "10.128.2.13:8443" } { "alertname": "NodeIPTablesStale", "instance": "10.128.2.13:8443" } { "alertname": "NodeIPTablesStale", "instance": "10.128.2.13:8443" } { "alertname": "NodeIPTablesStale", "instance": "10.128.2.13:8443" } { "alertname": "ClusterIPTablesStale", "instance": null } { "alertname": "AlertmanagerReceiversNotConfigured", "instance": null } { "alertname": "Watchdog", "instance": null } { "alertname": "CustomResourceDetected", "instance": "10.128.0.26:8081" }
Clock on those nodes is not in sync because chrony cannot contact external NTP server and alert is correctly triggered. Closing as this is due to incorrect environment configuration. $ oc debug node/ip-10-0-68-254.us-east-2.compute.internal sh-4.4# chroot /host sh-4.4# timedatectl Local time: Wed 2020-05-13 09:53:58 UTC Universal time: Wed 2020-05-13 09:53:58 UTC RTC time: Wed 2020-05-13 09:53:58 Time zone: UTC (UTC, +0000) System clock synchronized: no NTP service: active RTC in local TZ: no sh-4.4# chronyc -a 'burst 4/4' 501 Not authorised sh-4.4# chronyc -a makestep 501 Not authorised
> sh-4.4# chronyc -a 'burst 4/4' > 501 Not authorised > sh-4.4# chronyc -a makestep > 501 Not authorised That's sadly probably SELinux... spc_t denies random things that unconfined_t doesn't (IMO this is a bug). xref too https://github.com/openshift/machine-config-operator/issues/629 I still think we should consider having the control plane run NTP by default in some scenarios. (Though, at least for public cloud what Azure does in providing a virtual hardware clock by default is the best) In looking through docs I also found: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html We should definitely do that by default in AWS (for both openshift-ansible+RHEL and RHCOS).
Yep we'll take it.
https://github.com/coreos/fedora-coreos-config/pull/393
So this bug was originally reported around RHEL7 workers, but we fixed the issue in RHCOS because there are well defined methods to do that for public cloud. I probably made a mistake in "repurposing" this bug for aws/azure/gcp. The affected platform here is VMWare - we do not currently have a default time synchronization method there to the best of my knowledge. https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf is relevant.
Just to recap, as far as I know, OpenShift has never supported automatically synchronizing clocks on restricted vSphere systems. We definitely should though. https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf is relevant.
This never worked, I think we should track this as an RFE. In the short term, the workaround is to explicitly configure time servers, see e.g. https://github.com/openshift/machine-config-operator/tree/6c37f9c3bfb16b3d452a6f74195dd5ad30cb759d#applying-configuration-changes-to-the-cluster
Moving to https://issues.redhat.com/browse/RFE-1260