Bug 1765609 - [Azure] Virtual machines are unable to reach public NTP servers
Summary: [Azure] Virtual machines are unable to reach public NTP servers
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.5
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: 4.5.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 1828342 (view as bug list)
Depends On:
Blocks: 1186913 1828342 1834565 1837039
TreeView+ depends on / blocked
 
Reported: 2019-10-25 14:59 UTC by Nils
Modified: 2023-09-07 20:52 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1828342 1834565 1837039 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:11:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1682 0 None closed Bug 1765609: templates: add chrony config on Azure for host time synchronization 2021-02-08 20:12:24 UTC
Red Hat Knowledge Base (Solution) 4863201 0 None None None 2020-02-28 15:57:09 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:12:02 UTC

Description Nils 2019-10-25 14:59:27 UTC
Description of problem:

After cluster installation, the controlplane virtual machines are unable to reach public NTP servers. The chrony daemon which is running on the RHCOS virtual machines, seems to be configured to use 2.rhel.pool.ntp.org pool as an NTP source.

Investigation has shown that by default, virtual machines which have been added to a public load balancer, are unable to send outgoing UDP packets. (I have tested both NTP and DNS)

Version-Release number of selected component (if applicable):

OpenShift 4.2.1 (https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.2.1/)

How reproducible: Always

Steps to Reproduce:
1. Install a cluster using default settings on Azure (installer provisioned infrastructure).
2. Ssh into a master node. (I have linked the virtual network to another network where I'm running a VPN appliance)
3. Run the "chronyc sources" command.

Actual results:

NTP synchronisation fails, since no NTP peers are reachable

[core@cluster01-sjtmm-master-0 ~]$ sudo chronyc sources
210 Number of sources = 8
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
^? no-reverse-yet.comsave.nl     0   9     0     -     +0ns[   +0ns] +/-    0ns
^? mail.klausen.dk               0   9     0     -     +0ns[   +0ns] +/-    0ns
^? ntp1.trans-ix.nl              0   9     0     -     +0ns[   +0ns] +/-    0ns
^? aardbei.vanderzwet.net        0   9     0     -     +0ns[   +0ns] +/-    0ns
^? x.ns.gin.ntt.net              0   6     0     -     +0ns[   +0ns] +/-    0ns
^? services.freshdot.net         0   6     0     -     +0ns[   +0ns] +/-    0ns
^? ntp5.linocomm.net             0   6     0     -     +0ns[   +0ns] +/-    0ns
^? alta.fancube.com              0   6     0     -     +0ns[   +0ns] +/-    0ns

# Outbound DNS requests are also not possible:
[core@cluster01-sjtmm-master-0 ~]$ dig www.redhat.com @8.8.8.8

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-17.P2.el8_0.1 <<>> www.redhat.com @8.8.8.8
;; global options: +cmd
;; connection timed out; no servers could be reached

Expected results: 
(below output was received after implementing a workaround)

The 'chronyc sources' command should display peers with a valid stratum.

[core@cluster01-sjtmm-master-0 ~]$ sudo chronyc sources
210 Number of sources = 4
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
^? mail.klausen.dk               2   6     7     1    -50ms[  -50ms] +/- 2217us
^? ntp1.trans-ix.nl              2   6     7     0    -51ms[  -51ms] +/-   34ms
^? beetjevreemd.nl               1   6     3     2    -50ms[  -50ms] +/- 3035us
^? server01.colocenter.nl        2   6     3     2    -51ms[  -51ms] +/-   33ms

Additional info:

Timedrift becomes a problem for etcd, impacting the stability of the cluster.

The following workaround seems to work: Disable outgoing snat on the public load balancer, and create an explicit outbound rule which allows all traffic:

az network lb rule update  -g cluster01-ggrf9-rg --lb-name cluster01-ggrf9-public-lb --name api-internal --disable-outbound-snat true
az network lb outbound-rule  create -g cluster01-ggrf9-rg --lb-name cluster01-ggrf9-public-lb --frontend-ip-configs public-lb-ip --protocol All --address-pool cluster01-ggrf9-public-lb-control-plane --name AllowOutbound

Terraform has a azurerm_lb_outbound_rule, so this could be configured by the installer.

Additionally: timedrift should be monitored by the internal Prometheus installation. (https://github.com/prometheus/node_exporter/blob/master/docs/TIME.md)

Comment 1 Nils 2019-11-05 15:23:52 UTC
This also needs to be done for the worker nodes. The following seems to work:

cluster_id="cluster02-tflm4"
for rule in $(az network lb rule list -g "${cluster_id}-rg" --lb-name "${cluster_id}"  --query [].name -o tsv)
do
  az network lb rule update  -g "${cluster_id}-rg" --lb-name "${cluster_id}" --name "${rule}" --disable-outbound-snat true
done
frontend_ip="$(az network lb frontend-ip list -g "${cluster_id}-rg" --lb-name "${cluster_id}" --query [0].name -o tsv)"
az network lb outbound-rule  create -g "${cluster_id}-rg" --lb-name "${cluster_id}" --protocol All --address-pool "${cluster_id}" --frontend-ip-configs "${frontend_ip}" --name AllowOutbound

Comment 3 Scott Dodson 2020-03-06 19:30:33 UTC
After reviewing this the best solution looks to be that when running on Azure we should customize the NTP configuration to use host local time syncing per the Azure best practices.

Comment 9 Nils 2020-03-17 11:55:03 UTC
Is there any more information that I'm asked to provide? If not, can we remove the needinfo flag set to my email address?

Comment 17 Sinny Kumari 2020-04-24 11:24:35 UTC
fix for this issue is up for review https://github.com/openshift/machine-config-operator/pull/1658. Please review it and if needed tag along right people. thanks.

Comment 18 Antonio Murdaca 2020-04-27 12:41:06 UTC
The MCO PR will fix the specific issue with an azure cluster - it will not however take care of the bootstrap host. It might make sense in the future to have something like the MCO create and configure the bootstrap host but it's not the case today and it's not trivial. We're going to open a PR to the installer as well to take care of this issue on the bootstrap host as well with the gotcha to keep those chrony confs in sync.

Comment 21 Sinny Kumari 2020-04-30 04:56:22 UTC
*** Bug 1828342 has been marked as a duplicate of this bug. ***

Comment 22 Steve Milner 2020-05-07 21:07:22 UTC
MR merged. Should this move to modified?

Comment 24 Colin Walters 2020-05-08 19:21:54 UTC
After this lands in 4.5 and we've tested it, I'll clone this for 4.4 and probably 4.3 too.

Comment 26 Colin Walters 2020-05-12 00:28:03 UTC
Verified this:

walters@toolbox /s/w/rhcos-master> oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-11-211039   True        False         2m39s   Cluster version is 4.5.0-0.nightly-2020-05-11-211039
walters@toolbox /s/w/rhcos-master> oc debug node/ci-ln-0wb12s2-002ac-t2xsf-master-0
...
[root@ci-ln-0wb12s2-002ac-t2xsf-master-0 /]# rpm-ostree status -b
State: idle
AutomaticUpdates: disabled
BootedDeployment:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d202a8b981606f4ca691b6f87d847782f95b666963c81b207aa091ce1f806198
              CustomOrigin: Managed by machine-config-operator
                   Version: 45.81.202005111729-0 (2020-05-11T17:33:22Z)

[root@ci-ln-0wb12s2-002ac-t2xsf-master-0 /]# chronyc sources
210 Number of sources = 1
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
#* PHC0                          0   3   377     9   +772ns[ +525ns] +/-  339ns
[root@ci-ln-0wb12s2-002ac-t2xsf-master-0 /]#

Comment 27 Michael Nguyen 2020-05-12 21:22:14 UTC
$ oc get node
NAME                                                STATUS   ROLES    AGE   VERSION
ci-ln-tfmy982-002ac-mgnvn-master-0                  Ready    master   33m   v1.18.2
ci-ln-tfmy982-002ac-mgnvn-master-1                  Ready    master   33m   v1.18.2
ci-ln-tfmy982-002ac-mgnvn-master-2                  Ready    master   33m   v1.18.2
ci-ln-tfmy982-002ac-mgnvn-worker-centralus1-c5hhw   Ready    worker   18m   v1.18.2
ci-ln-tfmy982-002ac-mgnvn-worker-centralus2-xp5k9   Ready    worker   18m   v1.18.2
ci-ln-tfmy982-002ac-mgnvn-worker-centralus3-fhmbq   Ready    worker   18m   v1.18.2
$ oc debug node/ci-ln-tfmy982-002ac-mgnvn-master-0
Starting pod/ci-ln-tfmy982-002ac-mgnvn-master-0-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree status -b
State: idle
AutomaticUpdates: disabled
BootedDeployment:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:434a0401a0fb22773e49fda5127c9610145d21b67435e8bc822bdc860aafaf98
              CustomOrigin: Managed by machine-config-operator
                   Version: 45.81.202005121031-0 (2020-05-12T10:34:32Z)
sh-4.4# chronyc sources
210 Number of sources = 1
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
#* PHC0                          0   3   377     6    -17us[  -25us] +/- 8808ns
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-12-163804   True        False         17m     Cluster version is 4.5.0-0.nightly-2020-05-12-163804
$ oc debug node/ci-ln-tfmy982-002ac-mgnvn-worker-centralus3-fhmbq
Starting pod/ci-ln-tfmy982-002ac-mgnvn-worker-centralus3-fhmbq-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree status -b
State: idle
AutomaticUpdates: disabled
BootedDeployment:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:434a0401a0fb22773e49fda5127c9610145d21b67435e8bc822bdc860aafaf98
              CustomOrigin: Managed by machine-config-operator
                   Version: 45.81.202005121031-0 (2020-05-12T10:34:32Z)
sh-4.4# chronyc sources
210 Number of sources = 1
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
#* PHC0                          0   3   377    10  -5379ns[-6278ns] +/- 1374ns
sh-4.4# cd /run/systemd/generator/chronyd.service.d/
sh-4.4# ls
coreos-azure-phc.conf
sh-4.4# cat coreos-azure-phc.conf 
[Service]
ExecStart=
ExecStart=/usr/sbin/chronyd -f /run/coreos-azure-phc-chrony.conf $OPTIONS
sh-4.4# cat /dev/kmsg | grep PHC
12,654,17038655,-;coreos-azure-phc: Updated chrony to use Azure PHC

Comment 30 errata-xmlrpc 2020-07-13 17:11:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.