Description of problem: After cluster installation, the controlplane virtual machines are unable to reach public NTP servers. The chrony daemon which is running on the RHCOS virtual machines, seems to be configured to use 2.rhel.pool.ntp.org pool as an NTP source. Investigation has shown that by default, virtual machines which have been added to a public load balancer, are unable to send outgoing UDP packets. (I have tested both NTP and DNS) Version-Release number of selected component (if applicable): OpenShift 4.2.1 (https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.2.1/) How reproducible: Always Steps to Reproduce: 1. Install a cluster using default settings on Azure (installer provisioned infrastructure). 2. Ssh into a master node. (I have linked the virtual network to another network where I'm running a VPN appliance) 3. Run the "chronyc sources" command. Actual results: NTP synchronisation fails, since no NTP peers are reachable [core@cluster01-sjtmm-master-0 ~]$ sudo chronyc sources 210 Number of sources = 8 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^? no-reverse-yet.comsave.nl 0 9 0 - +0ns[ +0ns] +/- 0ns ^? mail.klausen.dk 0 9 0 - +0ns[ +0ns] +/- 0ns ^? ntp1.trans-ix.nl 0 9 0 - +0ns[ +0ns] +/- 0ns ^? aardbei.vanderzwet.net 0 9 0 - +0ns[ +0ns] +/- 0ns ^? x.ns.gin.ntt.net 0 6 0 - +0ns[ +0ns] +/- 0ns ^? services.freshdot.net 0 6 0 - +0ns[ +0ns] +/- 0ns ^? ntp5.linocomm.net 0 6 0 - +0ns[ +0ns] +/- 0ns ^? alta.fancube.com 0 6 0 - +0ns[ +0ns] +/- 0ns # Outbound DNS requests are also not possible: [core@cluster01-sjtmm-master-0 ~]$ dig www.redhat.com @8.8.8.8 ; <<>> DiG 9.11.4-P2-RedHat-9.11.4-17.P2.el8_0.1 <<>> www.redhat.com @8.8.8.8 ;; global options: +cmd ;; connection timed out; no servers could be reached Expected results: (below output was received after implementing a workaround) The 'chronyc sources' command should display peers with a valid stratum. [core@cluster01-sjtmm-master-0 ~]$ sudo chronyc sources 210 Number of sources = 4 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^? mail.klausen.dk 2 6 7 1 -50ms[ -50ms] +/- 2217us ^? ntp1.trans-ix.nl 2 6 7 0 -51ms[ -51ms] +/- 34ms ^? beetjevreemd.nl 1 6 3 2 -50ms[ -50ms] +/- 3035us ^? server01.colocenter.nl 2 6 3 2 -51ms[ -51ms] +/- 33ms Additional info: Timedrift becomes a problem for etcd, impacting the stability of the cluster. The following workaround seems to work: Disable outgoing snat on the public load balancer, and create an explicit outbound rule which allows all traffic: az network lb rule update -g cluster01-ggrf9-rg --lb-name cluster01-ggrf9-public-lb --name api-internal --disable-outbound-snat true az network lb outbound-rule create -g cluster01-ggrf9-rg --lb-name cluster01-ggrf9-public-lb --frontend-ip-configs public-lb-ip --protocol All --address-pool cluster01-ggrf9-public-lb-control-plane --name AllowOutbound Terraform has a azurerm_lb_outbound_rule, so this could be configured by the installer. Additionally: timedrift should be monitored by the internal Prometheus installation. (https://github.com/prometheus/node_exporter/blob/master/docs/TIME.md)
This also needs to be done for the worker nodes. The following seems to work: cluster_id="cluster02-tflm4" for rule in $(az network lb rule list -g "${cluster_id}-rg" --lb-name "${cluster_id}" --query [].name -o tsv) do az network lb rule update -g "${cluster_id}-rg" --lb-name "${cluster_id}" --name "${rule}" --disable-outbound-snat true done frontend_ip="$(az network lb frontend-ip list -g "${cluster_id}-rg" --lb-name "${cluster_id}" --query [0].name -o tsv)" az network lb outbound-rule create -g "${cluster_id}-rg" --lb-name "${cluster_id}" --protocol All --address-pool "${cluster_id}" --frontend-ip-configs "${frontend_ip}" --name AllowOutbound
After reviewing this the best solution looks to be that when running on Azure we should customize the NTP configuration to use host local time syncing per the Azure best practices.
Is there any more information that I'm asked to provide? If not, can we remove the needinfo flag set to my email address?
fix for this issue is up for review https://github.com/openshift/machine-config-operator/pull/1658. Please review it and if needed tag along right people. thanks.
The MCO PR will fix the specific issue with an azure cluster - it will not however take care of the bootstrap host. It might make sense in the future to have something like the MCO create and configure the bootstrap host but it's not the case today and it's not trivial. We're going to open a PR to the installer as well to take care of this issue on the bootstrap host as well with the gotcha to keep those chrony confs in sync.
https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/918
*** Bug 1828342 has been marked as a duplicate of this bug. ***
MR merged. Should this move to modified?
https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/934
After this lands in 4.5 and we've tested it, I'll clone this for 4.4 and probably 4.3 too.
This change is part of https://openshift-release.svc.ci.openshift.org/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-05-11-211039 at least.
Verified this: walters@toolbox /s/w/rhcos-master> oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-11-211039 True False 2m39s Cluster version is 4.5.0-0.nightly-2020-05-11-211039 walters@toolbox /s/w/rhcos-master> oc debug node/ci-ln-0wb12s2-002ac-t2xsf-master-0 ... [root@ci-ln-0wb12s2-002ac-t2xsf-master-0 /]# rpm-ostree status -b State: idle AutomaticUpdates: disabled BootedDeployment: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d202a8b981606f4ca691b6f87d847782f95b666963c81b207aa091ce1f806198 CustomOrigin: Managed by machine-config-operator Version: 45.81.202005111729-0 (2020-05-11T17:33:22Z) [root@ci-ln-0wb12s2-002ac-t2xsf-master-0 /]# chronyc sources 210 Number of sources = 1 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== #* PHC0 0 3 377 9 +772ns[ +525ns] +/- 339ns [root@ci-ln-0wb12s2-002ac-t2xsf-master-0 /]#
$ oc get node NAME STATUS ROLES AGE VERSION ci-ln-tfmy982-002ac-mgnvn-master-0 Ready master 33m v1.18.2 ci-ln-tfmy982-002ac-mgnvn-master-1 Ready master 33m v1.18.2 ci-ln-tfmy982-002ac-mgnvn-master-2 Ready master 33m v1.18.2 ci-ln-tfmy982-002ac-mgnvn-worker-centralus1-c5hhw Ready worker 18m v1.18.2 ci-ln-tfmy982-002ac-mgnvn-worker-centralus2-xp5k9 Ready worker 18m v1.18.2 ci-ln-tfmy982-002ac-mgnvn-worker-centralus3-fhmbq Ready worker 18m v1.18.2 $ oc debug node/ci-ln-tfmy982-002ac-mgnvn-master-0 Starting pod/ci-ln-tfmy982-002ac-mgnvn-master-0-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# rpm-ostree status -b State: idle AutomaticUpdates: disabled BootedDeployment: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:434a0401a0fb22773e49fda5127c9610145d21b67435e8bc822bdc860aafaf98 CustomOrigin: Managed by machine-config-operator Version: 45.81.202005121031-0 (2020-05-12T10:34:32Z) sh-4.4# chronyc sources 210 Number of sources = 1 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== #* PHC0 0 3 377 6 -17us[ -25us] +/- 8808ns sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-12-163804 True False 17m Cluster version is 4.5.0-0.nightly-2020-05-12-163804 $ oc debug node/ci-ln-tfmy982-002ac-mgnvn-worker-centralus3-fhmbq Starting pod/ci-ln-tfmy982-002ac-mgnvn-worker-centralus3-fhmbq-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# rpm-ostree status -b State: idle AutomaticUpdates: disabled BootedDeployment: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:434a0401a0fb22773e49fda5127c9610145d21b67435e8bc822bdc860aafaf98 CustomOrigin: Managed by machine-config-operator Version: 45.81.202005121031-0 (2020-05-12T10:34:32Z) sh-4.4# chronyc sources 210 Number of sources = 1 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== #* PHC0 0 3 377 10 -5379ns[-6278ns] +/- 1374ns sh-4.4# cd /run/systemd/generator/chronyd.service.d/ sh-4.4# ls coreos-azure-phc.conf sh-4.4# cat coreos-azure-phc.conf [Service] ExecStart= ExecStart=/usr/sbin/chronyd -f /run/coreos-azure-phc-chrony.conf $OPTIONS sh-4.4# cat /dev/kmsg | grep PHC 12,654,17038655,-;coreos-azure-phc: Updated chrony to use Azure PHC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409