Description of problem: The overcloud deployment with pre-provisioned nodes failed to sync the NTP test in all node validation task. Version-Release number of selected component (if applicable): Red Hat OpenStack Platform release 16.0.1 (Train) python3-tripleo-common-11.3.3-0.20200403044648.56c0fd5.el8ost.noarch openstack-tripleo-common-11.3.3-0.20200403044648.56c0fd5.el8ost.noarch ansible-role-tripleo-modify-image-1.1.1-0.20200302230738.bb6f78d.el8ost.noarch openstack-tripleo-validations-11.3.2-0.20200318124452.3fd14c9.el8ost.noarch puppet-tripleo-11.4.1-0.20200402130301.b4678ba.el8ost.noarch tripleo-ansible-0.4.2-0.20200404124614.67005aa.el8ost.noarch python3-tripleoclient-heat-installer-12.3.2-0.20200405044622.fdce01f.el8ost.noarch ansible-tripleo-ipsec-9.2.1-0.20200302220300.0c8693c.el8ost.noarch openstack-tripleo-image-elements-10.6.2-0.20200314025720.8c91b46.el8ost.noarch openstack-tripleo-puppet-elements-11.2.2-0.20200302235857.a6fef08.el8ost.noarch python3-tripleoclient-12.3.2-0.20200405044622.fdce01f.el8ost.noarch openstack-tripleo-common-containers-11.3.3-0.20200403044648.56c0fd5.el8ost.noarch openstack-tripleo-heat-templates-11.3.2-0.20200405044622.ec9970c.el8ost.noarch How reproducible: 100% reproduced in Scale lab environment Steps to Reproduce: 1. Pre-provisioned nodes configured with the default NTP server. # chronyc sources 210 Number of sources = 1 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^* foreman.rdu2.scalelab.re> 2 8 377 188 -1916ns[ -39us] +/- 42ms 2. While deploy the overcloud command with NTP parameter, it failed at below TASK. TASK [AllNodesValidationConfig] ************************************************ Thursday 23 April 2020 05:13:34 +0000 (0:00:00.965) 0:03:18.331 ******** fatal: [f08-h17-b07-5039ms]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 192.168.24.11 closed.\r\n", "stderr_lines": ["Shared connection to 192.168.24.11 closed."], "stdout": "Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.\r\nTrying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing to 192.168.24.11 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing to 192.168.24.11 succeeded.\r\nSUCCESS\r\nFailed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found\r\nTesting NTP...FAILURE\r\n\r\n", "stdout_lines": ["Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.", "Trying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Failed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found", "Testing NTP...FAILURE", ""]} fatal: [f08-h17-b08-5039ms]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 192.168.24.12 closed.\r\n", "stderr_lines": ["Shared connection to 192.168.24.12 closed."], "stdout": "Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.\r\nTrying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing to 192.168.24.11 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing to 192.168.24.11 succeeded.\r\nSUCCESS\r\nFailed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found\r\nTesting NTP...FAILURE\r\n\r\n", "stdout_lines": ["Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.", "Trying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Failed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found", "Testing NTP...FAILURE", ""]} fatal: [f08-h20-b01-5039ms]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 192.168.24.13 closed.\r\n", "stderr_lines": ["Shared connection to 192.168.24.13 closed."], "stdout": "Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.\r\nTrying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing to 192.168.24.11 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing to 192.168.24.11 succeeded.\r\nSUCCESS\r\nFailed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found\r\nTesting NTP...FAILURE\r\n\r\n", "stdout_lines": ["Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.", "Trying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Failed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found", "Testing NTP...FAILURE", ""]} fatal: [f08-h20-b02-5039ms]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 192.168.24.15 closed.\r\n", "stderr_lines": ["Shared connection to 192.168.24.15 closed."], "stdout": "Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.\r\nTrying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.15 for local network 192.168.24.0/24.\r\nPing to 192.168.24.15 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.15 for local network 192.168.24.0/24.\r\nPing to 192.168.24.15 succeeded.\r\nSUCCESS\r\nFailed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found\r\nTesting NTP...FAILURE\r\n\r\n", "stdout_lines": ["Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.", "Trying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.", "SUCCESS", "Trying to ping 192.168.24.15 for local network 192.168.24.0/24.", "Ping to 192.168.24.15 succeeded.", "SUCCESS", "Trying to ping 192.168.24.15 for local network 192.168.24.0/24.", "Ping to 192.168.24.15 succeeded.", "SUCCESS", "Failed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found", "Testing NTP...FAILURE", ""]} 3. Ideally, in "pre-steps" task, config-download override the NTP hostname which is conflict while reviewing the below condition in playbook steps. cat /var/lib/mistral/overcloud/deploy_steps_playbook.yaml - name: AllNodesValidationConfig script: all_nodes_validation_script.sh environment: validate_controllers_icmp: "{{ validate_controllers_icmp }}" validate_gateways_icmp: "{{ validate_gateways_icmp }}" validate_fqdn: "{{ validate_fqdn }}" validate_ntp: "{{ validate_ntp }}" ping_test_ips: "{{ ping_test_ips | to_json }}" tripleo_role_name: "{{ tripleo_role_name }}" - name: ArtifactsConfig script: deploy-artifacts.sh environment: artifact_urls: "{{ deploy_artifact_urls }}" tags: - overcloud - pre_deploy_steps - hosts: Controller name: Controller Host prep steps gather_facts: "{{ gather_facts | default(false) }}" any_errors_fatal: yes vars: bootstrap_server_id: 3899d15d-f51b-48d9-98a3-a0039f83eb3a deploy_identifier: 1587646089 enable_debug: True enable_puppet: True container_cli: podman container_log_stdout_path: /var/log/containers/stdouts container_healthcheck_disabled: False docker_puppet_debug: False docker_puppet_process_count: 6 docker_puppet_mount_host_puppet: True tasks: - name: Controller Host prep steps delegate_to: localhost run_once: true debug: msg: Use --start-at-task "Controller Host prep steps" to resume from this task - import_tasks: Controller/host_prep_tasks.yaml tags: - overcloud - host_prep_steps 4. As a workwround, we commented out NTP validation from all node, which help to pass the condition check before "prep steps". /usr/share/openstack-tripleo-heat-templates/validation-scripts/all-nodes.sh ## run chrony/ntpdate as available #function _run_ntp_sync() { # local NTP_SERVER=$1 # if ! type ntpdate 2>/dev/null; then # chronyd -Q "server $NTP_SERVER iburst" # else # ntpdate -qud $NTP_SERVER # fi #} # ## Verify at least one time source is available. #function ntp_check() { # NTP_SERVERS=$(hiera ntp::servers nil |tr -d '[],"') # if [[ "$NTP_SERVERS" != "nil" ]];then # echo -n "Testing NTP..." # NTP_SUCCESS=0 # for NTP_SERVER in $NTP_SERVERS; do # set +e # NTPDATE_OUT=$(_run_ntp_sync $NTP_SERVER 2>&1) # NTPDATE_EXIT=$? # set -e # if [[ "$NTPDATE_EXIT" == "0" ]];then # NTP_SUCCESS=1 # break # else # NTPDATE_OUT_FULL="$NTPDATE_OUT_FULL $NTPDATE_OUT" # fi # done # if [[ "$NTP_SUCCESS" == "0" ]];then # echo "FAILURE" # echo "$NTPDATE_OUT_FULL" # exit 1 # fi # echo "SUCCESS" # fi #} Actual results: NTP server on a pre-provisioned node did not override in the existing playbook step condition and failed to test NTP. Expected results: For pre-provisioned node integration, NTP server should not fail as overcloud deploy command passed with NTP server details.
I was able to pinpoint a simpler workaround, add this parameter to your deployment: ``` --- parameter_defaults: ValidateNtp: false ```
Shouldn't pre-deployed node get the right NTP server before the deploy itself? If not, well, we'd probably want to ensure the ntp/chrony configuration happens on pre-deployed node as well So either a doc bug (pre-deployed node must have working NTP/chrony) or DF bug (NTP/chrony must be configured by tripleo/director on pre-deployed node)
I faced the NTP problem in alias lab while deploying the overcloud. There are multiple settings that should be tried before changing the validation: 1. Set masquerade to be on in undercloud.conf since the default gateway is set to the undercloud node. 2. Use an NTP server that's either clock.redhat.com or foreman.rdu2.scalelab.redhat.com 3. In the validation checks, ping an IP that is not directly on any of the subnets the node is physically a part of (192.168.24.0 or 10.1.39.x in this case). This would ensure that the packets are actually routed through the default gateway. 4. If the default gateway is set to the undercloud node, it would need to NAT in the default lab environment which will not route 192.168.24.0 packets. For this masquare needs to be on, as stated earlier. 5. If the default gateway is not set to the undercloud node, the default is the lab's DHCP interface, which being fully routed through the rest of the RH network will work fine by default. There's a separate issue here altogether: It seems that there are policies in place outside the lab networks which prevent NTP sync to servers that are not clock.redhat.com. I haven't exhaustively tested the "whitelist" of servers that can be synced to, but clock.redhat.com definitely works. foreman.rdu2.scalelab.redhat.com should work as well. I had escalated this to wfoster, who verified it. I created a corresponding issue that thus now defaults to an NTP server reachable from within the lab networks, at least for the alias lab, where my setup is.
I'm not certain this is an actual bug. We need NTP servers to be accessible. The list of NTP servers that are tested are the ones provided via the NtpServer parameter. Even if it's pre-deployed, we configure chrony/ntp as part of the deployment which is why this is imported. That being said we could probably remove this validation because we have a different check later in the deployment to ensure we can actually sync the time. This is a very old validation prior to us having better testing as part of the ntp configuration.
I've proposed a patch to remove this part from the old validation scripts. We already do this when we configure the time as part of the deployment. The original bug was NOTABUG because the time servers need to be configured and available. However we should no longer continue the legacy validation and should remove it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284