Bug 1827268 - [OSP 16.0.1] "all_nodes_validation_script.sh" conflict to override NTP on pre-provisioned nodes integration
Summary: [OSP 16.0.1] "all_nodes_validation_script.sh" conflict to override NTP on pre...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Alex Schultz
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-23 14:03 UTC by Pradipta Kumar Sahoo
Modified: 2020-10-28 15:37 UTC (History)
6 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200728213431.6c7ccc9.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-28 15:36:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 731342 0 None MERGED Remove ValidateNtp 2021-01-06 22:48:09 UTC
Red Hat Product Errata RHEA-2020:4284 0 None None None 2020-10-28 15:37:17 UTC

Description Pradipta Kumar Sahoo 2020-04-23 14:03:00 UTC
Description of problem:
The overcloud deployment with pre-provisioned nodes failed to sync the NTP test in all node validation task.

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.0.1 (Train)
python3-tripleo-common-11.3.3-0.20200403044648.56c0fd5.el8ost.noarch
openstack-tripleo-common-11.3.3-0.20200403044648.56c0fd5.el8ost.noarch
ansible-role-tripleo-modify-image-1.1.1-0.20200302230738.bb6f78d.el8ost.noarch
openstack-tripleo-validations-11.3.2-0.20200318124452.3fd14c9.el8ost.noarch
puppet-tripleo-11.4.1-0.20200402130301.b4678ba.el8ost.noarch
tripleo-ansible-0.4.2-0.20200404124614.67005aa.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.2-0.20200405044622.fdce01f.el8ost.noarch
ansible-tripleo-ipsec-9.2.1-0.20200302220300.0c8693c.el8ost.noarch
openstack-tripleo-image-elements-10.6.2-0.20200314025720.8c91b46.el8ost.noarch
openstack-tripleo-puppet-elements-11.2.2-0.20200302235857.a6fef08.el8ost.noarch
python3-tripleoclient-12.3.2-0.20200405044622.fdce01f.el8ost.noarch
openstack-tripleo-common-containers-11.3.3-0.20200403044648.56c0fd5.el8ost.noarch
openstack-tripleo-heat-templates-11.3.2-0.20200405044622.ec9970c.el8ost.noarch

How reproducible: 100% reproduced in Scale lab environment

Steps to Reproduce:
1. Pre-provisioned nodes configured with the default NTP server.

# chronyc sources
210 Number of sources = 1
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* foreman.rdu2.scalelab.re>     2   8   377   188  -1916ns[  -39us] +/-   42ms

2. While deploy the overcloud command with NTP parameter, it failed at below TASK.

TASK [AllNodesValidationConfig] ************************************************
Thursday 23 April 2020  05:13:34 +0000 (0:00:00.965)       0:03:18.331 ********
fatal: [f08-h17-b07-5039ms]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 192.168.24.11 closed.\r\n", "stderr_lines": ["Shared connection to 192.168.24.11 closed."], "stdout": "Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.\r\nTrying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing
to 192.168.24.11 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing to 192.168.24.11 succeeded.\r\nSUCCESS\r\nFailed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found\r\nTesting NTP...FAILURE\r\n\r\n", "stdout_lines": ["Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.", "Trying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Failed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found", "Testing NTP...FAILURE", ""]}
fatal: [f08-h17-b08-5039ms]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 192.168.24.12 closed.\r\n", "stderr_lines": ["Shared connection to 192.168.24.12 closed."], "stdout": "Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.\r\nTrying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing
to 192.168.24.11 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing to 192.168.24.11 succeeded.\r\nSUCCESS\r\nFailed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found\r\nTesting NTP...FAILURE\r\n\r\n", "stdout_lines": ["Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.", "Trying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Failed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found", "Testing NTP...FAILURE", ""]}
fatal: [f08-h20-b01-5039ms]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 192.168.24.13 closed.\r\n", "stderr_lines": ["Shared connection to 192.168.24.13 closed."], "stdout": "Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.\r\nTrying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing
to 192.168.24.11 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.11 for local network 192.168.24.0/24.\r\nPing to 192.168.24.11 succeeded.\r\nSUCCESS\r\nFailed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found\r\nTesting NTP...FAILURE\r\n\r\n", "stdout_lines": ["Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.", "Trying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Trying to ping 192.168.24.11 for local network 192.168.24.0/24.", "Ping to 192.168.24.11 succeeded.", "SUCCESS", "Failed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found", "Testing NTP...FAILURE", ""]}
fatal: [f08-h20-b02-5039ms]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Shared connection to 192.168.24.15 closed.\r\n", "stderr_lines": ["Shared connection to 192.168.24.15 closed."], "stdout": "Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.\r\nTrying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.15 for local network 192.168.24.0/24.\r\nPing
to 192.168.24.15 succeeded.\r\nSUCCESS\r\nTrying to ping 192.168.24.15 for local network 192.168.24.0/24.\r\nPing to 192.168.24.15 succeeded.\r\nSUCCESS\r\nFailed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found\r\nTesting NTP...FAILURE\r\n\r\n", "stdout_lines": ["Trying to ping default gateway 192.168.24.1...Ping to 192.168.24.1 succeeded.", "Trying to ping default gateway 10.1.39.254...Ping to 10.1.39.254 succeeded.", "SUCCESS", "Trying to ping 192.168.24.15 for local network 192.168.24.0/24.", "Ping to 192.168.24.15 succeeded.", "SUCCESS", "Trying to ping 192.168.24.15 for local network 192.168.24.0/24.", "Ping to 192.168.24.15 succeeded.", "SUCCESS", "Failed to start Hiera: RuntimeError: Config file /etc/puppetlabs/puppet/hiera.yaml not found", "Testing NTP...FAILURE", ""]}

3. Ideally, in "pre-steps" task, config-download override the NTP hostname which is conflict while reviewing the below condition in playbook steps.

cat /var/lib/mistral/overcloud/deploy_steps_playbook.yaml
    - name: AllNodesValidationConfig
      script: all_nodes_validation_script.sh
      environment:
        validate_controllers_icmp: "{{ validate_controllers_icmp }}"
        validate_gateways_icmp: "{{ validate_gateways_icmp }}"
        validate_fqdn: "{{ validate_fqdn }}"
        validate_ntp: "{{ validate_ntp }}"
        ping_test_ips: "{{ ping_test_ips | to_json }}"
        tripleo_role_name: "{{ tripleo_role_name }}"

    - name: ArtifactsConfig
      script: deploy-artifacts.sh
      environment:
        artifact_urls: "{{ deploy_artifact_urls }}"

  tags:
    - overcloud
    - pre_deploy_steps


- hosts: Controller
  name: Controller Host prep steps
  gather_facts: "{{ gather_facts | default(false) }}"
  any_errors_fatal: yes
  vars:
    bootstrap_server_id: 3899d15d-f51b-48d9-98a3-a0039f83eb3a
    deploy_identifier: 1587646089
    enable_debug: True
    enable_puppet: True
    container_cli: podman
    container_log_stdout_path: /var/log/containers/stdouts
    container_healthcheck_disabled: False
    docker_puppet_debug: False
    docker_puppet_process_count: 6
    docker_puppet_mount_host_puppet: True
  tasks:
    - name: Controller Host prep steps
      delegate_to: localhost
      run_once: true
      debug:
        msg: Use --start-at-task "Controller Host prep steps" to resume from this task
    - import_tasks: Controller/host_prep_tasks.yaml
  tags:
    - overcloud
    - host_prep_steps

4. As a workwround, we commented out NTP validation from all node, which help to pass the condition check before "prep steps". 

/usr/share/openstack-tripleo-heat-templates/validation-scripts/all-nodes.sh
## run chrony/ntpdate as available
#function _run_ntp_sync() {
#  local NTP_SERVER=$1
#  if ! type ntpdate 2>/dev/null; then
#    chronyd -Q "server $NTP_SERVER iburst"
#  else
#    ntpdate -qud $NTP_SERVER
#  fi
#}
#
## Verify at least one time source is available.
#function ntp_check() {
#  NTP_SERVERS=$(hiera ntp::servers nil |tr -d '[],"')
#  if [[ "$NTP_SERVERS" != "nil" ]];then
#    echo -n "Testing NTP..."
#    NTP_SUCCESS=0
#    for NTP_SERVER in $NTP_SERVERS; do
#      set +e
#      NTPDATE_OUT=$(_run_ntp_sync $NTP_SERVER 2>&1)
#      NTPDATE_EXIT=$?
#      set -e
#      if [[ "$NTPDATE_EXIT" == "0" ]];then
#        NTP_SUCCESS=1
#        break
#      else
#        NTPDATE_OUT_FULL="$NTPDATE_OUT_FULL $NTPDATE_OUT"
#      fi
#    done
#    if  [[ "$NTP_SUCCESS" == "0" ]];then
#      echo "FAILURE"
#      echo "$NTPDATE_OUT_FULL"
#      exit 1
#    fi
#    echo "SUCCESS"
#  fi
#}


Actual results:
NTP server on a pre-provisioned node did not override in the existing playbook step condition and failed to test NTP.

Expected results:
For pre-provisioned node integration, NTP server should not fail as overcloud deploy command passed with NTP server details.

Comment 1 Luke Short 2020-04-23 14:35:16 UTC
I was able to pinpoint a simpler workaround, add this parameter to your deployment:

```
---
parameter_defaults:
  ValidateNtp: false
```

Comment 2 Cédric Jeanneret 2020-05-08 09:49:35 UTC
Shouldn't pre-deployed node get the right NTP server before the deploy itself? If not, well, we'd probably want to ensure the ntp/chrony configuration happens on pre-deployed node as well

So either a doc bug (pre-deployed node must have working NTP/chrony) or DF bug (NTP/chrony must be configured by tripleo/director on pre-deployed node)

Comment 3 Mrugesh Karnik 2020-05-11 06:54:32 UTC
I faced the NTP problem in alias lab while deploying the overcloud. There are multiple settings that should be tried before changing the validation:

1. Set masquerade to be on in undercloud.conf since the default gateway is set to the undercloud node.
2. Use an NTP server that's either clock.redhat.com or foreman.rdu2.scalelab.redhat.com
3. In the validation checks, ping an IP that is not directly on any of the subnets the node is physically a part of (192.168.24.0 or 10.1.39.x in this case). This would ensure that the packets are actually routed through the default gateway.
4. If the default gateway is set to the undercloud node, it would need to NAT in the default lab environment which will not route 192.168.24.0 packets. For this masquare needs to be on, as stated earlier.
5. If the default gateway is not set to the undercloud node, the default is the lab's DHCP interface, which being fully routed through the rest of the RH network will work fine by default.

There's a separate issue here altogether: It seems that there are policies in place outside the lab networks which prevent NTP sync to servers that are not clock.redhat.com. I haven't exhaustively tested the "whitelist" of servers that can be synced to, but clock.redhat.com definitely works. foreman.rdu2.scalelab.redhat.com should work as well. I had escalated this to wfoster, who verified it. I created a corresponding issue that thus now defaults to an NTP server reachable from within the lab networks, at least for the alias lab, where my setup is.

Comment 4 Alex Schultz 2020-05-11 14:26:14 UTC
I'm not certain this is an actual bug. We need NTP servers to be accessible. The list of NTP servers that are tested are the ones provided via the NtpServer parameter.  Even if it's pre-deployed, we configure chrony/ntp as part of the deployment which is why this is imported. That being said we could probably remove this validation because we have a different check later in the deployment to ensure we can actually sync the time.  This is a very old validation prior to us having better testing as part of the ntp configuration.

Comment 5 Alex Schultz 2020-05-27 21:07:10 UTC
I've proposed a patch to remove this part from the old validation scripts. We already do this when we configure the time as part of the deployment. The original bug was NOTABUG because the time servers need to be configured and available. However we should no longer continue the legacy validation and should remove it.

Comment 13 errata-xmlrpc 2020-10-28 15:36:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284


Note You need to log in before you can comment on or make changes to this bug.