Description of problem: I would like to report a possible fault in validation script: /usr/share/openstack-tripleo-heat-templates/validation-scripts/all-nodes.sh There is a function function ping_retry() { local IP_ADDR=$1 local TIMES=${2:-'10'} local COUNT=0 local PING_CMD=ping if [[ $IP_ADDR =~ ":" ]]; then PING_CMD=ping6 fi until [ $COUNT -ge $TIMES ]; do if $PING_CMD -w 300 -c 1 $IP_ADDR &> /dev/null; then echo "Ping to $IP_ADDR succeeded." return 0 fi echo "Ping to $IP_ADDR failed. Retrying..." COUNT=$(($COUNT + 1)) done return 1 } Problematic line here is $PING_CMD -w 300 -c 1 $IP_ADDR According to man ping: -w deadline Specify a timeout, in seconds, before ping exits regardless of how many packets have been sent or received. In this case ping does not stop after count packet are sent, it waits either for deadline expire or until count probes are answered or for some error notification from network. So, "-w 300" means 300 seconds deadline wait. and repeat this up to 10 times as per TIMES variable This would provide a timeout of 3000 seconds or 50 minutes for the ping to complete on the worst case scenario. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-5.3.3-1.el7ost.noarch.rpm Actual results: if "some error notification from network" is received, the ping will not wait "-w 300", it will exit quite fast. Here is a test, where I ping a non-existing IP address in a existing subnet: [VNF11 VPOD3 stack@director validation-scripts]$ time ping -w 300 -c 1 10.33.110.150 PING 10.33.110.150 (10.33.110.150) 56(84) bytes of data. From 10.33.110.133 icmp_seq=1 Destination Host Unreachable From 10.33.110.133 icmp_seq=2 Destination Host Unreachable From 10.33.110.133 icmp_seq=3 Destination Host Unreachable From 10.33.110.133 icmp_seq=4 Destination Host Unreachable --- 10.33.110.150 ping statistics --- 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 2999ms pipe 4 real 0m3.008s user 0m0.000s sys 0m0.002s The ping exited in 3 seconds. If this happens during deployment, it would provide a timeout of 3 * 10 = 30 seconds. 30 seconds is too short if this ping must be made over bonded interface with LACP. Expected results: It takes anywhere between 30 and 60 seconds for LACP to become functional. It does not matter is the slow or fast LACP mode is used on the switches, 30 seconds is a borderline minimum, and is not enough.
I noticed this has been changed in OSP12, openstack-tripleo-heat-templates-7.0.3-22.el7ost.noarch.rpm: if $PING_CMD -w 10 -c 1 $IP_ADDR &> /dev/null; then Is it possible to implement it to OSP11 and OSP10?
Hi, Perhaps I should be more precise with the BZ description. It looks like with simple adjustment of -w deadline timeout We will not solve this because the "Destination Host Unreachable" reply always ends in about 3 secs - regardless of -w param value. Let's assume that the loop should keep pinging for 60 sec to handle the LACP activation. To achieve this We should rather consider doubling the TIMES variable: ex. 3s x 20 = 60 sec, and for worst case scenario : with -w 300: 300 sec x 20 = 6000 sec with -w 10: 10 sec x 20 = 200 sec Could You please verify if adjustment of TIMES makes sense? Thank You
The change reduces -w 300 to -w 10 but then adds 60 seconds between each loop. So it's (10 + 60) * 20 = 1400 sec which should be sufficient based on the description of the issue. Please let us know if it is not.
Hello , Is there any chance to push OSP tht v5.3.8-8 to CDN ?
I believe it's currently slated for the next 10 updates. If you need a hotfix you could request one. Though it would be trivial to just fix it locally in THT for affected customers. https://review.openstack.org/#/c/548665/1/validation-scripts/all-nodes.sh is the change
Verified on puddle 2018-05-09.2 [stack@undercloud-0 ~]$ rpm -q openstack-tripleo-heat-templates openstack-tripleo-heat-templates-5.3.10-1.el7ost.noarch [stack@undercloud-0 ~]$ sed -n -e 13p -e 19p /usr/share/openstack-tripleo-heat-templates/validation-scripts/all-nodes.sh if $PING_CMD -w 10 -c 1 $IP_ADDR &> /dev/null; then sleep 60
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1593