Description of problem: While scaling up compute nodes in OSP10 cloud, the stack update gets "stuck" and eventually times out. Version-Release number of selected component (if applicable): OSP10 / RHEL 7.3 How reproducible: When scaling OSP10 cloud from 144 to 160 computes. Steps to Reproduce: 1. Login to one of the overcloud nodes 2. Stop os-collect-config service temporarily # systemctl stop os-collect-config 3. Run os-refresh-config manually with debug enabled # os-refresh-config --log-level DEBUG print v f=1 }f &&!/^# HEAT_HOSTS_END$/{next}/^# HEAT_HOSTS_END$/{f=0}!f' /etc/cloud/templates/hosts.redhat.tmpl /usr/libexec/os-refresh-config/configure.d/51-hosts: line 17: /bin/awk: Argument list too long [2017-08-23 12:58:45,719] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1] Actual results: overcloud stack update as well as corresponding software deployments get stuck IN_PROGRESS until stack update eventually times out after 240 min. Expected results: overcloud stack update / compute scale-up completes successfully. Additional info: Upstream TripleO bug 1674732. Dropping in updated '51-hosts' file seems to resolve the issue.
*** Bug 1484523 has been marked as a duplicate of this bug. ***
*** Bug 1427878 has been marked as a duplicate of this bug. ***
How did you workaround this ? , we tried the patch , we still see the same issue where 51-hosts get stuck.. How ever if i run manually inside the node "os-refresh-config --log-level DEBUG" , i do not get any error , it is going through fine. Just during the deployment , node progress get stuck , when we scale upto 35 nodes itself.
During deployment , we do not see "dib-run-parts Fri Sep 15 18:41:33 EDT 2017 51-hosts completed"
@bigswitch: Did you update your overcloud-full image with the patched 51-hosts file? We just used libguestfs-tools utilities to update the image until official fix gets released. Mounting the modified image via guestfish (http://libguestfs.org/guestfish.1.html) would be an easy way to make and/or validate the change.
We used virt-customize to upload the new 51-hosts file into the overcloud image..
virt-customize -a overcloud-full.qcow2 --upload /home/stack/templates/51-hosts:/usr/libexec/os-refresh-config/configure.d/51-hosts
Your updated 51-hosts file looks like this? https://review.openstack.org/gitweb?p=openstack/tripleo-image-elements.git;a=blob_plain;f=elements/hosts/os-refresh-config/configure.d/51-hosts And when you run 'os-refresh-config' manually, it updates the /etc/hosts on that overcloud node properly?
Yes , it is the same file , when i run manually , it does update the /etc/hosts on the node.
Did you also update the existing overcloud nodes with the updated 51-hosts file? If you did, and it's still failing with the above AWK error, I'd recommend opening a support case then. In our case, the IN_PROGRESS ("stuck") software deployments cleared upon rerunning overcloud stack update. To summarize, we updated 51-hosts file on all existing overcloud nodes (144 computes and 3 controllers already built with older overcloud image) + updated overcloud-full.qcow2 image, re-uploaded it to glance prior to re-running the overcloud stack update to arrive at 160 computes and 3 controllers.
All patches are available on a fresh deployment (2017-09-07.2 build), but due to lack of hardware it's not tested on a same setup. [stack@undercloud-0 ~]$ rpm -q openstack-tripleo-image-elements openstack-tripleo-image-elements-5.3.0-3.el7ost.noarch
Is there any setting you have used in undercloud to scale to 140 nodes?
At only 35 nodes it's unlikely you're running into this problem. I would suggest opening a separate bug with details on the issues you're seeing.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2825