1484533 – 51-hosts at scale fails to complete and does not report an error (need a backport to OSP10 overcloud image)

Bug 1484533 - 51-hosts at scale fails to complete and does not report an error (need a backport to OSP10 overcloud image)

Summary: 51-hosts at scale fails to complete and does not report an error (need a back...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-image-elements
Sub Component:
Version:	10.0 (Newton)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	z5
Target Release:	10.0 (Newton)
Assignee:	Ben Nemec
QA Contact:	Gurenko Alex
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1427878 1484523 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-23 19:42 UTC by Randy Rubins
Modified:	2020-12-14 09:41 UTC (History)
CC List:	12 users (show)
Fixed In Version:	openstack-tripleo-image-elements-5.3.0-3.el7ost
Doc Type:	Bug Fix
Doc Text:	In larger scale deployments (100 or more overcloud nodes), the 51-hosts script used to write all overcloud nodes into each /etc/hosts file would fail due to an "Argument list too long" error. This limitation has been fixed and should no longer block large-scale deployments.
Clone Of:
Environment:
Last Closed:	2017-09-28 16:35:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1674732	0	None	None	None	2017-08-23 19:42:42 UTC
Red Hat Product Errata	RHBA-2017:2825	0	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 director Bug Fix Advisory	2017-09-28 20:33:35 UTC

Description Randy Rubins 2017-08-23 19:42:43 UTC

Description of problem:
While scaling up compute nodes in OSP10 cloud, the stack update gets "stuck" and eventually times out.

Version-Release number of selected component (if applicable):
OSP10 / RHEL 7.3

How reproducible:
When scaling OSP10 cloud from 144 to 160 computes.

Steps to Reproduce:
1. Login to one of the overcloud nodes
2. Stop os-collect-config service temporarily
   # systemctl stop os-collect-config
3. Run os-refresh-config manually with debug enabled
   # os-refresh-config --log-level DEBUG
     print v
            f=1
            }f &&!/^# HEAT_HOSTS_END$/{next}/^# HEAT_HOSTS_END$/{f=0}!f' /etc/cloud/templates/hosts.redhat.tmpl
/usr/libexec/os-refresh-config/configure.d/51-hosts: line 17: /bin/awk: Argument list too long
[2017-08-23 12:58:45,719] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1]   

Actual results:
overcloud stack update as well as corresponding software deployments get stuck IN_PROGRESS until stack update eventually times out after 240 min.

Expected results:
overcloud stack update / compute scale-up completes successfully.

Additional info:
Upstream TripleO bug 1674732.
Dropping in updated '51-hosts' file seems to resolve the issue.

Comment 1 Alex Schultz 2017-08-29 21:34:21 UTC

*** Bug 1484523 has been marked as a duplicate of this bug. ***

Comment 2 Alex Schultz 2017-08-30 22:20:24 UTC

*** Bug 1427878 has been marked as a duplicate of this bug. ***

Comment 5 bigswitch 2017-09-15 22:45:29 UTC

How did you workaround this ? , we tried the patch , we still see the same issue where 51-hosts get stuck.. How ever if i run manually inside the node "os-refresh-config --log-level DEBUG" , i do not get any error , it is going through fine.

Just during the deployment , node progress get stuck , when we scale upto 35 nodes itself.

Comment 6 bigswitch 2017-09-15 22:47:05 UTC

During deployment , we do not see 

"dib-run-parts Fri Sep 15 18:41:33 EDT 2017 51-hosts completed"

Comment 7 Randy Rubins 2017-09-15 23:15:32 UTC

@bigswitch: Did you update your overcloud-full image with the patched 51-hosts file? We just used libguestfs-tools utilities to update the image until official fix gets released.  Mounting the modified image via guestfish (http://libguestfs.org/guestfish.1.html) would be an easy way to make and/or validate the change.

Comment 8 bigswitch 2017-09-15 23:18:49 UTC

We used virt-customize to upload the new 51-hosts file into the overcloud image..

Comment 9 bigswitch 2017-09-15 23:19:40 UTC

virt-customize -a overcloud-full.qcow2 --upload /home/stack/templates/51-hosts:/usr/libexec/os-refresh-config/configure.d/51-hosts

Comment 10 Randy Rubins 2017-09-16 00:49:02 UTC

Your updated 51-hosts file looks like this?
https://review.openstack.org/gitweb?p=openstack/tripleo-image-elements.git;a=blob_plain;f=elements/hosts/os-refresh-config/configure.d/51-hosts

And when you run 'os-refresh-config' manually, it updates the /etc/hosts on that overcloud node properly?

Comment 11 bigswitch 2017-09-16 16:17:02 UTC

Yes , it is the same file , when i run manually , it does update the /etc/hosts on the node.

Comment 12 Randy Rubins 2017-09-16 17:29:02 UTC

Did you also update the existing overcloud nodes with the updated 51-hosts file?
If you did, and it's still failing with the above AWK error, I'd recommend opening a support case then.

In our case, the IN_PROGRESS ("stuck") software deployments cleared upon rerunning overcloud stack update.

To summarize, we updated 51-hosts file on all existing overcloud nodes (144 computes and 3 controllers already built with older overcloud image) + updated overcloud-full.qcow2 image, re-uploaded it to glance prior to re-running the overcloud stack update to arrive at 160 computes and 3 controllers.

Comment 14 Gurenko Alex 2017-09-19 16:17:38 UTC

All patches are available on a fresh deployment (2017-09-07.2 build), but due to lack of hardware it's not tested on a same setup.

[stack@undercloud-0 ~]$ rpm -q openstack-tripleo-image-elements
openstack-tripleo-image-elements-5.3.0-3.el7ost.noarch

Comment 15 bigswitch 2017-09-20 18:16:41 UTC

Is there any setting you have used in undercloud to scale to 140 nodes?

Comment 16 Ben Nemec 2017-09-20 18:52:32 UTC

At only 35 nodes it's unlikely you're running into this problem.  I would suggest opening a separate bug with details on the issues you're seeing.

Comment 17 errata-xmlrpc 2017-09-28 16:35:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2825

Note You need to log in before you can comment on or make changes to this bug.