1737456 – [OSP15] auto scale-up doesn't add new nodes in the cluster during controller replacement

Bug 1737456 - [OSP15] auto scale-up doesn't add new nodes in the cluster during controller replacement

Summary: [OSP15] auto scale-up doesn't add new nodes in the cluster during controller ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-pacemaker
Sub Component:
Version:	15.0 (Stein)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	15.0 (Stein)
Assignee:	RHOS Maint
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1733697 (view as bug list)
Depends On:	1742169
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-05 12:05 UTC by Artem Hrechanychenko
Modified:	2023-09-14 05:59 UTC (History)
CC List:	12 users (show)
Fixed In Version:	puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-21 11:24:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ansible logs (4.57 MB, application/gzip) 2019-08-05 12:05 UTC, Artem Hrechanychenko	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Gerrithub.io	465263	None	None	None	2019-08-18 08:30:37 UTC
Gerrithub.io	466208	None	None	None	2019-08-25 19:55:53 UTC
OpenStack gerrit	674925	None	MERGED	pcs 0.10: authenticate nodes before adding them to the cluster	2020-12-08 06:16:05 UTC
Red Hat Issue Tracker	OSP-28698	None	None	None	2023-09-14 05:52:17 UTC
Red Hat Product Errata	RHEA-2019:2811	None	None	None	2019-09-21 11:24:41 UTC

Description Artem Hrechanychenko 2019-08-05 12:05:57 UTC

Created attachment 1600638 [details]
ansible logs

Description of problem:
Controller replacement failed failed by timeout



(undercloud) [stack@undercloud-0 ~]$ cat overcloud_replace.sh 
#!/bin/bash

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock1.rdu2.redhat.com \
-e /home/stack/virt/config_lvm.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/network/dvr-override.yaml \
-e /home/stack/virt/enable-tls.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/nodes_data.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /home/stack/virt/extra_templates.yaml \
-e /home/stack/rm.yaml \


TASK [Start containers for step 4] *********************************************
Monday 05 August 2019  06:27:54 -0400 (0:00:00.309)       1:20:46.432 ********* 
ok: [compute-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
Overcloud configuration failed.

Ansible timed out at 4919 seconds.
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=8, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 55394), raddr=('192.168.24.2', 13808)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 60452)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 41754)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=9, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 55492), raddr=('192.168.24.2', 13989)>


Step #1 takes too long time comparing with OSP14 


Version-Release number of selected component (if applicable):
RHOS_TRUNK-15.0-RHEL-8-20190725.n.1
python-openstackclient-lang-3.18.0-0.20190312140834.6868499.el8ost.noarch
openstack-tripleo-puppet-elements-10.3.2-0.20190710165331.c89fe3c.el8ost.noarch
openstack-heat-engine-12.0.1-0.20190704050403.bf16acc.el8ost.noarch
python3-openstacksdk-0.27.0-0.20190405091843.4174082.el8ost.noarch
openstack-heat-common-12.0.1-0.20190704050403.bf16acc.el8ost.noarch
puppet-openstacklib-14.4.1-0.20190420125152.3719ca1.el8ost.noarch
openstack-selinux-0.8.19-0.20190606150404.06faac7.el8ost.noarch
openstack-tripleo-image-elements-10.4.1-0.20190705161217.2c8a6a5.el8ost.noarch
puppet-openstack_extras-14.4.1-0.20190420090934.6b1b687.el8ost.noarch
openstack-tripleo-validations-10.5.1-0.20190724100449.23ebc8a.el8ost.noarch
openstack-heat-agents-1.8.1-0.20190523210450.1e15344.el8ost.noarch
openstack-heat-api-12.0.1-0.20190704050403.bf16acc.el8ost.noarch
openstack-tripleo-common-10.8.1-0.20190719020421.f2a2fd2.el8ost.noarch
openstack-tripleo-heat-templates-10.6.1-0.20190725000448.e49b8db.el8ost.noarch
python3-openstackclient-3.18.0-0.20190312140834.6868499.el8ost.noarch
openstack-tripleo-common-containers-10.8.1-0.20190719020421.f2a2fd2.el8ost.noarch
openstack-heat-monolith-12.0.1-0.20190704050403.bf16acc.el8ost.noarch


How reproducible:
always

Steps to Reproduce:
1.Deploy OSP15 with 3 controller + 1 compute
2. Try to replace controller using https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/15-beta/html-single/director_installation_and_usage/index#preparing-for-controller-replacement and fixes from https://bugzilla.redhat.com/show_bug.cgi?id=1722082


Actual results:
Overcloud configuration failed.

Ansible timed out at 4919 seconds.


Expected results:
passed 

Additional info:

Comment 3 Emilien Macchi 2019-08-05 13:50:16 UTC

I think the issue is during the Pacemaker cluster bootstrap.
I grepped the puppet logs from /var/log/messages on the new controller (controller-3) that is replacing the previous one:
http://ix.io/1QG0

Grep for "puppet-user[51963]" and you can see that the Puppet task starts at 12:34:08 and fails one hour later. This is likely the problem.
Now, please tell me why I also see puppet logs from 4 hours ago (check beginning of the file). Is controller-3 a fresh & clean node? It doesn't sounds like it's the case. That *could* be the reason why it takes so long to replace this controller in the cluster.
If it's not the case, we need to find out why the cluster takes so long to bootstrap; we probably want to involve PIDONE at this point.

Comment 5 Damien Ciabrini 2019-08-06 20:14:22 UTC

That is probably because of https://bugs.launchpad.net/tripleo/+bug/1839209, which makes the puppet code retry for nothing and slow down the entire procedure.
I just posted https://review.opendev.org/#/c/674925/ upstream so that puppet-pacemaker correctly adds controller-3 to the cluster, that should fix it.

If that works, I'll use that bz to track the backport downstream.

Comment 6 Artem Hrechanychenko 2019-08-07 10:31:25 UTC

*** Bug 1733697 has been marked as a duplicate of this bug. ***

Comment 8 pkomarov 2019-08-18 08:33:02 UTC

Verification depends on : controller replacement fix : https://review.gerrithub.io/c/rhos-infra/cloud-config/+/465263

Comment 9 pkomarov 2019-08-25 19:55:54 UTC

Verification depends on : 

https://bugzilla.redhat.com/show_bug.cgi?id=1742169

and : 

https://review.gerrithub.io/c/rhos-infra/cloud-config/+/466208

Comment 12 pkomarov 2019-09-11 06:46:00 UTC

Verified , 

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -mshell -b -a'rpm -q puppet-pacemaker' 
 [WARNING]: Found both group and host with same name: undercloud

 [WARNING]: Consider using the yum, dnf or zypper module rather than running 'rpm'.  If you need to use command because yum, dnf or zypper is insufficient you can add 'warn: false' to this command task or set 'command_warnings=False' in ansible.cfg to get rid of this
message.

controller-1 | CHANGED | rc=0 >>
puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost.noarch

new nodes are added and cluster is in good state after controller replacement:
http://pastebin.test.redhat.com/796167

Comment 14 errata-xmlrpc 2019-09-21 11:24:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811

Comment 15 Red Hat Bugzilla 2023-09-14 05:41:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.