Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1737456

Summary: [OSP15] auto scale-up doesn't add new nodes in the cluster during controller replacement
Product: Red Hat OpenStack Reporter: Artem Hrechanychenko <ahrechan>
Component: puppet-pacemakerAssignee: RHOS Maint <rhos-maint>
Status: CLOSED ERRATA QA Contact: pkomarov
Severity: high Docs Contact:
Priority: high    
Version: 15.0 (Stein)CC: dbecker, dciabrin, emacchi, jjoyce, jschluet, mburns, morazi, pkomarov, rhos-maint, slinaber, ssmolyak, tvignaud
Target Milestone: rcKeywords: Triaged
Target Release: 15.0 (Stein)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-21 11:24:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1742169    
Bug Blocks:    
Attachments:
Description Flags
ansible logs none

Description Artem Hrechanychenko 2019-08-05 12:05:57 UTC
Created attachment 1600638 [details]
ansible logs

Description of problem:
Controller replacement failed failed by timeout



(undercloud) [stack@undercloud-0 ~]$ cat overcloud_replace.sh 
#!/bin/bash

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock1.rdu2.redhat.com \
-e /home/stack/virt/config_lvm.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/network/dvr-override.yaml \
-e /home/stack/virt/enable-tls.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/nodes_data.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /home/stack/virt/extra_templates.yaml \
-e /home/stack/rm.yaml \


TASK [Start containers for step 4] *********************************************
Monday 05 August 2019  06:27:54 -0400 (0:00:00.309)       1:20:46.432 ********* 
ok: [compute-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
Overcloud configuration failed.

Ansible timed out at 4919 seconds.
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=8, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 55394), raddr=('192.168.24.2', 13808)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 60452)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 41754)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=9, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 55492), raddr=('192.168.24.2', 13989)>


Step #1 takes too long time comparing with OSP14 


Version-Release number of selected component (if applicable):
RHOS_TRUNK-15.0-RHEL-8-20190725.n.1
python-openstackclient-lang-3.18.0-0.20190312140834.6868499.el8ost.noarch
openstack-tripleo-puppet-elements-10.3.2-0.20190710165331.c89fe3c.el8ost.noarch
openstack-heat-engine-12.0.1-0.20190704050403.bf16acc.el8ost.noarch
python3-openstacksdk-0.27.0-0.20190405091843.4174082.el8ost.noarch
openstack-heat-common-12.0.1-0.20190704050403.bf16acc.el8ost.noarch
puppet-openstacklib-14.4.1-0.20190420125152.3719ca1.el8ost.noarch
openstack-selinux-0.8.19-0.20190606150404.06faac7.el8ost.noarch
openstack-tripleo-image-elements-10.4.1-0.20190705161217.2c8a6a5.el8ost.noarch
puppet-openstack_extras-14.4.1-0.20190420090934.6b1b687.el8ost.noarch
openstack-tripleo-validations-10.5.1-0.20190724100449.23ebc8a.el8ost.noarch
openstack-heat-agents-1.8.1-0.20190523210450.1e15344.el8ost.noarch
openstack-heat-api-12.0.1-0.20190704050403.bf16acc.el8ost.noarch
openstack-tripleo-common-10.8.1-0.20190719020421.f2a2fd2.el8ost.noarch
openstack-tripleo-heat-templates-10.6.1-0.20190725000448.e49b8db.el8ost.noarch
python3-openstackclient-3.18.0-0.20190312140834.6868499.el8ost.noarch
openstack-tripleo-common-containers-10.8.1-0.20190719020421.f2a2fd2.el8ost.noarch
openstack-heat-monolith-12.0.1-0.20190704050403.bf16acc.el8ost.noarch


How reproducible:
always

Steps to Reproduce:
1.Deploy OSP15 with 3 controller + 1 compute
2. Try to replace controller using https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/15-beta/html-single/director_installation_and_usage/index#preparing-for-controller-replacement and fixes from https://bugzilla.redhat.com/show_bug.cgi?id=1722082


Actual results:
Overcloud configuration failed.

Ansible timed out at 4919 seconds.


Expected results:
passed 

Additional info:

Comment 3 Emilien Macchi 2019-08-05 13:50:16 UTC
I think the issue is during the Pacemaker cluster bootstrap.
I grepped the puppet logs from /var/log/messages on the new controller (controller-3) that is replacing the previous one:
http://ix.io/1QG0

Grep for "puppet-user[51963]" and you can see that the Puppet task starts at 12:34:08 and fails one hour later. This is likely the problem.
Now, please tell me why I also see puppet logs from 4 hours ago (check beginning of the file). Is controller-3 a fresh & clean node? It doesn't sounds like it's the case. That *could* be the reason why it takes so long to replace this controller in the cluster.
If it's not the case, we need to find out why the cluster takes so long to bootstrap; we probably want to involve PIDONE at this point.

Comment 5 Damien Ciabrini 2019-08-06 20:14:22 UTC
That is probably because of https://bugs.launchpad.net/tripleo/+bug/1839209, which makes the puppet code retry for nothing and slow down the entire procedure.
I just posted https://review.opendev.org/#/c/674925/ upstream so that puppet-pacemaker correctly adds controller-3 to the cluster, that should fix it.

If that works, I'll use that bz to track the backport downstream.

Comment 6 Artem Hrechanychenko 2019-08-07 10:31:25 UTC
*** Bug 1733697 has been marked as a duplicate of this bug. ***

Comment 8 pkomarov 2019-08-18 08:33:02 UTC
Verification depends on : controller replacement fix : https://review.gerrithub.io/c/rhos-infra/cloud-config/+/465263

Comment 12 pkomarov 2019-09-11 06:46:00 UTC
Verified , 

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -mshell -b -a'rpm -q puppet-pacemaker' 
 [WARNING]: Found both group and host with same name: undercloud

 [WARNING]: Consider using the yum, dnf or zypper module rather than running 'rpm'.  If you need to use command because yum, dnf or zypper is insufficient you can add 'warn: false' to this command task or set 'command_warnings=False' in ansible.cfg to get rid of this
message.

controller-1 | CHANGED | rc=0 >>
puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost.noarch

new nodes are added and cluster is in good state after controller replacement:
http://pastebin.test.redhat.com/796167

Comment 14 errata-xmlrpc 2019-09-21 11:24:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811

Comment 15 Red Hat Bugzilla 2023-09-14 05:41:05 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days