Bug 1737456 - [OSP15] auto scale-up doesn't add new nodes in the cluster during controller replacement
Summary: [OSP15] auto scale-up doesn't add new nodes in the cluster during controller ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-pacemaker
Version: 15.0 (Stein)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 15.0 (Stein)
Assignee: RHOS Maint
QA Contact: pkomarov
URL:
Whiteboard:
: 1733697 (view as bug list)
Depends On: 1742169
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-05 12:05 UTC by Artem Hrechanychenko
Modified: 2023-09-14 05:59 UTC (History)
12 users (show)

Fixed In Version: puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-21 11:24:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ansible logs (4.57 MB, application/gzip)
2019-08-05 12:05 UTC, Artem Hrechanychenko
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Gerrithub.io 465263 0 None None None 2019-08-18 08:30:37 UTC
Gerrithub.io 466208 0 None None None 2019-08-25 19:55:53 UTC
OpenStack gerrit 674925 0 None MERGED pcs 0.10: authenticate nodes before adding them to the cluster 2020-12-08 06:16:05 UTC
Red Hat Issue Tracker OSP-28698 0 None None None 2023-09-14 05:52:17 UTC
Red Hat Product Errata RHEA-2019:2811 0 None None None 2019-09-21 11:24:41 UTC

Description Artem Hrechanychenko 2019-08-05 12:05:57 UTC
Created attachment 1600638 [details]
ansible logs

Description of problem:
Controller replacement failed failed by timeout



(undercloud) [stack@undercloud-0 ~]$ cat overcloud_replace.sh 
#!/bin/bash

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock1.rdu2.redhat.com \
-e /home/stack/virt/config_lvm.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/network/dvr-override.yaml \
-e /home/stack/virt/enable-tls.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/nodes_data.yaml \
-e ~/containers-prepare-parameter.yaml \
-e /home/stack/virt/extra_templates.yaml \
-e /home/stack/rm.yaml \


TASK [Start containers for step 4] *********************************************
Monday 05 August 2019  06:27:54 -0400 (0:00:00.309)       1:20:46.432 ********* 
ok: [compute-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
Overcloud configuration failed.

Ansible timed out at 4919 seconds.
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=8, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 55394), raddr=('192.168.24.2', 13808)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 60452)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 41754)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=9, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 55492), raddr=('192.168.24.2', 13989)>


Step #1 takes too long time comparing with OSP14 


Version-Release number of selected component (if applicable):
RHOS_TRUNK-15.0-RHEL-8-20190725.n.1
python-openstackclient-lang-3.18.0-0.20190312140834.6868499.el8ost.noarch
openstack-tripleo-puppet-elements-10.3.2-0.20190710165331.c89fe3c.el8ost.noarch
openstack-heat-engine-12.0.1-0.20190704050403.bf16acc.el8ost.noarch
python3-openstacksdk-0.27.0-0.20190405091843.4174082.el8ost.noarch
openstack-heat-common-12.0.1-0.20190704050403.bf16acc.el8ost.noarch
puppet-openstacklib-14.4.1-0.20190420125152.3719ca1.el8ost.noarch
openstack-selinux-0.8.19-0.20190606150404.06faac7.el8ost.noarch
openstack-tripleo-image-elements-10.4.1-0.20190705161217.2c8a6a5.el8ost.noarch
puppet-openstack_extras-14.4.1-0.20190420090934.6b1b687.el8ost.noarch
openstack-tripleo-validations-10.5.1-0.20190724100449.23ebc8a.el8ost.noarch
openstack-heat-agents-1.8.1-0.20190523210450.1e15344.el8ost.noarch
openstack-heat-api-12.0.1-0.20190704050403.bf16acc.el8ost.noarch
openstack-tripleo-common-10.8.1-0.20190719020421.f2a2fd2.el8ost.noarch
openstack-tripleo-heat-templates-10.6.1-0.20190725000448.e49b8db.el8ost.noarch
python3-openstackclient-3.18.0-0.20190312140834.6868499.el8ost.noarch
openstack-tripleo-common-containers-10.8.1-0.20190719020421.f2a2fd2.el8ost.noarch
openstack-heat-monolith-12.0.1-0.20190704050403.bf16acc.el8ost.noarch


How reproducible:
always

Steps to Reproduce:
1.Deploy OSP15 with 3 controller + 1 compute
2. Try to replace controller using https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/15-beta/html-single/director_installation_and_usage/index#preparing-for-controller-replacement and fixes from https://bugzilla.redhat.com/show_bug.cgi?id=1722082


Actual results:
Overcloud configuration failed.

Ansible timed out at 4919 seconds.


Expected results:
passed 

Additional info:

Comment 3 Emilien Macchi 2019-08-05 13:50:16 UTC
I think the issue is during the Pacemaker cluster bootstrap.
I grepped the puppet logs from /var/log/messages on the new controller (controller-3) that is replacing the previous one:
http://ix.io/1QG0

Grep for "puppet-user[51963]" and you can see that the Puppet task starts at 12:34:08 and fails one hour later. This is likely the problem.
Now, please tell me why I also see puppet logs from 4 hours ago (check beginning of the file). Is controller-3 a fresh & clean node? It doesn't sounds like it's the case. That *could* be the reason why it takes so long to replace this controller in the cluster.
If it's not the case, we need to find out why the cluster takes so long to bootstrap; we probably want to involve PIDONE at this point.

Comment 5 Damien Ciabrini 2019-08-06 20:14:22 UTC
That is probably because of https://bugs.launchpad.net/tripleo/+bug/1839209, which makes the puppet code retry for nothing and slow down the entire procedure.
I just posted https://review.opendev.org/#/c/674925/ upstream so that puppet-pacemaker correctly adds controller-3 to the cluster, that should fix it.

If that works, I'll use that bz to track the backport downstream.

Comment 6 Artem Hrechanychenko 2019-08-07 10:31:25 UTC
*** Bug 1733697 has been marked as a duplicate of this bug. ***

Comment 8 pkomarov 2019-08-18 08:33:02 UTC
Verification depends on : controller replacement fix : https://review.gerrithub.io/c/rhos-infra/cloud-config/+/465263

Comment 12 pkomarov 2019-09-11 06:46:00 UTC
Verified , 

(undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -mshell -b -a'rpm -q puppet-pacemaker' 
 [WARNING]: Found both group and host with same name: undercloud

 [WARNING]: Consider using the yum, dnf or zypper module rather than running 'rpm'.  If you need to use command because yum, dnf or zypper is insufficient you can add 'warn: false' to this command task or set 'command_warnings=False' in ansible.cfg to get rid of this
message.

controller-1 | CHANGED | rc=0 >>
puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost.noarch

new nodes are added and cluster is in good state after controller replacement:
http://pastebin.test.redhat.com/796167

Comment 14 errata-xmlrpc 2019-09-21 11:24:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811

Comment 15 Red Hat Bugzilla 2023-09-14 05:41:05 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.