Created attachment 1600638 [details] ansible logs Description of problem: Controller replacement failed failed by timeout (undercloud) [stack@undercloud-0 ~]$ cat overcloud_replace.sh #!/bin/bash openstack overcloud deploy \ --timeout 100 \ --templates /usr/share/openstack-tripleo-heat-templates \ --stack overcloud \ --libvirt-type kvm \ --ntp-server clock1.rdu2.redhat.com \ -e /home/stack/virt/config_lvm.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /home/stack/virt/network/dvr-override.yaml \ -e /home/stack/virt/enable-tls.yaml \ -e /home/stack/virt/inject-trust-anchor.yaml \ -e /home/stack/virt/public_vip.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \ -e /home/stack/virt/hostnames.yml \ -e /home/stack/virt/nodes_data.yaml \ -e ~/containers-prepare-parameter.yaml \ -e /home/stack/virt/extra_templates.yaml \ -e /home/stack/rm.yaml \ TASK [Start containers for step 4] ********************************************* Monday 05 August 2019 06:27:54 -0400 (0:00:00.309) 1:20:46.432 ********* ok: [compute-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} Overcloud configuration failed. Ansible timed out at 4919 seconds. sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=8, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 55394), raddr=('192.168.24.2', 13808)> sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 60452)> sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 41754)> sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=9, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.2', 55492), raddr=('192.168.24.2', 13989)> Step #1 takes too long time comparing with OSP14 Version-Release number of selected component (if applicable): RHOS_TRUNK-15.0-RHEL-8-20190725.n.1 python-openstackclient-lang-3.18.0-0.20190312140834.6868499.el8ost.noarch openstack-tripleo-puppet-elements-10.3.2-0.20190710165331.c89fe3c.el8ost.noarch openstack-heat-engine-12.0.1-0.20190704050403.bf16acc.el8ost.noarch python3-openstacksdk-0.27.0-0.20190405091843.4174082.el8ost.noarch openstack-heat-common-12.0.1-0.20190704050403.bf16acc.el8ost.noarch puppet-openstacklib-14.4.1-0.20190420125152.3719ca1.el8ost.noarch openstack-selinux-0.8.19-0.20190606150404.06faac7.el8ost.noarch openstack-tripleo-image-elements-10.4.1-0.20190705161217.2c8a6a5.el8ost.noarch puppet-openstack_extras-14.4.1-0.20190420090934.6b1b687.el8ost.noarch openstack-tripleo-validations-10.5.1-0.20190724100449.23ebc8a.el8ost.noarch openstack-heat-agents-1.8.1-0.20190523210450.1e15344.el8ost.noarch openstack-heat-api-12.0.1-0.20190704050403.bf16acc.el8ost.noarch openstack-tripleo-common-10.8.1-0.20190719020421.f2a2fd2.el8ost.noarch openstack-tripleo-heat-templates-10.6.1-0.20190725000448.e49b8db.el8ost.noarch python3-openstackclient-3.18.0-0.20190312140834.6868499.el8ost.noarch openstack-tripleo-common-containers-10.8.1-0.20190719020421.f2a2fd2.el8ost.noarch openstack-heat-monolith-12.0.1-0.20190704050403.bf16acc.el8ost.noarch How reproducible: always Steps to Reproduce: 1.Deploy OSP15 with 3 controller + 1 compute 2. Try to replace controller using https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/15-beta/html-single/director_installation_and_usage/index#preparing-for-controller-replacement and fixes from https://bugzilla.redhat.com/show_bug.cgi?id=1722082 Actual results: Overcloud configuration failed. Ansible timed out at 4919 seconds. Expected results: passed Additional info:
I think the issue is during the Pacemaker cluster bootstrap. I grepped the puppet logs from /var/log/messages on the new controller (controller-3) that is replacing the previous one: http://ix.io/1QG0 Grep for "puppet-user[51963]" and you can see that the Puppet task starts at 12:34:08 and fails one hour later. This is likely the problem. Now, please tell me why I also see puppet logs from 4 hours ago (check beginning of the file). Is controller-3 a fresh & clean node? It doesn't sounds like it's the case. That *could* be the reason why it takes so long to replace this controller in the cluster. If it's not the case, we need to find out why the cluster takes so long to bootstrap; we probably want to involve PIDONE at this point.
That is probably because of https://bugs.launchpad.net/tripleo/+bug/1839209, which makes the puppet code retry for nothing and slow down the entire procedure. I just posted https://review.opendev.org/#/c/674925/ upstream so that puppet-pacemaker correctly adds controller-3 to the cluster, that should fix it. If that works, I'll use that bz to track the backport downstream.
*** Bug 1733697 has been marked as a duplicate of this bug. ***
Verification depends on : controller replacement fix : https://review.gerrithub.io/c/rhos-infra/cloud-config/+/465263
Verification depends on : https://bugzilla.redhat.com/show_bug.cgi?id=1742169 and : https://review.gerrithub.io/c/rhos-infra/cloud-config/+/466208
Verified , (undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -mshell -b -a'rpm -q puppet-pacemaker' [WARNING]: Found both group and host with same name: undercloud [WARNING]: Consider using the yum, dnf or zypper module rather than running 'rpm'. If you need to use command because yum, dnf or zypper is insufficient you can add 'warn: false' to this command task or set 'command_warnings=False' in ansible.cfg to get rid of this message. controller-1 | CHANGED | rc=0 >> puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost.noarch new nodes are added and cluster is in good state after controller replacement: http://pastebin.test.redhat.com/796167
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2811
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days