Bug 1737456
| Summary: | [OSP15] auto scale-up doesn't add new nodes in the cluster during controller replacement | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Artem Hrechanychenko <ahrechan> | ||||
| Component: | puppet-pacemaker | Assignee: | RHOS Maint <rhos-maint> | ||||
| Status: | CLOSED ERRATA | QA Contact: | pkomarov | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 15.0 (Stein) | CC: | dbecker, dciabrin, emacchi, jjoyce, jschluet, mburns, morazi, pkomarov, rhos-maint, slinaber, ssmolyak, tvignaud | ||||
| Target Milestone: | rc | Keywords: | Triaged | ||||
| Target Release: | 15.0 (Stein) | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-09-21 11:24:21 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1742169 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
Artem Hrechanychenko
2019-08-05 12:05:57 UTC
I think the issue is during the Pacemaker cluster bootstrap. I grepped the puppet logs from /var/log/messages on the new controller (controller-3) that is replacing the previous one: http://ix.io/1QG0 Grep for "puppet-user[51963]" and you can see that the Puppet task starts at 12:34:08 and fails one hour later. This is likely the problem. Now, please tell me why I also see puppet logs from 4 hours ago (check beginning of the file). Is controller-3 a fresh & clean node? It doesn't sounds like it's the case. That *could* be the reason why it takes so long to replace this controller in the cluster. If it's not the case, we need to find out why the cluster takes so long to bootstrap; we probably want to involve PIDONE at this point. That is probably because of https://bugs.launchpad.net/tripleo/+bug/1839209, which makes the puppet code retry for nothing and slow down the entire procedure. I just posted https://review.opendev.org/#/c/674925/ upstream so that puppet-pacemaker correctly adds controller-3 to the cluster, that should fix it. If that works, I'll use that bz to track the backport downstream. *** Bug 1733697 has been marked as a duplicate of this bug. *** Verification depends on : controller replacement fix : https://review.gerrithub.io/c/rhos-infra/cloud-config/+/465263 Verification depends on : https://bugzilla.redhat.com/show_bug.cgi?id=1742169 and : https://review.gerrithub.io/c/rhos-infra/cloud-config/+/466208 Verified , (undercloud) [stack@undercloud-0 ~]$ ansible controller-1 -mshell -b -a'rpm -q puppet-pacemaker' [WARNING]: Found both group and host with same name: undercloud [WARNING]: Consider using the yum, dnf or zypper module rather than running 'rpm'. If you need to use command because yum, dnf or zypper is insufficient you can add 'warn: false' to this command task or set 'command_warnings=False' in ansible.cfg to get rid of this message. controller-1 | CHANGED | rc=0 >> puppet-pacemaker-0.7.3-0.20190807230458.8b30131.el8ost.noarch new nodes are added and cluster is in good state after controller replacement: http://pastebin.test.redhat.com/796167 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2811 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |