Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1299923

Summary:	pcs resource cleanup form new rabbitmq cluster
Product:	Red Hat Enterprise Linux 7	Reporter:	Ondrej <ochalups>
Component:	resource-agents	Assignee:	Peter Lemenkov <plemenko>
Status:	CLOSED DUPLICATE	QA Contact:	Ofer Blaut <oblaut>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.1	CC:	abeekhof, agk, cfeist, cluster-maint, fdinitto, jbuchta, jorge_martinez, mkrcmari, plemenko, rcernin
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-02-23 14:08:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ondrej 2016-01-19 14:52:34 UTC

Description of problem:
From customer testing:
I have just perform an additional test:

A) This time I have stop rabbitmq "gently" using:
controller1# rabbitmqctl stop

 RESULT: the rabbitmq re-join the cluster without a problem.

B)  Simulate a crash on rabbitmq proccess - but allow pacemaker to restart it, this time we are not impeding the execution by adding syntax errors on config.

controller1# ps -ef |grep rabbitmq | grep -v grep | awk '{print $2}' | xargs kill

We show that pacemaker is trying to clean using a rabbitmqctl stop and the start the process:

[root@ost-ah1-controller1 rabbitmq]# ps -ef |grep rabbit
rabbitmq  2358     1  0 16:25 ?        00:00:00 /usr/lib64/erlang/erts-5.10.4/bin/epmd -daemon
root      4361  3334  0 16:25 ?        00:00:00 /bin/sh /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster stop
root      4391  4361  0 16:25 ?        00:00:00 /bin/sh /usr/sbin/rabbitmqctl cluster_status
root      4398 18697  0 16:25 pts/0    00:00:00 grep --color=auto rabbit
[root@ost-ah1-controller1 rabbitmq]# ps -ef |grep rabbit
rabbitmq  2358     1  0 16:25 ?        00:00:00 /usr/lib64/erlang/erts-5.10.4/bin/epmd -daemon
root      4464  3334  0 16:25 ?        00:00:00 /bin/sh /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster start
root      4626  4464  0 16:25 ?        00:00:00 sh -c /usr/sbin/rabbitmq-server > /var/log/rabbitmq/startup_log 2> /var/log/rabbitmq/startup_err
root      4630 18697  0 16:25 pts/0    00:00:00 grep --color=auto rabbit
root      4633  4626  0 16:25 ?        00:00:00 /bin/sh /usr/sbin/rabbitmq-server

After a while, we saw that:

> No other process has been stopped during the time it was respawning - good
> Rabbitmq has join the cluster:
[ost-ah1-controller0 at Wed Jan 13 16:26:53 CET 2016(1452698813)]
Cluster status of node 'rabbit@ost-ah1-controller0' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2',
                 'rabbit@ost-ah1-controller0']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

[ost-ah1-controller1 at Wed Jan 13 16:26:54 CET 2016(1452698814)]
Cluster status of node 'rabbit@ost-ah1-controller1' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2',
                 'rabbit@ost-ah1-controller1']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

[ost-ah1-controller2 at Wed Jan 13 16:26:56 CET 2016(1452698816)]
Cluster status of node 'rabbit@ost-ah1-controller2' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller0',
                 'rabbit@ost-ah1-controller2']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

And this time is working fine!

Let's just repeat the original test, impede rabbitmq to be restarted ... this time we will use another strategy, simulate that the socket was in use (due to a previous stalled process for example):

controller1 # ps -ef |grep rabbitmq | grep -v grep | awk '{print $2}' | xargs kill

controller1 # nc -l 35672

# cat /var/log/rabbitmq/startup_err

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{k

 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ ost-ah1-controller0 ost-ah1-controller2 ]
     Stopped: [ ost-ah1-controller1 ]

Cluster status of node 'rabbit@ost-ah1-controller2' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

And now, we clean the impediment by leaving free the port  'nc -l 35672' and ask pacemaker to cleanup:

# pcs resource cleanup rabbitmq-clone
Resource: rabbitmq-clone successfully cleaned up

And finally ...... we got the expected error at last:

[ost-ah1-controller0 at Wed Jan 13 16:38:15 CET 2016(1452699495)]
Cluster status of node 'rabbit@ost-ah1-controller0' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller2','rabbit@ost-ah1-controller0']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

[ost-ah1-controller1 at Wed Jan 13 16:38:16 CET 2016(1452699496)]
Cluster status of node 'rabbit@ost-ah1-controller1' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller1']}]},
 {running_nodes,['rabbit@ost-ah1-controller1']},
 {cluster_name,<<"rabbit">>},  <--------------------------- controller1 has start his own cluster and is not aware of the rest
 {partitions,[]}]
...done.

[ost-ah1-controller2 at Wed Jan 13 16:38:17 CET 2016(1452699497)]
Cluster status of node 'rabbit@ost-ah1-controller2' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

I try to fix the situation using this:

1) Try to a gentle stop of the "confused" rabbitmq server on controller1 - rabbitmqctl stop 
   -- No difference, startover but no luck - still isolated:

[ost-ah1-controller1 at Wed Jan 13 16:44:28 CET 2016(1452699868)]
Cluster status of node 'rabbit@ost-ah1-controller1' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller1']}]},
 {running_nodes,['rabbit@ost-ah1-controller1']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

2) Stop rabbitmq on controller2 for example:

controller2# rabbitmqctl stop

Who will join, old o new cluster?

New cluster won!


[ost-ah1-controller0 at Wed Jan 13 16:46:46 CET 2016(1452700006)]
Cluster status of node 'rabbit@ost-ah1-controller0' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

[ost-ah1-controller1 at Wed Jan 13 16:46:48 CET 2016(1452700008)]
Cluster status of node 'rabbit@ost-ah1-controller1' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller2','rabbit@ost-ah1-controller1']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

[ost-ah1-controller2 at Wed Jan 13 16:46:49 CET 2016(1452700009)]
Cluster status of node 'rabbit@ost-ah1-controller2' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

Now {cluster_name,<<"rabbit"} has two members, old cluster is only formed by controller0 ...

As we expect, controller0 will join new cluster after we stop it:

controller0# rabbitmqctl stop
Stopping and halting node 'rabbit@ost-ah1-controller0'

Final status:

[ost-ah1-controller0 at Wed Jan 13 16:49:32 CET 2016(1452700172)]
Cluster status of node 'rabbit@ost-ah1-controller0' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2',
                 'rabbit@ost-ah1-controller0']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

[ost-ah1-controller1 at Wed Jan 13 16:49:34 CET 2016(1452700174)]
Cluster status of node 'rabbit@ost-ah1-controller1' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2',
                 'rabbit@ost-ah1-controller1']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.

[ost-ah1-controller2 at Wed Jan 13 16:49:35 CET 2016(1452700175)]
Cluster status of node 'rabbit@ost-ah1-controller2' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                 'rabbit@ost-ah1-controller2']},
 {cluster_name,<<"rabbit">>},
 {partitions,[]}]
...done.


Version-Release number of selected component (if applicable):
pacemaker-1.1.12-22.el7_1.4.x86_64
pacemaker-cli-1.1.12-22.el7_1.4.x86_64
pacemaker-cluster-libs-1.1.12-22.el7_1.4.x86_64
pacemaker-libs-1.1.12-22.el7_1.4.x86_64


How reproducible:


Steps to Reproduce:
1.kill rabbit
2.occupy its port - nc -l 35672
3.pcs resource cleanup rabbitmq-clone
4.cluster node form new cluster and start/stop of other nodes make them join the new one

Actual results:
nodes leave the cluster and form new one

Expected results:
all nodes stay in the same cluster

Additional info:

Comment 3 Andrew Beekhof 2016-01-24 21:57:34 UTC

Can't really comment without a crm/sos report archive.

Is the systemd or OCF agent being used to control rabbit?
First step would be to see if/why the cluster thinks it needs to restart rabbit on the other nodes.

It would also be useful to know if restricting the cleanup operation to just that resource+node produces the same behaviour.

Comment 5 Andrew Beekhof 2016-01-28 00:27:40 UTC

The logs only go back to Jan 18 and do not include the period covered by this test.
But if the cluster is not restarting rabbit on the other controllers, then its hard to imagine how it might be the cause.

Moving to the resource-agents package since thats where the script that controls rabbit lives.

FWIW. the behaviour once you begin stopping/starting things with rabbitmqctl is likely unsupportable.  All interaction with rabbit (apart from simulating the initial failure) should via the cluster.

Comment 8 Peter Lemenkov 2016-02-10 12:59:49 UTC

I'm working on it. I'll provide a status update shortly.

Comment 12 Fabio Massimo Di Nitto 2016-02-13 07:17:18 UTC

*** Bug 1307056 has been marked as a duplicate of this bug. ***

Comment 15 Peter Lemenkov 2016-02-17 22:53:06 UTC

Status update - I've found a way how to fix this issue and how to rejoin this rouge lone node back. I'm going to propose a workaround soon (so far it doesn't look great).

0. On any other node within the cluster we want to join you must run
   
rabbitmqctl forget_cluster_node rabbit@overcloud-controller-1

1. This removes a stale record and allows rouge node to properly re-register itself (roughly speaking it actually treats "ok, already registered" as a "403").

2. On the rouge node you must run consequently:

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster rabbit@overcloud-controller-2
# rabbit@overcloud-controller-2 is an Erlang node name which 
# currently belongs to a cluster, not a domain name which 
# is "rabbit

# here we will get something like "...done (already_member)." if we forget to forget rouge node (see step #0).

rabbitmqctl start_app

3. Now you may enjoy with retrieving cluster_status:

rabbitmqctl cluster_status

My next step is to figure out why the node forgot about the cluster and prevent this from happening.

Comment 16 Peter Lemenkov 2016-02-17 22:58:05 UTC

Again - this is just a workaround. I'll propose real solution later.

Meanwhile you may try resetting the cluster back to working state as mentioned above.

Comment 17 Fabio Massimo Di Nitto 2016-02-18 05:08:38 UTC

Petr this looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1247303 for which we already have a patch in POST for rhel7.3 and we just need to build/backport.

Strangely enough, we did see this issue only once in the past, and now it´s showing up a lot all of a sudden.

Please review patch in the other bug and let me know if that would work. then we can close this bug as duplicate and speed up the build/backport.

Comment 18 Peter Lemenkov 2016-02-18 08:07:52 UTC

(In reply to Fabio Massimo Di Nitto from comment #17)
> Petr this looks like a duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=1247303 for which we already
> have a patch in POST for rhel7.3 and we just need to build/backport.
> 
> Strangely enough, we did see this issue only once in the past, and now it´s
> showing up a lot all of a sudden.
> 
> Please review patch in the other bug and let me know if that would work.
> then we can close this bug as duplicate and speed up the build/backport.

Yes, that ticket describes exactly the same issue!

We certainly must backport that patch (it does exactly the same I proposed above), but I'd like to keep this ticket opened for a while. I still want to figure out why start up failure can lead to this situation.

Comment 19 Fabio Massimo Di Nitto 2016-02-23 14:08:09 UTC


*** This bug has been marked as a duplicate of bug 1247303 ***