Bug 1299923 - pcs resource cleanup form new rabbitmq cluster
pcs resource cleanup form new rabbitmq cluster
Status: CLOSED DUPLICATE of bug 1247303
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: resource-agents (Show other bugs)
7.1
Unspecified Unspecified
urgent Severity urgent
: rc
: ---
Assigned To: Peter Lemenkov
Ofer Blaut
:
: 1307056 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-19 09:52 EST by Ondrej
Modified: 2016-11-21 11:25 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-23 09:08:09 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Ondrej 2016-01-19 09:52:34 EST
Description of problem:
From customer testing:
I have just perform an additional test:

A) This time I have stop rabbitmq "gently" using:
controller1# rabbitmqctl stop

 RESULT: the rabbitmq re-join the cluster without a problem.

B)  Simulate a crash on rabbitmq proccess - but allow pacemaker to restart it, this time we are not impeding the execution by adding syntax errors on config.

controller1# ps -ef |grep rabbitmq | grep -v grep | awk '{print $2}' | xargs kill

We show that pacemaker is trying to clean using a rabbitmqctl stop and the start the process:

[root@ost-ah1-controller1 rabbitmq]# ps -ef |grep rabbit
rabbitmq  2358     1  0 16:25 ?        00:00:00 /usr/lib64/erlang/erts-5.10.4/bin/epmd -daemon
root      4361  3334  0 16:25 ?        00:00:00 /bin/sh /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster stop
root      4391  4361  0 16:25 ?        00:00:00 /bin/sh /usr/sbin/rabbitmqctl cluster_status
root      4398 18697  0 16:25 pts/0    00:00:00 grep --color=auto rabbit
[root@ost-ah1-controller1 rabbitmq]# ps -ef |grep rabbit
rabbitmq  2358     1  0 16:25 ?        00:00:00 /usr/lib64/erlang/erts-5.10.4/bin/epmd -daemon
root      4464  3334  0 16:25 ?        00:00:00 /bin/sh /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster start
root      4626  4464  0 16:25 ?        00:00:00 sh -c /usr/sbin/rabbitmq-server > /var/log/rabbitmq/startup_log 2> /var/log/rabbitmq/startup_err
root      4630 18697  0 16:25 pts/0    00:00:00 grep --color=auto rabbit
root      4633  4626  0 16:25 ?        00:00:00 /bin/sh /usr/sbin/rabbitmq-server

After a while, we saw that:

> No other process has been stopped during the time it was respawning - good
> Rabbitmq has join the cluster:
[ost-ah1-controller0 at Wed Jan 13 16:26:53 CET 2016(1452698813)]
Cluster status of node 'rabbit@ost-ah1-controller0' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2',
                 'rabbit@ost-ah1-controller0']},
 {cluster_name,<<"rabbit@ost-ah1-controller0.localdomain">>},
 {partitions,[]}]
...done.

[ost-ah1-controller1 at Wed Jan 13 16:26:54 CET 2016(1452698814)]
Cluster status of node 'rabbit@ost-ah1-controller1' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2',
                 'rabbit@ost-ah1-controller1']},
 {cluster_name,<<"rabbit@ost-ah1-controller0.localdomain">>},
 {partitions,[]}]
...done.

[ost-ah1-controller2 at Wed Jan 13 16:26:56 CET 2016(1452698816)]
Cluster status of node 'rabbit@ost-ah1-controller2' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller0',
                 'rabbit@ost-ah1-controller2']},
 {cluster_name,<<"rabbit@ost-ah1-controller0.localdomain">>},
 {partitions,[]}]
...done.

And this time is working fine!

Let's just repeat the original test, impede rabbitmq to be restarted ... this time we will use another strategy, simulate that the socket was in use (due to a previous stalled process for example):

controller1 # ps -ef |grep rabbitmq | grep -v grep | awk '{print $2}' | xargs kill

controller1 # nc -l 35672

# cat /var/log/rabbitmq/startup_err

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{k

 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ ost-ah1-controller0 ost-ah1-controller2 ]
     Stopped: [ ost-ah1-controller1 ]

Cluster status of node 'rabbit@ost-ah1-controller2' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2']},
 {cluster_name,<<"rabbit@ost-ah1-controller0.localdomain">>},
 {partitions,[]}]
...done.

And now, we clean the impediment by leaving free the port  'nc -l 35672' and ask pacemaker to cleanup:

# pcs resource cleanup rabbitmq-clone
Resource: rabbitmq-clone successfully cleaned up

And finally ...... we got the expected error at last:

[ost-ah1-controller0 at Wed Jan 13 16:38:15 CET 2016(1452699495)]
Cluster status of node 'rabbit@ost-ah1-controller0' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller2','rabbit@ost-ah1-controller0']},
 {cluster_name,<<"rabbit@ost-ah1-controller0.localdomain">>},
 {partitions,[]}]
...done.

[ost-ah1-controller1 at Wed Jan 13 16:38:16 CET 2016(1452699496)]
Cluster status of node 'rabbit@ost-ah1-controller1' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller1']}]},
 {running_nodes,['rabbit@ost-ah1-controller1']},
 {cluster_name,<<"rabbit@ost-ah1-controller1.localdomain">>},  <--------------------------- controller1 has start his own cluster and is not aware of the rest
 {partitions,[]}]
...done.

[ost-ah1-controller2 at Wed Jan 13 16:38:17 CET 2016(1452699497)]
Cluster status of node 'rabbit@ost-ah1-controller2' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2']},
 {cluster_name,<<"rabbit@ost-ah1-controller0.localdomain">>},
 {partitions,[]}]
...done.

I try to fix the situation using this:

1) Try to a gentle stop of the "confused" rabbitmq server on controller1 - rabbitmqctl stop 
   -- No difference, startover but no luck - still isolated:

[ost-ah1-controller1 at Wed Jan 13 16:44:28 CET 2016(1452699868)]
Cluster status of node 'rabbit@ost-ah1-controller1' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller1']}]},
 {running_nodes,['rabbit@ost-ah1-controller1']},
 {cluster_name,<<"rabbit@ost-ah1-controller1.localdomain">>},
 {partitions,[]}]
...done.

2) Stop rabbitmq on controller2 for example:

controller2# rabbitmqctl stop

Who will join, old o new cluster?

New cluster won!


[ost-ah1-controller0 at Wed Jan 13 16:46:46 CET 2016(1452700006)]
Cluster status of node 'rabbit@ost-ah1-controller0' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0']},
 {cluster_name,<<"rabbit@ost-ah1-controller0.localdomain">>},
 {partitions,[]}]
...done.

[ost-ah1-controller1 at Wed Jan 13 16:46:48 CET 2016(1452700008)]
Cluster status of node 'rabbit@ost-ah1-controller1' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller2','rabbit@ost-ah1-controller1']},
 {cluster_name,<<"rabbit@ost-ah1-controller1.localdomain">>},
 {partitions,[]}]
...done.

[ost-ah1-controller2 at Wed Jan 13 16:46:49 CET 2016(1452700009)]
Cluster status of node 'rabbit@ost-ah1-controller2' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2']},
 {cluster_name,<<"rabbit@ost-ah1-controller1.localdomain">>},
 {partitions,[]}]
...done.

Now {cluster_name,<<"rabbit@ost-ah1-controller1.localdomain"} has two members, old cluster is only formed by controller0 ...

As we expect, controller0 will join new cluster after we stop it:

controller0# rabbitmqctl stop
Stopping and halting node 'rabbit@ost-ah1-controller0'

Final status:

[ost-ah1-controller0 at Wed Jan 13 16:49:32 CET 2016(1452700172)]
Cluster status of node 'rabbit@ost-ah1-controller0' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2',
                 'rabbit@ost-ah1-controller0']},
 {cluster_name,<<"rabbit@ost-ah1-controller1.localdomain">>},
 {partitions,[]}]
...done.

[ost-ah1-controller1 at Wed Jan 13 16:49:34 CET 2016(1452700174)]
Cluster status of node 'rabbit@ost-ah1-controller1' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2',
                 'rabbit@ost-ah1-controller1']},
 {cluster_name,<<"rabbit@ost-ah1-controller1.localdomain">>},
 {partitions,[]}]
...done.

[ost-ah1-controller2 at Wed Jan 13 16:49:35 CET 2016(1452700175)]
Cluster status of node 'rabbit@ost-ah1-controller2' ...
[{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                'rabbit@ost-ah1-controller2']}]},
 {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1',
                 'rabbit@ost-ah1-controller2']},
 {cluster_name,<<"rabbit@ost-ah1-controller1.localdomain">>},
 {partitions,[]}]
...done.


Version-Release number of selected component (if applicable):
pacemaker-1.1.12-22.el7_1.4.x86_64
pacemaker-cli-1.1.12-22.el7_1.4.x86_64
pacemaker-cluster-libs-1.1.12-22.el7_1.4.x86_64
pacemaker-libs-1.1.12-22.el7_1.4.x86_64


How reproducible:


Steps to Reproduce:
1.kill rabbit
2.occupy its port - nc -l 35672
3.pcs resource cleanup rabbitmq-clone
4.cluster node form new cluster and start/stop of other nodes make them join the new one

Actual results:
nodes leave the cluster and form new one

Expected results:
all nodes stay in the same cluster

Additional info:
Comment 3 Andrew Beekhof 2016-01-24 16:57:34 EST
Can't really comment without a crm/sos report archive.

Is the systemd or OCF agent being used to control rabbit?
First step would be to see if/why the cluster thinks it needs to restart rabbit on the other nodes.

It would also be useful to know if restricting the cleanup operation to just that resource+node produces the same behaviour.
Comment 5 Andrew Beekhof 2016-01-27 19:27:40 EST
The logs only go back to Jan 18 and do not include the period covered by this test.
But if the cluster is not restarting rabbit on the other controllers, then its hard to imagine how it might be the cause.

Moving to the resource-agents package since thats where the script that controls rabbit lives.

FWIW. the behaviour once you begin stopping/starting things with rabbitmqctl is likely unsupportable.  All interaction with rabbit (apart from simulating the initial failure) should via the cluster.
Comment 8 Peter Lemenkov 2016-02-10 07:59:49 EST
I'm working on it. I'll provide a status update shortly.
Comment 12 Fabio Massimo Di Nitto 2016-02-13 02:17:18 EST
*** Bug 1307056 has been marked as a duplicate of this bug. ***
Comment 15 Peter Lemenkov 2016-02-17 17:53:06 EST
Status update - I've found a way how to fix this issue and how to rejoin this rouge lone node back. I'm going to propose a workaround soon (so far it doesn't look great).

0. On any other node within the cluster we want to join you must run
   
rabbitmqctl forget_cluster_node rabbit@overcloud-controller-1

1. This removes a stale record and allows rouge node to properly re-register itself (roughly speaking it actually treats "ok, already registered" as a "403").

2. On the rouge node you must run consequently:

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster rabbit@overcloud-controller-2
# rabbit@overcloud-controller-2 is an Erlang node name which 
# currently belongs to a cluster, not a domain name which 
# is "rabbit@overcloud-controller-2.localdomain

# here we will get something like "...done (already_member)." if we forget to forget rouge node (see step #0).

rabbitmqctl start_app

3. Now you may enjoy with retrieving cluster_status:

rabbitmqctl cluster_status

My next step is to figure out why the node forgot about the cluster and prevent this from happening.
Comment 16 Peter Lemenkov 2016-02-17 17:58:05 EST
Again - this is just a workaround. I'll propose real solution later.

Meanwhile you may try resetting the cluster back to working state as mentioned above.
Comment 17 Fabio Massimo Di Nitto 2016-02-18 00:08:38 EST
Petr this looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1247303 for which we already have a patch in POST for rhel7.3 and we just need to build/backport.

Strangely enough, we did see this issue only once in the past, and now it´s showing up a lot all of a sudden.

Please review patch in the other bug and let me know if that would work. then we can close this bug as duplicate and speed up the build/backport.
Comment 18 Peter Lemenkov 2016-02-18 03:07:52 EST
(In reply to Fabio Massimo Di Nitto from comment #17)
> Petr this looks like a duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=1247303 for which we already
> have a patch in POST for rhel7.3 and we just need to build/backport.
> 
> Strangely enough, we did see this issue only once in the past, and now it´s
> showing up a lot all of a sudden.
> 
> Please review patch in the other bug and let me know if that would work.
> then we can close this bug as duplicate and speed up the build/backport.

Yes, that ticket describes exactly the same issue!

We certainly must backport that patch (it does exactly the same I proposed above), but I'd like to keep this ticket opened for a while. I still want to figure out why start up failure can lead to this situation.
Comment 19 Fabio Massimo Di Nitto 2016-02-23 09:08:09 EST

*** This bug has been marked as a duplicate of bug 1247303 ***

Note You need to log in before you can comment on or make changes to this bug.