| Summary: | pcs resource cleanup form new rabbitmq cluster | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Ondrej <ochalups> |
| Component: | resource-agents | Assignee: | Peter Lemenkov <plemenko> |
| Status: | CLOSED DUPLICATE | QA Contact: | Ofer Blaut <oblaut> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 7.1 | CC: | abeekhof, agk, cfeist, cluster-maint, fdinitto, jbuchta, jorge_martinez, mkrcmari, plemenko, rcernin |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-02-23 14:08:09 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Can't really comment without a crm/sos report archive. Is the systemd or OCF agent being used to control rabbit? First step would be to see if/why the cluster thinks it needs to restart rabbit on the other nodes. It would also be useful to know if restricting the cleanup operation to just that resource+node produces the same behaviour. The logs only go back to Jan 18 and do not include the period covered by this test. But if the cluster is not restarting rabbit on the other controllers, then its hard to imagine how it might be the cause. Moving to the resource-agents package since thats where the script that controls rabbit lives. FWIW. the behaviour once you begin stopping/starting things with rabbitmqctl is likely unsupportable. All interaction with rabbit (apart from simulating the initial failure) should via the cluster. I'm working on it. I'll provide a status update shortly. *** Bug 1307056 has been marked as a duplicate of this bug. *** Status update - I've found a way how to fix this issue and how to rejoin this rouge lone node back. I'm going to propose a workaround soon (so far it doesn't look great). 0. On any other node within the cluster we want to join you must run rabbitmqctl forget_cluster_node rabbit@overcloud-controller-1 1. This removes a stale record and allows rouge node to properly re-register itself (roughly speaking it actually treats "ok, already registered" as a "403"). 2. On the rouge node you must run consequently: rabbitmqctl stop_app rabbitmqctl reset rabbitmqctl join_cluster rabbit@overcloud-controller-2 # rabbit@overcloud-controller-2 is an Erlang node name which # currently belongs to a cluster, not a domain name which # is "rabbit # here we will get something like "...done (already_member)." if we forget to forget rouge node (see step #0). rabbitmqctl start_app 3. Now you may enjoy with retrieving cluster_status: rabbitmqctl cluster_status My next step is to figure out why the node forgot about the cluster and prevent this from happening. Again - this is just a workaround. I'll propose real solution later. Meanwhile you may try resetting the cluster back to working state as mentioned above. Petr this looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1247303 for which we already have a patch in POST for rhel7.3 and we just need to build/backport. Strangely enough, we did see this issue only once in the past, and now it´s showing up a lot all of a sudden. Please review patch in the other bug and let me know if that would work. then we can close this bug as duplicate and speed up the build/backport. (In reply to Fabio Massimo Di Nitto from comment #17) > Petr this looks like a duplicate of > https://bugzilla.redhat.com/show_bug.cgi?id=1247303 for which we already > have a patch in POST for rhel7.3 and we just need to build/backport. > > Strangely enough, we did see this issue only once in the past, and now it´s > showing up a lot all of a sudden. > > Please review patch in the other bug and let me know if that would work. > then we can close this bug as duplicate and speed up the build/backport. Yes, that ticket describes exactly the same issue! We certainly must backport that patch (it does exactly the same I proposed above), but I'd like to keep this ticket opened for a while. I still want to figure out why start up failure can lead to this situation. *** This bug has been marked as a duplicate of bug 1247303 *** |
Description of problem: From customer testing: I have just perform an additional test: A) This time I have stop rabbitmq "gently" using: controller1# rabbitmqctl stop RESULT: the rabbitmq re-join the cluster without a problem. B) Simulate a crash on rabbitmq proccess - but allow pacemaker to restart it, this time we are not impeding the execution by adding syntax errors on config. controller1# ps -ef |grep rabbitmq | grep -v grep | awk '{print $2}' | xargs kill We show that pacemaker is trying to clean using a rabbitmqctl stop and the start the process: [root@ost-ah1-controller1 rabbitmq]# ps -ef |grep rabbit rabbitmq 2358 1 0 16:25 ? 00:00:00 /usr/lib64/erlang/erts-5.10.4/bin/epmd -daemon root 4361 3334 0 16:25 ? 00:00:00 /bin/sh /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster stop root 4391 4361 0 16:25 ? 00:00:00 /bin/sh /usr/sbin/rabbitmqctl cluster_status root 4398 18697 0 16:25 pts/0 00:00:00 grep --color=auto rabbit [root@ost-ah1-controller1 rabbitmq]# ps -ef |grep rabbit rabbitmq 2358 1 0 16:25 ? 00:00:00 /usr/lib64/erlang/erts-5.10.4/bin/epmd -daemon root 4464 3334 0 16:25 ? 00:00:00 /bin/sh /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster start root 4626 4464 0 16:25 ? 00:00:00 sh -c /usr/sbin/rabbitmq-server > /var/log/rabbitmq/startup_log 2> /var/log/rabbitmq/startup_err root 4630 18697 0 16:25 pts/0 00:00:00 grep --color=auto rabbit root 4633 4626 0 16:25 ? 00:00:00 /bin/sh /usr/sbin/rabbitmq-server After a while, we saw that: > No other process has been stopped during the time it was respawning - good > Rabbitmq has join the cluster: [ost-ah1-controller0 at Wed Jan 13 16:26:53 CET 2016(1452698813)] Cluster status of node 'rabbit@ost-ah1-controller0' ... [{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2', 'rabbit@ost-ah1-controller0']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. [ost-ah1-controller1 at Wed Jan 13 16:26:54 CET 2016(1452698814)] Cluster status of node 'rabbit@ost-ah1-controller1' ... [{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2', 'rabbit@ost-ah1-controller1']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. [ost-ah1-controller2 at Wed Jan 13 16:26:56 CET 2016(1452698816)] Cluster status of node 'rabbit@ost-ah1-controller2' ... [{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller0', 'rabbit@ost-ah1-controller2']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. And this time is working fine! Let's just repeat the original test, impede rabbitmq to be restarted ... this time we will use another strategy, simulate that the socket was in use (due to a previous stalled process for example): controller1 # ps -ef |grep rabbitmq | grep -v grep | awk '{print $2}' | xargs kill controller1 # nc -l 35672 # cat /var/log/rabbitmq/startup_err Crash dump was written to: erl_crash.dump Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{k Clone Set: rabbitmq-clone [rabbitmq] Started: [ ost-ah1-controller0 ost-ah1-controller2 ] Stopped: [ ost-ah1-controller1 ] Cluster status of node 'rabbit@ost-ah1-controller2' ... [{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. And now, we clean the impediment by leaving free the port 'nc -l 35672' and ask pacemaker to cleanup: # pcs resource cleanup rabbitmq-clone Resource: rabbitmq-clone successfully cleaned up And finally ...... we got the expected error at last: [ost-ah1-controller0 at Wed Jan 13 16:38:15 CET 2016(1452699495)] Cluster status of node 'rabbit@ost-ah1-controller0' ... [{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller2','rabbit@ost-ah1-controller0']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. [ost-ah1-controller1 at Wed Jan 13 16:38:16 CET 2016(1452699496)] Cluster status of node 'rabbit@ost-ah1-controller1' ... [{nodes,[{disc,['rabbit@ost-ah1-controller1']}]}, {running_nodes,['rabbit@ost-ah1-controller1']}, {cluster_name,<<"rabbit">>}, <--------------------------- controller1 has start his own cluster and is not aware of the rest {partitions,[]}] ...done. [ost-ah1-controller2 at Wed Jan 13 16:38:17 CET 2016(1452699497)] Cluster status of node 'rabbit@ost-ah1-controller2' ... [{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. I try to fix the situation using this: 1) Try to a gentle stop of the "confused" rabbitmq server on controller1 - rabbitmqctl stop -- No difference, startover but no luck - still isolated: [ost-ah1-controller1 at Wed Jan 13 16:44:28 CET 2016(1452699868)] Cluster status of node 'rabbit@ost-ah1-controller1' ... [{nodes,[{disc,['rabbit@ost-ah1-controller1']}]}, {running_nodes,['rabbit@ost-ah1-controller1']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. 2) Stop rabbitmq on controller2 for example: controller2# rabbitmqctl stop Who will join, old o new cluster? New cluster won! [ost-ah1-controller0 at Wed Jan 13 16:46:46 CET 2016(1452700006)] Cluster status of node 'rabbit@ost-ah1-controller0' ... [{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller0']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. [ost-ah1-controller1 at Wed Jan 13 16:46:48 CET 2016(1452700008)] Cluster status of node 'rabbit@ost-ah1-controller1' ... [{nodes,[{disc,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller2','rabbit@ost-ah1-controller1']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. [ost-ah1-controller2 at Wed Jan 13 16:46:49 CET 2016(1452700009)] Cluster status of node 'rabbit@ost-ah1-controller2' ... [{nodes,[{disc,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. Now {cluster_name,<<"rabbit"} has two members, old cluster is only formed by controller0 ... As we expect, controller0 will join new cluster after we stop it: controller0# rabbitmqctl stop Stopping and halting node 'rabbit@ost-ah1-controller0' Final status: [ost-ah1-controller0 at Wed Jan 13 16:49:32 CET 2016(1452700172)] Cluster status of node 'rabbit@ost-ah1-controller0' ... [{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller1','rabbit@ost-ah1-controller2', 'rabbit@ost-ah1-controller0']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. [ost-ah1-controller1 at Wed Jan 13 16:49:34 CET 2016(1452700174)] Cluster status of node 'rabbit@ost-ah1-controller1' ... [{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller2', 'rabbit@ost-ah1-controller1']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. [ost-ah1-controller2 at Wed Jan 13 16:49:35 CET 2016(1452700175)] Cluster status of node 'rabbit@ost-ah1-controller2' ... [{nodes,[{disc,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}]}, {running_nodes,['rabbit@ost-ah1-controller0','rabbit@ost-ah1-controller1', 'rabbit@ost-ah1-controller2']}, {cluster_name,<<"rabbit">>}, {partitions,[]}] ...done. Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7_1.4.x86_64 pacemaker-cli-1.1.12-22.el7_1.4.x86_64 pacemaker-cluster-libs-1.1.12-22.el7_1.4.x86_64 pacemaker-libs-1.1.12-22.el7_1.4.x86_64 How reproducible: Steps to Reproduce: 1.kill rabbit 2.occupy its port - nc -l 35672 3.pcs resource cleanup rabbitmq-clone 4.cluster node form new cluster and start/stop of other nodes make them join the new one Actual results: nodes leave the cluster and form new one Expected results: all nodes stay in the same cluster Additional info: