Bug 1362117
| Summary: | RabbitMQ fails to rejoin the cluster after some consecutive start/stop actions | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Raoul Scarazzini <rscarazz> |
| Component: | resource-agents | Assignee: | John Eckersberg <jeckersb> |
| Status: | CLOSED DUPLICATE | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.2 | CC: | agk, cluster-maint, fdinitto, michele, plemenko, rscarazz, ushkalim |
| Target Milestone: | rc | ||
| Target Release: | 7.2 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-12-07 14:00:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
All sosreports are available at this address [1] with this directory tree: BZ1362117 |-osp9 | |- with_resource-agents-3.9.5-54.el7 | \- with_resource-agents-3.9.5-80.el7 \-newton |- with_resource-agents-3.9.5-54.el7 \- with_resource-agents-3.9.5-80.el7 In every try the result is the same, what changes is just the number of starts/stops that precedes the failure. [1] http://file.rdu.redhat.com/~rscarazz/BZ1362117/ As an additional test I installed the exact rpm version from zstream (version resource-agents-3.9.5-54.el7_2.16 as from here [1]), and after 88 tries we got the same error:
rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=27889, status=complete, exitreason='none', last-rc-change='Tue Aug 2
17:41:29 2016', queued=0ms, exec=8124ms
And from the cluster log the error seems pretty much clear:
Aug 02 17:41:37 [40741] overcloud-controller-0.localdomain cib: info: crm_compress_string: Compressed 271070 bytes into 11028 (ratio 24:1) in 32ms
rabbitmq-cluster(rabbitmq)[23454]: 2016/08/02_17:41:37 INFO: node failed to join even after reseting local data. Check SELINUX policy
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ Error: process_not_running ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ Error: unable to connect to node 'rabbit@overcloud-con
troller-0': nodedown ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ DIAGNOSTICS ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ =========== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ attempted to contact: ['rabbit@overcloud-controller-0'
] ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ rabbit@overcloud-controller-0: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ * connected to epmd (port 4369) on overcloud-control
ler-0 ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ * epmd reports: node 'rabbit' not running at all ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ no other nodes on overcloud-controller-0 ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ * suggestion: start the node ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ current node details: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ - node name: 'rabbitmq-cli-15@overcloud-controller-0' ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ - home dir: /var/lib/rabbitmq ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ - cookie hash: YB33eAHDuEIsi0xTrQP4DA== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ Error: process_not_running ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ DIAGNOSTICS ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ =========== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ attempted to contact: ['rabbit@overcloud-controller-0'] ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ rabbit@overcloud-controller-0: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ * connected to epmd (port 4369) on overcloud-controller-0 ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ * epmd reports: node 'rabbit' not running at all ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ no other nodes on overcloud-controll:
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ * suggestion: start the node ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ current node details: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ - node name: 'rabbitmq-cli-77@overcloud-controller-0' ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ - home dir: /var/lib/rabbitmq ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ - cookie hash: YB33eAHDuEIsi0xTrQP4DA== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:23454:stderr [ ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain lrmd: info: log_finished: finished - rsc:rabbitmq action:start call_id:27889 pid:23454 exit-code:1 exec-time:8124ms queue-time:0ms
I attached all the new sosreport here [2], these were taken after while so you need to check the time while the error happened (Aug 02 17:41:37).
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1351446
[2] http://file.rdu.redhat.com/~rscarazz/BZ1362117/osp9/with_resource-agents-3.9.5-54.el7_2.16/
*** This bug has been marked as a duplicate of bug 1397393 *** |
Description of problem: There's a race condition that prevents rabbit cluster to start correctly inside OSP environments deployed via Director. Problem is present with both the actual release of resource-agents (3.9.5-54.el7) and the patched one that comes from [1] (3.9.5-80.el7). Once you stop and start rabbitmq component, sometimes it fails to start on one node with this error: rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=21217, status=complete, exitreason='none', last-rc-change='Sat Jul 30 06:44:32 2016', queued=0ms, exec=7128ms Looking inside application logs of the failed node problem seems to be related to the joining of the cluster: Error description: {error,{inconsistent_cluster,"Node 'rabbit@overcloud-controller-1' thinks it's clustered with node 'rabbit@overcloud-controller-2', but 'rabbit@overcloud-controller-2' disagrees"}} [1] https://bugzilla.redhat.com/show_bug.cgi?id=1343905 Version-Release number of selected component (if applicable): resource-agents-3.9.5-54.el7 and resource-agents-3.9.5-80.el7 How reproducible: The race happens when you repeat indefinitely cycles of stop/start of the resource. With the non patched release it happened after 16 runs, with the patched release after 265. Actual results: Rabbit cluster fails on one node, cluster's resource is failed, cluster overall status is unclean. Expected results: Starts and stops should always succeed in a consecutive undefinable number of times. Additional info: Tests were made also using CentOS and Newton, with the same results.