Bug 1362117

Summary: RabbitMQ fails to rejoin the cluster after some consecutive start/stop actions
Product: Red Hat Enterprise Linux 7 Reporter: Raoul Scarazzini <rscarazz>
Component: resource-agentsAssignee: John Eckersberg <jeckersb>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: medium    
Version: 7.2CC: agk, cluster-maint, fdinitto, michele, plemenko, rscarazz, ushkalim
Target Milestone: rc   
Target Release: 7.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-07 14:00:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Raoul Scarazzini 2016-08-01 10:21:44 UTC
Description of problem:

There's a race condition that prevents rabbit cluster to start correctly inside OSP environments deployed via Director.
Problem is present with both the actual release of resource-agents (3.9.5-54.el7) and the patched one that comes from [1] (3.9.5-80.el7).
Once you stop and start rabbitmq component, sometimes it fails to start on one node with this error:

rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=21217, status=complete, exitreason='none', last-rc-change='Sat Jul 30
 06:44:32 2016', queued=0ms, exec=7128ms

Looking inside application logs of the failed node problem seems to be related to the joining of the cluster:

Error description:
   {error,{inconsistent_cluster,"Node 'rabbit@overcloud-controller-1' thinks it's clustered with node 'rabbit@overcloud-controller-2', but 'rabbit@overcloud-controller-2' disagrees"}}

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1343905

Version-Release number of selected component (if applicable):

resource-agents-3.9.5-54.el7 and resource-agents-3.9.5-80.el7

How reproducible:

The race happens when you repeat indefinitely cycles of stop/start of the resource. With the non patched release it happened after 16 runs, with the patched release after 265.

Actual results:

Rabbit cluster fails on one node, cluster's resource is failed, cluster overall status is unclean.

Expected results:

Starts and stops should always succeed in a consecutive undefinable number of times.

Additional info:

Tests were made also using CentOS and Newton, with the same results.

Comment 1 Raoul Scarazzini 2016-08-01 10:27:10 UTC
All sosreports are available at this address [1] with this directory tree:

BZ1362117
|-osp9
| |- with_resource-agents-3.9.5-54.el7
| \- with_resource-agents-3.9.5-80.el7
\-newton
  |- with_resource-agents-3.9.5-54.el7
  \- with_resource-agents-3.9.5-80.el7

In every try the result is the same, what changes is just the number of starts/stops that precedes the failure.

[1] http://file.rdu.redhat.com/~rscarazz/BZ1362117/

Comment 3 Raoul Scarazzini 2016-08-03 07:30:06 UTC
As an additional test I installed the exact rpm version from zstream (version resource-agents-3.9.5-54.el7_2.16 as from here [1]), and after 88 tries we got the same error:

rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=27889, status=complete, exitreason='none', last-rc-change='Tue Aug 2 
17:41:29 2016', queued=0ms, exec=8124ms

And from the cluster log the error seems pretty much clear:

Aug 02 17:41:37 [40741] overcloud-controller-0.localdomain        cib:     info: crm_compress_string:   Compressed 271070 bytes into 11028 (ratio 24:1) in 32ms
rabbitmq-cluster(rabbitmq)[23454]:      2016/08/02_17:41:37 INFO: node failed to join even after reseting local data. Check SELINUX policy
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: process_not_running ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: unable to connect to node 'rabbit@overcloud-con
troller-0': nodedown ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ DIAGNOSTICS ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ =========== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ attempted to contact: ['rabbit@overcloud-controller-0'
] ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ rabbit@overcloud-controller-0: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * connected to epmd (port 4369) on overcloud-control
ler-0 ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * epmd reports: node 'rabbit' not running at all ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [                   no other nodes on overcloud-controller-0 ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * suggestion: start the node ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ current node details: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - node name: 'rabbitmq-cli-15@overcloud-controller-0' ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - home dir: /var/lib/rabbitmq ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - cookie hash: YB33eAHDuEIsi0xTrQP4DA== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: process_not_running ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ DIAGNOSTICS ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ =========== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ attempted to contact: ['rabbit@overcloud-controller-0'] ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ rabbit@overcloud-controller-0: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * connected to epmd (port 4369) on overcloud-controller-0 ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * epmd reports: node 'rabbit' not running at all ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [                   no other nodes on overcloud-controll:
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [   * suggestion: start the node ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ current node details: ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - node name: 'rabbitmq-cli-77@overcloud-controller-0' ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - home dir: /var/lib/rabbitmq ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [ - cookie hash: YB33eAHDuEIsi0xTrQP4DA== ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:   notice: operation_finished:    rabbitmq_start_0:23454:stderr [  ]
Aug 02 17:41:37 [40743] overcloud-controller-0.localdomain       lrmd:     info: log_finished:  finished - rsc:rabbitmq action:start call_id:27889 pid:23454 exit-code:1 exec-time:8124ms queue-time:0ms

I attached all the new sosreport here [2], these were taken after while so you need to check the time while the error happened (Aug 02 17:41:37).

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1351446
[2] http://file.rdu.redhat.com/~rscarazz/BZ1362117/osp9/with_resource-agents-3.9.5-54.el7_2.16/

Comment 5 Peter Lemenkov 2016-12-07 14:00:10 UTC

*** This bug has been marked as a duplicate of bug 1397393 ***