Bug 1367429

Summary: RabbitMQ Fail to Start During OSP-9 HA Deployments
Product: Red Hat OpenStack Reporter: Udi Shkalim <ushkalim>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED NOTABUG QA Contact: Udi Shkalim <ushkalim>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 9.0 (Mitaka)CC: apevec, jeckersb, lhh, srevivo
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-28 15:14:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
controller-0 sosreport
none
controller-1 sosreport
none
controller-2 sosreport none

Description Udi Shkalim 2016-08-16 11:46:09 UTC
Created attachment 1191250 [details]
controller-0 sosreport

Description of problem:
During HA deployments in ospd9 rabbitmq cluster fail to start.
The outcome is always the same:
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ controller-2 ]
     Stopped: [ controller-0 controller-1 ]


I've been investigating the resources (RAM,CPU) during cluster startup but did not saw any unusual memory consumptio
I did saw a CPU spike to ~90% on the rabbitmq startup pid

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
rabbitmq 29557 88.3  0.7 2196764 59116 ?       Sl   13:30   0:02 /usr/lib64/erlang/erts-7.3.1/bin/beam.smp -W w -A 64 -K true -A30 -P 1048576 -K true -B i -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.3/ebin -noshell -noinput -s rabbit boot -sname rabbit@controller-1 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<5000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<5000:64/native>>}] -kernel inet_default_connect_options [{nodelay,true}] -rabbit tcp_listeners [{"172.17.1.14",5672}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/var/log/rabbitmq/rabbit"} -rabbit sasl_error_logger {file,"/var/log/rabbitmq/rabbit"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.6.3/plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@controller-1-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@controller-1"


Following the startup process this is what I've got(in a general saying of course):
pacemaker starts rabbitmq on all controllers
controller-2 start first and other controllers need to join controller-2 cluster
controller-1 is the first to try and join the cluster and failed with "Error: mnesia_not_running"
logging in to controller-2 it seems that it was not fully start. I needed to run: rabbitmqctl start app
It seems that the rabbitmq monitor is reporting as up and the startup process continued before rabbitmq on controller-2 is "actually" up.


Version-Release number of selected component (if applicable):
resource-agents-3.9.5-54.el7_2.16.x86_64
rabbitmq-server-3.6.3-5.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-30.el7ost.noarch

How reproducible:
Hard to catch since not everybody is experiencing this failure.
Seems it is related to a specific setup type(maybe cpu?)
8/10 on my setup

Steps to Reproduce:
1. Install ospd9
2. Deploy simple HA (3 controllers 1 compute)
3. Each machine have between 8-16GB RAM and 2-4 CPU

Actual results:
Deployment fails


Expected results:
Deployment pass

Additional info:
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:46:55 INFO: Attempting to join cluster with target node rabbit@controller-2
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:46:56 INFO: Join process incomplete, shutting down.
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:46:56 INFO: node failed to join, wiping data directory and trying again
Aug 14 11:46:57 [7624] controller-1.localdomain        cib:     info: cib_process_ping: Reporting our current digest to controller-0: d9c03ca5636d10b3cc4d66a8d8e64b2a for 0.36.0 (0x1e41720 0)
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:46:59 INFO: RabbitMQ server is not running
Aug 14 11:46:59 [7627] controller-1.localdomain      attrd:     info: attrd_peer_update:        Setting rmq-node-attr-rabbitmq[controller-1]: rabbit@controller-1 -> (null) from controller-1
Aug 14 11:46:59 [7624] controller-1.localdomain        cib:     info: cib_perform_op:   Diff: --- 0.36.0 2
Aug 14 11:46:59 [7624] controller-1.localdomain        cib:     info: cib_perform_op:   Diff: +++ 0.36.1 (null)
Aug 14 11:46:59 [7624] controller-1.localdomain        cib:     info: cib_perform_op:   -- /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-rmq-node-attr-rabbitmq']
Aug 14 11:46:59 [7624] controller-1.localdomain        cib:     info: cib_perform_op:   +  /cib:  @num_updates=1
Aug 14 11:46:59 [7624] controller-1.localdomain        cib:     info: cib_process_request:      Completed cib_modify operation for section status: OK (rc=0, origin=controller-2/attrd/54, version=0.36.1)
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:46:59 INFO: Forgetting rabbit@controller-1 via nodes [ rabbit@controller-2  ].
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:47:01 ERROR: Failed to forget node rabbit@controller-1 via rabbit@controller-2.
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:47:01 INFO: Joining existing cluster with [ rabbit@controller-2  ] nodes.
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:47:01 INFO: Waiting for server to start
Aug 14 11:47:03 [7624] controller-1.localdomain        cib:     info: cib_process_request:      Completed cib_modify operation for section nodes: OK (rc=0, origin=controller-2/crm_attribute/4, version=0.36.1)
Aug 14 11:47:08 [7624] controller-1.localdomain        cib:     info: cib_process_ping: Reporting our current digest to controller-0: 0e7f8abdae93d33b8907214b782ed9a2 for 0.36.1 (0x1e02e00 0)
Aug 14 11:47:10 [7627] controller-1.localdomain      attrd:     info: attrd_peer_update:        Setting rmq-node-attr-rabbitmq[controller-1]: (null) -> rabbit@controller-1 from controller-1
Aug 14 11:47:10 [7624] controller-1.localdomain        cib:     info: cib_perform_op:   Diff: --- 0.36.1 2
Aug 14 11:47:10 [7624] controller-1.localdomain        cib:     info: cib_perform_op:   Diff: +++ 0.36.2 (null)
Aug 14 11:47:10 [7624] controller-1.localdomain        cib:     info: cib_perform_op:   +  /cib:  @num_updates=2
Aug 14 11:47:10 [7624] controller-1.localdomain        cib:     info: cib_perform_op:   ++ /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']:  <nvpair id="status-2-rmq-node-attr-rabbitmq" name="rmq-node-attr-rabbitmq" value="rabbit@controller-1"/>
Aug 14 11:47:10 [7624] controller-1.localdomain        cib:     info: cib_process_request:      Completed cib_modify operation for section status: OK (rc=0, origin=controller-2/attrd/55, version=0.36.2)
Aug 14 11:47:10 [7624] controller-1.localdomain        cib:     info: cib_process_request:      Forwarding cib_modify operation for section nodes to master (origin=local/crm_attribute/4)
Aug 14 11:47:10 [7624] controller-1.localdomain        cib:     info: cib_process_request:      Completed cib_modify operation for section nodes: OK (rc=0, origin=controller-1/crm_attribute/4, version=0.36.2)
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:47:13 INFO: Attempting to join cluster with target node rabbit@controller-2
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:47:14 INFO: Join process incomplete, shutting down.
rabbitmq-cluster(rabbitmq)[22186]:      2016/08/14_11:47:14 INFO: node failed to join even after reseting local data. Check SELINUX policy
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ Error: mnesia_not_running ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [  ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ DIAGNOSTICS ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ =========== ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [  ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ attempted to contact: ['rabbit@controller-1'] ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [  ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ rabbit@controller-1: ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [   * connected to epmd (port 4369) on controller-1 ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [   * epmd reports: node 'rabbit' not running at all ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [                   no other nodes on controller-1 ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [   * suggestion: start the node ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [  ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ current node details: ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ - node name: 'rabbitmq-cli-43@controller-1' ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ - home dir: /var/lib/rabbitmq ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ - cookie hash: GVNYTPdNf3dmRkIs+xbR2A== ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [  ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:   notice: operation_finished:       rabbitmq_start_0:22186:stderr [ Error: mnesia_not_running ]
Aug 14 11:47:14 [7626] controller-1.localdomain       lrmd:     info: log_finished:     finished - rsc:rabbitmq action:start call_id:78 pid:22186 exit-code:1 exec-time:33128ms queue-time:0ms
Aug 14 11:47:14 [7629] controller-1.localdomain       crmd:     info: action_synced_wait:       Managed rabbitmq-cluster_meta-data_0 process 25954 exited with rc=0
Aug 14 11:47:14 [7629] controller-1.localdomain       crmd:   notice: process_lrm_event:        Operation rabbitmq_start_0: unknown error (node=controller-1, call=78, rc=1, cib-update=42, confirmed=true)
Aug 14 11:47:14 [7629] controller-1.localdomain       crmd:   notice: process_lrm_event:        controller-1-rabbitmq_start_0:78 [ Error: mnesia_not_running\nError: unable to connect to node 'rabbit@controller-1': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@controller-1']\n\nrabbit@controller-1:\n  * connected to epmd (port 4369) on controller-1\n  * epmd reports: node 'rabbit' not running at all\n

Comment 2 Udi Shkalim 2016-08-16 11:47:16 UTC
Created attachment 1191251 [details]
controller-1 sosreport

Comment 3 Udi Shkalim 2016-08-16 11:49:17 UTC
Created attachment 1191253 [details]
controller-2 sosreport

Comment 4 John Eckersberg 2016-08-17 15:05:51 UTC
I think you must be resource starved on your controller hosts.  RabbitMQ is taking a very long time to do even the most basic startup on controller0:

=INFO REPORT==== 14-Aug-2016::13:27:58 ===
Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1

[...]

=INFO REPORT==== 14-Aug-2016::13:28:29 ===
Server startup complete; 0 plugins started.

Over 30 seconds.  That is before any of the clustering logic starts to kick in.  Under normal circumstances that part of startup should be nearly instant.  On my laptop it starts in less than one second.

Comment 5 Udi Shkalim 2016-08-28 15:14:37 UTC
I've been monitoring the hypervisor CPU usage during the overcloud deployment especially during the rabbitmq startup.

I've noticed that the undercloud is a high consumer and during the controllers startup CPU of controllers and undercloud spikes to 300-500% per vm.

When deploying 1 controller CPU usage is not reaching those numbers.
As explained by eck the rabbit server is a fairly high consumer of cpu.

As it seems the problem here is with cpu and not memory.

Environment Details:
Baremetal is HP ProLiant DL170e G6  
12 CPU Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
64GB RAM
5 VMs: 1 undercloud, 3 controllers, 1 compute
undercloud and controllers each have 4 vcpu with: vcpu placement='static'
 

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
stepping	: 2
microcode	: 0x14
cpu MHz		: 2400.000
cache size	: 12288 KB
physical id	: 1
siblings	: 12
core id		: 0
cpu cores	: 6
apicid		: 32
initial apicid	: 32
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.22
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: