| Summary: | RabbitMQ Fail to Start During OSP-9 HA Deployments | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Udi Shkalim <ushkalim> | ||||||||
| Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> | ||||||||
| Status: | CLOSED NOTABUG | QA Contact: | Udi Shkalim <ushkalim> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 9.0 (Mitaka) | CC: | apevec, jeckersb, lhh, srevivo | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2016-08-28 15:14:37 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
Created attachment 1191251 [details]
controller-1 sosreport
Created attachment 1191253 [details]
controller-2 sosreport
I think you must be resource starved on your controller hosts. RabbitMQ is taking a very long time to do even the most basic startup on controller0: =INFO REPORT==== 14-Aug-2016::13:27:58 === Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1 [...] =INFO REPORT==== 14-Aug-2016::13:28:29 === Server startup complete; 0 plugins started. Over 30 seconds. That is before any of the clustering logic starts to kick in. Under normal circumstances that part of startup should be nearly instant. On my laptop it starts in less than one second. I've been monitoring the hypervisor CPU usage during the overcloud deployment especially during the rabbitmq startup. I've noticed that the undercloud is a high consumer and during the controllers startup CPU of controllers and undercloud spikes to 300-500% per vm. When deploying 1 controller CPU usage is not reaching those numbers. As explained by eck the rabbit server is a fairly high consumer of cpu. As it seems the problem here is with cpu and not memory. Environment Details: Baremetal is HP ProLiant DL170e G6 12 CPU Intel(R) Xeon(R) CPU E5645 @ 2.40GHz 64GB RAM 5 VMs: 1 undercloud, 3 controllers, 1 compute undercloud and controllers each have 4 vcpu with: vcpu placement='static' processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz stepping : 2 microcode : 0x14 cpu MHz : 2400.000 cache size : 12288 KB physical id : 1 siblings : 12 core id : 0 cpu cores : 6 apicid : 32 initial apicid : 32 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.22 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: |
Created attachment 1191250 [details] controller-0 sosreport Description of problem: During HA deployments in ospd9 rabbitmq cluster fail to start. The outcome is always the same: Clone Set: rabbitmq-clone [rabbitmq] Started: [ controller-2 ] Stopped: [ controller-0 controller-1 ] I've been investigating the resources (RAM,CPU) during cluster startup but did not saw any unusual memory consumptio I did saw a CPU spike to ~90% on the rabbitmq startup pid USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND rabbitmq 29557 88.3 0.7 2196764 59116 ? Sl 13:30 0:02 /usr/lib64/erlang/erts-7.3.1/bin/beam.smp -W w -A 64 -K true -A30 -P 1048576 -K true -B i -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.3/ebin -noshell -noinput -s rabbit boot -sname rabbit@controller-1 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<5000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<5000:64/native>>}] -kernel inet_default_connect_options [{nodelay,true}] -rabbit tcp_listeners [{"172.17.1.14",5672}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/var/log/rabbitmq/rabbit"} -rabbit sasl_error_logger {file,"/var/log/rabbitmq/rabbit"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.6.3/plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@controller-1-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@controller-1" Following the startup process this is what I've got(in a general saying of course): pacemaker starts rabbitmq on all controllers controller-2 start first and other controllers need to join controller-2 cluster controller-1 is the first to try and join the cluster and failed with "Error: mnesia_not_running" logging in to controller-2 it seems that it was not fully start. I needed to run: rabbitmqctl start app It seems that the rabbitmq monitor is reporting as up and the startup process continued before rabbitmq on controller-2 is "actually" up. Version-Release number of selected component (if applicable): resource-agents-3.9.5-54.el7_2.16.x86_64 rabbitmq-server-3.6.3-5.el7ost.noarch openstack-tripleo-heat-templates-2.0.0-30.el7ost.noarch How reproducible: Hard to catch since not everybody is experiencing this failure. Seems it is related to a specific setup type(maybe cpu?) 8/10 on my setup Steps to Reproduce: 1. Install ospd9 2. Deploy simple HA (3 controllers 1 compute) 3. Each machine have between 8-16GB RAM and 2-4 CPU Actual results: Deployment fails Expected results: Deployment pass Additional info: rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:46:55 INFO: Attempting to join cluster with target node rabbit@controller-2 rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:46:56 INFO: Join process incomplete, shutting down. rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:46:56 INFO: node failed to join, wiping data directory and trying again Aug 14 11:46:57 [7624] controller-1.localdomain cib: info: cib_process_ping: Reporting our current digest to controller-0: d9c03ca5636d10b3cc4d66a8d8e64b2a for 0.36.0 (0x1e41720 0) rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:46:59 INFO: RabbitMQ server is not running Aug 14 11:46:59 [7627] controller-1.localdomain attrd: info: attrd_peer_update: Setting rmq-node-attr-rabbitmq[controller-1]: rabbit@controller-1 -> (null) from controller-1 Aug 14 11:46:59 [7624] controller-1.localdomain cib: info: cib_perform_op: Diff: --- 0.36.0 2 Aug 14 11:46:59 [7624] controller-1.localdomain cib: info: cib_perform_op: Diff: +++ 0.36.1 (null) Aug 14 11:46:59 [7624] controller-1.localdomain cib: info: cib_perform_op: -- /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-rmq-node-attr-rabbitmq'] Aug 14 11:46:59 [7624] controller-1.localdomain cib: info: cib_perform_op: + /cib: @num_updates=1 Aug 14 11:46:59 [7624] controller-1.localdomain cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=controller-2/attrd/54, version=0.36.1) rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:46:59 INFO: Forgetting rabbit@controller-1 via nodes [ rabbit@controller-2 ]. rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:47:01 ERROR: Failed to forget node rabbit@controller-1 via rabbit@controller-2. rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:47:01 INFO: Joining existing cluster with [ rabbit@controller-2 ] nodes. rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:47:01 INFO: Waiting for server to start Aug 14 11:47:03 [7624] controller-1.localdomain cib: info: cib_process_request: Completed cib_modify operation for section nodes: OK (rc=0, origin=controller-2/crm_attribute/4, version=0.36.1) Aug 14 11:47:08 [7624] controller-1.localdomain cib: info: cib_process_ping: Reporting our current digest to controller-0: 0e7f8abdae93d33b8907214b782ed9a2 for 0.36.1 (0x1e02e00 0) Aug 14 11:47:10 [7627] controller-1.localdomain attrd: info: attrd_peer_update: Setting rmq-node-attr-rabbitmq[controller-1]: (null) -> rabbit@controller-1 from controller-1 Aug 14 11:47:10 [7624] controller-1.localdomain cib: info: cib_perform_op: Diff: --- 0.36.1 2 Aug 14 11:47:10 [7624] controller-1.localdomain cib: info: cib_perform_op: Diff: +++ 0.36.2 (null) Aug 14 11:47:10 [7624] controller-1.localdomain cib: info: cib_perform_op: + /cib: @num_updates=2 Aug 14 11:47:10 [7624] controller-1.localdomain cib: info: cib_perform_op: ++ /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']: <nvpair id="status-2-rmq-node-attr-rabbitmq" name="rmq-node-attr-rabbitmq" value="rabbit@controller-1"/> Aug 14 11:47:10 [7624] controller-1.localdomain cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=controller-2/attrd/55, version=0.36.2) Aug 14 11:47:10 [7624] controller-1.localdomain cib: info: cib_process_request: Forwarding cib_modify operation for section nodes to master (origin=local/crm_attribute/4) Aug 14 11:47:10 [7624] controller-1.localdomain cib: info: cib_process_request: Completed cib_modify operation for section nodes: OK (rc=0, origin=controller-1/crm_attribute/4, version=0.36.2) rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:47:13 INFO: Attempting to join cluster with target node rabbit@controller-2 rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:47:14 INFO: Join process incomplete, shutting down. rabbitmq-cluster(rabbitmq)[22186]: 2016/08/14_11:47:14 INFO: node failed to join even after reseting local data. Check SELINUX policy Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ Error: mnesia_not_running ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ DIAGNOSTICS ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ =========== ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ attempted to contact: ['rabbit@controller-1'] ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ rabbit@controller-1: ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ * connected to epmd (port 4369) on controller-1 ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ * epmd reports: node 'rabbit' not running at all ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ no other nodes on controller-1 ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ * suggestion: start the node ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ current node details: ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ - node name: 'rabbitmq-cli-43@controller-1' ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ - home dir: /var/lib/rabbitmq ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ - cookie hash: GVNYTPdNf3dmRkIs+xbR2A== ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: notice: operation_finished: rabbitmq_start_0:22186:stderr [ Error: mnesia_not_running ] Aug 14 11:47:14 [7626] controller-1.localdomain lrmd: info: log_finished: finished - rsc:rabbitmq action:start call_id:78 pid:22186 exit-code:1 exec-time:33128ms queue-time:0ms Aug 14 11:47:14 [7629] controller-1.localdomain crmd: info: action_synced_wait: Managed rabbitmq-cluster_meta-data_0 process 25954 exited with rc=0 Aug 14 11:47:14 [7629] controller-1.localdomain crmd: notice: process_lrm_event: Operation rabbitmq_start_0: unknown error (node=controller-1, call=78, rc=1, cib-update=42, confirmed=true) Aug 14 11:47:14 [7629] controller-1.localdomain crmd: notice: process_lrm_event: controller-1-rabbitmq_start_0:78 [ Error: mnesia_not_running\nError: unable to connect to node 'rabbit@controller-1': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@controller-1']\n\nrabbit@controller-1:\n * connected to epmd (port 4369) on controller-1\n * epmd reports: node 'rabbit' not running at all\n