Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1318692

Summary:	rabbitmq uses 600% CPU and doesn't respond after some time
Product:	Red Hat OpenStack	Reporter:	Robin Cernin <rcernin>
Component:	rabbitmq-server	Assignee:	John Eckersberg <jeckersb>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Udi Shkalim <ushkalim>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.0 (Kilo)	CC:	apevec, jeckersb, lhh, plemenko, rcernin, scorcora, srevivo
Target Milestone:	async	Keywords:	ZStream
Target Release:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-26 17:01:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robin Cernin 2016-03-17 14:10:35 UTC

Description of problem:

rabbitmqctl list_channels hangs and connections are very slow. Sometimes it doesn't work at all and simply times out. rabbitmq consumes sometimes almost 600% CPU it is really slow, or doesn't even work.

in one controller-0 we see

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5176 rabbitmq  20   0 3405324 713864   2980 S 105.3  5.9   3787:52 beam.smp


in another controller-1 we can see similar situation.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5359 rabbitmq  20   0 3553436 934812   2984 S 341.9  7.7   7420:51 beam.smp

in the rabbitmq logs we can find following,

=ERROR REPORT==== 17-Mar-2016::09:38:02 ===
closing AMQP connection <0.18261.3202> (X:38676 -> X:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 17-Mar-2016::09:38:03 ===
closing AMQP connection <0.25122.3203> (X:38704 -> X:5672):
{heartbeat_timeout,running}

Version-Release number of selected component (if applicable):

rabbitmq-server-3.3.5-16.el7ost.noarch

How reproducible:

Keep using RabbitMQ

Steps to Reproduce:
1.
2.
3.

Actual results:

CPU peaks high over 100% most of the time and eventually stops responding.

Expected results:


Additional info:

Comment 7 John Eckersberg 2016-03-18 15:57:38 UTC

This might be roughly the same thing as https://bugzilla.redhat.com/show_bug.cgi?id=1295896#c1, I see a lot of heartbeat timeouts and inet_error,etimedout which implies tcp timeout.  It's hard to tell after the fact but my theory would be there was a bunch of stuff queued up for some reason that was getting worked through.

Comment 8 John Eckersberg 2016-03-21 18:33:43 UTC

A few questions/thoughts/requests to help move this along:

1) It's not abnormal for rabbitmq to have high CPU utilization.  Under load, with mirrored queues, there is a lot going on to replicate state around.  The utilization by itself doesn't worry me too much.

2) Grab the output of `rabbitmqctl report` from any *one* of the controllers and attach to the case, just to get an idea of the scale of the system from an AMQP perspective.

3) If the list_channels command gets stuck or times out, run the following on *all* controllers, capture the output, and attach to the case:

  rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

Comment 9 Robin Cernin 2016-03-22 10:20:58 UTC

Hi eck,

Ack.

After RabbitMQ restart the issue has not occurred till now.

Robin

Comment 13 Peter Lemenkov 2016-03-24 09:29:32 UTC

Just stumbled upon this (from the report attached above):


  {file_descriptors,[{total_limit,3996},
                    {total_used,110},
                    {sockets_limit,3594},
                    {sockets_used,108}]},

We require at least 16k sockets (see bug 1282491). Could you please as kthem to increase this and retest.

Comment 15 John Eckersberg 2016-08-26 17:01:53 UTC

Looks like this was mostly a one-off occurence and can't be reproduced, so I'm going to close it.  Feel free to re-open if more data is available.