Bug 1318692 - rabbitmq uses 600% CPU and doesn't respond after some time
Summary: rabbitmq uses 600% CPU and doesn't respond after some time
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 8.0 (Liberty)
Assignee: John Eckersberg
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-17 14:10 UTC by Robin Cernin
Modified: 2019-10-10 11:38 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-26 17:01:53 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Robin Cernin 2016-03-17 14:10:35 UTC
Description of problem:

rabbitmqctl list_channels hangs and connections are very slow. Sometimes it doesn't work at all and simply times out. rabbitmq consumes sometimes almost 600% CPU it is really slow, or doesn't even work.

in one controller-0 we see

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5176 rabbitmq  20   0 3405324 713864   2980 S 105.3  5.9   3787:52 beam.smp


in another controller-1 we can see similar situation.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5359 rabbitmq  20   0 3553436 934812   2984 S 341.9  7.7   7420:51 beam.smp

in the rabbitmq logs we can find following,

=ERROR REPORT==== 17-Mar-2016::09:38:02 ===
closing AMQP connection <0.18261.3202> (X:38676 -> X:5672):
{heartbeat_timeout,running}

=ERROR REPORT==== 17-Mar-2016::09:38:03 ===
closing AMQP connection <0.25122.3203> (X:38704 -> X:5672):
{heartbeat_timeout,running}

Version-Release number of selected component (if applicable):

rabbitmq-server-3.3.5-16.el7ost.noarch

How reproducible:

Keep using RabbitMQ

Steps to Reproduce:
1.
2.
3.

Actual results:

CPU peaks high over 100% most of the time and eventually stops responding.

Expected results:


Additional info:

Comment 7 John Eckersberg 2016-03-18 15:57:38 UTC
This might be roughly the same thing as https://bugzilla.redhat.com/show_bug.cgi?id=1295896#c1, I see a lot of heartbeat timeouts and inet_error,etimedout which implies tcp timeout.  It's hard to tell after the fact but my theory would be there was a bunch of stuff queued up for some reason that was getting worked through.

Comment 8 John Eckersberg 2016-03-21 18:33:43 UTC
A few questions/thoughts/requests to help move this along:

1) It's not abnormal for rabbitmq to have high CPU utilization.  Under load, with mirrored queues, there is a lot going on to replicate state around.  The utilization by itself doesn't worry me too much.

2) Grab the output of `rabbitmqctl report` from any *one* of the controllers and attach to the case, just to get an idea of the scale of the system from an AMQP perspective.

3) If the list_channels command gets stuck or times out, run the following on *all* controllers, capture the output, and attach to the case:

  rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

Comment 9 Robin Cernin 2016-03-22 10:20:58 UTC
Hi eck,

Ack.

After RabbitMQ restart the issue has not occurred till now.

Robin

Comment 13 Peter Lemenkov 2016-03-24 09:29:32 UTC
Just stumbled upon this (from the report attached above):


  {file_descriptors,[{total_limit,3996},
                    {total_used,110},
                    {sockets_limit,3594},
                    {sockets_used,108}]},

We require at least 16k sockets (see bug 1282491). Could you please as kthem to increase this and retest.

Comment 15 John Eckersberg 2016-08-26 17:01:53 UTC
Looks like this was mostly a one-off occurence and can't be reproduced, so I'm going to close it.  Feel free to re-open if more data is available.


Note You need to log in before you can comment on or make changes to this bug.