Hide Forgot
Description of problem: rabbitmqctl list_channels hangs and connections are very slow. Sometimes it doesn't work at all and simply times out. rabbitmq consumes sometimes almost 600% CPU it is really slow, or doesn't even work. in one controller-0 we see PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5176 rabbitmq 20 0 3405324 713864 2980 S 105.3 5.9 3787:52 beam.smp in another controller-1 we can see similar situation. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5359 rabbitmq 20 0 3553436 934812 2984 S 341.9 7.7 7420:51 beam.smp in the rabbitmq logs we can find following, =ERROR REPORT==== 17-Mar-2016::09:38:02 === closing AMQP connection <0.18261.3202> (X:38676 -> X:5672): {heartbeat_timeout,running} =ERROR REPORT==== 17-Mar-2016::09:38:03 === closing AMQP connection <0.25122.3203> (X:38704 -> X:5672): {heartbeat_timeout,running} Version-Release number of selected component (if applicable): rabbitmq-server-3.3.5-16.el7ost.noarch How reproducible: Keep using RabbitMQ Steps to Reproduce: 1. 2. 3. Actual results: CPU peaks high over 100% most of the time and eventually stops responding. Expected results: Additional info:
This might be roughly the same thing as https://bugzilla.redhat.com/show_bug.cgi?id=1295896#c1, I see a lot of heartbeat timeouts and inet_error,etimedout which implies tcp timeout. It's hard to tell after the fact but my theory would be there was a bunch of stuff queued up for some reason that was getting worked through.
A few questions/thoughts/requests to help move this along: 1) It's not abnormal for rabbitmq to have high CPU utilization. Under load, with mirrored queues, there is a lot going on to replicate state around. The utilization by itself doesn't worry me too much. 2) Grab the output of `rabbitmqctl report` from any *one* of the controllers and attach to the case, just to get an idea of the scale of the system from an AMQP perspective. 3) If the list_channels command gets stuck or times out, run the following on *all* controllers, capture the output, and attach to the case: rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
Hi eck, Ack. After RabbitMQ restart the issue has not occurred till now. Robin
Just stumbled upon this (from the report attached above): {file_descriptors,[{total_limit,3996}, {total_used,110}, {sockets_limit,3594}, {sockets_used,108}]}, We require at least 16k sockets (see bug 1282491). Could you please as kthem to increase this and retest.
Looks like this was mostly a one-off occurence and can't be reproduced, so I'm going to close it. Feel free to re-open if more data is available.