Red Hat Bugzilla – Bug 1465529
Autoscaling fails, RabbitMQ being killed
Last modified: 2017-09-04 03:23:03 EDT
Description of problem:
Autoscaling is failing, rabbitmq gets killed - succeeds on hardware but fails on cloud environment
It fails in various ways but generally the load increases until rabbit falls over.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Could you give more info about your problem and way to reproduce it in detail?
It's hard to understand what your problem is exactly.
I have tried the lab and ceph looks not healthy:
$ ceph health
HEALTH_ERR 68pgs are stuck inactive....
ceph pool have size of 3 replicas with min_size 1. This makes ceph slow. size should be 1 too, because you have only one node.
Also in ceph, three osds are configured, when you have only one, so ceph try to reach unexisting nodes, that also make ceph slow, by waiting a lot on osd that will never come back.
Also, even without running tempest ceph is already reporting slow request like more than 500s to write data. So, adding the tempest load is not going to work.
This slow requests have good chance to come from the missing osd nodes.
So, my guess is, the ceph node is too slow, Gnocchi can't write the backlog to it. Also Ceilometer can't post measures to Gnocchi because Ceph is too slow to write them. That make many messages waiting to be processed on rabbitmq.
You should first fix the ceph setup.
I got the ceph issue fixed, and can reproduce the issue. I have added depends on to the other issue. Since the root cause have good change to be the same for both BZs.
I'm closing this bug on the conclusion that this is not a bug and that the root cause is lack of resources and because the rest is discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1467947