Bug 1465529 - Autoscaling fails, RabbitMQ being killed
Autoscaling fails, RabbitMQ being killed
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ceilometer (Show other bugs)
10.0 (Newton)
Unspecified Linux
unspecified Severity medium
: ---
: 10.0 (Newton)
Assigned To: Julien Danjou
Sasha Smolyak
: Triaged, ZStream
Depends On: 1467947
  Show dependency treegraph
Reported: 2017-06-27 10:51 EDT by Morgan Weetman
Modified: 2017-09-04 03:23 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2017-09-04 03:23:03 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Morgan Weetman 2017-06-27 10:51:28 EDT
Description of problem:
Autoscaling is failing, rabbitmq gets killed - succeeds on hardware but fails on cloud environment

It fails in various ways but generally the load increases until rabbit falls over. 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:
Comment 2 Julien Danjou 2017-07-03 09:48:26 EDT
Could you give more info about your problem and way to reproduce it in detail?

It's hard to understand what your problem is exactly.
Comment 5 Mehdi ABAAKOUK 2017-07-17 11:29:39 EDT
I have tried the lab and ceph looks not healthy: 

$ ceph health
HEALTH_ERR 68pgs are stuck inactive....

ceph pool have size of 3 replicas with min_size 1. This makes ceph slow. size should be 1 too, because you have only one node.

Also in ceph, three osds are configured, when you have only one, so ceph try to reach unexisting nodes, that also make ceph slow, by waiting a lot on osd that will never come back.

Also, even without running tempest ceph is already reporting slow request like more than 500s to write data. So, adding the tempest load is not going to work.
This slow requests have good chance to come from the missing osd nodes.

So, my guess is, the ceph node is too slow, Gnocchi can't write the backlog to it. Also Ceilometer can't post measures to Gnocchi because Ceph is too slow to write them. That make many messages waiting to be processed on rabbitmq.

You should first fix the ceph setup.
Comment 9 Mehdi ABAAKOUK 2017-07-18 05:30:32 EDT
I got the ceph issue fixed, and can reproduce the issue. I have added depends on to the other issue. Since the root cause have good change to be the same for both BZs.
Comment 11 Julien Danjou 2017-09-04 03:23:03 EDT
I'm closing this bug on the conclusion that this is not a bug and that the root cause is lack of resources and because the rest is discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1467947

Note You need to log in before you can comment on or make changes to this bug.