Bug 1465529

Summary: Autoscaling fails, RabbitMQ being killed
Product: Red Hat OpenStack Reporter: Morgan Weetman <mweetman>
Component: openstack-ceilometerAssignee: Julien Danjou <jdanjou>
Status: CLOSED NOTABUG QA Contact: Sasha Smolyak <ssmolyak>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: ftaylor, jdanjou, jruzicka, mabaakou, mweetman, rlocke, sclewis, srevivo, vstinner
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-04 07:23:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1467947    
Bug Blocks:    

Description Morgan Weetman 2017-06-27 14:51:28 UTC
Description of problem:
Autoscaling is failing, rabbitmq gets killed - succeeds on hardware but fails on cloud environment

It fails in various ways but generally the load increases until rabbit falls over. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Julien Danjou 2017-07-03 13:48:26 UTC
Could you give more info about your problem and way to reproduce it in detail?

It's hard to understand what your problem is exactly.

Comment 5 Mehdi ABAAKOUK 2017-07-17 15:29:39 UTC
I have tried the lab and ceph looks not healthy: 

$ ceph health
HEALTH_ERR 68pgs are stuck inactive....

ceph pool have size of 3 replicas with min_size 1. This makes ceph slow. size should be 1 too, because you have only one node.

Also in ceph, three osds are configured, when you have only one, so ceph try to reach unexisting nodes, that also make ceph slow, by waiting a lot on osd that will never come back.

Also, even without running tempest ceph is already reporting slow request like more than 500s to write data. So, adding the tempest load is not going to work.
This slow requests have good chance to come from the missing osd nodes.

So, my guess is, the ceph node is too slow, Gnocchi can't write the backlog to it. Also Ceilometer can't post measures to Gnocchi because Ceph is too slow to write them. That make many messages waiting to be processed on rabbitmq.

You should first fix the ceph setup.

Comment 9 Mehdi ABAAKOUK 2017-07-18 09:30:32 UTC
I got the ceph issue fixed, and can reproduce the issue. I have added depends on to the other issue. Since the root cause have good change to be the same for both BZs.

Comment 11 Julien Danjou 2017-09-04 07:23:03 UTC
I'm closing this bug on the conclusion that this is not a bug and that the root cause is lack of resources and because the rest is discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1467947