1465529 – Autoscaling fails, RabbitMQ being killed

Bug 1465529 - Autoscaling fails, RabbitMQ being killed

Summary: Autoscaling fails, RabbitMQ being killed

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ceilometer
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	10.0 (Newton)
Assignee:	Julien Danjou
QA Contact:	Sasha Smolyak
Docs Contact:
URL:
Whiteboard:
Depends On:	1467947
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-27 14:51 UTC by Morgan Weetman
Modified:	2017-09-04 07:23 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-04 07:23:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Morgan Weetman 2017-06-27 14:51:28 UTC

Description of problem:
Autoscaling is failing, rabbitmq gets killed - succeeds on hardware but fails on cloud environment

It fails in various ways but generally the load increases until rabbit falls over. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Julien Danjou 2017-07-03 13:48:26 UTC

Could you give more info about your problem and way to reproduce it in detail?

It's hard to understand what your problem is exactly.

Comment 5 Mehdi ABAAKOUK 2017-07-17 15:29:39 UTC

I have tried the lab and ceph looks not healthy: 

$ ceph health
HEALTH_ERR 68pgs are stuck inactive....

ceph pool have size of 3 replicas with min_size 1. This makes ceph slow. size should be 1 too, because you have only one node.

Also in ceph, three osds are configured, when you have only one, so ceph try to reach unexisting nodes, that also make ceph slow, by waiting a lot on osd that will never come back.

Also, even without running tempest ceph is already reporting slow request like more than 500s to write data. So, adding the tempest load is not going to work.
This slow requests have good chance to come from the missing osd nodes.

So, my guess is, the ceph node is too slow, Gnocchi can't write the backlog to it. Also Ceilometer can't post measures to Gnocchi because Ceph is too slow to write them. That make many messages waiting to be processed on rabbitmq.

You should first fix the ceph setup.

Comment 9 Mehdi ABAAKOUK 2017-07-18 09:30:32 UTC

I got the ceph issue fixed, and can reproduce the issue. I have added depends on to the other issue. Since the root cause have good change to be the same for both BZs.

Comment 11 Julien Danjou 2017-09-04 07:23:03 UTC

I'm closing this bug on the conclusion that this is not a bug and that the root cause is lack of resources and because the rest is discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1467947

Note You need to log in before you can comment on or make changes to this bug.