Bug 1265418

Summary:	When rabbitmq is partitionned, heat-apis will break and never come back
Product:	Red Hat OpenStack	Reporter:	David Hill <dhill>
Component:	openstack-heat	Assignee:	Steve Baker <sbaker>
Status:	CLOSED WORKSFORME	QA Contact:	Amit Ugol <augol>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	6.0 (Juno)	CC:	dhill, fpercoco, mburns, mschuppe, nobody, rhel-osp-director-maint, sbaker, shardy, yeylon, zbitter
Target Milestone:	---	Keywords:	ZStream
Target Release:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	python-oslo-messaging-1.4.1-7.el7ost	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-04-06 21:07:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Hill 2015-09-22 21:19:40 UTC

Description of problem:
When rabbitmq is partitionned, heat-apis will break and never come back even when the rabbitmq cluster is back to a normal state

Version-Release number of selected component (if applicable):


How reproducible:
Everytime

Steps to Reproduce:
1. Get HA rabbitmq cluster
2. Get heat-api running
3. Partition rabbitmq (iptables, kill a node, etc)
4. Heat breaks, every services breaks
5. Fix rabbit MQ
6. All services comes back but not heat-api

Actual results:
Broken heat-apis

Expected results:
Working heat-apis

Additional info:
We're able to reproduce this issue on request.

Comment 3 Zane Bitter 2015-09-23 13:06:24 UTC

It's surprising that heat-api would be the only service not to come back, given that it relies on oslo.messaging the same as everything else.

Comment 4 Zane Bitter 2015-09-23 13:09:43 UTC

*** Bug 1265417 has been marked as a duplicate of this bug. ***

Comment 5 David Hill 2015-09-23 14:16:54 UTC

We're able to reproduce this problem as easily as adding an iptables rules to voluntarily break a rabbitmq cluster.

Comment 6 Steve Baker 2015-09-24 03:45:42 UTC

Could you please compare the following configuration settings in /etc/heat/heat.conf against other services? They should be consistent (or consistently using the default value):

- rpc_backend
- rabbit_*

Comment 7 Steve Baker 2015-09-29 04:36:51 UTC

As part of the above needinfo could you also confirm the lastest rhos-6 packages are installed?

python-oslo-messaging-1.4.1-6.el7ost.noarch
openstack-heat-engine-2014.2.3-5.el7ost.noarch
openstack-heat-common-2014.2.3-5.el7ost.noarch
openstack-heat-api-2014.2.3-5.el7ost.noarch

Comment 9 Steve Baker 2015-09-29 22:52:17 UTC

Progress report, I've set up a 3 node cluster and heat-api appears to recover from taking one node out, but removing and restoring 2 nodes results in the following for every request:

# heat stack-list
ERROR: The server could not comply with the request since it is either malformed or otherwise incorrect.

heat-api.log:
2015-09-29 18:50:35.547 16949 ERROR root [req-d898a6de-b2ae-47db-9e2f-948938246ca9 ] Exception handling resource: 'NoneType' object is not iterable

I will continue to investigate this exception.

Comment 10 Steve Baker 2015-09-30 03:43:49 UTC

I've observed heat not initially recovering from a rabbit partition, however it does recover after some time (I've witnessed times ranging from 5 - 15 minutes)

I've also investigated the error mentioned in comment 9. It turns out that during the partition, calls to RPC client methods such as list_stacks start returning None. The code either expects a valid response or an exception, so later code was raising exceptions. Defensive checking of RPC client method results stopped those errors, but it was unclear whether this sped recovery after the partition was healed.

David, how long have you waited for heat-api to recover? Could you try waiting up to 30 minutes?

Comment 11 David Hill 2015-09-30 16:31:14 UTC

At some points, we waited for many hours before taking action.

Comment 12 David Hill 2015-11-10 19:01:25 UTC

This is still an issue for the customer.  What are the next steps?

Comment 13 Steve Baker 2015-11-10 20:20:26 UTC

It would help us to prioritise this issue with our existing workload to know how this is impacting the customer. All we have to go on at the moment is Severity: medium.

Is the heat-api partitioning issue one that happens in testing only, or is the customer actually affected by partition events?

Comment 14 David Hill 2015-11-10 20:27:06 UTC

Well, it's not impacting the customer right now but once in a while, their rabbitmq cluster partitions (network glitches I guess) and then heat-api stops working properly and needs to be bounced.  It only happens with this service.

Comment 17 Steve Baker 2016-02-03 21:50:01 UTC

There is a new release of oslo.messaging being prepared for bug #1276166 which may help with this issue. Once it has been released we'll request that partitioning be re-tested with the new oslo.messaging.

Comment 18 Flavio Percoco 2016-03-01 13:05:36 UTC

This could probably be re-tested now!

Comment 19 Steve Baker 2016-03-01 20:10:01 UTC

Hi David, could you please retest this behaviour with the recently released python-oslo-messaging-1.4.1-7.el7ost?

While my testing didn't replicate this partitioning behaviour, it did show issues which sound very much like bug #1276166

Comment 21 Steve Baker 2016-04-06 21:07:35 UTC

Closing, no known environment has this issue and python-oslo-messaging-1.4.1-7.el7ost is expected to fix it.