1265418 – When rabbitmq is partitionned, heat-apis will break and never come back

Bug 1265418 - When rabbitmq is partitionned, heat-apis will break and never come back

Summary: When rabbitmq is partitionned, heat-apis will break and never come back

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-heat
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	6.0 (Juno)
Assignee:	Steve Baker
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1265417 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-22 21:19 UTC by David Hill
Modified:	2019-09-12 08:57 UTC (History)
CC List:	10 users (show)
Fixed In Version:	python-oslo-messaging-1.4.1-7.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-06 21:07:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description David Hill 2015-09-22 21:19:40 UTC

Description of problem:
When rabbitmq is partitionned, heat-apis will break and never come back even when the rabbitmq cluster is back to a normal state

Version-Release number of selected component (if applicable):


How reproducible:
Everytime

Steps to Reproduce:
1. Get HA rabbitmq cluster
2. Get heat-api running
3. Partition rabbitmq (iptables, kill a node, etc)
4. Heat breaks, every services breaks
5. Fix rabbit MQ
6. All services comes back but not heat-api

Actual results:
Broken heat-apis

Expected results:
Working heat-apis

Additional info:
We're able to reproduce this issue on request.

Comment 3 Zane Bitter 2015-09-23 13:06:24 UTC

It's surprising that heat-api would be the only service not to come back, given that it relies on oslo.messaging the same as everything else.

Comment 4 Zane Bitter 2015-09-23 13:09:43 UTC

*** Bug 1265417 has been marked as a duplicate of this bug. ***

Comment 5 David Hill 2015-09-23 14:16:54 UTC

We're able to reproduce this problem as easily as adding an iptables rules to voluntarily break a rabbitmq cluster.

Comment 6 Steve Baker 2015-09-24 03:45:42 UTC

Could you please compare the following configuration settings in /etc/heat/heat.conf against other services? They should be consistent (or consistently using the default value):

- rpc_backend
- rabbit_*

Comment 7 Steve Baker 2015-09-29 04:36:51 UTC

As part of the above needinfo could you also confirm the lastest rhos-6 packages are installed?

python-oslo-messaging-1.4.1-6.el7ost.noarch
openstack-heat-engine-2014.2.3-5.el7ost.noarch
openstack-heat-common-2014.2.3-5.el7ost.noarch
openstack-heat-api-2014.2.3-5.el7ost.noarch

Comment 9 Steve Baker 2015-09-29 22:52:17 UTC

Progress report, I've set up a 3 node cluster and heat-api appears to recover from taking one node out, but removing and restoring 2 nodes results in the following for every request:

# heat stack-list
ERROR: The server could not comply with the request since it is either malformed or otherwise incorrect.

heat-api.log:
2015-09-29 18:50:35.547 16949 ERROR root [req-d898a6de-b2ae-47db-9e2f-948938246ca9 ] Exception handling resource: 'NoneType' object is not iterable

I will continue to investigate this exception.

Comment 10 Steve Baker 2015-09-30 03:43:49 UTC

I've observed heat not initially recovering from a rabbit partition, however it does recover after some time (I've witnessed times ranging from 5 - 15 minutes)

I've also investigated the error mentioned in comment 9. It turns out that during the partition, calls to RPC client methods such as list_stacks start returning None. The code either expects a valid response or an exception, so later code was raising exceptions. Defensive checking of RPC client method results stopped those errors, but it was unclear whether this sped recovery after the partition was healed.

David, how long have you waited for heat-api to recover? Could you try waiting up to 30 minutes?

Comment 11 David Hill 2015-09-30 16:31:14 UTC

At some points, we waited for many hours before taking action.

Comment 12 David Hill 2015-11-10 19:01:25 UTC

This is still an issue for the customer.  What are the next steps?

Comment 13 Steve Baker 2015-11-10 20:20:26 UTC

It would help us to prioritise this issue with our existing workload to know how this is impacting the customer. All we have to go on at the moment is Severity: medium.

Is the heat-api partitioning issue one that happens in testing only, or is the customer actually affected by partition events?

Comment 14 David Hill 2015-11-10 20:27:06 UTC

Well, it's not impacting the customer right now but once in a while, their rabbitmq cluster partitions (network glitches I guess) and then heat-api stops working properly and needs to be bounced.  It only happens with this service.

Comment 17 Steve Baker 2016-02-03 21:50:01 UTC

There is a new release of oslo.messaging being prepared for bug #1276166 which may help with this issue. Once it has been released we'll request that partitioning be re-tested with the new oslo.messaging.

Comment 18 Flavio Percoco 2016-03-01 13:05:36 UTC

This could probably be re-tested now!

Comment 19 Steve Baker 2016-03-01 20:10:01 UTC

Hi David, could you please retest this behaviour with the recently released python-oslo-messaging-1.4.1-7.el7ost?

While my testing didn't replicate this partitioning behaviour, it did show issues which sound very much like bug #1276166

Comment 21 Steve Baker 2016-04-06 21:07:35 UTC

Closing, no known environment has this issue and python-oslo-messaging-1.4.1-7.el7ost is expected to fix it.

Note You need to log in before you can comment on or make changes to this bug.