Bug 1265418 - When rabbitmq is partitionned, heat-apis will break and never come back
When rabbitmq is partitionned, heat-apis will break and never come back
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat (Show other bugs)
6.0 (Juno)
Unspecified Unspecified
unspecified Severity medium
: ---
: 6.0 (Juno)
Assigned To: Steve Baker
Amit Ugol
: ZStream
: 1265417 (view as bug list)
Depends On:
  Show dependency treegraph
Reported: 2015-09-22 17:19 EDT by David Hill
Modified: 2016-04-26 15:57 EDT (History)
10 users (show)

See Also:
Fixed In Version: python-oslo-messaging-1.4.1-7.el7ost
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2016-04-06 17:07:35 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description David Hill 2015-09-22 17:19:40 EDT
Description of problem:
When rabbitmq is partitionned, heat-apis will break and never come back even when the rabbitmq cluster is back to a normal state

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Get HA rabbitmq cluster
2. Get heat-api running
3. Partition rabbitmq (iptables, kill a node, etc)
4. Heat breaks, every services breaks
5. Fix rabbit MQ
6. All services comes back but not heat-api

Actual results:
Broken heat-apis

Expected results:
Working heat-apis

Additional info:
We're able to reproduce this issue on request.
Comment 3 Zane Bitter 2015-09-23 09:06:24 EDT
It's surprising that heat-api would be the only service not to come back, given that it relies on oslo.messaging the same as everything else.
Comment 4 Zane Bitter 2015-09-23 09:09:43 EDT
*** Bug 1265417 has been marked as a duplicate of this bug. ***
Comment 5 David Hill 2015-09-23 10:16:54 EDT
We're able to reproduce this problem as easily as adding an iptables rules to voluntarily break a rabbitmq cluster.
Comment 6 Steve Baker 2015-09-23 23:45:42 EDT
Could you please compare the following configuration settings in /etc/heat/heat.conf against other services? They should be consistent (or consistently using the default value):

- rpc_backend
- rabbit_*
Comment 7 Steve Baker 2015-09-29 00:36:51 EDT
As part of the above needinfo could you also confirm the lastest rhos-6 packages are installed?

Comment 9 Steve Baker 2015-09-29 18:52:17 EDT
Progress report, I've set up a 3 node cluster and heat-api appears to recover from taking one node out, but removing and restoring 2 nodes results in the following for every request:

# heat stack-list
ERROR: The server could not comply with the request since it is either malformed or otherwise incorrect.

2015-09-29 18:50:35.547 16949 ERROR root [req-d898a6de-b2ae-47db-9e2f-948938246ca9 ] Exception handling resource: 'NoneType' object is not iterable

I will continue to investigate this exception.
Comment 10 Steve Baker 2015-09-29 23:43:49 EDT
I've observed heat not initially recovering from a rabbit partition, however it does recover after some time (I've witnessed times ranging from 5 - 15 minutes)

I've also investigated the error mentioned in comment 9. It turns out that during the partition, calls to RPC client methods such as list_stacks start returning None. The code either expects a valid response or an exception, so later code was raising exceptions. Defensive checking of RPC client method results stopped those errors, but it was unclear whether this sped recovery after the partition was healed.

David, how long have you waited for heat-api to recover? Could you try waiting up to 30 minutes?
Comment 11 David Hill 2015-09-30 12:31:14 EDT
At some points, we waited for many hours before taking action.
Comment 12 David Hill 2015-11-10 14:01:25 EST
This is still an issue for the customer.  What are the next steps?
Comment 13 Steve Baker 2015-11-10 15:20:26 EST
It would help us to prioritise this issue with our existing workload to know how this is impacting the customer. All we have to go on at the moment is Severity: medium.

Is the heat-api partitioning issue one that happens in testing only, or is the customer actually affected by partition events?
Comment 14 David Hill 2015-11-10 15:27:06 EST
Well, it's not impacting the customer right now but once in a while, their rabbitmq cluster partitions (network glitches I guess) and then heat-api stops working properly and needs to be bounced.  It only happens with this service.
Comment 17 Steve Baker 2016-02-03 16:50:01 EST
There is a new release of oslo.messaging being prepared for bug #1276166 which may help with this issue. Once it has been released we'll request that partitioning be re-tested with the new oslo.messaging.
Comment 18 Flavio Percoco 2016-03-01 08:05:36 EST
This could probably be re-tested now!
Comment 19 Steve Baker 2016-03-01 15:10:01 EST
Hi David, could you please retest this behaviour with the recently released python-oslo-messaging-1.4.1-7.el7ost?

While my testing didn't replicate this partitioning behaviour, it did show issues which sound very much like bug #1276166
Comment 21 Steve Baker 2016-04-06 17:07:35 EDT
Closing, no known environment has this issue and python-oslo-messaging-1.4.1-7.el7ost is expected to fix it.

Note You need to log in before you can comment on or make changes to this bug.