Red Hat Bugzilla – Bug 1265418
When rabbitmq is partitionned, heat-apis will break and never come back
Last modified: 2016-04-26 15:57:54 EDT
Description of problem:
When rabbitmq is partitionned, heat-apis will break and never come back even when the rabbitmq cluster is back to a normal state
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Get HA rabbitmq cluster
2. Get heat-api running
3. Partition rabbitmq (iptables, kill a node, etc)
4. Heat breaks, every services breaks
5. Fix rabbit MQ
6. All services comes back but not heat-api
We're able to reproduce this issue on request.
It's surprising that heat-api would be the only service not to come back, given that it relies on oslo.messaging the same as everything else.
*** Bug 1265417 has been marked as a duplicate of this bug. ***
We're able to reproduce this problem as easily as adding an iptables rules to voluntarily break a rabbitmq cluster.
Could you please compare the following configuration settings in /etc/heat/heat.conf against other services? They should be consistent (or consistently using the default value):
As part of the above needinfo could you also confirm the lastest rhos-6 packages are installed?
Progress report, I've set up a 3 node cluster and heat-api appears to recover from taking one node out, but removing and restoring 2 nodes results in the following for every request:
# heat stack-list
ERROR: The server could not comply with the request since it is either malformed or otherwise incorrect.
2015-09-29 18:50:35.547 16949 ERROR root [req-d898a6de-b2ae-47db-9e2f-948938246ca9 ] Exception handling resource: 'NoneType' object is not iterable
I will continue to investigate this exception.
I've observed heat not initially recovering from a rabbit partition, however it does recover after some time (I've witnessed times ranging from 5 - 15 minutes)
I've also investigated the error mentioned in comment 9. It turns out that during the partition, calls to RPC client methods such as list_stacks start returning None. The code either expects a valid response or an exception, so later code was raising exceptions. Defensive checking of RPC client method results stopped those errors, but it was unclear whether this sped recovery after the partition was healed.
David, how long have you waited for heat-api to recover? Could you try waiting up to 30 minutes?
At some points, we waited for many hours before taking action.
This is still an issue for the customer. What are the next steps?
It would help us to prioritise this issue with our existing workload to know how this is impacting the customer. All we have to go on at the moment is Severity: medium.
Is the heat-api partitioning issue one that happens in testing only, or is the customer actually affected by partition events?
Well, it's not impacting the customer right now but once in a while, their rabbitmq cluster partitions (network glitches I guess) and then heat-api stops working properly and needs to be bounced. It only happens with this service.
There is a new release of oslo.messaging being prepared for bug #1276166 which may help with this issue. Once it has been released we'll request that partitioning be re-tested with the new oslo.messaging.
This could probably be re-tested now!
Hi David, could you please retest this behaviour with the recently released python-oslo-messaging-1.4.1-7.el7ost?
While my testing didn't replicate this partitioning behaviour, it did show issues which sound very much like bug #1276166
Closing, no known environment has this issue and python-oslo-messaging-1.4.1-7.el7ost is expected to fix it.