Bug 1276166

Summary: oslo.messaging holds connections when replies fail
Product: Red Hat OpenStack Reporter: Faiaz Ahmed <fahmed>
Component: python-oslo-messagingAssignee: Flavio Percoco <fpercoco>
Status: CLOSED ERRATA QA Contact: Udi Shkalim <ushkalim>
Severity: high Docs Contact:
Priority: high    
Version: 6.0 (Juno)CC: adahms, apevec, chenders, christer.eriksson, dmaley, ealcaniz, fahmed, fpercoco, hguemar, jbiao, jeckersb, lhh, mabaakou, nlevinki, oblaut, ochalups, pcaruana, plemenko, ushkalim, wlehman, yeylon, zbitter
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: python-oslo-messaging-1.4.1-7.el7ost Doc Type: Bug Fix
Doc Text:
Previously, when reply queues did not exist, oslo messaging would retry every second for 60 seconds to connect, which would result in the connection used by oslo messaging being held for 60 seconds. Moreover, when multiple replies fail, this could lead to starvation because the pool could run out of connections. With this release, connections are now released back to the pool whenever a reply fails, and are re-acquired on a retry attempt.
Story Points: ---
Clone Of:
: 1315842 (view as bug list) Environment:
Last Closed: 2016-02-18 17:44:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1282918, 1301634, 1315842, 1341120, 1358909    

Comment 3 Peter Lemenkov 2015-10-29 15:50:06 UTC
> 
> - Restarting individual services including rabbitmq dose not work. 
> - Connecting to rebbitmq goes into infinity loop 
> - Need to restart entire cluster to reconnect to the services.

Could you please elaborate on this. So when you restarted a services one by one (incl. rmq) this didn't fix anything, but when you restarted the entire cluster of rabbitmq this fixed the issue.

Do I understand correctly that you stopped all rabbitmq instances and then started them again?

Comment 5 Peter Lemenkov 2015-10-30 16:05:50 UTC
Ok, this looks like a possible oslo.messaging issue. Let's call python-oslo-messaging maintainers for advice.

Comment 9 Flavio Percoco 2015-11-09 14:49:51 UTC
This is, unfortunately, a known issue.

We've just had similar reports for this. The upstream patch that would help fixing this is: https://review.openstack.org/#/c/193037/

That patch will be part of 8.0 so, perhaps, the target release for this bug should be changed.

I'm going to set a needinfo on Mehdi to get a second opinion on what's the issue here but I'd guess it's the same one we've been seeing.

Comment 10 Faiaz Ahmed 2015-11-09 23:04:16 UTC
Hell Flavio
This is production environment for Customer with version 6.

Can we have a backport? I am sure customer will not be happy to know to wait till version 8.  Though they have a work-around restarting the whole cluster.


Cheers
Faiaz

Comment 11 Mehdi ABAAKOUK 2015-11-16 10:36:14 UTC
I confirm what flavio said, https://review.openstack.org/#/c/193037/ doesn't fix anything, it just reduce the recovery time in certain case.

But if your installation 'never' recover, it will not fix anything.

That said, I have digged into the oslo.messaging version packaged into rhos-6 and found that https://review.openstack.org/#/c/109373/ was backported with the same issue we already found in rhos-5. The rhos-5 fix to this backport can be found here: https://code.engineering.redhat.com/gerrit/#/c/58673/.

I have just proposed the same fix for rhos-6: https://code.engineering.redhat.com/gerrit/#/c/61571/

Comment 12 Flavio Percoco 2015-11-16 13:49:09 UTC
Fabiaz,

I was not suggesting the customer should wait until 8.0 but that the target release of *this* bug should be fixed because it's currently set to 8.0. Did you mean 6.0?

I've merged Mehdi's backport and we could create a test build w/ that. Unfortunately, 193037 can't be backported as it's too invasive. That said, we're experimenting with some other ways to work around this issue and a less invasive way.

Comment 13 Faiaz Ahmed 2015-11-16 23:05:59 UTC
Hi,
Thanks for the clarification. We are decently looking for fix on OSP 6 since it is production environment.

If you are going to get a backport for OSP6 , can we have tentative target date?

Faiaz

Comment 15 Flavio Percoco 2015-11-17 19:00:57 UTC
Fabiaz,

Is this being requested as a hotfix? We should probably get a test build first to verify if that actually helps with the issue.

As I mentioned in one of my previous comments, we won't backport the bigger patch because it's too invasive.

Comment 17 Peter Lemenkov 2015-11-25 16:47:15 UTC
*** Bug 1277529 has been marked as a duplicate of this bug. ***

Comment 18 Peter Lemenkov 2015-11-25 16:54:56 UTC
*** Bug 1268365 has been marked as a duplicate of this bug. ***

Comment 19 Peter Lemenkov 2015-11-26 11:55:47 UTC
*** Bug 1255565 has been marked as a duplicate of this bug. ***

Comment 27 Udi Shkalim 2016-01-27 11:36:11 UTC
Failed QA
Version
2014.2.3 - OSP 6
python-oslo-messaging-1.4.1-7.el7ost.noarch

Steps done to recreate the problem:
1. Launched an instance with nova boot command - nova boot --image cirros-pop-osp --flavor m1.tiny --nic net-id=f05674ff-a734-4c08-8d6d-2318421b35c6 cirros
2. Restarted heat and rabbitmq - systemctl restart *heat* ; systemctl restart *rabbitmq*
3. In rabbit log the following messages is repeating it self in a loop:             'exchange.declare'}

=ERROR REPORT==== 27-Jan-2016::06:26:23 ===
connection <0.12730.0>, channel 1 - soft error:
{amqp_error,not_found,
            "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost '/'",
            'exchange.declare'}

=ERROR REPORT==== 27-Jan-2016::06:26:23 ===
connection <0.2201.0>, channel 1 - soft error:
{amqp_error,not_found,
            "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost '/'",
            'exchange.declare'}

=ERROR REPORT==== 27-Jan-2016::06:26:23 ===
connection <0.5471.0>, channel 1 - soft error:
{amqp_error,not_found,
            "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost '/'",
            'exchange.declare'}



After the restart the first Instance is still in build state. 
Tried to boot more instances and was successful.

OSP6-controller ~(keystone_admin)]# nova list
+--------------------------------------+--------------+--------+------------+-------------+---------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks            |
+--------------------------------------+--------------+--------+------------+-------------+---------------------+
| acfbbfc2-b3d3-4338-b109-a7cbad982636 | cirros_udi   | BUILD  | -          | NOSTATE     | private=192.168.1.1 |
| 96323d44-3374-4ba1-b32f-a22e21171168 | cirros_udi_2 | ACTIVE | -          | Running     | private=192.168.1.3 |
| e1c03f40-bfe0-44da-ba9f-be37e336cbd5 | cirros_udi_3 | ACTIVE | -          | Running     | private=192.168.1.4 |
+--------------------------------------+--------------+--------+------------+-------------+---------------------+

/var/log/ directory is attached.

Comment 29 Flavio Percoco 2016-01-27 14:09:10 UTC
mmh, I think this bug is verified, TBH. I can't find the `NOT_FOUND` errors in the logs, which means it recovered correctly. 

After the restart it looks like you're able to create VMs and things keep working.

Comment 30 Udi Shkalim 2016-01-28 14:02:31 UTC
On the VM creation part you are right - the system is functioning.
But checkout comment 6. The log is keep being populated with this error - If problem described in comment 6 is a different problem maybe we should open different bug and verify this one.

Comment 32 Lon Hohberger 2016-01-29 02:04:17 UTC
If there's a separate issue, then you're right, we should get this one out and fix the other one separately.

Comment 33 Flavio Percoco 2016-02-02 20:52:37 UTC
Based on comment 6, it seems that services reconnected correctly and therefore this bug is verified. I'd probably recommend checking whether there was some service left that didn't reconnect and where these errors came from exactly.

Other than that, I don't think we should hold this bug off.

Comment 34 John Eckersberg 2016-02-02 21:18:05 UTC
(In reply to Udi Shkalim from comment #27)
> =ERROR REPORT==== 27-Jan-2016::06:26:23 ===
> connection <0.12730.0>, channel 1 - soft error:
> {amqp_error,not_found,
>             "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost
> '/'",
>             'exchange.declare'}
> 
> =ERROR REPORT==== 27-Jan-2016::06:26:23 ===
> connection <0.2201.0>, channel 1 - soft error:
> {amqp_error,not_found,
>             "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost
> '/'",
>             'exchange.declare'}
> 
> =ERROR REPORT==== 27-Jan-2016::06:26:23 ===
> connection <0.5471.0>, channel 1 - soft error:
> {amqp_error,not_found,
>             "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost
> '/'",
>             'exchange.declare'}

This part worries me:

$ grep -A3 '27-Jan-2016::06:26:23' rabbit | grep reply_453622d06afd45cab4c3e69fe2fb5f37 | wc -l
13

So, it tried to resend that 13 times in a one second period.  The connection pid keeps changing, which tells me that o.m is in a tight loop: acquiring a connection in the pool, trying to reply, failing, and releasing it back to the pool.

Looks like the retry delay isn't happening.  I think in newer upstream versions it's every 1 second?

Comment 35 Perry Myers 2016-02-03 03:25:10 UTC
I just fixed a bunch of private comments to be public. Please only make comments private if they contain internal/customer sensitive data.

Based on comment 34 I'm moving this bug back to ASSIGNED state for additional analysis and work

Comment 36 Mehdi ABAAKOUK 2016-02-03 08:14:07 UTC
That can be totally normal to get 404 errors in rabbitmq logs.

About absent of destination reply queue, we have two cases:
1) rabbit is restarted: so RPCServer have to wait that clients come back and recreate queues with the same name.
2) the client is restarted: the reply queue is recreated with a new name, we become a new client. So, after the RPCServer have waited 60s in case of #1 occurs, it drops the reply that nobody can received anymore.

When the destination queue for reply doesn't exists we retry every 1s during 60s. Before mikita, the connection used for reply was hold during this 60s, so if too many replies was falling that could lead in a starvation of connections for reply. This have been fixed upstream by releasing back the connection to the pool between each try.

That fixes have been backported in osp5, osp6, kilo and liberty too.

Note that when the starvation occurs, services can just go crazy, and looking stuck from the user/admin point of view, services can't report its health and looks dead at time. All rpc call finish in timeout, because the reply come back to late. This can be caught like this: on the RPCServer (ex: nova-compute) logs for each 'Timed out waiting for a reply to message ID XXXX' we should got a 'No calling threads waiting for msg_id: XXXX'.

According the logs, I don't think fall into that bug anymore.

The 404 errors in rabbitmq logs looks totally normal here. We can see expected messages in nova-condutor logs that looks like: 

"oslo.messaging._drivers.impl_rabbit [-] The exchange to reply to reply_453622d06afd45cab4c3e69fe2fb5f37 doesn't exist yet, retrying...", that corresponds to  "{amqp_error, not_found, "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost '/'",'exchange.declare'}" in rabbitmq logs.

And then after 60 seconds: "2016-01-27 06:35:09.358 23579 INFO oslo.messaging._drivers.impl_rabbit [-] The exchange to reply to reply_453622d06afd45cab4c3e69fe2fb5f37 still doesn't exist, abandoning..." 

Also, for the rpm versioning story:
* python-oslo-messaging-1.4.1-6.el7ost -> the reply retry code was not working at all  # Reason of this bug have been opened
* python-oslo-messaging-1.4.1-7.el7ost -> the reply retry code was fixed but can lead to starvation of connection  # Official fix for this bug and this new logs show that reply retry works now
* python-oslo-messaging-1.4.1-8.el7ost -> The starvation of connection was fixed  # Looks like the version used by QA according the fact each retries use a new connection.


So, to resume, for me everything looks good.

Comment 37 Flavio Percoco 2016-02-03 11:13:27 UTC
Moving back to ON_QA, I agree with Mehdi assessment here.

Comment 40 John Eckersberg 2016-02-03 19:51:33 UTC
OK, I'm on board now.  It *is* sleeping, the reason it doesn't look like it's sleeping is because a bunch of different conductor workers are all trying to reply to the same queue.  So each worker is behaving as expected, it just looks like they are misbehaving when they're all firing at the same time.

The fact that we seem to have a bunch of duplicate RPC requests in multiple workers simultaneously may or may not be indicative of a different bug, but it is not the same as this bug so would be tracked separately.

Comment 41 John Eckersberg 2016-02-03 20:02:48 UTC
changed summary to match bug 1301634, these are the same changes but for different osp versions.

Comment 42 Zane Bitter 2016-02-03 22:02:43 UTC
Do we think that this fix will also resolve bug 1265418?

Comment 43 John Eckersberg 2016-02-03 22:53:23 UTC
(In reply to Zane Bitter from comment #42)
> Do we think that this fix will also resolve bug 1265418?

It very well may, although I can't say for sure.  The good news is it sounds like you have a good reproducer for 1265418 so it should be easy to test if it helps.  Keep in mind that there are several fixes rolled up into the pending oslo.messaging release so even if this particular bug is not the cause, one of the others may be.

Comment 48 John Eckersberg 2016-02-15 19:11:43 UTC
*** Bug 1268487 has been marked as a duplicate of this bug. ***

Comment 50 errata-xmlrpc 2016-02-18 17:44:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0267.html