Bug 1276166
Summary: | oslo.messaging holds connections when replies fail | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Faiaz Ahmed <fahmed> | |
Component: | python-oslo-messaging | Assignee: | Flavio Percoco <fpercoco> | |
Status: | CLOSED ERRATA | QA Contact: | Udi Shkalim <ushkalim> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 6.0 (Juno) | CC: | adahms, apevec, chenders, christer.eriksson, dmaley, ealcaniz, fahmed, fpercoco, hguemar, jbiao, jeckersb, lhh, mabaakou, nlevinki, oblaut, ochalups, pcaruana, plemenko, ushkalim, wlehman, yeylon, zbitter | |
Target Milestone: | async | Keywords: | Triaged, ZStream | |
Target Release: | 6.0 (Juno) | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | python-oslo-messaging-1.4.1-7.el7ost | Doc Type: | Bug Fix | |
Doc Text: |
Previously, when reply queues did not exist, oslo messaging would retry every second for 60 seconds to connect, which would result in the connection used by oslo messaging being held for 60 seconds. Moreover, when multiple replies fail, this could lead to starvation because the pool could run out of connections. With this release, connections are now released back to the pool whenever a reply fails, and are re-acquired on a retry attempt.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1315842 (view as bug list) | Environment: | ||
Last Closed: | 2016-02-18 17:44:35 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1282918, 1301634, 1315842, 1341120, 1358909 |
Ok, this looks like a possible oslo.messaging issue. Let's call python-oslo-messaging maintainers for advice. This is, unfortunately, a known issue. We've just had similar reports for this. The upstream patch that would help fixing this is: https://review.openstack.org/#/c/193037/ That patch will be part of 8.0 so, perhaps, the target release for this bug should be changed. I'm going to set a needinfo on Mehdi to get a second opinion on what's the issue here but I'd guess it's the same one we've been seeing. Hell Flavio This is production environment for Customer with version 6. Can we have a backport? I am sure customer will not be happy to know to wait till version 8. Though they have a work-around restarting the whole cluster. Cheers Faiaz I confirm what flavio said, https://review.openstack.org/#/c/193037/ doesn't fix anything, it just reduce the recovery time in certain case. But if your installation 'never' recover, it will not fix anything. That said, I have digged into the oslo.messaging version packaged into rhos-6 and found that https://review.openstack.org/#/c/109373/ was backported with the same issue we already found in rhos-5. The rhos-5 fix to this backport can be found here: https://code.engineering.redhat.com/gerrit/#/c/58673/. I have just proposed the same fix for rhos-6: https://code.engineering.redhat.com/gerrit/#/c/61571/ Fabiaz, I was not suggesting the customer should wait until 8.0 but that the target release of *this* bug should be fixed because it's currently set to 8.0. Did you mean 6.0? I've merged Mehdi's backport and we could create a test build w/ that. Unfortunately, 193037 can't be backported as it's too invasive. That said, we're experimenting with some other ways to work around this issue and a less invasive way. Hi, Thanks for the clarification. We are decently looking for fix on OSP 6 since it is production environment. If you are going to get a backport for OSP6 , can we have tentative target date? Faiaz Fabiaz, Is this being requested as a hotfix? We should probably get a test build first to verify if that actually helps with the issue. As I mentioned in one of my previous comments, we won't backport the bigger patch because it's too invasive. *** Bug 1277529 has been marked as a duplicate of this bug. *** *** Bug 1268365 has been marked as a duplicate of this bug. *** *** Bug 1255565 has been marked as a duplicate of this bug. *** Failed QA Version 2014.2.3 - OSP 6 python-oslo-messaging-1.4.1-7.el7ost.noarch Steps done to recreate the problem: 1. Launched an instance with nova boot command - nova boot --image cirros-pop-osp --flavor m1.tiny --nic net-id=f05674ff-a734-4c08-8d6d-2318421b35c6 cirros 2. Restarted heat and rabbitmq - systemctl restart *heat* ; systemctl restart *rabbitmq* 3. In rabbit log the following messages is repeating it self in a loop: 'exchange.declare'} =ERROR REPORT==== 27-Jan-2016::06:26:23 === connection <0.12730.0>, channel 1 - soft error: {amqp_error,not_found, "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost '/'", 'exchange.declare'} =ERROR REPORT==== 27-Jan-2016::06:26:23 === connection <0.2201.0>, channel 1 - soft error: {amqp_error,not_found, "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost '/'", 'exchange.declare'} =ERROR REPORT==== 27-Jan-2016::06:26:23 === connection <0.5471.0>, channel 1 - soft error: {amqp_error,not_found, "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost '/'", 'exchange.declare'} After the restart the first Instance is still in build state. Tried to boot more instances and was successful. OSP6-controller ~(keystone_admin)]# nova list +--------------------------------------+--------------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+---------------------+ | acfbbfc2-b3d3-4338-b109-a7cbad982636 | cirros_udi | BUILD | - | NOSTATE | private=192.168.1.1 | | 96323d44-3374-4ba1-b32f-a22e21171168 | cirros_udi_2 | ACTIVE | - | Running | private=192.168.1.3 | | e1c03f40-bfe0-44da-ba9f-be37e336cbd5 | cirros_udi_3 | ACTIVE | - | Running | private=192.168.1.4 | +--------------------------------------+--------------+--------+------------+-------------+---------------------+ /var/log/ directory is attached. mmh, I think this bug is verified, TBH. I can't find the `NOT_FOUND` errors in the logs, which means it recovered correctly. After the restart it looks like you're able to create VMs and things keep working. On the VM creation part you are right - the system is functioning. But checkout comment 6. The log is keep being populated with this error - If problem described in comment 6 is a different problem maybe we should open different bug and verify this one. If there's a separate issue, then you're right, we should get this one out and fix the other one separately. Based on comment 6, it seems that services reconnected correctly and therefore this bug is verified. I'd probably recommend checking whether there was some service left that didn't reconnect and where these errors came from exactly. Other than that, I don't think we should hold this bug off. (In reply to Udi Shkalim from comment #27) > =ERROR REPORT==== 27-Jan-2016::06:26:23 === > connection <0.12730.0>, channel 1 - soft error: > {amqp_error,not_found, > "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost > '/'", > 'exchange.declare'} > > =ERROR REPORT==== 27-Jan-2016::06:26:23 === > connection <0.2201.0>, channel 1 - soft error: > {amqp_error,not_found, > "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost > '/'", > 'exchange.declare'} > > =ERROR REPORT==== 27-Jan-2016::06:26:23 === > connection <0.5471.0>, channel 1 - soft error: > {amqp_error,not_found, > "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost > '/'", > 'exchange.declare'} This part worries me: $ grep -A3 '27-Jan-2016::06:26:23' rabbit | grep reply_453622d06afd45cab4c3e69fe2fb5f37 | wc -l 13 So, it tried to resend that 13 times in a one second period. The connection pid keeps changing, which tells me that o.m is in a tight loop: acquiring a connection in the pool, trying to reply, failing, and releasing it back to the pool. Looks like the retry delay isn't happening. I think in newer upstream versions it's every 1 second? I just fixed a bunch of private comments to be public. Please only make comments private if they contain internal/customer sensitive data. Based on comment 34 I'm moving this bug back to ASSIGNED state for additional analysis and work That can be totally normal to get 404 errors in rabbitmq logs. About absent of destination reply queue, we have two cases: 1) rabbit is restarted: so RPCServer have to wait that clients come back and recreate queues with the same name. 2) the client is restarted: the reply queue is recreated with a new name, we become a new client. So, after the RPCServer have waited 60s in case of #1 occurs, it drops the reply that nobody can received anymore. When the destination queue for reply doesn't exists we retry every 1s during 60s. Before mikita, the connection used for reply was hold during this 60s, so if too many replies was falling that could lead in a starvation of connections for reply. This have been fixed upstream by releasing back the connection to the pool between each try. That fixes have been backported in osp5, osp6, kilo and liberty too. Note that when the starvation occurs, services can just go crazy, and looking stuck from the user/admin point of view, services can't report its health and looks dead at time. All rpc call finish in timeout, because the reply come back to late. This can be caught like this: on the RPCServer (ex: nova-compute) logs for each 'Timed out waiting for a reply to message ID XXXX' we should got a 'No calling threads waiting for msg_id: XXXX'. According the logs, I don't think fall into that bug anymore. The 404 errors in rabbitmq logs looks totally normal here. We can see expected messages in nova-condutor logs that looks like: "oslo.messaging._drivers.impl_rabbit [-] The exchange to reply to reply_453622d06afd45cab4c3e69fe2fb5f37 doesn't exist yet, retrying...", that corresponds to "{amqp_error, not_found, "no exchange 'reply_453622d06afd45cab4c3e69fe2fb5f37' in vhost '/'",'exchange.declare'}" in rabbitmq logs. And then after 60 seconds: "2016-01-27 06:35:09.358 23579 INFO oslo.messaging._drivers.impl_rabbit [-] The exchange to reply to reply_453622d06afd45cab4c3e69fe2fb5f37 still doesn't exist, abandoning..." Also, for the rpm versioning story: * python-oslo-messaging-1.4.1-6.el7ost -> the reply retry code was not working at all # Reason of this bug have been opened * python-oslo-messaging-1.4.1-7.el7ost -> the reply retry code was fixed but can lead to starvation of connection # Official fix for this bug and this new logs show that reply retry works now * python-oslo-messaging-1.4.1-8.el7ost -> The starvation of connection was fixed # Looks like the version used by QA according the fact each retries use a new connection. So, to resume, for me everything looks good. Moving back to ON_QA, I agree with Mehdi assessment here. OK, I'm on board now. It *is* sleeping, the reason it doesn't look like it's sleeping is because a bunch of different conductor workers are all trying to reply to the same queue. So each worker is behaving as expected, it just looks like they are misbehaving when they're all firing at the same time. The fact that we seem to have a bunch of duplicate RPC requests in multiple workers simultaneously may or may not be indicative of a different bug, but it is not the same as this bug so would be tracked separately. changed summary to match bug 1301634, these are the same changes but for different osp versions. Do we think that this fix will also resolve bug 1265418? (In reply to Zane Bitter from comment #42) > Do we think that this fix will also resolve bug 1265418? It very well may, although I can't say for sure. The good news is it sounds like you have a good reproducer for 1265418 so it should be easy to test if it helps. Keep in mind that there are several fixes rolled up into the pending oslo.messaging release so even if this particular bug is not the cause, one of the others may be. *** Bug 1268487 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0267.html |
> > - Restarting individual services including rabbitmq dose not work. > - Connecting to rebbitmq goes into infinity loop > - Need to restart entire cluster to reconnect to the services. Could you please elaborate on this. So when you restarted a services one by one (incl. rmq) this didn't fix anything, but when you restarted the entire cluster of rabbitmq this fixed the issue. Do I understand correctly that you stopped all rabbitmq instances and then started them again?