Bug 1484543

Summary:	controller reporting missing reply queue during instance launch
Product:	Red Hat OpenStack	Reporter:	Jack Waterworth <jwaterwo>
Component:	python-oslo-messaging	Assignee:	John Eckersberg <jeckersb>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Udi Shkalim <ushkalim>
Severity:	high	Docs Contact:
Priority:	medium
Version:	10.0 (Newton)	CC:	apevec, chhudson, fpercoco, jbiao, jeckersb, jjoyce, jschluet, lhh, sibasankar.p, slinaber, srevivo, tvignaud, vstinner, wlehman
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	10.0 (Newton)
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-06 22:00:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jack Waterworth 2017-08-23 20:08:50 UTC

Description of problem:
instances are getting stuck in a scheduling state. from the compute logs we see message timeouts. from the controller logs we see the messages are being dropped because the reply queue does not exist.

----------------------
2017-08-07 23:24:53.888 47922 ERROR oslo_service.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID 88cf5824149c4724bf1a38922f4f7c34
----------------------
2017-08-07 23:24:54.014 67804 WARNING oslo_messaging._drivers.amqpdriver [req-079dbfa8-bbb5-40b8-9968-ae41eb8af1a7 - - - - -] reply_97969ea892da4a3db18ea49621a7b0ed doesn't exists, drop reply to 88cf5824149c4724bf1a38922f4f7c34
2017-08-07 23:24:54.014 67804 INFO oslo_messaging._drivers.amqpdriver [req-079dbfa8-bbb5-40b8-9968-ae41eb8af1a7 - - - - -] The reply 88cf5824149c4724bf1a38922f4f7c34 cannot be sent  reply_97969ea892da4a3db18ea49621a7b0ed reply queue don't exist after 60 sec abandoning...
----------------------

Version-Release number of selected component (if applicable):
RHOSP10
python-oslo-messaging-5.10.1-2.el7ost.noarch

How reproducible:
happened once

Steps to Reproduce:
1. experience connect problem between compute and controllers
2. wait for tunnel to rabbitmq to reconnect

Actual results:
messages begin to get lost on the comptue and dropped on the controller


Expected results:
messages should not be dropped on the controllers

Additional info:

Comment 5 John Eckersberg 2017-08-25 20:37:08 UTC

Here's what roughly seems to have happened in this case:

- Network outage causes nova-compute to reconnect to rabbitmq

- Somehow, the exchange bound to the reply queue for this compute node is lost, but the reply queue itself is still present. We know the reply queue is still present because we can see it continue to migrate around the cluster as failures happen in the rabbitmq logs. The most plausible explanation for this would be (1) reply queue gets auto-deleted due to nova-compute disconnecting, (2) exchange gets auto-deleted due to queue being deleted, and (3) nova-compute redeclares the queue upon reconnecting, but does not redeclare the exchange at the same time.

- nova-compute sends out rpc requests to nova-conductor as usual, but nova-conductor always fails to send a reply since the reply exchange is missing.

- At some later point in time, two of the three rabbitmq nodes restart, which forces nova-compute to reconnect.

- Now nova-conductor is suddenly able to reply. Reviewing sosreports confirms the exchange is declared. So it seems that on reconnecting the second time, the exchange got redeclared correctly.

Given this, I think the most likely explanation is slightly different circumstances between the two reconnects, triggering different codepaths in the client. I've looked through the relevant code, and I don't see anything yet to explain this. The way it works is:

- The oslo.messaging rabbit driver has an on_reconnection callback that is registered with kombu.

- When kombu experiences a connection error, it will automatically reconnect and invoke the callback and pass it the newly reconnected channel

- The handler calls _set_current_channel with the new channel

- _set_current_channel iterates through the list of consumers and calls declare() on each

- Consumer.declare() recreates the underlying kombu queue object, passing in the exchange for it to be bound as an argument, and calls declare() on the queue object

- When passing in an exchange to a queue object, the queue declare method automatically declares the exchange along with the queue.

So I don't really see how it could end up in a state where the queue is redeclared but the accompanying exchange isn't.

Comment 9 John Eckersberg 2017-11-03 19:03:17 UTC

A potential thought to mitigate this:

Flow a test message upon reconnection, to the exchange and back through the queue, to ensure the end-to-end health of the reply flow.

If this fails, maybe assume bad state with the particular pre-existing exchange/queue, and reconnect with a newly generated name (as you'd get on a full service restart).

The downside, you'd have to do this on each declare operation.  And declare operations occur frequently, so there could be nontrivial overhead introduced.

Comment 10 John Eckersberg 2017-11-07 15:34:20 UTC

(In reply to John Eckersberg from comment #9)
> A potential thought to mitigate this:
> 
> Flow a test message upon reconnection, to the exchange and back through the
> queue, to ensure the end-to-end health of the reply flow.
> 
> If this fails, maybe assume bad state with the particular pre-existing
> exchange/queue, and reconnect with a newly generated name (as you'd get on a
> full service restart).
> 
> The downside, you'd have to do this on each declare operation.  And declare
> operations occur frequently, so there could be nontrivial overhead
> introduced.

The more I look into it, the more I think this is a bad idea.  This would be a rather invasive change into critical parts of oslo.messaging, and is quite risky.  I am not at all confident that making such a change would not introduce additional problems.

The root problem is probably not in oslo.messaging; it is likely in kombu, py-amqp, or somewhere in rabbitmq/erlang.  So potentially breaking oslo.messaging to work around an issue somewhere else is undesirable.

As we have seen, the original issue is quite rare, and thus far not reproducible.  Weighing the risk of the original issue vs. the risk of attempting to modify oslo.messaging in such a way, I think leaving things as-is is much safer for now.

I am going to try building a more stripped-down setup in order to reproduce the issue.  Maybe we will have more luck there, and then we can fix this the correct way.

Comment 11 John Eckersberg 2017-11-13 21:11:34 UTC

Quick update here:

I've been trying to reproduce this using the oslo.messaging simulator, which allows for precise control of the timing and nature of failures.  Before, I was simulating network failure with iptables, and relying on repeating tests and luck to maybe get the timing right to reproduce.  Now, things are much more controlled.

That said, I have still been unable to reproduce the problem so far.  I have a few more things to try, so not all hope is lost, but so far it hasn't panned out.

Comment 12 John Eckersberg 2017-11-17 19:46:08 UTC

I've exhausted obvious places to break things in oslo.messaging.  My suspicion, even moreso now, is that this is likely somewhere lower in kombu/py-amqp.  I'm going to run through the same general exercise, except in those codebases.

Comment 15 John Eckersberg 2017-12-06 22:00:11 UTC

Despite best efforts, I've still been unable to reproduce this.  With the given data, and no reproducer, I am closing this as insufficient data.  Obviously something went wrong, but it can't be pinpointed with what we have available.

Comment 16 sibasankar 2018-12-25 06:18:43 UTC

we are unable to lunch instance in open stack. We are getting the below error.
Error: Timed out waiting for a reply to message ID 29bbfdcbad354768bb4ece845032bf18

Comment 17 sibasankar 2018-12-25 06:22:08 UTC

we are unable to lunch instance in open stack. We are getting the below error.
Error: Timed out waiting for a reply to message ID 29bbfdcbad354768bb4ece845032bf18

Message
Timed out waiting for a reply to message ID 29bbfdcbad354768bb4ece845032bf18
Code
500
Details
File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/conductor/manager.py", line 1231, in schedule_and_build_instances instance_uuids, return_alternates=True) File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/conductor/manager.py", line 748, in _schedule_instances return_alternates=return_alternates) File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/scheduler/utils.py", line 907, in wrapped return func(*args, **kwargs) File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 53, in select_destinations instance_uuids, return_objects, return_alternates) File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method return getattr(self.instance, __name)(*args, **kwargs) File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations instance_uuids, return_objects, return_alternates) File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 158, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call retry=self.retry) File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send retry=retry) File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send call_monitor_timeout, retry=retry) File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send call_monitor_timeout) File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait message = self.waiters.get(msg_id, timeout=timeout) File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get 'to message ID %s' % msg_id)
Created
Dec. 25, 2018, 10:43 a.m.