Bug 1120310

Summary:	Time out whilst waiting for RabbitMQ response
Product:	Red Hat OpenStack	Reporter:	Rhys Oxenham <roxenham>
Component:	python-oslo-messaging	Assignee:	Flavio Percoco <fpercoco>
Status:	CLOSED DUPLICATE	QA Contact:	Shai Revivo <srevivo>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.0 (RHEL 7)	CC:	apevec, ihrachys, jminter, jschluet, lhh, plemenko, racedoro, sgordon, srevivo
Target Milestone:	z6	Keywords:	OtherQA, ZStream
Target Release:	5.0 (RHEL 7)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-07-26 11:25:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1157848

Description Rhys Oxenham 2014-07-16 17:18:55 UTC

Description of problem:

When rabbitmq-server is restarted (or the host running RabbitMQ is rebooted), existing services such as nova-compute and neutron-openvswitch-agent timeout awaiting a response. They then seem to die silently, not logging anything further and do not attempt to reconnect to the AMQP broker.

It appears that when the agents are forced to reconnect to the broker, it attempts to re-use the previous topic. The agents receive no response to this topic and will hang. Only a agent restart (nova-compute and neutron-openvswitch-agent) allows for further connectivity.

I suspect that this issue would not occur in a multi-controller (or an environment with mirrored queues), only a single broker.

Note: I've only personally witnessed issues with Nova and Neutron, it may exist with other components reliant on Oslo messaging.

Version-Release number of selected component (if applicable):

RHEL OSP 5.0 GA (latest)
RHEL 7.0 (latest)

python-oslo-messaging-1.3.0.2-4.el7ost.noarch
rabbitmq-server-3.1.5-6.3.el7ost.noarch
openstack-nova-compute-2014.1-7.el7ost.noarch
openstack-neutron-openvswitch-2014.1-35.el7ost.noarch
openstack-neutron-ml2-2014.1-35.el7ost.noarch

How reproducible:
Every time.

Steps to Reproduce:
1. Restart rabbitmq-server (or reboot host)
2. Attempt to launch VM
3. Watch logs for nova-compute and/or neutron-openvswitch-agent

Actual results:

2014-07-16 17:04:44.655 2292 TRACE oslo.messaging._drivers.impl_rabbit IOError: Socket closed
2014-07-16 17:04:44.655 2292 TRACE oslo.messaging._drivers.impl_rabbit 
2014-07-16 17:04:44.656 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672
2014-07-16 17:04:44.656 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-07-16 17:04:45.026 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672
2014-07-16 17:04:45.026 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-07-16 17:04:45.668 2292 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.33.11.36:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2014-07-16 17:04:46.035 2292 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.33.11.36:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.
2014-07-16 17:04:46.669 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672
2014-07-16 17:04:46.669 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-07-16 17:04:47.682 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672
2014-07-16 17:04:51.035 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672
2014-07-16 17:04:51.035 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-07-16 17:04:52.048 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672
2014-07-16 17:05:47.736 2292 ERROR nova.servicegroup.drivers.db [-] model server went away
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     service.service_ref, state_catalog)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 218, in service_update
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     return self._manager.service_update(context, service, values)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 330, in service_update
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     service=service_p, values=values)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     wait_for_reply=True, timeout=timeout)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     timeout=timeout)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 412, in send
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 403, in _send
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     result = self._waiter.wait(msg_id, timeout)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     reply, ending = self._poll_connection(msg_id, timeout)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     % msg_id)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID f56debce2797494e81e201231e78dec0
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db 
[HANG]


[Service Restart]

2014-07-16 17:21:19.902 2292 INFO nova.openstack.common.service [-] Caught SIGTERM, exiting
2014-07-16 17:21:20.764 19665 INFO nova.openstack.common.periodic_task [-] Skipping periodic task _periodic_update_dns because its interval is negative
2014-07-16 17:21:20.832 19665 INFO nova.virt.driver [-] Loading compute driver 'nova.virt.libvirt.LibvirtDriver'
2014-07-16 17:21:20.879 19665 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672
2014-07-16 17:21:20.897 19665 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672
2014-07-16 17:21:20.937 19665 AUDIT nova.service [-] Starting compute node (version 2014.1-7.el7ost)
2014-07-16 17:21:21.013 19665 WARNING nova.virt.libvirt.driver [-] URI qemu:///system does not support events: adding cb to list
2014-07-16 17:21:21.284 19665 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2014-07-16 17:21:21.527 19665 AUDIT nova.compute.resource_tracker [-] Free ram (MB): 23437
2014-07-16 17:21:21.528 19665 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 199
2014-07-16 17:21:21.528 19665 AUDIT nova.compute.resource_tracker [-] Free VCPUS: 8


Expected results:

AMQP implementation reconnects to the broker successfully after an outage and communication resumes without requiring service restart.

Additional info:

Seems to be identified in launchpad too with an RCA outlined:

https://bugs.launchpad.net/oslo.messaging/+bug/1338732

Comment 2 Ihar Hrachyshka 2014-07-17 14:07:09 UTC

FYI: neutron does not use oslo.messaging in Icehouse. Instead, it relies on code from oslo-incubator. So if the fix belongs there, please make sure you create corresponding bug for neutron (and maybe other components that still don't use oslo.messaging in Icehouse).

Comment 5 Jon Schlueter 2015-07-28 19:07:05 UTC

Upstream patch openstack gerrit 123115 was abandoned

possibly related patch was submitted to upstream https://review.openstack.org/#/c/184473/4 for kilo or liberty timeframe but not sure of feasiblity of backport

Comment 7 Peter Lemenkov 2016-07-26 11:25:18 UTC

This issue was addressed in bug 1301634. Closing this one as duplicate.

Search for "Timed out waiting for a reply to message ID" for further details.

*** This bug has been marked as a duplicate of bug 1301634 ***