Description of problem: When rabbitmq-server is restarted (or the host running RabbitMQ is rebooted), existing services such as nova-compute and neutron-openvswitch-agent timeout awaiting a response. They then seem to die silently, not logging anything further and do not attempt to reconnect to the AMQP broker. It appears that when the agents are forced to reconnect to the broker, it attempts to re-use the previous topic. The agents receive no response to this topic and will hang. Only a agent restart (nova-compute and neutron-openvswitch-agent) allows for further connectivity. I suspect that this issue would not occur in a multi-controller (or an environment with mirrored queues), only a single broker. Note: I've only personally witnessed issues with Nova and Neutron, it may exist with other components reliant on Oslo messaging. Version-Release number of selected component (if applicable): RHEL OSP 5.0 GA (latest) RHEL 7.0 (latest) python-oslo-messaging-1.3.0.2-4.el7ost.noarch rabbitmq-server-3.1.5-6.3.el7ost.noarch openstack-nova-compute-2014.1-7.el7ost.noarch openstack-neutron-openvswitch-2014.1-35.el7ost.noarch openstack-neutron-ml2-2014.1-35.el7ost.noarch How reproducible: Every time. Steps to Reproduce: 1. Restart rabbitmq-server (or reboot host) 2. Attempt to launch VM 3. Watch logs for nova-compute and/or neutron-openvswitch-agent Actual results: 2014-07-16 17:04:44.655 2292 TRACE oslo.messaging._drivers.impl_rabbit IOError: Socket closed 2014-07-16 17:04:44.655 2292 TRACE oslo.messaging._drivers.impl_rabbit 2014-07-16 17:04:44.656 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672 2014-07-16 17:04:44.656 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds... 2014-07-16 17:04:45.026 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672 2014-07-16 17:04:45.026 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds... 2014-07-16 17:04:45.668 2292 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.33.11.36:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. 2014-07-16 17:04:46.035 2292 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.33.11.36:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds. 2014-07-16 17:04:46.669 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672 2014-07-16 17:04:46.669 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds... 2014-07-16 17:04:47.682 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672 2014-07-16 17:04:51.035 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672 2014-07-16 17:04:51.035 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds... 2014-07-16 17:04:52.048 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672 2014-07-16 17:05:47.736 2292 ERROR nova.servicegroup.drivers.db [-] model server went away 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db Traceback (most recent call last): 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog) 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 218, in service_update 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values) 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 330, in service_update 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db service=service_p, values=values) 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 150, in call 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db wait_for_reply=True, timeout=timeout) 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db timeout=timeout) 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 412, in send 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db return self._send(target, ctxt, message, wait_for_reply, timeout) 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 403, in _send 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout) 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db reply, ending = self._poll_connection(msg_id, timeout) 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db % msg_id) 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID f56debce2797494e81e201231e78dec0 2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db [HANG] [Service Restart] 2014-07-16 17:21:19.902 2292 INFO nova.openstack.common.service [-] Caught SIGTERM, exiting 2014-07-16 17:21:20.764 19665 INFO nova.openstack.common.periodic_task [-] Skipping periodic task _periodic_update_dns because its interval is negative 2014-07-16 17:21:20.832 19665 INFO nova.virt.driver [-] Loading compute driver 'nova.virt.libvirt.LibvirtDriver' 2014-07-16 17:21:20.879 19665 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672 2014-07-16 17:21:20.897 19665 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672 2014-07-16 17:21:20.937 19665 AUDIT nova.service [-] Starting compute node (version 2014.1-7.el7ost) 2014-07-16 17:21:21.013 19665 WARNING nova.virt.libvirt.driver [-] URI qemu:///system does not support events: adding cb to list 2014-07-16 17:21:21.284 19665 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources 2014-07-16 17:21:21.527 19665 AUDIT nova.compute.resource_tracker [-] Free ram (MB): 23437 2014-07-16 17:21:21.528 19665 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 199 2014-07-16 17:21:21.528 19665 AUDIT nova.compute.resource_tracker [-] Free VCPUS: 8 Expected results: AMQP implementation reconnects to the broker successfully after an outage and communication resumes without requiring service restart. Additional info: Seems to be identified in launchpad too with an RCA outlined: https://bugs.launchpad.net/oslo.messaging/+bug/1338732
FYI: neutron does not use oslo.messaging in Icehouse. Instead, it relies on code from oslo-incubator. So if the fix belongs there, please make sure you create corresponding bug for neutron (and maybe other components that still don't use oslo.messaging in Icehouse).
Upstream patch openstack gerrit 123115 was abandoned possibly related patch was submitted to upstream https://review.openstack.org/#/c/184473/4 for kilo or liberty timeframe but not sure of feasiblity of backport
This issue was addressed in bug 1301634. Closing this one as duplicate. Search for "Timed out waiting for a reply to message ID" for further details. *** This bug has been marked as a duplicate of bug 1301634 ***