Bug 1120310 - Time out whilst waiting for RabbitMQ response
Summary: Time out whilst waiting for RabbitMQ response
Keywords:
Status: CLOSED DUPLICATE of bug 1301634
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-messaging
Version: 5.0 (RHEL 7)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z6
: 5.0 (RHEL 7)
Assignee: Flavio Percoco
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks: 1157848
TreeView+ depends on / blocked
 
Reported: 2014-07-16 17:18 UTC by Rhys Oxenham
Modified: 2016-07-26 11:25 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-26 11:25:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1338732 0 None None None Never
OpenStack gerrit 123115 0 'None' ABANDONED Declare queue in publisher also 2020-04-28 06:39:33 UTC

Description Rhys Oxenham 2014-07-16 17:18:55 UTC
Description of problem:

When rabbitmq-server is restarted (or the host running RabbitMQ is rebooted), existing services such as nova-compute and neutron-openvswitch-agent timeout awaiting a response. They then seem to die silently, not logging anything further and do not attempt to reconnect to the AMQP broker.

It appears that when the agents are forced to reconnect to the broker, it attempts to re-use the previous topic. The agents receive no response to this topic and will hang. Only a agent restart (nova-compute and neutron-openvswitch-agent) allows for further connectivity.

I suspect that this issue would not occur in a multi-controller (or an environment with mirrored queues), only a single broker.

Note: I've only personally witnessed issues with Nova and Neutron, it may exist with other components reliant on Oslo messaging.

Version-Release number of selected component (if applicable):

RHEL OSP 5.0 GA (latest)
RHEL 7.0 (latest)

python-oslo-messaging-1.3.0.2-4.el7ost.noarch
rabbitmq-server-3.1.5-6.3.el7ost.noarch
openstack-nova-compute-2014.1-7.el7ost.noarch
openstack-neutron-openvswitch-2014.1-35.el7ost.noarch
openstack-neutron-ml2-2014.1-35.el7ost.noarch

How reproducible:
Every time.

Steps to Reproduce:
1. Restart rabbitmq-server (or reboot host)
2. Attempt to launch VM
3. Watch logs for nova-compute and/or neutron-openvswitch-agent

Actual results:

2014-07-16 17:04:44.655 2292 TRACE oslo.messaging._drivers.impl_rabbit IOError: Socket closed
2014-07-16 17:04:44.655 2292 TRACE oslo.messaging._drivers.impl_rabbit 
2014-07-16 17:04:44.656 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672
2014-07-16 17:04:44.656 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-07-16 17:04:45.026 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672
2014-07-16 17:04:45.026 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-07-16 17:04:45.668 2292 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.33.11.36:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2014-07-16 17:04:46.035 2292 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.33.11.36:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.
2014-07-16 17:04:46.669 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672
2014-07-16 17:04:46.669 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-07-16 17:04:47.682 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672
2014-07-16 17:04:51.035 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 10.33.11.36:5672
2014-07-16 17:04:51.035 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...
2014-07-16 17:04:52.048 2292 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672
2014-07-16 17:05:47.736 2292 ERROR nova.servicegroup.drivers.db [-] model server went away
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     service.service_ref, state_catalog)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 218, in service_update
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     return self._manager.service_update(context, service, values)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 330, in service_update
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     service=service_p, values=values)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     wait_for_reply=True, timeout=timeout)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     timeout=timeout)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 412, in send
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 403, in _send
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     result = self._waiter.wait(msg_id, timeout)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     reply, ending = self._poll_connection(msg_id, timeout)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db     % msg_id)
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID f56debce2797494e81e201231e78dec0
2014-07-16 17:05:47.736 2292 TRACE nova.servicegroup.drivers.db 
[HANG]


[Service Restart]

2014-07-16 17:21:19.902 2292 INFO nova.openstack.common.service [-] Caught SIGTERM, exiting
2014-07-16 17:21:20.764 19665 INFO nova.openstack.common.periodic_task [-] Skipping periodic task _periodic_update_dns because its interval is negative
2014-07-16 17:21:20.832 19665 INFO nova.virt.driver [-] Loading compute driver 'nova.virt.libvirt.LibvirtDriver'
2014-07-16 17:21:20.879 19665 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672
2014-07-16 17:21:20.897 19665 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 10.33.11.36:5672
2014-07-16 17:21:20.937 19665 AUDIT nova.service [-] Starting compute node (version 2014.1-7.el7ost)
2014-07-16 17:21:21.013 19665 WARNING nova.virt.libvirt.driver [-] URI qemu:///system does not support events: adding cb to list
2014-07-16 17:21:21.284 19665 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2014-07-16 17:21:21.527 19665 AUDIT nova.compute.resource_tracker [-] Free ram (MB): 23437
2014-07-16 17:21:21.528 19665 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 199
2014-07-16 17:21:21.528 19665 AUDIT nova.compute.resource_tracker [-] Free VCPUS: 8


Expected results:

AMQP implementation reconnects to the broker successfully after an outage and communication resumes without requiring service restart.

Additional info:

Seems to be identified in launchpad too with an RCA outlined:

https://bugs.launchpad.net/oslo.messaging/+bug/1338732

Comment 2 Ihar Hrachyshka 2014-07-17 14:07:09 UTC
FYI: neutron does not use oslo.messaging in Icehouse. Instead, it relies on code from oslo-incubator. So if the fix belongs there, please make sure you create corresponding bug for neutron (and maybe other components that still don't use oslo.messaging in Icehouse).

Comment 5 Jon Schlueter 2015-07-28 19:07:05 UTC
Upstream patch openstack gerrit 123115 was abandoned

possibly related patch was submitted to upstream https://review.openstack.org/#/c/184473/4 for kilo or liberty timeframe but not sure of feasiblity of backport

Comment 7 Peter Lemenkov 2016-07-26 11:25:18 UTC
This issue was addressed in bug 1301634. Closing this one as duplicate.

Search for "Timed out waiting for a reply to message ID" for further details.

*** This bug has been marked as a duplicate of bug 1301634 ***


Note You need to log in before you can comment on or make changes to this bug.