Bug 1049488
Summary: | Services can't connect to qpid after reboot | |||
---|---|---|---|---|
Product: | [Community] RDO | Reporter: | Steven Hardy <shardy> | |
Component: | distribution | Assignee: | RHOS Maint <rhos-maint> | |
Status: | CLOSED DUPLICATE | QA Contact: | Ami Jeain <ajeain> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | unspecified | CC: | jpeeler, lars, markmc, mrunge, sdake, shardy, yeylon | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1049504 (view as bug list) | Environment: | ||
Last Closed: | 2014-05-20 09:40:40 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1049504 |
Description
Steven Hardy
2014-01-07 15:51:13 UTC
Actually it looks like it's not just heat, and heat does recover (as well as nova, neutron and ceilometer who all seem to be broken in the same way, although not glance or keystone), it just takes a few seconds for heat to restablish the connection: # tail -f /var/log/messages | grep impl_qpid Jan 7 17:53:28 localhost ceilometer-agent-central: 2014-01-07 17:53:28.060 2135 ERROR ceilometer.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:53:28 localhost ceilometer-collector: 2014-01-07 17:53:28.344 2232 ERROR ceilometer.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:53:28 localhost ceilometer-alarm-notifier: 2014-01-07 17:53:28.494 2226 ERROR ceilometer.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:53:29 localhost neutron-l3-agent: 2014-01-07 17:53:29.177 3336 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:53:29 localhost neutron-l3-agent: 2014-01-07 17:53:29.179 3336 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:53:30 localhost neutron-dhcp-agent: 2014-01-07 17:53:30.346 3337 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:53:30 localhost neutron-dhcp-agent: 2014-01-07 17:53:30.346 3337 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:53:30 localhost neutron-openvswitch-agent: 2014-01-07 17:53:30.372 3339 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:53:30 localhost nova-compute: 2014-01-07 17:53:30.704 3335 ERROR nova.openstack.common.rpc.impl_qpid [req-46157e0d-b549-41ad-bed7-63f78c0bc85c None None] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:54:10 localhost heat-engine: 2014-01-07 17:54:10.515 5535 ERROR heat.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:54:19 localhost nova-cert: 2014-01-07 17:54:19.670 2166 ERROR nova.openstack.common.rpc.impl_qpid [req-21344615-8375-4a52-b908-40762e19f210 None None] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:54:19 localhost nova-consoleauth: 2014-01-07 17:54:19.778 2235 ERROR nova.openstack.common.rpc.impl_qpid [req-4c3ce8e8-54f4-496b-9796-2d2cce600c31 None None] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:54:19 localhost nova-conductor: 2014-01-07 17:54:19.811 2241 ERROR nova.openstack.common.rpc.impl_qpid [req-d38a13c3-1148-419d-8c98-4aad67e24072 None None] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds Jan 7 17:54:19 localhost nova-scheduler: 2014-01-07 17:54:19.845 2248 ERROR nova.openstack.common.rpc.impl_qpid [req-1f02af65-0b35-4965-a2c5-7a4805317ac7 None None] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds (systemctl restart qpidd.service) Jan 7 18:01:28 localhost ceilometer-agent-compute: 2014-01-07 18:01:28.315 2139 INFO ceilometer.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:28 localhost ceilometer-agent-central: 2014-01-07 18:01:28.496 2135 INFO ceilometer.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:28 localhost ceilometer-alarm-notifier: 2014-01-07 18:01:28.753 2226 INFO ceilometer.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:28 localhost ceilometer-collector: 2014-01-07 18:01:28.767 2232 INFO ceilometer.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:28 localhost ceilometer-agent-central: 2014-01-07 18:01:28.884 2135 INFO ceilometer.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:29 localhost neutron-l3-agent: 2014-01-07 18:01:29.198 3336 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:29 localhost neutron-l3-agent: 2014-01-07 18:01:29.201 3336 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:29 localhost neutron-l3-agent: 2014-01-07 18:01:29.219 3336 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:30 localhost neutron-dhcp-agent: 2014-01-07 18:01:30.699 3337 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:30 localhost neutron-dhcp-agent: 2014-01-07 18:01:30.702 3337 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:30 localhost neutron-openvswitch-agent: 2014-01-07 18:01:30.715 3339 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:30 localhost neutron-dhcp-agent: 2014-01-07 18:01:30.727 3337 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:30 localhost neutron-openvswitch-agent: 2014-01-07 18:01:30.745 3339 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:31 localhost nova-compute: 2014-01-07 18:01:31.008 3335 INFO nova.openstack.common.rpc.impl_qpid [req-46157e0d-b549-41ad-bed7-63f78c0bc85c None None] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:01:31 localhost nova-compute: 2014-01-07 18:01:31.026 3335 INFO nova.openstack.common.rpc.impl_qpid [req-46157e0d-b549-41ad-bed7-63f78c0bc85c None None] Connected to AMQP server on 192.168.0.11:5672 Jan 7 18:02:10 localhost heat-engine: 2014-01-07 18:02:10.807 5535 INFO heat.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672 So this appears to be either a qpidd or oslo issue. In the past Heat didn't try to reconnect to the AMQP bus after it was disconnected. Zane fixed this about 1 year ago, and after that it was possible to restart services or start services in non-dependent order and they would figure it out. So this may be "operating as intended" since the software itself retries the connections. I believe the retry was 2-3 seconds. There may be a dependency ordering issue with openstack services vs QPID. (In reply to Steven Dake from comment #2) > In the past Heat didn't try to reconnect to the AMQP bus after it was > disconnected. Zane fixed this about 1 year ago, and after that it was > possible to restart services or start services in non-dependent order and > they would figure it out. So this may be "operating as intended" since the > software itself retries the connections. I believe the retry was 2-3 > seconds. There may be a dependency ordering issue with openstack services > vs QPID. I think the retries are working as intended, although the multiplier in impl_qpid means you can end up waiting for nearly a minute for services to reconnect: https://github.com/openstack/heat/blob/master/heat/openstack/common/rpc/impl_qpid.py#L519 The bug AFAICT is that something is broken with qpidd, which requires restart before the services can connect, either that or there is a bug in the oslo common code which handles making the connection, currently the former seems most likely but I'm not sure how to debug yet. Setting to distribution since this isn't a heat specific problem, and it's been confirmed on IRC that I'm not the only one seeing this. Note this is probably a dupe of bug #984968, but since that's been open for many months with no comments at all, and this bug has more details, I'm not sure we want to close this one (yet) IMHO one can also trigger this by doing a suspend/resume cycle. Steve, When you run into this, is qpidd *not running*, or is it running but nothing is able to connect to it? If the latter, this may be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1055660. Can you take a look and let me know? (In reply to Lars Kellogg-Stedman from comment #7) > Steve, > > When you run into this, is qpidd *not running*, or is it running but nothing > is able to connect to it? If the latter, this may be a dupe of > https://bugzilla.redhat.com/show_bug.cgi?id=1055660. Can you take a look > and let me know? Yes, I think this probably was the problem, as since that update landed I've not seen the problem. Marking as a dupe of 1055660. *** This bug has been marked as a duplicate of bug 1055660 *** |