RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/
Bug 1049488 - Services can't connect to qpid after reboot
Summary: Services can't connect to qpid after reboot
Keywords:
Status: CLOSED DUPLICATE of bug 1055660
Alias: None
Product: RDO
Classification: Community
Component: distribution
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Ami Jeain
URL:
Whiteboard:
Depends On:
Blocks: 1049504
TreeView+ depends on / blocked
 
Reported: 2014-01-07 15:51 UTC by Steven Hardy
Modified: 2014-05-20 09:40 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1049504 (view as bug list)
Environment:
Last Closed: 2014-05-20 09:40:40 UTC
Embargoed:


Attachments (Terms of Use)

Description Steven Hardy 2014-01-07 15:51:13 UTC
Description of problem:
After reboot, heat is broken because it can't connect to the qpid broker - I'm not sure if this is a heat or qpid issue, but it doesn't seem to affect all the other services, e.g glance and keystone still work (although nova seems to have similar issues)

Version-Release number of selected component (if applicable):
# rpm -qa | grep -e heat -e packstack
openstack-heat-common-2013.2-1.0.fc20.noarch
openstack-heat-api-cfn-2013.2-1.0.fc20.noarch
openstack-heat-api-cloudwatch-2013.2-1.0.fc20.noarch
python-heatclient-0.2.6-2.fc20.noarch
openstack-heat-engine-2013.2-1.0.fc20.noarch
openstack-packstack-2013.2.1-0.25.dev936.fc20.noarch
openstack-heat-api-2013.2-1.0.fc20.noarch


How reproducible:
Always, seems broken after every reboot.


Steps to Reproduce:
1. Install F20 (bare metal box w/16G ram, selected development tools)
2. sudo su -
3. yum -y update
4. shutdown -r now (reboot to ensure all updates are active)
4. yum -y install openstack-packstack
5. packstack --allinone --os-heat-install=y --os-swift-install=n --os-ceilometer-install=y --os-heat-cloudwatch-install=y --os-heat-cfn-install=y
6. reboot

Actual results:
# tail -f /var/log/heat/api.log 
2014-01-07 15:45:08.599 1635 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.11
2014-01-07 15:45:08.643 1635 ERROR heat.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 1 seconds
2014-01-07 15:45:09.645 1635 ERROR heat.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 2 seconds
2014-01-07 15:45:11.648 1635 ERROR heat.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 4 seconds

Expected results:
Heat should be fully operational after reboot.

Additional info:

This workaround gets heat working again:
# service qpidd restart
# service openstack-heat-engine restart
# service openstack-heat-api restart

Comment 1 Steven Hardy 2014-01-07 18:06:14 UTC
Actually it looks like it's not just heat, and heat does recover (as well as nova, neutron and ceilometer who all seem to be broken in the same way, although not glance or keystone), it just takes a few seconds for heat to restablish the connection:

# tail -f /var/log/messages | grep impl_qpid
Jan  7 17:53:28 localhost ceilometer-agent-central: 2014-01-07 17:53:28.060 2135 ERROR ceilometer.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:53:28 localhost ceilometer-collector: 2014-01-07 17:53:28.344 2232 ERROR ceilometer.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:53:28 localhost ceilometer-alarm-notifier: 2014-01-07 17:53:28.494 2226 ERROR ceilometer.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:53:29 localhost neutron-l3-agent: 2014-01-07 17:53:29.177 3336 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:53:29 localhost neutron-l3-agent: 2014-01-07 17:53:29.179 3336 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:53:30 localhost neutron-dhcp-agent: 2014-01-07 17:53:30.346 3337 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:53:30 localhost neutron-dhcp-agent: 2014-01-07 17:53:30.346 3337 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:53:30 localhost neutron-openvswitch-agent: 2014-01-07 17:53:30.372 3339 ERROR neutron.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:53:30 localhost nova-compute: 2014-01-07 17:53:30.704 3335 ERROR nova.openstack.common.rpc.impl_qpid [req-46157e0d-b549-41ad-bed7-63f78c0bc85c None None] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:54:10 localhost heat-engine: 2014-01-07 17:54:10.515 5535 ERROR heat.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:54:19 localhost nova-cert: 2014-01-07 17:54:19.670 2166 ERROR nova.openstack.common.rpc.impl_qpid [req-21344615-8375-4a52-b908-40762e19f210 None None] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:54:19 localhost nova-consoleauth: 2014-01-07 17:54:19.778 2235 ERROR nova.openstack.common.rpc.impl_qpid [req-4c3ce8e8-54f4-496b-9796-2d2cce600c31 None None] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:54:19 localhost nova-conductor: 2014-01-07 17:54:19.811 2241 ERROR nova.openstack.common.rpc.impl_qpid [req-d38a13c3-1148-419d-8c98-4aad67e24072 None None] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
Jan  7 17:54:19 localhost nova-scheduler: 2014-01-07 17:54:19.845 2248 ERROR nova.openstack.common.rpc.impl_qpid [req-1f02af65-0b35-4965-a2c5-7a4805317ac7 None None] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds

(systemctl restart qpidd.service)

Jan  7 18:01:28 localhost ceilometer-agent-compute: 2014-01-07 18:01:28.315 2139 INFO ceilometer.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:28 localhost ceilometer-agent-central: 2014-01-07 18:01:28.496 2135 INFO ceilometer.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:28 localhost ceilometer-alarm-notifier: 2014-01-07 18:01:28.753 2226 INFO ceilometer.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:28 localhost ceilometer-collector: 2014-01-07 18:01:28.767 2232 INFO ceilometer.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:28 localhost ceilometer-agent-central: 2014-01-07 18:01:28.884 2135 INFO ceilometer.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:29 localhost neutron-l3-agent: 2014-01-07 18:01:29.198 3336 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:29 localhost neutron-l3-agent: 2014-01-07 18:01:29.201 3336 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:29 localhost neutron-l3-agent: 2014-01-07 18:01:29.219 3336 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:30 localhost neutron-dhcp-agent: 2014-01-07 18:01:30.699 3337 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:30 localhost neutron-dhcp-agent: 2014-01-07 18:01:30.702 3337 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:30 localhost neutron-openvswitch-agent: 2014-01-07 18:01:30.715 3339 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:30 localhost neutron-dhcp-agent: 2014-01-07 18:01:30.727 3337 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:30 localhost neutron-openvswitch-agent: 2014-01-07 18:01:30.745 3339 INFO neutron.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:31 localhost nova-compute: 2014-01-07 18:01:31.008 3335 INFO nova.openstack.common.rpc.impl_qpid [req-46157e0d-b549-41ad-bed7-63f78c0bc85c None None] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:01:31 localhost nova-compute: 2014-01-07 18:01:31.026 3335 INFO nova.openstack.common.rpc.impl_qpid [req-46157e0d-b549-41ad-bed7-63f78c0bc85c None None] Connected to AMQP server on 192.168.0.11:5672
Jan  7 18:02:10 localhost heat-engine: 2014-01-07 18:02:10.807 5535 INFO heat.openstack.common.rpc.impl_qpid [-] Connected to AMQP server on 192.168.0.11:5672

So this appears to be either a qpidd or oslo issue.

Comment 2 Steven Dake 2014-01-07 18:18:29 UTC
In the past Heat didn't try to reconnect to the AMQP bus after it was disconnected.  Zane fixed this about 1 year ago, and after that it was possible to restart services or start services in non-dependent order and they would figure it out.  So this may be "operating as intended" since the software itself retries the connections.  I believe the retry was 2-3 seconds.  There may be a dependency ordering issue with openstack services vs QPID.

Comment 3 Steven Hardy 2014-01-07 22:21:26 UTC
(In reply to Steven Dake from comment #2)
> In the past Heat didn't try to reconnect to the AMQP bus after it was
> disconnected.  Zane fixed this about 1 year ago, and after that it was
> possible to restart services or start services in non-dependent order and
> they would figure it out.  So this may be "operating as intended" since the
> software itself retries the connections.  I believe the retry was 2-3
> seconds.  There may be a dependency ordering issue with openstack services
> vs QPID.

I think the retries are working as intended, although the multiplier in impl_qpid means you can end up waiting for nearly a minute for services to reconnect:

https://github.com/openstack/heat/blob/master/heat/openstack/common/rpc/impl_qpid.py#L519

The bug AFAICT is that something is broken with qpidd, which requires restart before the services can connect, either that or there is a bug in the oslo common code which handles making the connection, currently the former seems most likely but I'm not sure how to debug yet.

Comment 4 Steven Hardy 2014-01-08 11:50:05 UTC
Setting to distribution since this isn't a heat specific problem, and it's been confirmed on IRC that I'm not the only one seeing this.

Comment 5 Steven Hardy 2014-01-08 11:52:06 UTC
Note this is probably a dupe of bug #984968, but since that's been open for many months with no comments at all, and this bug has more details, I'm not sure we want to close this one (yet)

Comment 6 Matthias Runge 2014-01-09 07:57:07 UTC
IMHO one can also trigger this by doing a suspend/resume cycle.

Comment 7 Lars Kellogg-Stedman 2014-02-10 19:15:07 UTC
Steve,

When you run into this, is qpidd *not running*, or is it running but nothing is able to connect to it?  If the latter, this may be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1055660.  Can you take a look and let me know?

Comment 8 Steven Hardy 2014-05-20 09:40:40 UTC
(In reply to Lars Kellogg-Stedman from comment #7)
> Steve,
> 
> When you run into this, is qpidd *not running*, or is it running but nothing
> is able to connect to it?  If the latter, this may be a dupe of
> https://bugzilla.redhat.com/show_bug.cgi?id=1055660.  Can you take a look
> and let me know?

Yes, I think this probably was the problem, as since that update landed I've not seen the problem. Marking as a dupe of 1055660.

*** This bug has been marked as a duplicate of bug 1055660 ***


Note You need to log in before you can comment on or make changes to this bug.