Bug 1005691 - compute doesn't recover from controller reboot
Summary: compute doesn't recover from controller reboot
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 3.0
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: 4.0
Assignee: Xavier Queralt
QA Contact: Ami Jeain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-09-09 08:02 UTC by Jaroslav Henner
Modified: 2023-09-18 09:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-01-14 20:30:51 UTC
Target Upstream Version:
Embargoed:
jhenner: needinfo-


Attachments (Terms of Use)
compute.log (44.81 KB, text/x-log)
2013-09-09 08:03 UTC, Jaroslav Henner
no flags Details

Description Jaroslav Henner 2013-09-09 08:02:50 UTC
Description of problem:
After rebooting the controller, where there is mysql, nova-api, quantum-server, amqpd, the nova-compute hosts stays in state XXX for more than 20 minutes.

Version-Release number of selected component (if applicable):
openstack-nova-compute-2013.1.3-3.el6ost.noarch

How reproducible:
always

Steps to Reproduce:
1. have controller + compute-node deployment
2. reboot controller
3. nova-manage service list

Actual results:
nova-compute     master-04.rhos... nova             enabled    XXX   2013-09-09 07:29:20


Expected results:
nova-compute     master-04.rhos... nova             enabled    :-)   2013-09-09 07:XX:XX

Additional info:
Seems like it tried to reconnect the AMQP, but it didn't wait 1 second per try, but did the attempts immediately.

Comment 1 Jaroslav Henner 2013-09-09 08:03:20 UTC
Created attachment 795522 [details]
compute.log

Comment 4 Daniel Berrangé 2013-12-05 15:42:17 UTC
I tested current upstream using devstack on a single machine. I shutdown the qpidd daemon and watched the nova-compute service logs and nova-manage service. I see the same kind of error messages

2013-12-05 15:34:39.935 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 16 seconds
2013-12-05 15:34:40.784 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
2013-12-05 15:34:40.785 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
2013-12-05 15:34:47.820 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
2013-12-05 15:34:55.945 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 32 seconds

and service list

$ nova-manage service list 2>/dev/null
Binary           Host                                 Zone             Status     State Updated_At
nova-conductor   mustard.gsslab.fab.redhat.com        internal         enabled    :-)   2013-12-05 15:34:32
nova-cert        mustard.gsslab.fab.redhat.com        internal         enabled    :-)   2013-12-05 15:34:34
nova-network     mustard.gsslab.fab.redhat.com        internal         enabled    :-)   2013-12-05 15:34:35
nova-scheduler   mustard.gsslab.fab.redhat.com        internal         enabled    :-)   2013-12-05 15:34:37
nova-compute     mustard.gsslab.fab.redhat.com        nova             enabled    XXX   2013-12-05 15:33:34
nova-consoleauth mustard.gsslab.fab.redhat.com        internal         enabled    :-)   2013-12-05 15:34:31


Looking at the timestamps there we certainly see the same bizarre waits. I think what is happening though is that these log messages are coming in from different eventlet threads - hence we get 3 'Sleeping 60 seconds' messages  within a short time each from a different thread. So I think Jaroslavs logs just show lots of threads waiting in parallel.

When I restart the qpidd daemon it eventually reconnected and went back to normal operation, I didn't see any 20 minute delay. It is possible there's a difference in between upstream GIT master vs RHOS4 codebase, but I don't have an environment able to run the latter myself currently so can't test that.

The logs attached to this bug only show data from a 1 minute interval. We could reall do with the full un-edited nova compute logfile from the host showing this problems, rather than just a short snippet.

Comment 5 Daniel Berrangé 2013-12-09 17:16:16 UTC
I've now tested the RHOS 4 versions directly.

I did a 2 node install using packstack on rhel-6.5

# packstack --install-hosts=192.168.122.84,192.168.122.82

Once everything was up & running, I rebooted  the controller node. The compute node showed it was attempting to reconnect periodically. Once the controller was fully up & running, the compute server re-connected within 1 minute as expected.

Versions on controller:

# rpm -qa | grep openstack | sort
openstack-ceilometer-alarm-2013.2-4.el6ost.noarch
openstack-ceilometer-api-2013.2-4.el6ost.noarch
openstack-ceilometer-central-2013.2-4.el6ost.noarch
openstack-ceilometer-collector-2013.2-4.el6ost.noarch
openstack-ceilometer-common-2013.2-4.el6ost.noarch
openstack-cinder-2013.2-7.el6ost.noarch
openstack-dashboard-2013.2-8.el6ost.noarch
openstack-dashboard-theme-2013.2-8.el6ost.noarch
openstack-glance-2013.2-4.el6ost.noarch
openstack-keystone-2013.2-3.el6ost.noarch
openstack-neutron-2013.2-13.el6ost.noarch
openstack-neutron-openvswitch-2013.2-13.el6ost.noarch
openstack-nova-api-2013.2-9.el6ost.noarch
openstack-nova-cert-2013.2-9.el6ost.noarch
openstack-nova-common-2013.2-9.el6ost.noarch
openstack-nova-conductor-2013.2-9.el6ost.noarch
openstack-nova-console-2013.2-9.el6ost.noarch
openstack-nova-novncproxy-2013.2-9.el6ost.noarch
openstack-nova-scheduler-2013.2-9.el6ost.noarch
openstack-packstack-2013.2.1-0.13.dev876.el6ost.noarch
openstack-selinux-0.1.3-1.el6ost.noarch
openstack-utils-2013.2-2.el6ost.noarch
python-django-openstack-auth-1.1.2-1.el6ost.noarch
redhat-access-plugin-openstack-4.0.0-0.el6ost.noarch

Versions on compute:

openstack-ceilometer-common-2013.2-4.el6ost.noarch
openstack-ceilometer-compute-2013.2-4.el6ost.noarch
openstack-neutron-2013.2-13.el6ost.noarch
openstack-neutron-openvswitch-2013.2-13.el6ost.noarch
openstack-nova-common-2013.2-9.el6ost.noarch
openstack-nova-compute-2013.2-9.el6ost.noarch
openstack-packstack-2013.2.1-0.13.dev876.el6ost.noarch
openstack-selinux-0.1.3-1.el6ost.noarch
openstack-utils-2013.2-2.el6ost.noarch


Unless anyone can demonstrate the flaw I suggest closing this as NOTABUG or WORKSFORME


Note You need to log in before you can comment on or make changes to this bug.