Bug 1005691 - compute doesn't recover from controller reboot [NEEDINFO]
compute doesn't recover from controller reboot
Status: CLOSED WORKSFORME
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
3.0
x86_64 Linux
high Severity medium
: ---
: 4.0
Assigned To: Xavier Queralt
Ami Jeain
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-09 04:02 EDT by Jaroslav Henner
Modified: 2014-01-14 15:30 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-14 15:30:51 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
berrange: needinfo? (jhenner)


Attachments (Terms of Use)
compute.log (44.81 KB, text/x-log)
2013-09-09 04:03 EDT, Jaroslav Henner
no flags Details

  None (edit)
Description Jaroslav Henner 2013-09-09 04:02:50 EDT
Description of problem:
After rebooting the controller, where there is mysql, nova-api, quantum-server, amqpd, the nova-compute hosts stays in state XXX for more than 20 minutes.

Version-Release number of selected component (if applicable):
openstack-nova-compute-2013.1.3-3.el6ost.noarch

How reproducible:
always

Steps to Reproduce:
1. have controller + compute-node deployment
2. reboot controller
3. nova-manage service list

Actual results:
nova-compute     master-04.rhos... nova             enabled    XXX   2013-09-09 07:29:20


Expected results:
nova-compute     master-04.rhos... nova             enabled    :-)   2013-09-09 07:XX:XX

Additional info:
Seems like it tried to reconnect the AMQP, but it didn't wait 1 second per try, but did the attempts immediately.
Comment 1 Jaroslav Henner 2013-09-09 04:03:20 EDT
Created attachment 795522 [details]
compute.log
Comment 4 Daniel Berrange 2013-12-05 10:42:17 EST
I tested current upstream using devstack on a single machine. I shutdown the qpidd daemon and watched the nova-compute service logs and nova-manage service. I see the same kind of error messages

2013-12-05 15:34:39.935 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 16 seconds
2013-12-05 15:34:40.784 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
2013-12-05 15:34:40.785 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
2013-12-05 15:34:47.820 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds
2013-12-05 15:34:55.945 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 32 seconds

and service list

$ nova-manage service list 2>/dev/null
Binary           Host                                 Zone             Status     State Updated_At
nova-conductor   mustard.gsslab.fab.redhat.com        internal         enabled    :-)   2013-12-05 15:34:32
nova-cert        mustard.gsslab.fab.redhat.com        internal         enabled    :-)   2013-12-05 15:34:34
nova-network     mustard.gsslab.fab.redhat.com        internal         enabled    :-)   2013-12-05 15:34:35
nova-scheduler   mustard.gsslab.fab.redhat.com        internal         enabled    :-)   2013-12-05 15:34:37
nova-compute     mustard.gsslab.fab.redhat.com        nova             enabled    XXX   2013-12-05 15:33:34
nova-consoleauth mustard.gsslab.fab.redhat.com        internal         enabled    :-)   2013-12-05 15:34:31


Looking at the timestamps there we certainly see the same bizarre waits. I think what is happening though is that these log messages are coming in from different eventlet threads - hence we get 3 'Sleeping 60 seconds' messages  within a short time each from a different thread. So I think Jaroslavs logs just show lots of threads waiting in parallel.

When I restart the qpidd daemon it eventually reconnected and went back to normal operation, I didn't see any 20 minute delay. It is possible there's a difference in between upstream GIT master vs RHOS4 codebase, but I don't have an environment able to run the latter myself currently so can't test that.

The logs attached to this bug only show data from a 1 minute interval. We could reall do with the full un-edited nova compute logfile from the host showing this problems, rather than just a short snippet.
Comment 5 Daniel Berrange 2013-12-09 12:16:16 EST
I've now tested the RHOS 4 versions directly.

I did a 2 node install using packstack on rhel-6.5

# packstack --install-hosts=192.168.122.84,192.168.122.82

Once everything was up & running, I rebooted  the controller node. The compute node showed it was attempting to reconnect periodically. Once the controller was fully up & running, the compute server re-connected within 1 minute as expected.

Versions on controller:

# rpm -qa | grep openstack | sort
openstack-ceilometer-alarm-2013.2-4.el6ost.noarch
openstack-ceilometer-api-2013.2-4.el6ost.noarch
openstack-ceilometer-central-2013.2-4.el6ost.noarch
openstack-ceilometer-collector-2013.2-4.el6ost.noarch
openstack-ceilometer-common-2013.2-4.el6ost.noarch
openstack-cinder-2013.2-7.el6ost.noarch
openstack-dashboard-2013.2-8.el6ost.noarch
openstack-dashboard-theme-2013.2-8.el6ost.noarch
openstack-glance-2013.2-4.el6ost.noarch
openstack-keystone-2013.2-3.el6ost.noarch
openstack-neutron-2013.2-13.el6ost.noarch
openstack-neutron-openvswitch-2013.2-13.el6ost.noarch
openstack-nova-api-2013.2-9.el6ost.noarch
openstack-nova-cert-2013.2-9.el6ost.noarch
openstack-nova-common-2013.2-9.el6ost.noarch
openstack-nova-conductor-2013.2-9.el6ost.noarch
openstack-nova-console-2013.2-9.el6ost.noarch
openstack-nova-novncproxy-2013.2-9.el6ost.noarch
openstack-nova-scheduler-2013.2-9.el6ost.noarch
openstack-packstack-2013.2.1-0.13.dev876.el6ost.noarch
openstack-selinux-0.1.3-1.el6ost.noarch
openstack-utils-2013.2-2.el6ost.noarch
python-django-openstack-auth-1.1.2-1.el6ost.noarch
redhat-access-plugin-openstack-4.0.0-0.el6ost.noarch

Versions on compute:

openstack-ceilometer-common-2013.2-4.el6ost.noarch
openstack-ceilometer-compute-2013.2-4.el6ost.noarch
openstack-neutron-2013.2-13.el6ost.noarch
openstack-neutron-openvswitch-2013.2-13.el6ost.noarch
openstack-nova-common-2013.2-9.el6ost.noarch
openstack-nova-compute-2013.2-9.el6ost.noarch
openstack-packstack-2013.2.1-0.13.dev876.el6ost.noarch
openstack-selinux-0.1.3-1.el6ost.noarch
openstack-utils-2013.2-2.el6ost.noarch


Unless anyone can demonstrate the flaw I suggest closing this as NOTABUG or WORKSFORME

Note You need to log in before you can comment on or make changes to this bug.