Description of problem: After rebooting the controller, where there is mysql, nova-api, quantum-server, amqpd, the nova-compute hosts stays in state XXX for more than 20 minutes. Version-Release number of selected component (if applicable): openstack-nova-compute-2013.1.3-3.el6ost.noarch How reproducible: always Steps to Reproduce: 1. have controller + compute-node deployment 2. reboot controller 3. nova-manage service list Actual results: nova-compute master-04.rhos... nova enabled XXX 2013-09-09 07:29:20 Expected results: nova-compute master-04.rhos... nova enabled :-) 2013-09-09 07:XX:XX Additional info: Seems like it tried to reconnect the AMQP, but it didn't wait 1 second per try, but did the attempts immediately.
Created attachment 795522 [details] compute.log
I tested current upstream using devstack on a single machine. I shutdown the qpidd daemon and watched the nova-compute service logs and nova-manage service. I see the same kind of error messages 2013-12-05 15:34:39.935 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 16 seconds 2013-12-05 15:34:40.784 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds 2013-12-05 15:34:40.785 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds 2013-12-05 15:34:47.820 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds 2013-12-05 15:34:55.945 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 32 seconds and service list $ nova-manage service list 2>/dev/null Binary Host Zone Status State Updated_At nova-conductor mustard.gsslab.fab.redhat.com internal enabled :-) 2013-12-05 15:34:32 nova-cert mustard.gsslab.fab.redhat.com internal enabled :-) 2013-12-05 15:34:34 nova-network mustard.gsslab.fab.redhat.com internal enabled :-) 2013-12-05 15:34:35 nova-scheduler mustard.gsslab.fab.redhat.com internal enabled :-) 2013-12-05 15:34:37 nova-compute mustard.gsslab.fab.redhat.com nova enabled XXX 2013-12-05 15:33:34 nova-consoleauth mustard.gsslab.fab.redhat.com internal enabled :-) 2013-12-05 15:34:31 Looking at the timestamps there we certainly see the same bizarre waits. I think what is happening though is that these log messages are coming in from different eventlet threads - hence we get 3 'Sleeping 60 seconds' messages within a short time each from a different thread. So I think Jaroslavs logs just show lots of threads waiting in parallel. When I restart the qpidd daemon it eventually reconnected and went back to normal operation, I didn't see any 20 minute delay. It is possible there's a difference in between upstream GIT master vs RHOS4 codebase, but I don't have an environment able to run the latter myself currently so can't test that. The logs attached to this bug only show data from a 1 minute interval. We could reall do with the full un-edited nova compute logfile from the host showing this problems, rather than just a short snippet.
I've now tested the RHOS 4 versions directly. I did a 2 node install using packstack on rhel-6.5 # packstack --install-hosts=192.168.122.84,192.168.122.82 Once everything was up & running, I rebooted the controller node. The compute node showed it was attempting to reconnect periodically. Once the controller was fully up & running, the compute server re-connected within 1 minute as expected. Versions on controller: # rpm -qa | grep openstack | sort openstack-ceilometer-alarm-2013.2-4.el6ost.noarch openstack-ceilometer-api-2013.2-4.el6ost.noarch openstack-ceilometer-central-2013.2-4.el6ost.noarch openstack-ceilometer-collector-2013.2-4.el6ost.noarch openstack-ceilometer-common-2013.2-4.el6ost.noarch openstack-cinder-2013.2-7.el6ost.noarch openstack-dashboard-2013.2-8.el6ost.noarch openstack-dashboard-theme-2013.2-8.el6ost.noarch openstack-glance-2013.2-4.el6ost.noarch openstack-keystone-2013.2-3.el6ost.noarch openstack-neutron-2013.2-13.el6ost.noarch openstack-neutron-openvswitch-2013.2-13.el6ost.noarch openstack-nova-api-2013.2-9.el6ost.noarch openstack-nova-cert-2013.2-9.el6ost.noarch openstack-nova-common-2013.2-9.el6ost.noarch openstack-nova-conductor-2013.2-9.el6ost.noarch openstack-nova-console-2013.2-9.el6ost.noarch openstack-nova-novncproxy-2013.2-9.el6ost.noarch openstack-nova-scheduler-2013.2-9.el6ost.noarch openstack-packstack-2013.2.1-0.13.dev876.el6ost.noarch openstack-selinux-0.1.3-1.el6ost.noarch openstack-utils-2013.2-2.el6ost.noarch python-django-openstack-auth-1.1.2-1.el6ost.noarch redhat-access-plugin-openstack-4.0.0-0.el6ost.noarch Versions on compute: openstack-ceilometer-common-2013.2-4.el6ost.noarch openstack-ceilometer-compute-2013.2-4.el6ost.noarch openstack-neutron-2013.2-13.el6ost.noarch openstack-neutron-openvswitch-2013.2-13.el6ost.noarch openstack-nova-common-2013.2-9.el6ost.noarch openstack-nova-compute-2013.2-9.el6ost.noarch openstack-packstack-2013.2.1-0.13.dev876.el6ost.noarch openstack-selinux-0.1.3-1.el6ost.noarch openstack-utils-2013.2-2.el6ost.noarch Unless anyone can demonstrate the flaw I suggest closing this as NOTABUG or WORKSFORME