Bug 1005691
Summary: | compute doesn't recover from controller reboot | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Jaroslav Henner <jhenner> | ||||
Component: | openstack-nova | Assignee: | Xavier Queralt <xqueralt> | ||||
Status: | CLOSED WORKSFORME | QA Contact: | Ami Jeain <ajeain> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 3.0 | CC: | dallan, jhenner, ndipanov, sgordon, yeylon | ||||
Target Milestone: | --- | Keywords: | ZStream | ||||
Target Release: | 4.0 | Flags: | jhenner:
needinfo-
|
||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2014-01-14 20:30:51 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jaroslav Henner
2013-09-09 08:02:50 UTC
Created attachment 795522 [details]
compute.log
I tested current upstream using devstack on a single machine. I shutdown the qpidd daemon and watched the nova-compute service logs and nova-manage service. I see the same kind of error messages 2013-12-05 15:34:39.935 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 16 seconds 2013-12-05 15:34:40.784 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds 2013-12-05 15:34:40.785 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds 2013-12-05 15:34:47.820 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 60 seconds 2013-12-05 15:34:55.945 ERROR nova.openstack.common.rpc.impl_qpid [-] Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 32 seconds and service list $ nova-manage service list 2>/dev/null Binary Host Zone Status State Updated_At nova-conductor mustard.gsslab.fab.redhat.com internal enabled :-) 2013-12-05 15:34:32 nova-cert mustard.gsslab.fab.redhat.com internal enabled :-) 2013-12-05 15:34:34 nova-network mustard.gsslab.fab.redhat.com internal enabled :-) 2013-12-05 15:34:35 nova-scheduler mustard.gsslab.fab.redhat.com internal enabled :-) 2013-12-05 15:34:37 nova-compute mustard.gsslab.fab.redhat.com nova enabled XXX 2013-12-05 15:33:34 nova-consoleauth mustard.gsslab.fab.redhat.com internal enabled :-) 2013-12-05 15:34:31 Looking at the timestamps there we certainly see the same bizarre waits. I think what is happening though is that these log messages are coming in from different eventlet threads - hence we get 3 'Sleeping 60 seconds' messages within a short time each from a different thread. So I think Jaroslavs logs just show lots of threads waiting in parallel. When I restart the qpidd daemon it eventually reconnected and went back to normal operation, I didn't see any 20 minute delay. It is possible there's a difference in between upstream GIT master vs RHOS4 codebase, but I don't have an environment able to run the latter myself currently so can't test that. The logs attached to this bug only show data from a 1 minute interval. We could reall do with the full un-edited nova compute logfile from the host showing this problems, rather than just a short snippet. I've now tested the RHOS 4 versions directly. I did a 2 node install using packstack on rhel-6.5 # packstack --install-hosts=192.168.122.84,192.168.122.82 Once everything was up & running, I rebooted the controller node. The compute node showed it was attempting to reconnect periodically. Once the controller was fully up & running, the compute server re-connected within 1 minute as expected. Versions on controller: # rpm -qa | grep openstack | sort openstack-ceilometer-alarm-2013.2-4.el6ost.noarch openstack-ceilometer-api-2013.2-4.el6ost.noarch openstack-ceilometer-central-2013.2-4.el6ost.noarch openstack-ceilometer-collector-2013.2-4.el6ost.noarch openstack-ceilometer-common-2013.2-4.el6ost.noarch openstack-cinder-2013.2-7.el6ost.noarch openstack-dashboard-2013.2-8.el6ost.noarch openstack-dashboard-theme-2013.2-8.el6ost.noarch openstack-glance-2013.2-4.el6ost.noarch openstack-keystone-2013.2-3.el6ost.noarch openstack-neutron-2013.2-13.el6ost.noarch openstack-neutron-openvswitch-2013.2-13.el6ost.noarch openstack-nova-api-2013.2-9.el6ost.noarch openstack-nova-cert-2013.2-9.el6ost.noarch openstack-nova-common-2013.2-9.el6ost.noarch openstack-nova-conductor-2013.2-9.el6ost.noarch openstack-nova-console-2013.2-9.el6ost.noarch openstack-nova-novncproxy-2013.2-9.el6ost.noarch openstack-nova-scheduler-2013.2-9.el6ost.noarch openstack-packstack-2013.2.1-0.13.dev876.el6ost.noarch openstack-selinux-0.1.3-1.el6ost.noarch openstack-utils-2013.2-2.el6ost.noarch python-django-openstack-auth-1.1.2-1.el6ost.noarch redhat-access-plugin-openstack-4.0.0-0.el6ost.noarch Versions on compute: openstack-ceilometer-common-2013.2-4.el6ost.noarch openstack-ceilometer-compute-2013.2-4.el6ost.noarch openstack-neutron-2013.2-13.el6ost.noarch openstack-neutron-openvswitch-2013.2-13.el6ost.noarch openstack-nova-common-2013.2-9.el6ost.noarch openstack-nova-compute-2013.2-9.el6ost.noarch openstack-packstack-2013.2.1-0.13.dev876.el6ost.noarch openstack-selinux-0.1.3-1.el6ost.noarch openstack-utils-2013.2-2.el6ost.noarch Unless anyone can demonstrate the flaw I suggest closing this as NOTABUG or WORKSFORME |