Bug 974045

Summary:

openstack-nova: 'model server went away' ERROR after creating more than 2 snapshots from different instances

Product:

Red Hat OpenStack

Reporter:

Dafna Ron <dron>

Component:

openstack-nova

Assignee:

Vladan Popovic <vpopovic>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Haim <hateya>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

unspecified

CC:

abaron, dallan, jkt, ndipanov, shedoh, yeylon

Target Milestone:

---

Target Release:

4.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

storage

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-10-18 14:33:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs	none

Description Dafna Ron 2013-06-13 10:49:45 UTC

Created attachment 760570 [details]
logs

Description of problem:

I have 10 instances running on two different hosts. 
when trying to create snapshots from the 10 running instances, we constantly get 'model server went away' ERRORs. 

Version-Release number of selected component (if applicable):

openstack-nova-compute-2013.1.1-4.el6ost.noarch
openstack-nova-api-2013.1.1-4.el6ost.noarch
libvirt-0.10.2-18.el6_4.5.x86_64
qemu-img-rhev-0.12.1.2-2.355.el6_4.4.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.4.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create an image and launch 10 instances from the image
2. once the instances are running create snapshot for each of the instances
3.

Actual results:

after the 3ed snapshot we seem to be getting "model server went away" ERRORs in the log, the commands 'nova <server> <name>' takes a long time to return and even the compute.log stops reacting for a few minutes. 

Expected results:

we should be able to create several snapshots for different instances at the same time without timeouts. 


Additional info:


2013-06-13 11:32:27.171 2966 ERROR nova.servicegroup.drivers.db [-] model server went away
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 92, in _report_state
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     service.service_ref, state_catalog)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 627, in service_update
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return self.conductor_rpcapi.service_update(context, service, values)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 365, in service_update
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return self.call(context, msg, version='1.34')
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/proxy.py", line 80, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return rpc.call(context, self._get_topic(topic), msg, timeout)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/__init__.py", line 140, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return _get_impl().call(CONF, context, topic, msg, timeout)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 610, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     rpc_amqp.get_connection_pool(conf, Connection))
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 612, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     rv = list(rv)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 554, in __iter__
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     self.done()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     self.gen.next()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 551, in __iter__
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     self._iterator.next()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 435, in iterconsume
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     yield self.ensure(_error_callback, _consume)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 379, in ensure
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     error_callback(e)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 420, in _error_callback
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     raise rpc_common.Timeout()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db Timeout: Timeout while waiting on RPC response.
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db

Comment 2 Vladan Popovic 2013-10-10 12:28:32 UTC

I was unable to reproduce this behaviour in upstream Havana after many tries.
I created one snapshot for every instance in a loop so there's no pause between the calls and the snapshots were created fine in the upstream Havana.

While testing on RHOS 4.0 I ran into an issue with qemu-img-rhev https://bugzilla.redhat.com/show_bug.cgi?id=1016896
After fixing this the snapshots were created fine and there was no error at all.

I got this behaviour in Grizzly though, after creating snapshots for few machines. I'll investigate it and try to apply the patches that fix this issue.