Bug 974045

Summary: openstack-nova: 'model server went away' ERROR after creating more than 2 snapshots from different instances
Product: Red Hat OpenStack Reporter: Dafna Ron <dron>
Component: openstack-novaAssignee: Vladan Popovic <vpopovic>
Status: CLOSED CURRENTRELEASE QA Contact: Haim <hateya>
Severity: high Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: abaron, dallan, jkt, ndipanov, shedoh, yeylon
Target Milestone: ---   
Target Release: 4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-18 14:33:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-06-13 10:49:45 UTC
Created attachment 760570 [details]
logs

Description of problem:

I have 10 instances running on two different hosts. 
when trying to create snapshots from the 10 running instances, we constantly get 'model server went away' ERRORs. 

Version-Release number of selected component (if applicable):

openstack-nova-compute-2013.1.1-4.el6ost.noarch
openstack-nova-api-2013.1.1-4.el6ost.noarch
libvirt-0.10.2-18.el6_4.5.x86_64
qemu-img-rhev-0.12.1.2-2.355.el6_4.4.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.4.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create an image and launch 10 instances from the image
2. once the instances are running create snapshot for each of the instances
3.

Actual results:

after the 3ed snapshot we seem to be getting "model server went away" ERRORs in the log, the commands 'nova <server> <name>' takes a long time to return and even the compute.log stops reacting for a few minutes. 

Expected results:

we should be able to create several snapshots for different instances at the same time without timeouts. 


Additional info:


2013-06-13 11:32:27.171 2966 ERROR nova.servicegroup.drivers.db [-] model server went away
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 92, in _report_state
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     service.service_ref, state_catalog)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 627, in service_update
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return self.conductor_rpcapi.service_update(context, service, values)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 365, in service_update
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return self.call(context, msg, version='1.34')
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/proxy.py", line 80, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return rpc.call(context, self._get_topic(topic), msg, timeout)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/__init__.py", line 140, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return _get_impl().call(CONF, context, topic, msg, timeout)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 610, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     rpc_amqp.get_connection_pool(conf, Connection))
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 612, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     rv = list(rv)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 554, in __iter__
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     self.done()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     self.gen.next()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 551, in __iter__
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     self._iterator.next()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 435, in iterconsume
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     yield self.ensure(_error_callback, _consume)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 379, in ensure
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     error_callback(e)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 420, in _error_callback
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     raise rpc_common.Timeout()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db Timeout: Timeout while waiting on RPC response.
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db

Comment 2 Vladan Popovic 2013-10-10 12:28:32 UTC
I was unable to reproduce this behaviour in upstream Havana after many tries.
I created one snapshot for every instance in a loop so there's no pause between the calls and the snapshots were created fine in the upstream Havana.

While testing on RHOS 4.0 I ran into an issue with qemu-img-rhev https://bugzilla.redhat.com/show_bug.cgi?id=1016896
After fixing this the snapshots were created fine and there was no error at all.

I got this behaviour in Grizzly though, after creating snapshots for few machines. I'll investigate it and try to apply the patches that fix this issue.