Created attachment 760570 [details] logs Description of problem: I have 10 instances running on two different hosts. when trying to create snapshots from the 10 running instances, we constantly get 'model server went away' ERRORs. Version-Release number of selected component (if applicable): openstack-nova-compute-2013.1.1-4.el6ost.noarch openstack-nova-api-2013.1.1-4.el6ost.noarch libvirt-0.10.2-18.el6_4.5.x86_64 qemu-img-rhev-0.12.1.2-2.355.el6_4.4.x86_64 qemu-kvm-rhev-0.12.1.2-2.355.el6_4.4.x86_64 How reproducible: 100% Steps to Reproduce: 1. create an image and launch 10 instances from the image 2. once the instances are running create snapshot for each of the instances 3. Actual results: after the 3ed snapshot we seem to be getting "model server went away" ERRORs in the log, the commands 'nova <server> <name>' takes a long time to return and even the compute.log stops reacting for a few minutes. Expected results: we should be able to create several snapshots for different instances at the same time without timeouts. Additional info: 2013-06-13 11:32:27.171 2966 ERROR nova.servicegroup.drivers.db [-] model server went away 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db Traceback (most recent call last): 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 92, in _report_state 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog) 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 627, in service_update 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db return self.conductor_rpcapi.service_update(context, service, values) 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 365, in service_update 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db return self.call(context, msg, version='1.34') 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/proxy.py", line 80, in call 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db return rpc.call(context, self._get_topic(topic), msg, timeout) 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/__init__.py", line 140, in call 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db return _get_impl().call(CONF, context, topic, msg, timeout) 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 610, in call 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db rpc_amqp.get_connection_pool(conf, Connection)) 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 612, in call 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db rv = list(rv) 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 554, in __iter__ 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db self.done() 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__ 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db self.gen.next() 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 551, in __iter__ 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db self._iterator.next() 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 435, in iterconsume 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db yield self.ensure(_error_callback, _consume) 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 379, in ensure 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db error_callback(e) 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 420, in _error_callback 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db raise rpc_common.Timeout() 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db Timeout: Timeout while waiting on RPC response. 2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db
I was unable to reproduce this behaviour in upstream Havana after many tries. I created one snapshot for every instance in a loop so there's no pause between the calls and the snapshots were created fine in the upstream Havana. While testing on RHOS 4.0 I ran into an issue with qemu-img-rhev https://bugzilla.redhat.com/show_bug.cgi?id=1016896 After fixing this the snapshots were created fine and there was no error at all. I got this behaviour in Grizzly though, after creating snapshots for few machines. I'll investigate it and try to apply the patches that fix this issue.