974045 – openstack-nova: 'model server went away' ERROR after creating more than 2 snapshots from different instances

Bug 974045 - openstack-nova: 'model server went away' ERROR after creating more than 2 snapshots from different instances

Summary: openstack-nova: 'model server went away' ERROR after creating more than 2 sna...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.0
Assignee:	Vladan Popovic
QA Contact:	Haim
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-13 10:49 UTC by Dafna Ron
Modified:	2019-09-09 16:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-10-18 14:33:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (1.71 MB, application/x-gzip) 2013-06-13 10:49 UTC, Dafna Ron	no flags	Details
View All

Description Dafna Ron 2013-06-13 10:49:45 UTC

Created attachment 760570 [details]
logs

Description of problem:

I have 10 instances running on two different hosts. 
when trying to create snapshots from the 10 running instances, we constantly get 'model server went away' ERRORs. 

Version-Release number of selected component (if applicable):

openstack-nova-compute-2013.1.1-4.el6ost.noarch
openstack-nova-api-2013.1.1-4.el6ost.noarch
libvirt-0.10.2-18.el6_4.5.x86_64
qemu-img-rhev-0.12.1.2-2.355.el6_4.4.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.4.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create an image and launch 10 instances from the image
2. once the instances are running create snapshot for each of the instances
3.

Actual results:

after the 3ed snapshot we seem to be getting "model server went away" ERRORs in the log, the commands 'nova <server> <name>' takes a long time to return and even the compute.log stops reacting for a few minutes. 

Expected results:

we should be able to create several snapshots for different instances at the same time without timeouts. 


Additional info:


2013-06-13 11:32:27.171 2966 ERROR nova.servicegroup.drivers.db [-] model server went away
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 92, in _report_state
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     service.service_ref, state_catalog)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 627, in service_update
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return self.conductor_rpcapi.service_update(context, service, values)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 365, in service_update
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return self.call(context, msg, version='1.34')
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/proxy.py", line 80, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return rpc.call(context, self._get_topic(topic), msg, timeout)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/__init__.py", line 140, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     return _get_impl().call(CONF, context, topic, msg, timeout)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 610, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     rpc_amqp.get_connection_pool(conf, Connection))
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 612, in call
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     rv = list(rv)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 554, in __iter__
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     self.done()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     self.gen.next()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 551, in __iter__
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     self._iterator.next()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 435, in iterconsume
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     yield self.ensure(_error_callback, _consume)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 379, in ensure
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     error_callback(e)
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 420, in _error_callback
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db     raise rpc_common.Timeout()
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db Timeout: Timeout while waiting on RPC response.
2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db

Comment 2 Vladan Popovic 2013-10-10 12:28:32 UTC

I was unable to reproduce this behaviour in upstream Havana after many tries.
I created one snapshot for every instance in a loop so there's no pause between the calls and the snapshots were created fine in the upstream Havana.

While testing on RHOS 4.0 I ran into an issue with qemu-img-rhev https://bugzilla.redhat.com/show_bug.cgi?id=1016896
After fixing this the snapshots were created fine and there was no error at all.

I got this behaviour in Grizzly though, after creating snapshots for few machines. I'll investigate it and try to apply the patches that fix this issue.

Note You need to log in before you can comment on or make changes to this bug.