Bug 974051

Summary: openstack-nova: instances state moves to 'shutoff' when we have time-outs on create snapshots for several instances
Product: Red Hat OpenStack Reporter: Dafna Ron <dron>
Component: openstack-novaAssignee: Vladan Popovic <vpopovic>
Status: CLOSED CURRENTRELEASE QA Contact: Ami Jeain <ajeain>
Severity: high Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: dallan, dron, jkt, ndipanov, yeylon
Target Milestone: ---Flags: vpopovic: needinfo+
Target Release: 4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-23 09:59:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-06-13 10:59:14 UTC
Created attachment 760591 [details]
logs

Description of problem:

I launched 10 instances and tried creating snapshots for all of them once the instances were running.

we get errors from nova:

2013-06-13 11:32:27.171 2966 ERROR nova.servicegroup.drivers.db [-] model server went away

2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db Timeout: Timeout while waiting on RPC response.

and than some of the instances change state to shutoff. 

Version-Release number of selected component (if applicable):

openstack-nova-compute-2013.1.1-4.el6ost.noarch
openstack-nova-api-2013.1.1-4.el6ost.noarch
libvirt-0.10.2-18.el6_4.5.x86_64
qemu-img-rhev-0.12.1.2-2.355.el6_4.4.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.4.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create an image and launch 10 instances from the image on two different hosts
2. create snapshots for each of the instances 
3.

Actual results:

we get time-out errors in compute log and some of the instances move to shutoff and need a soft reboot. 

Expected results:

instances should not move to shutoff

Additional info:

[root@opens-vdsb ~(keystone_admin)]# nova list
+--------------------------------------+-------------------------------------------+---------+---------------------------+
| ID                                   | Name                                      | Status  | Networks                  |
+--------------------------------------+-------------------------------------------+---------+---------------------------+
| 1b1eb170-3bdb-496e-bcaa-2a3cd5078e06 | HAHA-1b1eb170-3bdb-496e-bcaa-2a3cd5078e06 | ACTIVE  | novanetwork=192.168.32.19 |
| 2e19d87f-44ac-4525-bccd-41f2d0350a92 | HAHA-2e19d87f-44ac-4525-bccd-41f2d0350a92 | SHUTOFF | novanetwork=192.168.32.17 |
| 7f4df8e2-77fc-495e-acea-467b039e1ccd | HAHA-7f4df8e2-77fc-495e-acea-467b039e1ccd | SHUTOFF | novanetwork=192.168.32.3  |
| 80fbf669-e8c5-422a-b2ff-90c34e09d8ec | HAHA-80fbf669-e8c5-422a-b2ff-90c34e09d8ec | ACTIVE  | novanetwork=192.168.32.4  |
| 9f0cb489-56d3-4258-93e6-0c927cf70352 | HAHA-9f0cb489-56d3-4258-93e6-0c927cf70352 | ACTIVE  | novanetwork=192.168.32.2  |
| a512a696-13f3-4e48-8154-d8de3ad35af4 | HAHA-a512a696-13f3-4e48-8154-d8de3ad35af4 | SHUTOFF | novanetwork=192.168.32.16 |
| d965ae10-6175-4c57-93f9-47b57c7a2907 | HAHA-d965ae10-6175-4c57-93f9-47b57c7a2907 | ACTIVE  | novanetwork=192.168.32.14 |
| da936d2d-0273-4f47-b6cb-24ea24577317 | HAHA-da936d2d-0273-4f47-b6cb-24ea24577317 | ACTIVE  | novanetwork=192.168.32.15 |
| fa7ea571-a0c1-4cef-9788-ef1e0cf1bf8d | HAHA-fa7ea571-a0c1-4cef-9788-ef1e0cf1bf8d | SHUTOFF | novanetwork=192.168.32.18 |
| 12893599-1418-4cf0-b74d-5f2200418a74 | haha10                                    | ACTIVE  | novanetwork=192.168.32.5  |
+--------------------------------------+-------------------------------------------+---------+---------------------------+

Comment 2 Vladan Popovic 2013-10-18 14:32:26 UTC
I struggled with this issue a lot, trying to reproduce it in 3.0 and didn't always get the "model server went away" error after lots of testing with small instances (64MB mem / 1GB storage). I got it a while ago but can't get the same behaviour again.

Could you please tell me more on how could I actually reproduce this in Grizzly.
Which flavor did you use?
Whics image did you use to get this error?
Everything else would be more than welcome.


In 4.0 I never got this issue after numerous tests.

Comment 3 Dafna Ron 2013-10-18 14:39:22 UTC
I have only worked with Havana and not grizzly... 
flavour was 1 (tiny)
these images no longer exist

not sure what else to give you... two computes -> 10 instances on each -> try to create snapshots for each instance

Comment 4 Vladan Popovic 2013-10-21 17:09:04 UTC
I'm sorry, but I cannot reproduce this on my local setup after countless times.
I had lots of issues with getting myself in a situation where I can test this, and probably require more resources than I have on my laptop, but after managing that, I couldn't get the instances to go in shutoff state.

Could you please provide me with access to the machines where I can reproduce and debug this.

Comment 6 Dafna Ron 2013-10-22 15:29:16 UTC
After speaking to Ami today I realised that I did open this bug on Grizzly. 
I tested this with Vladan in Havana and it no longer reproduced in Havana so I think we can close this.

Comment 7 Vladan Popovic 2013-10-22 17:18:12 UTC
I agree, the described behaviour is unreachable by doing snapshots of 10 instances running on 2 hosts, so I guess we can close this bug.

However, we managed to reproduce this behaviour with Dafna in Havana by snapshoting multiple instances and then opening a VNC console for one of them. All of the instances that were on that node got into Shutoff state.

Dafna please correct me if I'm wrong.

More investigation is needed on this issue. When we manage to reproduce it in 100% of the cases I suggest opening another bug and describe the steps to reproduce in detail.

For now the traceback shows only this:


 _volume_snapshot_create /usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:1594
     % image_id, instance=instance)
   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 309, in decorated_function
     *args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 2293, in snapshot_instance
     task_states.IMAGE_SNAPSHOT)
   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 2324, in _snapshot_instance
     update_task_state)
   File "/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py", line 1374, in snapshot
     virt_dom.managedSave(0)
   File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 187, in doit
     result = proxy_call(self._autowrap, f, *args, **kwargs)
   File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 147, in proxy_call
     rv = execute(f,*args,**kwargs)
   File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 76, in tworker
     rv = meth(*args,**kwargs)
   File "/usr/lib64/python2.6/site-packages/libvirt.py", line 863, in managedSave
     if ret == -1: raise libvirtError ('virDomainManagedSave() failed', dom=self)
 libvirtError: internal error received hangup / error event on socket

Comment 8 Vladan Popovic 2013-10-23 09:59:48 UTC
I'm closing this bug now because it's not reproducible in 4.0