Bug 974051 - openstack-nova: instances state moves to 'shutoff' when we have time-outs on create snapshots for several instances
openstack-nova: instances state moves to 'shutoff' when we have time-outs on ...
Status: CLOSED CURRENTRELEASE
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
unspecified
x86_64 Linux
unspecified Severity high
: ---
: 4.0
Assigned To: Vladan Popovic
Ami Jeain
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-13 06:59 EDT by Dafna Ron
Modified: 2016-04-26 12:46 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-10-23 05:59:48 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
vpopovic: needinfo+


Attachments (Terms of Use)
logs (1.71 MB, application/x-gzip)
2013-06-13 06:59 EDT, Dafna Ron
no flags Details

  None (edit)
Description Dafna Ron 2013-06-13 06:59:14 EDT
Created attachment 760591 [details]
logs

Description of problem:

I launched 10 instances and tried creating snapshots for all of them once the instances were running.

we get errors from nova:

2013-06-13 11:32:27.171 2966 ERROR nova.servicegroup.drivers.db [-] model server went away

2013-06-13 11:32:27.171 2966 TRACE nova.servicegroup.drivers.db Timeout: Timeout while waiting on RPC response.

and than some of the instances change state to shutoff. 

Version-Release number of selected component (if applicable):

openstack-nova-compute-2013.1.1-4.el6ost.noarch
openstack-nova-api-2013.1.1-4.el6ost.noarch
libvirt-0.10.2-18.el6_4.5.x86_64
qemu-img-rhev-0.12.1.2-2.355.el6_4.4.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.4.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create an image and launch 10 instances from the image on two different hosts
2. create snapshots for each of the instances 
3.

Actual results:

we get time-out errors in compute log and some of the instances move to shutoff and need a soft reboot. 

Expected results:

instances should not move to shutoff

Additional info:

[root@opens-vdsb ~(keystone_admin)]# nova list
+--------------------------------------+-------------------------------------------+---------+---------------------------+
| ID                                   | Name                                      | Status  | Networks                  |
+--------------------------------------+-------------------------------------------+---------+---------------------------+
| 1b1eb170-3bdb-496e-bcaa-2a3cd5078e06 | HAHA-1b1eb170-3bdb-496e-bcaa-2a3cd5078e06 | ACTIVE  | novanetwork=192.168.32.19 |
| 2e19d87f-44ac-4525-bccd-41f2d0350a92 | HAHA-2e19d87f-44ac-4525-bccd-41f2d0350a92 | SHUTOFF | novanetwork=192.168.32.17 |
| 7f4df8e2-77fc-495e-acea-467b039e1ccd | HAHA-7f4df8e2-77fc-495e-acea-467b039e1ccd | SHUTOFF | novanetwork=192.168.32.3  |
| 80fbf669-e8c5-422a-b2ff-90c34e09d8ec | HAHA-80fbf669-e8c5-422a-b2ff-90c34e09d8ec | ACTIVE  | novanetwork=192.168.32.4  |
| 9f0cb489-56d3-4258-93e6-0c927cf70352 | HAHA-9f0cb489-56d3-4258-93e6-0c927cf70352 | ACTIVE  | novanetwork=192.168.32.2  |
| a512a696-13f3-4e48-8154-d8de3ad35af4 | HAHA-a512a696-13f3-4e48-8154-d8de3ad35af4 | SHUTOFF | novanetwork=192.168.32.16 |
| d965ae10-6175-4c57-93f9-47b57c7a2907 | HAHA-d965ae10-6175-4c57-93f9-47b57c7a2907 | ACTIVE  | novanetwork=192.168.32.14 |
| da936d2d-0273-4f47-b6cb-24ea24577317 | HAHA-da936d2d-0273-4f47-b6cb-24ea24577317 | ACTIVE  | novanetwork=192.168.32.15 |
| fa7ea571-a0c1-4cef-9788-ef1e0cf1bf8d | HAHA-fa7ea571-a0c1-4cef-9788-ef1e0cf1bf8d | SHUTOFF | novanetwork=192.168.32.18 |
| 12893599-1418-4cf0-b74d-5f2200418a74 | haha10                                    | ACTIVE  | novanetwork=192.168.32.5  |
+--------------------------------------+-------------------------------------------+---------+---------------------------+
Comment 2 Vladan Popovic 2013-10-18 10:32:26 EDT
I struggled with this issue a lot, trying to reproduce it in 3.0 and didn't always get the "model server went away" error after lots of testing with small instances (64MB mem / 1GB storage). I got it a while ago but can't get the same behaviour again.

Could you please tell me more on how could I actually reproduce this in Grizzly.
Which flavor did you use?
Whics image did you use to get this error?
Everything else would be more than welcome.


In 4.0 I never got this issue after numerous tests.
Comment 3 Dafna Ron 2013-10-18 10:39:22 EDT
I have only worked with Havana and not grizzly... 
flavour was 1 (tiny)
these images no longer exist

not sure what else to give you... two computes -> 10 instances on each -> try to create snapshots for each instance
Comment 4 Vladan Popovic 2013-10-21 13:09:04 EDT
I'm sorry, but I cannot reproduce this on my local setup after countless times.
I had lots of issues with getting myself in a situation where I can test this, and probably require more resources than I have on my laptop, but after managing that, I couldn't get the instances to go in shutoff state.

Could you please provide me with access to the machines where I can reproduce and debug this.
Comment 6 Dafna Ron 2013-10-22 11:29:16 EDT
After speaking to Ami today I realised that I did open this bug on Grizzly. 
I tested this with Vladan in Havana and it no longer reproduced in Havana so I think we can close this.
Comment 7 Vladan Popovic 2013-10-22 13:18:12 EDT
I agree, the described behaviour is unreachable by doing snapshots of 10 instances running on 2 hosts, so I guess we can close this bug.

However, we managed to reproduce this behaviour with Dafna in Havana by snapshoting multiple instances and then opening a VNC console for one of them. All of the instances that were on that node got into Shutoff state.

Dafna please correct me if I'm wrong.

More investigation is needed on this issue. When we manage to reproduce it in 100% of the cases I suggest opening another bug and describe the steps to reproduce in detail.

For now the traceback shows only this:


 _volume_snapshot_create /usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py:1594
     % image_id, instance=instance)
   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 309, in decorated_function
     *args, **kwargs)
   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 2293, in snapshot_instance
     task_states.IMAGE_SNAPSHOT)
   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 2324, in _snapshot_instance
     update_task_state)
   File "/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py", line 1374, in snapshot
     virt_dom.managedSave(0)
   File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 187, in doit
     result = proxy_call(self._autowrap, f, *args, **kwargs)
   File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 147, in proxy_call
     rv = execute(f,*args,**kwargs)
   File "/usr/lib/python2.6/site-packages/eventlet/tpool.py", line 76, in tworker
     rv = meth(*args,**kwargs)
   File "/usr/lib64/python2.6/site-packages/libvirt.py", line 863, in managedSave
     if ret == -1: raise libvirtError ('virDomainManagedSave() failed', dom=self)
 libvirtError: internal error received hangup / error event on socket
Comment 8 Vladan Popovic 2013-10-23 05:59:48 EDT
I'm closing this bug now because it's not reproducible in 4.0

Note You need to log in before you can comment on or make changes to this bug.