Bug 907877

Summary: vdsm: we are re-running vm that raised libvirt error domain is already active (no exception raised by vdsm to engine)
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Nobody's working on this, feel free to take it <nobody>
Status: CLOSED DUPLICATE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: bazulay, hateya, iheim, lpeer, michal.skrivanek, ykaul
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-14 12:38:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-02-05 13:03:07 UTC
Created attachment 693360 [details]
logs

Description of problem:

I had a vm that was stuck in wait fir launch so after a minute I decided to power off the vm and re-start it. 
after I powered off the vm and restarted it, it failed to run on the same host again and we re-ran it on the second host. 
looking at the error in the vdsm, libvirt failed to start the vm because the domain is already up in libvirt. 
however, since no specific error was raised to engine, we re-start the vm on the second host. 

the event is already listed in event log: 
VM NNNNN is down. Exit message: Requested operation is not valid: domain is already active as 'NNNNN'.

but I cannot see any exception which will prevent the engine from re-running the vm. 

Version-Release number of selected component (if applicable):

sf5
vdsm-4.10.2-5.0.el6ev.x86_64
libvirt-0.10.2-18.el6.x86_64

How reproducible:

100%

Steps to Reproduce:
1. create a vm and run it
2. suspend the vm
3. create a live snapshot while the vm is suspended
4. once the snapshot was created resume the vm
5. power off the vm
6. run the vm again
7. vm will be stuck in wait for launch -> power off
8. try to start the vm again. 
  
Actual results:

we are re-running a domain on a second host when the domain already exists in libvirt. 

Expected results:

we should not re-run a vm if the domain already exists in libvirt. 
exception should be raised to engine. 

Additional info:

first host: 


virsh > list
 Id    Name                           State
----------------------------------------------------
 5     KKKKK                          shut off
 8     NNNNN                          shut off


second host: 

 Id    Name                           State
----------------------------------------------------
 29    KKKKK                          running
 31    NNNNN                          running


  File "/usr/share/vdsm/vm.py", line 662, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1518, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 104, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2645, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: Requested operation is not valid: domain is already active as 'KKKKK'


2013-02-05 14:23:00,397 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-44) [437e7a48] Rerun vm 11d0501a-59aa-4566-81f5-be8c5eeced79. Called from vds gold-vdsd

Comment 1 Dafna Ron 2013-02-05 14:39:18 UTC
sorry - I forgot a step:

Steps to Reproduce:
1. create a vm and run it
2. suspend the vm
3. create a live snapshot while the vm is suspended
4. once the snapshot was created resume the vm
5. power off the vm
6. delete the snapshot
7. run the vm again
8. vm will be stuck in wait for launch -> power off
9. try to start the vm again.

Comment 2 Dafna Ron 2013-02-05 15:24:19 UTC
after some more tests this scenario is simpler.
the domain is listed as existed in libvirt because of a bug in which after suspend -> resume -> power off -> power on of vm the vm will start with status shut off in libvirt -> vdsm is not getting a pid and vm is stuck in wait for launch. 

https://bugzilla.redhat.com/show_bug.cgi?id=907972

Comment 3 Michal Skrivanek 2013-02-14 12:38:17 UTC
then I'd really dupe it, if you don't mind. We need to avoid 907972 in the first place

*** This bug has been marked as a duplicate of bug 907972 ***