Bug 1010980

Summary: Vm fails to start after OS installation.
Product: Red Hat Enterprise Virtualization Manager Reporter: Leonid Natapov <lnatapov>
Component: ovirt-hosted-engine-setupAssignee: Sandro Bonazzola <sbonazzo>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: urgent Docs Contact:
Priority: urgent    
Version: unspecifiedCC: acathrow, dfediuck, fsimonce, gpadgett, iheim, michal.skrivanek, oschreib, pstehlik, sbonazzo
Target Milestone: ---Keywords: Triaged
Target Release: 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: integration
Fixed In Version: ovirt-hosted-engine-setup-1.0.0-0.7.beta2.el6ev Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 16:53:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
logs
none
logs none

Description Leonid Natapov 2013-09-23 12:42:33 UTC
Created attachment 801592 [details]
logs

Description of problem:

VM unable to start after OS installation.

Scenario:
ovirt-hosted-engine setup creates VM.
user connects to VM and install OS.
After successful OS installation VM reboots and destroys. 

At this point setup asks user if OS installation was successful.
If answer is yes. setup should create and start VM again.

but VM fails to start.
Here is the traceback from vdsm.log file

Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 356, in teardownVolumePath
    res = self.irs.teardownImage(drive['domainID'],
  File "/usr/share/vdsm/vm.py", line 1361, in __getitem__
    raise KeyError(key)
KeyError: 'domainID'

full vdsm log and ovirt-hosted-engine-setup.log file attached.

Comment 1 Leonid Natapov 2013-09-23 12:44:50 UTC
vdsm-4.12.0-127.gitedb88bf.el6ev.x86_64
libvirt-0.10.2-18.el6_4.9.x86_64
ovirt-hosted-engine-setup-1.0.0-0.4.1.beta.1.el6.noarch
ovirt-hosted-engine-ha-0.1.0-0.1.beta.1.el6.noarch

Comment 2 Leonid Natapov 2013-09-23 12:45:16 UTC
Created attachment 801604 [details]
logs

Comment 3 Greg Padgett 2013-09-23 13:55:09 UTC
From what I can tell, the issue is that if qemu exits, vdsm stats for the vm become stale instead of being removed.  This patch should at least serve as a workaround, if not a complete fix:

http://gerrit.ovirt.org/19470

Eduardo, Federico, your thoughts on this would be most welcome.  Thanks!

Comment 4 Greg Padgett 2013-09-23 21:32:38 UTC
In http://gerrit.ovirt.org/#/c/19470/ danken noted that the stats need to stay around until the engine can retrieve the status.  I then did the following test:

1. Start the vm
2. Poweroff the vm via the console
3. Confirm the bug was reproduced (starting vm fails with "Virtual machine already exists")
4. Destroy the vm with vdsClient
5. Start the vm - this time it succeeds.

Sandro, perhaps adding a call in hosted-engine-setup to destroy the vm after os installation (as in step 4) would solve the issue?

Comment 5 Sandro Bonazzola 2013-09-24 07:05:37 UTC
(In reply to Greg Padgett from comment #4)

> Sandro, perhaps adding a call in hosted-engine-setup to destroy the vm after
> os installation (as in step 4) would solve the issue?

Greg, the hosted engine VM is created with 'destroy' action on             'on_poweroff', 'on_reboot', 'on_crash' events.
http://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-setup.git;a=blob;f=src/vdsm_hooks/hostedengine.py;h=e9e2ac42fe0981606d89a29a7b56cacd5809e928;hb=HEAD#l36

It also already issue a destroy command if the above is not honored:
after OS installation 
http://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-setup.git;a=blob;f=src/plugins/ovirt-hosted-engine-setup/vm/runvm.py;h=32d82a15fc8d312c9e06227d5a9920f30ba1bcbc;hb=HEAD#l297

and after engine liveliness validation before starting ha daemons:
http://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-setup.git;a=blob;f=src/plugins/ovirt-hosted-engine-setup/ha/ha_services.py;h=a96fce43ad77c90d6824875e1da12476296eb3a1;hb=HEAD#l70

Comment 6 Greg Padgett 2013-09-27 21:10:56 UTC
(In reply to Sandro Bonazzola from comment #5)
> (In reply to Greg Padgett from comment #4)
> 
> > Sandro, perhaps adding a call in hosted-engine-setup to destroy the vm after
> > os installation (as in step 4) would solve the issue?
> 
> Greg, the hosted engine VM is created with 'destroy' action on            
> 'on_poweroff', 'on_reboot', 'on_crash' events.
> http://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-setup.git;a=blob;f=src/
> vdsm_hooks/hostedengine.py;h=e9e2ac42fe0981606d89a29a7b56cacd5809e928;
> hb=HEAD#l36
> 
> It also already issue a destroy command if the above is not honored:
> after OS installation 
> http://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-setup.git;a=blob;f=src/
> plugins/ovirt-hosted-engine-setup/vm/runvm.py;
> h=32d82a15fc8d312c9e06227d5a9920f30ba1bcbc;hb=HEAD#l297
> 
> and after engine liveliness validation before starting ha daemons:
> http://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-setup.git;a=blob;f=src/
> plugins/ovirt-hosted-engine-setup/ha/ha_services.py;
> h=a96fce43ad77c90d6824875e1da12476296eb3a1;hb=HEAD#l70

I did some more testing... I think there are a few things going on:

1. I could be mistaken, but I think the on_poweroff/etc hooks affect libvirt state (or at least, I don't see anything in vdsm that watch them).

2. The `hosted-engine --deploy` operation does indeed run the destroy command, as you mentioned above.  It appears I ran into this failure because I had restarted the vm using `hosted-engine --vm-start`, changed some things, and shut it down again.  Because deploy had already destroyed the vm and displayed the prompt asking the OS had installed successfully, it didn't destroy it again.

You could say it's my fault for starting the vm out-of-band from the deployment, rather than just answering "No" and letting the deployment code restart the vm for me.  We could work around this by either adding more messaging so people don't go rogue like me and start the vm by hand, or perhaps by having adding a vdsm destroy command somewhere in the `hosted-engine --vm-start` flow--only if it's not running already of course.

Comment 7 Sandro Bonazzola 2013-10-02 09:07:43 UTC
(In reply to Greg Padgett from comment #6)

> You could say it's my fault for starting the vm out-of-band from the
> deployment, rather than just answering "No" and letting the deployment code
> restart the vm for me.  We could work around this by either adding more
> messaging so people don't go rogue like me and start the vm by hand, or
> perhaps by having adding a vdsm destroy command somewhere in the
> `hosted-engine --vm-start` flow--only if it's not running already of course.

I'm not sure that destroying the VM when --vm-start is called is a good idea.
As Dan pointed out:

Whatever started the Vm should monitor it and issue the destroy verb when it finds the VM has gone Down.

So if the user start the VM with --vm-start it should be the user that call also --vm-poweroff after the shutdown.

However:
- I think that the destroy on shutdown requested by the hook should be honored.
- I think that I can add some additional checks before trying to create the VM, checking if the user has done something like you, leaving around a stale VM and tell him to cleanup it.

Comment 9 Sandro Bonazzola 2013-10-21 14:28:31 UTC
I assume you've moved this on me for:

> - I think that I can add some additional checks before trying to create the
> VM, checking if the user has done something like you, leaving around a stale
> VM and tell him to cleanup it.

right?

Comment 10 Greg Padgett 2013-10-21 15:02:28 UTC
(In reply to Sandro Bonazzola from comment #9)
> I assume you've moved this on me for:
> 
> > - I think that I can add some additional checks before trying to create the
> > VM, checking if the user has done something like you, leaving around a stale
> > VM and tell him to cleanup it.
> 
> right?

right, thanks.

Comment 11 Sandro Bonazzola 2013-10-25 13:22:15 UTC
patch merged on upstream master and 1.0 branch.

Comment 13 Leonid Natapov 2013-11-03 11:56:47 UTC
fixed.

Comment 14 Charlie 2013-11-28 01:18:52 UTC
This bug is currently attached to errata RHBA-2013:15257. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 15 Sandro Bonazzola 2013-12-05 10:42:08 UTC
hosted engine is a new package, does not need errata for specific bugs during its development.

Comment 16 errata-xmlrpc 2014-01-21 16:53:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0083.html