Bug 1220841

Summary: [RFE] provide a better way to remove broken hosts/vms from the database
Product: [Retired] oVirt Reporter: Sven Kieske <s.kieske>
Component: ovirt-engine-coreAssignee: bugs <bugs>
Status: CLOSED NOTABUG QA Contact: Pavel Stehlik <pstehlik>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.5CC: amureini, ecohen, gklein, lsurette, mgoldboi, ofrenkel, oourfali, rbalakri, s.kieske, yeylon
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-05-18 08:57:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sven Kieske 2015-05-12 15:20:52 UTC
Description of problem:

scenario:

you got a datacenter with one host (say local storage) with running vms.
the hosts crashes fatally (disc defect) and can't be restored.

you want to remove the whole dc/cluster/host/"running vms" (they are "running"
in the db, but the host is gone".

there is no easy way to do this.
I had to tweak the DB in order to achieve this.
even awels on IRC couldn't find a tool (there maybe is something somewhere..)
to do this.

if there _is_ a tool to do this in a clean way, please document this better, because even RH employees don't find it ;)

Version-Release number of selected component (if applicable):
latest master?

How reproducible:
always

Steps to Reproduce:
1. the host was in status "non responsive" with 2 "running" vms on it (actually it wasn't even powered on as the hard discs got replaced)
2. so I wanted to delete everything, I couldn't put the host in maintenance, because ovirt complains: "Error while executing action: Cannot switch Host to Maintenance mode.
Host still has running VMs on it and is in Non Responsive state."
3. so i deleted by hand in the database:

select deletevm('VM1_UUID');
select deletevm('VM2_UUID');

engine=# select vds_id from vds_static where vds_name = 'MYBROKENHOST';
vds_id =$BROKEN_HOST_UUID

delete from vds_statistics where vds_id='$BROKEN_HOST_UUID';
delete from vds_dynamic where vds_id='$BROKEN_HOST_UUID';
delete from vds_static where vds_id='$BROKEN_HOST_UUID';

Than I could force-remove the DC, and the cluster

Actual results:
long and error prone and risky actions to remove stuff which is _gone_

Expected results:
provide a tool (maybe even GUI, but not necessary) to clean up broken hosts, gone vms etc.

Additional info:
I encountered this in ovirt 3.3.3, but I think there was no improvement in this area.

I'd like to hear if there is a cmd tool to do this!

the component might be wrong (webadmin), but I can't find "ovirt-engine-core"
which did exist in the past.



kind regards

Sven

Comment 1 Oved Ourfali 2015-05-17 09:57:20 UTC
Have you tried right clicking on the host and press on the "confirm host has been rebooted" option? That should get into a state that allows safe removal.

Comment 2 Omer Frenkel 2015-05-17 10:47:39 UTC
in addition to comment 1 - this flow should allow you to clean up easily using the UI, this will clean the db, any storage on the host (if still exists) should be removed manually:
again this flow relies on the user that the host is really down, because engine cannot verify this

1. right click the host and select "confirm host has been rebooted" - this will clean the "SPM" status of the host and will move all running vms to down.

2. assuming storage is not reachable:
right click the Data Center and select "Force remove" - this will remove vm/templates related to this DC and also storage domains, finally the DC itself

now you should be left with the cluster and the host:
3. move host to maintenance - now can be removed
4. remove cluster

Comment 3 Oved Ourfali 2015-05-17 10:59:24 UTC
Based on the flow in Comment #2, and my comment, closing as NOTABUG.

Comment 4 Sven Kieske 2015-05-18 08:53:45 UTC
I'm 99% sure that the option "confirm host has been rebooted" was grayed out, thus could not be selected.

afaik you can only select this option, if you are able to bring the host into
maintenance state, which was not the case.

I'd like to reopen therefore.

I'll try to reproduce, but this could take some time..

Comment 5 Oved Ourfali 2015-05-18 08:57:10 UTC
(In reply to Sven Kieske from comment #4)
> I'm 99% sure that the option "confirm host has been rebooted" was grayed
> out, thus could not be selected.
> 
> afaik you can only select this option, if you are able to bring the host into
> maintenance state, which was not the case.
> 
> I'd like to reopen therefore.
> 
> I'll try to reproduce, but this could take some time..

It shouldn't be, so it it was then it could be a different bug
Please open a new bug if you get to reproduce this.

Comment 6 Sven Kieske 2015-06-10 18:33:07 UTC
I can now confirm that this works almost as described here:

(In reply to Omer Frenkel from comment #2)
> in addition to comment 1 - this flow should allow you to clean up easily
> using the UI, this will clean the db, any storage on the host (if still
> exists) should be removed manually:
> again this flow relies on the user that the host is really down, because
> engine cannot verify this
> 
> 1. right click the host and select "confirm host has been rebooted" - this
> will clean the "SPM" status of the host and will move all running vms to
> down.
> 
> 2. assuming storage is not reachable:
> right click the Data Center and select "Force remove" - this will remove
> vm/templates related to this DC and also storage domains, finally the DC
> itself
> 

-> could not remove DC because of hosts in it which are not in maintenance!
"Error while executing action: Cannot remove Data Center while there are Hosts that are not in Maintenance mode."

--> I did put the host into maintenance, then it worked!


> now you should be left with the cluster and the host:
> 3. move host to maintenance - now can be removed
> 4. remove cluster