Bug 850286 - [engine] InvocationTargetException when domain in problem and storage pool is force-removed
Summary: [engine] InvocationTargetException when domain in problem and storage pool is...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.1.0
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.2.0
Assignee: Tal Nisan
QA Contact: Gadi Ickowicz
URL:
Whiteboard: infra
Depends On:
Blocks: 915537
TreeView+ depends on / blocked
 
Reported: 2012-08-21 14:02 UTC by Gadi Ickowicz
Modified: 2016-02-10 19:31 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
vdsm + engine logs (1.12 MB, application/x-gzip)
2012-08-21 14:02 UTC, Gadi Ickowicz
no flags Details

Description Gadi Ickowicz 2012-08-21 14:02:12 UTC
Created attachment 605941 [details]
vdsm + engine logs

Description of problem:
Host is stuck on "waiting to enter maintenance" after force removing the data center. Engine logs show it tries to recover the VM even though the datacenter has been removed from the web interface.

vdsClient on the host it shows the VM as still running even though it was stopped and removed from the web GUI before removing the data center.

Scenario was a single datacenter/cluster/host with 2 iscsi data domains. After blocking the non-master domain while a VM is up, and then unblocking it, the test failed and cleaned up the storage domains (on the storage server only - dynamic storage...). 

To finish cleaning up everything, I manually (from the GUI) stopped the VM (which was still running), and then removed it. Then I force-removed the datacenter and set the host to maintenance.
At this point the host got stuck on "waiting for host to enter maintenance", and it still shows 1 VM running.

Version-Release number of selected component (if applicable):
rhevm-3.1.0-12.el6ev.noarch

logs:
012-08-21 14:30:23,508 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-17) [ce10a21] VDS::UpdateVmRunTimeInfo Error: found VM on a VDS that is not in the database!
2012-08-21 14:30:23,998 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-64) [59d49915] starting ProcessDomainRecovery for domain 93b21619-8a9d-467c-a9cf-cf2272702fc8
2012-08-21 14:30:24,001 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (QuartzScheduler_Worker-64) failed to invoke sceduled method OnTimer: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedMethodAccessor200.invoke(Unknown Source) [:1.7.0_05-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_05-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_05-icedtea]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:64) [engine-scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]
Caused by: java.lang.NullPointerException
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand$IrsProxyData$7.runInTransaction(IrsBrokerCommand.java:1317) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:168) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:107) [engine-utils.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand$IrsProxyData.ProcessDomainRecovery(IrsBrokerCommand.java:1235) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand$IrsProxyData.OnTimer(IrsBrokerCommand.java:1215) [engine-vdsbroker.jar:]
        ... 6 more

2012-08-21 14:30:24,002 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-9) [698e4e3e] starting ProcessDomainRecovery for domain 76c5f446-4768-4c96-adb7-14f538fb3ea7
2012-08-21 14:30:24,010 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (QuartzScheduler_Worker-9) failed to invoke sceduled method OnTimer: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedMethodAccessor200.invoke(Unknown Source) [:1.7.0_05-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_05-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_05-icedtea]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:64) [engine-scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]
Caused by: java.lang.NullPointerException
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand$IrsProxyData$7.runInTransaction(IrsBrokerCommand.java:1317) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:168) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:107) [engine-utils.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand$IrsProxyData.ProcessDomainRecovery(IrsBrokerCommand.java:1235) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand$IrsProxyData.OnTimer(IrsBrokerCommand.java:1215) [engine-vdsbroker.jar:]
        ... 6 more

2012-08-21 14:30:25,610 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-69) [4705279e] VDS::UpdateVmRunTimeInfo Error: found VM on a VDS that is not in the database!

Comment 2 Omer Frenkel 2012-09-02 14:24:09 UTC
there are few things here:
1. the log doesn't match the described steps - there is request to run vm in
2012-08-21 14:04:58 and no request to stop or delete before force remove of storage pool requested - this leads to situation of vm running on vdsm but doesn't exist in the backend, and this is expected.
about moving this host to maintenance, we probably need a different bug for that,
and think there what is the behaviour we would like to have.

2. the InvocationTargetException seen here is a bug in force remove SP that is caused because IRSProxy cache and domains recovery timers are not cleaned,
marking bug as storage to be handled, changing title to be more accurate.

Comment 3 Ayal Baron 2012-09-05 09:09:20 UTC
Need to prevent force remove of dc when it has hosts which are not in maintenance mode.

Comment 4 Tal Nisan 2012-11-07 19:30:47 UTC
http://gerrit.ovirt.org/9110

Comment 5 mkublin 2012-11-07 22:48:30 UTC
(In reply to comment #3)
> Need to prevent force remove of dc when it has hosts which are not in
> maintenance mode.
Ayal, about problem 2 (agreed with Omer), and also can be reproduced with out force remove of storage pool, you also have ForceRemoveStorageDomainCommand.
Also I thought that force remove is really to remove a pool from DB at any case,
if some host are not responsive or not operational - no force remove?

Comment 6 Ayal Baron 2012-11-07 22:55:07 UTC
(In reply to comment #5)
> (In reply to comment #3)
> > Need to prevent force remove of dc when it has hosts which are not in
> > maintenance mode.
> Ayal, about problem 2 (agreed with Omer), and also can be reproduced with
> out force remove of storage pool, you also have
> ForceRemoveStorageDomainCommand.

Which should also only happen when storage domain is in maintenance.
Can it be reproduced any other way?

> Also I thought that force remove is really to remove a pool from DB at any
> case,
> if some host are not responsive or not operational - no force remove?

Move host to maintenance then force remove.
Moving a host to maintenance should always work.

Comment 7 mkublin 2012-11-08 09:15:05 UTC
> Move host to maintenance then force remove.
> Moving a host to maintenance should always work.
It is not, host failed to disconnect from storage pool will be moved to NonOperational.

Comment 8 Ayal Baron 2012-11-08 09:34:24 UTC
(In reply to comment #7)
> > Move host to maintenance then force remove.
> > Moving a host to maintenance should always work.
> It is not, host failed to disconnect from storage pool will be moved to
> NonOperational.

That is a bug (that is already open).

Comment 9 mkublin 2012-11-08 09:58:04 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > > Move host to maintenance then force remove.
> > > Moving a host to maintenance should always work.
> > It is not, host failed to disconnect from storage pool will be moved to
> > NonOperational.
> 
> That is a bug (that is already open).

Regards bug I know , but again question: why for force delete I need all host in maintenance when for regular remove host should be in status UP, not confusing? Especially when force delete it is a weapon of judgement day , when I am using it I know that I already loose my system?

Comment 10 Ayal Baron 2012-11-08 12:53:04 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #7)
> > > > Move host to maintenance then force remove.
> > > > Moving a host to maintenance should always work.
> > > It is not, host failed to disconnect from storage pool will be moved to
> > > NonOperational.
> > 
> > That is a bug (that is already open).
> 
> Regards bug I know , but again question: why for force delete I need all
> host in maintenance when for regular remove host should be in status UP, not
> confusing? Especially when force delete it is a weapon of judgement day ,
> when I am using it I know that I already loose my system?

I see no sense in a host being up when removing DC.
That is the same problem as bug 869309
But, when you run normal delete you need SPM, you go through the host, you have validations that make sure there are no running VMs etc.
With destroy you don't have all this and there is no reason for having any host running with VMs (up or non-operational).

Destroy should be something that is not easily done.

Comment 11 mkublin 2012-11-08 13:08:23 UTC
Now I got these, we think different about what the command is doing.
Today force remove of storage pool it is just remove a pool and domains and networks from DB, nothing is send to host.

Comment 12 Simon Grinberg 2012-11-18 16:56:36 UTC
(In reply to comment #11)
> Now I got these, we think different about what the command is doing.
> Today force remove of storage pool it is just remove a pool and domains and
> networks from DB, nothing is send to host.

And we should keep it this way.
This operation is to clean up DB after admin failed to clean up properly with remove DC thus he results to manual steps. 

The flow to use destroy DC:
Make sure all the hosts are in maintenance (fence or confirmed to be rebooted) 
Force remove DC
Manually clean up the storage outside of The engine scope. 

(In reply to comment #9)
> Regards bug I know , but again question: why for force delete I need all
> host in maintenance when for regular remove host should be in status UP, not
> confusing? Especially when force delete it is a weapon of judgement day ,
> when I am using it I know that I already loose my system?

Exactly because of that. 
For regular remove you need the hosts to be up for the manager to clean up the storage. If you have to use force remove - it's because everything else failed and you want to clean up this DC from your manager no mater what. This means DB only operation which implies the hosts should already be disconnected from the storage and have no VMs in order to grantee the cleanup is safe and there are no running VMs that may still write to the storage. 


Ayal's comment #3 is the correct answer.

Comment 17 Gadi Ickowicz 2013-03-14 09:36:28 UTC
SF9
Able to force remove pool if domain is in problem

Comment 18 Itamar Heim 2013-06-11 09:03:14 UTC
3.2 has been released

Comment 19 Itamar Heim 2013-06-11 09:03:17 UTC
3.2 has been released

Comment 20 Itamar Heim 2013-06-11 09:03:21 UTC
3.2 has been released

Comment 21 Itamar Heim 2013-06-11 09:04:05 UTC
3.2 has been released

Comment 22 Itamar Heim 2013-06-11 09:33:09 UTC
3.2 has been released


Note You need to log in before you can comment on or make changes to this bug.