Bug 874019

Summary: ovirt-engine-backend: Non-operational Hosts that been switched to 'Maintenance' returns to non-operational status when disconnectStoragePool fails.
Product: Red Hat Enterprise Virtualization Manager Reporter: Omri Hochman <ohochman>
Component: ovirt-engineAssignee: mkublin <mkublin>
Status: CLOSED ERRATA QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: bazulay, chetan, gickowic, hateya, iheim, lnatapov, sgrinber, yeylon, ykaul, yzaslavs
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, when the disconnectStoragePool call failed or returned with a timeout, a non-operational host that had been switched to maintenance mode would return to a non-operational state. Now, if the host fails it remains connected to the storage pool but it is marked in the database as being in maintenance.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-10 21:18:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 915537    
Attachments:
Description Flags
engine.log
none
vdsm.log none

Description Omri Hochman 2012-11-07 09:58:37 UTC
Created attachment 639921 [details]
engine.log

ovirt-engine-backend: Non-operational Hosts that been switched to 'Maintenance' returns to the non-operational status when disconnectStoragePool fails.  

Description: 
*************
I had Storage Issue that caused some hosts in my cluster to fail read vgs MetaData, due to this problem the hosts switched to status 'Non-operational' in the RHEVM, when I attempted to switch the hosts to 'maintenance', it looked  successful at the beginning, the hosts switch to 'preparing for maintenance' and then to 'maintenance', but few minutes later(3-4 minutes), the hosts switched back from status 'maintenance' to status 'non-operational.  

Looking at engine.log /vdsm.log - It seems that : calling to maintenance sent disconenctStoragePool on the same thread, in case disconnectStoragePool failed or return with timeout, the host will be switched back to non-operational.

Note:
******
Due to the storage issue running 'vgs' command on the problematic hosts took very long (around 8 minutes) 

engine.log :
(follow ID: ad912e12-2662-11e2-9057-441ea17336ee)
******************************************************************
2012-11-06 10:18:17,910 INFO  [org.ovirt.engine.core.bll.MaintananceNumberOfVdssCommand] (pool-4-thread-157) [37436bfb] Running command: MaintananceNumberOfVdssCommand internal: false. Entities affected :  ID: ad912e12-2662-11e2-9057-441ea17336ee Type: VDS
2012-11-06 10:18:17,923 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-157) [37436bfb] START, SetVdsStatusVDSCommand(HostName = puma23, HostId = ad912e12-2662-11e2-9057-441ea17336ee, status=PreparingForMaintenance, nonOperationalReason=NONE), log id: 4fc3e3c8
2012-11-06 10:18:17,952 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-157) [37436bfb] FINISH, SetVdsStatusVDSCommand, log id: 4fc3e3c8
2012-11-06 10:18:18,014 INFO  [org.ovirt.engine.core.bll.MaintananceVdsCommand] (pool-4-thread-157) [37436bfb] Running command: MaintananceVdsCommand internal: true. Entities affected :  ID: ad912e12-2662-11e2-9057-441ea17336ee Type: VDS
2012-11-06 10:18:18,461 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-35) [6b6342e3] vds::Updated vds status from Preparing for Maintenance to Maintenance in database,  vds = ad912e12-2662-11e2-9057-441ea17336ee : puma23
2012-11-06 10:18:18,494 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-35) [6b6342e3] Clearing cache of pool: 020d9b34-265d-11e2-8865-441ea17336ee for problematic entities of VDS: puma23.
..
..
..
2012-11-06 10:18:18,537 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (QuartzScheduler_Worker-35) [6b6342e3] START, DisconnectStoragePoolVDSCommand(HostName = puma23, HostId = ad912e12-2662-11e2-9057-441ea17336ee, storagePoolId = 020d9b34-265d-11e2-8865-441ea17336ee, vds_spm_id = 20), log id: 302b03f8
2012-11-06 10:18:32,711 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-24) Failed to decrypt Data must not be longer than 256 bytes
2012-11-06 10:20:18,112 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-35) [6b6342e3] Failed in DisconnectStoragePoolVDS method
2012-11-06 10:20:18,112 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-35) [6b6342e3] Error code ResourceTimeout and error message VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Resource timeout: ()
2012-11-06 10:20:18,112 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-35) [6b6342e3] Command org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         851
mMessage                      Resource timeout: ()


2012-11-06 10:20:18,112 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-35) [6b6342e3] HostName = puma23
2012-11-06 10:20:18,112 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-35) [6b6342e3] Command DisconnectStoragePoolVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Resource timeout: ()
2012-11-06 10:20:18,112 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (QuartzScheduler_Worker-35) [6b6342e3] FINISH, DisconnectStoragePoolVDSCommand, log id: 302b03f8
2012-11-06 10:20:18,114 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-35) [6b6342e3] Host encounter a problem moving to maintenance mode. The Host status will change to Non operational status.
2012-11-06 10:20:18,191 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-35) [5e3a930f] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: ad912e12-2662-11e2-9057-441ea17336ee Type: VDS
2012-11-06 10:20:18,204 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-35) [5e3a930f] START, SetVdsStatusVDSCommand(HostName = puma23, HostId = ad912e12-2662-11e2-9057-441ea17336ee, status=NonOperational, nonOperationalReason=NONE), log id: 66b96e65
2012-11-06 10:20:18,218 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-35) [5e3a930f] FINISH, SetVdsStatusVDSCommand, log id: 66b96e65
2012-11-06 10:20:18,241 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-35) [5e3a930f] ResourceManager::RerunFailedCommand: Error: VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Resource timeout: (), vds = ad912e12-2662-11e2-9057-441ea17336ee : puma23
2012-11-06 10:20:18,242 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-35) [5e3a930f] VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Resource timeout: (): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Resource timeout: ()
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:212) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
at org.ovirt.engine.core.bll.MaintananceVdsCommand.ProcessStorageOnVdsInactive(MaintananceVdsCommand.java:178) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VdsEventListener.vdsMovedToMaintanance(VdsEventListener.java:69) [engine-bll.jar:]
        at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) [:1.7.0_09-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.invocationmetrics.ExecutionTimeInterceptor.processInvocation(ExecutionTimeInterceptor.java:43) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.jpa.interceptor.SBInvocationInterceptor.processInvocation(SBInvocationInterceptor.java:47)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:210) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:362) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:193) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:42) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
   at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:176) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.ovirt.engine.core.common.businessentities.IVdsEventListener$$$view6.vdsMovedToMaintanance(Unknown Source)
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.AfterRefreshTreatment(VdsUpdateRunTimeInfo.java:336) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer(VdsManager.java:272) [engine-vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) [:1.7.0_09-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:64) [engine-scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]

Comment 1 Omri Hochman 2012-11-07 11:53:54 UTC
attach vdsm.log (clocks are sync): 
*********************************
Thread-78197::DEBUG::2012-11-06 10:18:18,556::BindingXMLRPC::171::vds::(wrapper) [10.35.160.85]
Thread-78197::DEBUG::2012-11-06 10:18:18,556::task::588::TaskManager.Task::(_updateState) Task=`194eee42-7939-4ba5-af9e-ed75c8cec59d`::moving from state init -> state preparing
Thread-78197::INFO::2012-11-06 10:18:18,556::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='020d9b34-265d-11e2-8865-441ea17336ee', hostID=20, scsiKey='020d9b34-265d-11e2-8865-441ea17336ee', remove=False, options=None)
Thread-78197::DEBUG::2012-11-06 10:18:18,557::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.020d9b34-265d-11e2-8865-441ea17336ee`ReqID=`7fa71204-6600-40b2-8171-51678108a690`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource'
Thread-78197::DEBUG::2012-11-06 10:18:18,557::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.020d9b34-265d-11e2-8865-441ea17336ee' for lock type 'exclusive'
Thread-78197::DEBUG::2012-11-06 10:18:18,558::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.020d9b34-265d-11e2-8865-441ea17336ee' is currently locked, Entering queue (11 in queue)
Thread-78197::DEBUG::2012-11-06 10:20:18,559::resourceManager::186::ResourceManager.Request::(cancel) ResName=`Storage.020d9b34-265d-11e2-8865-441ea17336ee`ReqID=`7fa71204-6600-40b2-8171-51678108a690`::Canceled request
Thread-78197::DEBUG::2012-11-06 10:20:18,559::resourceManager::705::ResourceManager.Owner::(acquire) 194eee42-7939-4ba5-af9e-ed75c8cec59d: request for 'Storage.020d9b34-265d-11e2-8865-441ea17336ee' timed out after '120.000000' seconds
Thread-78197::ERROR::2012-11-06 10:20:18,560::task::853::TaskManager.Task::(_setError) Task=`194eee42-7939-4ba5-af9e-ed75c8cec59d`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 891, in disconnectStoragePool
    vars.task.getExclusiveLock(STORAGE, spUUID)
  File "/usr/share/vdsm/storage/task.py", line 1301, in getExclusiveLock
    self.resOwner.acquire(namespace, resName, resourceManager.LockType.exclusive, timeout)
  File "/usr/share/vdsm/storage/resourceManager.py", line 706, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()
Thread-78197::DEBUG::2012-11-06 10:20:18,560::task::872::TaskManager.Task::(_run) Task=`194eee42-7939-4ba5-af9e-ed75c8cec59d`::Task._run: 194eee42-7939-4ba5-af9e-ed75c8cec59d ('020d9b34-265d-11e2-8865-441ea17336ee', 20, '020d9b34-265d-11e2-8865-441ea17336ee', False) {} failed - stopping task
Thread-78197::DEBUG::2012-11-06 10:20:18,560::task::1199::TaskManager.Task::(stop) Task=`194eee42-7939-4ba5-af9e-ed75c8cec59d`::stopping in state preparing (force False)
Thread-78197::DEBUG::2012-11-06 10:20:18,561::task::978::TaskManager.Task::(_decref) Task=`194eee42-7939-4ba5-af9e-ed75c8cec59d`::ref 1 aborting True

Comment 2 Omri Hochman 2012-11-07 11:57:38 UTC
Created attachment 639995 [details]
vdsm.log

Comment 4 mkublin 2012-11-19 10:16:12 UTC
And what is expected behaviour?
If host did not success to disconnect, I can not move it to maintaince because of
after that I can move host to another pool and activate of host will fail during connect to new pool with error host is connected to other pool.

Comment 5 mkublin 2012-11-22 13:38:17 UTC
After these commit http://gerrit.ovirt.org/#/c/9110/, in the described situation user will not success to force remove a problematic data canter. Only solution will be is to fix vdsm, if it is not possible - drop create for DB schema

Comment 6 Ayal Baron 2012-12-02 13:22:37 UTC
Point is to move the host to maintenance even if it failed connecting.  If host is later moved and doesn't come up, then user needs to fence it, no big deal (as it's not running any VMs, etc).

Comment 7 mkublin 2012-12-02 13:37:01 UTC
(In reply to comment #6)
> Point is to move the host to maintenance even if it failed connecting.  If
> host is later moved and doesn't come up, then user needs to fence it, no big
> deal (as it's not running any VMs, etc).

It is failed to disconnect. 
So what is a reason for verb disconnectStoragePool? If I usually can move host to maintenance?

Comment 8 Ayal Baron 2012-12-02 20:38:47 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > Point is to move the host to maintenance even if it failed connecting.  If
> > host is later moved and doesn't come up, then user needs to fence it, no big
> > deal (as it's not running any VMs, etc).
> 
> It is failed to disconnect. 
> So what is a reason for verb disconnectStoragePool? If I usually can move
> host to maintenance?

And if it fails on timeout?
user wants to move host to maintenance not because everything is ok but because he wants the system to stop touching/monitoring the host.
If there are running VMs on the host then obviously this is not the case.  Otherwise, it should succeed.
wrt not sending disconnect to begin with - that is only valid if you change initvdsonup to disconnect host if it is connected to the wrong pool.

Comment 9 mkublin 2012-12-03 07:29:47 UTC
I want to be clear:
1.When a host fail to disconnect from pool I will move it to maintenance - these is easy fix.
2. No obligation form storage team: 
   a) for cases can not connect host to storage pool because it is already connected to other.
   b) Host actually connected to pool and running, no indication at the engine management that host even exist (All host in maintenance can be removed with out any call to host)

Comment 10 Barak 2012-12-06 12:55:20 UTC
An idea:

What if we add additional ability to ssh to the host an restart vdsm , will that do the work ?

Than we can use it on various occations

Comment 11 Ayal Baron 2012-12-06 13:53:45 UTC
(In reply to comment #10)
> An idea:
> 
> What if we add additional ability to ssh to the host an restart vdsm , will
> that do the work ?
> 
> Than we can use it on various occations

and if restart vdsm fails?
you're moving a *non* operational host to *maintenance*
I would say - don't touch the host at all.
Just move to maintenance.
In activate from maintenance run connect and if it fails because already connected, disconnect and try again.  If fails again, move to non-op.

Comment 12 mkublin 2012-12-09 10:07:35 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > An idea:
> > 
> > What if we add additional ability to ssh to the host an restart vdsm , will
> > that do the work ?
> > 
> > Than we can use it on various occations
> 
> and if restart vdsm fails?
> you're moving a *non* operational host to *maintenance*
> I would say - don't touch the host at all.
> Just move to maintenance.
> In activate from maintenance run connect and if it fails because already
> connected, disconnect and try again.  If fails again, move to non-op.
I will not implement in such way, a number of possible bugs and number of middle 
cases and races are too big.
As I said, I can implement these: host failed to disconnect, I will switch it status to Maintaince. User should be aware that after that at some cases he will need to restart host manually, or at some cases he will have a "ghost" host which is not in engine db but still connected to pool and running, if these is ok from storage team I am implementing these, if not closing a bug as worked as design.

Comment 13 mkublin 2012-12-16 14:13:17 UTC
http://gerrit.ovirt.org/#/c/10109/ 
Patch. The behaviour will be:
Even if failed to Disconnect, host will be moved to maintenance.

Comment 14 Ayal Baron 2012-12-17 06:51:50 UTC
Barak suggested to provide ability to reboot host through ssh to solve the issues...

Comment 15 Itamar Heim 2012-12-17 08:37:32 UTC
(In reply to comment #14)
> Barak suggested to provide ability to reboot host through ssh to solve the
> issues...

why not via a vdsm verb?

Comment 16 Ayal Baron 2012-12-19 21:42:57 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > Barak suggested to provide ability to reboot host through ssh to solve the
> > issues...
> 
> why not via a vdsm verb?

possible, but would not be usable for cases when vdsm is not responding even though the host is.

Comment 17 Simon Grinberg 2012-12-24 16:25:53 UTC
(In reply to comment #13)
> http://gerrit.ovirt.org/#/c/10109/ 
> Patch. The behaviour will be:
> Even if failed to Disconnect, host will be moved to maintenance.

I don't like the fact there is a host connected to a storage in maintenance mode - non-operational is the proper status. There are some types of operations that we only permit on a host in maintenance mode, like re-install. I have no idea of the side effects of re-install while the storage is connected. 

On the other hand Kaul said that due to auto-recovery if the host will be placed into non-operational the host toggles afterwards from non-operational to up and vice verse. That is even worse. 

The solution to the above should come from the auto recovery procedure. While the move to maintenance may be enforced by fencing the host. 

Andy?

Comment 20 Ayal Baron 2012-12-30 20:32:32 UTC
*** Bug 890824 has been marked as a duplicate of this bug. ***

Comment 26 Elad 2013-03-18 11:07:41 UTC
Verified on SF10.
I blocked connectivity between HSM and master domain by Iptables. The host became non-operational after 5 minutes. maintenance to host succeeded and it remained maintenance and did not become non-operational.

Comment 28 errata-xmlrpc 2013-06-10 21:18:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0888.html