Bug 874019 - ovirt-engine-backend: Non-operational Hosts that been switched to 'Maintenance' returns to non-operational status when disconnectStoragePool fails.
ovirt-engine-backend: Non-operational Hosts that been switched to 'Maintenanc...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
unspecified
x86_64 Linux
high Severity high
: ---
: 3.2.0
Assigned To: mkublin
Elad
infra
:
: 890824 (view as bug list)
Depends On:
Blocks: 915537
  Show dependency treegraph
 
Reported: 2012-11-07 04:58 EST by Omri Hochman
Modified: 2016-02-10 14:29 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, when the disconnectStoragePool call failed or returned with a timeout, a non-operational host that had been switched to maintenance mode would return to a non-operational state. Now, if the host fails it remains connected to the storage pool but it is marked in the database as being in maintenance.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-06-10 17:18:46 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
engine.log (659.44 KB, application/x-zip-compressed)
2012-11-07 04:58 EST, Omri Hochman
no flags Details
vdsm.log (400.51 KB, application/octet-stream)
2012-11-07 06:57 EST, Omri Hochman
no flags Details

  None (edit)
Description Omri Hochman 2012-11-07 04:58:37 EST
Created attachment 639921 [details]
engine.log

ovirt-engine-backend: Non-operational Hosts that been switched to 'Maintenance' returns to the non-operational status when disconnectStoragePool fails.  

Description: 
*************
I had Storage Issue that caused some hosts in my cluster to fail read vgs MetaData, due to this problem the hosts switched to status 'Non-operational' in the RHEVM, when I attempted to switch the hosts to 'maintenance', it looked  successful at the beginning, the hosts switch to 'preparing for maintenance' and then to 'maintenance', but few minutes later(3-4 minutes), the hosts switched back from status 'maintenance' to status 'non-operational.  

Looking at engine.log /vdsm.log - It seems that : calling to maintenance sent disconenctStoragePool on the same thread, in case disconnectStoragePool failed or return with timeout, the host will be switched back to non-operational.

Note:
******
Due to the storage issue running 'vgs' command on the problematic hosts took very long (around 8 minutes) 

engine.log :
(follow ID: ad912e12-2662-11e2-9057-441ea17336ee)
******************************************************************
2012-11-06 10:18:17,910 INFO  [org.ovirt.engine.core.bll.MaintananceNumberOfVdssCommand] (pool-4-thread-157) [37436bfb] Running command: MaintananceNumberOfVdssCommand internal: false. Entities affected :  ID: ad912e12-2662-11e2-9057-441ea17336ee Type: VDS
2012-11-06 10:18:17,923 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-157) [37436bfb] START, SetVdsStatusVDSCommand(HostName = puma23, HostId = ad912e12-2662-11e2-9057-441ea17336ee, status=PreparingForMaintenance, nonOperationalReason=NONE), log id: 4fc3e3c8
2012-11-06 10:18:17,952 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-4-thread-157) [37436bfb] FINISH, SetVdsStatusVDSCommand, log id: 4fc3e3c8
2012-11-06 10:18:18,014 INFO  [org.ovirt.engine.core.bll.MaintananceVdsCommand] (pool-4-thread-157) [37436bfb] Running command: MaintananceVdsCommand internal: true. Entities affected :  ID: ad912e12-2662-11e2-9057-441ea17336ee Type: VDS
2012-11-06 10:18:18,461 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-35) [6b6342e3] vds::Updated vds status from Preparing for Maintenance to Maintenance in database,  vds = ad912e12-2662-11e2-9057-441ea17336ee : puma23
2012-11-06 10:18:18,494 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-35) [6b6342e3] Clearing cache of pool: 020d9b34-265d-11e2-8865-441ea17336ee for problematic entities of VDS: puma23.
..
..
..
2012-11-06 10:18:18,537 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (QuartzScheduler_Worker-35) [6b6342e3] START, DisconnectStoragePoolVDSCommand(HostName = puma23, HostId = ad912e12-2662-11e2-9057-441ea17336ee, storagePoolId = 020d9b34-265d-11e2-8865-441ea17336ee, vds_spm_id = 20), log id: 302b03f8
2012-11-06 10:18:32,711 ERROR [org.ovirt.engine.core.engineencryptutils.EncryptionUtils] (QuartzScheduler_Worker-24) Failed to decrypt Data must not be longer than 256 bytes
2012-11-06 10:20:18,112 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-35) [6b6342e3] Failed in DisconnectStoragePoolVDS method
2012-11-06 10:20:18,112 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-35) [6b6342e3] Error code ResourceTimeout and error message VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Resource timeout: ()
2012-11-06 10:20:18,112 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-35) [6b6342e3] Command org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         851
mMessage                      Resource timeout: ()


2012-11-06 10:20:18,112 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (QuartzScheduler_Worker-35) [6b6342e3] HostName = puma23
2012-11-06 10:20:18,112 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-35) [6b6342e3] Command DisconnectStoragePoolVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Resource timeout: ()
2012-11-06 10:20:18,112 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (QuartzScheduler_Worker-35) [6b6342e3] FINISH, DisconnectStoragePoolVDSCommand, log id: 302b03f8
2012-11-06 10:20:18,114 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-35) [6b6342e3] Host encounter a problem moving to maintenance mode. The Host status will change to Non operational status.
2012-11-06 10:20:18,191 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-35) [5e3a930f] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: ad912e12-2662-11e2-9057-441ea17336ee Type: VDS
2012-11-06 10:20:18,204 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-35) [5e3a930f] START, SetVdsStatusVDSCommand(HostName = puma23, HostId = ad912e12-2662-11e2-9057-441ea17336ee, status=NonOperational, nonOperationalReason=NONE), log id: 66b96e65
2012-11-06 10:20:18,218 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-35) [5e3a930f] FINISH, SetVdsStatusVDSCommand, log id: 66b96e65
2012-11-06 10:20:18,241 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-35) [5e3a930f] ResourceManager::RerunFailedCommand: Error: VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Resource timeout: (), vds = ad912e12-2662-11e2-9057-441ea17336ee : puma23
2012-11-06 10:20:18,242 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-35) [5e3a930f] VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Resource timeout: (): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DisconnectStoragePoolVDS, error = Resource timeout: ()
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:212) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunVdsCommand(VDSBrokerFrontendImpl.java:33) [engine-bll.jar:]
at org.ovirt.engine.core.bll.MaintananceVdsCommand.ProcessStorageOnVdsInactive(MaintananceVdsCommand.java:178) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.VdsEventListener.vdsMovedToMaintanance(VdsEventListener.java:69) [engine-bll.jar:]
        at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) [:1.7.0_09-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptorFactory$ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptorFactory.java:72) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.WeavedInterceptor.processInvocation(WeavedInterceptor.java:53) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:36) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.invocationmetrics.ExecutionTimeInterceptor.processInvocation(ExecutionTimeInterceptor.java:43) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.jpa.interceptor.SBInvocationInterceptor.processInvocation(SBInvocationInterceptor.java:47)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:53) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.pool.PooledInstanceInterceptor.processInvocation(PooledInstanceInterceptor.java:51) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInNoTx(CMTTxInterceptor.java:210) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:362) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:193) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:42) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
   at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59) [jboss-as-ejb3.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:165) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:176) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) [jboss-invocation.jar:1.1.1.Final-redhat-1]
        at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:72) [jboss-as-ee.jar:7.1.2.Final-redhat-1]
        at org.ovirt.engine.core.common.businessentities.IVdsEventListener$$$view6.vdsMovedToMaintanance(Unknown Source)
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.AfterRefreshTreatment(VdsUpdateRunTimeInfo.java:336) [engine-vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.OnTimer(VdsManager.java:272) [engine-vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) [:1.7.0_09-icedtea]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_09-icedtea]
        at java.lang.reflect.Method.invoke(Method.java:601) [rt.jar:1.7.0_09-icedtea]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:64) [engine-scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz-2.1.2.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-2.1.2.jar:]
Comment 1 Omri Hochman 2012-11-07 06:53:54 EST
attach vdsm.log (clocks are sync): 
*********************************
Thread-78197::DEBUG::2012-11-06 10:18:18,556::BindingXMLRPC::171::vds::(wrapper) [10.35.160.85]
Thread-78197::DEBUG::2012-11-06 10:18:18,556::task::588::TaskManager.Task::(_updateState) Task=`194eee42-7939-4ba5-af9e-ed75c8cec59d`::moving from state init -> state preparing
Thread-78197::INFO::2012-11-06 10:18:18,556::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='020d9b34-265d-11e2-8865-441ea17336ee', hostID=20, scsiKey='020d9b34-265d-11e2-8865-441ea17336ee', remove=False, options=None)
Thread-78197::DEBUG::2012-11-06 10:18:18,557::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.020d9b34-265d-11e2-8865-441ea17336ee`ReqID=`7fa71204-6600-40b2-8171-51678108a690`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource'
Thread-78197::DEBUG::2012-11-06 10:18:18,557::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.020d9b34-265d-11e2-8865-441ea17336ee' for lock type 'exclusive'
Thread-78197::DEBUG::2012-11-06 10:18:18,558::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.020d9b34-265d-11e2-8865-441ea17336ee' is currently locked, Entering queue (11 in queue)
Thread-78197::DEBUG::2012-11-06 10:20:18,559::resourceManager::186::ResourceManager.Request::(cancel) ResName=`Storage.020d9b34-265d-11e2-8865-441ea17336ee`ReqID=`7fa71204-6600-40b2-8171-51678108a690`::Canceled request
Thread-78197::DEBUG::2012-11-06 10:20:18,559::resourceManager::705::ResourceManager.Owner::(acquire) 194eee42-7939-4ba5-af9e-ed75c8cec59d: request for 'Storage.020d9b34-265d-11e2-8865-441ea17336ee' timed out after '120.000000' seconds
Thread-78197::ERROR::2012-11-06 10:20:18,560::task::853::TaskManager.Task::(_setError) Task=`194eee42-7939-4ba5-af9e-ed75c8cec59d`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 891, in disconnectStoragePool
    vars.task.getExclusiveLock(STORAGE, spUUID)
  File "/usr/share/vdsm/storage/task.py", line 1301, in getExclusiveLock
    self.resOwner.acquire(namespace, resName, resourceManager.LockType.exclusive, timeout)
  File "/usr/share/vdsm/storage/resourceManager.py", line 706, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()
Thread-78197::DEBUG::2012-11-06 10:20:18,560::task::872::TaskManager.Task::(_run) Task=`194eee42-7939-4ba5-af9e-ed75c8cec59d`::Task._run: 194eee42-7939-4ba5-af9e-ed75c8cec59d ('020d9b34-265d-11e2-8865-441ea17336ee', 20, '020d9b34-265d-11e2-8865-441ea17336ee', False) {} failed - stopping task
Thread-78197::DEBUG::2012-11-06 10:20:18,560::task::1199::TaskManager.Task::(stop) Task=`194eee42-7939-4ba5-af9e-ed75c8cec59d`::stopping in state preparing (force False)
Thread-78197::DEBUG::2012-11-06 10:20:18,561::task::978::TaskManager.Task::(_decref) Task=`194eee42-7939-4ba5-af9e-ed75c8cec59d`::ref 1 aborting True
Comment 2 Omri Hochman 2012-11-07 06:57:38 EST
Created attachment 639995 [details]
vdsm.log
Comment 4 mkublin 2012-11-19 05:16:12 EST
And what is expected behaviour?
If host did not success to disconnect, I can not move it to maintaince because of
after that I can move host to another pool and activate of host will fail during connect to new pool with error host is connected to other pool.
Comment 5 mkublin 2012-11-22 08:38:17 EST
After these commit http://gerrit.ovirt.org/#/c/9110/, in the described situation user will not success to force remove a problematic data canter. Only solution will be is to fix vdsm, if it is not possible - drop create for DB schema
Comment 6 Ayal Baron 2012-12-02 08:22:37 EST
Point is to move the host to maintenance even if it failed connecting.  If host is later moved and doesn't come up, then user needs to fence it, no big deal (as it's not running any VMs, etc).
Comment 7 mkublin 2012-12-02 08:37:01 EST
(In reply to comment #6)
> Point is to move the host to maintenance even if it failed connecting.  If
> host is later moved and doesn't come up, then user needs to fence it, no big
> deal (as it's not running any VMs, etc).

It is failed to disconnect. 
So what is a reason for verb disconnectStoragePool? If I usually can move host to maintenance?
Comment 8 Ayal Baron 2012-12-02 15:38:47 EST
(In reply to comment #7)
> (In reply to comment #6)
> > Point is to move the host to maintenance even if it failed connecting.  If
> > host is later moved and doesn't come up, then user needs to fence it, no big
> > deal (as it's not running any VMs, etc).
> 
> It is failed to disconnect. 
> So what is a reason for verb disconnectStoragePool? If I usually can move
> host to maintenance?

And if it fails on timeout?
user wants to move host to maintenance not because everything is ok but because he wants the system to stop touching/monitoring the host.
If there are running VMs on the host then obviously this is not the case.  Otherwise, it should succeed.
wrt not sending disconnect to begin with - that is only valid if you change initvdsonup to disconnect host if it is connected to the wrong pool.
Comment 9 mkublin 2012-12-03 02:29:47 EST
I want to be clear:
1.When a host fail to disconnect from pool I will move it to maintenance - these is easy fix.
2. No obligation form storage team: 
   a) for cases can not connect host to storage pool because it is already connected to other.
   b) Host actually connected to pool and running, no indication at the engine management that host even exist (All host in maintenance can be removed with out any call to host)
Comment 10 Barak 2012-12-06 07:55:20 EST
An idea:

What if we add additional ability to ssh to the host an restart vdsm , will that do the work ?

Than we can use it on various occations
Comment 11 Ayal Baron 2012-12-06 08:53:45 EST
(In reply to comment #10)
> An idea:
> 
> What if we add additional ability to ssh to the host an restart vdsm , will
> that do the work ?
> 
> Than we can use it on various occations

and if restart vdsm fails?
you're moving a *non* operational host to *maintenance*
I would say - don't touch the host at all.
Just move to maintenance.
In activate from maintenance run connect and if it fails because already connected, disconnect and try again.  If fails again, move to non-op.
Comment 12 mkublin 2012-12-09 05:07:35 EST
(In reply to comment #11)
> (In reply to comment #10)
> > An idea:
> > 
> > What if we add additional ability to ssh to the host an restart vdsm , will
> > that do the work ?
> > 
> > Than we can use it on various occations
> 
> and if restart vdsm fails?
> you're moving a *non* operational host to *maintenance*
> I would say - don't touch the host at all.
> Just move to maintenance.
> In activate from maintenance run connect and if it fails because already
> connected, disconnect and try again.  If fails again, move to non-op.
I will not implement in such way, a number of possible bugs and number of middle 
cases and races are too big.
As I said, I can implement these: host failed to disconnect, I will switch it status to Maintaince. User should be aware that after that at some cases he will need to restart host manually, or at some cases he will have a "ghost" host which is not in engine db but still connected to pool and running, if these is ok from storage team I am implementing these, if not closing a bug as worked as design.
Comment 13 mkublin 2012-12-16 09:13:17 EST
http://gerrit.ovirt.org/#/c/10109/ 
Patch. The behaviour will be:
Even if failed to Disconnect, host will be moved to maintenance.
Comment 14 Ayal Baron 2012-12-17 01:51:50 EST
Barak suggested to provide ability to reboot host through ssh to solve the issues...
Comment 15 Itamar Heim 2012-12-17 03:37:32 EST
(In reply to comment #14)
> Barak suggested to provide ability to reboot host through ssh to solve the
> issues...

why not via a vdsm verb?
Comment 16 Ayal Baron 2012-12-19 16:42:57 EST
(In reply to comment #15)
> (In reply to comment #14)
> > Barak suggested to provide ability to reboot host through ssh to solve the
> > issues...
> 
> why not via a vdsm verb?

possible, but would not be usable for cases when vdsm is not responding even though the host is.
Comment 17 Simon Grinberg 2012-12-24 11:25:53 EST
(In reply to comment #13)
> http://gerrit.ovirt.org/#/c/10109/ 
> Patch. The behaviour will be:
> Even if failed to Disconnect, host will be moved to maintenance.

I don't like the fact there is a host connected to a storage in maintenance mode - non-operational is the proper status. There are some types of operations that we only permit on a host in maintenance mode, like re-install. I have no idea of the side effects of re-install while the storage is connected. 

On the other hand Kaul said that due to auto-recovery if the host will be placed into non-operational the host toggles afterwards from non-operational to up and vice verse. That is even worse. 

The solution to the above should come from the auto recovery procedure. While the move to maintenance may be enforced by fencing the host. 

Andy?
Comment 20 Ayal Baron 2012-12-30 15:32:32 EST
*** Bug 890824 has been marked as a duplicate of this bug. ***
Comment 26 Elad 2013-03-18 07:07:41 EDT
Verified on SF10.
I blocked connectivity between HSM and master domain by Iptables. The host became non-operational after 5 minutes. maintenance to host succeeded and it remained maintenance and did not become non-operational.
Comment 28 errata-xmlrpc 2013-06-10 17:18:46 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0888.html

Note You need to log in before you can comment on or make changes to this bug.