Bug 844350 - after failed connectStoragePool host doesn't move to non-operational
after failed connectStoragePool host doesn't move to non-operational
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.1.0
x86_64 Linux
unspecified Severity high
: ---
: 3.2.0
Assigned To: mkublin
Dafna Ron
infra
:
Depends On:
Blocks: 915537
  Show dependency treegraph
 
Reported: 2012-07-30 08:07 EDT by Haim
Modified: 2016-02-10 14:26 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
engine.log (49.71 KB, application/x-gzip)
2012-07-30 08:20 EDT, Haim
no flags Details

  None (edit)
Description Haim 2012-07-30 08:07:48 EDT
Description of problem:

I have the following case where i'm able to run a vm over non-operational host which cannot connect to storage pool.
case goes as follows:

- create cluster and data-center
- add 2 hosts to it
- make sure one of the hosts is already connected to different pool
- create new storage domain
- attach new storage domain to new data-center, a new pool will be created (createStoragePool)
- connectStoragePool will be sent to both hosts, in my case, vdsm returned an error (see logs attached), host didn't turn to non-operational
- create vm and disk
- start vm, vm was started on the host which is not connected to pool ...


engine-log (see the exception):

2012-07-30 16:48:26,680 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] Running command: SetStoragePoo
lStatusCommand internal: true. Entities affected :  ID: b8423a1f-8889-469a-b2fa-39ab78ac3a57 Type: StoragePool
2012-07-30 16:48:26,806 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] hostFromVds::selectedVds - nott-v
ds1.qa.lab.tlv.redhat.com, spmStatus Unknown_Pool, storage pool iSCSI
2012-07-30 16:48:26,823 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] START, ConnectStorage
PoolVDSCommand(vdsId = 78e0bf24-d979-11e1-9e8e-001a4a16970e, storagePoolId = b8423a1f-8889-469a-b2fa-39ab78ac3a57, vds_spm_id = 1, masterDomainId = 4f38a996-db
eb-4981-885b-742b46a4714f, masterVersion = 1), log id: 643f68c3
2012-07-30 16:48:26,874 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] FINISH, ConnectStorag
ePoolVDSCommand, log id: 643f68c3
2012-07-30 16:48:26,874 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] IrsBroker::Failed::ActivateStorag
eDomainVDS due to: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Could not connect host to Data Center(Storag
e issue)
2012-07-30 16:48:26,902 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] Irs placed on server null failed.
 Proceed Failover
2012-07-30 16:48:26,934 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] hostFromVds::selectedVds - nott-v
ds2.qa.lab.tlv.redhat.com, spmStatus Unknown_Pool, storage pool iSCSI
2012-07-30 16:48:26,942 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] START, ConnectStorage
PoolVDSCommand(vdsId = 16cbcf58-d97a-11e1-bba6-001a4a16970e, storagePoolId = b8423a1f-8889-469a-b2fa-39ab78ac3a57, vds_spm_id = 2, masterDomainId = 4f38a996-db
eb-4981-885b-742b46a4714f, masterVersion = 1), log id: 37ad18f9
2012-07-30 16:48:30,227 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] FINISH, ConnectStorag
ePoolVDSCommand, log id: 37ad18f9
2012-07-30 16:48:30,257 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] hostFromVds::Connected host to po
ol - selectedVds - nott-vds2.qa.lab.tlv.redhat.com, spmStatus Free, storage pool iSCSI
2012-07-30 16:48:30,286 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] starting spm on vds nott-vds2.qa.lab.tlv.redhat.com, storage pool iSCSI, prevId -1, LVER -1
2012-07-30 16:48:30,290 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (ajp-/127.0.0.1:8009-4) [5aa80001] START, SpmStartVDSCommand(vdsId = 16cbcf58-d97a-11e1-bba6-001a4a16970e, storagePoolId = b8423a1f-8889-469a-b2fa-39ab78ac3a57, prevId=-1, prevLVER=-1, storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), log id: 21e59259

vdsm log:

Thread-20934::INFO::2012-07-30 16:47:11,238::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='b8423a1f-8889-469a-b2fa-39ab78ac3a57', hostID=1, scsiKey='b8423a1f-8889-469a-b2fa-39ab78ac3a57', msdUUID='4f38a996-dbeb-4981-885b-742b46a4714f', masterVersion=1, options=None)
Thread-20934::ERROR::2012-07-30 16:47:11,238::task::853::TaskManager.Task::(_setError) Task=`e1fdfa2e-e35b-4a2f-895f-a0e46db7740a`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 814, in connectStoragePool
    return self._connectStoragePool(spUUID, hostID, scsiKey, msdUUID, masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 822, in _connectStoragePool
    raise se.CannotConnectMultiplePools(str(self.pools.keys()))
CannotConnectMultiplePools: Cannot connect pool, already connected to another pool: ("['77887fd1-3408-44ec-8de4-153948d89cdc']",)

create vm (engine log):

2012-07-30 16:50:08,088 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (pool-4-thread-49) [12e6df6a] START, CreateVmVDSCommand(vdsId = 78e0bf24-d979-11e1-9e8e-001a4a16970e, vmId=0fdbfebb-8583-43df-8b11-78b7cd3e2372, vm=org.ovirt.engine.core.common.businessentities.VM@8ff6ae0), log id: 184adef5
2012-07-30 16:50:08,116 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (pool-4-thread-49) [12e6df6a] START, CreateVDSCommand(vdsId = 78e0bf24-d979-11e1-9e8e-001a4a16970e, vmId=0fdbfebb-8583-43df-8b11-78b7cd3e2372, vm=org.ovirt.engine.core.common.businessentities.VM@8ff6ae0), log id: 4d12a8c7
2012-07-30 16:50:08,229 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (pool-4-thread-49) [12e6df6a] org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand spiceSslCipherSuite=DEFAULT,memSize=512,kvmEnable=true,smp=1,emulatedMachine=rhel6.3.0,vmType=kvm,keyboardLayout=en-us,nice=0,display=qxl,smpCoresPerSocket=1,spiceSecureChannels=smain,sinputs,scursor,splayback,srecord,sdisplay,timeOffset=-43200,transparentHugePages=true,vmId=0fdbfebb-8583-43df-8b11-78b7cd3e2372,devices=[Ljava.util.Map;@238837e7,acpiEnable=true,vmName=vm,cpuType=Conroe,custom={}
2012-07-30 16:50:08,230 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (pool-4-thread-49) [12e6df6a] FINISH, CreateVDSCommand, log id: 4d12a8c7
2012-07-30 16:50:08,237 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (pool-4-thread-49) [12e6df6a] IncreasePendingVms::CreateVmIncreasing vds nott-vds1.qa.lab.tlv.redhat.com pending vcpu count, now 1. Vm: vm
2012-07-30 16:50:08,252 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (pool-4-thread-49) [12e6df6a] FINISH, CreateVmVDSCommand, return: WaitForLaunch, log id: 184adef5
2012-07-30 16:50:08,253 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-49) [12e6df6a] Lock freed to object EngineLock [exclusiveLocks= key: 0fdbfebb-8583-43df-8b11-78b7cd3e2372 value: VM
, sharedLocks= ]


create vm (vdsm log):

Thread-20997::DEBUG::2012-07-30 16:48:52,561::BindingXMLRPC::864::vds::(wrapper) client [10.35.97.14]::call vmCreate with ({'custom': {}, 'keyboardLayout': 'en-us', 'kvmEnable': 'true', 'acpiEnable': 'true', 'emulatedMachine': 'rhel6.3.0', 'vmId': '0fdbfebb-8583-43df-8b11-78b7cd3e2372', 'devices': [{'device': 'qxl', 'specParams': {'vram': '65536'}, 'type': 'video', 'deviceId': '9a204903-075d-4800-a809-3d4016ba80f1'}, {'index': '2', 'iface': 'ide', 'specParams': {'path': ''}, 'readonly': 'true', 'deviceId': 'aaf36377-48a7-4e5b-afef-cd9930058b66', 'path': '', 'device': 'cdrom', 'shared': 'false', 'type': 'disk'}, {'index': 0, 'iface': 'virtio', 'format': 'raw', 'optional': 'false', 'volumeID': 'b584bfb1-183b-4329-80d5-b43bebfa44bf', 'imageID': 'f619a935-0cf2-4e21-b031-0c9840476d81', 'specParams': {}, 'readonly': 'false', 'domainID': '4f38a996-dbeb-4981-885b-742b46a4714f', 'deviceId': 'f619a935-0cf2-4e21-b031-0c9840476d81', 'poolID': 'b8423a1f-8889-469a-b2fa-39ab78ac3a57', 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'type': 'disk'}, {'nicModel': 'pv', 'macAddr': '00:1a:4a:23:61:67', 'network': 'rhevm', 'specParams': {}, 'deviceId': 'fe3eb7bb-b838-4c64-be53-79c6246fe913', 'device': 'bridge', 'type': 'interface'}, {'device': 'memballoon', 'specParams': {'model': 'virtio'}, 'type': 'balloon', 'deviceId': '6a29d58e-37cf-4294-b87b-ce40948f9433'}], 'smp': '1', 'vmType': 'kvm', 'timeOffset': '-43200', 'memSize': 512, 'spiceSslCipherSuite': 'DEFAULT', 'cpuType': 'Conroe', 'spiceSecureChannels': 'smain,sinputs,scursor,splayback,srecord,sdisplay', 'smpCoresPerSocket': '1', 'vmName': 'vm', 'display': 'qxl', 'transparentHugePages': 'true', 'nice': '0'},) {} flowID [12e6df6a]
Thread-20997::INFO::2012-07-30 16:48:52,579::API::579::vds::(_getNetworkIp) network None: using 0
Thread-20997::INFO::2012-07-30 16:48:52,580::API::216::vds::(create) vmContainerLock acquired by vm 0fdbfebb-8583-43df-8b11-78b7cd3e2372
Thread-20998::DEBUG::2012-07-30 16:48:52,584::vm::564::vm.Vm::(_startUnderlyingVm) vmId=`0fdbfebb-8583-43df-8b11-78b7cd3e2372`::Start
Thread-20997::DEBUG::2012-07-30 16:48:52,585::API::229::vds::(create) Total desktops after creation of 0fdbfebb-8583-43df-8b11-78b7cd3e2372 is 1
Thread-20998::DEBUG::2012-07-30 16:48:52,585::vm::568::vm.Vm::(_startUnderlyingVm) vmId=`0fdbfebb-8583-43df-8b11-78b7cd3e2372`::_ongoingCreations acquired
Thread-20997::DEBUG::2012-07-30 16:48:52,586::BindingXMLRPC::870::vds::(wrapper) return vmCreate with {'status': {'message': 'Done', 'code': 0}, 'vmList': {'status': 'WaitForLaunch', 'acpiEnable': 'true', 'emulatedMachine': 'rhel6.3.0', 'vmId': '0fdbfebb-8583-43df-8b11-78b7cd3e2372', 'pid': '0', 'timeOffset': '-43200', 'displayPort': '-1', 'displaySecurePort': '-1', 'spiceSslCipherSuite': 'DEFAULT', 'cpuType': 'Conroe', 'custom': {}, 'clientIp': '', 'nicModel': 'rtl8139,pv', 'keyboardLayout': 'en-us', 'kvmEnable': 'true', 'transparentHugePages': 'true', 'devices': [{'device': 'qxl', 'specParams': {'vram': '65536'}, 'type': 'video', 'deviceId': '9a204903-075d-4800-a809-3d4016ba80f1'}, {'index': '2', 'iface': 'ide', 'specParams': {'path': ''}, 'readonly': 'true', 'deviceId': 'aaf36377-48a7-4e5b-afef-cd9930058b66', 'path': '', 'device': 'cdrom', 'shared': 'false', 'type': 'disk'}, {'index': 0, 'iface': 'virtio', 'format': 'raw', 'optional': 'false', 'volumeID': 'b584bfb1-183b-4329-80d5-b43bebfa44bf', 'imageID': 'f619a935-0cf2-4e21-b031-0c9840476d81', 'specParams': {}, 'readonly': 'false', 'domainID': '4f38a996-dbeb-4981-885b-742b46a4714f', 'deviceId': 'f619a935-0cf2-4e21-b031-0c9840476d81', 'poolID': 'b8423a1f-8889-469a-b2fa-39ab78ac3a57', 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'type': 'disk'}, {'nicModel': 'pv', 'macAddr': '00:1a:4a:23:61:67', 'network': 'rhevm', 'specParams': {}, 'deviceId': 'fe3eb7bb-b838-4c64-be53-79c6246fe913', 'device': 'bridge', 'type': 'interface'}, {'device': 'memballoon', 'specParams': {'model': 'virtio'}, 'type': 'balloon', 'deviceId': '6a29d58e-37cf-4294-b87b-ce40948f9433'}], 'smp': '1', 'vmType': 'kvm', 'memSize': 512, 'displayIp': '0', 'spiceSecureChannels': 'smain,sinputs,scursor,splayback,srecord,sdisplay', 'smpCoresPerSocket': '1', 'vmName': 'vm', 'display': 'qxl', 'nice': '0'}}
Comment 1 Haim 2012-07-30 08:20:34 EDT
Created attachment 601245 [details]
engine.log
Comment 2 mkublin 2012-08-06 08:15:44 EDT
Actually, I don't know how to solve a bug, all mechanism worked as design.
Lets I will explain:
Engine run AddStoragePoolWithStorages,
1.We are trying to connect storage to vdsm , we choose a good vdsm and we successed.
2. We are running createStoragePool and these is again a good vdsm and we successes.
3. Stop, why not? Pool created and vdsm is connected to storage.
4. Now we are trying to ActivateStorageDomain these is an SPM operation and we are choosing a SPM , but now it is  bad host and we failed, but we are trying to again and we are choosing a good vdsm as SPM and we are successes.
These is a design of SPM election. Domain was activated.

So possible in the future vdsm will be moved to NonOperational or NonResponsive but only after 5 minutes and meanwhile Haim successed to run vm.

All described above it is worked as designed, maybe bad, but I don't see anything that I can do.
Also bug is not related to Locking Mechanism.
Actually these is a bug is a good example when almost at first time a scenario and different mechanism worked as design, but we are still not happy
Comment 3 Ayal Baron 2012-08-07 03:50:45 EDT
(In reply to comment #2)
> Actually, I don't know how to solve a bug, all mechanism worked as design.
> Lets I will explain:
> Engine run AddStoragePoolWithStorages,
> 1.We are trying to connect storage to vdsm , we choose a good vdsm and we
> successed.
> 2. We are running createStoragePool and these is again a good vdsm and we
> successes.
> 3. Stop, why not? Pool created and vdsm is connected to storage.
> 4. Now we are trying to ActivateStorageDomain these is an SPM operation and
> we are choosing a SPM , but now it is  bad host and we failed, but we are
> trying to again and we are choosing a good vdsm as SPM and we are successes.
> These is a design of SPM election. Domain was activated.
> 
> So possible in the future vdsm will be moved to NonOperational or
> NonResponsive but only after 5 minutes and meanwhile Haim successed to run
> vm.
> 
> All described above it is worked as designed, maybe bad, but I don't see
> anything that I can do.
> Also bug is not related to Locking Mechanism.
> Actually these is a bug is a good example when almost at first time a
> scenario and different mechanism worked as design, but we are still not happy

This has nothing to do with the problem.
Problem is very simple - host should not be 'up' in pool until it is successfully configured to be part of it.  This means that initvdsonup (or whatever) needs to run on it once it is added to the datacenter and connected to the correct pool. once this happens getVdsStats needs to run and only after we see that the host is connected to the correct pool and that the storage domains monitoring is ok move the host to 'up'.  Once we do this many problems including this one will disappear.
This has nothing to do with storage per se.
Comment 4 mkublin 2012-08-07 04:18:14 EDT
(In reply to comment #3)
> (In reply to comment #2)
> This has nothing to do with the problem.
> Problem is very simple - host should not be 'up' in pool until it is
> successfully configured to be part of it.  This means that initvdsonup (or
> whatever) needs to run on it once it is added to the datacenter and
> connected to the correct pool. once this happens getVdsStats needs to run
> and only after we see that the host is connected to the correct pool and
> that the storage domains monitoring is ok move the host to 'up'.  Once we do
> this many problems including this one will disappear.
> This has nothing to do with storage per se.

I think you don't know that I have not any influence on IntVdsOnUp, it runs when it runs, these is also design. Also , why vdsm should not be up ? It is already connected to wrong pool
Comment 6 Ayal Baron 2012-09-12 04:57:30 EDT
(In reply to comment #4)
> (In reply to comment #3)
> > (In reply to comment #2)
> > This has nothing to do with the problem.
> > Problem is very simple - host should not be 'up' in pool until it is
> > successfully configured to be part of it.  This means that initvdsonup (or
> > whatever) needs to run on it once it is added to the datacenter and
> > connected to the correct pool. once this happens getVdsStats needs to run
> > and only after we see that the host is connected to the correct pool and
> > that the storage domains monitoring is ok move the host to 'up'.  Once we do
> > this many problems including this one will disappear.
> > This has nothing to do with storage per se.
> 
> I think you don't know that I have not any influence on IntVdsOnUp, it runs
> when it runs, these is also design. Also , why vdsm should not be up ? It is
> already connected to wrong pool

this is not a 'design' it's an implementation detail and it's a bug.
Host should not move to 'up' until engine has verified everything is ok with it period.

This has nothing to do with storage flows, anything that would cause the host to become non-operational later would cause the same behaviour.
Comment 7 Barak 2012-09-24 05:08:25 EDT
discussed this BZ with Ayal Kublin & Yair,

1 - this scenario is rare (having a host in up while it is connected to a different SP)
2 - when moving host between clusters it must be on maintenance so initVdsOnUP will be called and fix the status (this is the usual scenario)
3 - This is very late to touch such a delicate piece of code.

moving to rhevm_future.
Comment 8 mkublin 2012-12-03 05:24:20 EST
These should be fixed after the following commit:
http://gerrit.ovirt.org/#/c/9566/

The host will not be moved UP if it failed to connect to storage pool
Comment 9 Dafna Ron 2013-02-05 05:36:14 EST
verified on sf5

added a new pool while blocking one of the hosts to the storage using iptables. 
host became non-operational as soon as contending finished. 


2013-02-05 12:32:27,034 ERROR [org.ovirt.engine.core.bll.storage.RefreshPoolSingleAsyncOperation] (pool-3-thread-41) [50303fbd] Could not connect vds gold-vdsc to pool DC - moving host to non-operational
Comment 10 Itamar Heim 2013-06-11 04:35:16 EDT
3.2 has been released
Comment 11 Itamar Heim 2013-06-11 04:35:16 EDT
3.2 has been released
Comment 12 Itamar Heim 2013-06-11 04:35:37 EDT
3.2 has been released
Comment 13 Itamar Heim 2013-06-11 04:44:07 EDT
3.2 has been released

Note You need to log in before you can comment on or make changes to this bug.