Bug 1032925 - Domain monitoring stops after disconnecting and reconnecting storage pool leading to bogus "No space left" error
Domain monitoring stops after disconnecting and reconnecting storage pool lea...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.3.0
Unspecified Unspecified
unspecified Severity high
: ---
: 3.3.0
Assigned To: Liron Aravot
Aharon Canan
storage
: AutomationBlocker, Regression, TestBlocker
: 1024018 1024269 1035822 (view as bug list)
Depends On:
Blocks: 3.3rc1
  Show dependency treegraph
 
Reported: 2013-11-21 04:17 EST by Ilanit Stein
Modified: 2016-02-10 12:02 EST (History)
19 users (show)

See Also:
Fixed In Version: is27
Doc Type: Bug Fix
Doc Text:
Domain monitoring stopped after disconnecting and reconnecting the storage pool, which yielded misleading "No space left" errors. This was caused by an anonymous thread running at unexpected times, and silently stopping all domain monitors. Now, the monitoring is stopped when appropriate, so users do not receive incorrect storage space limitation errors.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-01-21 11:22:02 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (3.21 MB, application/x-bzip2)
2013-11-21 04:21 EST, Ilanit Stein
no flags Details
pg_dump (156.56 KB, application/zip)
2013-11-26 10:12 EST, Ilanit Stein
no flags Details
pg dump unzipped (1.10 MB, text/x-vhdl)
2013-12-01 03:37 EST, Ilanit Stein
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 22058 None None None Never

  None (edit)
Description Ilanit Stein 2013-11-21 04:17:51 EST
Description of problem:
Create VM pool sometimes fail on  MISSED_STORAGES_FOR_SOME_DISKS error.

Version-Release number of selected component (if applicable):
is23.1

How reproducible:
sometimes, during a test of reg VMs:

ERROR on engine.log:

2013-11-20 13:12:59,184 - MainThread - plmanagement.error_fetcher - ERROR - Errors fetched from VDC(jenkins-automation-rpm-vm29.eng.lab.tlv.redhat.com): 2013-11-20 13:12:58,947 ERROR [org.ovirt.engine.core.bll.AddVmPoolWithVmsCommand] (ajp-/127.0.0.1:8702-6) [ae32cdf] Can not found any default active domain for one of the disks of template with id : 24f9ad07-6315-4ae6-8aba-b95ddd2dc6b2
2013-11-20 13:12:58,947 WARN  [org.ovirt.engine.core.bll.AddVmPoolWithVmsCommand] (ajp-/127.0.0.1:8702-6) [ae32cdf] CanDoAction of action AddVmPoolWithVms failed. Reasons:VAR__TYPE__DESKTOP_POOL,VAR__ACTION__CREATE,ACTION_TYPE_FAILED_MISSED_STORAGES_FOR_SOME_DISKS
2013-11-20 13:12:58,952 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-6) Operation Failed: [Cannot create VM-Pool. One or more provided storage domains are in maintenance/non-operational status.]

Steps to Reproduce (flow in regression VMs test, which fail):

1. Create a VM, add a disk
2. Create a template from it
3. Update template name \description\ cpu to: name='rest_templ3', description='Update template', cpu_socket='2', cpu_cores='2'
4. Create a VM pool from this template

Failing test link:
http://jenkins-ci.eng.lab.tlv.redhat.com/job/rhevm_3.3_automation_coretools_two_hosts_restapi_vms_nfs_rest_factory/48
Comment 1 Ilanit Stein 2013-11-21 04:21:06 EST
Created attachment 827086 [details]
logs
Comment 4 Liron Aravot 2013-11-25 08:57:46 EST
The given message is being when various issues related to the storage domain of the template disks are being encountered by the engine - regardless of this issue it's bad and should be fixed so we can know explictly what's the error and not have general message for few errors.

In that case, it appears that the error is free space related,
I suspect that the issue was caused by a vdsm bug that was handled here:
http://gerrit.ovirt.org/#/c/21490/

Ilanit, please check what are the disks sizes that appear in the engine in that case so we can know if that's the issue.

Regardless, the error messages propogation issue should be handled, but that's possibly 3.4 bug (Allon?).
Comment 5 Liron Aravot 2013-11-25 11:04:41 EST
Additional info : http://gerrit.ovirt.org/#/c/21490/ shouldn't be relevant as this issue was present only upstream.
Comment 6 Liron Aravot 2013-11-25 18:40:24 EST
Ilanit, if you can - please attach a pg_dump of the db when this scenario occurs so i can look exactly on the current state.
thanks.
Comment 7 Ilanit Stein 2013-11-26 10:12:42 EST
Created attachment 829320 [details]
pg_dump
Comment 8 Liron Aravot 2013-11-28 10:01:38 EST
Ilanit, the pg_dump file is corrupted - therefore i can't see the creation of the template/disk sizes/etc.

I tested it manually on 3.3 and it seems to work fine (Atleast the described scenario)

please try to reproduce it and attach again or if you come to clear reproduction steps it'll be even better.
thanks.
Comment 9 Ilanit Stein 2013-12-01 03:37:42 EST
Created attachment 831130 [details]
pg dump unzipped
Comment 10 Sergey Gotliv 2013-12-01 08:08:12 EST
*** Bug 1035822 has been marked as a duplicate of this bug. ***
Comment 11 Ilanit Stein 2013-12-01 08:52:56 EST
The "create vmpool" failure, might have same root cause, 
as another failure in this same test, "clone VM from template". 
In this link both failures occur: http://jenkins-ci.eng.lab.tlv.redhat.com/job/rhevm_3.3_automation_coretools_two_hosts_restapi_vms_nfs_rest_factory/72/

amureini is on that.
Comment 12 Liron Aravot 2013-12-01 09:05:11 EST
I've looked into the db dump/vdsm/engine logs -

I'll provide here short analyses and then will add the relevant log parts:
The host cinteg14 becomes the spm, getStoragePoolInfo returns us the info including free space on each domain. Then a copy image operation is being started on that host (copying the image within the same domain) - we stop the domain monitoring on all the domain (which should be inspected) and later on fail to stop it on the domain the image is copied within - this will lead to GetStoragePoolInfo to return from now without the diskfree key indicating the free space and to not have repoStats on any of the domains.

In the engine, the response of GetStoragePoolInfo is updated to the engine db, as now the available_disk_size should be 0/null, we will get CDA messages for insufficient space (till the domains will move to inactive).

Therefore i'm moving the bug to the vdsm as it's the main issue here -
in the engine logs could be added in case that the free space isn't retrieved from vdsm, but that's regardless.




Some related log parts:

(1) note this reponse of getStoragePoolInfo, it contains the "diskfree" key for each of the domains

--------------------------------------------------------------------------
Thread-1496::INFO::2013-11-20 13:11:05,641::logUtils::47::dispatcher::(wrapper) Run and protect: getStoragePoolInfo, Return response: {'info': {'spm_id': 1, 'master_uuid': '
9502a003-6e7e-496e-97c0-225d45b476c1', 'name': 'DefaultRest', 'version': '3', 'domains': '8981cbaa-732a-43cd-b143-00c71734458b:Active,9502a003-6e7e-496e-97c0-225d45b476c1:Ac
tive,7df91534-49bd-499a-bb59-9c26b87b7f2b:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'NFS', 'master_ver': 1, 'lver': 3}, 'dominfo': {'8981cbaa-732a-43cd-b
143-00c71734458b': {'status': 'Active', 'diskfree': '1325691109376', 'isoprefix': '', 'alerts': [], 'disktotal': '1460288880640', 'version': 0}, '9502a003-6e7e-496e-97c0-225
d45b476c1': {'status': 'Active', 'diskfree': '1325691109376', 'isoprefix': '', 'alerts': [], 'disktotal': '1460288880640', 'version': 3}, '7df91534-49bd-499a-bb59-9c26b87b7f
2b': {'status': 'Active', 'diskfree': '1325691109376', 'isoprefix': '', 'alerts': [], 'disktotal': '1460288880640', 'version': 3}}}
Thread-1496::DEBUG::2013-11-20 13:11:05,641::task::1168::TaskManager.Task::(prepare) Task=`6f773615-784c-45cd-9046-e8b4b2b03289`::finished: {'info': {'spm_id': 1, 'master_uu
id': '9502a003-6e7e-496e-97c0-225d45b476c1', 'name': 'DefaultRest', 'version': '3', 'domains': '8981cbaa-732a-43cd-b143-00c71734458b:Active,9502a003-6e7e-496e-97c0-225d45b47
6c1:Active,7df91534-49bd-499a-bb59-9c26b87b7f2b:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'NFS', 'master_ver': 1, 'lver': 3}, 'dominfo': {'8981cbaa-732a-
43cd-b143-00c71734458b': {'status': 'Active', 'diskfree': '1325691109376', 'isoprefix': '', 'alerts': [], 'disktotal': '1460288880640', 'version': 0}, '9502a003-6e7e-496e-97
c0-225d45b476c1': {'status': 'Active', 'diskfree': '1325691109376', 'isoprefix': '', 'alerts': [], 'disktotal': '1460288880640', 'version': 3}, '7df91534-49bd-499a-bb59-9c26
b87b7f2b': {'status': 'Active', 'diskfree': '1325691109376', 'isoprefix': '', 'alerts': [], 'disktotal': '1460288880640', 'version': 3}}}

--------------------------------------------------------------------------


2. then  a copy image operation is being started
--------------------------------------------------------------------------
Thread-1498::INFO::2013-11-20 13:11:09,385::logUtils::44::dispatcher::(wrapper) Run and protect: copyImage(sdUUID='9502a003-6e7e-496e-97c0-225d45b476c1', spUUID='1d695907-39
24-4eb7-9484-812d82023213', vmUUID='', srcImgUUID='cf92af3b-317b-433a-a374-eef1c2ed4307', srcVolUUID='bba48cad-35d1-4f79-8e8b-5034ca6949ce', dstImgUUID='06a2c384-a49c-4c67-9
110-6df59ab3b658', dstVolUUID='897b2c2d-680d-4122-bc53-d40c84b28e18', description='Active VM', dstSdUUID='9502a003-6e7e-496e-97c0-225d45b476c1', volType=6, volFormat=4, prea
llocate=2, postZero='false', force='false')

Thread-1498::DEBUG::2013-11-20 13:11:09,448::taskManager::68::TaskManager::(scheduleJob) scheduled job copyImage_06a2c384-a49c-4c67-9110-6df59ab3b658 for task 6828283e-958c-
41a5-8dd0-b215bfab904d 
Thread-1498::INFO::2013-11-20 13:11:09,448::logUtils::47::dispatcher::(wrapper) Run and protect: copyImage, Return response: None

--------------------------------------------------------------------------

3. the domain monitors are being stopped for the domains for some reason (need to see if that's related or not)

--------------------------------------------------------------------------
Thread-833::DEBUG::2013-11-20 13:11:09,466::domainMonitor::169::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 8981cbaa-732a-43cd-b143-00c71734458b
Thread-833::DEBUG::2013-11-20 13:11:09,467::clusterlock::86::SafeLease::(releaseHostId) Host id for domain 8981cbaa-732a-43cd-b143-00c71734458b released successfully (id: 1)
6828283e-958c-41a5-8dd0-b215bfab904d::DEBUG::2013-11-20 13:11:09,467::task::736::TaskManager.Task::(_save) Task=`6828283e-958c-41a5-8dd0-b215bfab904d`::_save: orig /rhev/dat
a-center/1d695907-3924-4eb7-9484-812d82023213/mastersd/master/tasks/6828283e-958c-41a5-8dd0-b215bfab904d temp /rhev/data-center/1d695907-3924-4eb7-9484-812d82023213/mastersd
/master/tasks/6828283e-958c-41a5-8dd0-b215bfab904d.temp
Thread-834::DEBUG::2013-11-20 13:11:09,468::domainMonitor::169::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 7df91534-49bd-499a-bb59-9c26b87b7f2b
Thread-834::INFO::2013-11-20 13:11:09,468::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain 7df91534-49bd-499a-bb59-9c26b87b7f2b (id: 1)
6828283e-958c-41a5-8dd0-b215bfab904d::DEBUG::2013-11-20 )
--------------------------------------------------------------------------

4. the image is being copied:

--------------------------------------------------------------------------
6828283e-958c-41a5-8dd0-b215bfab904d::INFO::2013-11-20 13:11:09,510::image::645::Storage.Image::(copy) sdUUID=9502a003-6e7e-496e-97c0-225d45b476c1 vmUUID= srcImgUUID=cf92af3
b-317b-433a-a374-eef1c2ed4307 srcVolUUID=bba48cad-35d1-4f79-8e8b-5034ca6949ce dstImgUUID=06a2c384-a49c-4c67-9110-6df59ab3b658 dstVolUUID=897b2c2d-680d-4122-bc53-d40c84b28e18
 dstSdUUID=9502a003-6e7e-496e-97c0-225d45b476c1 volType=6 volFormat=COW preallocate=SPARSE force=False postZero=False
--------------------------------------------------------------------------


5.  we attempt to stop the domain monitoring thread of the domain the image is copied within and fail -
--------------------------------------------------------------------------
Thread-838::DEBUG::2013-11-20 13:11:11,470::domainMonitor::169::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 9502a003-6e7e-496e-97c0-225d45b476c1
Thread-838::INFO::2013-11-20 13:11:11,470::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain 9502a003-6e7e-496e-97c0-225d45b476c1 (id: 1)
Thread-838::DEBUG::2013-11-20 13:11:11,470::domainMonitor::178::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain 9502a003-6e7e-496e-97c0-225d45b476c1
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 175, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 461, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 204, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: ('9502a003-6e7e-496e-97c0-225d45b476c1', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))
Thread-1503::DEBUG::2013-11-20 13:11:15,662::BindingXMLRPC::177::vds::(wrapper) client [10.35.19.10]
--------------------------------------------------------------------------


6. Later on GetStoragePoolInfo request will return without freespaceinfo - it can be compared with (1), diskFree key is missing from the reponse..
--------------------------------------------------------------------------
Run and protect: getStoragePoolInfo, Return response: {'info': {'spm_id': 1, 'master_uuid': '9502a003-6e7e-496e-97c0-225d45b476c1', 'name': 'DefaultRest', 'version': '3', 'domains': '8981cbaa-732a-43cd-b143-00c71734458b:Active,9502a003-6e7e-496e-97c0-225d45b476c1:Active,7df91534-49bd-499a-bb59-9c26b87b7f2b:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'NFS', 'master_ver': 1, 'lver': 3}, 'dominfo': {'8981cbaa-732a-43cd-b143-00c71734458b': {'status': 'Active', 'isoprefix': '', 'alerts': []}, '9502a003-6e7e-496e-97c0-225d45b476c1': {'status': 'Active', 'isoprefix': '', 'alerts': []}, '7df91534-49bd-499a-bb59-9c26b87b7f2b': {'status': 'Active', 'isoprefix': '', 'alerts': []}}}
Thread-1504::DEBUG::2013-11-20 13:11:15,673::task::1168::TaskManager.Task::(prepare) Task=`5a807298-58d6-4cde-8e83-8a78d343df15`::finished: {'info': {'spm_id': 1, 'master_uuid': '9502a003-6e7e-496e-97c0-225d45b476c1', 'name': 'DefaultRest', 'version': '3', 'domains': '8981cbaa-732a-43cd-b143-00c71734458b:Active,9502a003-6e7e-496e-97c0-225d45b476c1:Active,7df91534-49bd-499a-bb59-9c26b87b7f2b:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'NFS', 'master_ver': 1, 'lver': 3}, 'dominfo': {'8981cbaa-732a-43cd-b143-00c71734458b': {'status': 'Active', 'isoprefix': '', 'alerts': []}, '9502a003-6e7e-496e-97c0-225d45b476c1': {'status': 'Active', 'isoprefix': '', 'alerts': []}, '7df91534-49bd-499a-bb59-9c26b87b7f2b': {'status': 'Active', 'isoprefix': '', 'alerts': []}}}

Furthermore, we don't have repostats anymore - which will cause all the domains to go inactive.
Thread-1557::INFO::2013-11-20 13:12:10,656::logUtils::44::dispatcher::(wrapper) Run and protect: repoStats(options=None)
Thread-1557::INFO::2013-11-20 13:12:10,656::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {}
Comment 13 Liron Aravot 2013-12-01 09:18:48 EST
Nir, can you take a look? it's possibly a regression/duplicate.
Comment 14 Nir Soffer 2013-12-01 14:15:49 EST
Federico, do you have any idea why releaseHostId is failing here?

13:11:11,470::domainMonitor::178::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain 9502a003-6e7e-496e-97c0-225d45b476c1
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 175, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 461, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 204, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: ('9502a003-6e7e-496e-97c0-225d45b476c1', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))
Comment 15 Nir Soffer 2013-12-01 15:43:15 EST
(In reply to Liron Aravot from comment #12)
> 
> 3. the domain monitors are being stopped for the domains for some reason
> (need to see if that's related or not)
> 

I could not find why they were stopped - please apply http://gerrit.ovirt.org/21904 
when you run the test again.
Comment 16 Nir Soffer 2013-12-02 16:35:06 EST
The gerrit patch is not a bug fix but a debugging patch.
Comment 17 Nir Soffer 2013-12-04 12:47:00 EST
After adding more logging we understand the problem.

1. Storage pool is disconnected

Thread-762::INFO::2013-12-04 18:49:25,262::logUtils::44::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='7b7cf016-bb4c-44d4-aceb-7454d9680064', hostID=1, scsiKey='7b7cf016-bb4c-44d4-aceb-745
4d9680064', remove=False, options=None)

2. Domain monitors are stopped

Thread-762::INFO::2013-12-04 18:49:25,265::sp::726::Storage.StoragePool::(disconnect) Disconnect from the storage pool 7b7cf016-bb4c-44d4-aceb-7454d9680064
Thread-762::INFO::2013-12-04 18:49:25,266::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring) Stop monitoring ea0101ca-8a0e-4ff9-9032-98caa3a02825
Thread-68::DEBUG::2013-12-04 18:49:25,267::domainMonitor::174::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for ea0101ca-8a0e-4ff9-9032-98caa3a02825
Thread-68::INFO::2013-12-04 18:49:25,267::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain ea0101ca-8a0e-4ff9-9032-98caa3a02825 (id: 1)
Thread-68::DEBUG::2013-12-04 18:49:27,267::clusterlock::207::SANLock::(releaseHostId) Host id for domain ea0101ca-8a0e-4ff9-9032-98caa3a02825 released successfully (id: 1)
Thread-762::INFO::2013-12-04 18:49:27,268::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring) Stop monitoring 61ac6bae-79ba-4e79-a05a-dde38e0aa1c6
Thread-149::DEBUG::2013-12-04 18:49:27,269::domainMonitor::174::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 61ac6bae-79ba-4e79-a05a-dde38e0aa1c6
Thread-149::INFO::2013-12-04 18:49:27,269::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain 61ac6bae-79ba-4e79-a05a-dde38e0aa1c6 (id: 1)

3. Storage pool is connected again

Thread-771::INFO::2013-12-04 18:51:21,136::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='7b7cf016-bb4c-44d4-aceb-7454d9680064', hostID=1, scsiKey='7b7cf016-bb4c-44d4-aceb-7454d9680064', msdUUID='ea0101ca-8a0e-4ff9-9032-98caa3a02825', masterVersion=1, options=None)

4. Storage pool starts monitoring domains

Thread-771::INFO::2013-12-04 18:51:25,305::domainMonitor::100::Storage.DomainMonitor::(startMonitoring) Start monitoring ea0101ca-8a0e-4ff9-9032-98caa3a02825
Thread-775::DEBUG::2013-12-04 18:51:25,305::domainMonitor::164::Storage.DomainMonitorThread::(_monitorLoop) Starting domain monitor for ea0101ca-8a0e-4ff9-9032-98caa3a02825
Thread-771::DEBUG::2013-12-04 18:51:25,305::sp::1577::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `7b7cf016-bb4c-44d4-aceb-7454d9680064` started monitoring domain `ea0101ca-8a0e-4ff9-9032-98caa3a02825`
Thread-771::INFO::2013-12-04 18:51:25,306::domainMonitor::100::Storage.DomainMonitor::(startMonitoring) Start monitoring 61ac6bae-79ba-4e79-a05a-dde38e0aa1c6
Thread-771::DEBUG::2013-12-04 18:51:25,308::sp::1577::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `7b7cf016-bb4c-44d4-aceb-7454d9680064` started monitoring domain `61ac6bae-79ba-4e79-a05a-dde38e0aa1c6`
Thread-776::DEBUG::2013-12-04 18:51:25,308::domainMonitor::164::Storage.DomainMonitorThread::(_monitorLoop) Starting domain monitor for 61ac6bae-79ba-4e79-a05a-dde38e0aa1c6
Thread-771::INFO::2013-12-04 18:51:25,310::domainMonitor::100::Storage.DomainMonitor::(startMonitoring) Start monitoring 29f87f6b-e39b-4215-a66d-bdc4a8ec5afe
Thread-771::DEBUG::2013-12-04 18:51:25,313::sp::1577::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `7b7cf016-bb4c-44d4-aceb-7454d9680064` started monitoring domain `29f87f6b-e39b-4215-a66d-bdc4a8ec5afe`

5. Storage pool desctructor is invoked

3b41b49b-1ae4-4109-beef-97d700ea94f1::INFO::2013-12-04 19:03:10,363::sp::156::Storage.StoragePool::(__del__) Storage Pool 7b7cf016-bb4c-44d4-aceb-7454d9680064 starting cleanup thread

6. Storage pool cleanup tread running, stopping all domain monitors

Thread-1426::INFO::2013-12-04 19:03:10,365::sp::150::Storage.StoragePool::(run) Storage Pool 7b7cf016-bb4c-44d4-aceb-7454d9680064 cleanup thread started
3b41b49b-1ae4-4109-beef-97d700ea94f1::DEBUG::2013-12-04 19:03:10,366::task::579::TaskManager.Task::(_updateState) Task=`3b41b49b-1ae4-4109-beef-97d700ea94f1`::moving from state queued -> state running
Thread-1426::INFO::2013-12-04 19:03:10,366::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring) Stop monitoring ea0101ca-8a0e-4ff9-9032-98caa3a02825
Thread-775::DEBUG::2013-12-04 19:03:10,367::domainMonitor::174::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for ea0101ca-8a0e-4ff9-9032-98caa3a02825
Thread-775::INFO::2013-12-04 19:03:10,368::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain ea0101ca-8a0e-4ff9-9032-98caa3a02825 (id: 1)
3b41b49b-1ae4-4109-beef-97d700ea94f1::DEBUG::2013-12-04 19:03:10,369::task::736::TaskManager.Task::(_save) Task=`3b41b49b-1ae4-4109-beef-97d700ea94f1`::_save: orig /rhev/data-center/7b7cf016-bb4c-44d4-aceb-7454d9680064/mastersd/master/tasks/3b41b49b-1ae4-4109-beef-97d700ea94f1 temp /rhev/data-center/7b7cf016-bb4c-44d4-aceb-7454d9680064/mastersd/master/tasks/3b41b49b-1ae4-4109-beef-97d700ea94f1.temp
Thread-775::DEBUG::2013-12-04 19:03:10,370::domainMonitor::183::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain ea0101ca-8a0e-4ff9-9032-98caa3a02825
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 180, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 461, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 204, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: ('ea0101ca-8a0e-4ff9-9032-98caa3a02825', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy'))
Thread-1426::INFO::2013-12-04 19:03:10,372::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring) Stop monitoring 61ac6bae-79ba-4e79-a05a-dde38e0aa1c6
Thread-776::DEBUG::2013-12-04 19:03:10,372::domainMonitor::174::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 61ac6bae-79ba-4e79-a05a-dde38e0aa1c6
Thread-776::INFO::2013-12-04 19:03:10,373::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain 61ac6bae-79ba-4e79-a05a-dde38e0aa1c6 (id: 1)

7. Since domain monitorign stopped, getStoragePoolInfo returns no diskfree info, leading to "No Space" error.

Thread-1435::INFO::2013-12-04 19:03:19,962::logUtils::47::dispatcher::(wrapper) Run and protect: getStoragePoolInfo, Return response: {'info': {'spm_id': 1, 'master_uuid': 'ea0101ca-8a0e-4ff9-9032-98caa3a02825', 'name': 'DefaultRest', 'version': '3', 'domains': 'ea0101ca-8a0e-4ff9-9032-98caa3a02825:Active,61ac6bae-79ba-4e79-a05a-dde38e0aa1c6:Active,29f87f6b-e39b-4215-a66d-bdc4a8ec5afe:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'NFS', 'master_ver': 1, 'lver': 4}, 'dominfo': {'ea0101ca-8a0e-4ff9-9032-98caa3a02825': {'status': 'Active', 'isoprefix': '', 'alerts': []}, '61ac6bae-79ba-4e79-a05a-dde38e0aa1c6': {'status': 'Active', 'isoprefix': '', 'alerts': []}, '29f87f6b-e39b-4215-a66d-bdc4a8ec5afe': {'status': 'Active', 'isoprefix': '', 'alerts': []}}}
Thread-1435::DEBUG::2013-12-04 19:03:19,962::task::1168::TaskManager.Task::(prepare) Task=`0c9391c3-fa14-4bbf-816b-4510a418cbd1`::finished: {'info': {'spm_id': 1, 'master_uuid': 'ea0101ca-8a0e-4ff9-9032-98caa3a02825', 'name': 'DefaultRest', 'version': '3', 'domains': 'ea0101ca-8a0e-4ff9-9032-98caa3a02825:Active,61ac6bae-79ba-4e79-a05a-dde38e0aa1c6:Active,29f87f6b-e39b-4215-a66d-bdc4a8ec5afe:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'NFS', 'master_ver': 1, 'lver': 4}, 'dominfo': {'ea0101ca-8a0e-4ff9-9032-98caa3a02825': {'status': 'Active', 'isoprefix': '', 'alerts': []}, '61ac6bae-79ba-4e79-a05a-dde38e0aa1c6': {'status': 'Active', 'isoprefix': '', 'alerts': []}, '29f87f6b-e39b-4215-a66d-bdc4a8ec5afe': {'status': 'Active', 'isoprefix': '', 'alerts': []}}}
Comment 18 Ilanit Stein 2013-12-17 07:16:20 EST
Verified on is27 (vdsm-4.13.2-0.2.rc.el6ev.x86_64).

Problem did not occur on 5 runs on jenkins qe / 20 runs on ci jenkins.
Comment 19 Allon Mureinik 2013-12-25 02:32:55 EST
*** Bug 1024269 has been marked as a duplicate of this bug. ***
Comment 20 errata-xmlrpc 2014-01-21 11:22:02 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html
Comment 21 Allon Mureinik 2014-08-14 05:11:41 EDT
*** Bug 1024018 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.