Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1059757

Summary: [vdsm] domain monitor does not stop during disconnectStoragePool on HSM during destroyStoragePool flow on SPM
Product: [Retired] oVirt Reporter: Gadi Ickowicz <gickowic>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED CURRENTRELEASE QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4CC: acathrow, bazulay, dron, gickowic, gklein, iheim, mgoldboi, nlevinki, nsoffer, sbonazzo, yeylon
Target Milestone: ---   
Target Release: 3.4.0   
Hardware: All   
OS: Linux   
Whiteboard: storage
Fixed In Version: ovirt-3.4.0-beta3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1064480 (view as bug list) Environment:
Last Closed: 2014-03-31 12:28:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1064480    
Attachments:
Description Flags
logs none

Description Gadi Ickowicz 2014-01-30 15:10:16 UTC
Created attachment 857529 [details]
logs

Description of problem:
When trying to remove the last storage domain of a datacenter through oVirt, the way to do it is to move the storage domain to Maintenance first. This moves the whole DC to maintenance, and then remove the datacenter, which will move the storage domain to unattached, and only then can the storage domain be removed. When performing this flow, the domain monitor does not stop monitoring the domain on the HSM and this causes the sanlock lease to remain for that domain and removing the domain using that host fails.

A workaround to fix a host in this problematic state is to stop vdsmd service on the host and start it again

During the remove datacenter the following flow is triggered by engine:
* all hosts are initially not connected to the pool, since moving the domain to maintenance in the DC causes disconnectStoragePool
1. connectStorageServer + connectStoragePool on all hosts in the DC
2. SPM election in the DC to allow a host to be SPM in order to perform the destroyStoragePool command
3. SPM that was elected destroys the storage pool triggering domain monitor to stop monitoring the domains:

Thread-66985::INFO::2014-01-30 15:28:07,170::logUtils::44::dispatcher::(wrapper) Run and protect: destroyStoragePool(spUUID='36263611-ae52-4e2d-90cb-0b5629f99555', hostID=1, options=None)
Thread-66985::INFO::2014-01-30 15:28:07,170::hsm::1149::Storage.HSM::(destroyStoragePool) spUUID=36263611-ae52-4e2d-90cb-0b5629f99555
Thread-66985::WARNING::2014-01-30 15:28:07,173::sp::911::Storage.StoragePool::(forcedDetachSD) Force detaching domain `454354bd-d1df-49d4-b509-b2110286c3d5`
Thread-66985::DEBUG::2014-01-30 15:28:07,174::persistentDict::167::Storage.PersistentDict::(transaction) Starting transaction
Thread-66985::DEBUG::2014-01-30 15:28:07,174::persistentDict::173::Storage.PersistentDict::(transaction) Flushing changes
<.....snip.....>
Thread-66985::INFO::2014-01-30 15:28:10,407::domainMonitor::113::Storage.DomainMonitor::(stopMonitoring) Stop monitoring 454354bd-d1df-49d4-b509-b2110286c3d5
Thread-66933::DEBUG::2014-01-30 15:28:10,408::domainMonitor::175::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 454354bd-d1df-49d4-b509-b2110286c3d5
Thread-66985::DEBUG::2014-01-30 15:28:12,410::sp::1384::Storage.StoragePool::(updateMonitoringThreads) Storage Pool `36263611-ae52-4e2d-90cb-0b5629f99555` stopped monitoring domain `454354bd-d1df-49d4-b509-b211028
Thread-66985::DEBUG::2014-01-30 15:28:12,410::sp::928::Storage.StoragePool::(forcedDetachSD) Force detach for domain `454354bd-d1df-49d4-b509-b2110286c3d5` is done
<.....snip.....>
Thread-66985::INFO::2014-01-30 15:28:13,305::sp::644::Storage.StoragePool::(disconnect) Disconnect from the storage pool 36263611-ae52-4e2d-90cb-0b5629f99555
Thread-66943::INFO::2014-01-30 15:28:13,770::storage_mailbox::511::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail
Thread-66943::INFO::2014-01-30 15:28:13,770::storage_mailbox::387::Storage.MailBox.HsmMailMonitor::(_sendMail) HSM_MailMonitor sending mail to SPM - ['/bin/dd', 'of=/rhev/data-center/36263611-ae52-4e2d-90cb-0b5629
Thread-66943::DEBUG::2014-01-30 15:28:13,770::storage_mailbox::86::Storage.Misc.excCmd::(_mboxExecCmd) '/bin/dd of=/rhev/data-center/36263611-ae52-4e2d-90cb-0b5629f99555/mastersd/dom_md/inbox iflag=fullblock oflag
Thread-66943::DEBUG::2014-01-30 15:28:14,783::storage_mailbox::86::Storage.Misc.excCmd::(_mboxExecCmd) SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 1.00187 s, 4.1 kB/s\n'; <rc> = 
Thread-66985::DEBUG::2014-01-30 15:28:15,427::lvm::295::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cach
Thread-66985::DEBUG::2014-01-30 15:28:15,699::lvm::295::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0
Thread-66985::DEBUG::2014-01-30 15:28:15,703::persistentDict::234::Storage.PersistentDict::(refresh) read lines (VGTagMetadataRW)=['LEASETIMESEC=60', 'VERSION=3', 'TYPE=ISCSI', 'LOGBLKSIZE=512', 'LEASERETRIES=3', 
Thread-66985::INFO::2014-01-30 15:28:15,704::logUtils::47::dispatcher::(wrapper) Run and protect: destroyStoragePool, Return response: True


4. HSM hosts in the DC do not receive the destroyStoragePool and only get disconnectStoragePool, and in this specific flow, (maybe due to the fact that the pool does not exist or the fact that the domain is not part of a pool), the domain monitor is not stopped.  This means that the sanlock lease for the domain still exists and the host does not release it, and is not able to remove the domain later:

Thread-543::INFO::2014-01-30 15:28:05,583::logUtils::44::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID='36263611-ae52-4e2d-90cb-0b5629f99555', hostID=2, remove=False, options=None)
Thread-543::INFO::2014-01-30 15:28:05,586::sp::698::Storage.StoragePool::(disconnect) Disconnect from the storage pool 36263611-ae52-4e2d-90cb-0b5629f99555
Thread-544::INFO::2014-01-30 15:28:06,049::logUtils::44::dispatcher::(wrapper) Run and protect: repoStats(options=None)
Thread-544::INFO::2014-01-30 15:28:06,049::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'454354bd-d1df-49d4-b509-b2110286c3d5': {'code': 0, 'version': 3, 'acquired': True, 'del
ay': '0.00109253', 'lastCheck': '5.1', 'valid': True}}
Thread-12::DEBUG::2014-01-30 15:28:06,216::utils::555::root::(execCmd) '/sbin/ip -d -o link' (cwd None)
Thread-12::DEBUG::2014-01-30 15:28:06,228::utils::575::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0
Thread-528::INFO::2014-01-30 15:28:06,257::storage_mailbox::511::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail
Thread-528::INFO::2014-01-30 15:28:06,258::storage_mailbox::387::Storage.MailBox.HsmMailMonitor::(_sendMail) HSM_MailMonitor sending mail to SPM - ['/bin/dd', 'of=/rhev/data-center/36263611-ae52-4e2d-90cb-0b5629f9
9555/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=512', 'seek=16']
Thread-528::DEBUG::2014-01-30 15:28:06,258::storage_mailbox::86::Storage.Misc.excCmd::(_mboxExecCmd) '/bin/dd of=/rhev/data-center/36263611-ae52-4e2d-90cb-0b5629f99555/mastersd/dom_md/inbox iflag=fullblock oflag=d
irect conv=notrunc bs=512 seek=16' (cwd None)
Thread-528::DEBUG::2014-01-30 15:28:06,280::storage_mailbox::86::Storage.Misc.excCmd::(_mboxExecCmd) SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 0.00864146 s, 474 kB/s\n'; <rc> =
 0
Thread-543::DEBUG::2014-01-30 15:28:06,950::lvm::295::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_
state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/360060160128230003456a6ddcb3be311|/dev/mapper/3600601601282300070524cd7cb3be311|/dev/mapper/3600601601282300072524cd7cb
3be311|/dev/mapper/360060160128230007297d8e31c88e311|/dev/mapper/3600601601282300074524cd7cb3be311|/dev/mapper/3600601601282300076524cd7cb3be311|/dev/mapper/36006016012823000a01215db1c88e311|/dev/mapper/3600601601
2823000a21215db1c88e311|/dev/mapper/36006016012823000a41215db1c88e311|/dev/mapper/36006016012823000a61215db1c88e311|/dev/mapper/36006016012823000a81215db1c88e311|/dev/mapper/36006016012823000aa1215db1c88e311|/dev/
mapper/36006016012823000ac1215db1c88e311|/dev/mapper/36006016012823000ae1215db1c88e311|/dev/mapper/36006016012823000b01215db1c88e311|/dev/mapper/360060160f4a0300004808fd56157e311|\', \'r|.*|\' ] }  global {  locki
ng_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_cou
nt,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 454354bd-d1df-49d4-b509-b2110286c3d5' (cwd None)
Thread-543::DEBUG::2014-01-30 15:28:07,287::lvm::295::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0
Thread-543::DEBUG::2014-01-30 15:28:07,290::persistentDict::234::Storage.PersistentDict::(refresh) read lines (VGTagMetadataRW)=['LEASETIMESEC=60', 'VERSION=3', 'TYPE=ISCSI', 'LOGBLKSIZE=512', 'LEASERETRIES=3', 'I
OOPTIMEOUTSEC=10', 'LOCKRENEWALINTERVALSEC=5', 'PHYBLKSIZE=512', 'SDUUID=454354bd-d1df-49d4-b509-b2110286c3d5', u'PV0=pv:36006016012823000a81215db1c88e311,uuid:TfsbCd-N6HH-LZP0-X5Pv-7HZR-dy0d-7uG5IP,pestart:0,peco
unt:77,mapoffset:0', 'DESCRIPTION=extend_me', 'VGUUID=f3Xz8Z-tnZw-KQb7-QG5w-0rEz-G40A-IxF6kP', 'CLASS=Data', 'LOCKPOLICY=', 'POOL_DESCRIPTION=newDC', 'MASTER_VERSION=1', 'POOL_SPM_LVER=0', 'POOL_DOMAINS=', 'POOL_S
PM_ID=-1', 'ROLE=Regular', 'POOL_UUID=', '_SHA_CKSUM=fd8ed5e0c508429f8e737a830ac329b730613acc']
Thread-543::INFO::2014-01-30 15:28:07,291::logUtils::47::dispatcher::(wrapper) Run and protect: disconnectStoragePool, Return response: True


Version-Release number of selected component (if applicable):
vdsm-4.14.0-21.git54cc674.el6.x86_64

How reproducible:
100%

Steps to Reproduce (from ovirt webadmin):
1. Create DC and add at least 2 hosts to the DC
2. Add storage domain to dc and wait for spm
3. Move the storage domain to maintenance
4. Remove the datacenter. Note: during this phase the events will show one of the hosts was elected as SPM. The other hosts will be the problematic ones

Actual results:
HSM hosts are left with sanlock lease on the domain and domain monitor still running. Also attempting to remove the domain using those hosts will fail

Expected results:
Removing the datacenter should leave the domain unattached and hosts should not hold leases for the domain

Additional info:
* attached logs include vdsm log from the hsm and from the spm starting with the domain in maintenance and the "remove datacenter" phase starts.
* sanlock client status -D was run on each host during each part of the flow, files are also attached in the tar

Comment 1 Sergey Gotliv 2014-02-10 09:43:12 UTC
*** Bug 856240 has been marked as a duplicate of this bug. ***

Comment 2 Nir Soffer 2014-02-10 09:54:35 UTC
Looking at rhev-3.3 code, the same bug should be also in rhev-3.3 and needs a backport.

Gaddi, can you reproduce this with rhev-3.3?

Comment 3 Aharon Canan 2014-02-10 11:30:30 UTC
Nir, 

I don't see any reason to reproduce in 3.3 - you have all info needed, you found the bug in 3.4 and 3.3, 
Reproduction won't give any additional info as you already saw it in 3.3

for 3.3 fix you need to ask for clone.

Comment 4 Leonid Natapov 2014-02-23 10:53:32 UTC
3.4.0-0.11.beta3.el6. tested according to steps to reproduce. After removing DC,the LV wasn't open. host doesn't hold lease for the SD.

Comment 5 Sandro Bonazzola 2014-03-31 12:28:30 UTC
this is an automated message: moving to Closed CURRENT RELEASE since oVirt 3.4.0 has been released