Bug 872270
| Summary: | 3.1 - [vdsm] Pool link is missing under /rhev/data-center after failure of storage domain during live-snapshot (altough host sees both pool and storage domain) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | vvyazmin <vvyazmin> | ||||
| Component: | vdsm | Assignee: | Yaniv Bronhaim <ybronhei> | ||||
| Status: | CLOSED ERRATA | QA Contact: | vvyazmin <vvyazmin> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 6.3 | CC: | abaron, achan, amureini, bazulay, ewarszaw, hateya, iheim, ilvovsky, lpeer, sgrinber, smizrahi, ybronhei, ykaul | ||||
| Target Milestone: | rc | Keywords: | Regression, ZStream | ||||
| Target Release: | 6.3 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | storage, infra | ||||||
| Fixed In Version: | vdsm-4.9.6-42.0 | Doc Type: | Bug Fix | ||||
| Doc Text: |
A pool link goes missing under the /rhev/data-center after a live snapshot's storage domain failure. This is caused by a race condition between cleaning up the symbolic link to the pool and the recreation of the link at recovery when accessing the storage pool. This was fixed by moving the cleanStorageRepository out into the same thread to prevent the race condition. The pool link should appear correctly under /rhev/data-center whenever the storage becomes available.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2012-12-04 19:13:51 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
There is a race between the hsm.__init__ (that removes the pool links) and connectStoragePool() which creates them.
# hsm.__init__.storageRefresh() thread is started.
Thread-12::DEBUG::2012-11-01 15:19:13,658::lvm::319::OperationMutex::(_reloadpvs) Operation 'lvm reload operation' got the operation mutex
Thread-12::DEBUG::2012-11-01 15:19:13,665::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm pvs --config " devices { preferred_names = [\\"^/dev/mapper/\\
# ConnectStoragePool started
Thread-23::INFO::2012-11-01 15:19:23,338::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='b5b11a32-558f-440a-b1a1-cd618dc4d4e7', hostID=2, scsiKey='b5b11a32-558f-440a-b1a1-cd618dc4d4e7', msdUUID='98a413eb-e536-4d6f-8145-65e3e6ad597c', masterVersion=1, options=None)
# Pool link exists!
Thread-23::INFO::2012-11-01 15:19:29,461::storage_mailbox::340::Storage.MailBox.HsmMailMonitor::(_sendMail) HSM_MailMonitor sending mail to SPM - ['/bin/dd', 'of=/rhev/data-center/b5b11a32-558f-440a-b1a1-cd618dc4d4e7/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=512', 'seek=16']
Thread-23::DEBUG::2012-11-01 15:19:29,461::__init__::1164::Storage.Misc.excCmd::(_log) '/bin/dd of=/rhev/data-center/b5b11a32-558f-440a-b1a1-cd618dc4d4e7/mastersd/dom_md/inbox iflag=fullblock oflag=direct conv=notrunc bs=512 seek=16' (cwd None)
Thread-23::DEBUG::2012-11-01 15:19:30,470::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 1.00134 s, 4.1 kB/s\n'; <rc> = 0
# Cleaning
Thread-12::DEBUG::2012-11-01 15:19:29,530::hsm::356::Storage.HSM::(__cleanStorageRepository) Started cleaning storage repository at '/rhev/data-center'
Thread-12::DEBUG::2012-11-01 15:19:29,542::hsm::388::Storage.HSM::(__cleanStorageRepository) White list: ['/rhev/data-center/hsm-tasks', '/rhev/data-center/hsm-tasks/*', '/rhev/
data-center/mnt']
Thread-12::DEBUG::2012-11-01 15:19:29,543::hsm::389::Storage.HSM::(__cleanStorageRepository) Mount list: []
Thread-12::DEBUG::2012-11-01 15:19:29,543::hsm::391::Storage.HSM::(__cleanStorageRepository) Cleaning leftovers
Thread-12::DEBUG::2012-11-01 15:19:29,546::hsm::434::Storage.HSM::(__cleanStorageRepository) Finished cleaning storage repository at '/rhev/data-center'
# Connect storage pool finished
Thread-23::INFO::2012-11-01 15:19:30,474::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True
# And the pool directory disapared!
clientIFinit::ERROR::2012-11-01 15:19:35,471::blockVolume::401::Storage.Volume::(validateImagePath) Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/blockVolume.py", line 399, in validateImagePath
os.mkdir(imageDir, 0755)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/b5b11a32-558f-440a-b1a1-cd618dc4d4e7/98a413eb-e536-4d6f-8145-65e3e6ad597c/images/942899c2-3c0e-465a-8382-205650c909a0'
clientIFinit::ERROR::2012-11-01 15:19:35,472::task::853::TaskManager.Task::(_setError) Task=`d9bd3136-5c6d-4bcc-99d8-eecffe599627`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 861, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
res = f(*args, **kwargs)
File "/usr/share/vdsm/storage/hsm.py", line 2803, in prepareImage
imgVolumes = img.prepare(sdUUID, imgUUID, volUUID)
File "/usr/share/vdsm/storage/image.py", line 347, in prepare
chain = self.getChain(sdUUID, imgUUID, volUUID)
File "/usr/share/vdsm/storage/image.py", line 283, in getChain
srcVol = volclass(self.repoPath, sdUUID, imgUUID, volUUID)
File "/usr/share/vdsm/storage/blockVolume.py", line 77, in __init__
volume.Volume.__init__(self, repoPath, sdUUID, imgUUID, volUUID)
File "/usr/share/vdsm/storage/volume.py", line 127, in __init__
self.validate()
File "/usr/share/vdsm/storage/blockVolume.py", line 86, in validate
volume.Volume.validate(self)
File "/usr/share/vdsm/storage/volume.py", line 139, in validate
self.validateImagePath()
File "/usr/share/vdsm/storage/blockVolume.py", line 402, in validateImagePath
raise se.ImagePathError(imageDir)
ImagePathError: Image path does not exist or cannot be accessed/created: ('/rhev/data-center/b5b11a32-558f-440a-b1a1-cd618dc4d4e7/98a413eb-e536-4d6f-8145-65e3e6ad597c/images/942899c2-3c0e-465a-8382-205650c909a0',)
In addition we should remark that the vdsm threads are exhausted:
ps auxH | awk '/vdsm/ {print $1}' |grep vdsm | wc -l
4096
Continous engine reqs and raises due to the lacking pool dir seems to lead to thread leak.
25100c57-bcc1-4d20-b386-e0ce53267580::ERROR::2012-11-01 17:03:34,768::task::853::TaskManager.Task::(_setError) Task=`25100c57-bcc1-4d20-b386-e0ce53267580`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 861, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/storage/task.py", line 320, in run
return self.cmd(*self.argslist, **self.argsdict)
File "/usr/share/vdsm/storage/sp.py", line 303, in startSpm
self.spmMailer = storage_mailbox.SPM_MailMonitor(self, maxHostID)
File "/usr/share/vdsm/storage/storage_mailbox.py", line 472, in __init__
self.tp = ThreadPool(tpSize, waitTimeout, maxTasks)
File "/usr/share/vdsm/storage/threadPool.py", line 54, in __init__
self.setThreadCount(numThreads)
File "/usr/share/vdsm/storage/threadPool.py", line 87, in setThreadCount
self.__setThreadCountNolock(newNumThreads)
File "/usr/share/vdsm/storage/threadPool.py", line 102, in __setThreadCountNolock
newThread.start()
File "/usr/lib64/python2.6/threading.py", line 474, in start
_start_new_thread(self.__bootstrap, ())
error: can't start new thread
25100c57-bcc1-4d20-b386-e0ce53267580::DEBUG::2012-11-01 17:03:34,775::task::872::TaskManager.Task::(_run) Task=`25100c57-bcc1-4d20-b386-e0ce53267580`::Task._run: 25100c57-bcc1-4
upstream patch http://gerrit.ovirt.org/#/c/8980/ The patch http://gerrit.ovirt.org/#/c/8980/ related to https://bugzilla.redhat.com/show_bug.cgi?id=872935. http://gerrit.ovirt.org/#/c/9005/ - fixed the missing link issue Verified on RHEVM 3.1 - SI24 RHEVM: rhevm-3.1.0-28.el6ev.noarch VDSM: vdsm-4.9.6-42.0.el6_3.x86_64 LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.5.x86_64 SANLOCK: sanlock-2.3-4.el6_3.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1508.html |
Created attachment 636743 [details] ## Logs vdsm, rhevm, screen-shots Description of problem: Pool link is missing under /rhev/data-center after failure of storage domain during live-snapshot (altough host sees both pool and storage domain). Version-Release number of selected component (if applicable): RHEVM 3.1 - SI23 RHEVM: rhevm-3.1.0-25.el6ev.noarch VDSM: vdsm-4.9.6-40.0.el6_3.x86_64 LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.4.x86_64 SANLOCK: sanlock-2.3-4.el6_3.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create iSCSI DC with one Host and one SD 2. Create a VM with OS installed (in my case was VM with RHEL 6.3, and VirtIO disk) 3. Run a VM 4. Create a Live-Snapshot 5. Wait when disks VM in “Locked” state 6. Block via IpTables SD connection 7. Wait when Host on “Non Responsive” state 8. Remove restriction from IpTables Actual results: Pool link is missing under /rhev/data-center after failure of storage domain during live-snapshot (altough host sees both pool and storage domain) After 2 hours, no more free threads Disks VM in “Locked” state Snapshot on Locked state Expected results: System need handler with SD disconnection, and know to recover from this state Additional info: [root@cougar08 ~]# vgs VG #PV #LV #SN Attr VSize VFree 98a413eb-e536-4d6f-8145-65e3e6ad597c 1 10 0 wz--n- 199.62g 147.12g c7b9c1ee-c8b0-4036-b2d3-37d742d53db9 1 6 0 wz--n- 199.62g 195.75g vg0 1 3 0 wz--n- 465.27g 0