872270 – 3.1 - [vdsm] Pool link is missing under /rhev/data-center after failure of storage domain during live-snapshot (altough host sees both pool and storage domain)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 872270 - 3.1 - [vdsm] Pool link is missing under /rhev/data-center after failure of storage domain during live-snapshot (altough host sees both pool and storage domain)

Summary: 3.1 - [vdsm] Pool link is missing under /rhev/data-center after failure of st...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	6.3
Assignee:	Yaniv Bronhaim
QA Contact:	vvyazmin@redhat.com
Docs Contact:
URL:
Whiteboard:	storage, infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-01 17:05 UTC by vvyazmin@redhat.com
Modified:	2022-07-09 05:40 UTC (History)
CC List:	13 users (show)
Fixed In Version:	vdsm-4.9.6-42.0
Doc Type:	Bug Fix
Doc Text:	A pool link goes missing under the /rhev/data-center after a live snapshot's storage domain failure. This is caused by a race condition between cleaning up the symbolic link to the pool and the recreation of the link at recovery when accessing the storage pool. This was fixed by moving the cleanStorageRepository out into the same thread to prevent the race condition. The pool link should appear correctly under /rhev/data-center whenever the storage becomes available.
Clone Of:
Environment:
Last Closed:	2012-12-04 19:13:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
## Logs vdsm, rhevm, screen-shots (4.42 MB, application/x-gzip) 2012-11-01 17:05 UTC, vvyazmin@redhat.com	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2012:1508	0	normal	SHIPPED_LIVE	Important: rhev-3.1.0 vdsm security, bug fix, and enhancement update	2012-12-04 23:48:05 UTC

Description vvyazmin@redhat.com 2012-11-01 17:05:14 UTC

Created attachment 636743 [details]
## Logs vdsm, rhevm, screen-shots

Description of problem: Pool link is missing under /rhev/data-center after failure of storage domain during live-snapshot (altough host sees both pool and storage domain).


Version-Release number of selected component (if applicable):
RHEVM 3.1 - SI23

RHEVM: rhevm-3.1.0-25.el6ev.noarch
VDSM: vdsm-4.9.6-40.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.4.x86_64
SANLOCK: sanlock-2.3-4.el6_3.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create iSCSI DC with one Host and one SD
2. Create a VM with OS installed (in my case was VM with RHEL 6.3, and VirtIO disk)
3. Run a VM
4. Create a Live-Snapshot 
5. Wait when disks VM in “Locked” state
6. Block via IpTables SD connection 
7. Wait when Host on “Non Responsive” state
8. Remove restriction from IpTables
  
Actual results:
Pool link is missing under /rhev/data-center after failure of storage domain during live-snapshot (altough host sees both pool and storage domain)
After 2 hours, no more free threads
Disks VM in “Locked” state 
Snapshot on Locked state

Expected results:
System need handler with SD disconnection, and know to recover from this state

Additional info:

[root@cougar08 ~]# vgs
  VG                                   #PV #LV #SN Attr   VSize   VFree  
  98a413eb-e536-4d6f-8145-65e3e6ad597c   1  10   0 wz--n- 199.62g 147.12g
  c7b9c1ee-c8b0-4036-b2d3-37d742d53db9   1   6   0 wz--n- 199.62g 195.75g
  vg0                                    1   3   0 wz--n- 465.27g      0

Comment 2 Eduardo Warszawski 2012-11-02 09:15:06 UTC

There is a race between the hsm.__init__ (that removes the pool links) and connectStoragePool() which creates them.

# hsm.__init__.storageRefresh() thread is started.

Thread-12::DEBUG::2012-11-01 15:19:13,658::lvm::319::OperationMutex::(_reloadpvs) Operation 'lvm reload operation' got the operation mutex
Thread-12::DEBUG::2012-11-01 15:19:13,665::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm pvs --config " devices { preferred_names = [\\"^/dev/mapper/\\

# ConnectStoragePool started
Thread-23::INFO::2012-11-01 15:19:23,338::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='b5b11a32-558f-440a-b1a1-cd618dc4d4e7', hostID=2, scsiKey='b5b11a32-558f-440a-b1a1-cd618dc4d4e7', msdUUID='98a413eb-e536-4d6f-8145-65e3e6ad597c', masterVersion=1, options=None)

# Pool link exists!
Thread-23::INFO::2012-11-01 15:19:29,461::storage_mailbox::340::Storage.MailBox.HsmMailMonitor::(_sendMail) HSM_MailMonitor sending mail to SPM - ['/bin/dd', 'of=/rhev/data-center/b5b11a32-558f-440a-b1a1-cd618dc4d4e7/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=512', 'seek=16']
Thread-23::DEBUG::2012-11-01 15:19:29,461::__init__::1164::Storage.Misc.excCmd::(_log) '/bin/dd of=/rhev/data-center/b5b11a32-558f-440a-b1a1-cd618dc4d4e7/mastersd/dom_md/inbox iflag=fullblock oflag=direct conv=notrunc bs=512 seek=16' (cwd None)
Thread-23::DEBUG::2012-11-01 15:19:30,470::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 1.00134 s, 4.1 kB/s\n'; <rc> = 0




# Cleaning
Thread-12::DEBUG::2012-11-01 15:19:29,530::hsm::356::Storage.HSM::(__cleanStorageRepository) Started cleaning storage repository at '/rhev/data-center'

Thread-12::DEBUG::2012-11-01 15:19:29,542::hsm::388::Storage.HSM::(__cleanStorageRepository) White list: ['/rhev/data-center/hsm-tasks', '/rhev/data-center/hsm-tasks/*', '/rhev/
data-center/mnt']
Thread-12::DEBUG::2012-11-01 15:19:29,543::hsm::389::Storage.HSM::(__cleanStorageRepository) Mount list: []
Thread-12::DEBUG::2012-11-01 15:19:29,543::hsm::391::Storage.HSM::(__cleanStorageRepository) Cleaning leftovers
Thread-12::DEBUG::2012-11-01 15:19:29,546::hsm::434::Storage.HSM::(__cleanStorageRepository) Finished cleaning storage repository at '/rhev/data-center'

# Connect storage pool finished
Thread-23::INFO::2012-11-01 15:19:30,474::logUtils::39::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True



# And the pool directory disapared!
clientIFinit::ERROR::2012-11-01 15:19:35,471::blockVolume::401::Storage.Volume::(validateImagePath) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/blockVolume.py", line 399, in validateImagePath
    os.mkdir(imageDir, 0755)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/b5b11a32-558f-440a-b1a1-cd618dc4d4e7/98a413eb-e536-4d6f-8145-65e3e6ad597c/images/942899c2-3c0e-465a-8382-205650c909a0'
clientIFinit::ERROR::2012-11-01 15:19:35,472::task::853::TaskManager.Task::(_setError) Task=`d9bd3136-5c6d-4bcc-99d8-eecffe599627`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2803, in prepareImage
    imgVolumes = img.prepare(sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/image.py", line 347, in prepare
    chain = self.getChain(sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/image.py", line 283, in getChain
    srcVol = volclass(self.repoPath, sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/blockVolume.py", line 77, in __init__
    volume.Volume.__init__(self, repoPath, sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/volume.py", line 127, in __init__
    self.validate()
  File "/usr/share/vdsm/storage/blockVolume.py", line 86, in validate
    volume.Volume.validate(self)
  File "/usr/share/vdsm/storage/volume.py", line 139, in validate
    self.validateImagePath()
  File "/usr/share/vdsm/storage/blockVolume.py", line 402, in validateImagePath
    raise se.ImagePathError(imageDir)
ImagePathError: Image path does not exist or cannot be accessed/created: ('/rhev/data-center/b5b11a32-558f-440a-b1a1-cd618dc4d4e7/98a413eb-e536-4d6f-8145-65e3e6ad597c/images/942899c2-3c0e-465a-8382-205650c909a0',)

Comment 3 Eduardo Warszawski 2012-11-02 09:18:28 UTC

In addition we should remark that the vdsm  threads are exhausted:

ps auxH | awk '/vdsm/ {print $1}' |grep vdsm | wc -l
4096

Comment 4 Eduardo Warszawski 2012-11-02 09:34:27 UTC

Continous engine reqs and raises due to the lacking pool dir seems to lead to thread leak.

25100c57-bcc1-4d20-b386-e0ce53267580::ERROR::2012-11-01 17:03:34,768::task::853::TaskManager.Task::(_setError) Task=`25100c57-bcc1-4d20-b386-e0ce53267580`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 320, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 303, in startSpm
    self.spmMailer = storage_mailbox.SPM_MailMonitor(self, maxHostID)
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 472, in __init__
    self.tp = ThreadPool(tpSize, waitTimeout, maxTasks)
  File "/usr/share/vdsm/storage/threadPool.py", line 54, in __init__
    self.setThreadCount(numThreads)
  File "/usr/share/vdsm/storage/threadPool.py", line 87, in setThreadCount
    self.__setThreadCountNolock(newNumThreads)
  File "/usr/share/vdsm/storage/threadPool.py", line 102, in __setThreadCountNolock
    newThread.start()
  File "/usr/lib64/python2.6/threading.py", line 474, in start
    _start_new_thread(self.__bootstrap, ())
error: can't start new thread
25100c57-bcc1-4d20-b386-e0ce53267580::DEBUG::2012-11-01 17:03:34,775::task::872::TaskManager.Task::(_run) Task=`25100c57-bcc1-4d20-b386-e0ce53267580`::Task._run: 25100c57-bcc1-4

Comment 6 Barak 2012-11-04 09:05:29 UTC

upstream patch 

http://gerrit.ovirt.org/#/c/8980/

Comment 9 Yaniv Bronhaim 2012-11-05 14:16:29 UTC

The patch http://gerrit.ovirt.org/#/c/8980/ related to https://bugzilla.redhat.com/show_bug.cgi?id=872935.

http://gerrit.ovirt.org/#/c/9005/ - fixed the missing link issue

Comment 11 vvyazmin@redhat.com 2012-11-12 09:42:08 UTC

Verified on RHEVM 3.1 - SI24

RHEVM: rhevm-3.1.0-28.el6ev.noarch
VDSM: vdsm-4.9.6-42.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.5.x86_64
SANLOCK: sanlock-2.3-4.el6_3.x86_64

Comment 13 errata-xmlrpc 2012-12-04 19:13:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html

Note You need to log in before you can comment on or make changes to this bug.