Bug 838547

Summary: 3.1 - VDSM [Scalability]: 'OSError' during refreshStoragePool mkdir that fails over File exists.
Product: Red Hat Enterprise Linux 6 Reporter: Omri Hochman <ohochman>
Component: vdsmAssignee: Eduardo Warszawski <ewarszaw>
Status: CLOSED ERRATA QA Contact: Haim <hateya>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.3CC: abaron, aburden, amureini, bazulay, chetan, cpelland, danken, iheim, ilvovsky, yeylon, ykaul, zdover
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage,scale
Fixed In Version: vdsm-4.9.6-41.0 Doc Type: Bug Fix
Doc Text:
Previously, when attempting to activate five or six storage domains at the same time, refreshStoragePool attempted to create a folder that already existed, which caused an "OSError [Errno 17]" to be thrown. This issue was caused by a race condition in fileUtils.createdir(). An updated version of VDSM now creates the target directory; if the target directory already exists, VDSM ensures that the directory has the correct permissions, and "OSError [Errno 17]" is not thrown.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 19:02:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
full_vdsm)log none

Description Omri Hochman 2012-07-09 12:41:15 UTC
VDSM: refreshStoragePool attempted to mkdir an exists folder and failed with 'OSError: [Errno 17] File exists:'.

Description: 
***********
I worked on scale environment with 40 hosts and 100 iSCSI Storage-Domains (EMC Machine), When attempted to activate several (5-6) Storage Domains at the same time, refreshStoragePool attempted to mkdir folder that was already exist It returned OSError [Errno 17].  

Note: 
*******
might be related - the SSL request of the refreshStoragePool command took long time (more than 4 minutes) it failed on timeout with "BrokenPipe" Error (same Thread-38098 at vdsm.log). 


VDSM.log
************
Thread-38098::ERROR::2012-07-09 08:35:34,901::task::853::TaskManager.Task::(_setError) Task=`43caf957-5dda-46e0-932e-6699ba021304`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 699, in refreshStoragePool
    pool.refresh(msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1178, in refresh
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1139, in __rebuild
    fileUtils.createdir(os.path.join(domaindir, sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/fileUtils.py", line 149, in createdir
    os.makedirs(dirPath)
  File "/usr/lib64/python2.6/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: '/rhev/data-center/mnt/blockSD/3191acd9-0fc6-4270-99b3-7775fa70ed15/dom_md'
', '35459266-07dd-4f15-92c1-f6e36e9b851a', 1) {} failed - stopping task
Thread-38098::DEBUG::2012-07-09 08:35:34,905::task::1199::TaskManager.Task::(stop) Task=`43caf957-5dda-46e0-932e-6699ba021304`::stopping in state preparing (force False)
Thread-38098::DEBUG::2012-07-09 08:35:34,905::task::978::TaskManager.Task::(_decref) Task=`43caf957-5dda-46e0-932e-6699ba021304`::ref 1 aborting True
Thread-38098::INFO::2012-07-09 08:35:34,906::task::1157::TaskManager.Task::(prepare) Task=`43caf957-5dda-46e0-932e-6699ba021304`::aborting: Task is aborted: u"[Errno 17] File exists: '/rhev/data-center/mnt/blockSD
/3191acd9-0fc6-4270-99b3-7775fa70ed15/dom_md'" - code 100
Thread-38098::DEBUG::2012-07-09 08:35:34,906::task::1162::TaskManager.Task::(prepare) Task=`43caf957-5dda-46e0-932e-6699ba021304`::Prepare: aborted: [Errno 17] File exists: '/rhev/data-center/mnt/blockSD/3191acd9-
0fc6-4270-99b3-7775fa70ed15/dom_md'
Thread-38098::DEBUG::2012-07-09 08:35:34,906::task::978::TaskManager.Task::(_decref) Task=`43caf957-5dda-46e0-932e-6699ba021304`::ref 0 aborting True
Thread-38098::DEBUG::2012-07-09 08:35:34,907::task::913::TaskManager.Task::(_doAbort) Task=`43caf957-5dda-46e0-932e-6699ba021304`::Task._doAbort: force False
Thread-38098::DEBUG::2012-07-09 08:35:34,907::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-38098::DEBUG::2012-07-09 08:35:34,907::task::588::TaskManager.Task::(_updateState) Task=`43caf957-5dda-46e0-932e-6699ba021304`::moving from state preparing -> state aborting
Thread-38098::DEBUG::2012-07-09 08:35:34,907::task::537::TaskManager.Task::(__state_aborting) Task=`43caf957-5dda-46e0-932e-6699ba021304`::_aborting: recover policy none
Thread-38098::DEBUG::2012-07-09 08:35:34,908::task::588::TaskManager.Task::(_updateState) Task=`43caf957-5dda-46e0-932e-6699ba021304`::moving from state aborting -> state failed
Thread-38098::DEBUG::2012-07-09 08:35:34,908::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.50d275f3-2029-4fd9-bc86-20333f3cb24f': < ResourceRef 'Storag
e.50d275f3-2029-4fd9-bc86-20333f3cb24f', isValid: 'True' obj: 'None'>}
Thread-38098::DEBUG::2012-07-09 08:35:34,908::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-38098::DEBUG::2012-07-09 08:35:34,909::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.50d275f3-2029-4fd9-bc86-20333f3cb24f'
Thread-38098::DEBUG::2012-07-09 08:35:34,909::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.50d275f3-2029-4fd9-bc86-20333f3cb24f' (0 active users)
Thread-38098::DEBUG::2012-07-09 08:35:34,909::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.50d275f3-2029-4fd9-bc86-20333f3cb24f' is free, finding out if anyone is waiting for it.
Thread-38098::DEBUG::2012-07-09 08:35:34,910::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.50d275f3-2029-4fd9-bc86-20333f3cb24f', Clearing records.
Thread-38098::ERROR::2012-07-09 08:35:34,910::dispatcher::69::Storage.Dispatcher.Protect::(run) [Errno 17] File exists: '/rhev/data-center/mnt/blockSD/3191acd9-0fc6-4270-99b3-7775fa70ed15/dom_md'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1164, in prepare
    raise self.error
OSError: [Errno 17] File exists: '/rhev/data-center/mnt/blockSD/3191acd9-0fc6-4270-99b3-7775fa70ed15/dom_md'

Comment 1 Omri Hochman 2012-07-09 12:43:17 UTC
Created attachment 597067 [details]
full_vdsm)log

Comment 2 Eduardo Warszawski 2012-07-15 18:09:24 UTC
Brief of refreshStoragePoolTimes:

Thread-38055    2012-07-09 08:25:24,973         2012-07-09 08:35:31,909
Thread-38080    2012-07-09 08:26:08,145         2012-07-09 08:30:27,482
Thread-38098    2012-07-09 08:28:51,500         2012-07-09 08:35:34,910 Raise
Thread-38129    2012-07-09 08:29:59,013         2012-07-09 08:35:31,986

Comment 3 Eduardo Warszawski 2012-10-23 11:34:20 UTC
http://gerrit.ovirt.org/#/c/8732/2

Comment 6 Omri Hochman 2012-11-11 14:36:25 UTC
Unable to reproduce vdsm-4.9.6-41.0.el6_3.x86_64 (rhevm build si24).

Comment 8 errata-xmlrpc 2012-12-04 19:02:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html