Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1112171

Summary: Inconsistent failures when attempting to create a local sd on a directory owned by root
Product: [Retired] oVirt Reporter: Idan Shaby <ishaby>
Component: vdsmAssignee: Greg Padgett <gpadgett>
Status: CLOSED CURRENTRELEASE QA Contact: Aharon Canan <acanan>
Severity: low Docs Contact:
Priority: low    
Version: 3.5CC: acanan, amureini, bazulay, bugs, ebenahar, gklein, gpadgett, mgoldboi, nsoffer, rbalakri, scohen, yeylon, ylavi
Target Milestone: m1   
Target Release: 3.6.0   
Hardware: All   
OS: Linux   
Whiteboard: storage
Fixed In Version: v4.17.3 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1250752 (view as bug list) Environment:
Last Closed: 2015-11-04 12:59:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
VDSM log file none

Description Idan Shaby 2014-06-23 09:30:14 UTC
Created attachment 911366 [details]
VDSM log file

Description of problem:
Creating a sd which is owned by root returns a general error 100 instead of a more informative error message the first try, and succeeds the second try.

Version-Release number of selected component (if applicable):
vdsm-4.14.6-0.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create a directory:
[root@dhcp-2-166 storage]# mkdir sd_owned_by_root

2. chmod it: 
[root@dhcp-2-166 storage]# chmod 777 sd_owned_by_root/

3. Create a local sd in the UI (gets the error).
4. Click OK again (works).

Actual results:
See above.

Expected results:
Should fail consistently with an informative error message.

Additional info:

Comment 1 Allon Mureinik 2015-07-07 13:35:36 UTC
*** Bug 1218556 has been marked as a duplicate of this bug. ***

Comment 2 Allon Mureinik 2015-07-13 08:35:25 UTC
Should also be relevant to NFS.

Comment 3 Greg Padgett 2015-08-05 19:34:51 UTC
Proposed fix will cause the following error to be displayed in the UI when the file storage permissions are wrong (checked both local and nfs storage):

Error while executing action Add Storage Connection: Permission settings on the specified path do not allow access to the storage.
Verify permission settings on the specified storage path.

Comment 4 Nir Soffer 2015-08-05 20:10:48 UTC
(In reply to Idan Shaby from comment #0)
> Created attachment 911366 [details]
> VDSM log file
> 
> Description of problem:
> Creating a sd which is owned by root returns a general error 100 instead of
> a more informative error message the first try, and succeeds the second try.
> 
> Version-Release number of selected component (if applicable):
> vdsm-4.14.6-0.el6.x86_64

This is a rather old version, can you reproduce it with 4.17?

> 3. Create a local sd in the UI (gets the error).

What error?

> 4. Click OK again (works).

> Expected results:
> Should fail consistently with an informative error message.

If this works in the second time, vdsm probably can cope with the
existing permissions, so it is not clear why it should fail.

Where is vdsm log showing the actual error?

Comment 5 Greg Padgett 2015-08-05 20:37:02 UTC
(In reply to Nir Soffer from comment #4)
> (In reply to Idan Shaby from comment #0)
> > Created attachment 911366 [details]
> > VDSM log file
> > 
> > Description of problem:
> > Creating a sd which is owned by root returns a general error 100 instead of
> > a more informative error message the first try, and succeeds the second try.
> > 
> > Version-Release number of selected component (if applicable):
> > vdsm-4.14.6-0.el6.x86_64
> 
> This is a rather old version, can you reproduce it with 4.17?
> 

I did

> > 3. Create a local sd in the UI (gets the error).
> 
> What error?
> 

I don't recall the wording, something useless about a general error.  VDSM log of the error:

Thread-41103::ERROR::2015-08-04 18:09:04,223::hsm::2464::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2461, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 591, in connect
    os.chmod(lnPath, 0o775)
OSError: [Errno 1] Operation not permitted: '/rhev/data-center/mnt/_storage_local'
Thread-41103::DEBUG::2015-08-04 18:09:04,223::hsm::2483::Storage.HSM::(connectStorageServer) knownSDs: {20b5566c-6bc3-499c-9dcc-696888c21d7d: storage.nfsSD.findDomain, bf77b1a4-7bc5-41f7-9e11-2b76e884a2cb: storage.nfsSD.findDomain, 624eb942-111e-4b96-b508-b78a84df3957: storage.nfsSD.findDomain, a1b43220-a08f-4f7a-be4a-e3529448802c: storage.blockSD.findDomain, 09bb7379-e5b3-4400-9aa5-f43924c543fd: storage.nfsSD.findDomain}
Thread-41103::INFO::2015-08-04 18:09:04,223::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 100, 'id': u'00000000-0000-0000-0000-000000000000'}]}

> > 4. Click OK again (works).
> 
> > Expected results:
> > Should fail consistently with an informative error message.
> 
> If this works in the second time, vdsm probably can cope with the
> existing permissions, so it is not clear why it should fail.
> 
> Where is vdsm log showing the actual error?

It didn't work the second time for me.

Instead, I received a different error... I guess because part of the SD data was present from the first attempt, having made it partially through the creation process before failing.

Here's my log of the second error:

Thread-41118::ERROR::2015-08-04 18:10:02,352::sdc::144::Storage.StorageDomainCache::(_findDomain) domain 25c961a4-9bf3-4e11-ba77-c0bd56e0710e not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 142, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 172, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'25c961a4-9bf3-4e11-ba77-c0bd56e0710e',)
Thread-41118::INFO::2015-08-04 18:10:02,353::localFsSD::73::Storage.StorageDomain::(create) sdUUID=25c961a4-9bf3-4e11-ba77-c0bd56e0710e domainName=local remotePath=/storage/local domClass=1
Thread-41118::DEBUG::2015-08-04 18:10:02,358::__init__::298::IOProcessClient::(_run) Starting IOProcess...
Thread-41118::ERROR::2015-08-04 18:10:02,364::task::866::Storage.TaskManager.Task::(_setError) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2695, in createStorageDomain
    domVersion)
  File "/usr/share/vdsm/storage/localFsSD.py", line 84, in create
    cls._preCreateValidation(sdUUID, mntPoint, remotePath, version)
  File "/usr/share/vdsm/storage/localFsSD.py", line 51, in _preCreateValidation
    fileSD.validateFileSystemFeatures(sdUUID, domPath)
  File "/usr/share/vdsm/storage/fileSD.py", line 89, in validateFileSystemFeatures
    oop.getProcessPool(sdUUID).directTouch(testFilePath)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 350, in directTouch
    ioproc.touch(path, flags, mode)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 507, in touch
    self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 391, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 13] Permission denied
Thread-41118::DEBUG::2015-08-04 18:10:02,364::task::885::Storage.TaskManager.Task::(_run) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::Task._run: ab7907ce-74a1-4329-9080-ca139db56eea (4, u'25c961a4-9bf3-4e11-ba77-c0bd56e0710e', u'local', u'/storage/local', 1, u'3') {} failed - stopping task
Thread-41118::DEBUG::2015-08-04 18:10:02,364::task::1246::Storage.TaskManager.Task::(stop) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::stopping in state preparing (force False)
Thread-41118::DEBUG::2015-08-04 18:10:02,364::task::993::Storage.TaskManager.Task::(_decref) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::ref 1 aborting True
Thread-41118::INFO::2015-08-04 18:10:02,364::task::1171::Storage.TaskManager.Task::(prepare) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::aborting: Task is aborted: u'[Errno 13] Permission denied' - code 100
Thread-41118::DEBUG::2015-08-04 18:10:02,364::task::1176::Storage.TaskManager.Task::(prepare) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::Prepare: aborted: [Errno 13] Permission denied
Thread-41118::DEBUG::2015-08-04 18:10:02,364::task::993::Storage.TaskManager.Task::(_decref) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::ref 0 aborting True
Thread-41118::DEBUG::2015-08-04 18:10:02,365::task::928::Storage.TaskManager.Task::(_doAbort) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::Task._doAbort: force False
Thread-41118::DEBUG::2015-08-04 18:10:02,365::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-41118::DEBUG::2015-08-04 18:10:02,365::task::595::Storage.TaskManager.Task::(_updateState) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::moving from state preparing -> state aborting
Thread-41118::DEBUG::2015-08-04 18:10:02,365::task::550::Storage.TaskManager.Task::(__state_aborting) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::_aborting: recover policy none
Thread-41118::DEBUG::2015-08-04 18:10:02,365::task::595::Storage.TaskManager.Task::(_updateState) Task=`ab7907ce-74a1-4329-9080-ca139db56eea`::moving from state aborting -> state failed
Thread-41118::DEBUG::2015-08-04 18:10:02,365::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-41118::DEBUG::2015-08-04 18:10:02,365::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-41118::ERROR::2015-08-04 18:10:02,365::dispatcher::79::Storage.Dispatcher::(wrapper) [Errno 13] Permission denied
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 71, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 104, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1179, in prepare
    raise self.error
OSError: [Errno 13] Permission denied

Comment 6 Nir Soffer 2015-08-05 20:55:57 UTC
(In reply to Greg Padgett from comment #5)
> > > 4. Click OK again (works).
> > 
> > > Expected results:
> > > Should fail consistently with an informative error message.
> > 
> > If this works in the second time, vdsm probably can cope with the
> > existing permissions, so it is not clear why it should fail.
> > 
> > Where is vdsm log showing the actual error?
> 
> It didn't work the second time for me.

Did you use the same permissions used by Idan?

The results you get makes sense, we fail with EPERM when permissions are wrong.

Can you verify that the unlikely flow (failing first time, succeeding in 
the second time) described in the bug description does not happen?

Comment 7 Greg Padgett 2015-08-05 21:29:37 UTC
(In reply to Nir Soffer from comment #6)
> Did you use the same permissions used by Idan?
> 
> The results you get makes sense, we fail with EPERM when permissions are
> wrong.
> 
> Can you verify that the unlikely flow (failing first time, succeeding in 
> the second time) described in the bug description does not happen?

Actually I found out how to reproduce that behavior.  The results above are applicable to directories having root:root ownership with 0755 permissions (or less).

If the permissions are 0777, it fails with OSError on the single os.chmod call in storageServer.py and then succeeds a second time.  If I remove the call to chmod, it creates a storage domain without complaint.

The addition of that call was done in a huge commit 93390aab8381bff869bd029bf121f6a1f3761e8e where Saggi did some refactoring and, it seems, functional changes.  If we check the permissions and they are adequate, do we need this chmod at all?

---
Also just for the record, here are the two UI errors for a directory with insufficient permissions:

1st: Error while executing action Add Storage Connection: General Exception
2nd: Error while executing action New Local Storage Domain: Error creating a storage domain

Comment 8 Nir Soffer 2015-08-05 21:53:30 UTC
(In reply to Greg Padgett from comment #7)
> (In reply to Nir Soffer from comment #6)
> If the permissions are 0777, it fails with OSError on the single os.chmod
> call in storageServer.py and then succeeds a second time.  If I remove the
> call to chmod, it creates a storage domain without complaint.

Sound very strange - if the chmod call failed, how this succeeds in the second
time?

Can you explain how this does not fail in the second call?

> The addition of that call was done in a huge commit
> 93390aab8381bff869bd029bf121f6a1f3761e8e where Saggi did some refactoring
> and, it seems, functional changes.  If we check the permissions and they are
> adequate, do we need this chmod at all?

Removing bad behavior changes which were part of huge "refactoring" worked 
well for us in the past. I would check first this direction.

Comment 9 Greg Padgett 2015-08-05 22:11:30 UTC
(In reply to Nir Soffer from comment #8)
> (In reply to Greg Padgett from comment #7)
> > (In reply to Nir Soffer from comment #6)
> > If the permissions are 0777, it fails with OSError on the single os.chmod
> > call in storageServer.py and then succeeds a second time.  If I remove the
> > call to chmod, it creates a storage domain without complaint.
> 
> Sound very strange - if the chmod call failed, how this succeeds in the
> second
> time?
> 
> Can you explain how this does not fail in the second call?

Sure.  In storageServer.py, LocalDirectoryConnection.connect() checks for the symlink before trying to create and chmod the symlinkcalling symlink and chmod.  After the first failure the link exists, so the second time it doesn't get to the chmod and happily carries on.

> > The addition of that call was done in a huge commit
> > 93390aab8381bff869bd029bf121f6a1f3761e8e where Saggi did some refactoring
> > and, it seems, functional changes.  If we check the permissions and they are
> > adequate, do we need this chmod at all?
> 
> Removing bad behavior changes which were part of huge "refactoring" worked 
> well for us in the past. I would check first this direction.

Seems to work, but I should do more verification.  I think the existing patch and this new change would resolve the issue altogether.

Comment 10 Nir Soffer 2015-08-05 22:19:26 UTC
(In reply to Greg Padgett from comment #9)

Changing the mode of the file should not be done in connect, this is part
of domain initialization, so we can safely remove it. This is the minimal
fix for this bug, no need to add additional permission checking.

Adding permission checking makes sense but not required for this bug.

Comment 11 Greg Padgett 2015-08-05 22:25:35 UTC
(In reply to Nir Soffer from comment #10)
> (In reply to Greg Padgett from comment #9)
> 
> Changing the mode of the file should not be done in connect, this is part
> of domain initialization, so we can safely remove it. This is the minimal
> fix for this bug, no need to add additional permission checking.

Good, I'll verify the change and submit it.  Thanks.

> Adding permission checking makes sense but not required for this bug.

(In reply to Idan Shaby from comment #0)
> Expected results:
> Should fail consistently with an informative error message.

Well I suppose the chmod removal fix satisfies inconsistent behavior, but I wouldn't consider either error in comment #7 an informative error message :)

Comment 12 Nir Soffer 2015-08-05 22:35:38 UTC
Actually there are two bugs here - bad error message and in consistent failures. Lets clone the bug to handle the bad error messages, we already
have a patch for this :-)

Comment 13 Greg Padgett 2015-08-06 00:09:58 UTC
(In reply to Nir Soffer from comment #12)
> Actually there are two bugs here - bad error message and in consistent
> failures. Lets clone the bug to handle the bad error messages, we already
> have a patch for this :-)

Done!  This one and its patch are about the inconsistent failures.  Bug 1250752 is about the uninformative message.

Comment 14 Greg Padgett 2015-08-17 22:00:11 UTC
Regarding testing this fix:

The bug was cloned to separate the bad message fix vs inconsistent failures.  This is for the inconsistent failures.

1. Create a local storage dir owned by root
2. chmod 0775 the directory
3. Add storage; it should fail
4. Try again; should still fail

Repeat the above, but use 0777 in step 2.  This should work the first time.

(Note that the trigger here is that vdsm/kvm need rwx permission, which is granted by the 'others' field if the directory is owned by root.)

Comment 15 Allon Mureinik 2015-08-20 11:33:57 UTC
Greg - is there anything intelligent to document here?
If so - please provide doctext.
If not - please set requires-doctext-.


Thanks!

Comment 16 Greg Padgett 2015-08-20 22:38:45 UTC
(In reply to Allon Mureinik from comment #15)
> Greg - is there anything intelligent to document here?
> If so - please provide doctext.
> If not - please set requires-doctext-.

Documentation already says to chown the dir to 36:36; setting requires-doctext-.

Comment 17 Aharon Canan 2015-09-08 06:40:08 UTC
(In reply to Greg Padgett from comment #14)
> Regarding testing this fix:
> 
> The bug was cloned to separate the bad message fix vs inconsistent failures.
> This is for the inconsistent failures.
> 
> 1. Create a local storage dir owned by root
> 2. chmod 0775 the directory
> 3. Add storage; it should fail
> 4. Try again; should still fail
> 
> Repeat the above, but use 0777 in step 2.  This should work the first time.
> 
> (Note that the trigger here is that vdsm/kvm need rwx permission, which is
> granted by the 'others' field if the directory is owned by root.)

Verified using 3.6.0-11

Comment 18 Sandro Bonazzola 2015-11-04 12:59:57 UTC
oVirt 3.6.0 has been released on November 4th, 2015 and should fix this issue.
If problems still persist, please open a new BZ and reference this one.