Bug 870024

Summary: 3.1.z - vdsm: Unexpected exception when upgrading local/NFS domain from 3.0 to 3.1
Product: Red Hat Enterprise Linux 6 Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Haim <hateya>
Status: CLOSED ERRATA QA Contact: Dafna Ron <dron>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: abaron, achan, amureini, bazulay, cpelland, dpaikov, hateya, iheim, ilvovsky, lpeer, thildred, yeylon, ykaul
Target Milestone: rcKeywords: TestBlocker, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: vdsm-4.9.6-41.0 Doc Type: Bug Fix
Doc Text:
As part of the upgrade from Red Hat Enterprise Virtualization 3.0 to 3.1, storage domain metadata is also upgraded. Previously, storage lease metadata files were having incorrect permissions on them, which prevented file-based storage domains from being recovered after an upgrade This fix modifies the permissions on the storage lease metadata files during the upgrade procedure. The local domain can now be upgraded without issue, and will come online after the upgrade.
Story Points: ---
Clone Of:
: 883327 (view as bug list) Environment:
Last Closed: 2012-12-04 19:13:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 883327, 887900    
Attachments:
Description Flags
logs
none
nfs logs
none
pool is not recovered logs
none
vdsm log none

Description Dafna Ron 2012-10-25 12:15:48 UTC
Created attachment 633292 [details]
logs

Description of problem:

I upgraded from 3.0 setup to 3.1 (si22.1) and tried to upgrade my local domain. 
UI shows unexpected exception when we try to upgrade DC.

Backend: 

012-10-25 13:53:56,990 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (http-/0.0.0.0:8443-5) [399bd68b] Weird return value: Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         661
mMessage                      Cannot acquire host id: ('508b3e1a-7533-4bec-b27b-8196dc5be596', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))


vdsm: 

Thread-121::ERROR::2012-10-25 13:51:39,415::task::853::TaskManager.Task::(_setError) Task=`e480d5a0-3ca0-48d8-9234-f8a3cbd8e051`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3068, in upgradeStoragePool
    pool._upgradePool(targetDomVersion)
  File "/usr/share/vdsm/storage/securable.py", line 63, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 413, in _upgradePool
    self._convertDomain(self.masterDomain, str(targetDomVersion))
  File "/usr/share/vdsm/storage/sp.py", line 1031, in _convertDomain
    domain.getRealDomain(), isMsd, targetFormat)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 252, in convert
    converter(repoPath, hostId, imageRepo, isMsd)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 99, in v3DomainConverter
    newClusterLock.acquireHostId(hostId, async=False)
  File "/usr/share/vdsm/storage/safelease.py", line 175, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('508b3e1a-7533-4bec-b27b-8196dc5be596', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))

sanlock:

2012-10-25 13:51:38+0200 4914 [1929]: s1 lockspace 508b3e1a-7533-4bec-b27b-8196dc5be596:1:/rhev/data-center/mnt/_data_local/508b3e1a-7533-4bec-b27b-8196dc5be596/dom_md/ids:0
2012-10-25 13:51:38+0200 4914 [9395]: open error -13 /rhev/data-center/mnt/_data_local/508b3e1a-7533-4bec-b27b-8196dc5be596/dom_md/ids
2012-10-25 13:51:38+0200 4914 [9395]: s1 open_disk /rhev/data-center/mnt/_data_local/508b3e1a-7533-4bec-b27b-8196dc5be596/dom_md/ids error -13
2012-10-25 13:51:39+0200 4915 [1929]: s1 add_lockspace fail result -19


Version-Release number of selected component (if applicable):

3.0 to 22.1 
vdsm-4.9.6-39.0.el6_3.x86_64
sanlock-2.3-4.el6_3.x86_64


How reproducible:

100%

Steps to Reproduce:
1. create a local domain with 1 vm on 3.0 rhevm
2. upgrade rhevm to 3.1
3. yum update vdsm on host -> restart vdsm service (my vm was up) -> try to upgrade DC level to 3.1
  
Actual results:

we get unexpected exception 
we also cannot remove the DC once this happens

Expected results:

we should succeed to upgrade or be blocked before operation starts 

Additional info: full logs (including audit log)

Comment 1 Dafna Ron 2012-10-25 13:44:17 UTC
I tested the upgrade with NFS type storage and I got the same issue: 

Thread-963::ERROR::2012-10-25 15:32:51,766::task::853::TaskManager.Task::(_setError) Task=`dbdf6dd4-2f42-4274-aba3-f31104e40548`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3068, in upgradeStoragePool
    pool._upgradePool(targetDomVersion)
  File "/usr/share/vdsm/storage/securable.py", line 63, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 413, in _upgradePool
    self._convertDomain(self.masterDomain, str(targetDomVersion))
  File "/usr/share/vdsm/storage/sp.py", line 1031, in _convertDomain
    domain.getRealDomain(), isMsd, targetFormat)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 252, in convert
    converter(repoPath, hostId, imageRepo, isMsd)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 99, in v3DomainConverter
    newClusterLock.acquireHostId(hostId, async=False)
  File "/usr/share/vdsm/storage/safelease.py", line 175, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('63c20d84-f712-4f69-b44a-0c87a33ba2e5', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))
Thread-963::DEBUG::2012-10-25 15:32:51,769::task::872::TaskManager.Task::(_run) Task=`dbdf6dd4-2f42-4274-aba3-f31104e40548`::Task._run: dbdf6dd4-2f42-4274-aba3-f31104e40548 ('c
58fc8ee-0b57-46d9-8a7e-b7ce7416c56d', '3') {} failed - stopping task
Thread-963::DEBUG::2012-10-25 15:32:51,770::task::1199::TaskManager.Task::(stop) Task=`dbdf6dd4-2f42-4274-aba3-f31104e40548`::stopping in state preparing (force False)
Thread-963::DEBUG::2012-10-25 15:32:51,770::task::978::TaskManager.Task::(_decref) Task=`dbdf6dd4-2f42-4274-aba3-f31104e40548`::ref 1 aborting True
Thread-963::INFO::2012-10-25 15:32:51,771::task::1157::TaskManager.Task::(prepare) Task=`dbdf6dd4-2f42-4274-aba3-f31104e40548`::aborting: Task is aborted: 'Cannot acquire host id' - code 661
Thread-963::DEBUG::2012-10-25 15:32:51,771::task::1162::TaskMan

Comment 2 Dafna Ron 2012-10-25 13:55:26 UTC
[root@gold-vdsc ~]# ls -lZd /rhev/
drwxr-xr-x. root root system_u:object_r:root_t:s0      /rhev/
[root@gold-vdsc ~]# ls -lZd /rhev/*
drwxrwxr-x. vdsm kvm system_u:object_r:mnt_t:s0       /rhev/data-center
[root@gold-vdsc ~]#

Comment 4 Dafna Ron 2012-10-25 14:08:30 UTC
Created attachment 633340 [details]
nfs logs

Comment 5 Dafna Ron 2012-10-25 14:09:36 UTC
after this error in nfs pool cannot be recovered.

Comment 6 Dafna Ron 2012-10-25 14:11:12 UTC
Created attachment 633343 [details]
pool is not recovered logs

Comment 7 Federico Simoncelli 2012-10-26 02:06:32 UTC
Once the pool is marked as 3.1 in the engine the upgrade process is started automatically when the spmStart is issued (expected domainVersion). If such process keeps failing (on the master domain) it could prevent you from re-activating the pool. Since the V3 upgrade is stuck on the acquireHostId command I fear that we might start experiencing a selinux issue on rhel6 too. Dafna could you try to temporarily disable selinux (setenforce 0) and activate the pool again. Thanks.

Comment 8 Haim 2012-10-29 16:18:35 UTC
(In reply to comment #7)
> Once the pool is marked as 3.1 in the engine the upgrade process is started
> automatically when the spmStart is issued (expected domainVersion). If such
> process keeps failing (on the master domain) it could prevent you from
> re-activating the pool. Since the V3 upgrade is stuck on the acquireHostId
> command I fear that we might start experiencing a selinux issue on rhel6
> too. Dafna could you try to temporarily disable selinux (setenforce 0) and
> activate the pool again. Thanks.

Federico, would you please elaborate on the reason you need this scenario reproduced again ? to me, its obvious that its related to Selinux and disabling it will workaround the problem, furthermore, we have an AVC for wdmd service on audit.log.


type=AVC msg=audit(1351160857.306:122085): avc:  denied  { read } for  pid=12435 comm="setfiles" path=2F746D702F73682D7468642D31333531313833363630202864656C6574656429 dev=dm-0 ino=915187 scontext=unconfined_u:system_r:setfiles_t:s0 tcontext=unconfined_u:object_r:initrc_tmp_t:s0 tclass=file
type=AVC msg=audit(1351160858.196:122086): avc:  denied  { write } for  pid=12443 comm="shutdown" path="/tmp/vds_installer.805233.log" dev=dm-0 ino=915855 scontext=unconfined_u:unconfined_r:shutdown_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file
type=AVC msg=audit(1351160858.196:122086): avc:  denied  { write } for  pid=12443 comm="shutdown" path="/tmp/vds_bootstrap_complete.20121025_122705.log" dev=dm-0 ino=915858 scontext=unconfined_u:unconfined_r:shutdown_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file
type=AVC msg=audit(1351160860.863:122103): avc:  denied  { search } for  pid=6519 comm="wdmd" name="/" dev=tmpfs ino=5538 scontext=system_u:system_r:wdmd_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=dir
type=AVC msg=audit(1351160860.863:122103): avc:  denied  { write } for  pid=6519 comm="wdmd" name="/" dev=tmpfs ino=5538 scontext=system_u:system_r:wdmd_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=dir
type=AVC msg=audit(1351160860.863:122103): avc:  denied  { remove_name } for  pid=6519 comm="wdmd" name="wdmd" dev=tmpfs ino=18610 scontext=system_u:system_r:wdmd_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=dir
type=AVC msg=audit(1351160860.863:122103): avc:  denied  { unlink } for  pid=6519 comm="wdmd" name="wdmd" dev=tmpfs ino=18610 scontext=system_u:system_r:wdmd_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=file
type=AVC msg=audit(1351171127.900:61645): avc:  denied  { getattr } for  pid=3534 comm="qemu-kvm" name="/" dev=0:20 ino=2 scontext=system_u:system_r:svirt_t:s0:c600,c657 tcontext=system_u:object_r:nfs_t:s0 tclass=filesystem
type=AVC msg=audit(1351171128.967:61651): avc:  denied  { getattr } for  pid=3534 comm="qemu-kvm" name="/" dev=0:20 ino=2 scontext=system_u:system_r:svirt_t:s0:c600,c657 tcontext=system_u:object_r:nfs_t:s0 tclass=filesystem

Comment 9 Federico Simoncelli 2012-10-30 01:18:33 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > Once the pool is marked as 3.1 in the engine the upgrade process is started
> > automatically when the spmStart is issued (expected domainVersion). If such
> > process keeps failing (on the master domain) it could prevent you from
> > re-activating the pool. Since the V3 upgrade is stuck on the acquireHostId
> > command I fear that we might start experiencing a selinux issue on rhel6
> > too. Dafna could you try to temporarily disable selinux (setenforce 0) and
> > activate the pool again. Thanks.
> 
> Federico, would you please elaborate on the reason you need this scenario
> reproduced again ? to me, its obvious that its related to Selinux and
> disabling it will workaround the problem, furthermore, we have an AVC for
> wdmd service on audit.log.

It is not obvious, in fact the avc warnings that you reported are related to the bootstrap and wdmd (and not to sanlock).
Anyway it is not related to selinux at all. I logged to the host and I did some tests. Probably one of my (bugzilla) comments got lost (or not submitted) but what I saw was that the metadata files (ids/leases) had permissions 600 (vdsm:kvm) and sanlock wasn't able to access the files. What I also wrote (and lost) was to change such permissions to 660 and activate the pool again. Now we need to understand if we need to make this automatic (are those the standard permissions in 3.0?).

Comment 10 Allon Mureinik 2012-10-31 15:06:07 UTC
Just to be completely clear - does this issue fix itslef in VDSM if I chmod to 660, or do we corrupt the dc/sd?

Comment 11 Federico Simoncelli 2012-11-01 03:22:14 UTC
(In reply to comment #10)
> Just to be completely clear - does this issue fix itslef in VDSM if I chmod
> to 660, or do we corrupt the dc/sd?

Looking at the exceptions and at the code it should fix itself but I wanted QA to verify it.

Comment 12 Haim 2012-11-01 12:19:58 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > Just to be completely clear - does this issue fix itslef in VDSM if I chmod
> > to 660, or do we corrupt the dc/sd?
> 
> Looking at the exceptions and at the code it should fix itself but I wanted
> QA to verify it.

check what ? do you have a fix for us ?

Comment 13 Daniel Paikov 2012-11-01 13:50:25 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #7)
> > > Once the pool is marked as 3.1 in the engine the upgrade process is started
> > > automatically when the spmStart is issued (expected domainVersion). If such
> > > process keeps failing (on the master domain) it could prevent you from
> > > re-activating the pool. Since the V3 upgrade is stuck on the acquireHostId
> > > command I fear that we might start experiencing a selinux issue on rhel6
> > > too. Dafna could you try to temporarily disable selinux (setenforce 0) and
> > > activate the pool again. Thanks.
> > 
> > Federico, would you please elaborate on the reason you need this scenario
> > reproduced again ? to me, its obvious that its related to Selinux and
> > disabling it will workaround the problem, furthermore, we have an AVC for
> > wdmd service on audit.log.
> 
> It is not obvious, in fact the avc warnings that you reported are related to
> the bootstrap and wdmd (and not to sanlock).
> Anyway it is not related to selinux at all. I logged to the host and I did
> some tests. Probably one of my (bugzilla) comments got lost (or not
> submitted) but what I saw was that the metadata files (ids/leases) had
> permissions 600 (vdsm:kvm) and sanlock wasn't able to access the files. What
> I also wrote (and lost) was to change such permissions to 660 and activate
> the pool again. Now we need to understand if we need to make this automatic
> (are those the standard permissions in 3.0?).

It's impossible to create domains under 3.0 in directories with permissions of 600 OR of 660. It fails like this:

Thread-6655::ERROR::2012-11-01 15:31:01,794::storage_connection::302::Storage.ServerConnection::(__validateLocalConnection) Error during storage connection validation: Pe
rmission settings on the specified path do not allow access to the storage. Verify permission settings on the specified storage path.: 'path = /blah'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_connection.py", line 296, in __validateLocalConnection
    validateDirAccess(con['rp'])
  File "/usr/share/vdsm/storage/storage_connection.py", line 27, in validateDirAccess
    getProcPool().fileUtils.validateAccess(dirPath)
  File "/usr/share/vdsm/storage/processPool.py", line 33, in wrapper
    return self.runExternally(func, *args, **kwds)
  File "/usr/share/vdsm/storage/processPool.py", line 44, in runExternally
    return self._procPool.runExternally(*args, **kwargs)
  File "/usr/share/vdsm/storage/processPool.py", line 127, in runExternally
    raise err
StorageServerAccessPermissionError: Permission settings on the specified path do not allow access to the storage. Verify permission settings on the specified storage path
.: 'path = /blah'

Comment 14 Igor Lvovsky 2012-11-02 06:51:47 UTC
http://gerrit.ovirt.org/#/c/8985/1

Comment 16 Daniel Paikov 2012-11-13 13:16:34 UTC
Checked on 4.9.6-42.

Comment 17 Haim 2012-11-27 14:10:48 UTC
moving to re-open - issue was found on vdsm 4.9.6-44 trying to upgrade storage pool from type NFS. we hit the following issue:

Thread-499::ERROR::2012-11-27 15:30:10,778::task::853::TaskManager.Task::(_setError) Task=`2411593b-52a1-4a8c-82ea-33455722b390`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3177, in upgradeStoragePool
    pool._upgradePool(targetDomVersion)
  File "/usr/share/vdsm/storage/securable.py", line 63, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 414, in _upgradePool
    self._convertDomain(self.masterDomain, str(targetDomVersion))
  File "/usr/share/vdsm/storage/sp.py", line 1032, in _convertDomain
    domain.getRealDomain(), isMsd, targetFormat)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 282, in convert
    converter(repoPath, hostId, imageRepo, isMsd)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 99, in v3DomainConverter
    newClusterLock.acquireHostId(hostId, async=False)
  File "/usr/share/vdsm/storage/safelease.py", line 175, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('6713fb2c-38d8-4468-8b34-293d76d034ef', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))

I have added the following printouts in vdsm code:


    if domain.getStorageType() in sd.FILE_DOMAIN_TYPES:
        log.debug('### storageType=%s', domain.getStorageType()  )
        domain.setMetadataPermissions()

and it appears that we are not entering the if.

Comment 18 Haim 2012-11-27 14:18:20 UTC
Created attachment 652770 [details]
vdsm log

Comment 19 Haim 2012-11-27 15:58:41 UTC
proposed upstream: http://gerrit.ovirt.org/#/c/9517/1

Comment 24 errata-xmlrpc 2012-12-04 19:13:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html