Bug 870024 - 3.1.z - vdsm: Unexpected exception when upgrading local/NFS domain from 3.0 to 3.1
3.1.z - vdsm: Unexpected exception when upgrading local/NFS domain from 3.0 t...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: vdsm (Show other bugs)
6.3
x86_64 Linux
urgent Severity urgent
: rc
: ---
Assigned To: Haim
Dafna Ron
storage
: TestBlocker, ZStream
Depends On:
Blocks: 883327 887900
  Show dependency treegraph
 
Reported: 2012-10-25 08:15 EDT by Dafna Ron
Modified: 2014-01-12 19:54 EST (History)
13 users (show)

See Also:
Fixed In Version: vdsm-4.9.6-41.0
Doc Type: Bug Fix
Doc Text:
As part of the upgrade from Red Hat Enterprise Virtualization 3.0 to 3.1, storage domain metadata is also upgraded. Previously, storage lease metadata files were having incorrect permissions on them, which prevented file-based storage domains from being recovered after an upgrade This fix modifies the permissions on the storage lease metadata files during the upgrade procedure. The local domain can now be upgraded without issue, and will come online after the upgrade.
Story Points: ---
Clone Of:
: 883327 (view as bug list)
Environment:
Last Closed: 2012-12-04 14:13:22 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (829.71 KB, application/x-gzip)
2012-10-25 08:15 EDT, Dafna Ron
no flags Details
nfs logs (848.03 KB, application/x-gzip)
2012-10-25 10:08 EDT, Dafna Ron
no flags Details
pool is not recovered logs (428.49 KB, application/x-gzip)
2012-10-25 10:11 EDT, Dafna Ron
no flags Details
vdsm log (536.94 KB, application/x-gzip)
2012-11-27 09:18 EST, Haim
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:1508 normal SHIPPED_LIVE Important: rhev-3.1.0 vdsm security, bug fix, and enhancement update 2012-12-04 18:48:05 EST

  None (edit)
Description Dafna Ron 2012-10-25 08:15:48 EDT
Created attachment 633292 [details]
logs

Description of problem:

I upgraded from 3.0 setup to 3.1 (si22.1) and tried to upgrade my local domain. 
UI shows unexpected exception when we try to upgrade DC.

Backend: 

012-10-25 13:53:56,990 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (http-/0.0.0.0:8443-5) [399bd68b] Weird return value: Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         661
mMessage                      Cannot acquire host id: ('508b3e1a-7533-4bec-b27b-8196dc5be596', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))


vdsm: 

Thread-121::ERROR::2012-10-25 13:51:39,415::task::853::TaskManager.Task::(_setError) Task=`e480d5a0-3ca0-48d8-9234-f8a3cbd8e051`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3068, in upgradeStoragePool
    pool._upgradePool(targetDomVersion)
  File "/usr/share/vdsm/storage/securable.py", line 63, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 413, in _upgradePool
    self._convertDomain(self.masterDomain, str(targetDomVersion))
  File "/usr/share/vdsm/storage/sp.py", line 1031, in _convertDomain
    domain.getRealDomain(), isMsd, targetFormat)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 252, in convert
    converter(repoPath, hostId, imageRepo, isMsd)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 99, in v3DomainConverter
    newClusterLock.acquireHostId(hostId, async=False)
  File "/usr/share/vdsm/storage/safelease.py", line 175, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('508b3e1a-7533-4bec-b27b-8196dc5be596', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))

sanlock:

2012-10-25 13:51:38+0200 4914 [1929]: s1 lockspace 508b3e1a-7533-4bec-b27b-8196dc5be596:1:/rhev/data-center/mnt/_data_local/508b3e1a-7533-4bec-b27b-8196dc5be596/dom_md/ids:0
2012-10-25 13:51:38+0200 4914 [9395]: open error -13 /rhev/data-center/mnt/_data_local/508b3e1a-7533-4bec-b27b-8196dc5be596/dom_md/ids
2012-10-25 13:51:38+0200 4914 [9395]: s1 open_disk /rhev/data-center/mnt/_data_local/508b3e1a-7533-4bec-b27b-8196dc5be596/dom_md/ids error -13
2012-10-25 13:51:39+0200 4915 [1929]: s1 add_lockspace fail result -19


Version-Release number of selected component (if applicable):

3.0 to 22.1 
vdsm-4.9.6-39.0.el6_3.x86_64
sanlock-2.3-4.el6_3.x86_64


How reproducible:

100%

Steps to Reproduce:
1. create a local domain with 1 vm on 3.0 rhevm
2. upgrade rhevm to 3.1
3. yum update vdsm on host -> restart vdsm service (my vm was up) -> try to upgrade DC level to 3.1
  
Actual results:

we get unexpected exception 
we also cannot remove the DC once this happens

Expected results:

we should succeed to upgrade or be blocked before operation starts 

Additional info: full logs (including audit log)
Comment 1 Dafna Ron 2012-10-25 09:44:17 EDT
I tested the upgrade with NFS type storage and I got the same issue: 

Thread-963::ERROR::2012-10-25 15:32:51,766::task::853::TaskManager.Task::(_setError) Task=`dbdf6dd4-2f42-4274-aba3-f31104e40548`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3068, in upgradeStoragePool
    pool._upgradePool(targetDomVersion)
  File "/usr/share/vdsm/storage/securable.py", line 63, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 413, in _upgradePool
    self._convertDomain(self.masterDomain, str(targetDomVersion))
  File "/usr/share/vdsm/storage/sp.py", line 1031, in _convertDomain
    domain.getRealDomain(), isMsd, targetFormat)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 252, in convert
    converter(repoPath, hostId, imageRepo, isMsd)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 99, in v3DomainConverter
    newClusterLock.acquireHostId(hostId, async=False)
  File "/usr/share/vdsm/storage/safelease.py", line 175, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('63c20d84-f712-4f69-b44a-0c87a33ba2e5', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))
Thread-963::DEBUG::2012-10-25 15:32:51,769::task::872::TaskManager.Task::(_run) Task=`dbdf6dd4-2f42-4274-aba3-f31104e40548`::Task._run: dbdf6dd4-2f42-4274-aba3-f31104e40548 ('c
58fc8ee-0b57-46d9-8a7e-b7ce7416c56d', '3') {} failed - stopping task
Thread-963::DEBUG::2012-10-25 15:32:51,770::task::1199::TaskManager.Task::(stop) Task=`dbdf6dd4-2f42-4274-aba3-f31104e40548`::stopping in state preparing (force False)
Thread-963::DEBUG::2012-10-25 15:32:51,770::task::978::TaskManager.Task::(_decref) Task=`dbdf6dd4-2f42-4274-aba3-f31104e40548`::ref 1 aborting True
Thread-963::INFO::2012-10-25 15:32:51,771::task::1157::TaskManager.Task::(prepare) Task=`dbdf6dd4-2f42-4274-aba3-f31104e40548`::aborting: Task is aborted: 'Cannot acquire host id' - code 661
Thread-963::DEBUG::2012-10-25 15:32:51,771::task::1162::TaskMan
Comment 2 Dafna Ron 2012-10-25 09:55:26 EDT
[root@gold-vdsc ~]# ls -lZd /rhev/
drwxr-xr-x. root root system_u:object_r:root_t:s0      /rhev/
[root@gold-vdsc ~]# ls -lZd /rhev/*
drwxrwxr-x. vdsm kvm system_u:object_r:mnt_t:s0       /rhev/data-center
[root@gold-vdsc ~]#
Comment 4 Dafna Ron 2012-10-25 10:08:30 EDT
Created attachment 633340 [details]
nfs logs
Comment 5 Dafna Ron 2012-10-25 10:09:36 EDT
after this error in nfs pool cannot be recovered.
Comment 6 Dafna Ron 2012-10-25 10:11:12 EDT
Created attachment 633343 [details]
pool is not recovered logs
Comment 7 Federico Simoncelli 2012-10-25 22:06:32 EDT
Once the pool is marked as 3.1 in the engine the upgrade process is started automatically when the spmStart is issued (expected domainVersion). If such process keeps failing (on the master domain) it could prevent you from re-activating the pool. Since the V3 upgrade is stuck on the acquireHostId command I fear that we might start experiencing a selinux issue on rhel6 too. Dafna could you try to temporarily disable selinux (setenforce 0) and activate the pool again. Thanks.
Comment 8 Haim 2012-10-29 12:18:35 EDT
(In reply to comment #7)
> Once the pool is marked as 3.1 in the engine the upgrade process is started
> automatically when the spmStart is issued (expected domainVersion). If such
> process keeps failing (on the master domain) it could prevent you from
> re-activating the pool. Since the V3 upgrade is stuck on the acquireHostId
> command I fear that we might start experiencing a selinux issue on rhel6
> too. Dafna could you try to temporarily disable selinux (setenforce 0) and
> activate the pool again. Thanks.

Federico, would you please elaborate on the reason you need this scenario reproduced again ? to me, its obvious that its related to Selinux and disabling it will workaround the problem, furthermore, we have an AVC for wdmd service on audit.log.


type=AVC msg=audit(1351160857.306:122085): avc:  denied  { read } for  pid=12435 comm="setfiles" path=2F746D702F73682D7468642D31333531313833363630202864656C6574656429 dev=dm-0 ino=915187 scontext=unconfined_u:system_r:setfiles_t:s0 tcontext=unconfined_u:object_r:initrc_tmp_t:s0 tclass=file
type=AVC msg=audit(1351160858.196:122086): avc:  denied  { write } for  pid=12443 comm="shutdown" path="/tmp/vds_installer.805233.log" dev=dm-0 ino=915855 scontext=unconfined_u:unconfined_r:shutdown_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file
type=AVC msg=audit(1351160858.196:122086): avc:  denied  { write } for  pid=12443 comm="shutdown" path="/tmp/vds_bootstrap_complete.20121025_122705.log" dev=dm-0 ino=915858 scontext=unconfined_u:unconfined_r:shutdown_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=file
type=AVC msg=audit(1351160860.863:122103): avc:  denied  { search } for  pid=6519 comm="wdmd" name="/" dev=tmpfs ino=5538 scontext=system_u:system_r:wdmd_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=dir
type=AVC msg=audit(1351160860.863:122103): avc:  denied  { write } for  pid=6519 comm="wdmd" name="/" dev=tmpfs ino=5538 scontext=system_u:system_r:wdmd_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=dir
type=AVC msg=audit(1351160860.863:122103): avc:  denied  { remove_name } for  pid=6519 comm="wdmd" name="wdmd" dev=tmpfs ino=18610 scontext=system_u:system_r:wdmd_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=dir
type=AVC msg=audit(1351160860.863:122103): avc:  denied  { unlink } for  pid=6519 comm="wdmd" name="wdmd" dev=tmpfs ino=18610 scontext=system_u:system_r:wdmd_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=file
type=AVC msg=audit(1351171127.900:61645): avc:  denied  { getattr } for  pid=3534 comm="qemu-kvm" name="/" dev=0:20 ino=2 scontext=system_u:system_r:svirt_t:s0:c600,c657 tcontext=system_u:object_r:nfs_t:s0 tclass=filesystem
type=AVC msg=audit(1351171128.967:61651): avc:  denied  { getattr } for  pid=3534 comm="qemu-kvm" name="/" dev=0:20 ino=2 scontext=system_u:system_r:svirt_t:s0:c600,c657 tcontext=system_u:object_r:nfs_t:s0 tclass=filesystem
Comment 9 Federico Simoncelli 2012-10-29 21:18:33 EDT
(In reply to comment #8)
> (In reply to comment #7)
> > Once the pool is marked as 3.1 in the engine the upgrade process is started
> > automatically when the spmStart is issued (expected domainVersion). If such
> > process keeps failing (on the master domain) it could prevent you from
> > re-activating the pool. Since the V3 upgrade is stuck on the acquireHostId
> > command I fear that we might start experiencing a selinux issue on rhel6
> > too. Dafna could you try to temporarily disable selinux (setenforce 0) and
> > activate the pool again. Thanks.
> 
> Federico, would you please elaborate on the reason you need this scenario
> reproduced again ? to me, its obvious that its related to Selinux and
> disabling it will workaround the problem, furthermore, we have an AVC for
> wdmd service on audit.log.

It is not obvious, in fact the avc warnings that you reported are related to the bootstrap and wdmd (and not to sanlock).
Anyway it is not related to selinux at all. I logged to the host and I did some tests. Probably one of my (bugzilla) comments got lost (or not submitted) but what I saw was that the metadata files (ids/leases) had permissions 600 (vdsm:kvm) and sanlock wasn't able to access the files. What I also wrote (and lost) was to change such permissions to 660 and activate the pool again. Now we need to understand if we need to make this automatic (are those the standard permissions in 3.0?).
Comment 10 Allon Mureinik 2012-10-31 11:06:07 EDT
Just to be completely clear - does this issue fix itslef in VDSM if I chmod to 660, or do we corrupt the dc/sd?
Comment 11 Federico Simoncelli 2012-10-31 23:22:14 EDT
(In reply to comment #10)
> Just to be completely clear - does this issue fix itslef in VDSM if I chmod
> to 660, or do we corrupt the dc/sd?

Looking at the exceptions and at the code it should fix itself but I wanted QA to verify it.
Comment 12 Haim 2012-11-01 08:19:58 EDT
(In reply to comment #11)
> (In reply to comment #10)
> > Just to be completely clear - does this issue fix itslef in VDSM if I chmod
> > to 660, or do we corrupt the dc/sd?
> 
> Looking at the exceptions and at the code it should fix itself but I wanted
> QA to verify it.

check what ? do you have a fix for us ?
Comment 13 Daniel Paikov 2012-11-01 09:50:25 EDT
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #7)
> > > Once the pool is marked as 3.1 in the engine the upgrade process is started
> > > automatically when the spmStart is issued (expected domainVersion). If such
> > > process keeps failing (on the master domain) it could prevent you from
> > > re-activating the pool. Since the V3 upgrade is stuck on the acquireHostId
> > > command I fear that we might start experiencing a selinux issue on rhel6
> > > too. Dafna could you try to temporarily disable selinux (setenforce 0) and
> > > activate the pool again. Thanks.
> > 
> > Federico, would you please elaborate on the reason you need this scenario
> > reproduced again ? to me, its obvious that its related to Selinux and
> > disabling it will workaround the problem, furthermore, we have an AVC for
> > wdmd service on audit.log.
> 
> It is not obvious, in fact the avc warnings that you reported are related to
> the bootstrap and wdmd (and not to sanlock).
> Anyway it is not related to selinux at all. I logged to the host and I did
> some tests. Probably one of my (bugzilla) comments got lost (or not
> submitted) but what I saw was that the metadata files (ids/leases) had
> permissions 600 (vdsm:kvm) and sanlock wasn't able to access the files. What
> I also wrote (and lost) was to change such permissions to 660 and activate
> the pool again. Now we need to understand if we need to make this automatic
> (are those the standard permissions in 3.0?).

It's impossible to create domains under 3.0 in directories with permissions of 600 OR of 660. It fails like this:

Thread-6655::ERROR::2012-11-01 15:31:01,794::storage_connection::302::Storage.ServerConnection::(__validateLocalConnection) Error during storage connection validation: Pe
rmission settings on the specified path do not allow access to the storage. Verify permission settings on the specified storage path.: 'path = /blah'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_connection.py", line 296, in __validateLocalConnection
    validateDirAccess(con['rp'])
  File "/usr/share/vdsm/storage/storage_connection.py", line 27, in validateDirAccess
    getProcPool().fileUtils.validateAccess(dirPath)
  File "/usr/share/vdsm/storage/processPool.py", line 33, in wrapper
    return self.runExternally(func, *args, **kwds)
  File "/usr/share/vdsm/storage/processPool.py", line 44, in runExternally
    return self._procPool.runExternally(*args, **kwargs)
  File "/usr/share/vdsm/storage/processPool.py", line 127, in runExternally
    raise err
StorageServerAccessPermissionError: Permission settings on the specified path do not allow access to the storage. Verify permission settings on the specified storage path
.: 'path = /blah'
Comment 14 Igor Lvovsky 2012-11-02 02:51:47 EDT
http://gerrit.ovirt.org/#/c/8985/1
Comment 16 Daniel Paikov 2012-11-13 08:16:34 EST
Checked on 4.9.6-42.
Comment 17 Haim 2012-11-27 09:10:48 EST
moving to re-open - issue was found on vdsm 4.9.6-44 trying to upgrade storage pool from type NFS. we hit the following issue:

Thread-499::ERROR::2012-11-27 15:30:10,778::task::853::TaskManager.Task::(_setError) Task=`2411593b-52a1-4a8c-82ea-33455722b390`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3177, in upgradeStoragePool
    pool._upgradePool(targetDomVersion)
  File "/usr/share/vdsm/storage/securable.py", line 63, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 414, in _upgradePool
    self._convertDomain(self.masterDomain, str(targetDomVersion))
  File "/usr/share/vdsm/storage/sp.py", line 1032, in _convertDomain
    domain.getRealDomain(), isMsd, targetFormat)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 282, in convert
    converter(repoPath, hostId, imageRepo, isMsd)
  File "/usr/share/vdsm/storage/imageRepository/formatConverter.py", line 99, in v3DomainConverter
    newClusterLock.acquireHostId(hostId, async=False)
  File "/usr/share/vdsm/storage/safelease.py", line 175, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('6713fb2c-38d8-4468-8b34-293d76d034ef', SanlockException(19, 'Sanlock lockspace add failure', 'No such device'))

I have added the following printouts in vdsm code:


    if domain.getStorageType() in sd.FILE_DOMAIN_TYPES:
        log.debug('### storageType=%s', domain.getStorageType()  )
        domain.setMetadataPermissions()

and it appears that we are not entering the if.
Comment 18 Haim 2012-11-27 09:18:20 EST
Created attachment 652770 [details]
vdsm log
Comment 19 Haim 2012-11-27 10:58:41 EST
proposed upstream: http://gerrit.ovirt.org/#/c/9517/1
Comment 24 errata-xmlrpc 2012-12-04 14:13:22 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html

Note You need to log in before you can comment on or make changes to this bug.