Bug 882667

Summary: vdsm: master domain is partially inaccessible when umount fails for iso/export domain (only on posix master domain over nfs)
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Daniel Erez <derez>
Status: CLOSED ERRATA QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.1.0CC: abaron, acathrow, amureini, bazulay, dyasny, ewarszaw, hateya, iheim, lnatapov, lpeer, Rhev-m-bugs, scohen, yeylon, ykaul, zdover
Target Milestone: ---Keywords: Regression
Target Release: 3.2.0Flags: scohen: Triaged+
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: vdsm-4.10.2-17.0.el6ev Doc Type: Bug Fix
Doc Text:
Previously metadata related to the file storage domain was wrongly reported as missing. This meant that during some race conditions, attaching and detaching ISO domains and export domains led to those domains being put in a non-operational state. VDSM now verifies the existence of file storage domain metadata. ISO domains and export domains no longer fall into a non-operational state due to an inability to find file storage domain metadata.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-10 20:37:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
engine.log none

Description Dafna Ron 2012-12-02 15:07:33 UTC
Created attachment 656044 [details]
logs

Description of problem:

In 3 hosts cluster I added iso and export domains. 
when I detached the domains I got an error: 

Host gold-vdsc cannot access one of the Storage Domains attached to the Data Center POSIX. Setting Host state to Non-Operational.

the host becomes non-operational and we cannot activate it with: 

StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('74b2f39e-d374-4bda-b8a1-fb2aca984a5a',)


Version-Release number of selected component (if applicable):

vdsm-4.9.6-44.0.el6_3.x86_64

How reproducible:

100%

Steps to Reproduce:
1. in 3 hosts cluster, create a posix domain (I used a gluster storage but nfs v3 configurations) 
2. attach iso/export domains
3. detach iso/export domains
  
Actual results:

hsm host becomes non-operational and we cannot connect to the storage with partially inaccessible errors. 

Expected results:

umount should succeed. we should not show domain as inaccessible when umount fails for iso/export domains. 

Additional info:logs 

please note that this is nfs over gluster. 


looking at the logs we can see that umount failed: 

raceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2164, in disconnectStorageServer
    conObj.disconnect()
  File "/usr/share/vdsm/storage/storageServer.py", line 199, in disconnect
    self._mount.umount(True, True)
  File "/usr/share/vdsm/storage/mount.py", line 226, in umount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 214, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (1, ';umount: /rhev/data-center/mnt/paikov-rhevm-gluster.qa.lab.tlv.redhat.com:_Dafna-GA1: not found\n')



filer01.qa.lab.tlv.redhat.com:/Dafna-GA on /rhev/data-center/mnt/filer01.qa.lab.tlv.redhat.com:_Dafna-GA type nfs (rw,vers=3,addr=10.35.64.21)
wolf:/volumes/wolf/shared/shared_iso_domain on /rhev/data-center/mnt/wolf:_volumes_wolf_shared_shared__iso__domain type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.102)


the domain that appears partially inaccessible:

[root@gold-vdsc ~]# cd /rhev/data-center/mnt/filer01.qa.lab.tlv.redhat.com\:_Dafna-GA/74b2f39e-d374-4bda-b8a1-fb2aca984a5a/dom_md/
[root@gold-vdsc dom_md]# 
Display all 1725 possibilities? (y or n)
[root@gold-vdsc dom_md]# cat metadata 
CLASS=Data
DESCRIPTION=Dafna-01
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=1
POOL_DESCRIPTION=POSIX
POOL_DOMAINS=74b2f39e-d374-4bda-b8a1-fb2aca984a5a:Active
POOL_SPM_ID=1
POOL_SPM_LVER=0
POOL_UUID=85047de7-c841-4ff0-9ebb-3f722001b0b4
REMOTE_PATH=filer01.qa.lab.tlv.redhat.com:/Dafna-GA
ROLE=Master
SDUUID=74b2f39e-d374-4bda-b8a1-fb2aca984a5a
TYPE=POSIXFS
VERSION=3
_SHA_CKSUM=79b3b6568dd5e0180a0034daf156e461874d07ec

Comment 1 Allon Mureinik 2012-12-05 10:32:01 UTC
Need to understand if the storage server is wrongly disconnected by engine, or if this is a VDSM issue.

Comment 4 Vered Volansky 2013-02-28 10:13:41 UTC
The same behaviour occurs with just nfs, no gluster, vdsm 4.10-*.
The only difference is the error is not logged in either vdsm log.
engine log does have the following:

2013-02-28 12:00:02,611 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-7-thread-47) Domain 67fdd12d-87c9-4129-a8aa-3dd076c11a79 is not seen by Host
2013-02-28 12:00:02,612 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-7-thread-47) One of the Storage Domains of host h3 in pool dc_posix is problematic
Full log attached.

*Couldn't reproduce with less than three hosts.

Comment 5 Vered Volansky 2013-02-28 10:15:06 UTC
Created attachment 703871 [details]
engine.log

Comment 7 Allon Mureinik 2013-03-03 17:28:05 UTC
(In reply to comment #1)
> Need to understand if the storage server is wrongly disconnected by engine,
> or if this is a VDSM issue.

Vered, any input on this?

Comment 8 Vered Volansky 2013-03-04 06:52:08 UTC
(In reply to comment #7)
> (In reply to comment #1)
> > Need to understand if the storage server is wrongly disconnected by engine,
> > or if this is a VDSM issue.
> 
> Vered, any input on this?
Seems like umount fails on vdsm side, doesn't look like engine has anything to do with it at this time.

Comment 9 Vered Volansky 2013-03-06 18:07:00 UTC
Dafna, please reproduce on your setup with clean logs.
Doesn't reproduce for me anymore and the logs here have too much unrelated data.

Thanks.

Comment 10 Vered Volansky 2013-03-19 11:58:38 UTC
There seems to be a StorageDomain metadata that's missing the ROLE key (that should be MASTER_DOMAIN), which results in connectStoragePool (hsm.py) failure.
changeRole in sd.py should set the DMDK_ROLE (already set to ROLE) key, so my next step would be to see why this is not called.

Comment 11 Eduardo Warszawski 2013-03-25 08:16:32 UTC
The cause of:
StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('74b2f39e-d374-4bda-b8a1-fb2aca984a5a',)

Is that the host was disconnected from the storage server.

In the excerpt below you can see that refreshStoragePool succeeds when the host is connected.

Afterwards disconnectStorageServer is received, therefore the next refreshStoragePool fails to reach the MSD and raises the error.


disconnectStorageServer should be not sent by engine.


In addition:
This is not related to gluster or posix, all the attempts to connect the gluster storage failed consistently.




###########################################################################
# vdsm.log


Thread-154845::INFO::2012-12-02 15:27:43,960::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': 'wolf:/volumes/wolf/Dafna_New_Export_0_nfs_71122241851338', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': '3ab751fe-128b-418d-b749-cda8a8d32ca6', 'port': ''}], options=None)
Thread-154845::INFO::2012-12-02 15:27:44,003::logUtils::39::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '3ab751fe-128b-418d-b749-cda8a8d32ca6'}]}


Thread-154848::INFO::2012-12-02 15:27:48,662::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None)
Thread-154848::INFO::2012-12-02 15:27:51,253::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None

Thread-154856::INFO::2012-12-02 15:27:51,282::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': 'wolf:/volumes/wolf/shared/shared_iso_domain', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': 'f7f9dede-a2de-450f-916f-b09da173303f', 'port': ''}], options=None)
Thread-154856::INFO::2012-12-02 15:27:53,436::logUtils::39::dispatcher::(wrapper) Run and protect: disconnectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': 'f7f9dede-a2de-450f-916f-b09da173303f'}]}

Thread-154860::INFO::2012-12-02 15:27:54,819::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None)
Thread-154860::ERROR::2012-12-02 15:27:57,831::task::853::TaskManager.Task::(_setError) Task=`6e08df80-e564-43e2-9de6-c99ee5173245`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 745, in refreshStoragePool
    self.validateSdUUID(msdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 246, in validateSdUUID
    sdDom.validate()
  File "/usr/share/vdsm/storage/fileSD.py", line 258, in validate
    raise se.StorageDomainAccessError(self.sdUUID)
StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('74b2f39e-d374-4bda-b8a1-fb2aca984a5a',)



# In engine log:

2012-12-02 15:27:24,898 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (pool-3-thread-40) [7dc2f39b] START, DisconnectStorageServerVDSComman
d(HostName = gold-vdsd, HostId = 9bef26b8-3324-11e2-8dfe-001a4a169741, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 3ab751fe-
128b-418d-b749-cda8a8d32ca6, connection: wolf:/volumes/wolf/Dafna_New_Export_0_nfs_71122241851338, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: nu
ll, nfsTimeo: null };]), log id: 2a24890a
2012-12-02 15:27:28,654 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (pool-3-thread-40) [7dc2f39b] FINISH, DisconnectStorageServerVDSComma
nd, return: {3ab751fe-128b-418d-b749-cda8a8d32ca6=0}, log id: 2a24890a

Comment 12 Ayal Baron 2013-04-03 10:59:24 UTC
As far as I can see the analysis here is incorrect.

User wants to detach the iso/export domain so the flow is:
On spm:
deactivateSD
detachSD
disconnectSD

After that is finished, what *currently happens is on all non-spm hosts:
disconnectStorageServer (only of the iso/export domain as is evident above)
refreshStoragePool

refresh should succeed in this case. I'm not clear why it fails.
it probably should be:
refreshStoragePool
disconnectStorageServer

But this is a nicety, the refresh should work anyway since the iso domain is no longer listed in the metadata.

The real problem here is that refresh is reading the master domain metadata and did not find any data:
Thread-154860::DEBUG::2012-12-02 15:27:54,827::persistentDict::226::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=[]
Thread-154860::WARNING::2012-12-02 15:27:54,827::persistentDict::248::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is

My guess is that for some reason the file is not there and readlines is swallowing this error.
(read lines above calls the following):

fileSD.py:
     def readlines(self):
         if not self._oop.fileUtils.pathExists(self._metafile):
                 return []
         return misc.stripNewLines(self._oop.directReadLines(self._metafile))

Comment 13 Eduardo Warszawski 2013-04-11 07:19:44 UTC
(In reply to comment #12)
> As far as I can see the analysis here is incorrect.
> 
> User wants to detach the iso/export domain so the flow is:
> On spm:
> deactivateSD
> detachSD
> disconnectSD
> 
> After that is finished, what *currently happens is on all non-spm hosts:
> disconnectStorageServer (only of the iso/export domain as is evident above)
> refreshStoragePool
> 
> refresh should succeed in this case. I'm not clear why it fails.
> it probably should be:
> refreshStoragePool
> disconnectStorageServer
> 
> But this is a nicety, the refresh should work anyway since the iso domain is
> no longer listed in the metadata.
> 
> The real problem here is that refresh is reading the master domain metadata
> and did not find any data:
> Thread-154860::DEBUG::2012-12-02
> 15:27:54,827::persistentDict::226::Storage.PersistentDict::(refresh) read
> lines (FileMetadataRW)=[]
> Thread-154860::WARNING::2012-12-02
> 15:27:54,827::persistentDict::248::Storage.PersistentDict::(refresh) data
> has no embedded checksum - trust it as it is
> 
> My guess is that for some reason the file is not there and readlines is
> swallowing this error.
> (read lines above calls the following):
> 
> fileSD.py:
>      def readlines(self):
>          if not self._oop.fileUtils.pathExists(self._metafile):
>                  return []
>          return misc.stripNewLines(self._oop.directReadLines(self._metafile))


You can see below that the pool MD is retrieved correctly multiple times after the 1st connect until it can't be found any more.
After "reconnect" the master domain again the MD can be accessed again.
The code mentioned is swallowing errors but the root cause of such ones should be elucidated.



###################################################################
Thread-151674::INFO::2012-12-02 14:02:20,341::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=6, spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', conList=[{'port': '', 'connection': 'filer01.qa.lab.tlv.redhat.com:/Dafna-GA', 'mnt_options': 'vers=3', 'portal': '', 'user': '', 'iqn': '', 'vfs_type': 'nfs', 'password': '******', 'id': '67af501e-506d-4b8e-9119-4287a2e72d62'}], options=None)
Thread-151674::INFO::2012-12-02 14:02:20,452::logUtils::39::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '67af501e-506d-4b8e-9119-4287a2e72d62'}]}
Thread-154313::INFO::2012-12-02 15:13:30,268::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None)
Thread-154313::INFO::2012-12-02 15:13:31,267::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None
Thread-154824::INFO::2012-12-02 15:27:29,349::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None)
Thread-154824::INFO::2012-12-02 15:27:30,270::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None
Thread-154848::INFO::2012-12-02 15:27:48,662::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None)
Thread-154848::INFO::2012-12-02 15:27:51,253::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None
Thread-154860::INFO::2012-12-02 15:27:54,819::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None)
MainThread::INFO::2012-12-02 16:22:04,641::vdsm::70::vds::(run) I am the actual vdsm 4.9-44.0
Thread-17::INFO::2012-12-02 16:22:06,879::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=6, spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', conList=[{'port': '', 'connection': 'filer01.qa.lab.tlv.redhat.com:/Dafna-GA', 'mnt_options': 'vers=3', 'portal': '', 'user': '', 'iqn': '', 'vfs_type': 'nfs', 'password': '******', 'id': '67af501e-506d-4b8e-9119-4287a2e72d62'}], options=None)
Thread-17::INFO::2012-12-02 16:22:06,882::logUtils::39::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '67af501e-506d-4b8e-9119-4287a2e72d62'}]}
Thread-176::INFO::2012-12-02 16:25:38,574::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None)
Thread-176::INFO::2012-12-02 16:25:43,156::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None

Comment 15 Leonid Natapov 2013-05-06 14:30:12 UTC
vdsm-4.10.2-17.0.el6ev.x86_64 . Cluster with 3 hosts. Created NFS domain.
Attached iso/export domains and than detached iso/export domains. umount succeeded.

Comment 18 errata-xmlrpc 2013-06-10 20:37:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0886.html