Bug 882667
Summary: | vdsm: master domain is partially inaccessible when umount fails for iso/export domain (only on posix master domain over nfs) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||||
Component: | vdsm | Assignee: | Daniel Erez <derez> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Elad <ebenahar> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 3.1.0 | CC: | abaron, acathrow, amureini, bazulay, dyasny, ewarszaw, hateya, iheim, lnatapov, lpeer, Rhev-m-bugs, scohen, yeylon, ykaul, zdover | ||||||
Target Milestone: | --- | Keywords: | Regression | ||||||
Target Release: | 3.2.0 | Flags: | scohen:
Triaged+
|
||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | storage | ||||||||
Fixed In Version: | vdsm-4.10.2-17.0.el6ev | Doc Type: | Bug Fix | ||||||
Doc Text: |
Previously metadata related to the file storage domain was wrongly reported as missing. This meant that during some race conditions, attaching and detaching ISO domains and export domains led to those domains being put in a non-operational state.
VDSM now verifies the existence of file storage domain metadata. ISO domains and export domains no longer fall into a non-operational state due to an inability to find file storage domain metadata.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2013-06-10 20:37:09 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Need to understand if the storage server is wrongly disconnected by engine, or if this is a VDSM issue. The same behaviour occurs with just nfs, no gluster, vdsm 4.10-*. The only difference is the error is not logged in either vdsm log. engine log does have the following: 2013-02-28 12:00:02,611 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-7-thread-47) Domain 67fdd12d-87c9-4129-a8aa-3dd076c11a79 is not seen by Host 2013-02-28 12:00:02,612 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (pool-7-thread-47) One of the Storage Domains of host h3 in pool dc_posix is problematic Full log attached. *Couldn't reproduce with less than three hosts. Created attachment 703871 [details]
engine.log
(In reply to comment #1) > Need to understand if the storage server is wrongly disconnected by engine, > or if this is a VDSM issue. Vered, any input on this? (In reply to comment #7) > (In reply to comment #1) > > Need to understand if the storage server is wrongly disconnected by engine, > > or if this is a VDSM issue. > > Vered, any input on this? Seems like umount fails on vdsm side, doesn't look like engine has anything to do with it at this time. Dafna, please reproduce on your setup with clean logs. Doesn't reproduce for me anymore and the logs here have too much unrelated data. Thanks. There seems to be a StorageDomain metadata that's missing the ROLE key (that should be MASTER_DOMAIN), which results in connectStoragePool (hsm.py) failure. changeRole in sd.py should set the DMDK_ROLE (already set to ROLE) key, so my next step would be to see why this is not called. The cause of: StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('74b2f39e-d374-4bda-b8a1-fb2aca984a5a',) Is that the host was disconnected from the storage server. In the excerpt below you can see that refreshStoragePool succeeds when the host is connected. Afterwards disconnectStorageServer is received, therefore the next refreshStoragePool fails to reach the MSD and raises the error. disconnectStorageServer should be not sent by engine. In addition: This is not related to gluster or posix, all the attempts to connect the gluster storage failed consistently. ########################################################################### # vdsm.log Thread-154845::INFO::2012-12-02 15:27:43,960::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': 'wolf:/volumes/wolf/Dafna_New_Export_0_nfs_71122241851338', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': '3ab751fe-128b-418d-b749-cda8a8d32ca6', 'port': ''}], options=None) Thread-154845::INFO::2012-12-02 15:27:44,003::logUtils::39::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '3ab751fe-128b-418d-b749-cda8a8d32ca6'}]} Thread-154848::INFO::2012-12-02 15:27:48,662::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None) Thread-154848::INFO::2012-12-02 15:27:51,253::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None Thread-154856::INFO::2012-12-02 15:27:51,282::logUtils::37::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': 'wolf:/volumes/wolf/shared/shared_iso_domain', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': 'f7f9dede-a2de-450f-916f-b09da173303f', 'port': ''}], options=None) Thread-154856::INFO::2012-12-02 15:27:53,436::logUtils::39::dispatcher::(wrapper) Run and protect: disconnectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': 'f7f9dede-a2de-450f-916f-b09da173303f'}]} Thread-154860::INFO::2012-12-02 15:27:54,819::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None) Thread-154860::ERROR::2012-12-02 15:27:57,831::task::853::TaskManager.Task::(_setError) Task=`6e08df80-e564-43e2-9de6-c99ee5173245`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 745, in refreshStoragePool self.validateSdUUID(msdUUID) File "/usr/share/vdsm/storage/hsm.py", line 246, in validateSdUUID sdDom.validate() File "/usr/share/vdsm/storage/fileSD.py", line 258, in validate raise se.StorageDomainAccessError(self.sdUUID) StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('74b2f39e-d374-4bda-b8a1-fb2aca984a5a',) # In engine log: 2012-12-02 15:27:24,898 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (pool-3-thread-40) [7dc2f39b] START, DisconnectStorageServerVDSComman d(HostName = gold-vdsd, HostId = 9bef26b8-3324-11e2-8dfe-001a4a169741, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 3ab751fe- 128b-418d-b749-cda8a8d32ca6, connection: wolf:/volumes/wolf/Dafna_New_Export_0_nfs_71122241851338, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: nu ll, nfsTimeo: null };]), log id: 2a24890a 2012-12-02 15:27:28,654 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (pool-3-thread-40) [7dc2f39b] FINISH, DisconnectStorageServerVDSComma nd, return: {3ab751fe-128b-418d-b749-cda8a8d32ca6=0}, log id: 2a24890a As far as I can see the analysis here is incorrect. User wants to detach the iso/export domain so the flow is: On spm: deactivateSD detachSD disconnectSD After that is finished, what *currently happens is on all non-spm hosts: disconnectStorageServer (only of the iso/export domain as is evident above) refreshStoragePool refresh should succeed in this case. I'm not clear why it fails. it probably should be: refreshStoragePool disconnectStorageServer But this is a nicety, the refresh should work anyway since the iso domain is no longer listed in the metadata. The real problem here is that refresh is reading the master domain metadata and did not find any data: Thread-154860::DEBUG::2012-12-02 15:27:54,827::persistentDict::226::Storage.PersistentDict::(refresh) read lines (FileMetadataRW)=[] Thread-154860::WARNING::2012-12-02 15:27:54,827::persistentDict::248::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it is My guess is that for some reason the file is not there and readlines is swallowing this error. (read lines above calls the following): fileSD.py: def readlines(self): if not self._oop.fileUtils.pathExists(self._metafile): return [] return misc.stripNewLines(self._oop.directReadLines(self._metafile)) (In reply to comment #12) > As far as I can see the analysis here is incorrect. > > User wants to detach the iso/export domain so the flow is: > On spm: > deactivateSD > detachSD > disconnectSD > > After that is finished, what *currently happens is on all non-spm hosts: > disconnectStorageServer (only of the iso/export domain as is evident above) > refreshStoragePool > > refresh should succeed in this case. I'm not clear why it fails. > it probably should be: > refreshStoragePool > disconnectStorageServer > > But this is a nicety, the refresh should work anyway since the iso domain is > no longer listed in the metadata. > > The real problem here is that refresh is reading the master domain metadata > and did not find any data: > Thread-154860::DEBUG::2012-12-02 > 15:27:54,827::persistentDict::226::Storage.PersistentDict::(refresh) read > lines (FileMetadataRW)=[] > Thread-154860::WARNING::2012-12-02 > 15:27:54,827::persistentDict::248::Storage.PersistentDict::(refresh) data > has no embedded checksum - trust it as it is > > My guess is that for some reason the file is not there and readlines is > swallowing this error. > (read lines above calls the following): > > fileSD.py: > def readlines(self): > if not self._oop.fileUtils.pathExists(self._metafile): > return [] > return misc.stripNewLines(self._oop.directReadLines(self._metafile)) You can see below that the pool MD is retrieved correctly multiple times after the 1st connect until it can't be found any more. After "reconnect" the master domain again the MD can be accessed again. The code mentioned is swallowing errors but the root cause of such ones should be elucidated. ################################################################### Thread-151674::INFO::2012-12-02 14:02:20,341::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=6, spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', conList=[{'port': '', 'connection': 'filer01.qa.lab.tlv.redhat.com:/Dafna-GA', 'mnt_options': 'vers=3', 'portal': '', 'user': '', 'iqn': '', 'vfs_type': 'nfs', 'password': '******', 'id': '67af501e-506d-4b8e-9119-4287a2e72d62'}], options=None) Thread-151674::INFO::2012-12-02 14:02:20,452::logUtils::39::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '67af501e-506d-4b8e-9119-4287a2e72d62'}]} Thread-154313::INFO::2012-12-02 15:13:30,268::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None) Thread-154313::INFO::2012-12-02 15:13:31,267::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None Thread-154824::INFO::2012-12-02 15:27:29,349::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None) Thread-154824::INFO::2012-12-02 15:27:30,270::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None Thread-154848::INFO::2012-12-02 15:27:48,662::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None) Thread-154848::INFO::2012-12-02 15:27:51,253::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None Thread-154860::INFO::2012-12-02 15:27:54,819::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None) MainThread::INFO::2012-12-02 16:22:04,641::vdsm::70::vds::(run) I am the actual vdsm 4.9-44.0 Thread-17::INFO::2012-12-02 16:22:06,879::logUtils::37::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=6, spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', conList=[{'port': '', 'connection': 'filer01.qa.lab.tlv.redhat.com:/Dafna-GA', 'mnt_options': 'vers=3', 'portal': '', 'user': '', 'iqn': '', 'vfs_type': 'nfs', 'password': '******', 'id': '67af501e-506d-4b8e-9119-4287a2e72d62'}], options=None) Thread-17::INFO::2012-12-02 16:22:06,882::logUtils::39::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '67af501e-506d-4b8e-9119-4287a2e72d62'}]} Thread-176::INFO::2012-12-02 16:25:38,574::logUtils::37::dispatcher::(wrapper) Run and protect: refreshStoragePool(spUUID='85047de7-c841-4ff0-9ebb-3f722001b0b4', msdUUID='74b2f39e-d374-4bda-b8a1-fb2aca984a5a', masterVersion=1, options=None) Thread-176::INFO::2012-12-02 16:25:43,156::logUtils::39::dispatcher::(wrapper) Run and protect: refreshStoragePool, Return response: None vdsm-4.10.2-17.0.el6ev.x86_64 . Cluster with 3 hosts. Created NFS domain. Attached iso/export domains and than detached iso/export domains. umount succeeded. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0886.html |
Created attachment 656044 [details] logs Description of problem: In 3 hosts cluster I added iso and export domains. when I detached the domains I got an error: Host gold-vdsc cannot access one of the Storage Domains attached to the Data Center POSIX. Setting Host state to Non-Operational. the host becomes non-operational and we cannot activate it with: StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('74b2f39e-d374-4bda-b8a1-fb2aca984a5a',) Version-Release number of selected component (if applicable): vdsm-4.9.6-44.0.el6_3.x86_64 How reproducible: 100% Steps to Reproduce: 1. in 3 hosts cluster, create a posix domain (I used a gluster storage but nfs v3 configurations) 2. attach iso/export domains 3. detach iso/export domains Actual results: hsm host becomes non-operational and we cannot connect to the storage with partially inaccessible errors. Expected results: umount should succeed. we should not show domain as inaccessible when umount fails for iso/export domains. Additional info:logs please note that this is nfs over gluster. looking at the logs we can see that umount failed: raceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2164, in disconnectStorageServer conObj.disconnect() File "/usr/share/vdsm/storage/storageServer.py", line 199, in disconnect self._mount.umount(True, True) File "/usr/share/vdsm/storage/mount.py", line 226, in umount return self._runcmd(cmd, timeout) File "/usr/share/vdsm/storage/mount.py", line 214, in _runcmd raise MountError(rc, ";".join((out, err))) MountError: (1, ';umount: /rhev/data-center/mnt/paikov-rhevm-gluster.qa.lab.tlv.redhat.com:_Dafna-GA1: not found\n') filer01.qa.lab.tlv.redhat.com:/Dafna-GA on /rhev/data-center/mnt/filer01.qa.lab.tlv.redhat.com:_Dafna-GA type nfs (rw,vers=3,addr=10.35.64.21) wolf:/volumes/wolf/shared/shared_iso_domain on /rhev/data-center/mnt/wolf:_volumes_wolf_shared_shared__iso__domain type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.102) the domain that appears partially inaccessible: [root@gold-vdsc ~]# cd /rhev/data-center/mnt/filer01.qa.lab.tlv.redhat.com\:_Dafna-GA/74b2f39e-d374-4bda-b8a1-fb2aca984a5a/dom_md/ [root@gold-vdsc dom_md]# Display all 1725 possibilities? (y or n) [root@gold-vdsc dom_md]# cat metadata CLASS=Data DESCRIPTION=Dafna-01 IOOPTIMEOUTSEC=10 LEASERETRIES=3 LEASETIMESEC=60 LOCKPOLICY= LOCKRENEWALINTERVALSEC=5 MASTER_VERSION=1 POOL_DESCRIPTION=POSIX POOL_DOMAINS=74b2f39e-d374-4bda-b8a1-fb2aca984a5a:Active POOL_SPM_ID=1 POOL_SPM_LVER=0 POOL_UUID=85047de7-c841-4ff0-9ebb-3f722001b0b4 REMOTE_PATH=filer01.qa.lab.tlv.redhat.com:/Dafna-GA ROLE=Master SDUUID=74b2f39e-d374-4bda-b8a1-fb2aca984a5a TYPE=POSIXFS VERSION=3 _SHA_CKSUM=79b3b6568dd5e0180a0034daf156e461874d07ec