Created attachment 733264 [details] vdsm log with the error Description of problem: Vdsm fails to do "getStorageDomainStats" command and fails with "Failed to GetStorageDomainStatsVDS, error = Domain is either partially accessible or entirely inaccessible". Vdsm checks in its cache for which path to use during the stats command. When edit/update path is done, the path changes (different mount), however the cache doesn't get refreshed, so vdsm tries to access a mount (based on what's in its cache) that is no longer relevant and fails. Version-Release number of selected component (if applicable): How reproducible: always - the second time an edit path of a storage connection is performed Steps to Reproduce: 1. Create a storage domain (Posix, NFS), for example pointing to /mystorageserver/data1 2. Create manually another folder on some storage, amd copy manually the content of the storage domain to that newly created path. (for example - /mystorageserver/data2) 3. Put storage domain in maintenance 4. Edit the path of the storage domain to point to the path /mystorageserver/data2. By now it should succeed 5. Edit again the path of the storage domain to point back to /mystorageserver/data1. It will fail during "getstoragedomainstats" with the error "Failed to GetStorageDomainStatsVDS, error = Domain is either partially accessible or entirely inaccessible" Actual results: Edit path of storage domain fails starting the second time. Expected results: It shouldn't fail. Additional info: log attached.
The problem here is that the storage domain is cached (together with the path to reach it) so getStorageDomainStats is always executed using the old path. The fix could be removing the domain from the cache before executing the sdCache.produce command: diff --git a/vdsm/storage/hsm.py b/vdsm/storage/hsm.py index 32a32c9..e35460d 100644 --- a/vdsm/storage/hsm.py +++ b/vdsm/storage/hsm.py @@ -2591,6 +2591,7 @@ class HSM: vars.task.setDefaultException( se.StorageDomainActionError("sdUUID=%s" % sdUUID)) vars.task.getSharedLock(STORAGE, sdUUID) + sdCache.manuallyRemoveDomain(sdUUID) dom = sdCache.produce(sdUUID=sdUUID) dom.refresh() stats = dom.getStats() but this would be solving just a very specific flow: disconnect old storage, check the new one (getting the statistics) and as side-effect also clear the cache for that domain. Anyway more importantly this wouldn't solve the problem on all the other hosts where getStorageDomainStats isn't called. Maybe the correct solution is to clear the cache on disconnectStoragePool. We can try this new solution tomorrow.
Submitted a patch that forces the cache to refresh when a domain is ACTIVATED. IIUC, this should cover all our bases, but lets see what the reviewers think about it.
(In reply to Alissa from comment #0) > Created attachment 733264 [details] > vdsm log with the error > > Description of problem: > Vdsm fails to do "getStorageDomainStats" command and fails with "Failed to > GetStorageDomainStatsVDS, error = Domain is either partially > accessible or entirely inaccessible". > > Vdsm checks in its cache for which path to use during the stats command. > When edit/update path is done, the path changes (different mount), however > the cache doesn't get refreshed, so vdsm tries to access a mount (based on > what's in its cache) that is no longer relevant and fails. > > Version-Release number of selected component (if applicable): > > > How reproducible: > always - the second time an edit path of a storage connection is performed > > Steps to Reproduce: > 1. Create a storage domain (Posix, NFS), for example pointing to > /mystorageserver/data1 > 2. Create manually another folder on some storage, amd copy manually the > content of the storage domain to that newly created path. (for example - > /mystorageserver/data2) > 3. Put storage domain in maintenance > 4. Edit the path of the storage domain to point to the path > /mystorageserver/data2. By now it should succeed > 5. Edit again the path of the storage domain to point back to > /mystorageserver/data1. It will fail during "getstoragedomainstats" with the > error > > "Failed to GetStorageDomainStatsVDS, error = Domain is either partially > accessible or entirely inaccessible" > > Actual results: > Edit path of storage domain fails starting the second time. > > Expected results: > It shouldn't fail. > > Additional info: > > log attached. Step 2 is unsupported and strongly discouraged. Step 4 makes two SDs with the same UUID and connect them to the system or makes and invalid SD with a mismatch between the directory name and the UUID in the MD. Steps 4 and 5 assume a unique connection for each SD. This will not be certain in the future. This steps are unrelated to the "edit connections" issue. In order to test edit connections the ip of the storage server (file or block) should be changed. Should not mess with the SDs.
From the (lacking short) log we can learn that the issue here is the SD cache returning a previous version of the SD in spite that the sdc was marked as STORAGE_STALE by the connectStorageServer (correct). Has nothing to do with the activation of the SD. Only SPM can activate the SD and it is the only one that can change the SD MD. An attached SD MD should be changed only by the SPM. If a former SPM has altered the domain spmStart takes care of refresh the storage view.
(In reply to Eduardo Warszawski from comment #4) > From the (lacking short) log we can learn that the issue here is the SD > cache returning a previous version of the SD in spite that the sdc was > marked as STORAGE_STALE by the connectStorageServer (correct). This is correct and should be fixed indeed. Once done, we should review places which call sdCache.refreshStorage() (e.g. disconnectStorageServer) to see if we can switch it with sdc.invalidateStorage() to avoid redundant scans. > > Has nothing to do with the activation of the SD. > > Only SPM can activate the SD and it is the only one that can change the SD > MD. > An attached SD MD should be changed only by the SPM. If a former SPM has > altered the domain spmStart takes care of refresh the storage view. This is correct but irrelevant as the problem here is not about SD MD, it is about the path to the domain being changed. However, this should indeed have been dealt with once connectStorageServer was called.
Patch was abandoned, need to rethink. Eduardo - thanks for the insight!
(In reply to Ayal Baron from comment #5) > (In reply to Eduardo Warszawski from comment #4) > > From the (lacking short) log we can learn that the issue here is the SD > > cache returning a previous version of the SD in spite that the sdc was > > marked as STORAGE_STALE by the connectStorageServer (correct). > > This is correct and should be fixed indeed. > Once done, we should review places which call sdCache.refreshStorage() (e.g. > disconnectStorageServer) to see if we can switch it with > sdc.invalidateStorage() to avoid redundant scans. connectStorageServer adds a new connection, it's not related to cleaning the (entire) current domain cache (at most it would trigger the visibility on a new domain, already handled). invalidateStorage() is intended to refresh the storage subsystem (multipath and lvm), not the domain cache. If you want to interact with the domain cache (sdCache.refresh()) you probably want to do it at the domain scope (not server connection scope). Anyway since connectStorageServer can now prefetch domains, we can use it to selectively clean the relevant domains: try: doms = self.__prefetchDomains(domType, conObj) except: self.log.debug("prefetch failed: %s", sdCache.knownSDs, exc_info=True) else: + for sdUUID in doms.iterkeys(): + sdCache.manuallyRemoveDomain(sdUUID) sdCache.knownSDs.update(doms) This will allow us to keep invalidateStorage() and refresh() separated for better flexibility.
*** Bug 1015026 has been marked as a duplicate of this bug. ***
posted the suggested solution (comment 7, see tracker for link) - your input would be highly appreciated.
is24.1. tested according steps to reproduce.
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0040.html