950055 – Vdsm uses not updated cache when doing getStorageDomainStats

Bug 950055 - Vdsm uses not updated cache when doing getStorageDomainStats

Summary: Vdsm uses not updated cache when doing getStorageDomainStats

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Allon Mureinik
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:	storage
Duplicates (1):	1015026 (view as bug list)
Depends On:
Blocks:	835543 1019461 3.3snap1
TreeView+	depends on / blocked

Reported:	2013-04-09 14:31 UTC by Alissa
Modified:	2016-02-10 16:57 UTC (History)
CC List:	14 users (show)
Fixed In Version:	is20
Doc Type:	Bug Fix
Doc Text:	VdSM was failing to perform the "getStorageDomainStats" command and failed with a "Failed to GetStorageDomainStatsVDS, error = Domain is either partially accessible or entirely inaccessible". The problem was that the storage domain was cached (together with the path to reach it), so getStorageDomainStats was always executed using the old path. This has been fixed by removing pre-existing domains when connecting to a storage server.
Clone Of:
Environment:
Last Closed:	2014-01-21 16:06:26 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	amureini: needinfo- abaron: Triaged+

Attachments	(Terms of Use)
vdsm log with the error (5.64 KB, application/octet-stream) 2013-04-09 14:31 UTC, Alissa	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:0040	normal	SHIPPED_LIVE	vdsm bug fix and enhancement update	2014-01-21 20:26:21 UTC
oVirt gerrit	19995	None	None	None	Never
oVirt gerrit	20254	None	None	None	Never

Description Alissa 2013-04-09 14:31:41 UTC

Created attachment 733264 [details]
vdsm log with the error

Description of problem:
Vdsm fails to do "getStorageDomainStats" command and fails with "Failed to GetStorageDomainStatsVDS, error = Domain is either partially
accessible or entirely inaccessible".

Vdsm checks in its cache for which path to use during the stats command. When edit/update path is done, the path changes (different mount), however the cache doesn't get refreshed, so vdsm tries to access a mount (based on what's in its cache) that is no longer relevant and fails.

Version-Release number of selected component (if applicable):


How reproducible:
always - the second time an edit path of a storage connection is performed

Steps to Reproduce:
1. Create a storage domain (Posix, NFS), for example pointing to /mystorageserver/data1
2. Create manually another folder on some storage, amd copy manually the content of the storage domain to that newly created path. (for example - /mystorageserver/data2)
3. Put storage domain in maintenance
4. Edit the path of the storage domain to point to the path /mystorageserver/data2. By now it should succeed
5. Edit again the path of the storage domain to point back to /mystorageserver/data1. It will fail during "getstoragedomainstats" with the error

"Failed to GetStorageDomainStatsVDS, error = Domain is either partially
accessible or entirely inaccessible"  
  
Actual results:
Edit path of storage domain fails starting the second time.

Expected results:
It shouldn't fail.

Additional info:

log attached.

Comment 1 Federico Simoncelli 2013-04-09 21:11:00 UTC

The problem here is that the storage domain is cached (together with the path to reach it) so getStorageDomainStats is always executed using the old path. The fix could be removing the domain from the cache before executing the sdCache.produce command:

diff --git a/vdsm/storage/hsm.py b/vdsm/storage/hsm.py
index 32a32c9..e35460d 100644
--- a/vdsm/storage/hsm.py
+++ b/vdsm/storage/hsm.py
@@ -2591,6 +2591,7 @@ class HSM:
         vars.task.setDefaultException(
             se.StorageDomainActionError("sdUUID=%s" % sdUUID))
         vars.task.getSharedLock(STORAGE, sdUUID)
+        sdCache.manuallyRemoveDomain(sdUUID)
         dom = sdCache.produce(sdUUID=sdUUID)
         dom.refresh()
         stats = dom.getStats()

but this would be solving just a very specific flow: disconnect old storage, check the new one (getting the statistics) and as side-effect also clear the cache for that domain. Anyway more importantly this wouldn't solve the problem on all the other hosts where getStorageDomainStats isn't called.

Maybe the correct solution is to clear the cache on disconnectStoragePool. We can try this new solution tomorrow.

Comment 2 Allon Mureinik 2013-08-25 08:47:16 UTC

Submitted a patch that forces the cache to refresh when a domain is ACTIVATED.
IIUC, this should cover all our bases, but lets see what the reviewers think about it.

Comment 3 Eduardo Warszawski 2013-09-02 13:09:18 UTC

(In reply to Alissa from comment #0)
> Created attachment 733264 [details]
> vdsm log with the error
> 
> Description of problem:
> Vdsm fails to do "getStorageDomainStats" command and fails with "Failed to
> GetStorageDomainStatsVDS, error = Domain is either partially
> accessible or entirely inaccessible".
> 
> Vdsm checks in its cache for which path to use during the stats command.
> When edit/update path is done, the path changes (different mount), however
> the cache doesn't get refreshed, so vdsm tries to access a mount (based on
> what's in its cache) that is no longer relevant and fails.
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> always - the second time an edit path of a storage connection is performed
> 
> Steps to Reproduce:
> 1. Create a storage domain (Posix, NFS), for example pointing to
> /mystorageserver/data1
> 2. Create manually another folder on some storage, amd copy manually the
> content of the storage domain to that newly created path. (for example -
> /mystorageserver/data2)
> 3. Put storage domain in maintenance
> 4. Edit the path of the storage domain to point to the path
> /mystorageserver/data2. By now it should succeed
> 5. Edit again the path of the storage domain to point back to
> /mystorageserver/data1. It will fail during "getstoragedomainstats" with the
> error
> 
> "Failed to GetStorageDomainStatsVDS, error = Domain is either partially
> accessible or entirely inaccessible"  
>   
> Actual results:
> Edit path of storage domain fails starting the second time.
> 
> Expected results:
> It shouldn't fail.
> 
> Additional info:
> 
> log attached.

Step 2 is unsupported and strongly discouraged.
Step 4 makes two SDs with the same UUID and connect them to the system or makes and invalid SD with a mismatch between the directory name and the UUID in the MD.
Steps 4 and 5 assume a unique connection for each SD. This will not be certain in the future.
This steps are unrelated to the "edit connections" issue.

In order to test edit connections the ip of the storage server (file or block) should be changed. Should not mess with the SDs.

Comment 4 Eduardo Warszawski 2013-09-02 13:16:48 UTC

From the (lacking short) log we can learn that the issue here is the SD cache returning a previous version of the SD in spite that the sdc was marked as STORAGE_STALE by the connectStorageServer (correct).

Has nothing to do with the activation of the SD.

Only SPM can activate the SD and it is the only one that can change the SD MD.
An attached SD MD should be changed only by the SPM. If a former SPM has altered the domain spmStart takes care of refresh the storage view.

Comment 5 Ayal Baron 2013-09-04 07:43:07 UTC

(In reply to Eduardo Warszawski from comment #4)
> From the (lacking short) log we can learn that the issue here is the SD
> cache returning a previous version of the SD in spite that the sdc was
> marked as STORAGE_STALE by the connectStorageServer (correct).

This is correct and should be fixed indeed.
Once done, we should review places which call sdCache.refreshStorage() (e.g. disconnectStorageServer) to see if we can switch it with sdc.invalidateStorage() to avoid redundant scans.

> 
> Has nothing to do with the activation of the SD.
> 
> Only SPM can activate the SD and it is the only one that can change the SD
> MD.
> An attached SD MD should be changed only by the SPM. If a former SPM has
> altered the domain spmStart takes care of refresh the storage view.

This is correct but irrelevant as the problem here is not about SD MD, it is about the path to the domain being changed.

However, this should indeed have been dealt with once connectStorageServer was called.

Comment 6 Allon Mureinik 2013-09-04 08:30:54 UTC

Patch was abandoned, need to rethink.
Eduardo - thanks for the insight!

Comment 7 Federico Simoncelli 2013-09-04 09:12:47 UTC

(In reply to Ayal Baron from comment #5)
> (In reply to Eduardo Warszawski from comment #4)
> > From the (lacking short) log we can learn that the issue here is the SD
> > cache returning a previous version of the SD in spite that the sdc was
> > marked as STORAGE_STALE by the connectStorageServer (correct).
> 
> This is correct and should be fixed indeed.
> Once done, we should review places which call sdCache.refreshStorage() (e.g.
> disconnectStorageServer) to see if we can switch it with
> sdc.invalidateStorage() to avoid redundant scans.

connectStorageServer adds a new connection, it's not related to cleaning the (entire) current domain cache (at most it would trigger the visibility on a new domain, already handled).

invalidateStorage() is intended to refresh the storage subsystem (multipath and lvm), not the domain cache.
If you want to interact with the domain cache (sdCache.refresh()) you probably want to do it at the domain scope (not server connection scope).

Anyway since connectStorageServer can now prefetch domains, we can use it to selectively clean the relevant domains:

  try:
      doms = self.__prefetchDomains(domType, conObj)
  except:
      self.log.debug("prefetch failed: %s",
                     sdCache.knownSDs,  exc_info=True)
  else:
+     for sdUUID in doms.iterkeys():
+         sdCache.manuallyRemoveDomain(sdUUID)
      sdCache.knownSDs.update(doms)

This will allow us to keep invalidateStorage() and refresh() separated for better flexibility.

Comment 9 Daniel Erez 2013-10-07 15:39:25 UTC

*** Bug 1015026 has been marked as a duplicate of this bug. ***

Comment 10 Allon Mureinik 2013-10-08 15:24:34 UTC

posted the suggested solution (comment 7, see tracker for link) - your input would be highly appreciated.

Comment 11 Leonid Natapov 2013-11-24 16:47:29 UTC

is24.1. tested according steps to reproduce.

Comment 12 Charlie 2013-11-28 00:28:45 UTC

This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 17 errata-xmlrpc 2014-01-21 16:06:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html

Note You need to log in before you can comment on or make changes to this bug.