Bug 688680

Summary: [vdsm] When trying to activate unreachable storage domain - vms become non-responsive for about 200 seconds.
Product: Red Hat Enterprise Linux 6 Reporter: David Naori <dnaori>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED CURRENTRELEASE QA Contact: David Naori <dnaori>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1CC: abaron, bazulay, danken, dnaori, hateya, iheim, ilvovsky, mgoldboi, syeghiay, ykaul
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: vdsm-4.9-59.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-19 15:25:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 679123    
Bug Blocks:    
Attachments:
Description Flags
full vdsm-log none

Description David Naori 2011-03-17 17:29:57 UTC
Created attachment 486074 [details]
full vdsm-log

Description of problem:

When trying to activate unreachable NFS storage domain - vms become non-responsive for about 200 seconds. 

from watching the logs refreshIfDirty is causing the problem 

Thread-4160::DEBUG::2011-03-17 18:09:37,659::misc::1011::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc._refreshDomains)

Thread-2871::DEBUG::2011-03-17 18:09:37,660::task::491::TaskManager.Task::(_debug) Task be4a9093-a850-49ba-bd38-9cac62eb8607: moving from state init -> state preparing
Thread-2871::INFO::2011-03-17 18:09:37,661::sp::1410::Storage.StoragePool::(getDomains) Get storage pool domains: {'9908a9cd-c8d2-4bc8-b4c5-57edd2625c7c': 'Active', '74b99101-ae
73-4c38-a2bf-4f8cd930e2f7': 'Attached'}

Thread-2871::DEBUG::2011-03-17 18:09:37,661::sdc::54::Storage.StorageDomainCache::(_refreshIfDirty) Cache is cold-expired
Thread-2871::DEBUG::2011-03-17 18:09:37,661::misc::1011::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc._refreshDomains)

...

Thread-5162::DEBUG::2011-03-17 18:12:59,248::
 'monitorResponse': '-1' statsAge': '203.60'

 

Version-Release number of selected component (if applicable):
-vdsm-4.9-54.el6.x86_64
-lvm2-2.02.83-2.el6.x86_64


Steps to Reproduce:
1.Have 2 nfs DSD in the pool from 2 different servers
2.Block connection to non master domain using iptables
3.try to activate the unreachable SD

Full vdsm log attached

Comment 3 Federico Simoncelli 2011-03-24 10:30:43 UTC
The vm stats are not updated (statsAge > 2) because the thread is blocked on getVolumeSize.
This is very easy to reproduce without a vm running (the monitorResponse/statsAge values are only a symptom of the problem).
See also bug 679123.

For debug purpose we can disable the pool.refresh() call in public_activateStorageDomain:

--- a/vdsm/storage/spm.py
+++ b/vdsm/storage/spm.py
@@ -1191,9 +1191,7 @@ class SPM:
             hsm.HSM.validateSdUUID(sdUUID)
             hsm.HSM.validatePoolSD(spUUID, sdUUID)
         except:
-            pool.refresh()
-            hsm.HSM.validateSdUUID(sdUUID)
-            hsm.HSM.validatePoolSD(spUUID, sdUUID)
+            raise
         pool.activateSD(sdUUID)

Having changed that getVolumeSize doesn't get stuck anymore and we finally see the real error in the logs:

Thread-82::ERROR::2011-03-24 10:23:10,380::task::854::TaskManager.Task::(_setError) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 862, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/spm.py", line 114, in run
    return self.func(*args, **kwargs)
  File "/usr/share/vdsm/storage/spm.py", line 1191, in public_activateStorageDomain
    hsm.HSM.validateSdUUID(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 111, in validateSdUUID
    SDF.produce(sdUUID=sdUUID).validate()
  File "/usr/share/vdsm/storage/fileSD.py", line 185, in validate
    self.Metadata(cache=False)
  File "/usr/share/vdsm/storage/sd.py", line 482, in Metadata
    return self._metadata.metadata
  File "/usr/share/vdsm/storage/metadata.py", line 110, in __getmetadata
    md = self._get()
  File "/usr/share/vdsm/storage/fileSD.py", line 39, in _get
    return misc.stripNewLines(oop.directReadLines(self.metafile))
  File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper
    return self.runExternally(func, *args, **kwds)
  File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally
    raise Timeout("Operation Stuck")
Timeout: Operation Stuck

Comment 5 David Naori 2011-04-28 13:42:47 UTC
Verified on vdsm-4.9-62
watch "vdsClient -s 0 getVmStats b79b5b41-8726-42a7-bb89-c090b01d65b1 | grep statsAge"
statsAge did not raise above 2 while Operation Stuck.