Bug 688680 - [vdsm] When trying to activate unreachable storage domain - vms become non-responsive for about 200 seconds.
Summary: [vdsm] When trying to activate unreachable storage domain - vms become non-re...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: vdsm
Version: 6.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Federico Simoncelli
QA Contact: David Naori
URL:
Whiteboard:
Depends On: 679123
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-17 17:29 UTC by David Naori
Modified: 2011-08-19 15:25 UTC (History)
10 users (show)

Fixed In Version: vdsm-4.9-59.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-08-19 15:25:35 UTC
Target Upstream Version:


Attachments (Terms of Use)
full vdsm-log (1007.94 KB, application/binary)
2011-03-17 17:29 UTC, David Naori
no flags Details

Description David Naori 2011-03-17 17:29:57 UTC
Created attachment 486074 [details]
full vdsm-log

Description of problem:

When trying to activate unreachable NFS storage domain - vms become non-responsive for about 200 seconds. 

from watching the logs refreshIfDirty is causing the problem 

Thread-4160::DEBUG::2011-03-17 18:09:37,659::misc::1011::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc._refreshDomains)

Thread-2871::DEBUG::2011-03-17 18:09:37,660::task::491::TaskManager.Task::(_debug) Task be4a9093-a850-49ba-bd38-9cac62eb8607: moving from state init -> state preparing
Thread-2871::INFO::2011-03-17 18:09:37,661::sp::1410::Storage.StoragePool::(getDomains) Get storage pool domains: {'9908a9cd-c8d2-4bc8-b4c5-57edd2625c7c': 'Active', '74b99101-ae
73-4c38-a2bf-4f8cd930e2f7': 'Attached'}

Thread-2871::DEBUG::2011-03-17 18:09:37,661::sdc::54::Storage.StorageDomainCache::(_refreshIfDirty) Cache is cold-expired
Thread-2871::DEBUG::2011-03-17 18:09:37,661::misc::1011::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc._refreshDomains)

...

Thread-5162::DEBUG::2011-03-17 18:12:59,248::
 'monitorResponse': '-1' statsAge': '203.60'

 

Version-Release number of selected component (if applicable):
-vdsm-4.9-54.el6.x86_64
-lvm2-2.02.83-2.el6.x86_64


Steps to Reproduce:
1.Have 2 nfs DSD in the pool from 2 different servers
2.Block connection to non master domain using iptables
3.try to activate the unreachable SD

Full vdsm log attached

Comment 3 Federico Simoncelli 2011-03-24 10:30:43 UTC
The vm stats are not updated (statsAge > 2) because the thread is blocked on getVolumeSize.
This is very easy to reproduce without a vm running (the monitorResponse/statsAge values are only a symptom of the problem).
See also bug 679123.

For debug purpose we can disable the pool.refresh() call in public_activateStorageDomain:

--- a/vdsm/storage/spm.py
+++ b/vdsm/storage/spm.py
@@ -1191,9 +1191,7 @@ class SPM:
             hsm.HSM.validateSdUUID(sdUUID)
             hsm.HSM.validatePoolSD(spUUID, sdUUID)
         except:
-            pool.refresh()
-            hsm.HSM.validateSdUUID(sdUUID)
-            hsm.HSM.validatePoolSD(spUUID, sdUUID)
+            raise
         pool.activateSD(sdUUID)

Having changed that getVolumeSize doesn't get stuck anymore and we finally see the real error in the logs:

Thread-82::ERROR::2011-03-24 10:23:10,380::task::854::TaskManager.Task::(_setError) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 862, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/spm.py", line 114, in run
    return self.func(*args, **kwargs)
  File "/usr/share/vdsm/storage/spm.py", line 1191, in public_activateStorageDomain
    hsm.HSM.validateSdUUID(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 111, in validateSdUUID
    SDF.produce(sdUUID=sdUUID).validate()
  File "/usr/share/vdsm/storage/fileSD.py", line 185, in validate
    self.Metadata(cache=False)
  File "/usr/share/vdsm/storage/sd.py", line 482, in Metadata
    return self._metadata.metadata
  File "/usr/share/vdsm/storage/metadata.py", line 110, in __getmetadata
    md = self._get()
  File "/usr/share/vdsm/storage/fileSD.py", line 39, in _get
    return misc.stripNewLines(oop.directReadLines(self.metafile))
  File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper
    return self.runExternally(func, *args, **kwds)
  File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally
    raise Timeout("Operation Stuck")
Timeout: Operation Stuck

Comment 5 David Naori 2011-04-28 13:42:47 UTC
Verified on vdsm-4.9-62
watch "vdsClient -s 0 getVmStats b79b5b41-8726-42a7-bb89-c090b01d65b1 | grep statsAge"
statsAge did not raise above 2 while Operation Stuck.


Note You need to log in before you can comment on or make changes to this bug.