| Summary: | [vdsm] When trying to activate unreachable storage domain - vms become non-responsive for about 200 seconds. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | David Naori <dnaori> | ||||
| Component: | vdsm | Assignee: | Federico Simoncelli <fsimonce> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | David Naori <dnaori> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.1 | CC: | abaron, bazulay, danken, dnaori, hateya, iheim, ilvovsky, mgoldboi, syeghiay, ykaul | ||||
| Target Milestone: | rc | Keywords: | Regression | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | vdsm-4.9-59.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-08-19 15:25:35 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | 679123 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
The vm stats are not updated (statsAge > 2) because the thread is blocked on getVolumeSize. This is very easy to reproduce without a vm running (the monitorResponse/statsAge values are only a symptom of the problem). See also bug 679123. For debug purpose we can disable the pool.refresh() call in public_activateStorageDomain: --- a/vdsm/storage/spm.py +++ b/vdsm/storage/spm.py @@ -1191,9 +1191,7 @@ class SPM: hsm.HSM.validateSdUUID(sdUUID) hsm.HSM.validatePoolSD(spUUID, sdUUID) except: - pool.refresh() - hsm.HSM.validateSdUUID(sdUUID) - hsm.HSM.validatePoolSD(spUUID, sdUUID) + raise pool.activateSD(sdUUID) Having changed that getVolumeSize doesn't get stuck anymore and we finally see the real error in the logs: Thread-82::ERROR::2011-03-24 10:23:10,380::task::854::TaskManager.Task::(_setError) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 862, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/spm.py", line 114, in run return self.func(*args, **kwargs) File "/usr/share/vdsm/storage/spm.py", line 1191, in public_activateStorageDomain hsm.HSM.validateSdUUID(sdUUID) File "/usr/share/vdsm/storage/hsm.py", line 111, in validateSdUUID SDF.produce(sdUUID=sdUUID).validate() File "/usr/share/vdsm/storage/fileSD.py", line 185, in validate self.Metadata(cache=False) File "/usr/share/vdsm/storage/sd.py", line 482, in Metadata return self._metadata.metadata File "/usr/share/vdsm/storage/metadata.py", line 110, in __getmetadata md = self._get() File "/usr/share/vdsm/storage/fileSD.py", line 39, in _get return misc.stripNewLines(oop.directReadLines(self.metafile)) File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper return self.runExternally(func, *args, **kwds) File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally raise Timeout("Operation Stuck") Timeout: Operation Stuck Verified on vdsm-4.9-62 watch "vdsClient -s 0 getVmStats b79b5b41-8726-42a7-bb89-c090b01d65b1 | grep statsAge" statsAge did not raise above 2 while Operation Stuck. |
Created attachment 486074 [details] full vdsm-log Description of problem: When trying to activate unreachable NFS storage domain - vms become non-responsive for about 200 seconds. from watching the logs refreshIfDirty is causing the problem Thread-4160::DEBUG::2011-03-17 18:09:37,659::misc::1011::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc._refreshDomains) Thread-2871::DEBUG::2011-03-17 18:09:37,660::task::491::TaskManager.Task::(_debug) Task be4a9093-a850-49ba-bd38-9cac62eb8607: moving from state init -> state preparing Thread-2871::INFO::2011-03-17 18:09:37,661::sp::1410::Storage.StoragePool::(getDomains) Get storage pool domains: {'9908a9cd-c8d2-4bc8-b4c5-57edd2625c7c': 'Active', '74b99101-ae 73-4c38-a2bf-4f8cd930e2f7': 'Attached'} Thread-2871::DEBUG::2011-03-17 18:09:37,661::sdc::54::Storage.StorageDomainCache::(_refreshIfDirty) Cache is cold-expired Thread-2871::DEBUG::2011-03-17 18:09:37,661::misc::1011::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc._refreshDomains) ... Thread-5162::DEBUG::2011-03-17 18:12:59,248:: 'monitorResponse': '-1' statsAge': '203.60' Version-Release number of selected component (if applicable): -vdsm-4.9-54.el6.x86_64 -lvm2-2.02.83-2.el6.x86_64 Steps to Reproduce: 1.Have 2 nfs DSD in the pool from 2 different servers 2.Block connection to non master domain using iptables 3.try to activate the unreachable SD Full vdsm log attached