Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 688680

Summary: [vdsm] When trying to activate unreachable storage domain - vms become non-responsive for about 200 seconds.
Product: Red Hat Enterprise Linux 6 Reporter: David Naori <dnaori>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED CURRENTRELEASE QA Contact: David Naori <dnaori>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1CC: abaron, bazulay, danken, dnaori, hateya, iheim, ilvovsky, mgoldboi, syeghiay, ykaul
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: vdsm-4.9-59.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-19 15:25:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 679123    
Bug Blocks:    
Attachments:
Description Flags
full vdsm-log none

Description David Naori 2011-03-17 17:29:57 UTC
Created attachment 486074 [details]
full vdsm-log

Description of problem:

When trying to activate unreachable NFS storage domain - vms become non-responsive for about 200 seconds. 

from watching the logs refreshIfDirty is causing the problem 

Thread-4160::DEBUG::2011-03-17 18:09:37,659::misc::1011::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc._refreshDomains)

Thread-2871::DEBUG::2011-03-17 18:09:37,660::task::491::TaskManager.Task::(_debug) Task be4a9093-a850-49ba-bd38-9cac62eb8607: moving from state init -> state preparing
Thread-2871::INFO::2011-03-17 18:09:37,661::sp::1410::Storage.StoragePool::(getDomains) Get storage pool domains: {'9908a9cd-c8d2-4bc8-b4c5-57edd2625c7c': 'Active', '74b99101-ae
73-4c38-a2bf-4f8cd930e2f7': 'Attached'}

Thread-2871::DEBUG::2011-03-17 18:09:37,661::sdc::54::Storage.StorageDomainCache::(_refreshIfDirty) Cache is cold-expired
Thread-2871::DEBUG::2011-03-17 18:09:37,661::misc::1011::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc._refreshDomains)

...

Thread-5162::DEBUG::2011-03-17 18:12:59,248::
 'monitorResponse': '-1' statsAge': '203.60'

 

Version-Release number of selected component (if applicable):
-vdsm-4.9-54.el6.x86_64
-lvm2-2.02.83-2.el6.x86_64


Steps to Reproduce:
1.Have 2 nfs DSD in the pool from 2 different servers
2.Block connection to non master domain using iptables
3.try to activate the unreachable SD

Full vdsm log attached

Comment 3 Federico Simoncelli 2011-03-24 10:30:43 UTC
The vm stats are not updated (statsAge > 2) because the thread is blocked on getVolumeSize.
This is very easy to reproduce without a vm running (the monitorResponse/statsAge values are only a symptom of the problem).
See also bug 679123.

For debug purpose we can disable the pool.refresh() call in public_activateStorageDomain:

--- a/vdsm/storage/spm.py
+++ b/vdsm/storage/spm.py
@@ -1191,9 +1191,7 @@ class SPM:
             hsm.HSM.validateSdUUID(sdUUID)
             hsm.HSM.validatePoolSD(spUUID, sdUUID)
         except:
-            pool.refresh()
-            hsm.HSM.validateSdUUID(sdUUID)
-            hsm.HSM.validatePoolSD(spUUID, sdUUID)
+            raise
         pool.activateSD(sdUUID)

Having changed that getVolumeSize doesn't get stuck anymore and we finally see the real error in the logs:

Thread-82::ERROR::2011-03-24 10:23:10,380::task::854::TaskManager.Task::(_setError) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 862, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/spm.py", line 114, in run
    return self.func(*args, **kwargs)
  File "/usr/share/vdsm/storage/spm.py", line 1191, in public_activateStorageDomain
    hsm.HSM.validateSdUUID(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 111, in validateSdUUID
    SDF.produce(sdUUID=sdUUID).validate()
  File "/usr/share/vdsm/storage/fileSD.py", line 185, in validate
    self.Metadata(cache=False)
  File "/usr/share/vdsm/storage/sd.py", line 482, in Metadata
    return self._metadata.metadata
  File "/usr/share/vdsm/storage/metadata.py", line 110, in __getmetadata
    md = self._get()
  File "/usr/share/vdsm/storage/fileSD.py", line 39, in _get
    return misc.stripNewLines(oop.directReadLines(self.metafile))
  File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper
    return self.runExternally(func, *args, **kwds)
  File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally
    raise Timeout("Operation Stuck")
Timeout: Operation Stuck

Comment 5 David Naori 2011-04-28 13:42:47 UTC
Verified on vdsm-4.9-62
watch "vdsClient -s 0 getVmStats b79b5b41-8726-42a7-bb89-c090b01d65b1 | grep statsAge"
statsAge did not raise above 2 while Operation Stuck.