Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 717184

Summary: [vdsm] [scale] getStoragePoolInfo returns with partial list of storage domains which corrupts pool on rhevm side
Product: Red Hat Enterprise Linux 6 Reporter: Haim <hateya>
Component: vdsmAssignee: Saggi Mizrahi <smizrahi>
Status: CLOSED CURRENTRELEASE QA Contact: yeylon <yeylon>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2CC: abaron, bazulay, dnaori, hateya, iheim, mgoldboi, smizrahi, srevivo, yeylon, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-12 13:33:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 612978    
Attachments:
Description Flags
vdsm log. none

Description Haim 2011-06-28 09:55:42 UTC
Created attachment 510235 [details]
vdsm log.

Description of problem:

scenario:

- 1 host 
- connected to 100 Iscsi storage domains 
- vdsmd restarted, pool is not connected, backend tries to connect pool for 
  several hours, fails on VDSM on resource timeout. 
- at certain point, during reconstruct retries, I get some resource unavailable errors: 
   OSError: [Errno 11] Resource temporarily unavailable
- backend sends getStoragePoolInfo command on that pool - VDSM returns partial 
  list of connected storage domains
- as a result, backend performs db action and move other storage domains which 
  wasn't included in that list to unattached. 

this bug is a cousin of 716714.

this results corrupted pool state on backend side; domains are attached and active on VDSM side, unattached on backend side, including master domain, pool cannot be activated; only way out is to try re-initialize data-center, and attach all storage domains again (or change back in data-base manually).

when I read metadata of master domain, I get a list of all 100 SD's.
I would like to understand how such scenario is possible where VDSM return partial list of SD's ? how can we defend our system which such occurs? 

attached full log.

problematic getStoragePoolInfo command.

Thread-128::INFO::2011-06-28 04:50:05,351::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getStoragePoolInfo, Return response: {'status': {'message': 'OK',
 'code': 0}, 'info': {'spm_id': -1, 'master_uuid': 'c86f6017-3b24-4a8a-9a08-22d307ba1560', 'name': 'TIGER-SCALE', 'version': '2', 'domains': u'038c7b5c-b7fe-41db-b908-ea639bb1d
3bc:Active,6cb87f01-1273-42f3-af07-3b82b27fb160:Active,42a5685e-28be-4afc-998a-7952521c64ad:Attached,25363726-0f63-48e7-bb0a-31fc0ac6d3d9:Active,d2d6d11e-f27f-4f87-ad93-f0aba2c
75bd2:Active,3ed7d660-1a72-430f-b8a3-ff7dadd5b248:Active,c86f6017-3b24-4a8a-9a08-22d307ba1560:Active,91c4d192-ecfb-4135-8947-df79ecf300d9:Active,3bf3f355-9642-4650-8498-88a5304
c525c:Active,8a9127e5-621e-4601-ac00-1b6d934b4eb7:Active,fd6b754f-6969-4a6c-9101-c4cd9839a9a3:Active,e300b440-eb94-4b80-95a6-141456c39933:Active,30147201-bb27-4b1e-a2fc-b3ad62c
4de10:Active,4b023fae-0427-463a-ae8a-b70aec512cd1:Active,725bb958-c7bf-468f-bb30-c503b2ad5981:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'ISCSI', 'master_ver
': 94, 'lver': 0}, 'dominfo': {u'038c7b5c-b7fe-41db-b908-ea639bb1d3bc': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'6cb87f01-1273-42f3-af07-3
b82b27fb160': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'42a5685e-28be-4afc-998a-7952521c64ad': {'status': u'Attached'}, u'25363726-0f63-48e
7-bb0a-31fc0ac6d3d9': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'd2d6d11e-f27f-4f87-ad93-f0aba2c75bd2': {'status': u'Active', 'diskfree': '8
455716864', 'disktotal': '12616466432'}, u'3ed7d660-1a72-430f-b8a3-ff7dadd5b248': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'c86f6017-3b24-4
a8a-9a08-22d307ba1560': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'91c4d192-ecfb-4135-8947-df79ecf300d9': {'status': u'Active', 'diskfree': 
'8455716864', 'disktotal': '12616466432'}, u'3bf3f355-9642-4650-8498-88a5304c525c': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'8a9127e5-621e
-4601-ac00-1b6d934b4eb7': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'e300b440-eb94-4b80-95a6-141456c39933': {'status': u'Active', 'diskfree'
: '8455716864', 'disktotal': '12616466432'}, u'fd6b754f-6969-4a6c-9101-c4cd9839a9a3': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'30147201-bb
27-4b1e-a2fc-b3ad62c4de10': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'4b023fae-0427-463a-ae8a-b70aec512cd1': {'status': u'Active', 'diskfre
e': '8455716864', 'disktotal': '12616466432'}, u'725bb958-c7bf-468f-bb30-c503b2ad5981': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}}

Comment 2 Saggi Mizrahi 2011-07-12 13:33:17 UTC
It looks like we ran out of file descriptors or processes or such other stuff.

Edu worked on this, I checked it myself with 101 domains and it worked fine.