Bug 717184 - [vdsm] [scale] getStoragePoolInfo returns with partial list of storage domains which corrupts pool on rhevm side
Summary: [vdsm] [scale] getStoragePoolInfo returns with partial list of storage domain...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: vdsm
Version: 6.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Saggi Mizrahi
QA Contact: yeylon@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 612978
TreeView+ depends on / blocked
 
Reported: 2011-06-28 09:55 UTC by Haim
Modified: 2016-04-18 06:40 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-12 13:33:17 UTC
Target Upstream Version:


Attachments (Terms of Use)
vdsm log. (5.92 MB, application/x-gzip)
2011-06-28 09:55 UTC, Haim
no flags Details

Description Haim 2011-06-28 09:55:42 UTC
Created attachment 510235 [details]
vdsm log.

Description of problem:

scenario:

- 1 host 
- connected to 100 Iscsi storage domains 
- vdsmd restarted, pool is not connected, backend tries to connect pool for 
  several hours, fails on VDSM on resource timeout. 
- at certain point, during reconstruct retries, I get some resource unavailable errors: 
   OSError: [Errno 11] Resource temporarily unavailable
- backend sends getStoragePoolInfo command on that pool - VDSM returns partial 
  list of connected storage domains
- as a result, backend performs db action and move other storage domains which 
  wasn't included in that list to unattached. 

this bug is a cousin of 716714.

this results corrupted pool state on backend side; domains are attached and active on VDSM side, unattached on backend side, including master domain, pool cannot be activated; only way out is to try re-initialize data-center, and attach all storage domains again (or change back in data-base manually).

when I read metadata of master domain, I get a list of all 100 SD's.
I would like to understand how such scenario is possible where VDSM return partial list of SD's ? how can we defend our system which such occurs? 

attached full log.

problematic getStoragePoolInfo command.

Thread-128::INFO::2011-06-28 04:50:05,351::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getStoragePoolInfo, Return response: {'status': {'message': 'OK',
 'code': 0}, 'info': {'spm_id': -1, 'master_uuid': 'c86f6017-3b24-4a8a-9a08-22d307ba1560', 'name': 'TIGER-SCALE', 'version': '2', 'domains': u'038c7b5c-b7fe-41db-b908-ea639bb1d
3bc:Active,6cb87f01-1273-42f3-af07-3b82b27fb160:Active,42a5685e-28be-4afc-998a-7952521c64ad:Attached,25363726-0f63-48e7-bb0a-31fc0ac6d3d9:Active,d2d6d11e-f27f-4f87-ad93-f0aba2c
75bd2:Active,3ed7d660-1a72-430f-b8a3-ff7dadd5b248:Active,c86f6017-3b24-4a8a-9a08-22d307ba1560:Active,91c4d192-ecfb-4135-8947-df79ecf300d9:Active,3bf3f355-9642-4650-8498-88a5304
c525c:Active,8a9127e5-621e-4601-ac00-1b6d934b4eb7:Active,fd6b754f-6969-4a6c-9101-c4cd9839a9a3:Active,e300b440-eb94-4b80-95a6-141456c39933:Active,30147201-bb27-4b1e-a2fc-b3ad62c
4de10:Active,4b023fae-0427-463a-ae8a-b70aec512cd1:Active,725bb958-c7bf-468f-bb30-c503b2ad5981:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'ISCSI', 'master_ver
': 94, 'lver': 0}, 'dominfo': {u'038c7b5c-b7fe-41db-b908-ea639bb1d3bc': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'6cb87f01-1273-42f3-af07-3
b82b27fb160': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'42a5685e-28be-4afc-998a-7952521c64ad': {'status': u'Attached'}, u'25363726-0f63-48e
7-bb0a-31fc0ac6d3d9': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'd2d6d11e-f27f-4f87-ad93-f0aba2c75bd2': {'status': u'Active', 'diskfree': '8
455716864', 'disktotal': '12616466432'}, u'3ed7d660-1a72-430f-b8a3-ff7dadd5b248': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'c86f6017-3b24-4
a8a-9a08-22d307ba1560': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'91c4d192-ecfb-4135-8947-df79ecf300d9': {'status': u'Active', 'diskfree': 
'8455716864', 'disktotal': '12616466432'}, u'3bf3f355-9642-4650-8498-88a5304c525c': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'8a9127e5-621e
-4601-ac00-1b6d934b4eb7': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'e300b440-eb94-4b80-95a6-141456c39933': {'status': u'Active', 'diskfree'
: '8455716864', 'disktotal': '12616466432'}, u'fd6b754f-6969-4a6c-9101-c4cd9839a9a3': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'30147201-bb
27-4b1e-a2fc-b3ad62c4de10': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'4b023fae-0427-463a-ae8a-b70aec512cd1': {'status': u'Active', 'diskfre
e': '8455716864', 'disktotal': '12616466432'}, u'725bb958-c7bf-468f-bb30-c503b2ad5981': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}}

Comment 2 Saggi Mizrahi 2011-07-12 13:33:17 UTC
It looks like we ran out of file descriptors or processes or such other stuff.

Edu worked on this, I checked it myself with 101 domains and it worked fine.


Note You need to log in before you can comment on or make changes to this bug.