| Summary: | [vdsm] [scale] getStoragePoolInfo returns with partial list of storage domains which corrupts pool on rhevm side | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Haim <hateya> | ||||
| Component: | vdsm | Assignee: | Saggi Mizrahi <smizrahi> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | yeylon <yeylon> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.2 | CC: | abaron, bazulay, dnaori, hateya, iheim, mgoldboi, smizrahi, srevivo, yeylon, ykaul | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-07-12 13:33:17 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 612978 | ||||||
| Attachments: |
|
||||||
It looks like we ran out of file descriptors or processes or such other stuff. Edu worked on this, I checked it myself with 101 domains and it worked fine. |
Created attachment 510235 [details] vdsm log. Description of problem: scenario: - 1 host - connected to 100 Iscsi storage domains - vdsmd restarted, pool is not connected, backend tries to connect pool for several hours, fails on VDSM on resource timeout. - at certain point, during reconstruct retries, I get some resource unavailable errors: OSError: [Errno 11] Resource temporarily unavailable - backend sends getStoragePoolInfo command on that pool - VDSM returns partial list of connected storage domains - as a result, backend performs db action and move other storage domains which wasn't included in that list to unattached. this bug is a cousin of 716714. this results corrupted pool state on backend side; domains are attached and active on VDSM side, unattached on backend side, including master domain, pool cannot be activated; only way out is to try re-initialize data-center, and attach all storage domains again (or change back in data-base manually). when I read metadata of master domain, I get a list of all 100 SD's. I would like to understand how such scenario is possible where VDSM return partial list of SD's ? how can we defend our system which such occurs? attached full log. problematic getStoragePoolInfo command. Thread-128::INFO::2011-06-28 04:50:05,351::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getStoragePoolInfo, Return response: {'status': {'message': 'OK', 'code': 0}, 'info': {'spm_id': -1, 'master_uuid': 'c86f6017-3b24-4a8a-9a08-22d307ba1560', 'name': 'TIGER-SCALE', 'version': '2', 'domains': u'038c7b5c-b7fe-41db-b908-ea639bb1d 3bc:Active,6cb87f01-1273-42f3-af07-3b82b27fb160:Active,42a5685e-28be-4afc-998a-7952521c64ad:Attached,25363726-0f63-48e7-bb0a-31fc0ac6d3d9:Active,d2d6d11e-f27f-4f87-ad93-f0aba2c 75bd2:Active,3ed7d660-1a72-430f-b8a3-ff7dadd5b248:Active,c86f6017-3b24-4a8a-9a08-22d307ba1560:Active,91c4d192-ecfb-4135-8947-df79ecf300d9:Active,3bf3f355-9642-4650-8498-88a5304 c525c:Active,8a9127e5-621e-4601-ac00-1b6d934b4eb7:Active,fd6b754f-6969-4a6c-9101-c4cd9839a9a3:Active,e300b440-eb94-4b80-95a6-141456c39933:Active,30147201-bb27-4b1e-a2fc-b3ad62c 4de10:Active,4b023fae-0427-463a-ae8a-b70aec512cd1:Active,725bb958-c7bf-468f-bb30-c503b2ad5981:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'ISCSI', 'master_ver ': 94, 'lver': 0}, 'dominfo': {u'038c7b5c-b7fe-41db-b908-ea639bb1d3bc': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'6cb87f01-1273-42f3-af07-3 b82b27fb160': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'42a5685e-28be-4afc-998a-7952521c64ad': {'status': u'Attached'}, u'25363726-0f63-48e 7-bb0a-31fc0ac6d3d9': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'd2d6d11e-f27f-4f87-ad93-f0aba2c75bd2': {'status': u'Active', 'diskfree': '8 455716864', 'disktotal': '12616466432'}, u'3ed7d660-1a72-430f-b8a3-ff7dadd5b248': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'c86f6017-3b24-4 a8a-9a08-22d307ba1560': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'91c4d192-ecfb-4135-8947-df79ecf300d9': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'3bf3f355-9642-4650-8498-88a5304c525c': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'8a9127e5-621e -4601-ac00-1b6d934b4eb7': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'e300b440-eb94-4b80-95a6-141456c39933': {'status': u'Active', 'diskfree' : '8455716864', 'disktotal': '12616466432'}, u'fd6b754f-6969-4a6c-9101-c4cd9839a9a3': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'30147201-bb 27-4b1e-a2fc-b3ad62c4de10': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'4b023fae-0427-463a-ae8a-b70aec512cd1': {'status': u'Active', 'diskfre e': '8455716864', 'disktotal': '12616466432'}, u'725bb958-c7bf-468f-bb30-c503b2ad5981': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}}