Hide Forgot
Created attachment 510235 [details] vdsm log. Description of problem: scenario: - 1 host - connected to 100 Iscsi storage domains - vdsmd restarted, pool is not connected, backend tries to connect pool for several hours, fails on VDSM on resource timeout. - at certain point, during reconstruct retries, I get some resource unavailable errors: OSError: [Errno 11] Resource temporarily unavailable - backend sends getStoragePoolInfo command on that pool - VDSM returns partial list of connected storage domains - as a result, backend performs db action and move other storage domains which wasn't included in that list to unattached. this bug is a cousin of 716714. this results corrupted pool state on backend side; domains are attached and active on VDSM side, unattached on backend side, including master domain, pool cannot be activated; only way out is to try re-initialize data-center, and attach all storage domains again (or change back in data-base manually). when I read metadata of master domain, I get a list of all 100 SD's. I would like to understand how such scenario is possible where VDSM return partial list of SD's ? how can we defend our system which such occurs? attached full log. problematic getStoragePoolInfo command. Thread-128::INFO::2011-06-28 04:50:05,351::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getStoragePoolInfo, Return response: {'status': {'message': 'OK', 'code': 0}, 'info': {'spm_id': -1, 'master_uuid': 'c86f6017-3b24-4a8a-9a08-22d307ba1560', 'name': 'TIGER-SCALE', 'version': '2', 'domains': u'038c7b5c-b7fe-41db-b908-ea639bb1d 3bc:Active,6cb87f01-1273-42f3-af07-3b82b27fb160:Active,42a5685e-28be-4afc-998a-7952521c64ad:Attached,25363726-0f63-48e7-bb0a-31fc0ac6d3d9:Active,d2d6d11e-f27f-4f87-ad93-f0aba2c 75bd2:Active,3ed7d660-1a72-430f-b8a3-ff7dadd5b248:Active,c86f6017-3b24-4a8a-9a08-22d307ba1560:Active,91c4d192-ecfb-4135-8947-df79ecf300d9:Active,3bf3f355-9642-4650-8498-88a5304 c525c:Active,8a9127e5-621e-4601-ac00-1b6d934b4eb7:Active,fd6b754f-6969-4a6c-9101-c4cd9839a9a3:Active,e300b440-eb94-4b80-95a6-141456c39933:Active,30147201-bb27-4b1e-a2fc-b3ad62c 4de10:Active,4b023fae-0427-463a-ae8a-b70aec512cd1:Active,725bb958-c7bf-468f-bb30-c503b2ad5981:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'ISCSI', 'master_ver ': 94, 'lver': 0}, 'dominfo': {u'038c7b5c-b7fe-41db-b908-ea639bb1d3bc': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'6cb87f01-1273-42f3-af07-3 b82b27fb160': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'42a5685e-28be-4afc-998a-7952521c64ad': {'status': u'Attached'}, u'25363726-0f63-48e 7-bb0a-31fc0ac6d3d9': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'd2d6d11e-f27f-4f87-ad93-f0aba2c75bd2': {'status': u'Active', 'diskfree': '8 455716864', 'disktotal': '12616466432'}, u'3ed7d660-1a72-430f-b8a3-ff7dadd5b248': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'c86f6017-3b24-4 a8a-9a08-22d307ba1560': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'91c4d192-ecfb-4135-8947-df79ecf300d9': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'3bf3f355-9642-4650-8498-88a5304c525c': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'8a9127e5-621e -4601-ac00-1b6d934b4eb7': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'e300b440-eb94-4b80-95a6-141456c39933': {'status': u'Active', 'diskfree' : '8455716864', 'disktotal': '12616466432'}, u'fd6b754f-6969-4a6c-9101-c4cd9839a9a3': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'30147201-bb 27-4b1e-a2fc-b3ad62c4de10': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'4b023fae-0427-463a-ae8a-b70aec512cd1': {'status': u'Active', 'diskfre e': '8455716864', 'disktotal': '12616466432'}, u'725bb958-c7bf-468f-bb30-c503b2ad5981': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}}
It looks like we ran out of file descriptors or processes or such other stuff. Edu worked on this, I checked it myself with 101 domains and it worked fine.