Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Created attachment 510235[details]
vdsm log.
Description of problem:
scenario:
- 1 host
- connected to 100 Iscsi storage domains
- vdsmd restarted, pool is not connected, backend tries to connect pool for
several hours, fails on VDSM on resource timeout.
- at certain point, during reconstruct retries, I get some resource unavailable errors:
OSError: [Errno 11] Resource temporarily unavailable
- backend sends getStoragePoolInfo command on that pool - VDSM returns partial
list of connected storage domains
- as a result, backend performs db action and move other storage domains which
wasn't included in that list to unattached.
this bug is a cousin of 716714.
this results corrupted pool state on backend side; domains are attached and active on VDSM side, unattached on backend side, including master domain, pool cannot be activated; only way out is to try re-initialize data-center, and attach all storage domains again (or change back in data-base manually).
when I read metadata of master domain, I get a list of all 100 SD's.
I would like to understand how such scenario is possible where VDSM return partial list of SD's ? how can we defend our system which such occurs?
attached full log.
problematic getStoragePoolInfo command.
Thread-128::INFO::2011-06-28 04:50:05,351::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getStoragePoolInfo, Return response: {'status': {'message': 'OK',
'code': 0}, 'info': {'spm_id': -1, 'master_uuid': 'c86f6017-3b24-4a8a-9a08-22d307ba1560', 'name': 'TIGER-SCALE', 'version': '2', 'domains': u'038c7b5c-b7fe-41db-b908-ea639bb1d
3bc:Active,6cb87f01-1273-42f3-af07-3b82b27fb160:Active,42a5685e-28be-4afc-998a-7952521c64ad:Attached,25363726-0f63-48e7-bb0a-31fc0ac6d3d9:Active,d2d6d11e-f27f-4f87-ad93-f0aba2c
75bd2:Active,3ed7d660-1a72-430f-b8a3-ff7dadd5b248:Active,c86f6017-3b24-4a8a-9a08-22d307ba1560:Active,91c4d192-ecfb-4135-8947-df79ecf300d9:Active,3bf3f355-9642-4650-8498-88a5304
c525c:Active,8a9127e5-621e-4601-ac00-1b6d934b4eb7:Active,fd6b754f-6969-4a6c-9101-c4cd9839a9a3:Active,e300b440-eb94-4b80-95a6-141456c39933:Active,30147201-bb27-4b1e-a2fc-b3ad62c
4de10:Active,4b023fae-0427-463a-ae8a-b70aec512cd1:Active,725bb958-c7bf-468f-bb30-c503b2ad5981:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'ISCSI', 'master_ver
': 94, 'lver': 0}, 'dominfo': {u'038c7b5c-b7fe-41db-b908-ea639bb1d3bc': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'6cb87f01-1273-42f3-af07-3
b82b27fb160': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'42a5685e-28be-4afc-998a-7952521c64ad': {'status': u'Attached'}, u'25363726-0f63-48e
7-bb0a-31fc0ac6d3d9': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'd2d6d11e-f27f-4f87-ad93-f0aba2c75bd2': {'status': u'Active', 'diskfree': '8
455716864', 'disktotal': '12616466432'}, u'3ed7d660-1a72-430f-b8a3-ff7dadd5b248': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'c86f6017-3b24-4
a8a-9a08-22d307ba1560': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'91c4d192-ecfb-4135-8947-df79ecf300d9': {'status': u'Active', 'diskfree':
'8455716864', 'disktotal': '12616466432'}, u'3bf3f355-9642-4650-8498-88a5304c525c': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'8a9127e5-621e
-4601-ac00-1b6d934b4eb7': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'e300b440-eb94-4b80-95a6-141456c39933': {'status': u'Active', 'diskfree'
: '8455716864', 'disktotal': '12616466432'}, u'fd6b754f-6969-4a6c-9101-c4cd9839a9a3': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'30147201-bb
27-4b1e-a2fc-b3ad62c4de10': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'4b023fae-0427-463a-ae8a-b70aec512cd1': {'status': u'Active', 'diskfre
e': '8455716864', 'disktotal': '12616466432'}, u'725bb958-c7bf-468f-bb30-c503b2ad5981': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}}
It looks like we ran out of file descriptors or processes or such other stuff.
Edu worked on this, I checked it myself with 101 domains and it worked fine.
Created attachment 510235 [details] vdsm log. Description of problem: scenario: - 1 host - connected to 100 Iscsi storage domains - vdsmd restarted, pool is not connected, backend tries to connect pool for several hours, fails on VDSM on resource timeout. - at certain point, during reconstruct retries, I get some resource unavailable errors: OSError: [Errno 11] Resource temporarily unavailable - backend sends getStoragePoolInfo command on that pool - VDSM returns partial list of connected storage domains - as a result, backend performs db action and move other storage domains which wasn't included in that list to unattached. this bug is a cousin of 716714. this results corrupted pool state on backend side; domains are attached and active on VDSM side, unattached on backend side, including master domain, pool cannot be activated; only way out is to try re-initialize data-center, and attach all storage domains again (or change back in data-base manually). when I read metadata of master domain, I get a list of all 100 SD's. I would like to understand how such scenario is possible where VDSM return partial list of SD's ? how can we defend our system which such occurs? attached full log. problematic getStoragePoolInfo command. Thread-128::INFO::2011-06-28 04:50:05,351::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getStoragePoolInfo, Return response: {'status': {'message': 'OK', 'code': 0}, 'info': {'spm_id': -1, 'master_uuid': 'c86f6017-3b24-4a8a-9a08-22d307ba1560', 'name': 'TIGER-SCALE', 'version': '2', 'domains': u'038c7b5c-b7fe-41db-b908-ea639bb1d 3bc:Active,6cb87f01-1273-42f3-af07-3b82b27fb160:Active,42a5685e-28be-4afc-998a-7952521c64ad:Attached,25363726-0f63-48e7-bb0a-31fc0ac6d3d9:Active,d2d6d11e-f27f-4f87-ad93-f0aba2c 75bd2:Active,3ed7d660-1a72-430f-b8a3-ff7dadd5b248:Active,c86f6017-3b24-4a8a-9a08-22d307ba1560:Active,91c4d192-ecfb-4135-8947-df79ecf300d9:Active,3bf3f355-9642-4650-8498-88a5304 c525c:Active,8a9127e5-621e-4601-ac00-1b6d934b4eb7:Active,fd6b754f-6969-4a6c-9101-c4cd9839a9a3:Active,e300b440-eb94-4b80-95a6-141456c39933:Active,30147201-bb27-4b1e-a2fc-b3ad62c 4de10:Active,4b023fae-0427-463a-ae8a-b70aec512cd1:Active,725bb958-c7bf-468f-bb30-c503b2ad5981:Active', 'pool_status': 'connected', 'isoprefix': '', 'type': 'ISCSI', 'master_ver ': 94, 'lver': 0}, 'dominfo': {u'038c7b5c-b7fe-41db-b908-ea639bb1d3bc': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'6cb87f01-1273-42f3-af07-3 b82b27fb160': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'42a5685e-28be-4afc-998a-7952521c64ad': {'status': u'Attached'}, u'25363726-0f63-48e 7-bb0a-31fc0ac6d3d9': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'd2d6d11e-f27f-4f87-ad93-f0aba2c75bd2': {'status': u'Active', 'diskfree': '8 455716864', 'disktotal': '12616466432'}, u'3ed7d660-1a72-430f-b8a3-ff7dadd5b248': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'c86f6017-3b24-4 a8a-9a08-22d307ba1560': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'91c4d192-ecfb-4135-8947-df79ecf300d9': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'3bf3f355-9642-4650-8498-88a5304c525c': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'8a9127e5-621e -4601-ac00-1b6d934b4eb7': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'e300b440-eb94-4b80-95a6-141456c39933': {'status': u'Active', 'diskfree' : '8455716864', 'disktotal': '12616466432'}, u'fd6b754f-6969-4a6c-9101-c4cd9839a9a3': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'30147201-bb 27-4b1e-a2fc-b3ad62c4de10': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}, u'4b023fae-0427-463a-ae8a-b70aec512cd1': {'status': u'Active', 'diskfre e': '8455716864', 'disktotal': '12616466432'}, u'725bb958-c7bf-468f-bb30-c503b2ad5981': {'status': u'Active', 'diskfree': '8455716864', 'disktotal': '12616466432'}}