Bug 1304834
Summary: | [scale] Attaching large ISCSI data domain with 60 luns causes VDSM to set Host/DC to unresponsive | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | mlehrer | ||||||
Component: | BLL.Storage | Assignee: | Adam Litke <alitke> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | guy chen <guchen> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.6.2.5 | CC: | amureini, bugs, gklein, mlehrer, nsoffer, tnisan | ||||||
Target Milestone: | ovirt-4.1.0-beta | Keywords: | Performance | ||||||
Target Release: | --- | Flags: | amureini:
ovirt-4.1?
rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack? |
||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2017-01-18 09:46:07 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Created attachment 1121176 [details]
vdsm, ovirt, and lastcheck logs 1st occurrence of issue
Setting target tentatively to 4.0 till we'll have the results of the discussion whether that's a feasible test case Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA. oVirt 4.0 beta has been released, moving to RC milestone. oVirt 4.0 beta has been released, moving to RC milestone. Are we still pursuing this bug as a reasonable configuration? Nir has done a lot of work around block storage domain scalability recently. I propose you retest with the latest vdsm. Block storage domain is limited to 10 luns. Creating a vg with more luns should fail. See bug 648051. How are you creating the storage domains? This looks like a duplicate of bug 1081962 and 1346012, so the attach patch should help. Please fix your setup to use 10 luns per storage domains and try this patch. (In reply to Nir Soffer from comment #9) > Block storage domain is limited to 10 luns. Creating a vg with more luns > should fail. See bug 648051. > > How are you creating the storage domains? (In reply to Nir Soffer from comment #10) > This looks like a duplicate of bug 1081962 and 1346012, so the attach patch > should > help. > > Please fix your setup to use 10 luns per storage domains and try this patch. Since only 10 luns are supported it seems that the configuration of single ISCSI domain of 60 luns is not a realistic scenario. I have verified on RHV 4.0.4.4 that a single iscsi domain backed by 10 luns will dettach/attach around 1 minute and 10 seconds without any timeouts occurring. (In reply to mlehrer from comment #12) > (In reply to Nir Soffer from comment #10) > > This looks like a duplicate of bug 1081962 and 1346012, so the attach patch > > should > > help. > > > > Please fix your setup to use 10 luns per storage domains and try this patch. > > Since only 10 luns are supported it seems that the configuration of single > ISCSI domain of 60 luns is not a realistic scenario. > > I have verified on RHV 4.0.4.4 that a single iscsi domain backed by 10 luns > will dettach/attach around 1 minute and 10 seconds without any timeouts > occurring. I must correct myself, the limit for 10 devices is correct only for old domains before version 3. There is no limit in domain version 3 and later. So you can test domain with 60 luns, I have seen domains with 121 luns in users systems. I'm not sure what is the practical limit that we should support, maybe ydary have more info on this. I have tested the scenario on 4.1 build 4, created domain with 60 luns and attached it to a single host. No errors or timeout occurred - bug did not reproduced. |
Created attachment 1121175 [details] vdsm, ovirt, and lastcheck logs 2nd occurrence of issue Description of problem: When attaching a single ISCSI domain of 60 luns, VDSM reports communication timeouts and then reports the Active Storage Domain as problematic. A series of events are then triggered including the migrating of 13 running VMs to another host, and for around 10 minutes or so the original host that the domain attachment was attempted then returns to responsive status in ovirt web admin status. The env: Env contains 2 Hosts 50 total SDs of which 20 are ISCSI 30 are NFS The SD used for attachment has 70 luns, with a total of 125 possible luns that could be attached. Version-Release number of selected component (if applicable): vdsm-hook-vmfex-dev-4.17.17-0.el7ev.noarch vdsm-python-4.17.17-0.el7ev.noarch vdsm-yajsonrpc-4.17.17-0.el7ev.noarch vdsm-4.17.17-0.el7ev.noarch vdsm-xmlrpc-4.17.17-0.el7ev.noarch vdsm-jsonrpc-4.17.17-0.el7ev.noarch vdsm-cli-4.17.17-0.el7ev.noarch vdsm-infra-4.17.17-0.el7ev.noarch How reproducible: each time, but only on sd attachment Steps to Reproduce: 1. Create ICSCI SD of 60 luns, with 125 possible targets 2. Attach SD 3. Wait 10 minutes or so for DC to go unresponsive Actual results: VDSM sets host/DC unresponsive, VMs migrated, SPM reassigned. Assuming that DC unresponisveness is caused by inability of host to communicate with other Storage Domains (data) during attachment. Expected results: DC to stay responsive Additional info: