Bug 1304834 - [scale] Attaching large ISCSI data domain with 60 luns causes VDSM to set Host/DC to unresponsive
[scale] Attaching large ISCSI data domain with 60 luns causes VDSM to set Hos...
Status: CLOSED WORKSFORME
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage (Show other bugs)
3.6.2.5
Unspecified Unspecified
medium Severity high (vote)
: ovirt-4.1.0-beta
: ---
Assigned To: Adam Litke
guy chen
: Performance
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-04 13:07 EST by mlehrer
Modified: 2017-01-18 04:46 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-01-18 04:46:07 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
amureini: ovirt‑4.1?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
vdsm, ovirt, and lastcheck logs 2nd occurrence of issue (4.77 MB, application/zip)
2016-02-04 13:07 EST, mlehrer
no flags Details
vdsm, ovirt, and lastcheck logs 1st occurrence of issue (1.71 MB, application/zip)
2016-02-04 13:09 EST, mlehrer
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 58943 master ABANDONED lvm: Use global_filter instead of filter 2016-06-28 17:48 EDT
oVirt gerrit 59100 ovirt-3.6 ABANDONED lvm: Use global_filter instead of filter 2016-06-28 17:45 EDT

  None (edit)
Description mlehrer 2016-02-04 13:07:35 EST
Created attachment 1121175 [details]
vdsm, ovirt, and lastcheck logs 2nd occurrence of issue

Description of problem:

When attaching a single ISCSI domain of 60 luns, VDSM reports communication timeouts and then reports the Active Storage Domain as problematic.  A series of events are then triggered including the migrating of 13 running VMs to another host, and for around 10 minutes or so the original host that the domain attachment was attempted then returns to responsive status in ovirt web admin status.



The env:
Env contains

2 Hosts

50 total SDs of which 
   20 are ISCSI
   30 are NFS

The SD used for attachment has 70 luns, with a total of 125 possible luns that could be attached.

Version-Release number of selected component (if applicable):

vdsm-hook-vmfex-dev-4.17.17-0.el7ev.noarch
vdsm-python-4.17.17-0.el7ev.noarch
vdsm-yajsonrpc-4.17.17-0.el7ev.noarch
vdsm-4.17.17-0.el7ev.noarch
vdsm-xmlrpc-4.17.17-0.el7ev.noarch
vdsm-jsonrpc-4.17.17-0.el7ev.noarch
vdsm-cli-4.17.17-0.el7ev.noarch
vdsm-infra-4.17.17-0.el7ev.noarch


How reproducible:
each time, but only on sd attachment

Steps to Reproduce:
1. Create ICSCI SD of 60 luns, with 125 possible targets
2. Attach SD
3. Wait 10 minutes or so for DC to go unresponsive

Actual results:

VDSM sets host/DC unresponsive, VMs migrated, SPM reassigned.
Assuming that DC unresponisveness is caused by inability of host to communicate with other Storage Domains (data) during attachment.

Expected results:

DC to stay responsive 

Additional info:
Comment 1 mlehrer 2016-02-04 13:09 EST
Created attachment 1121176 [details]
vdsm, ovirt, and lastcheck logs 1st occurrence of issue
Comment 4 Tal Nisan 2016-02-07 09:51:19 EST
Setting target tentatively to 4.0 till we'll have the results of the discussion whether that's a feasible test case
Comment 5 Sandro Bonazzola 2016-05-02 05:58:06 EDT
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.
Comment 6 Yaniv Lavi (Dary) 2016-05-23 09:19:03 EDT
oVirt 4.0 beta has been released, moving to RC milestone.
Comment 7 Yaniv Lavi (Dary) 2016-05-23 09:26:35 EDT
oVirt 4.0 beta has been released, moving to RC milestone.
Comment 8 Adam Litke 2016-06-13 16:47:47 EDT
Are we still pursuing this bug as a reasonable configuration?  Nir has done a lot of work around block storage domain scalability recently.  I propose you retest with the latest vdsm.
Comment 9 Nir Soffer 2016-06-14 20:33:16 EDT
Block storage domain is limited to 10 luns. Creating a vg with more luns should fail. See bug 648051.

How are you creating the storage domains?
Comment 10 Nir Soffer 2016-06-14 20:51:46 EDT
This looks like a duplicate of bug 1081962 and 1346012, so the attach patch should
help.

Please fix your setup to use 10 luns per storage domains and try this patch.
Comment 11 Yaniv Kaul 2016-06-15 03:40:55 EDT
(In reply to Nir Soffer from comment #9)
> Block storage domain is limited to 10 luns. Creating a vg with more luns
> should fail. See bug 648051.
> 
> How are you creating the storage domains?
Comment 12 mlehrer 2016-09-27 08:29:30 EDT
(In reply to Nir Soffer from comment #10)
> This looks like a duplicate of bug 1081962 and 1346012, so the attach patch
> should
> help.
> 
> Please fix your setup to use 10 luns per storage domains and try this patch.

Since only 10 luns are supported it seems that the configuration of single ISCSI domain of 60 luns is not a realistic scenario.

I have verified on RHV 4.0.4.4 that a single iscsi domain backed by 10 luns will dettach/attach around 1 minute and 10 seconds without any timeouts occurring.
Comment 13 Nir Soffer 2016-09-27 09:12:51 EDT
(In reply to mlehrer from comment #12)
> (In reply to Nir Soffer from comment #10)
> > This looks like a duplicate of bug 1081962 and 1346012, so the attach patch
> > should
> > help.
> > 
> > Please fix your setup to use 10 luns per storage domains and try this patch.
> 
> Since only 10 luns are supported it seems that the configuration of single
> ISCSI domain of 60 luns is not a realistic scenario.
> 
> I have verified on RHV 4.0.4.4 that a single iscsi domain backed by 10 luns
> will dettach/attach around 1 minute and 10 seconds without any timeouts
> occurring.

I must correct myself, the limit for 10 devices is correct only for old domains
before version 3. There is no limit in domain version 3 and later.

So you can test domain with 60 luns, I have seen domains with 121 luns in users
systems.

I'm not sure what is the practical limit that we should support, maybe ydary
have more info on this.
Comment 14 guy chen 2017-01-10 10:35:10 EST
I have tested the scenario on 4.1 build 4, created domain with 60 luns and attached it to a single host.
No errors or timeout occurred - bug did not reproduced.

Note You need to log in before you can comment on or make changes to this bug.