Bug 1304834

Summary: [scale] Attaching large ISCSI data domain with 60 luns causes VDSM to set Host/DC to unresponsive
Product: [oVirt] ovirt-engine Reporter: mlehrer
Component: BLL.StorageAssignee: Adam Litke <alitke>
Status: CLOSED WORKSFORME QA Contact: guy chen <guchen>
Severity: high Docs Contact:
Priority: medium    
Version: 3.6.2.5CC: amureini, bugs, gklein, mlehrer, nsoffer, tnisan
Target Milestone: ovirt-4.1.0-betaKeywords: Performance
Target Release: ---Flags: amureini: ovirt-4.1?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 09:46:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm, ovirt, and lastcheck logs 2nd occurrence of issue
none
vdsm, ovirt, and lastcheck logs 1st occurrence of issue none

Description mlehrer 2016-02-04 18:07:35 UTC
Created attachment 1121175 [details]
vdsm, ovirt, and lastcheck logs 2nd occurrence of issue

Description of problem:

When attaching a single ISCSI domain of 60 luns, VDSM reports communication timeouts and then reports the Active Storage Domain as problematic.  A series of events are then triggered including the migrating of 13 running VMs to another host, and for around 10 minutes or so the original host that the domain attachment was attempted then returns to responsive status in ovirt web admin status.



The env:
Env contains

2 Hosts

50 total SDs of which 
   20 are ISCSI
   30 are NFS

The SD used for attachment has 70 luns, with a total of 125 possible luns that could be attached.

Version-Release number of selected component (if applicable):

vdsm-hook-vmfex-dev-4.17.17-0.el7ev.noarch
vdsm-python-4.17.17-0.el7ev.noarch
vdsm-yajsonrpc-4.17.17-0.el7ev.noarch
vdsm-4.17.17-0.el7ev.noarch
vdsm-xmlrpc-4.17.17-0.el7ev.noarch
vdsm-jsonrpc-4.17.17-0.el7ev.noarch
vdsm-cli-4.17.17-0.el7ev.noarch
vdsm-infra-4.17.17-0.el7ev.noarch


How reproducible:
each time, but only on sd attachment

Steps to Reproduce:
1. Create ICSCI SD of 60 luns, with 125 possible targets
2. Attach SD
3. Wait 10 minutes or so for DC to go unresponsive

Actual results:

VDSM sets host/DC unresponsive, VMs migrated, SPM reassigned.
Assuming that DC unresponisveness is caused by inability of host to communicate with other Storage Domains (data) during attachment.

Expected results:

DC to stay responsive 

Additional info:

Comment 1 mlehrer 2016-02-04 18:09:03 UTC
Created attachment 1121176 [details]
vdsm, ovirt, and lastcheck logs 1st occurrence of issue

Comment 4 Tal Nisan 2016-02-07 14:51:19 UTC
Setting target tentatively to 4.0 till we'll have the results of the discussion whether that's a feasible test case

Comment 5 Sandro Bonazzola 2016-05-02 09:58:06 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 6 Yaniv Lavi 2016-05-23 13:19:03 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 7 Yaniv Lavi 2016-05-23 13:26:35 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 8 Adam Litke 2016-06-13 20:47:47 UTC
Are we still pursuing this bug as a reasonable configuration?  Nir has done a lot of work around block storage domain scalability recently.  I propose you retest with the latest vdsm.

Comment 9 Nir Soffer 2016-06-15 00:33:16 UTC
Block storage domain is limited to 10 luns. Creating a vg with more luns should fail. See bug 648051.

How are you creating the storage domains?

Comment 10 Nir Soffer 2016-06-15 00:51:46 UTC
This looks like a duplicate of bug 1081962 and 1346012, so the attach patch should
help.

Please fix your setup to use 10 luns per storage domains and try this patch.

Comment 11 Yaniv Kaul 2016-06-15 07:40:55 UTC
(In reply to Nir Soffer from comment #9)
> Block storage domain is limited to 10 luns. Creating a vg with more luns
> should fail. See bug 648051.
> 
> How are you creating the storage domains?

Comment 12 mlehrer 2016-09-27 12:29:30 UTC
(In reply to Nir Soffer from comment #10)
> This looks like a duplicate of bug 1081962 and 1346012, so the attach patch
> should
> help.
> 
> Please fix your setup to use 10 luns per storage domains and try this patch.

Since only 10 luns are supported it seems that the configuration of single ISCSI domain of 60 luns is not a realistic scenario.

I have verified on RHV 4.0.4.4 that a single iscsi domain backed by 10 luns will dettach/attach around 1 minute and 10 seconds without any timeouts occurring.

Comment 13 Nir Soffer 2016-09-27 13:12:51 UTC
(In reply to mlehrer from comment #12)
> (In reply to Nir Soffer from comment #10)
> > This looks like a duplicate of bug 1081962 and 1346012, so the attach patch
> > should
> > help.
> > 
> > Please fix your setup to use 10 luns per storage domains and try this patch.
> 
> Since only 10 luns are supported it seems that the configuration of single
> ISCSI domain of 60 luns is not a realistic scenario.
> 
> I have verified on RHV 4.0.4.4 that a single iscsi domain backed by 10 luns
> will dettach/attach around 1 minute and 10 seconds without any timeouts
> occurring.

I must correct myself, the limit for 10 devices is correct only for old domains
before version 3. There is no limit in domain version 3 and later.

So you can test domain with 60 luns, I have seen domains with 121 luns in users
systems.

I'm not sure what is the practical limit that we should support, maybe ydary
have more info on this.

Comment 14 guy chen 2017-01-10 15:35:10 UTC
I have tested the scenario on 4.1 build 4, created domain with 60 luns and attached it to a single host.
No errors or timeout occurred - bug did not reproduced.