Bug 846264 - ovirt-engine-backend[Scalability]: When activating several hosts at-once, engine connects hosts to the storage one by one, process that takes long time while hosts in status 'Unassigned'.
Summary: ovirt-engine-backend[Scalability]: When activating several hosts at-once, eng...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.1.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 3.2.4
Assignee: Barak
QA Contact: yeylon@redhat.com
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-08-07 09:39 UTC by Omri Hochman
Modified: 2016-04-18 06:55 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-20 10:11:59 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine.log (306 bytes, application/octet-stream)
2012-08-07 09:39 UTC, Omri Hochman
no flags Details
engine.log (596.45 KB, application/x-tar)
2012-08-07 22:10 UTC, Omri Hochman
no flags Details
## Logs rhevm (5.81 MB, text/plain)
2013-05-26 20:18 UTC, vvyazmin@redhat.com
no flags Details

Description Omri Hochman 2012-08-07 09:39:45 UTC
Created attachment 602704 [details]
engine.log

ovirt-engine-backend[Scalability]: When activating several hosts at-once, engine connects hosts to the storage one by one, process that takes long time while hosts in status 'Unassigned'.   

Description:
*************
I'm working on iSCSI Data-Center environment: with 5 hosts and 30 Data Storage-domains. one of the hosts is 'UP and it's SPM, the other 4 hosts are in 'Maintenance'. When I'm activating the 4 'Maintenance' hosts All At-once, engine will attempt to connect the hosts to the storage one by one, the 'conenctStoragePool' takes around 3 minutes on vdsm side per host and when connecting hosts to storage one by one the all transaction that takes long time.

Note: 
******
- Transaction timeout on rhevm side is 10 minutes.
- The issue is very noticeable when working on large environment, that has many storage domains the hosts 'hangs' in 'Unassigned' for 10 minutes.
- The 'Unassigned' time can be reduce if engine will connect the hosts to the storage in parallel.

Engine.log (same transaction):
*******************************
2012-08-07 11:21:45,487 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-100) [71bb3bc] Running command: InitVdsOnUpCommand internal: true.
2012-08-07 11:21:45,669 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-100) [21a74ca0] START, ValidateStorageServerConnectionVDSCommand(vdsId = d9f6b526-dee1-11e1-9199-002564b27d78, storagePoolId = 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5, storageType = ISCSI, connectionList = [{ id: 7bfb451c-be5b-4bad-9828-a40a33fed34c, connection: 10.35.160.107 };]), log id: 539225
2012-08-07 11:21:45,687 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-100) [21a74ca0] FINISH, ValidateStorageServerConnectionVDSCommand, return: {7bfb451c-be5b-4bad-9828-a40a33fed34c=0}, log id: 539225
2012-08-07 11:21:45,688 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-100) [21a74ca0] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5 Type: StoragePool


2012-08-07 11:24:46,836 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-72) [758489a] Running command: InitVdsOnUpCommand internal: true.
2012-08-07 11:24:46,845 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-100) [56ef8645] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: d9f6b526-dee1-11e1-9199-002564b27d78 Type: VDS
2012-08-07 11:24:46,856 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-100) [baadde8] Running command: HandleVdsVersionCommand internal: true. Entities affected :  ID: d9f6b526-dee1-11e1-9199-002564b27d78 Type: VDS
2012-08-07 11:24:46,859 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-100) [baadde8] Host d9f6b526-dee1-11e1-9199-002564b27d78 : puma07 is already in NonOperational status. SetNonOperationalVds command is skipped.
2012-08-07 11:24:47,030 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-72) [3104d5d6] START, ValidateStorageServerConnectionVDSCommand(vdsId = db3d6bb4-dee1-11e1-9359-002564b27d78, storagePoolId = 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5, storageType = ISCSI, connectionList = [{ id: 7bfb451c-be5b-4bad-9828-a40a33fed34c, connection: 10.35.160.107 };]), log id: 6c38dc3c
2012-08-07 11:24:47,053 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-72) [3104d5d6] FINISH, ValidateStorageServerConnectionVDSCommand, return: {7bfb451c-be5b-4bad-9828-a40a33fed34c=0}, log id: 6c38dc3c
2012-08-07 11:24:47,053 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-72) [3104d5d6] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5 Type: StoragePool
2012-08-07 11:24:47,055 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-72) [3104d5d6] START, ConnectStorageServerVDSCommand(vdsId = db3d6bb4-dee1-11e1-9359-002564b27d78, storagePoolId = 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5, storageType = ISCSI, connectionList = [{ id: 7bfb451c-be5b-4bad-9828-a40a33fed34c, connection: 10.35.160.107 };]), log id: 146677f1


2012-08-07 11:27:47,479 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-59) [17347243] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5 Type: StoragePool
2012-08-07 11:27:47,481 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-59) [17347243] START, ConnectStorageServerVDSCommand(vdsId = da944b7e-dee1-11e1-9568-002564b27d78, storagePoolId = 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5, storageType = ISCSI, connectionList = [{ id: 7bfb451c-be5b-4bad-9828-a40a33fed34c, connection: 10.35.160.107 };]), log id: 3c1e1163
2012-08-07 11:27:48,505 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-59) [17347243] FINISH, ConnectStorageServerVDSCommand, return: {7bfb451c-be5b-4bad-9828-a40a33fed34c=0}, log id: 3c1e1163

Comment 3 Omri Hochman 2012-08-07 22:10:29 UTC
Created attachment 602871 [details]
engine.log

Comment 10 Barak 2013-01-03 14:01:12 UTC
so there are 2 separate issues.

1 - The above scale test - did it happen due to a slow storage ? or some other 
    problem introduced in 3.1 ?
2 - Making calls to vdsm in parallel - bug 883096 which is about making calls to 
    VDSM in parallel., but in that bug connectStoragePool is not mentioned as a
    command performed under lock (comment 4)

Omri   - what storage did you use ? 
Kublin - is connectStoragePool done under a lock (= serialized)

Comment 11 Omri Hochman 2013-01-06 15:10:57 UTC
(In reply to comment #10)
> so there are 2 separate issues.
> 
> 1 - The above scale test - did it happen due to a slow storage ? or some
> other 
>     problem introduced in 3.1 ?
> 2 - Making calls to vdsm in parallel - bug 883096 which is about making
> calls to 
>     VDSM in parallel., but in that bug connectStoragePool is not mentioned
> as a
>     command performed under lock (comment 4)
> 
> Omri   - what storage did you use ? 
> Kublin - is connectStoragePool done under a lock (= serialized)

The storage was EMC. the problem wasn't the storage but vdsm that called 'vgs' twice for each storage domain and I had 30 SD's, that's why connectStoragePool took long time.

The bug was open on engine-backned - the assumption was that engine can send the connetStoragePool in parallel and improve the process regardless to vdsm performance.

Comment 12 mkublin 2013-03-18 08:21:26 UTC
This bug is solved. InitVdsOnUp is occurring in parallel now, only if there are will be some problems which will required reconstruct the calls will be serialized.
A patch is merged at 3.2, the fix was done due to work on last host in status UP

Comment 13 vvyazmin@redhat.com 2013-05-26 20:16:51 UTC
No issues are found

Verified on RHEVM 3.2 - SF17.1 environment:

RHEVM: rhevm-3.2.0-11.28.el6ev.noarch
VDSM: vdsm-4.10.2-21.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64


When Activate 50 host, all actions done with 6 minutes

Comment 14 vvyazmin@redhat.com 2013-05-26 20:18:17 UTC
Created attachment 753375 [details]
## Logs rhevm


Note You need to log in before you can comment on or make changes to this bug.