Created attachment 602704 [details] engine.log ovirt-engine-backend[Scalability]: When activating several hosts at-once, engine connects hosts to the storage one by one, process that takes long time while hosts in status 'Unassigned'. Description: ************* I'm working on iSCSI Data-Center environment: with 5 hosts and 30 Data Storage-domains. one of the hosts is 'UP and it's SPM, the other 4 hosts are in 'Maintenance'. When I'm activating the 4 'Maintenance' hosts All At-once, engine will attempt to connect the hosts to the storage one by one, the 'conenctStoragePool' takes around 3 minutes on vdsm side per host and when connecting hosts to storage one by one the all transaction that takes long time. Note: ****** - Transaction timeout on rhevm side is 10 minutes. - The issue is very noticeable when working on large environment, that has many storage domains the hosts 'hangs' in 'Unassigned' for 10 minutes. - The 'Unassigned' time can be reduce if engine will connect the hosts to the storage in parallel. Engine.log (same transaction): ******************************* 2012-08-07 11:21:45,487 INFO [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-100) [71bb3bc] Running command: InitVdsOnUpCommand internal: true. 2012-08-07 11:21:45,669 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-100) [21a74ca0] START, ValidateStorageServerConnectionVDSCommand(vdsId = d9f6b526-dee1-11e1-9199-002564b27d78, storagePoolId = 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5, storageType = ISCSI, connectionList = [{ id: 7bfb451c-be5b-4bad-9828-a40a33fed34c, connection: 10.35.160.107 };]), log id: 539225 2012-08-07 11:21:45,687 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-100) [21a74ca0] FINISH, ValidateStorageServerConnectionVDSCommand, return: {7bfb451c-be5b-4bad-9828-a40a33fed34c=0}, log id: 539225 2012-08-07 11:21:45,688 INFO [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-100) [21a74ca0] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected : ID: 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5 Type: StoragePool 2012-08-07 11:24:46,836 INFO [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (QuartzScheduler_Worker-72) [758489a] Running command: InitVdsOnUpCommand internal: true. 2012-08-07 11:24:46,845 INFO [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-100) [56ef8645] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected : ID: d9f6b526-dee1-11e1-9199-002564b27d78 Type: VDS 2012-08-07 11:24:46,856 INFO [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-100) [baadde8] Running command: HandleVdsVersionCommand internal: true. Entities affected : ID: d9f6b526-dee1-11e1-9199-002564b27d78 Type: VDS 2012-08-07 11:24:46,859 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-100) [baadde8] Host d9f6b526-dee1-11e1-9199-002564b27d78 : puma07 is already in NonOperational status. SetNonOperationalVds command is skipped. 2012-08-07 11:24:47,030 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-72) [3104d5d6] START, ValidateStorageServerConnectionVDSCommand(vdsId = db3d6bb4-dee1-11e1-9359-002564b27d78, storagePoolId = 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5, storageType = ISCSI, connectionList = [{ id: 7bfb451c-be5b-4bad-9828-a40a33fed34c, connection: 10.35.160.107 };]), log id: 6c38dc3c 2012-08-07 11:24:47,053 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ValidateStorageServerConnectionVDSCommand] (QuartzScheduler_Worker-72) [3104d5d6] FINISH, ValidateStorageServerConnectionVDSCommand, return: {7bfb451c-be5b-4bad-9828-a40a33fed34c=0}, log id: 6c38dc3c 2012-08-07 11:24:47,053 INFO [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-72) [3104d5d6] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected : ID: 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5 Type: StoragePool 2012-08-07 11:24:47,055 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-72) [3104d5d6] START, ConnectStorageServerVDSCommand(vdsId = db3d6bb4-dee1-11e1-9359-002564b27d78, storagePoolId = 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5, storageType = ISCSI, connectionList = [{ id: 7bfb451c-be5b-4bad-9828-a40a33fed34c, connection: 10.35.160.107 };]), log id: 146677f1 2012-08-07 11:27:47,479 INFO [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (QuartzScheduler_Worker-59) [17347243] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected : ID: 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5 Type: StoragePool 2012-08-07 11:27:47,481 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-59) [17347243] START, ConnectStorageServerVDSCommand(vdsId = da944b7e-dee1-11e1-9568-002564b27d78, storagePoolId = 7f8d8b1f-6881-4cbe-b4f1-ffd6f139b2b5, storageType = ISCSI, connectionList = [{ id: 7bfb451c-be5b-4bad-9828-a40a33fed34c, connection: 10.35.160.107 };]), log id: 3c1e1163 2012-08-07 11:27:48,505 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (QuartzScheduler_Worker-59) [17347243] FINISH, ConnectStorageServerVDSCommand, return: {7bfb451c-be5b-4bad-9828-a40a33fed34c=0}, log id: 3c1e1163
Created attachment 602871 [details] engine.log
so there are 2 separate issues. 1 - The above scale test - did it happen due to a slow storage ? or some other problem introduced in 3.1 ? 2 - Making calls to vdsm in parallel - bug 883096 which is about making calls to VDSM in parallel., but in that bug connectStoragePool is not mentioned as a command performed under lock (comment 4) Omri - what storage did you use ? Kublin - is connectStoragePool done under a lock (= serialized)
(In reply to comment #10) > so there are 2 separate issues. > > 1 - The above scale test - did it happen due to a slow storage ? or some > other > problem introduced in 3.1 ? > 2 - Making calls to vdsm in parallel - bug 883096 which is about making > calls to > VDSM in parallel., but in that bug connectStoragePool is not mentioned > as a > command performed under lock (comment 4) > > Omri - what storage did you use ? > Kublin - is connectStoragePool done under a lock (= serialized) The storage was EMC. the problem wasn't the storage but vdsm that called 'vgs' twice for each storage domain and I had 30 SD's, that's why connectStoragePool took long time. The bug was open on engine-backned - the assumption was that engine can send the connetStoragePool in parallel and improve the process regardless to vdsm performance.
This bug is solved. InitVdsOnUp is occurring in parallel now, only if there are will be some problems which will required reconstruct the calls will be serialized. A patch is merged at 3.2, the fix was done due to work on last host in status UP
No issues are found Verified on RHEVM 3.2 - SF17.1 environment: RHEVM: rhevm-3.2.0-11.28.el6ev.noarch VDSM: vdsm-4.10.2-21.0.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64 When Activate 50 host, all actions done with 6 minutes
Created attachment 753375 [details] ## Logs rhevm