Created attachment 649221 [details] deadlock_console.log ovirt-engine-backend [scalability]: Mass migration of 1500 VM's caused engine to deadlock. Environment: ************* rhevm build si24.4 vdsm-4.9.6-43.0.el6_3.x86_64 libvirt-0.9.10-21.el6_3.6.x86_64 Scenario: ********* 1) start ~1500 VM's on 25 Hosts. 2) Attempt to migrate all running VM's (100 by 100, use paging according to page results) Results: ********* - 21 VM's stuck in status 'Unknown'. - 2 VM's stuck in status 'migrating to'. console.log : ************** Found one Java-level deadlock ============================= "pool-4-thread-28": waiting to lock monitor 0x00007f95d8444c90 (object 0x00000000c450f778, a java.lang.Object), which is held by "QuartzScheduler_Worker-37" "QuartzScheduler_Worker-37": waiting to lock monitor 0x00007f9484027108 (object 0x00000000c4418930, a java.lang.Object), which is held by "QuartzScheduler_Worker-36" "QuartzScheduler_Worker-36": waiting to lock monitor 0x00007f944400b840 (object 0x00000000c4111e10, a java.lang.Object), which is held by "QuartzScheduler_Worker-31" "QuartzScheduler_Worker-31": waiting to lock monitor 0x00007f9484027108 (object 0x00000000c4418930, a java.lang.Object), which is held by "QuartzScheduler_Worker-36" Engine.log *********** 2012-11-21 11:41:03,057 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) Error code migrateErr and error message VDSGenericException: VDSErrorException: Failed to MigrateStatusVDS, error = Fatal error during migration 2012-11-21 11:41:03,057 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) Command org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand return value Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc mStatus Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc mCode 12 mMessage Fatal error during migration 2012-11-21 11:41:03,057 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) HostName = puma27 2012-11-21 11:41:03,057 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-25) Command MigrateStatusVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MigrateStatusVDS, error = Fatal error during migration
Created attachment 649223 [details] engine.log
Adding vdsm.log. The "Fatal error during migration" on the engine.log was unrelated storage issue that didn't cause qemu process to fail but they remain leaving source hosts, I suspect that the deadlock that found on the engine side and can be seen in console.log, related with the QEMU processes that died during the migration. VM's that stuck on rhevm side with status 'Migrate to' Or 'Unknown' while the qemu processes were no longer running on the hosts.
Created attachment 651457 [details] vdsm_log_source_side
Created attachment 651458 [details] vdsm_destination_side problematic VM died during migration: vmId 'd28b6c17-ce33-4036-bfce-011bb95d9d3d'
might be related to issues mentioned in 781975 (cloned as 861918)
3 VdsManager are deadlocked here on each other. it is caused by a failed to run VM which the manager tries to run on a different VdsManager, which does the same thing. take a look at RunVmCommandBase.runningSucceded() We have here a situation where VdsManager are calling each-other and are block waiting. I'll try to run the blocking code in another thread to make the vdsManager release its instead of block on this operation.
http://gerrit.ovirt.org/#/c/10002/
Created attachment 667941 [details] console_log deadlock after applying http://gerrit.ovirt.org/#/c/10002/
Re-test mass migration with the new fixes: http://gerrit.ovirt.org/#/c/7204/ https://docspace.corp.redhat.com/docs/DOC-126555 Performed 3 times migration of 1250 VM's - No deadlock occurred! It seems the the fix is valid. The only thing that was a bit strange, was that during the migration test, VM's became temporarily with the status = 'Unknown' for a few seconds then the migration succeed and VM's status changed back to 'Up'. Not sure if 'Unknown' behavior related with the fix.
http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=e645e9657a1cd212c12add94efbd7af5e97b4d6d
Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request.
since this is in by rebase since 3.2 already, I'd rather keep it untested then risk revert
acking it no allow customer fix. no verification on QE side.
Currently we do not have the resources (Lab) to test it. will have to push it forward to 3.4
QE Cannot verify it in 3.3, will verify in 3.4
Bug is not reproducible on version: RHEVM 3.4.0-0.16.rc.el6ev, OS Version: RHEL - 6Server - 6.5.0.1.el6, Kernel Version: 2.6.32 - 431.5.1.el6.x86_64, KVM Version: 0.12.1.2 - 2.415.el6_5.6, LIBVIRT Version: libvirt-0.10.2-29.el6_5.5, VDSM Version: vdsm-4.14.7-0.2.rc.el6ev. Note: We've tried reproduce it on environment with 200 fake hosts instead of 15 real since we didn't have enough resources. Environment: 2 Data Center (Real and Fake), 2 NFS Storage Domains (Real and Fake), 4 real hosts (24 CPU, 64 G), 200 real VMs,200 fake hosts,1500 fake VMs. Another exceptions occurred - see new created bug Bug 1098763 - Mass migration of 1500 VM's caused VDSErrorException: Failed to GetAllVmStatsVDS.
Closing as part of 3.4.0
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days