Bug 878890 - ovirt-engine-backend [scalability]: Mass migration of 1500 VM's caused engine to deadlock. [NEEDINFO]
ovirt-engine-backend [scalability]: Mass migration of 1500 VM's caused engine...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.1.0
x86_64 Linux
high Severity urgent
: ---
: 3.4.0
Assigned To: Roy Golan
Yuri Obshansky
virt
: Reopened
Depends On:
Blocks: 915537 rhev3.4beta 1142926
  Show dependency treegraph
 
Reported: 2012-11-21 08:30 EST by Omri Hochman
Modified: 2015-09-22 09 EDT (History)
12 users (show)

See Also:
Fixed In Version: sf1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-01-06 03:46:27 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
acathrow: needinfo? (edolinin)


Attachments (Terms of Use)
deadlock_console.log (245.18 KB, application/octet-stream)
2012-11-21 08:30 EST, Omri Hochman
no flags Details
engine.log (265.29 KB, application/octet-stream)
2012-11-21 08:30 EST, Omri Hochman
no flags Details
vdsm_log_source_side (917.41 KB, application/octet-stream)
2012-11-25 05:16 EST, Omri Hochman
no flags Details
vdsm_destination_side (1.13 MB, application/octet-stream)
2012-11-25 05:23 EST, Omri Hochman
no flags Details
console_log (13.89 KB, application/octet-stream)
2012-12-23 02:40 EST, Omri Hochman
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 10002 None None None Never

  None (edit)
Description Omri Hochman 2012-11-21 08:30:21 EST
Created attachment 649221 [details]
deadlock_console.log

ovirt-engine-backend [scalability]: Mass migration of 1500 VM's caused engine to deadlock.

Environment:
*************
rhevm build si24.4
vdsm-4.9.6-43.0.el6_3.x86_64
libvirt-0.9.10-21.el6_3.6.x86_64

Scenario:
*********
1) start ~1500 VM's on 25 Hosts.
2) Attempt to migrate all running VM's (100 by 100, use paging according to page results)

Results: 
*********
- 21 VM's stuck in status 'Unknown'. 
- 2 VM's stuck in status 'migrating to'.  

console.log :
**************

Found one Java-level deadlock 
=============================
"pool-4-thread-28":
  waiting to lock monitor 0x00007f95d8444c90 (object 0x00000000c450f778, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-37"
"QuartzScheduler_Worker-37":
  waiting to lock monitor 0x00007f9484027108 (object 0x00000000c4418930, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-36"
"QuartzScheduler_Worker-36":
  waiting to lock monitor 0x00007f944400b840 (object 0x00000000c4111e10, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-31"
"QuartzScheduler_Worker-31":
  waiting to lock monitor 0x00007f9484027108 (object 0x00000000c4418930, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-36"


Engine.log
***********
2012-11-21 11:41:03,057 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) Error code migrateErr and error message VDSGenericException: VDSErrorException: Failed to MigrateStatusVDS, error = Fatal error during migration
2012-11-21 11:41:03,057 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) Command org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         12
mMessage                      Fatal error during migration


2012-11-21 11:41:03,057 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) HostName = puma27
2012-11-21 11:41:03,057 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-25) Command MigrateStatusVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MigrateStatusVDS, error = Fatal error during migration
Comment 1 Omri Hochman 2012-11-21 08:30:52 EST
Created attachment 649223 [details]
engine.log
Comment 3 Omri Hochman 2012-11-25 05:15:12 EST
Adding vdsm.log. 
The "Fatal error during migration" on the engine.log was unrelated storage issue that didn't cause qemu process to fail but they remain leaving source hosts,
I suspect that the deadlock that found on the engine side and can be seen in console.log, related with the QEMU processes that died during the migration. VM's that stuck on rhevm side with status 'Migrate to' Or 'Unknown' while the qemu processes were no longer running on the hosts.
Comment 4 Omri Hochman 2012-11-25 05:16:35 EST
Created attachment 651457 [details]
vdsm_log_source_side
Comment 5 Omri Hochman 2012-11-25 05:23:38 EST
Created attachment 651458 [details]
vdsm_destination_side

problematic VM died during migration: 
vmId 'd28b6c17-ce33-4036-bfce-011bb95d9d3d'
Comment 6 Michal Skrivanek 2012-11-27 11:34:53 EST
might be related to issues mentioned in 781975 (cloned as 861918)
Comment 8 Roy Golan 2012-12-12 10:18:03 EST
3 VdsManager are deadlocked here on each other. it is caused by a failed to run VM which the manager tries to run on a different VdsManager, which does the same thing.

take a look at RunVmCommandBase.runningSucceded()

We have here a situation where VdsManager are calling each-other and are block waiting. 

I'll try to run the blocking code in another thread to make the vdsManager release its instead of block on this operation.
Comment 9 Roy Golan 2012-12-12 10:25:12 EST
http://gerrit.ovirt.org/#/c/10002/
Comment 11 Omri Hochman 2012-12-23 02:40:49 EST
Created attachment 667941 [details]
console_log

deadlock after applying http://gerrit.ovirt.org/#/c/10002/
Comment 12 Omri Hochman 2012-12-26 03:32:55 EST
Re-test mass migration with the new fixes:
http://gerrit.ovirt.org/#/c/7204/https://docspace.corp.redhat.com/docs/DOC-126555

Performed 3 times migration of 1250 VM's - No deadlock occurred! 
It seems the the fix is valid.

The only thing that was a bit strange, was that during the migration test, VM's became temporarily with the status = 'Unknown' for a few seconds then the migration succeed and VM's status changed back to 'Up'.

Not sure if 'Unknown' behavior related with the fix.
Comment 15 RHEL Product and Program Management 2013-01-06 03:46:27 EST
Quality Engineering Management has reviewed and declined this request.
You may appeal this decision by reopening this request.
Comment 25 Michal Skrivanek 2013-11-04 08:05:26 EST
since this is in by rebase since 3.2 already, I'd rather keep it untested then risk revert
Comment 27 Shai Revivo 2013-11-06 04:46:03 EST
acking it no allow customer fix.
no verification on QE side.
Comment 29 Shai Revivo 2013-12-30 04:10:26 EST
Currently we do not have the resources (Lab) to test it.
will have to push it forward to 3.4
Comment 30 Shai Revivo 2014-01-15 09:43:26 EST
QE Cannot verify it in 3.3, will verify in 3.4
Comment 32 Yuri Obshansky 2014-05-18 04:06:44 EDT
Bug is not reproducible on version:
RHEVM 3.4.0-0.16.rc.el6ev,
OS Version: RHEL - 6Server - 6.5.0.1.el6,
Kernel Version: 2.6.32 - 431.5.1.el6.x86_64,
KVM Version: 0.12.1.2 - 2.415.el6_5.6,
LIBVIRT Version: libvirt-0.10.2-29.el6_5.5,
VDSM Version: vdsm-4.14.7-0.2.rc.el6ev.

Note: We've tried reproduce it on environment with 200 fake hosts instead of 15 real since we didn't have enough resources. 
Environment: 2 Data Center (Real and Fake),
2 NFS Storage Domains (Real and Fake),
4 real hosts (24 CPU, 64 G),
200 real VMs,200 fake hosts,1500 fake VMs. 

Another exceptions occurred - see new created bug Bug 1098763 - Mass migration of 1500 VM's caused VDSErrorException: Failed to GetAllVmStatsVDS.
Comment 33 Itamar Heim 2014-06-12 10:10:15 EDT
Closing as part of 3.4.0

Note You need to log in before you can comment on or make changes to this bug.