878890 – ovirt-engine-backend [scalability]: Mass migration of 1500 VM's caused engine to deadlock.

Bug 878890 - ovirt-engine-backend [scalability]: Mass migration of 1500 VM's caused engine to deadlock.

Summary: ovirt-engine-backend [scalability]: Mass migration of 1500 VM's caused engine...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.1.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Roy Golan
QA Contact:	Yuri Obshansky
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	915537 rhev3.4beta 1142926
TreeView+	depends on / blocked

Reported:	2012-11-21 13:30 UTC by Omri Hochman
Modified:	2023-09-14 01:39 UTC (History)
CC List:	12 users (show)
Fixed In Version:	sf1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-01-06 08:46:27 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
deadlock_console.log (245.18 KB, application/octet-stream) 2012-11-21 13:30 UTC, Omri Hochman	no flags	Details
engine.log (265.29 KB, application/octet-stream) 2012-11-21 13:30 UTC, Omri Hochman	no flags	Details
vdsm_log_source_side (917.41 KB, application/octet-stream) 2012-11-25 10:16 UTC, Omri Hochman	no flags	Details
vdsm_destination_side (1.13 MB, application/octet-stream) 2012-11-25 10:23 UTC, Omri Hochman	no flags	Details
console_log (13.89 KB, application/octet-stream) 2012-12-23 07:40 UTC, Omri Hochman	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	10002	0	None	None	None	Never

Description Omri Hochman 2012-11-21 13:30:21 UTC

Created attachment 649221 [details]
deadlock_console.log

ovirt-engine-backend [scalability]: Mass migration of 1500 VM's caused engine to deadlock.

Environment:
*************
rhevm build si24.4
vdsm-4.9.6-43.0.el6_3.x86_64
libvirt-0.9.10-21.el6_3.6.x86_64

Scenario:
*********
1) start ~1500 VM's on 25 Hosts.
2) Attempt to migrate all running VM's (100 by 100, use paging according to page results)

Results: 
*********
- 21 VM's stuck in status 'Unknown'. 
- 2 VM's stuck in status 'migrating to'.  

console.log :
**************

Found one Java-level deadlock 
=============================
"pool-4-thread-28":
  waiting to lock monitor 0x00007f95d8444c90 (object 0x00000000c450f778, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-37"
"QuartzScheduler_Worker-37":
  waiting to lock monitor 0x00007f9484027108 (object 0x00000000c4418930, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-36"
"QuartzScheduler_Worker-36":
  waiting to lock monitor 0x00007f944400b840 (object 0x00000000c4111e10, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-31"
"QuartzScheduler_Worker-31":
  waiting to lock monitor 0x00007f9484027108 (object 0x00000000c4418930, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-36"


Engine.log
***********
2012-11-21 11:41:03,057 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) Error code migrateErr and error message VDSGenericException: VDSErrorException: Failed to MigrateStatusVDS, error = Fatal error during migration
2012-11-21 11:41:03,057 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) Command org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         12
mMessage                      Fatal error during migration


2012-11-21 11:41:03,057 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) HostName = puma27
2012-11-21 11:41:03,057 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-25) Command MigrateStatusVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MigrateStatusVDS, error = Fatal error during migration

Comment 1 Omri Hochman 2012-11-21 13:30:52 UTC

Created attachment 649223 [details]
engine.log

Comment 3 Omri Hochman 2012-11-25 10:15:12 UTC

Adding vdsm.log. 
The "Fatal error during migration" on the engine.log was unrelated storage issue that didn't cause qemu process to fail but they remain leaving source hosts,
I suspect that the deadlock that found on the engine side and can be seen in console.log, related with the QEMU processes that died during the migration. VM's that stuck on rhevm side with status 'Migrate to' Or 'Unknown' while the qemu processes were no longer running on the hosts.

Comment 4 Omri Hochman 2012-11-25 10:16:35 UTC

Created attachment 651457 [details]
vdsm_log_source_side

Comment 5 Omri Hochman 2012-11-25 10:23:38 UTC

Created attachment 651458 [details]
vdsm_destination_side

problematic VM died during migration: 
vmId 'd28b6c17-ce33-4036-bfce-011bb95d9d3d'

Comment 6 Michal Skrivanek 2012-11-27 16:34:53 UTC

might be related to issues mentioned in 781975 (cloned as 861918)

Comment 8 Roy Golan 2012-12-12 15:18:03 UTC

3 VdsManager are deadlocked here on each other. it is caused by a failed to run VM which the manager tries to run on a different VdsManager, which does the same thing.

take a look at RunVmCommandBase.runningSucceded()

We have here a situation where VdsManager are calling each-other and are block waiting. 

I'll try to run the blocking code in another thread to make the vdsManager release its instead of block on this operation.

Comment 9 Roy Golan 2012-12-12 15:25:12 UTC

http://gerrit.ovirt.org/#/c/10002/

Comment 11 Omri Hochman 2012-12-23 07:40:49 UTC

Created attachment 667941 [details]
console_log

deadlock after applying http://gerrit.ovirt.org/#/c/10002/

Comment 12 Omri Hochman 2012-12-26 08:32:55 UTC

Re-test mass migration with the new fixes:
http://gerrit.ovirt.org/#/c/7204/
https://docspace.corp.redhat.com/docs/DOC-126555

Performed 3 times migration of 1250 VM's - No deadlock occurred! 
It seems the the fix is valid.

The only thing that was a bit strange, was that during the migration test, VM's became temporarily with the status = 'Unknown' for a few seconds then the migration succeed and VM's status changed back to 'Up'.

Not sure if 'Unknown' behavior related with the fix.

Comment 13 Omer Frenkel 2012-12-26 13:18:41 UTC

http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=e645e9657a1cd212c12add94efbd7af5e97b4d6d

Comment 15 RHEL Program Management 2013-01-06 08:46:27 UTC

Quality Engineering Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 25 Michal Skrivanek 2013-11-04 13:05:26 UTC

since this is in by rebase since 3.2 already, I'd rather keep it untested then risk revert

Comment 27 Shai Revivo 2013-11-06 09:46:03 UTC

acking it no allow customer fix.
no verification on QE side.

Comment 29 Shai Revivo 2013-12-30 09:10:26 UTC

Currently we do not have the resources (Lab) to test it.
will have to push it forward to 3.4

Comment 30 Shai Revivo 2014-01-15 14:43:26 UTC

QE Cannot verify it in 3.3, will verify in 3.4

Comment 32 Yuri Obshansky 2014-05-18 08:06:44 UTC

Bug is not reproducible on version:
RHEVM 3.4.0-0.16.rc.el6ev,
OS Version: RHEL - 6Server - 6.5.0.1.el6,
Kernel Version: 2.6.32 - 431.5.1.el6.x86_64,
KVM Version: 0.12.1.2 - 2.415.el6_5.6,
LIBVIRT Version: libvirt-0.10.2-29.el6_5.5,
VDSM Version: vdsm-4.14.7-0.2.rc.el6ev.

Note: We've tried reproduce it on environment with 200 fake hosts instead of 15 real since we didn't have enough resources. 
Environment: 2 Data Center (Real and Fake),
2 NFS Storage Domains (Real and Fake),
4 real hosts (24 CPU, 64 G),
200 real VMs,200 fake hosts,1500 fake VMs. 

Another exceptions occurred - see new created bug Bug 1098763 - Mass migration of 1500 VM's caused VDSErrorException: Failed to GetAllVmStatsVDS.

Comment 33 Itamar Heim 2014-06-12 14:10:15 UTC

Closing as part of 3.4.0

Comment 34 Red Hat Bugzilla 2023-09-14 01:39:00 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.