Bug 878890

Summary: ovirt-engine-backend [scalability]: Mass migration of 1500 VM's caused engine to deadlock.
Product: Red Hat Enterprise Virtualization Manager Reporter: Omri Hochman <ohochman>
Component: ovirt-engineAssignee: Roy Golan <rgolan>
Status: CLOSED CURRENTRELEASE QA Contact: Yuri Obshansky <yobshans>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.1.0CC: acathrow, bazulay, cpelland, eedri, iheim, jkt, michal.skrivanek, ofrenkel, pstehlik, rgolan, srevivo, yeylon
Target Milestone: ---Keywords: Reopened
Target Release: 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: virt
Fixed In Version: sf1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-06 08:46:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 915537, 1078909, 1142926    
Attachments:
Description Flags
deadlock_console.log
none
engine.log
none
vdsm_log_source_side
none
vdsm_destination_side
none
console_log none

Description Omri Hochman 2012-11-21 13:30:21 UTC
Created attachment 649221 [details]
deadlock_console.log

ovirt-engine-backend [scalability]: Mass migration of 1500 VM's caused engine to deadlock.

Environment:
*************
rhevm build si24.4
vdsm-4.9.6-43.0.el6_3.x86_64
libvirt-0.9.10-21.el6_3.6.x86_64

Scenario:
*********
1) start ~1500 VM's on 25 Hosts.
2) Attempt to migrate all running VM's (100 by 100, use paging according to page results)

Results: 
*********
- 21 VM's stuck in status 'Unknown'. 
- 2 VM's stuck in status 'migrating to'.  

console.log :
**************

Found one Java-level deadlock 
=============================
"pool-4-thread-28":
  waiting to lock monitor 0x00007f95d8444c90 (object 0x00000000c450f778, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-37"
"QuartzScheduler_Worker-37":
  waiting to lock monitor 0x00007f9484027108 (object 0x00000000c4418930, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-36"
"QuartzScheduler_Worker-36":
  waiting to lock monitor 0x00007f944400b840 (object 0x00000000c4111e10, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-31"
"QuartzScheduler_Worker-31":
  waiting to lock monitor 0x00007f9484027108 (object 0x00000000c4418930, a java.lang.Object),
  which is held by "QuartzScheduler_Worker-36"


Engine.log
***********
2012-11-21 11:41:03,057 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) Error code migrateErr and error message VDSGenericException: VDSErrorException: Failed to MigrateStatusVDS, error = Fatal error during migration
2012-11-21 11:41:03,057 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) Command org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand return value 
 Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mStatus                       Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusForXmlRpc
mCode                         12
mMessage                      Fatal error during migration


2012-11-21 11:41:03,057 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase] (pool-4-thread-25) HostName = puma27
2012-11-21 11:41:03,057 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-4-thread-25) Command MigrateStatusVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to MigrateStatusVDS, error = Fatal error during migration

Comment 1 Omri Hochman 2012-11-21 13:30:52 UTC
Created attachment 649223 [details]
engine.log

Comment 3 Omri Hochman 2012-11-25 10:15:12 UTC
Adding vdsm.log. 
The "Fatal error during migration" on the engine.log was unrelated storage issue that didn't cause qemu process to fail but they remain leaving source hosts,
I suspect that the deadlock that found on the engine side and can be seen in console.log, related with the QEMU processes that died during the migration. VM's that stuck on rhevm side with status 'Migrate to' Or 'Unknown' while the qemu processes were no longer running on the hosts.

Comment 4 Omri Hochman 2012-11-25 10:16:35 UTC
Created attachment 651457 [details]
vdsm_log_source_side

Comment 5 Omri Hochman 2012-11-25 10:23:38 UTC
Created attachment 651458 [details]
vdsm_destination_side

problematic VM died during migration: 
vmId 'd28b6c17-ce33-4036-bfce-011bb95d9d3d'

Comment 6 Michal Skrivanek 2012-11-27 16:34:53 UTC
might be related to issues mentioned in 781975 (cloned as 861918)

Comment 8 Roy Golan 2012-12-12 15:18:03 UTC
3 VdsManager are deadlocked here on each other. it is caused by a failed to run VM which the manager tries to run on a different VdsManager, which does the same thing.

take a look at RunVmCommandBase.runningSucceded()

We have here a situation where VdsManager are calling each-other and are block waiting. 

I'll try to run the blocking code in another thread to make the vdsManager release its instead of block on this operation.

Comment 9 Roy Golan 2012-12-12 15:25:12 UTC
http://gerrit.ovirt.org/#/c/10002/

Comment 11 Omri Hochman 2012-12-23 07:40:49 UTC
Created attachment 667941 [details]
console_log

deadlock after applying http://gerrit.ovirt.org/#/c/10002/

Comment 12 Omri Hochman 2012-12-26 08:32:55 UTC
Re-test mass migration with the new fixes:
http://gerrit.ovirt.org/#/c/7204/https://docspace.corp.redhat.com/docs/DOC-126555

Performed 3 times migration of 1250 VM's - No deadlock occurred! 
It seems the the fix is valid.

The only thing that was a bit strange, was that during the migration test, VM's became temporarily with the status = 'Unknown' for a few seconds then the migration succeed and VM's status changed back to 'Up'.

Not sure if 'Unknown' behavior related with the fix.

Comment 15 RHEL Program Management 2013-01-06 08:46:27 UTC
Quality Engineering Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 25 Michal Skrivanek 2013-11-04 13:05:26 UTC
since this is in by rebase since 3.2 already, I'd rather keep it untested then risk revert

Comment 27 Shai Revivo 2013-11-06 09:46:03 UTC
acking it no allow customer fix.
no verification on QE side.

Comment 29 Shai Revivo 2013-12-30 09:10:26 UTC
Currently we do not have the resources (Lab) to test it.
will have to push it forward to 3.4

Comment 30 Shai Revivo 2014-01-15 14:43:26 UTC
QE Cannot verify it in 3.3, will verify in 3.4

Comment 32 Yuri Obshansky 2014-05-18 08:06:44 UTC
Bug is not reproducible on version:
RHEVM 3.4.0-0.16.rc.el6ev,
OS Version: RHEL - 6Server - 6.5.0.1.el6,
Kernel Version: 2.6.32 - 431.5.1.el6.x86_64,
KVM Version: 0.12.1.2 - 2.415.el6_5.6,
LIBVIRT Version: libvirt-0.10.2-29.el6_5.5,
VDSM Version: vdsm-4.14.7-0.2.rc.el6ev.

Note: We've tried reproduce it on environment with 200 fake hosts instead of 15 real since we didn't have enough resources. 
Environment: 2 Data Center (Real and Fake),
2 NFS Storage Domains (Real and Fake),
4 real hosts (24 CPU, 64 G),
200 real VMs,200 fake hosts,1500 fake VMs. 

Another exceptions occurred - see new created bug Bug 1098763 - Mass migration of 1500 VM's caused VDSErrorException: Failed to GetAllVmStatsVDS.

Comment 33 Itamar Heim 2014-06-12 14:10:15 UTC
Closing as part of 3.4.0

Comment 34 Red Hat Bugzilla 2023-09-14 01:39:00 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days