Bug 840467

Summary: [rhevm] [engine-core] Activate and remove host one after another causes deadlock (system stuck)
Product: Red Hat Enterprise Virtualization Manager Reporter: vvyazmin <vvyazmin>
Component: ovirt-engineAssignee: mkublin <mkublin>
Status: CLOSED CURRENTRELEASE QA Contact: Haim <hateya>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.1.0CC: amureini, bazulay, dyasny, hateya, iheim, lpeer, Rhev-m-bugs, sgrinber, yeylon, ykaul, yzaslavs
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: SI13 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
## Logs rhevm, postgresql
none
engine and server logs none

Description vvyazmin@redhat.com 2012-07-16 12:55:35 UTC
Created attachment 598445 [details]
## Logs rhevm, postgresql

Summary: RHEVM stack, and get deadlock in DB, when you run “Activate” and “Remove” host, one after the other


Description of problem:
RHEVM stack, and get deadlock in DB, when you run “Activate” and “Remove” host, one after the other

Version-Release number of selected component (if applicable):
Verified on RHEVM 3.1 - SI10

RHEV-M: rhevm-3.1.0-6.el6ev.noarch
VDSM: vdsm-4.9.6-21.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.1.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create a new iSCSI DC
2. Add new host to “Default” DC
3. Enter this host to “Maintenance” mode
4. Edit host configuration, and change no new DC (that you create on first step)
5. “Activate” host, and than select “Remove” option, one after the other
  
Actual results:
System stack, and get deadlock in DB (logs rhevm & postgresql attached)

Expected results:
System prevent run second command, till first are not finished. 

Additional info:

2012-07-16 18:08:37,599 ERROR [org.ovirt.engine.core.bll.RemoveVdsCommand] (pool-4-thread-45) [77922e3e] Command org.ovirt.engine.core.bll.RemoveVdsCommand throw exception: org.springframework.dao.DeadlockLoserDataAccessException: Callab
leStatementCallback; SQL [{call deletevdsstatistics(?)}]; ERROR: deadlock detected
  Detail: Process 20617 waits for ShareLock on transaction 100741; blocked by process 18736.
Process 18736 waits for ShareLock on transaction 100668; blocked by process 17663.
Process 17663 waits for ExclusiveLock on tuple (0,35) of relation 16681 of database 16395; blocked by process 20617.
  Hint: See server log for query details.
  Where: SQL statement "DELETE FROM vds_statistics WHERE vds_id =  $1 "
PL/pgSQL function "deletevdsstatistics" line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 20617 waits for ShareLock on transaction 100741; blocked by process 18736.
Process 18736 waits for ShareLock on transaction 100668; blocked by process 17663.
Process 17663 waits for ExclusiveLock on tuple (0,35) of relation 16681 of database 16395; blocked by process 20617.
  Hint: See server log for query details.
  Where: SQL statement "DELETE FROM vds_statistics WHERE vds_id =  $1 "
PL/pgSQL function "deletevdsstatistics" line 3 at SQL statement
        at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:265) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:72) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:1030) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.call(JdbcTemplate.java:1064) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.core.simple.AbstractJdbcCall.executeCallInternal(AbstractJdbcCall.java:388) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.core.simple.AbstractJdbcCall.doExecute(AbstractJdbcCall.java:351) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE]
        at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:181) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:124) [engine-dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeModification(SimpleJdbcCallsHandler.java:37) [engine-dal.jar:]
        at org.ovirt.engine.core.dao.VdsStatisticsDAODbFacadeImpl.remove(VdsStatisticsDAODbFacadeImpl.java:110) [engine-dal.jar:]
        at org.ovirt.engine.core.dao.VdsStatisticsDAODbFacadeImpl.remove(VdsStatisticsDAODbFacadeImpl.java:20) [engine-dal.jar:]
        at org.ovirt.engine.core.bll.RemoveVdsCommand.RemoveVdsStatisticsFromDb(RemoveVdsCommand.java:127) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.RemoveVdsCommand.executeCommand(RemoveVdsCommand.java:40) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.ExecuteWithoutTransaction(CommandBase.java:805) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:897) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1204) [engine-bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:204) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:142) [engine-utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:109) [engine-utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.Execute(CommandBase.java:912) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.ExecuteAction(CommandBase.java:269) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommands(MultipleActionsRunner.java:182) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.MultipleActionsRunner.RunCommands(MultipleActionsRunner.java:162) [engine-bll.jar:]
        at org.ovirt.engine.core.bll.MultipleActionsRunner$1.run(MultipleActionsRunner.java:84) [engine-bll.jar:]
        at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:64) [engine-utils.jar:]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_05-icedtea]
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_05-icedtea]
        at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_05-icedtea]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [rt.jar:1.7.0_05-icedtea]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [rt.jar:1.7.0_05-icedtea]
        at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_05-icedtea]
Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected

Comment 1 mkublin 2012-07-17 07:27:41 UTC
http://gerrit.ovirt.org/#/c/6359/

Comment 2 Haim 2012-07-29 14:53:52 UTC
Created attachment 601041 [details]
engine and server logs

issue is far worse on SI12 - host is stuck on unassigned for ever - deadlock is not printed to logs though. 
please note that connectStoragePool sent to host was returned successfully.

Comment 3 Haim 2012-07-29 15:01:43 UTC
(In reply to comment #2)
> Created attachment 601041 [details]
> engine and server logs
> 
> issue is far worse on SI12 - host is stuck on unassigned for ever - deadlock
> is not printed to logs though. 
> please note that connectStoragePool sent to host was returned successfully.

Michael, the host went up after 10 minutes when transaction rolled-back.
what is the desired behavior here? should it get stuck for 10 minutes ?

Comment 4 mkublin 2012-07-30 08:12:10 UTC
I don't think that a 10 minutes is good behaviour, the case which was described by Haim it is one of the example of bad architecture.
The problem is following we thought that host is going to Up, so we have code at VdsManager.OnTimer which is handling an up event of host.
These code is taking a lock, for some reason a process of start host take a round 10 minutes.
Can be see at logs when it finished: 
2012-07-29 21:13:15,941 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-69) [324a1400] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: 78e0bf24-d979-11e1-9e8e-001a4a16970e Type: VDS
2012-07-29 21:13:15,962 INFO  [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-69) [ad0e4c8] Running command: HandleVdsVersionCommand internal: true. Entities affected :  ID: 78e0bf24-d979-11e1-9e8e-001a4a16970e Type: VDS

To see all process please follow thread with name QuartzScheduler_Worker-69.

The RemoveVdsVDSCommand also trying to acqure the same lock , can be seen that it is finished exactly after a process of thread 
QuartzScheduler_Worker-69 finished.
2012-07-29 21:13:15,970 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (pool-4-thread-50) [78b30eae] vdsManager::disposing
2012-07-29 21:13:15,971 INFO  [org.ovirt.engine.core.vdsbroker.RemoveVdsVDSCommand] (pool-4-thread-50) [78b30eae] FINISH, RemoveVdsVDSCommand, log id: 6fd9ef63

The described problem not related to bug, but it rises couple of problems:
1. RemoveVds should run with out global transaction, because at the end we removed proxy for host but it left at db as orphan record. I think that we should remove it from DN at any case
2. The host is ok and connected to storage but we are removing it, so I don't think that in the future we will success to connect it to any other pool

Haim, next steps it is yours decisin

Comment 5 Haim 2012-07-30 08:52:57 UTC
deadlock is solved, but were hitting different locking issue with host on that scenario, will close this bug as verified on SI12 and open a new one with new information.