Bug 840467
Summary: | [rhevm] [engine-core] Activate and remove host one after another causes deadlock (system stuck) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | vvyazmin <vvyazmin> | ||||||
Component: | ovirt-engine | Assignee: | mkublin <mkublin> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Haim <hateya> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 3.1.0 | CC: | amureini, bazulay, dyasny, hateya, iheim, lpeer, Rhev-m-bugs, sgrinber, yeylon, ykaul, yzaslavs | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | infra | ||||||||
Fixed In Version: | SI13 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | Type: | Bug | |||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Created attachment 601041 [details]
engine and server logs
issue is far worse on SI12 - host is stuck on unassigned for ever - deadlock is not printed to logs though.
please note that connectStoragePool sent to host was returned successfully.
(In reply to comment #2) > Created attachment 601041 [details] > engine and server logs > > issue is far worse on SI12 - host is stuck on unassigned for ever - deadlock > is not printed to logs though. > please note that connectStoragePool sent to host was returned successfully. Michael, the host went up after 10 minutes when transaction rolled-back. what is the desired behavior here? should it get stuck for 10 minutes ? I don't think that a 10 minutes is good behaviour, the case which was described by Haim it is one of the example of bad architecture. The problem is following we thought that host is going to Up, so we have code at VdsManager.OnTimer which is handling an up event of host. These code is taking a lock, for some reason a process of start host take a round 10 minutes. Can be see at logs when it finished: 2012-07-29 21:13:15,941 INFO [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (QuartzScheduler_Worker-69) [324a1400] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected : ID: 78e0bf24-d979-11e1-9e8e-001a4a16970e Type: VDS 2012-07-29 21:13:15,962 INFO [org.ovirt.engine.core.bll.HandleVdsVersionCommand] (QuartzScheduler_Worker-69) [ad0e4c8] Running command: HandleVdsVersionCommand internal: true. Entities affected : ID: 78e0bf24-d979-11e1-9e8e-001a4a16970e Type: VDS To see all process please follow thread with name QuartzScheduler_Worker-69. The RemoveVdsVDSCommand also trying to acqure the same lock , can be seen that it is finished exactly after a process of thread QuartzScheduler_Worker-69 finished. 2012-07-29 21:13:15,970 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (pool-4-thread-50) [78b30eae] vdsManager::disposing 2012-07-29 21:13:15,971 INFO [org.ovirt.engine.core.vdsbroker.RemoveVdsVDSCommand] (pool-4-thread-50) [78b30eae] FINISH, RemoveVdsVDSCommand, log id: 6fd9ef63 The described problem not related to bug, but it rises couple of problems: 1. RemoveVds should run with out global transaction, because at the end we removed proxy for host but it left at db as orphan record. I think that we should remove it from DN at any case 2. The host is ok and connected to storage but we are removing it, so I don't think that in the future we will success to connect it to any other pool Haim, next steps it is yours decisin deadlock is solved, but were hitting different locking issue with host on that scenario, will close this bug as verified on SI12 and open a new one with new information. |
Created attachment 598445 [details] ## Logs rhevm, postgresql Summary: RHEVM stack, and get deadlock in DB, when you run “Activate” and “Remove” host, one after the other Description of problem: RHEVM stack, and get deadlock in DB, when you run “Activate” and “Remove” host, one after the other Version-Release number of selected component (if applicable): Verified on RHEVM 3.1 - SI10 RHEV-M: rhevm-3.1.0-6.el6ev.noarch VDSM: vdsm-4.9.6-21.0.el6_3.x86_64 LIBVIRT: libvirt-0.9.10-21.el6_3.1.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create a new iSCSI DC 2. Add new host to “Default” DC 3. Enter this host to “Maintenance” mode 4. Edit host configuration, and change no new DC (that you create on first step) 5. “Activate” host, and than select “Remove” option, one after the other Actual results: System stack, and get deadlock in DB (logs rhevm & postgresql attached) Expected results: System prevent run second command, till first are not finished. Additional info: 2012-07-16 18:08:37,599 ERROR [org.ovirt.engine.core.bll.RemoveVdsCommand] (pool-4-thread-45) [77922e3e] Command org.ovirt.engine.core.bll.RemoveVdsCommand throw exception: org.springframework.dao.DeadlockLoserDataAccessException: Callab leStatementCallback; SQL [{call deletevdsstatistics(?)}]; ERROR: deadlock detected Detail: Process 20617 waits for ShareLock on transaction 100741; blocked by process 18736. Process 18736 waits for ShareLock on transaction 100668; blocked by process 17663. Process 17663 waits for ExclusiveLock on tuple (0,35) of relation 16681 of database 16395; blocked by process 20617. Hint: See server log for query details. Where: SQL statement "DELETE FROM vds_statistics WHERE vds_id = $1 " PL/pgSQL function "deletevdsstatistics" line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected Detail: Process 20617 waits for ShareLock on transaction 100741; blocked by process 18736. Process 18736 waits for ShareLock on transaction 100668; blocked by process 17663. Process 17663 waits for ExclusiveLock on tuple (0,35) of relation 16681 of database 16395; blocked by process 20617. Hint: See server log for query details. Where: SQL statement "DELETE FROM vds_statistics WHERE vds_id = $1 " PL/pgSQL function "deletevdsstatistics" line 3 at SQL statement at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:265) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE] at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:72) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:1030) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.call(JdbcTemplate.java:1064) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE] at org.springframework.jdbc.core.simple.AbstractJdbcCall.executeCallInternal(AbstractJdbcCall.java:388) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE] at org.springframework.jdbc.core.simple.AbstractJdbcCall.doExecute(AbstractJdbcCall.java:351) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE] at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:181) [spring-jdbc-3.1.1.RELEASE.jar:3.1.1.RELEASE] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:124) [engine-dal.jar:] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeModification(SimpleJdbcCallsHandler.java:37) [engine-dal.jar:] at org.ovirt.engine.core.dao.VdsStatisticsDAODbFacadeImpl.remove(VdsStatisticsDAODbFacadeImpl.java:110) [engine-dal.jar:] at org.ovirt.engine.core.dao.VdsStatisticsDAODbFacadeImpl.remove(VdsStatisticsDAODbFacadeImpl.java:20) [engine-dal.jar:] at org.ovirt.engine.core.bll.RemoveVdsCommand.RemoveVdsStatisticsFromDb(RemoveVdsCommand.java:127) [engine-bll.jar:] at org.ovirt.engine.core.bll.RemoveVdsCommand.executeCommand(RemoveVdsCommand.java:40) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.ExecuteWithoutTransaction(CommandBase.java:805) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:897) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1204) [engine-bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:204) [engine-utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:142) [engine-utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:109) [engine-utils.jar:] at org.ovirt.engine.core.bll.CommandBase.Execute(CommandBase.java:912) [engine-bll.jar:] at org.ovirt.engine.core.bll.CommandBase.ExecuteAction(CommandBase.java:269) [engine-bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommands(MultipleActionsRunner.java:182) [engine-bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.RunCommands(MultipleActionsRunner.java:162) [engine-bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner$1.run(MultipleActionsRunner.java:84) [engine-bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:64) [engine-utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_05-icedtea] at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [rt.jar:1.7.0_05-icedtea] at java.util.concurrent.FutureTask.run(FutureTask.java:166) [rt.jar:1.7.0_05-icedtea] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [rt.jar:1.7.0_05-icedtea] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [rt.jar:1.7.0_05-icedtea] at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_05-icedtea] Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected