+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1564069 +++ ====================================================================== Created attachment 1417608 [details] Full stacktrace as logged in engine.log Description of problem: A HotUpPlugDisk operation may trigger an SQL deadlock. These are the related messages logged on engine.log (full stacktrace attached to BZ): 2018-04-04 17:17:08,740Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotUnPlugDiskVDSCommand] (defaul t task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] FINISH, HotUnPlugDiskVDSCommand, log id: 63104284 2018-04-04 17:17:09,760Z ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Can't execute batch: Batch entry 0 select * from public.updatevmdeviceb ootorder(CAST ('87880aaf-d326-4253-8636-026f53e6ce6b' AS uuid),CAST ('1bfe3e00-0b33-41f6-b413-dc64b9eda5b8' AS uuid),CAST (0 AS int4)) as result was aborted. Call getNextException to see the cause. 2018-04-04 17:17:09,760Z ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Can't execute batch. Next exception is: ERROR: deadlock detected Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090. Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976. Hint: See server log for query details. Where: SQL statement "UPDATE vm_device SET boot_order = v_boot_order, _update_date = current_timestamp WHERE device_id = v_device_id AND vm_id = v_vm_id" PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement 2018-04-04 17:17:09,761Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Command 'org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand' failed: Conn ectionCallback; SQL []; ERROR: deadlock detected Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090. Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976. Hint: See server log for query details. Where: SQL statement "UPDATE vm_device SET boot_order = v_boot_order, _update_date = current_timestamp WHERE device_id = v_device_id AND vm_id = v_vm_id" PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090. Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976. Hint: See server log for query details. Where: SQL statement "UPDATE vm_device SET boot_order = v_boot_order, _update_date = current_timestamp WHERE device_id = v_device_id AND vm_id = v_vm_id" PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement 2018-04-04 17:17:09,761Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Exception: org.springframework.dao.DeadlockLoserDataAccessException: ConnectionCallba ck; SQL []; ERROR: deadlock detected Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090. Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976. Hint: See server log for query details. Where: SQL statement "UPDATE vm_device SET boot_order = v_boot_order, _update_date = current_timestamp WHERE device_id = v_device_id AND vm_id = v_vm_id" PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090. Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976. Hint: See server log for query details. Where: SQL statement "UPDATE vm_device SET boot_order = v_boot_order, _update_date = current_timestamp WHERE device_id = v_device_id AND vm_id = v_vm_id" PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:263) [spring-jdbc.jar:4.2.4.RELEASE] at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73) [spring-jdbc.jar:4.2.4.RELEASE] at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:356) [spring-jdbc.jar:4.2.4.RELEASE] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:58) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:72) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:60) [dal.jar:] at org.ovirt.engine.core.dao.VmDeviceDaoImpl.updateBootOrderInBatch(VmDeviceDaoImpl.java:272) [dal.jar:] at org.ovirt.engine.core.bll.utils.VmDeviceUtils.updateBootOrder(VmDeviceUtils.java:1205) [bll.jar:] at org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand.executeVmCommand(DetachDiskFromVmCommand.java:96) [bll.jar:] at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:120) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1255) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1395) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2059) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:137) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:105) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1455) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:397) [bll.jar:] at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:516) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:498) [bll.jar:] at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:451) [bll.jar:] (...) Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090. Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976. Hint: See server log for query details. Where: SQL statement "UPDATE vm_device SET boot_order = v_boot_order, _update_date = current_timestamp WHERE device_id = v_device_id AND vm_id = v_vm_id" PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2157) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893) at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.executeBatch(CachedPreparedStatement.java:714) at org.jboss.jca.adapters.jdbc.WrappedStatement.executeBatch(WrappedStatement.java:1180) at org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback.doInConnection(BatchProcedureExecutionConnectionCallback.java:62) [dal.jar:] at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:349) [spring-jdbc.jar:4.2.4.RELEASE] ... 191 more 2018-04-04 17:17:09,788Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Transaction rolled-back for command 'org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand'. 2018-04-04 17:17:09,803Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] EVENT_ID: USER_FAILED_DETACH_DISK_FROM_VM(2,019), Correlation ID: dfd9e427-683c-41bf-afc0-370b12de0581, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Failed to detach Disk hds-monitor2_Disk1 from VM backup-rhv (User: admin@internal-authz). 2018-04-04 17:17:09,804Z ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-42) [] Operation Failed: [Internal Engine Error] And this is the information logged by PostgreSQL: ERROR: deadlock detected DETAIL: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090. Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976. Process 23976: select * from public.updatevmdevicebootorder(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4)) as result Process 24090: select * from public.updatevmdevice(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS varchar),CAST ($4 AS varchar),CAST ($5 AS varchar),CAST ($6 AS int4),CAST ($7 AS text),CAST ($8 AS bool),CAST ($9 AS bool),CAST ($10 AS bool),CAST ($11 AS varchar),CAST ($12 AS text),CAST ($13 AS uuid),CAST ($14 AS varchar),CAST ($15 AS varchar)) as result HINT: See server log for query details. CONTEXT: SQL statement "UPDATE vm_device SET boot_order = v_boot_order, _update_date = current_timestamp WHERE device_id = v_device_id AND vm_id = v_vm_id" PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement STATEMENT: select * from public.updatevmdevicebootorder(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4)) as result Looks like the execution of multiple UpdateVmDeviceBootOrder in a batch, may collide with another UpdateVmDevice running in parallel. Version-Release number of selected component (if applicable): ovirt-engine-4.1.10.3-0.1.el7.noarch How reproducible: Quite frequently while running a Commvault VM backup process. (Originally by Sergio Lopez Pascual)
The issue should be already resolved in 4.2 as part of: https://gerrit.ovirt.org/#/c/71346/ https://gerrit.ovirt.org/#/c/71333/ Since the change seems too major for backporting to 4.1, moving to ON_QA for verification. (Originally by Daniel Erez)
Has QE reproduced this? (Originally by Yaniv Kaul)
I don't see clear steps to reproduce here so I can't tell if we have this tested (Originally by Elad Ben Aharon)
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ] For more info please contact: rhv-devops (Originally by rhv-bugzilla-bot)
Execute all disk hotplug automation tests for verification. (Originally by Elad Ben Aharon)
acked
Executed storage automation hotplug/unplug disk sanity over iscsi and nfs. Passed For a reference, these are the tests cases in Polarion that were executed: TestCase5033 TestCase5034 TestCase5035 TestCase5036 TestCase5041 TestCase6231 Used: rhevm-4.1.11.2-0.1.el7.noarch vdsm-4.19.51-1.el7ev.x86_64 libvirt-daemon-3.9.0-14.el7_5.3.x86_64
(In reply to Elad from comment #25) > Executed storage automation hotplug/unplug disk sanity over iscsi and nfs. > Passed > For a reference, these are the tests cases in Polarion that were executed: > > TestCase5033 > TestCase5034 > TestCase5035 > TestCase5036 > TestCase5041 > TestCase6231 > > > Used: > rhevm-4.1.11.2-0.1.el7.noarch > vdsm-4.19.51-1.el7ev.x86_64 > libvirt-daemon-3.9.0-14.el7_5.3.x86_64 The fix also changes the way the boot sequence is set on run-VM. The code that has been backported works fine on the 4.2 branch for quite some time, but don't we want to have at least some minimal verification it also works as expected on 4.1?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1219
(In reply to Arik from comment #26) > (In reply to Elad from comment #25) > > Executed storage automation hotplug/unplug disk sanity over iscsi and nfs. > > Passed > > For a reference, these are the tests cases in Polarion that were executed: > > > > TestCase5033 > > TestCase5034 > > TestCase5035 > > TestCase5036 > > TestCase5041 > > TestCase6231 > > > > > > Used: > > rhevm-4.1.11.2-0.1.el7.noarch > > vdsm-4.19.51-1.el7ev.x86_64 > > libvirt-daemon-3.9.0-14.el7_5.3.x86_64 > > The fix also changes the way the boot sequence is set on run-VM. > The code that has been backported works fine on the 4.2 branch for quite > some time, but don't we want to have at least some minimal verification it > also works as expected on 4.1? Verify with run once and start VM all 3 options: HD, DC, PXE - ALL PASS
BZ<2>Jira Resync