Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1568075 - [downstream clone - 4.1.11] HotUnPlugDisk triggers SQL deadlock
[downstream clone - 4.1.11] HotUnPlugDisk triggers SQL deadlock
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
4.1.10
Unspecified Unspecified
urgent Severity urgent
: ovirt-4.1.11
: ---
Assigned To: Daniel Erez
Elad
: ZStream
Depends On: 1564069
Blocks:
  Show dependency treegraph
 
Reported: 2018-04-16 13:37 EDT by RHV Bugzilla Automation and Verification Bot
Modified: 2018-08-02 06:56 EDT (History)
21 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1564069
Environment:
Last Closed: 2018-04-24 11:30:28 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3410751 None None None 2018-04-16 13:39 EDT
oVirt gerrit 90337 ovirt-engine-4.1 MERGED core: do not persist VmDevice#bootOrder 2018-04-17 04:58 EDT
oVirt gerrit 90338 ovirt-engine-4.1 ABANDONED core: always compute boot order on run vm 2018-04-16 13:39 EDT
Red Hat Product Errata RHBA-2018:1219 None None None 2018-04-24 11:31 EDT

  None (edit)
Description RHV Bugzilla Automation and Verification Bot 2018-04-16 13:37:38 EDT
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1564069 +++
======================================================================

Created attachment 1417608 [details]
Full stacktrace as logged in engine.log

Description of problem:

A HotUpPlugDisk operation may trigger an SQL deadlock. These are the related messages logged on engine.log (full stacktrace attached to BZ):

2018-04-04 17:17:08,740Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HotUnPlugDiskVDSCommand] (defaul
t task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] FINISH, HotUnPlugDiskVDSCommand, log id: 63104284
2018-04-04 17:17:09,760Z ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Can't execute batch: Batch entry 0 select * from public.updatevmdeviceb
ootorder(CAST ('87880aaf-d326-4253-8636-026f53e6ce6b' AS uuid),CAST ('1bfe3e00-0b33-41f6-b413-dc64b9eda5b8' AS uuid),CAST (0 AS int4)) as result was aborted.  Call getNextException to see the cause.
2018-04-04 17:17:09,760Z ERROR [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Can't execute batch. Next exception is: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
2018-04-04 17:17:09,761Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Command 'org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand' failed: Conn
ectionCallback; SQL []; ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
2018-04-04 17:17:09,761Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Exception: org.springframework.dao.DeadlockLoserDataAccessException: ConnectionCallba
ck; SQL []; ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
        at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:263) [spring-jdbc.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73) [spring-jdbc.jar:4.2.4.RELEASE]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:356) [spring-jdbc.jar:4.2.4.RELEASE]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:58) [dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:72) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:60) [dal.jar:]
        at org.ovirt.engine.core.dao.VmDeviceDaoImpl.updateBootOrderInBatch(VmDeviceDaoImpl.java:272) [dal.jar:]
        at org.ovirt.engine.core.bll.utils.VmDeviceUtils.updateBootOrder(VmDeviceUtils.java:1205) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand.executeVmCommand(DetachDiskFromVmCommand.java:96) [bll.jar:]
        at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:120) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1255) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1395) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2059) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:137) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:105) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1455) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:397) [bll.jar:]
        at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:516) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:498) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:451) [bll.jar:]
(...)
Caused by: org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
  Hint: See server log for query details.
  Where: SQL statement "UPDATE vm_device
    SET boot_order = v_boot_order,
        _update_date = current_timestamp
    WHERE device_id = v_device_id
        AND vm_id = v_vm_id"
PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2157)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893)
        at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.executeBatch(CachedPreparedStatement.java:714)
        at org.jboss.jca.adapters.jdbc.WrappedStatement.executeBatch(WrappedStatement.java:1180)
        at org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback.doInConnection(BatchProcedureExecutionConnectionCallback.java:62) [dal.jar:]
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:349) [spring-jdbc.jar:4.2.4.RELEASE]
        ... 191 more

2018-04-04 17:17:09,788Z ERROR [org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] Transaction rolled-back for command 'org.ovirt.engine.core.bll.storage.disk.DetachDiskFromVmCommand'.
2018-04-04 17:17:09,803Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-42) [dfd9e427-683c-41bf-afc0-370b12de0581] EVENT_ID: USER_FAILED_DETACH_DISK_FROM_VM(2,019), Correlation ID: dfd9e427-683c-41bf-afc0-370b12de0581, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Failed to detach Disk hds-monitor2_Disk1 from VM backup-rhv (User: admin@internal-authz).
2018-04-04 17:17:09,804Z ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-42) [] Operation Failed: [Internal Engine Error]


And this is the information logged by PostgreSQL:

ERROR:  deadlock detected
DETAIL:  Process 23976 waits for ShareLock on transaction 37761211; blocked by process 24090.
	Process 24090 waits for ShareLock on transaction 37761182; blocked by process 23976.
	Process 23976: select * from public.updatevmdevicebootorder(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4)) as result
	Process 24090: select * from public.updatevmdevice(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS varchar),CAST ($4 AS varchar),CAST ($5 AS varchar),CAST ($6 AS int4),CAST ($7 AS text),CAST ($8 AS bool),CAST ($9 AS bool),CAST ($10 AS bool),CAST ($11 AS varchar),CAST ($12 AS text),CAST ($13 AS uuid),CAST ($14 AS varchar),CAST ($15 AS varchar)) as result
HINT:  See server log for query details.
CONTEXT:  SQL statement "UPDATE vm_device
	    SET boot_order = v_boot_order,
	        _update_date = current_timestamp
	    WHERE device_id = v_device_id
	        AND vm_id = v_vm_id"
	PL/pgSQL function updatevmdevicebootorder(uuid,uuid,integer) line 3 at SQL statement
STATEMENT:  select * from public.updatevmdevicebootorder(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS int4)) as result


Looks like the execution of multiple UpdateVmDeviceBootOrder in a batch, may collide with another UpdateVmDevice running in parallel. 


Version-Release number of selected component (if applicable):

ovirt-engine-4.1.10.3-0.1.el7.noarch


How reproducible:

Quite frequently while running a Commvault VM backup process.

(Originally by Sergio Lopez Pascual)
Comment 4 RHV Bugzilla Automation and Verification Bot 2018-04-16 13:38:00 EDT
The issue should be already resolved in 4.2 as part of:
https://gerrit.ovirt.org/#/c/71346/
https://gerrit.ovirt.org/#/c/71333/

Since the change seems too major for backporting to 4.1, moving to ON_QA for verification.

(Originally by Daniel Erez)
Comment 10 RHV Bugzilla Automation and Verification Bot 2018-04-16 13:38:32 EDT
Has QE reproduced this?

(Originally by Yaniv Kaul)
Comment 11 RHV Bugzilla Automation and Verification Bot 2018-04-16 13:38:39 EDT
I don't see clear steps to reproduce here so I can't tell if we have this tested

(Originally by Elad Ben Aharon)
Comment 19 RHV Bugzilla Automation and Verification Bot 2018-04-16 13:39:25 EDT
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops@redhat.com

(Originally by rhv-bugzilla-bot)
Comment 20 RHV Bugzilla Automation and Verification Bot 2018-04-16 13:39:31 EDT
Execute all disk hotplug automation tests for verification.

(Originally by Elad Ben Aharon)
Comment 24 Raz Tamir 2018-04-23 08:36:05 EDT
acked
Comment 25 Elad 2018-04-23 12:08:20 EDT
Executed storage automation hotplug/unplug disk sanity over iscsi and nfs. Passed
For a reference, these are the tests cases in Polarion that were executed:

TestCase5033
TestCase5034
TestCase5035
TestCase5036
TestCase5041
TestCase6231


Used:
rhevm-4.1.11.2-0.1.el7.noarch
vdsm-4.19.51-1.el7ev.x86_64
libvirt-daemon-3.9.0-14.el7_5.3.x86_64
Comment 26 Arik 2018-04-23 12:19:27 EDT
(In reply to Elad from comment #25)
> Executed storage automation hotplug/unplug disk sanity over iscsi and nfs.
> Passed
> For a reference, these are the tests cases in Polarion that were executed:
> 
> TestCase5033
> TestCase5034
> TestCase5035
> TestCase5036
> TestCase5041
> TestCase6231
> 
> 
> Used:
> rhevm-4.1.11.2-0.1.el7.noarch
> vdsm-4.19.51-1.el7ev.x86_64
> libvirt-daemon-3.9.0-14.el7_5.3.x86_64

The fix also changes the way the boot sequence is set on run-VM.
The code that has been backported works fine on the 4.2 branch for quite some time, but don't we want to have at least some minimal verification it also works as expected on 4.1?
Comment 29 errata-xmlrpc 2018-04-24 11:30:28 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1219
Comment 30 Israel Pinto 2018-05-16 09:57:49 EDT
(In reply to Arik from comment #26)
> (In reply to Elad from comment #25)
> > Executed storage automation hotplug/unplug disk sanity over iscsi and nfs.
> > Passed
> > For a reference, these are the tests cases in Polarion that were executed:
> > 
> > TestCase5033
> > TestCase5034
> > TestCase5035
> > TestCase5036
> > TestCase5041
> > TestCase6231
> > 
> > 
> > Used:
> > rhevm-4.1.11.2-0.1.el7.noarch
> > vdsm-4.19.51-1.el7ev.x86_64
> > libvirt-daemon-3.9.0-14.el7_5.3.x86_64
> 
> The fix also changes the way the boot sequence is set on run-VM.
> The code that has been backported works fine on the 4.2 branch for quite
> some time, but don't we want to have at least some minimal verification it
> also works as expected on 4.1?

Verify with run once and start VM all 3 options: HD, DC, PXE - ALL PASS

Note You need to log in before you can comment on or make changes to this bug.