Created attachment 1275242 [details] engine and vdsm logs Description of problem: In a basic DR flow, when detaching a storage domain from a DC with compatibility version lower than 4.1, and attaching it to a 4.1 DC, the operation fails after the OVF_STORE disk is being registered: 2017-04-30 09:31:03,502+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-9) [75185110] EVENT_ID: USER_REGISTER_DISK_FINISHED_SUCCESS(378), Correlation ID: 75185110, Job ID: 209ebced-d9fb-42ab-964e-4cd6314108d8, Call Stack: null, Custom Event ID: -1, Message: Disk 'OVF_STORE' has been successfully registered as a floating disk. 2017-04-30 09:31:03,505+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (org.ovirt.thread.pool-7-thread-9) [75185110] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction Version-Release number of selected component (if applicable): ovirt-engine-4.1.2-0.1.el7 vdsm-4.19.11-1.el7 How reproducible: 100% Steps to Reproduce: 1. Create a DC and cluster with compatibility version 3.6/4.0 2. Add host to the cluster 3. Create a storage domain and attach it 4. Deactivate and detach it 5. Create a new 4.1 DC and cluster, add host 6. Attach the storage domain from (3) to the new DC Actual results: The operation fails and the storage domain remains locked Expected results: Additional info:
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Raz, do you have the full VDSM logs? If so can u please attach those to the bug.
Maor, What do you mean by full vdsm.log? The attached log is not ok?
I see that the VDSM log starts at 2017-04-30 09:01:02 and finishes at 2017-04-30 09:34:31 While the engine log starts at 2017-04-29 03:28:32,233+03 There are some exceptions which I wanted to investigate to understand if those were related... I've been trying to reproduce this bug on master and on ovirt-engine-4.1.2.1 (using commit c003381748861b9c42cd301c35f2a337e9b954e0) The following reproduce steps seems to work on both my envs: 1. Create a DC and cluster with compatibility version 3.6/4.0 2. Add host to the cluster 3. Create a storage domain and attach it 4. Deactivate and detach it 5. Create a new 4.1 DC and cluster, add host 6. Attach the storage domain from (3) to the new DC Are there any additional steps or extra information that should be added? Was this exception also reproduced manually?
The steps are correct and it also reproduced manually
Created attachment 1275938 [details] new engine and vdsm logs Reproduced in my environment: 2017-05-03 19:30:58,377+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-40) [2a97b3dc] EVENT_ID: USER_REGISTER_DISK_FINISHED_SUCCESS(378), Correlation ID: 2a97b3dc, Job ID: 89e8ac49-dd0f-4f86-8a1b-193506aed0d7, Call Stack: null, Custom Event ID: -1, Message: Disk 'OVF_STORE' has been successfully registered as a floating disk. 2017-05-03 19:30:58,388+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (org.ovirt.thread.pool-7-thread-40) [2a97b3dc] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction No errors in vdsm.log
Thanks for the logs Raz! From the new logs it seems that there were no related exceptions before the detach. I still was not able to reproduce this on my both env (also tried now with iSCSI), it might be related to a different engine version. I suggest that tomorrow I will try to check it directly on your env. Thanks for the quick response!
Raz, what version of postgres is in use? I have an idea of the cause for the issue (related to the changes that were made as part of I45f10a82fdf0573). Additionally, can you attach the server.log as well? Regards, Liron
(In reply to Liron Aravot from comment #8) > Raz, what version of postgres is in use? I have an idea of the cause for the > issue (related to the changes that were made as part of I45f10a82fdf0573). It is not related since it was reproduced on a storage domain without any QCOW volumes (only OVF disks). I've managed to reproduce this only on Raz env. It looks like a dead lock (You can see by the 5 minutes that passes when starting to fetch OVF files from tar file) : 2017-05-04 12:11:43,677+03 INFO [org.ovirt.engine.core.utils.archivers.tar.TarInMemoryExport] (default task-3) [496fc47d] Start fetching OVF files from tar file The time of the exception: 2017-05-04 12:16:43,761+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (default task-3) [29d6d0a9] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction Here also : 2017-05-03 19:25:58,316+03 INFO [org.ovirt.engine.core.utils.archivers.tar.TarInMemoryExport] (org.ovirt.thread.pool-7-thread-40) [34d3bce8] Start fetching OVF files from tar file and the exception exactly after 5 minutes: 2017-05-03 19:30:58,388+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (org.ovirt.thread.pool-7-thread-40) [2a97b3dc] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction
*** Bug 1447912 has been marked as a duplicate of this bug. ***
It looks like the bug is an exception in postgres (see [1]), it seems like image_storage_domain_map is blocked by updatestorage_domain_static. From the thread dump (see [2]) it looks like there were two new transaction which were opened in the same thread. The first one is on AttachStorageDomainToPoolCommand.executeCommand and the other one is on RegisterDiskCommand in BaseImagesCommand.addDiskImageToDb. This could be related to the postgres server. This doesn't seem to reproduce on new postgresql server version (That is probably the reason I could not manage to reproduce it on postgres 9.5.6) On the other hand it does reproduce on postgres 9.2.18. The fix which might fix that issue for now is to change the transaction of RegisterDisk to acquired instead of new [1] engine=# SELECT blocked_locks.pid AS blocked_pid, blocked_activity.usename AS blocked_user, blocking_locks.pid AS blocking_pid, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement, blocking_activity.query AS current_statement_in_blocking_process FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED; blocked_pid | blocked_user | blocking_pid | blocking_user | blocked_statement | current_statemen t_in_blocking_process -------------+--------------+--------------+---------------+------------------------------------------------------------------------+--------------------------------------------------------- -------------------------------------------------------------- 2490 | engine | 2555 | engine | select * from insertimage_storage_domain_map($1, $2, $3, $4) as result | select * from updatestorage_domain_static($1, $2, $3, $4 , $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15) as result (1 row) [2] "default task-11" #166 prio=5 os_prio=0 tid=0x00007f6f2068c000 nid=0xa34 runnable [0x00007f6efbfc5000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112) at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:71) at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:269) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1700) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) - locked <0x00000000e26b4ba0> (a org.postgresql.core.v3.QueryExecutorImpl) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:555) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:417) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:410) at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.execute(CachedPreparedStatement.java:303) at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.execute(WrappedPreparedStatement.java:442) at org.springframework.jdbc.core.JdbcTemplate$6.doInCallableStatement(JdbcTemplate.java:1133) at org.springframework.jdbc.core.JdbcTemplate$6.doInCallableStatement(JdbcTemplate.java:1130) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:1078) at org.springframework.jdbc.core.JdbcTemplate.call(JdbcTemplate.java:1130) at org.springframework.jdbc.core.simple.AbstractJdbcCall.executeCallInternal(AbstractJdbcCall.java:405) at org.springframework.jdbc.core.simple.AbstractJdbcCall.doExecute(AbstractJdbcCall.java:365) at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:198) at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:135) at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:130) at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeModification(SimpleJdbcCallsHandler.java:76) at org.ovirt.engine.core.dao.ImageStorageDomainMapDaoImpl.save(ImageStorageDomainMapDaoImpl.java:20) at org.ovirt.engine.core.dao.ImageStorageDomainMapDaoImpl.save(ImageStorageDomainMapDaoImpl.java:14) at org.ovirt.engine.core.bll.storage.disk.image.BaseImagesCommand.addDiskImageToDb(BaseImagesCommand.java:281) at org.ovirt.engine.core.bll.storage.disk.image.RegisterDiskCommand.lambda$executeCommand$0(RegisterDiskCommand.java:131) at org.ovirt.engine.core.bll.storage.disk.image.RegisterDiskCommand$$Lambda$578/117466492.runInTransaction(Unknown Source) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) at org.ovirt.engine.core.bll.storage.disk.image.RegisterDiskCommand.executeCommand(RegisterDiskCommand.java:122) at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251) at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1391) at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2055) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1451) at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:397) at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:511) at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:493) at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:697) at sun.reflect.GeneratedMethodAccessor129.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437) at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.delegateInterception(Jsr299BindingsInterceptor.java:70) at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.doMethodInterception(Jsr299BindingsInterceptor.java:80) at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.processInvocation(Jsr299BindingsInterceptor.java:93) at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:63) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ejb3.component.invocationmetrics.ExecutionTimeInterceptor.processInvocation(ExecutionTimeInterceptor.java:43) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437) at org.jboss.weld.ejb.AbstractEJBRequestScopeActivationInterceptor.aroundInvoke(AbstractEJBRequestScopeActivationInterceptor.java:64) at org.jboss.as.weld.ejb.EjbRequestScopeActivationInterceptor.processInvocation(EjbRequestScopeActivationInterceptor.java:83) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ee.concurrent.ConcurrentContextInterceptor.processInvocation(ConcurrentContextInterceptor.java:45) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:52) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ejb3.component.singleton.SingletonComponentInstanceAssociationInterceptor.processInvocation(SingletonComponentInstanceAssociationInterceptor.java:53) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:254) at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:376) at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:243) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ejb3.component.invocationmetrics.WaitTimeInterceptor.processInvocation(WaitTimeInterceptor.java:43) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ejb3.security.SecurityContextInterceptor.processInvocation(SecurityContextInterceptor.java:100) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ejb3.deployment.processors.StartupAwaitInterceptor.processInvocation(StartupAwaitInterceptor.java:22) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:64) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:66) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.invocation.ContextClassLoaderInterceptor.processInvocation(ContextClassLoaderInterceptor.java:64) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.invocation.InterceptorContext.run(InterceptorContext.java:356) at org.wildfly.security.manager.WildFlySecurityManager.doChecked(WildFlySecurityManager.java:636) at org.jboss.invocation.AccessCheckingInterceptor.processInvocation(AccessCheckingInterceptor.java:61) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.invocation.InterceptorContext.run(InterceptorContext.java:356) at org.jboss.invocation.PrivilegedWithCombinerInterceptor.processInvocation(PrivilegedWithCombinerInterceptor.java:80) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:198) at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:185) at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340) at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61) at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:73) at org.ovirt.engine.core.bll.interfaces.BackendInternal$$$view2.runInternalAction(Unknown Source) at sun.reflect.GeneratedMethodAccessor128.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.jboss.weld.util.reflection.Reflections.invokeAndUnwrap(Reflections.java:433) at org.jboss.weld.bean.proxy.EnterpriseBeanProxyMethodHandler.invoke(EnterpriseBeanProxyMethodHandler.java:128) at org.jboss.weld.bean.proxy.EnterpriseTargetBeanInstance.invoke(EnterpriseTargetBeanInstance.java:56) at org.jboss.weld.bean.proxy.ProxyMethodHandler.invoke(ProxyMethodHandler.java:100) at org.ovirt.engine.core.bll.BackendCommandObjectsHandler$BackendInternal$BackendLocal$2049259618$Proxy$_$$_Weld$EnterpriseProxy$.runInternalAction(Unknown Source) at org.ovirt.engine.core.bll.CommandBase.runInternalAction(CommandBase.java:2452) at org.ovirt.engine.core.bll.storage.StorageHandlingCommandBase.registerAllOvfDisks(StorageHandlingCommandBase.java:370) at org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand.lambda$executeCommand$1(AttachStorageDomainToPoolCommand.java:176) at org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand$$Lambda$577/775861765.runInTransaction(Unknown Source) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) at org.ovirt.engine.core.bll.storage.domain.StorageDomainCommandBase.executeInNewTransaction(StorageDomainCommandBase.java:418) at org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand.executeCommand(AttachStorageDomainToPoolCommand.java:163) at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251) at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1391) at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2055) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164) at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103) at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1451) .....
(In reply to Maor from comment #9) > (In reply to Liron Aravot from comment #8) > > Raz, what version of postgres is in use? I have an idea of the cause for the > > issue (related to the changes that were made as part of I45f10a82fdf0573). > > It is not related since it was reproduced on a storage domain without any > QCOW volumes (only OVF disks). It's related - since in that commit RegisterDiskCommand was changed to be non transactive. I've asked Raz to confirm the postgres version before writing this as the lock mechanism was changed in the newer postgres version to use more granular locking than row level locking. The following flow now happens: 1. AttachStorageDomainToPool - starts a transaction and updates the domain format/status, in earlier postgres versions that will trigger a lock on the domain row in storage_domain_static. 2. The transaction is suspended for the RegisterDiskCommand execution (as now it's marked as non transactive). 3. RegisterDiskCommand attempts to insert a new image_storage_domain_map record which will attempt to acquire a lock on the storage_domain_static. So we have the flow waiting for RegisterDiskCommand to end while the RegisterDiskCommand is waiting for the lock held by the flow - that's a deadlock. As for the fix, we have two issues - 1. The deadlock described above 2. If the engine is being restarted during the AttachStorageDomain execution, the disks that were registered will remain in the db (as the RegiserDiskCommand transaction were commited). To immediately fix the deadlock we can change the order between the storage domain related updates and the RegisterDisk command calls, the ultimate solution should be to use compensation in RegisterDisk and to remove the calls out of the transaction as they may lead to transaction timeout.
Allon - platform wise, can we increase the version of our postgres we require as dependency?
Already reply to Maor about the needinfo
(In reply to Liron Aravot from comment #12) > (In reply to Maor from comment #9) > > (In reply to Liron Aravot from comment #8) > > > Raz, what version of postgres is in use? I have an idea of the cause for the > > > issue (related to the changes that were made as part of I45f10a82fdf0573). > > > > It is not related since it was reproduced on a storage domain without any > > QCOW volumes (only OVF disks). > > It's related - since in that commit RegisterDiskCommand was changed to be > non transactive. Yes, I thought you meant the update of the qcow version first. The change is indeed related since it changed the RegisterDiskCommand to nontransaction as I also sent it in the patch. > I've asked Raz to confirm the postgres version before writing this as the > lock mechanism was changed in the newer postgres version to use more > granular locking than row level locking. > > The following flow now happens: > 1. AttachStorageDomainToPool - starts a transaction and updates the domain > format/status I don't think that status is related. image_storage_domain_map has foreign key to storage_domain_static = fk_image_storage_domain_map_storage_domain_static There is no foreign key between storage_pool_iso_map and image_storage_domain_map , in earlier postgres versions that will trigger a lock on the > domain row in storage_domain_static. > 2. The transaction is suspended for the RegisterDiskCommand execution (as > now it's marked as non transactive). > 3. RegisterDiskCommand attempts to insert a new image_storage_domain_map > record which will attempt to acquire a lock on the storage_domain_static. > > So we have the flow waiting for RegisterDiskCommand to end while the > RegisterDiskCommand is waiting for the lock held by the flow - that's a > deadlock. > > As for the fix, we have two issues - > 1. The deadlock described above > 2. If the engine is being restarted during the AttachStorageDomain > execution, the disks that were registered will remain in the db (as the > RegiserDiskCommand transaction were commited). > > To immediately fix the deadlock we can change the order between the storage > domain related updates and the RegisterDisk command calls, the ultimate > solution should be to use compensation in RegisterDisk and to remove the > calls out of the transaction as they may lead to transaction timeout. I though about changing the order in AttachStorageDomain but I don't think this is a safe solution for the upcoming build, that is why I posted this fix. I agree that, for the long run, the appropriate approach should be using compensation with RegisterDisk.
(In reply to Liron Aravot from comment #13) > Allon - platform wise, can we increase the version of our postgres we > require as dependency? Good question. Martin - I know there's been some thought around this - can you share some details?
(In reply to Allon Mureinik from comment #16) > (In reply to Liron Aravot from comment #13) > > Allon - platform wise, can we increase the version of our postgres we > > require as dependency? > Good question. > Martin - I know there's been some thought around this - can you share some > details? Plan is to use PostgreSQL 9.5 from software collections for oVirt 4.2 (integration + infra P1 feature). But in 4.1 we can use only PostgreSQL 9.2
(In reply to Martin Perina from comment #17) > (In reply to Allon Mureinik from comment #16) > > (In reply to Liron Aravot from comment #13) > > > Allon - platform wise, can we increase the version of our postgres we > > > require as dependency? > > Good question. > > Martin - I know there's been some thought around this - can you share some > > details? > > Plan is to use PostgreSQL 9.5 from software collections for oVirt 4.2 > (integration + infra P1 feature). But in 4.1 we can use only PostgreSQL 9.2 Thanks Martin. While moving to PG 9.5 in oVirt 4.2 sounds like a great idea, as you noted, it's no help here, and we must solve this bug within PG 9.2's constraints.
Verified on rhevm-4.1.3.1-0.1.el7