Bug 1446878 - Attaching storage domain with lower compatibility version to 4.1 DC fails
Summary: Attaching storage domain with lower compatibility version to 4.1 DC fails
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.1.2
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.1.3
: 4.1.3
Assignee: Maor
QA Contact: Raz Tamir
URL:
Whiteboard:
: 1447912 (view as bug list)
Depends On:
Blocks: 1451112
TreeView+ depends on / blocked
 
Reported: 2017-04-30 06:46 UTC by Raz Tamir
Modified: 2017-07-06 13:22 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-07-06 13:22:39 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)
engine and vdsm logs (1.01 MB, application/x-gzip)
2017-04-30 06:46 UTC, Raz Tamir
no flags Details
new engine and vdsm logs (170.09 KB, application/x-gzip)
2017-05-03 16:36 UTC, Raz Tamir
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 76475 0 master MERGED core: Register OVF_STORE disk at the end of the transaction. 2020-04-20 11:28:18 UTC
oVirt gerrit 77000 0 ovirt-engine-4.1 MERGED core: Register OVF_STORE disk at the end of the transaction. 2020-04-20 11:28:18 UTC

Description Raz Tamir 2017-04-30 06:46:37 UTC
Created attachment 1275242 [details]
engine and vdsm logs

Description of problem:
In a basic DR flow, when detaching a storage domain from a DC with compatibility version lower than 4.1, and attaching it to a 4.1 DC, the operation fails after the OVF_STORE disk is being registered:

2017-04-30 09:31:03,502+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-9) [75185110] EVENT_ID: USER_REGISTER_DISK_FINISHED_SUCCESS(378), Correlation ID: 75185110, Job ID: 209ebced-d9fb-42ab-964e-4cd6314108d8, Call Stack: null, Custom Event ID: -1, Message: Disk 'OVF_STORE' has been successfully registered as a floating disk.
2017-04-30 09:31:03,505+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (org.ovirt.thread.pool-7-thread-9) [75185110] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction



Version-Release number of selected component (if applicable):
ovirt-engine-4.1.2-0.1.el7
vdsm-4.19.11-1.el7


How reproducible:
100%

Steps to Reproduce:
1. Create a DC and cluster with compatibility version 3.6/4.0
2. Add host to the cluster
3. Create a storage domain and attach it
4. Deactivate and detach it
5. Create a new 4.1 DC and cluster, add host
6. Attach the storage domain from (3) to the new DC


Actual results:
The operation fails and the storage domain remains locked


Expected results:


Additional info:

Comment 1 Red Hat Bugzilla Rules Engine 2017-04-30 14:32:06 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 2 Maor 2017-05-03 14:47:00 UTC
Raz, do you have the full VDSM logs?
If so can u please attach those to the bug.

Comment 3 Raz Tamir 2017-05-03 16:05:02 UTC
Maor,
What do you mean by full vdsm.log?
The attached log is not ok?

Comment 4 Maor 2017-05-03 16:12:51 UTC
I see that the VDSM log 
starts at       2017-04-30 09:01:02
and finishes at 2017-04-30 09:34:31
While the engine log starts at 2017-04-29 03:28:32,233+03
There are some exceptions which I wanted to investigate to understand if those were related...

I've been trying to reproduce this bug on master and on ovirt-engine-4.1.2.1 (using commit c003381748861b9c42cd301c35f2a337e9b954e0)

The following reproduce steps seems to work on both my envs:
  1. Create a DC and cluster with compatibility version 3.6/4.0
  2. Add host to the cluster
  3. Create a storage domain and attach it
  4. Deactivate and detach it
  5. Create a new 4.1 DC and cluster, add host
  6. Attach the storage domain from (3) to the new DC

Are there any additional steps or extra information that should be added?
Was this exception also reproduced manually?

Comment 5 Raz Tamir 2017-05-03 16:17:52 UTC
The steps are correct and it also reproduced manually

Comment 6 Raz Tamir 2017-05-03 16:36:45 UTC
Created attachment 1275938 [details]
new engine and vdsm logs

Reproduced in my environment:

2017-05-03 19:30:58,377+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-40) [2a97b3dc] EVENT_ID: USER_REGISTER_DISK_FINISHED_SUCCESS(378), Correlation ID: 2a97b3dc, Job ID: 89e8ac49-dd0f-4f86-8a1b-193506aed0d7, Call Stack: null, Custom Event ID: -1, Message: Disk 'OVF_STORE' has been successfully registered as a floating disk.
2017-05-03 19:30:58,388+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (org.ovirt.thread.pool-7-thread-40) [2a97b3dc] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction


No errors in vdsm.log

Comment 7 Maor 2017-05-03 17:21:30 UTC
Thanks for the logs Raz!
From the new logs it seems that there were no related exceptions before the detach.
I still was not able to reproduce this on my both env (also tried now with iSCSI), it might be related to a different engine version.
I suggest that tomorrow I will try to check it directly on your env.

Thanks for the quick response!

Comment 8 Liron Aravot 2017-05-04 01:16:35 UTC
Raz, what version of postgres is in use? I have an idea of the cause for the issue (related to the changes that were made as part of I45f10a82fdf0573).
Additionally, can you attach the server.log as well?

Regards,
Liron

Comment 9 Maor 2017-05-04 09:21:23 UTC
(In reply to Liron Aravot from comment #8)
> Raz, what version of postgres is in use? I have an idea of the cause for the
> issue (related to the changes that were made as part of I45f10a82fdf0573).

It is not related since it was reproduced on a storage domain without any QCOW volumes (only OVF disks).

I've managed to reproduce this only on Raz env.
It looks like a dead lock (You can see by the 5 minutes that passes when starting to fetch OVF files from tar file) :

2017-05-04 12:11:43,677+03 INFO  [org.ovirt.engine.core.utils.archivers.tar.TarInMemoryExport] (default task-3) [496fc47d] Start fetching OVF files from tar file


The time of the exception:

2017-05-04 12:16:43,761+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (default task-3) [29d6d0a9] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not 
get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction


Here also :
2017-05-03 19:25:58,316+03 INFO  [org.ovirt.engine.core.utils.archivers.tar.TarInMemoryExport] (org.ovirt.thread.pool-7-thread-40) [34d3bce8] Start fetching OVF files from tar file

and the exception exactly after 5 minutes:

2017-05-03 19:30:58,388+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (org.ovirt.thread.pool-7-thread-40) [2a97b3dc] Exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction

Comment 10 Tal Nisan 2017-05-04 11:34:34 UTC
*** Bug 1447912 has been marked as a duplicate of this bug. ***

Comment 11 Maor 2017-05-04 14:29:53 UTC
It looks like the bug is an exception in postgres (see [1]), it seems like image_storage_domain_map is blocked by updatestorage_domain_static.
From the thread dump (see [2]) it looks like there were two new transaction which were opened in the same thread.
The first one is on AttachStorageDomainToPoolCommand.executeCommand
and the other one is on RegisterDiskCommand in BaseImagesCommand.addDiskImageToDb.

This could be related to the postgres server.
This doesn't seem to reproduce on new postgresql server version (That is probably the reason I could not manage to reproduce it on postgres 9.5.6)
On the other hand it does reproduce on postgres 9.2.18.

The fix which might fix that issue for now is to change the transaction of RegisterDisk to acquired instead of new


[1]
engine=# SELECT blocked_locks.pid     AS blocked_pid,
         blocked_activity.usename  AS blocked_user,
         blocking_locks.pid     AS blocking_pid,
         blocking_activity.usename AS blocking_user,
         blocked_activity.query    AS blocked_statement,
         blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks         blocked_locks
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks 
        ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid
 
    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.GRANTED;
 blocked_pid | blocked_user | blocking_pid | blocking_user |                           blocked_statement                            |                                         current_statemen
t_in_blocking_process                                         
-------------+--------------+--------------+---------------+------------------------------------------------------------------------+---------------------------------------------------------
--------------------------------------------------------------
        2490 | engine       |         2555 | engine        | select * from insertimage_storage_domain_map($1, $2, $3, $4) as result | select * from updatestorage_domain_static($1, $2, $3, $4
, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15) as result
(1 row)


[2]
"default task-11" #166 prio=5 os_prio=0 tid=0x00007f6f2068c000 nid=0xa34 runnable [0x00007f6efbfc5000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143)
        at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112)
        at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:71)
        at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:269)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1700)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
        - locked <0x00000000e26b4ba0> (a org.postgresql.core.v3.QueryExecutorImpl)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:555)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:417)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:410)
        at org.jboss.jca.adapters.jdbc.CachedPreparedStatement.execute(CachedPreparedStatement.java:303)
        at org.jboss.jca.adapters.jdbc.WrappedPreparedStatement.execute(WrappedPreparedStatement.java:442)
        at org.springframework.jdbc.core.JdbcTemplate$6.doInCallableStatement(JdbcTemplate.java:1133)
        at org.springframework.jdbc.core.JdbcTemplate$6.doInCallableStatement(JdbcTemplate.java:1130)
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:1078)
        at org.springframework.jdbc.core.JdbcTemplate.call(JdbcTemplate.java:1130)
        at org.springframework.jdbc.core.simple.AbstractJdbcCall.executeCallInternal(AbstractJdbcCall.java:405)
        at org.springframework.jdbc.core.simple.AbstractJdbcCall.doExecute(AbstractJdbcCall.java:365)
        at org.springframework.jdbc.core.simple.SimpleJdbcCall.execute(SimpleJdbcCall.java:198)
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:135)
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeImpl(SimpleJdbcCallsHandler.java:130)
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeModification(SimpleJdbcCallsHandler.java:76)
        at org.ovirt.engine.core.dao.ImageStorageDomainMapDaoImpl.save(ImageStorageDomainMapDaoImpl.java:20)
        at org.ovirt.engine.core.dao.ImageStorageDomainMapDaoImpl.save(ImageStorageDomainMapDaoImpl.java:14)
        at org.ovirt.engine.core.bll.storage.disk.image.BaseImagesCommand.addDiskImageToDb(BaseImagesCommand.java:281)
        at org.ovirt.engine.core.bll.storage.disk.image.RegisterDiskCommand.lambda$executeCommand$0(RegisterDiskCommand.java:131)
        at org.ovirt.engine.core.bll.storage.disk.image.RegisterDiskCommand$$Lambda$578/117466492.runInTransaction(Unknown Source)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202)
        at org.ovirt.engine.core.bll.storage.disk.image.RegisterDiskCommand.executeCommand(RegisterDiskCommand.java:122)
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251)
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1391)
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2055)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103)
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1451)
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:397)
        at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13)
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:511)
        at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:493)
        at org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:697)
        at sun.reflect.GeneratedMethodAccessor129.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437)
        at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.delegateInterception(Jsr299BindingsInterceptor.java:70)
        at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.doMethodInterception(Jsr299BindingsInterceptor.java:80)
        at org.jboss.as.weld.ejb.Jsr299BindingsInterceptor.processInvocation(Jsr299BindingsInterceptor.java:93)
        at org.jboss.as.ee.component.interceptors.UserInterceptorFactory$1.processInvocation(UserInterceptorFactory.java:63)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ejb3.component.invocationmetrics.ExecutionTimeInterceptor.processInvocation(ExecutionTimeInterceptor.java:43)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:437)
        at org.jboss.weld.ejb.AbstractEJBRequestScopeActivationInterceptor.aroundInvoke(AbstractEJBRequestScopeActivationInterceptor.java:64)
        at org.jboss.as.weld.ejb.EjbRequestScopeActivationInterceptor.processInvocation(EjbRequestScopeActivationInterceptor.java:83)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ee.concurrent.ConcurrentContextInterceptor.processInvocation(ConcurrentContextInterceptor.java:45)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.InitialInterceptor.processInvocation(InitialInterceptor.java:21)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)
        at org.jboss.as.ee.component.interceptors.ComponentDispatcherInterceptor.processInvocation(ComponentDispatcherInterceptor.java:52)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ejb3.component.singleton.SingletonComponentInstanceAssociationInterceptor.processInvocation(SingletonComponentInstanceAssociationInterceptor.java:53)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:254)
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:376)
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:243)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ejb3.component.interceptors.CurrentInvocationContextInterceptor.processInvocation(CurrentInvocationContextInterceptor.java:41)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ejb3.component.invocationmetrics.WaitTimeInterceptor.processInvocation(WaitTimeInterceptor.java:43)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ejb3.security.SecurityContextInterceptor.processInvocation(SecurityContextInterceptor.java:100)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ejb3.deployment.processors.StartupAwaitInterceptor.processInvocation(StartupAwaitInterceptor.java:22)
       at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:64)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:66)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.ContextClassLoaderInterceptor.processInvocation(ContextClassLoaderInterceptor.java:64)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.InterceptorContext.run(InterceptorContext.java:356)
        at org.wildfly.security.manager.WildFlySecurityManager.doChecked(WildFlySecurityManager.java:636)
        at org.jboss.invocation.AccessCheckingInterceptor.processInvocation(AccessCheckingInterceptor.java:61)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.InterceptorContext.run(InterceptorContext.java:356)
        at org.jboss.invocation.PrivilegedWithCombinerInterceptor.processInvocation(PrivilegedWithCombinerInterceptor.java:80)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)
        at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:198)
        at org.jboss.as.ee.component.ViewDescription$1.processInvocation(ViewDescription.java:185)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:340)
        at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)
        at org.jboss.as.ee.component.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:73)
        at org.ovirt.engine.core.bll.interfaces.BackendInternal$$$view2.runInternalAction(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor128.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.jboss.weld.util.reflection.Reflections.invokeAndUnwrap(Reflections.java:433)
        at org.jboss.weld.bean.proxy.EnterpriseBeanProxyMethodHandler.invoke(EnterpriseBeanProxyMethodHandler.java:128)
        at org.jboss.weld.bean.proxy.EnterpriseTargetBeanInstance.invoke(EnterpriseTargetBeanInstance.java:56)
        at org.jboss.weld.bean.proxy.ProxyMethodHandler.invoke(ProxyMethodHandler.java:100)
        at org.ovirt.engine.core.bll.BackendCommandObjectsHandler$BackendInternal$BackendLocal$2049259618$Proxy$_$$_Weld$EnterpriseProxy$.runInternalAction(Unknown Source)
        at org.ovirt.engine.core.bll.CommandBase.runInternalAction(CommandBase.java:2452)
        at org.ovirt.engine.core.bll.storage.StorageHandlingCommandBase.registerAllOvfDisks(StorageHandlingCommandBase.java:370)
        at org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand.lambda$executeCommand$1(AttachStorageDomainToPoolCommand.java:176)
        at org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand$$Lambda$577/775861765.runInTransaction(Unknown Source)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202)
        at org.ovirt.engine.core.bll.storage.domain.StorageDomainCommandBase.executeInNewTransaction(StorageDomainCommandBase.java:418)
        at org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand.executeCommand(AttachStorageDomainToPoolCommand.java:163)
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1251)
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1391)
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:2055)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:164)
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:103)
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1451)
.....

Comment 12 Liron Aravot 2017-05-04 18:21:02 UTC
(In reply to Maor from comment #9)
> (In reply to Liron Aravot from comment #8)
> > Raz, what version of postgres is in use? I have an idea of the cause for the
> > issue (related to the changes that were made as part of I45f10a82fdf0573).
> 
> It is not related since it was reproduced on a storage domain without any
> QCOW volumes (only OVF disks).

It's related - since in that commit RegisterDiskCommand was changed to be non transactive.
I've asked Raz to confirm the postgres version before writing this as the lock mechanism was changed in the newer postgres version to use more granular locking than row level locking.

The following flow now happens:
1. AttachStorageDomainToPool - starts a transaction and updates the domain format/status, in earlier postgres versions that will trigger a lock on the domain row in storage_domain_static.
2. The transaction is suspended for the RegisterDiskCommand execution (as now it's marked as non transactive).
3. RegisterDiskCommand attempts to insert a new image_storage_domain_map record which will attempt to acquire a lock on the storage_domain_static.

So we have the flow waiting for RegisterDiskCommand to end while the RegisterDiskCommand is waiting for the lock held by the flow - that's a deadlock.

As for the fix, we have two issues - 
1. The deadlock described above
2. If the engine is being restarted during the AttachStorageDomain execution, the disks that were registered will remain in the db (as the RegiserDiskCommand transaction were commited).

To immediately fix the deadlock we can change the order between the storage domain related updates and the RegisterDisk command calls, the ultimate solution should be to use compensation in RegisterDisk and to remove the calls out of the transaction as they may lead to transaction timeout.

Comment 13 Liron Aravot 2017-05-04 19:53:31 UTC
Allon - platform wise, can we increase the version of our postgres we require as dependency?

Comment 14 Raz Tamir 2017-05-04 21:27:27 UTC
Already reply to Maor about the needinfo

Comment 15 Maor 2017-05-05 08:15:26 UTC
(In reply to Liron Aravot from comment #12)
> (In reply to Maor from comment #9)
> > (In reply to Liron Aravot from comment #8)
> > > Raz, what version of postgres is in use? I have an idea of the cause for the
> > > issue (related to the changes that were made as part of I45f10a82fdf0573).
> > 
> > It is not related since it was reproduced on a storage domain without any
> > QCOW volumes (only OVF disks).
> 
> It's related - since in that commit RegisterDiskCommand was changed to be
> non transactive.

Yes, I thought you meant the update of the qcow version first.
The change is indeed related since it changed the RegisterDiskCommand to nontransaction as I also sent it in the patch.

> I've asked Raz to confirm the postgres version before writing this as the
> lock mechanism was changed in the newer postgres version to use more
> granular locking than row level locking.
> 
> The following flow now happens:
> 1. AttachStorageDomainToPool - starts a transaction and updates the domain
> format/status

I don't think that status is related.
image_storage_domain_map has foreign key to storage_domain_static =
fk_image_storage_domain_map_storage_domain_static
There is no foreign key between storage_pool_iso_map and image_storage_domain_map

, in earlier postgres versions that will trigger a lock on the
> domain row in storage_domain_static.
> 2. The transaction is suspended for the RegisterDiskCommand execution (as
> now it's marked as non transactive).
> 3. RegisterDiskCommand attempts to insert a new image_storage_domain_map
> record which will attempt to acquire a lock on the storage_domain_static.
> 
> So we have the flow waiting for RegisterDiskCommand to end while the
> RegisterDiskCommand is waiting for the lock held by the flow - that's a
> deadlock.
> 
> As for the fix, we have two issues - 
> 1. The deadlock described above
> 2. If the engine is being restarted during the AttachStorageDomain
> execution, the disks that were registered will remain in the db (as the
> RegiserDiskCommand transaction were commited).
> 
> To immediately fix the deadlock we can change the order between the storage
> domain related updates and the RegisterDisk command calls, the ultimate
> solution should be to use compensation in RegisterDisk and to remove the
> calls out of the transaction as they may lead to transaction timeout.

I though about changing the order in AttachStorageDomain but I don't think this is a safe solution for the upcoming build, that is why I posted this fix.
I agree that, for the long run, the appropriate approach should be using compensation with RegisterDisk.

Comment 16 Allon Mureinik 2017-05-08 09:45:17 UTC
(In reply to Liron Aravot from comment #13)
> Allon - platform wise, can we increase the version of our postgres we
> require as dependency?
Good question.
Martin - I know there's been some thought around this -  can you share some details?

Comment 17 Martin Perina 2017-05-17 19:22:23 UTC
(In reply to Allon Mureinik from comment #16)
> (In reply to Liron Aravot from comment #13)
> > Allon - platform wise, can we increase the version of our postgres we
> > require as dependency?
> Good question.
> Martin - I know there's been some thought around this -  can you share some
> details?

Plan is to use PostgreSQL 9.5 from software collections for oVirt 4.2 (integration + infra P1 feature). But in 4.1 we can use only PostgreSQL 9.2

Comment 18 Allon Mureinik 2017-05-17 22:28:23 UTC
(In reply to Martin Perina from comment #17)
> (In reply to Allon Mureinik from comment #16)
> > (In reply to Liron Aravot from comment #13)
> > > Allon - platform wise, can we increase the version of our postgres we
> > > require as dependency?
> > Good question.
> > Martin - I know there's been some thought around this -  can you share some
> > details?
> 
> Plan is to use PostgreSQL 9.5 from software collections for oVirt 4.2
> (integration + infra P1 feature). But in 4.1 we can use only PostgreSQL 9.2

Thanks Martin.

While moving to PG 9.5 in oVirt 4.2 sounds like a great idea, as you noted, it's no help here, and we must solve this bug within PG 9.2's constraints.

Comment 19 Raz Tamir 2017-06-02 09:51:34 UTC
Verified on rhevm-4.1.3.1-0.1.el7


Note You need to log in before you can comment on or make changes to this bug.