Bug 1291780 - OVF_STORE not updated when a lock is held on the domain after a 3.5 upgrade
Summary: OVF_STORE not updated when a lock is held on the domain after a 3.5 upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.4
Hardware: All
OS: Linux
high
high
Target Milestone: ovirt-4.0.0-alpha
: 4.0.0
Assignee: Liron Aravot
QA Contact: Carlos Mestre González
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-15 15:18 UTC by Julio Entrena Perez
Modified: 2021-06-10 11:05 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 20:31:01 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2116541 0 None None None 2016-01-08 16:06:36 UTC
Red Hat Product Errata RHEA-2016:1743 0 normal SHIPPED_LIVE Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine) 2016-09-02 21:54:01 UTC

Description Julio Entrena Perez 2015-12-15 15:18:02 UTC
Description of problem:
When a 3.5 SD is put into maintenance, the OVF_STORE fails to be updated but the operation carries on and puts the SD into maintenance without issuing any warning to the user.
Once the SD is detached all information about the VMs in the SD is lost (gone from the database and absent in the OVF_STORE).

Version-Release number of selected component (if applicable):
rhevm-backend-3.5.4.2-1.3.el6ev

How reproducible:
Unknown (observed once so far).

Steps to Reproduce:
1. Upgrade a DC to compatibility version 3.5 so the SD is able to have an OVF_STORE.
2. Put the SD into maintenance mode.
3.

Actual results:
Update of the OVF_STORE fails but SD is still deactivated.

Expected results:
SD fails to enter maintenance mode due to failure to update OVF_STORE.

Additional info:

- DC is updated to version 3.5 which in turn triggers the update of the SD:

2015-12-13 11:18:11,136 INFO  [org.ovirt.engine.core.bll.storage.UpdateStoragePoolCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] Running command: UpdateStoragePoolCommand internal: false. Entities affected :  ID: 6b1fb84b-ef5a-4337-be90-008412c71fe0 Type: StoragePoolAction group EDIT_STORAGE_POOL_CONFIGURATION with role type ADMIN
2015-12-13 11:18:11,265 INFO  [org.ovirt.engine.core.bll.storage.UpdateStoragePoolCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] Updating storage domain b8b4c733-60d3-4f0c-9208-f4d0c846bdc3 (type Data) to format V3
2015-12-13 11:18:11,272 INFO  [org.ovirt.engine.core.bll.storage.UpdateStoragePoolCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] Updating storage domain c1b29073-e30b-4398-a9ba-bf214e617491 (type Master) to format V3
2015-12-13 11:18:11,392 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpgradeStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] START, UpgradeStoragePoolVDSCommand( storagePoolId = 6b1fb84b-ef5a-4337-be90-008412c71fe0, poolVersion = 3), log id: 64fe29aa
2015-12-13 11:18:11,393 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpgradeStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] Upgrading storage pool 6b1fb84b-ef5a-4337-be90-008412c71fe0 to version 3
2015-12-13 11:18:11,667 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UpgradeStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] FINISH, UpgradeStoragePoolVDSCommand, log id: 64fe29aa

- SD is put into maintenance mode.

2015-12-13 11:24:10,651 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (org.ovirt.thread.pool-7-thread-32) [78c37d18] Running command: DeactivateStorageDomainWithOvfUpdateCommand internal: false. Entities affected :  ID: c1b29073-e30b-4398-a9ba-bf214e617491 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN

- That triggers an update of the OVF_STORE:

2015-12-13 11:24:10,677 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (org.ovirt.thread.pool-7-thread-32) [6fb70273] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected :  ID: 6b1fb84b-ef5a-4337-be90-008412c71fe0 Type: StoragePool

- RHEV-M fails to acquire an exclusive lock on DC (this message is repeated 1216 times during 24 seconds):

2015-12-13 11:33:43,300 INFO  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-13) [20f82bd3] Failed to acquire lock and wait lock EngineLock [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE, sharedLocks= ]
  
- The lock is released

015-12-13 11:34:07,281 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (org.ovirt.thread.pool-7-thread-32) [6fb70273] Lock freed to object EngineLock [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE, sharedLocks= ]
  
- The update of the OVF_STORE is retried:
  
2015-12-13 11:34:07,281 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-13) [20f82bd3] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected :  ID: 6b1fb84b-ef5a-4337-be90-008412c71fe0 Type: StoragePool

- RHEV-M fails to acquire a shared lock on DC:

2015-12-13 11:34:07,340 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStorageDomainCommand] (org.ovirt.thread.pool-7-thread-32) [2ac34bea] Failed to Acquire Lock to object EngineLock [exclusiveLocks= , sharedLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE]
  
- The update of the OVF_STORE fails:

2015-12-13 11:34:07,341 WARN  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStorageDomainCommand] (org.ovirt.thread.pool-7-thread-32) [2ac34bea] CanDoAction of action ProcessOvfUpdateForStorageDomain failed for user admin@internal. Reasons: VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_OBJECT_LOCKED

- RHEV-M still goes ahead and sets SD to maintenance despite the OVF_STORE update failure above:

2015-12-13 11:34:07,476 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (org.ovirt.thread.pool-7-thread-32) [6c8bc27] Running command: DeactivateStorageDomainCommand internal: true. Entities affected :  ID: c1b29073-e30b-4398-a9ba-bf214e617491 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN
2015-12-13 11:34:08,073 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-32) [6c8bc27] START, DeactivateStorageDomainVDSCommand( storagePoolId = 6b1fb84b-ef5a-4337-be90-008412c71fe0, ignoreFailoverLimit = false, storageDomainId = c1b29073-e30b-4398-a9ba-bf214e617491, masterDomainId = b8b4c733-60d3-4f0c-9208-f4d0c846bdc3, masterVersion = 2), log id: 352435e5

- An exclusive lock is released:

2015-12-13 11:34:08,108 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-13) [20f82bd3] Lock freed to object EngineLock [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE, sharedLocks= ]

Comment 2 Liron Aravot 2015-12-23 11:24:23 UTC
The issue seems to be in the lock manager infrastructure or even deeper than that, perhaps we can check the java version installed or it can be updated.

The domain is being deactivated on 2015-12-13 11:24:10,651 , we first attempt to update the ovf data in the db by executing the ProcessOvfUpdateForStoragePoolCommand, this command attempts to acquire a lock with wait (if needed) using our LockManager -

2015-12-13 11:24:10,677 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (org.ovirt.thread.pool-7-thread-32) [6fb70273] Running command: ProcessOvfUpd
ateForStoragePoolCommand internal: true. Entities affected :  ID: 6b1fb84b-ef5a-4337-be90-008412c71fe0 Type: StoragePool

the pervious logging related to lock on that resource is from almost 30 minutes before and it should be free so the operation should success immediately.
2015-12-13 10:33:40,418 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-76) [5c02c867] Lock freed to object EngineLock
 [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE
, sharedLocks= ]



the thread get stuck till the ovf update process runs and attempts to perform ovf update for that pool as well, during that it's attempted to acquire the lock from another thread as well.

2015-12-13 11:33:43,300 INFO  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-13) [20f82bd3] Failed to acquire lock and wait lock EngineLock [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE

this message repeats itself multiple times, the thread should await on the condition (and i don't see logging for any released resources, though i don't have information if any other operations ran and took/released locks without logging so perhaps that's worth checking as well).

only then (after 10 minutes and only after it was attempted to acquire the lock by the ovf update process) the ProcessOvfUpdateForStoragePoolCommand thread logs that it failed to acquire the lock and continues.

2015-12-13 11:34:07,340 INFO  [org.ovirt.engine.core.bll.ProcessOvfUpdateForStorageDomainCommand] (org.ovirt.thread.pool-7-thread-32) [2ac34bea] Failed to Acquire Lock to object EngineLock [exclusiveLocks= , sharedLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE

the bug seems to be in the implementation of our log manager or deeper than that, moving to infra for further investigation.

Comment 3 Oved Ourfali 2015-12-23 11:29:23 UTC
Moti - can you take a look?

Comment 4 Oved Ourfali 2015-12-23 11:31:14 UTC
(In reply to Liron Aravot from comment #2)
> The issue seems to be in the lock manager infrastructure or even deeper than
> that, perhaps we can check the java version installed or it can be updated.
> 
> The domain is being deactivated on 2015-12-13 11:24:10,651 , we first
> attempt to update the ovf data in the db by executing the
> ProcessOvfUpdateForStoragePoolCommand, this command attempts to acquire a
> lock with wait (if needed) using our LockManager -
> 
> 2015-12-13 11:24:10,677 INFO 
> [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand]
> (org.ovirt.thread.pool-7-thread-32) [6fb70273] Running command: ProcessOvfUpd
> ateForStoragePoolCommand internal: true. Entities affected :  ID:
> 6b1fb84b-ef5a-4337-be90-008412c71fe0 Type: StoragePool
> 
> the pervious logging related to lock on that resource is from almost 30
> minutes before and it should be free so the operation should success
> immediately.
> 2015-12-13 10:33:40,418 INFO 
> [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand]
> (DefaultQuartzScheduler_Worker-76) [5c02c867] Lock freed to object EngineLock
>  [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE
> , sharedLocks= ]
> 
> 
> 
> the thread get stuck till the ovf update process runs and attempts to
> perform ovf update for that pool as well, during that it's attempted to
> acquire the lock from another thread as well.
> 
> 2015-12-13 11:33:43,300 INFO 
> [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
> (DefaultQuartzScheduler_Worker-13) [20f82bd3] Failed to acquire lock and
> wait lock EngineLock [exclusiveLocks= key:
> 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE
> 
> this message repeats itself multiple times, the thread should await on the
> condition (and i don't see logging for any released resources, though i
> don't have information if any other operations ran and took/released locks
> without logging so perhaps that's worth checking as well).
> 

Liron - Did we ask the customer to run with debug log?

Comment 5 Liron Aravot 2015-12-23 14:14:08 UTC
adding a note here - opposed to acquiring log without wait, when acquiring a lock with wait there's no log in commandbase indicating  the lock is taken so we can't tell whether and when ProcessOvfUpdateForStoragePoolCommand acquired the lock. I'll do some further digging on that as well.

Comment 6 Liron Aravot 2015-12-23 14:14:31 UTC
Oved, not that i'm aware of.

Comment 7 Moti Asayag 2015-12-23 15:59:38 UTC
Julio,

Could you try to set log level of engine to 'DEBUG' and try to reproduce the same scenario ?

By looking at the log & code, there is no clear evident that the problem resides on the locking management framework. 
It can be that the attempt to get the exclusive lock took ~10 minutes till the former lock was released which lead to the extensive logging of failed attempts to acquire the log.

I opened also Bug 1293944 to improve the logging of the common acquire-lock actions.

The lose of vms information should be further investigated by the storage team.
However, getting a detailed logs will shed more light.

Comment 8 Liron Aravot 2015-12-24 16:07:17 UTC
Julio,
please provide the following information
1. db dump (if possible)
2. the output of the query 
select * from vdc_options where option_name like '%Ovf%';
3. I see in the log that the dc of the problematic deactivated domain was upgraded, can you elaborate on what was the original version and what was the target version?

4. output of:
select * from storage_pool where id = '6b1fb84b-ef5a-4337-be90-008412c71fe0';

thanks.

Comment 12 Liron Aravot 2015-12-31 09:40:34 UTC
Hi Julio,
Currently we can't diagnose the root cause for the issue, with the extended logging we should have more information.

Currently the behavior of the OVF update on deactivation is that we only attempt to perform the ovf update, but on failure the deactivation isn't stopped.
what we can do is to have that as a config value (is ovf update is a prerequisite before deactivating) or as an option when running the operation.
we can also add an audit log indicating that the update failed because of a concurrent update which is what happened in this case (this is a very rare case).

Allon/Tal - let's decide to what do we want to solve out of those issues and open bugs with the desired target, seems like this bug can be closed as we don't have enough info to proceed with.

Comment 13 Liron Aravot 2015-12-31 09:40:35 UTC
Hi Julio,
Currently we can't diagnose the root cause for the issue, with the extended logging we should have more information.

Currently the behavior of the OVF update on deactivation is that we only attempt to perform the ovf update, but on failure the deactivation isn't stopped.
what we can do is to have that as a config value (is ovf update is a prerequisite before deactivating) or as an option when running the operation.
we can also add an audit log indicating that the update failed because of a concurrent update which is what happened in this case (this is a very rare case).

Allon/Tal - let's decide to what do we want to solve out of those issues and open bugs with the desired target, seems like this bug can be closed as we don't have enough info to proceed with.

Comment 15 Allon Mureinik 2016-01-04 09:52:32 UTC
(In reply to Liron Aravot from comment #13)
> Hi Julio,
> Currently we can't diagnose the root cause for the issue, with the extended
> logging we should have more information.
> 
> Currently the behavior of the OVF update on deactivation is that we only
> attempt to perform the ovf update, but on failure the deactivation isn't
> stopped.
> what we can do is to have that as a config value (is ovf update is a
> prerequisite before deactivating) or as an option when running the operation.
> we can also add an audit log indicating that the update failed because of a
> concurrent update which is what happened in this case (this is a very rare
> case).
> 
> Allon/Tal - let's decide to what do we want to solve out of those issues and
> open bugs with the desired target, seems like this bug can be closed as we
> don't have enough info to proceed with.

Such an audit log is a must, regardless of anything else we wish to add to the flow.

I'm not a fan of the conf value, though - a system should have one behavior that we stand behind.
Making the OVF update a prereq, however, doesn't sound like a bad idea - deferring to PM to decide if this is a direction we'd want to pursue.

Comment 19 Allon Mureinik 2016-01-13 14:21:47 UTC
Looking forwards - the command should take a lock with a wait, wait for any external update of the ovf store to finish, and then just deactivate the domain

Comment 20 Yaniv Lavi 2016-02-08 12:23:05 UTC
(In reply to Allon Mureinik from comment #15)
> 
> Such an audit log is a must, regardless of anything else we wish to add to
> the flow.

+1

> 
> I'm not a fan of the conf value, though - a system should have one behavior
> that we stand behind.
> Making the OVF update a prereq, however, doesn't sound like a bad idea -
> deferring to PM to decide if this is a direction we'd want to pursue.

I would go in the direction of making the default fail to move to maintenance if update fails, but adding option to force the action.

Comment 21 Liron Aravot 2016-03-28 13:49:34 UTC
After talk with Allon, I'm closing this bug and we can reopen if we encounter similar issue again, the infra team added logs when acquiring log with wait that'll allow to perform RCA if we encounter that again.

I'm opening a RFE on adding a config value that'll prevent domain from moving to maintenance if at least one OVF store wasn't updated.

Comment 26 Allon Mureinik 2016-03-29 11:06:23 UTC
Liron, please provide some doctext for this BZ.

Comment 27 Liron Aravot 2016-05-08 08:18:00 UTC
Allon, as nothing was actually fixed in this BZ (as the exact cause couldn't be detected because of missing logging) there's no doctext to provide.
I've moved it to modified so that the QE would retest it.

removing the doctext flag, please update me if its somehow necessary.

Comment 28 Carlos Mestre González 2016-07-11 10:24:26 UTC
Hi Liron,

If nothing was really fixed for this BZ I'm not really sure what is it to test here. confirm if this scenario still occurs?

Comment 29 Julio Entrena Perez 2016-07-11 10:29:55 UTC
I thought additional logging was added:

(In reply to Liron Aravot from comment #21)
> the infra team added logs when acquiring log
> with wait that'll allow to perform RCA if we encounter that again.

Comment 30 Liron Aravot 2016-07-12 12:31:40 UTC
Carlos/Juan - I'd verify that the described scenario works fine.

Comment 31 Carlos Mestre González 2016-07-12 13:30:07 UTC
Well, since with 4.0 (what this bz is targetted) we cannot have a < 3.6 DC seems the scenario is the behaviour of update OVF_STORE and how put in maintenance the domains stops the operation. I've tested this scenario and works fine as with previous releases.

Comment 32 Julio Entrena Perez 2016-07-13 10:27:22 UTC
When I tested the scenario it worked fine for me too.
But then customer did the same and run into the bug where the OVF_STORE was not updated before the SD was detached.

End result was all info about VMs in the SD was lost from the engine db and from the OVF_STORE, requiring many hours of work to recover that from a db backup in an auxiliary engine installed for this very recovery purpose.

Troubleshooting of this issue concluded that we didn't have enough information to find RCA and that extra logging was required to determine RCA.

The goal of this BZ was then to add this extra logging to this scenario.

(In reply to Liron Aravot from comment #30)
> Carlos/Juan - I'd verify that the described scenario works fine.

You should verify that the extra logging is produced for that scenario, without the extra logging we are not making any progress.

Comment 33 Carlos Mestre González 2016-07-14 08:59:32 UTC
(In reply to Julio Entrena Perez from comment #32)
 
> (In reply to Liron Aravot from comment #30)
> > Carlos/Juan - I'd verify that the described scenario works fine.
> 
> You should verify that the extra logging is produced for that scenario,
> without the extra logging we are not making any progress.

That is for bug 1293944 to test. As I said in #31, the scenario in the description works fine (even though is not exactly the same as we cannot have the 3.5 upgrade on 4.0). So I'm tempted to close this as verify (the logging issue should be checked on 1293944).

Comment 34 Julio Entrena Perez 2016-07-14 09:15:24 UTC
(In reply to Carlos Mestre González from comment #33)
> (the logging issue should be checked on 1293944).
Sorry, makes sense, thanks Carlos.

Comment 35 Carlos Mestre González 2016-07-14 11:47:43 UTC
Verified since the scenario works fine in 4.0.

Comment 36 Liron Aravot 2016-08-09 09:49:39 UTC
(In reply to Carlos Mestre González from comment #31)
> Well, since with 4.0 (what this bz is targetted) we cannot have a < 3.6 DC
> seems the scenario is the behaviour of update OVF_STORE and how put in
> maintenance the domains stops the operation. I've tested this scenario and
> works fine as with previous releases.

That's fine, thanks.

Comment 38 errata-xmlrpc 2016-08-23 20:31:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html

Comment 39 Allon Mureinik 2017-02-21 21:16:42 UTC
@James - why change the priority of a BZ that has been closed for half a year? Was this intentional?


Note You need to log in before you can comment on or make changes to this bug.