Description of problem: When a 3.5 SD is put into maintenance, the OVF_STORE fails to be updated but the operation carries on and puts the SD into maintenance without issuing any warning to the user. Once the SD is detached all information about the VMs in the SD is lost (gone from the database and absent in the OVF_STORE). Version-Release number of selected component (if applicable): rhevm-backend-3.5.4.2-1.3.el6ev How reproducible: Unknown (observed once so far). Steps to Reproduce: 1. Upgrade a DC to compatibility version 3.5 so the SD is able to have an OVF_STORE. 2. Put the SD into maintenance mode. 3. Actual results: Update of the OVF_STORE fails but SD is still deactivated. Expected results: SD fails to enter maintenance mode due to failure to update OVF_STORE. Additional info: - DC is updated to version 3.5 which in turn triggers the update of the SD: 2015-12-13 11:18:11,136 INFO [org.ovirt.engine.core.bll.storage.UpdateStoragePoolCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] Running command: UpdateStoragePoolCommand internal: false. Entities affected : ID: 6b1fb84b-ef5a-4337-be90-008412c71fe0 Type: StoragePoolAction group EDIT_STORAGE_POOL_CONFIGURATION with role type ADMIN 2015-12-13 11:18:11,265 INFO [org.ovirt.engine.core.bll.storage.UpdateStoragePoolCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] Updating storage domain b8b4c733-60d3-4f0c-9208-f4d0c846bdc3 (type Data) to format V3 2015-12-13 11:18:11,272 INFO [org.ovirt.engine.core.bll.storage.UpdateStoragePoolCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] Updating storage domain c1b29073-e30b-4398-a9ba-bf214e617491 (type Master) to format V3 2015-12-13 11:18:11,392 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UpgradeStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] START, UpgradeStoragePoolVDSCommand( storagePoolId = 6b1fb84b-ef5a-4337-be90-008412c71fe0, poolVersion = 3), log id: 64fe29aa 2015-12-13 11:18:11,393 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UpgradeStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] Upgrading storage pool 6b1fb84b-ef5a-4337-be90-008412c71fe0 to version 3 2015-12-13 11:18:11,667 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UpgradeStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-34) [20de5926] FINISH, UpgradeStoragePoolVDSCommand, log id: 64fe29aa - SD is put into maintenance mode. 2015-12-13 11:24:10,651 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (org.ovirt.thread.pool-7-thread-32) [78c37d18] Running command: DeactivateStorageDomainWithOvfUpdateCommand internal: false. Entities affected : ID: c1b29073-e30b-4398-a9ba-bf214e617491 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN - That triggers an update of the OVF_STORE: 2015-12-13 11:24:10,677 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (org.ovirt.thread.pool-7-thread-32) [6fb70273] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected : ID: 6b1fb84b-ef5a-4337-be90-008412c71fe0 Type: StoragePool - RHEV-M fails to acquire an exclusive lock on DC (this message is repeated 1216 times during 24 seconds): 2015-12-13 11:33:43,300 INFO [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-13) [20f82bd3] Failed to acquire lock and wait lock EngineLock [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE, sharedLocks= ] - The lock is released 015-12-13 11:34:07,281 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (org.ovirt.thread.pool-7-thread-32) [6fb70273] Lock freed to object EngineLock [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE, sharedLocks= ] - The update of the OVF_STORE is retried: 2015-12-13 11:34:07,281 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-13) [20f82bd3] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected : ID: 6b1fb84b-ef5a-4337-be90-008412c71fe0 Type: StoragePool - RHEV-M fails to acquire a shared lock on DC: 2015-12-13 11:34:07,340 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStorageDomainCommand] (org.ovirt.thread.pool-7-thread-32) [2ac34bea] Failed to Acquire Lock to object EngineLock [exclusiveLocks= , sharedLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE] - The update of the OVF_STORE fails: 2015-12-13 11:34:07,341 WARN [org.ovirt.engine.core.bll.ProcessOvfUpdateForStorageDomainCommand] (org.ovirt.thread.pool-7-thread-32) [2ac34bea] CanDoAction of action ProcessOvfUpdateForStorageDomain failed for user admin@internal. Reasons: VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_OBJECT_LOCKED - RHEV-M still goes ahead and sets SD to maintenance despite the OVF_STORE update failure above: 2015-12-13 11:34:07,476 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (org.ovirt.thread.pool-7-thread-32) [6c8bc27] Running command: DeactivateStorageDomainCommand internal: true. Entities affected : ID: c1b29073-e30b-4398-a9ba-bf214e617491 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2015-12-13 11:34:08,073 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (org.ovirt.thread.pool-7-thread-32) [6c8bc27] START, DeactivateStorageDomainVDSCommand( storagePoolId = 6b1fb84b-ef5a-4337-be90-008412c71fe0, ignoreFailoverLimit = false, storageDomainId = c1b29073-e30b-4398-a9ba-bf214e617491, masterDomainId = b8b4c733-60d3-4f0c-9208-f4d0c846bdc3, masterVersion = 2), log id: 352435e5 - An exclusive lock is released: 2015-12-13 11:34:08,108 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-13) [20f82bd3] Lock freed to object EngineLock [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE, sharedLocks= ]
The issue seems to be in the lock manager infrastructure or even deeper than that, perhaps we can check the java version installed or it can be updated. The domain is being deactivated on 2015-12-13 11:24:10,651 , we first attempt to update the ovf data in the db by executing the ProcessOvfUpdateForStoragePoolCommand, this command attempts to acquire a lock with wait (if needed) using our LockManager - 2015-12-13 11:24:10,677 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (org.ovirt.thread.pool-7-thread-32) [6fb70273] Running command: ProcessOvfUpd ateForStoragePoolCommand internal: true. Entities affected : ID: 6b1fb84b-ef5a-4337-be90-008412c71fe0 Type: StoragePool the pervious logging related to lock on that resource is from almost 30 minutes before and it should be free so the operation should success immediately. 2015-12-13 10:33:40,418 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler_Worker-76) [5c02c867] Lock freed to object EngineLock [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE , sharedLocks= ] the thread get stuck till the ovf update process runs and attempts to perform ovf update for that pool as well, during that it's attempted to acquire the lock from another thread as well. 2015-12-13 11:33:43,300 INFO [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler_Worker-13) [20f82bd3] Failed to acquire lock and wait lock EngineLock [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE this message repeats itself multiple times, the thread should await on the condition (and i don't see logging for any released resources, though i don't have information if any other operations ran and took/released locks without logging so perhaps that's worth checking as well). only then (after 10 minutes and only after it was attempted to acquire the lock by the ovf update process) the ProcessOvfUpdateForStoragePoolCommand thread logs that it failed to acquire the lock and continues. 2015-12-13 11:34:07,340 INFO [org.ovirt.engine.core.bll.ProcessOvfUpdateForStorageDomainCommand] (org.ovirt.thread.pool-7-thread-32) [2ac34bea] Failed to Acquire Lock to object EngineLock [exclusiveLocks= , sharedLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE the bug seems to be in the implementation of our log manager or deeper than that, moving to infra for further investigation.
Moti - can you take a look?
(In reply to Liron Aravot from comment #2) > The issue seems to be in the lock manager infrastructure or even deeper than > that, perhaps we can check the java version installed or it can be updated. > > The domain is being deactivated on 2015-12-13 11:24:10,651 , we first > attempt to update the ovf data in the db by executing the > ProcessOvfUpdateForStoragePoolCommand, this command attempts to acquire a > lock with wait (if needed) using our LockManager - > > 2015-12-13 11:24:10,677 INFO > [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] > (org.ovirt.thread.pool-7-thread-32) [6fb70273] Running command: ProcessOvfUpd > ateForStoragePoolCommand internal: true. Entities affected : ID: > 6b1fb84b-ef5a-4337-be90-008412c71fe0 Type: StoragePool > > the pervious logging related to lock on that resource is from almost 30 > minutes before and it should be free so the operation should success > immediately. > 2015-12-13 10:33:40,418 INFO > [org.ovirt.engine.core.bll.ProcessOvfUpdateForStoragePoolCommand] > (DefaultQuartzScheduler_Worker-76) [5c02c867] Lock freed to object EngineLock > [exclusiveLocks= key: 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE > , sharedLocks= ] > > > > the thread get stuck till the ovf update process runs and attempts to > perform ovf update for that pool as well, during that it's attempted to > acquire the lock from another thread as well. > > 2015-12-13 11:33:43,300 INFO > [org.ovirt.engine.core.bll.lock.InMemoryLockManager] > (DefaultQuartzScheduler_Worker-13) [20f82bd3] Failed to acquire lock and > wait lock EngineLock [exclusiveLocks= key: > 6b1fb84b-ef5a-4337-be90-008412c71fe0 value: OVF_UPDATE > > this message repeats itself multiple times, the thread should await on the > condition (and i don't see logging for any released resources, though i > don't have information if any other operations ran and took/released locks > without logging so perhaps that's worth checking as well). > Liron - Did we ask the customer to run with debug log?
adding a note here - opposed to acquiring log without wait, when acquiring a lock with wait there's no log in commandbase indicating the lock is taken so we can't tell whether and when ProcessOvfUpdateForStoragePoolCommand acquired the lock. I'll do some further digging on that as well.
Oved, not that i'm aware of.
Julio, Could you try to set log level of engine to 'DEBUG' and try to reproduce the same scenario ? By looking at the log & code, there is no clear evident that the problem resides on the locking management framework. It can be that the attempt to get the exclusive lock took ~10 minutes till the former lock was released which lead to the extensive logging of failed attempts to acquire the log. I opened also Bug 1293944 to improve the logging of the common acquire-lock actions. The lose of vms information should be further investigated by the storage team. However, getting a detailed logs will shed more light.
Julio, please provide the following information 1. db dump (if possible) 2. the output of the query select * from vdc_options where option_name like '%Ovf%'; 3. I see in the log that the dc of the problematic deactivated domain was upgraded, can you elaborate on what was the original version and what was the target version? 4. output of: select * from storage_pool where id = '6b1fb84b-ef5a-4337-be90-008412c71fe0'; thanks.
Hi Julio, Currently we can't diagnose the root cause for the issue, with the extended logging we should have more information. Currently the behavior of the OVF update on deactivation is that we only attempt to perform the ovf update, but on failure the deactivation isn't stopped. what we can do is to have that as a config value (is ovf update is a prerequisite before deactivating) or as an option when running the operation. we can also add an audit log indicating that the update failed because of a concurrent update which is what happened in this case (this is a very rare case). Allon/Tal - let's decide to what do we want to solve out of those issues and open bugs with the desired target, seems like this bug can be closed as we don't have enough info to proceed with.
(In reply to Liron Aravot from comment #13) > Hi Julio, > Currently we can't diagnose the root cause for the issue, with the extended > logging we should have more information. > > Currently the behavior of the OVF update on deactivation is that we only > attempt to perform the ovf update, but on failure the deactivation isn't > stopped. > what we can do is to have that as a config value (is ovf update is a > prerequisite before deactivating) or as an option when running the operation. > we can also add an audit log indicating that the update failed because of a > concurrent update which is what happened in this case (this is a very rare > case). > > Allon/Tal - let's decide to what do we want to solve out of those issues and > open bugs with the desired target, seems like this bug can be closed as we > don't have enough info to proceed with. Such an audit log is a must, regardless of anything else we wish to add to the flow. I'm not a fan of the conf value, though - a system should have one behavior that we stand behind. Making the OVF update a prereq, however, doesn't sound like a bad idea - deferring to PM to decide if this is a direction we'd want to pursue.
Looking forwards - the command should take a lock with a wait, wait for any external update of the ovf store to finish, and then just deactivate the domain
(In reply to Allon Mureinik from comment #15) > > Such an audit log is a must, regardless of anything else we wish to add to > the flow. +1 > > I'm not a fan of the conf value, though - a system should have one behavior > that we stand behind. > Making the OVF update a prereq, however, doesn't sound like a bad idea - > deferring to PM to decide if this is a direction we'd want to pursue. I would go in the direction of making the default fail to move to maintenance if update fails, but adding option to force the action.
After talk with Allon, I'm closing this bug and we can reopen if we encounter similar issue again, the infra team added logs when acquiring log with wait that'll allow to perform RCA if we encounter that again. I'm opening a RFE on adding a config value that'll prevent domain from moving to maintenance if at least one OVF store wasn't updated.
Liron, please provide some doctext for this BZ.
Allon, as nothing was actually fixed in this BZ (as the exact cause couldn't be detected because of missing logging) there's no doctext to provide. I've moved it to modified so that the QE would retest it. removing the doctext flag, please update me if its somehow necessary.
Hi Liron, If nothing was really fixed for this BZ I'm not really sure what is it to test here. confirm if this scenario still occurs?
I thought additional logging was added: (In reply to Liron Aravot from comment #21) > the infra team added logs when acquiring log > with wait that'll allow to perform RCA if we encounter that again.
Carlos/Juan - I'd verify that the described scenario works fine.
Well, since with 4.0 (what this bz is targetted) we cannot have a < 3.6 DC seems the scenario is the behaviour of update OVF_STORE and how put in maintenance the domains stops the operation. I've tested this scenario and works fine as with previous releases.
When I tested the scenario it worked fine for me too. But then customer did the same and run into the bug where the OVF_STORE was not updated before the SD was detached. End result was all info about VMs in the SD was lost from the engine db and from the OVF_STORE, requiring many hours of work to recover that from a db backup in an auxiliary engine installed for this very recovery purpose. Troubleshooting of this issue concluded that we didn't have enough information to find RCA and that extra logging was required to determine RCA. The goal of this BZ was then to add this extra logging to this scenario. (In reply to Liron Aravot from comment #30) > Carlos/Juan - I'd verify that the described scenario works fine. You should verify that the extra logging is produced for that scenario, without the extra logging we are not making any progress.
(In reply to Julio Entrena Perez from comment #32) > (In reply to Liron Aravot from comment #30) > > Carlos/Juan - I'd verify that the described scenario works fine. > > You should verify that the extra logging is produced for that scenario, > without the extra logging we are not making any progress. That is for bug 1293944 to test. As I said in #31, the scenario in the description works fine (even though is not exactly the same as we cannot have the 3.5 upgrade on 4.0). So I'm tempted to close this as verify (the logging issue should be checked on 1293944).
(In reply to Carlos Mestre González from comment #33) > (the logging issue should be checked on 1293944). Sorry, makes sense, thanks Carlos.
Verified since the scenario works fine in 4.0.
(In reply to Carlos Mestre González from comment #31) > Well, since with 4.0 (what this bz is targetted) we cannot have a < 3.6 DC > seems the scenario is the behaviour of update OVF_STORE and how put in > maintenance the domains stops the operation. I've tested this scenario and > works fine as with previous releases. That's fine, thanks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1743.html
@James - why change the priority of a BZ that has been closed for half a year? Was this intentional?