Bug 1693191 - Snapshot creation might fail due to a transaction timeout if takes too long to process the reply from vdsm
Summary: Snapshot creation might fail due to a transaction timeout if takes too long t...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-4.3.3
: ---
Assignee: Benny Zlotnik
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-27 10:16 UTC by Benny Zlotnik
Modified: 2020-03-03 19:55 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-4.3.3.2
Clone Of:
Environment:
Last Closed: 2019-04-16 13:58:16 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.3+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 98940 0 master MERGED core: make CreateSnapshotDiskCommand NonTransactive 2020-03-03 20:13:34 UTC
oVirt gerrit 99067 0 ovirt-engine-4.3 MERGED core: make CreateSnapshotDiskCommand NonTransactive 2020-03-03 20:13:34 UTC

Description Benny Zlotnik 2019-03-27 10:16:31 UTC
Description of problem:
When snapshot creation is initiated, a vdsm task for volume creation is created, if engine takes too long to process the task it will fail with a transaction timeout, see additional info for a detailed 


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Start creating a snapshot
2. Wait for vdsm to reply on the CreateVolumeVDSCommand and make the engine process it longer than the transaction timeout (300 by default), this can be achieved by strategically placing a sleep command in vdsm (possibly volume.py#create)

Actual results:
The operation will fail

Expected results:
Engine should patiently wait for vdsm to finish the task

Additional info:
We start the snapshot here:
2019-03-27 10:55:30,347+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (default task-3) [a6cebda9-38ee-46c3-b69c-01084f42be6f] Lock Acquired to object 'EngineLock:{exclusiveLocks='[9f5e07f2-1664-480c-bc8b-1f28fab56b7a=VM]', sharedLocks=''}'
2019-03-27 10:55:30,655+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] Running command: CreateSnapshotForVmCommand internal: false. Entities affected :  ID: 9f5e07f2-1664-480c-bc8b-1f28fab56b7a Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER

2019-03-27 10:55:30,980+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] Running command: CreateSnapshotCommand internal: true. Entities affected :  ID: 00000000-0000-0000-0000-000000000000 Type: Storage
2019-03-27 10:55:31,041+02 INFO

We send a create volume command to vdsm:
[org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='d191b902-4f0b-11e9-93ab-507b9dec63c2', ignoreFailoverLimit='false', storageDomainId='d75e9038-54c1-4c24-b79a-776ef336ca36', imageGroupId='62ed902a-056f-4b84-a2eb-c565a522ee96', imageSizeInBytes='1073741824', volumeFormat='COW', newImageId='579053f7-019f-4dbd-be2e-49340c35f7b2', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='873fc779-fd76-4565-9baf-660105197474', sourceImageGroupId='62ed902a-056f-4b84-a2eb-c565a522ee96'}), log id: 30362717

Vdsm replies within a timely manner:
2019-03-27 10:55:31,392+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] FINISH, CreateVolumeVDSCommand, return: 579053f7-019f-4dbd-be2e-49340c35f7b2, log id: 30362717



The transaction is rolled back because it timed out (30 seconds on my env)
2019-03-27 10:56:00,661+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand'.

Only after the transaction has timed out engine begins to process vdsm's reply
2019-03-27 10:56:14,784+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] Adding task '36cae13f-7a18-46da-9999-278001b44868' (Parent Command 'CreateSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet..
2019-03-27 10:56:14,872+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] BaseAsyncTask::startPollingTask: Starting to poll task '36cae13f-7a18-46da-9999-278001b44868'.

// We fail because we no longer have a transaction available
2019-03-27 10:56:14,896+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] Command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand' failed: Failed to obtain JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000457: Unchecked throwable in managedConnectionReconnected() cl=org.jboss.jca.core.connectionmanager.listener.TxConnectionListener@6f537e03[state=NORMAL managed connection=org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@60a21e07 connection handles=0 lastReturned=1553676974893 lastValidated=1553676811726 lastCheckedOut=1553676973899 trackByTx=false pool=org.jboss.jca.core.connectionmanager.pool.strategy.OnePool@32c57360 mcp=SemaphoreConcurrentLinkedQueueManagedConnectionPool@3d136373[pool=ENGINEDataSource] xaResource=LocalXAResourceImpl@7ec10f33[connectionListener=6f537e03 connectionManager=404b5954 warned=false currentXid=null productName=PostgreSQL productVersion=10.7 jndiName=java:/ENGINEDataSource] txSync=null]

2019-03-27 10:56:14,934+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] EVENT_ID: USER_FAILED_CREATE_SNAPSHOT(117), Failed to create Snapshot s2 for VM vm1 (User: admin@internal-authz).

Comment 1 Evelina Shames 2019-04-07 11:58:05 UTC
> Steps to Reproduce:
> 1. Start creating a snapshot
> 2. Wait for vdsm to reply on the CreateVolumeVDSCommand and make the engine
> process it longer than the transaction timeout (300 by default), this can be
> achieved by strategically placing a sleep command in vdsm (possibly
> volume.py#create)


In step 2, I added "from time import sleep;sleep(400)" in "create" method in /usr/lib/python2.7/site-packages/vdsm/storage/volume.py.

That's what you meant?

Comment 2 Evelina Shames 2019-04-07 12:27:39 UTC
(In reply to Evelina Shames from comment #1)
> > Steps to Reproduce:
> > 1. Start creating a snapshot
> > 2. Wait for vdsm to reply on the CreateVolumeVDSCommand and make the engine
> > process it longer than the transaction timeout (300 by default), this can be
> > achieved by strategically placing a sleep command in vdsm (possibly
> > volume.py#create)
> 
> 
> In step 2, I added "from time import sleep;sleep(400)" in "create" method in
> /usr/lib/python2.7/site-packages/vdsm/storage/volume.py.
> 
> That's what you meant?

Verified by this comment.
vdsm-4.30.12-1.el7ev.x86_64
ovirt-engine-4.3.3.2-0.1.el7.noarch

Tell me if you have any comments.

Comment 3 Benny Zlotnik 2019-04-07 12:39:23 UTC
If the snapshot was successful it should be good

Comment 4 Sandro Bonazzola 2019-04-16 13:58:16 UTC
This bugzilla is included in oVirt 4.3.3 release, published on April 16th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 5 Daniel de Morais Gurgel 2020-03-03 19:55:37 UTC
This bug seems to affect version 4.3.8. I have received in my RHV the following message during backup. We have no error failure or disconnection events, with our environment connected at 10GB.

The following are the logs:
Failed to complete snapshot 'vbk_590677bf-c292-45d5-8587-fb652651a3e5' creation for VM 'CLIGXXX'.
VDSM hostXXX.domain command SnapshotVDS failed: Message timeout which can be caused by communication issues


I did not identify in the usr/lib/python2.7 file/site-packages/vdsm/storage/volume.py where I should add the timeout option.
Could you help me?

# Packages:
vdsm-4.30.40-1.el7ev.x86_64
ovirt-ansible-engine-setup-1.1.9-1.el7ev.noarch
ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
python2-ovirt-host-deploy-1.8.4-1.el7ev.noarch
ovirt-vmconsole-1.0.7-3.el7ev.noarch
ovirt-imageio-common-1.5.3-0.el7ev.x86_64
cockpit-ovirt-dashboard-0.13.9-1.el7ev.noarch
python2-ovirt-setup-lib-1.2.0-1.el7ev.noarch
python-ovirt-engine-sdk4-4.3.2-1.el7ev.x86_64
ovirt-vmconsole-host-1.0.7-3.el7ev.noarch
cockpit-machines-ovirt-195-1.el7.noarch
python2-ovirt-node-ng-nodectl-4.3.7-0.20191031.0.el7ev.noarch
ovirt-provider-ovn-driver-1.2.29-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.12-1.el7ev.noarch
ovirt-node-ng-nodectl-4.3.7-0.20191031.0.el7ev.noarch
ovirt-ansible-repositories-1.1.5-1.el7ev.noarch
ovirt-imageio-daemon-1.5.3-0.el7ev.noarch
ovirt-host-4.3.5-1.el7ev.x86_64
ovirt-ansible-hosted-engine-setup-1.0.32-1.el7ev.noarch
ovirt-host-dependencies-4.3.5-1.el7ev.x86_64
ovirt-host-deploy-common-1.8.4-1.el7ev.noarch


Note You need to log in before you can comment on or make changes to this bug.