Bug 1693191

Summary: Snapshot creation might fail due to a transaction timeout if takes too long to process the reply from vdsm
Product: [oVirt] ovirt-engine Reporter: Benny Zlotnik <bzlotnik>
Component: BLL.StorageAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED CURRENTRELEASE QA Contact: Evelina Shames <eshames>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aefrat, bugs, danieldemoraisgurgel, tnisan
Target Milestone: ovirt-4.3.3Flags: pm-rhel: ovirt-4.3+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.3.3.2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-16 13:58:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Benny Zlotnik 2019-03-27 10:16:31 UTC
Description of problem:
When snapshot creation is initiated, a vdsm task for volume creation is created, if engine takes too long to process the task it will fail with a transaction timeout, see additional info for a detailed 


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Start creating a snapshot
2. Wait for vdsm to reply on the CreateVolumeVDSCommand and make the engine process it longer than the transaction timeout (300 by default), this can be achieved by strategically placing a sleep command in vdsm (possibly volume.py#create)

Actual results:
The operation will fail

Expected results:
Engine should patiently wait for vdsm to finish the task

Additional info:
We start the snapshot here:
2019-03-27 10:55:30,347+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (default task-3) [a6cebda9-38ee-46c3-b69c-01084f42be6f] Lock Acquired to object 'EngineLock:{exclusiveLocks='[9f5e07f2-1664-480c-bc8b-1f28fab56b7a=VM]', sharedLocks=''}'
2019-03-27 10:55:30,655+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] Running command: CreateSnapshotForVmCommand internal: false. Entities affected :  ID: 9f5e07f2-1664-480c-bc8b-1f28fab56b7a Type: VMAction group MANIPULATE_VM_SNAPSHOTS with role type USER

2019-03-27 10:55:30,980+02 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] Running command: CreateSnapshotCommand internal: true. Entities affected :  ID: 00000000-0000-0000-0000-000000000000 Type: Storage
2019-03-27 10:55:31,041+02 INFO

We send a create volume command to vdsm:
[org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='d191b902-4f0b-11e9-93ab-507b9dec63c2', ignoreFailoverLimit='false', storageDomainId='d75e9038-54c1-4c24-b79a-776ef336ca36', imageGroupId='62ed902a-056f-4b84-a2eb-c565a522ee96', imageSizeInBytes='1073741824', volumeFormat='COW', newImageId='579053f7-019f-4dbd-be2e-49340c35f7b2', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='873fc779-fd76-4565-9baf-660105197474', sourceImageGroupId='62ed902a-056f-4b84-a2eb-c565a522ee96'}), log id: 30362717

Vdsm replies within a timely manner:
2019-03-27 10:55:31,392+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] FINISH, CreateVolumeVDSCommand, return: 579053f7-019f-4dbd-be2e-49340c35f7b2, log id: 30362717



The transaction is rolled back because it timed out (30 seconds on my env)
2019-03-27 10:56:00,661+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (Transaction Reaper Worker 0) [] Transaction rolled-back for command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand'.

Only after the transaction has timed out engine begins to process vdsm's reply
2019-03-27 10:56:14,784+02 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] Adding task '36cae13f-7a18-46da-9999-278001b44868' (Parent Command 'CreateSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet..
2019-03-27 10:56:14,872+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] BaseAsyncTask::startPollingTask: Starting to poll task '36cae13f-7a18-46da-9999-278001b44868'.

// We fail because we no longer have a transaction available
2019-03-27 10:56:14,896+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] Command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand' failed: Failed to obtain JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000457: Unchecked throwable in managedConnectionReconnected() cl=org.jboss.jca.core.connectionmanager.listener.TxConnectionListener@6f537e03[state=NORMAL managed connection=org.jboss.jca.adapters.jdbc.local.LocalManagedConnection@60a21e07 connection handles=0 lastReturned=1553676974893 lastValidated=1553676811726 lastCheckedOut=1553676973899 trackByTx=false pool=org.jboss.jca.core.connectionmanager.pool.strategy.OnePool@32c57360 mcp=SemaphoreConcurrentLinkedQueueManagedConnectionPool@3d136373[pool=ENGINEDataSource] xaResource=LocalXAResourceImpl@7ec10f33[connectionListener=6f537e03 connectionManager=404b5954 warned=false currentXid=null productName=PostgreSQL productVersion=10.7 jndiName=java:/ENGINEDataSource] txSync=null]

2019-03-27 10:56:14,934+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-20) [a6cebda9-38ee-46c3-b69c-01084f42be6f] EVENT_ID: USER_FAILED_CREATE_SNAPSHOT(117), Failed to create Snapshot s2 for VM vm1 (User: admin@internal-authz).

Comment 1 Evelina Shames 2019-04-07 11:58:05 UTC
> Steps to Reproduce:
> 1. Start creating a snapshot
> 2. Wait for vdsm to reply on the CreateVolumeVDSCommand and make the engine
> process it longer than the transaction timeout (300 by default), this can be
> achieved by strategically placing a sleep command in vdsm (possibly
> volume.py#create)


In step 2, I added "from time import sleep;sleep(400)" in "create" method in /usr/lib/python2.7/site-packages/vdsm/storage/volume.py.

That's what you meant?

Comment 2 Evelina Shames 2019-04-07 12:27:39 UTC
(In reply to Evelina Shames from comment #1)
> > Steps to Reproduce:
> > 1. Start creating a snapshot
> > 2. Wait for vdsm to reply on the CreateVolumeVDSCommand and make the engine
> > process it longer than the transaction timeout (300 by default), this can be
> > achieved by strategically placing a sleep command in vdsm (possibly
> > volume.py#create)
> 
> 
> In step 2, I added "from time import sleep;sleep(400)" in "create" method in
> /usr/lib/python2.7/site-packages/vdsm/storage/volume.py.
> 
> That's what you meant?

Verified by this comment.
vdsm-4.30.12-1.el7ev.x86_64
ovirt-engine-4.3.3.2-0.1.el7.noarch

Tell me if you have any comments.

Comment 3 Benny Zlotnik 2019-04-07 12:39:23 UTC
If the snapshot was successful it should be good

Comment 4 Sandro Bonazzola 2019-04-16 13:58:16 UTC
This bugzilla is included in oVirt 4.3.3 release, published on April 16th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 5 Daniel de Morais Gurgel 2020-03-03 19:55:37 UTC
This bug seems to affect version 4.3.8. I have received in my RHV the following message during backup. We have no error failure or disconnection events, with our environment connected at 10GB.

The following are the logs:
Failed to complete snapshot 'vbk_590677bf-c292-45d5-8587-fb652651a3e5' creation for VM 'CLIGXXX'.
VDSM hostXXX.domain command SnapshotVDS failed: Message timeout which can be caused by communication issues


I did not identify in the usr/lib/python2.7 file/site-packages/vdsm/storage/volume.py where I should add the timeout option.
Could you help me?

# Packages:
vdsm-4.30.40-1.el7ev.x86_64
ovirt-ansible-engine-setup-1.1.9-1.el7ev.noarch
ovirt-hosted-engine-ha-2.3.6-1.el7ev.noarch
python2-ovirt-host-deploy-1.8.4-1.el7ev.noarch
ovirt-vmconsole-1.0.7-3.el7ev.noarch
ovirt-imageio-common-1.5.3-0.el7ev.x86_64
cockpit-ovirt-dashboard-0.13.9-1.el7ev.noarch
python2-ovirt-setup-lib-1.2.0-1.el7ev.noarch
python-ovirt-engine-sdk4-4.3.2-1.el7ev.x86_64
ovirt-vmconsole-host-1.0.7-3.el7ev.noarch
cockpit-machines-ovirt-195-1.el7.noarch
python2-ovirt-node-ng-nodectl-4.3.7-0.20191031.0.el7ev.noarch
ovirt-provider-ovn-driver-1.2.29-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.12-1.el7ev.noarch
ovirt-node-ng-nodectl-4.3.7-0.20191031.0.el7ev.noarch
ovirt-ansible-repositories-1.1.5-1.el7ev.noarch
ovirt-imageio-daemon-1.5.3-0.el7ev.noarch
ovirt-host-4.3.5-1.el7ev.x86_64
ovirt-ansible-hosted-engine-setup-1.0.32-1.el7ev.noarch
ovirt-host-dependencies-4.3.5-1.el7ev.x86_64
ovirt-host-deploy-common-1.8.4-1.el7ev.noarch