1372743 – Database transaction cancellations result in failed moveImage operations

Bug 1372743 - Database transaction cancellations result in failed moveImage operations

Summary: Database transaction cancellations result in failed moveImage operations

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.7
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.0-alpha
Target Release:	---
Assignee:	Liron Aravot
QA Contact:	Avihai
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-02 14:28 UTC by Julio Entrena Perez
Modified:	2023-09-07 18:47 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1390923 1390934 1390936 (view as bug list)
Environment:
Last Closed:	2017-04-20 12:42:32 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1390936	high	CLOSED	[CodeChange] move vdsm calls (mostly removeImage) from transactional endAction()	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	2599611	None	None	None	2016-09-02 15:17:10 UTC
Red Hat Product Errata	RHEA-2017:0997	normal	SHIPPED_LIVE	Red Hat Virtualization Manager (ovirt-engine) 4.1 GA	2017-04-18 20:11:26 UTC

Internal Links: 1390936

Comment 1 Julio Entrena Perez 2016-09-02 14:29:36 UTC

Description of problem:
Moving vDisks between storage domains (while VMs are off) can fail due to database transactions being canceled by TransactionReaper.

Version-Release number of selected component (if applicable):
rhevm-3.5.7-0.1.el6ev

How reproducible:
One occurrence so far, but reports of similar situations exist: bug 1311409 and bug 1275736.

Steps to Reproduce:
1. Shutdown some VMs.
2. Move the vDisks of those VMs to another storage domain.
3.

Actual results:
- Task succeeds on SPM host in vdsm.log:

  Thread-72::DEBUG::2016-09-01 14:06:53,662::taskManager::78::Storage.TaskManager::(scheduleJob) scheduled job moveImage_d4c98e82-8a44-40a3-87c5-524e1133cbb3 for task 4c898831-f2b4-4e0b-8dc3-bce03db66928
  [...]
  4c898831-f2b4-4e0b-8dc3-bce03db66928::DEBUG::2016-09-01 14:12:19,718::task::916::Storage.TaskManager.Task::(_runJobs) Task=`4c898831-f2b4-4e0b-8dc3-bce03db66928`::Task.run: exit - success: result

- Task success is reported in engine.log:

  2016-09-01 14:06:53,704 INFO  [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (org.ovirt.thread.pool-7-thread-14) [50f3696b] CommandMultiAsyncTasks::AttachTask: Attaching task 4c898831-f2b4-4e0b-8dc3-bce03db66928 to command 06abea87-a886-4d96-9c02-89adfa8b9243.
  [...]
  2016-09-01 14:29:57,706 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-19) [58e0d470] BaseAsyncTask::onTaskEndSuccess: Task 4c898831-f2b4-4e0b-8dc3-bce03db66928 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.

- TransactionReaper cancels the ongoing transactions in server.log:

  2016-09-01 14:40:00,021 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffffac143119:14637ac7:57b42fce:6588193 in state  RUN
  2016-09-01 14:40:00,079 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 0,5,main] successfully canceled TX 0:ffffac143119:14637ac7:57b42fce:6588193

- The entire operation is reported as failed due to the transaction no longer existing:

  2016-09-01 14:41:57,731 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-15) Correlation ID: 50f3696b, Job ID: f247983c-e955-4023-9529-79b40796328d, Call Stack: null, Custom Event ID: -1, Message: User joe have failed to move disk vmname_Disk1 to domain STORAGEDOMAINNAME.
  2016-09-01 14:41:57,731 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-15) [within thread]: endAction for action type MoveOrCopyDisk threw an exception.: 
    javax.ejb.EJBException: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; 
    nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction

- Manual database statements are required to update the correct storage domain for the image and to change its Locked status.

Expected results:

- Either transaction timeout is sufficient for long running operations to complete successfully:

  /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in

    <subsystem xmlns="urn:jboss:domain:transactions:1.1">
  [...]
      <coordinator-environment default-timeout="600"/>

- Or database transactions are not used for long running operations.

Additional info:
See bug 1311409 and bug 1275736.

Comment 12 Martin Perina 2016-09-02 20:04:11 UTC

Allon, could you please take a look if related storage parts are using transactions with compensation logic correctly?

Comment 13 Allon Mureinik 2016-09-04 03:57:08 UTC

Liron, please take a look.
AFAIK, the relevant code no longer exists in master.

Comment 23 Raz Tamir 2016-11-08 20:03:16 UTC

Do we have a clear steps to reproduce?

Comment 24 Liron Aravot 2016-11-09 12:59:29 UTC

no easy reproducer for this one.
we can cause the removeImage() call to take a long time (more than 10 minutes) to execute.

Comment 26 Avihai 2017-01-22 11:58:51 UTC

Hi Liron ,

Same request as Raz , please provide clear scenario of how to reproduce this issue .

Comment 27 Liron Aravot 2017-01-22 15:40:29 UTC

Hi Avihai,
The only relative easy way to reproduce is to cause the removeImage() call on vdsm to take a long time as specified in comment #4 (by modifying the vdsm code), check that we have failures/retries on the engine side and then to remove the code modification and check that the flow completes successfully.

Comment 28 Avihai 2017-01-22 16:24:51 UTC

(In reply to Liron Aravot from comment #27)
> Hi Avihai,
> The only relative easy way to reproduce is to cause the removeImage() call
> on vdsm to take a long time as specified in comment #4 (by modifying the
> vdsm code), check that we have failures/retries on the engine side and then
> to remove the code modification and check that the flow completes
> successfully.


Ok , is it something I can do (modifying the vdsm code) ? 
if so how ?

Comment 29 Avihai 2017-01-23 12:54:22 UTC

Searching /usr/share/vdsm/storage/* I did not find removeImage() method .
 
All I found was removeImageLinks() method (see below ) , is this what I need ?

Where (what line) should I insert the time.sleep(600) ? 

*grep results searching for removeImage method:

> grep -i removeImage /usr/share/vdsm/storage/*
/usr/share/vdsm/storage/blockSD.py:        self.removeImageLinks(imgUUID)

Comment 30 Liron Aravot 2017-01-24 09:29:19 UTC

Hi Avihai,
Please move the SPM to maintenance, edit /usr/share/vdsm/API.py and replace the Image.delete() code with sleep().

Comment 31 Avihai 2017-01-24 13:51:46 UTC

Bug 1415407 blocks this bugs scenario :

Steps to Reproduce:
1. Shutdown some VMs -> step fails due to bug 1415407 
2. Move the vDisks of those VMs to another storage domain.

Comment 32 Avihai 2017-02-02 13:30:17 UTC

in current build ( engine 4.1.0.3-0.1.el7 ) this bug rarely reproduce 
(2 failures out of 20 ) .

Trying to verify

Comment 33 Avihai 2017-02-02 14:03:30 UTC

verified .

Engine : 4.1.0.3-0.1.el7
VDSM   : 4.19.2-2

Comment 37 Julio Entrena Perez 2017-04-20 12:52:40 UTC

https://access.redhat.com/errata/RHEA-2017:0997

Note You need to log in before you can comment on or make changes to this bug.