Description of problem: Moving vDisks between storage domains (while VMs are off) can fail due to database transactions being canceled by TransactionReaper. Version-Release number of selected component (if applicable): rhevm-3.5.7-0.1.el6ev How reproducible: One occurrence so far, but reports of similar situations exist: bug 1311409 and bug 1275736. Steps to Reproduce: 1. Shutdown some VMs. 2. Move the vDisks of those VMs to another storage domain. 3. Actual results: - Task succeeds on SPM host in vdsm.log: Thread-72::DEBUG::2016-09-01 14:06:53,662::taskManager::78::Storage.TaskManager::(scheduleJob) scheduled job moveImage_d4c98e82-8a44-40a3-87c5-524e1133cbb3 for task 4c898831-f2b4-4e0b-8dc3-bce03db66928 [...] 4c898831-f2b4-4e0b-8dc3-bce03db66928::DEBUG::2016-09-01 14:12:19,718::task::916::Storage.TaskManager.Task::(_runJobs) Task=`4c898831-f2b4-4e0b-8dc3-bce03db66928`::Task.run: exit - success: result - Task success is reported in engine.log: 2016-09-01 14:06:53,704 INFO [org.ovirt.engine.core.bll.CommandMultiAsyncTasks] (org.ovirt.thread.pool-7-thread-14) [50f3696b] CommandMultiAsyncTasks::AttachTask: Attaching task 4c898831-f2b4-4e0b-8dc3-bce03db66928 to command 06abea87-a886-4d96-9c02-89adfa8b9243. [...] 2016-09-01 14:29:57,706 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (DefaultQuartzScheduler_Worker-19) [58e0d470] BaseAsyncTask::onTaskEndSuccess: Task 4c898831-f2b4-4e0b-8dc3-bce03db66928 (Parent Command MoveOrCopyDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully. - TransactionReaper cancels the ongoing transactions in server.log: 2016-09-01 14:40:00,021 WARN [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffffac143119:14637ac7:57b42fce:6588193 in state RUN 2016-09-01 14:40:00,079 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 0,5,main] successfully canceled TX 0:ffffac143119:14637ac7:57b42fce:6588193 - The entire operation is reported as failed due to the transaction no longer existing: 2016-09-01 14:41:57,731 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-15) Correlation ID: 50f3696b, Job ID: f247983c-e955-4023-9529-79b40796328d, Call Stack: null, Custom Event ID: -1, Message: User joe have failed to move disk vmname_Disk1 to domain STORAGEDOMAINNAME. 2016-09-01 14:41:57,731 ERROR [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-7-thread-15) [within thread]: endAction for action type MoveOrCopyDisk threw an exception.: javax.ejb.EJBException: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000460: Error checking for a transaction - Manual database statements are required to update the correct storage domain for the image and to change its Locked status. Expected results: - Either transaction timeout is sufficient for long running operations to complete successfully: /usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in <subsystem xmlns="urn:jboss:domain:transactions:1.1"> [...] <coordinator-environment default-timeout="600"/> - Or database transactions are not used for long running operations. Additional info: See bug 1311409 and bug 1275736.
Allon, could you please take a look if related storage parts are using transactions with compensation logic correctly?
Liron, please take a look. AFAIK, the relevant code no longer exists in master.
Do we have a clear steps to reproduce?
no easy reproducer for this one. we can cause the removeImage() call to take a long time (more than 10 minutes) to execute.
Hi Liron , Same request as Raz , please provide clear scenario of how to reproduce this issue .
Hi Avihai, The only relative easy way to reproduce is to cause the removeImage() call on vdsm to take a long time as specified in comment #4 (by modifying the vdsm code), check that we have failures/retries on the engine side and then to remove the code modification and check that the flow completes successfully.
(In reply to Liron Aravot from comment #27) > Hi Avihai, > The only relative easy way to reproduce is to cause the removeImage() call > on vdsm to take a long time as specified in comment #4 (by modifying the > vdsm code), check that we have failures/retries on the engine side and then > to remove the code modification and check that the flow completes > successfully. Ok , is it something I can do (modifying the vdsm code) ? if so how ?
Searching /usr/share/vdsm/storage/* I did not find removeImage() method . All I found was removeImageLinks() method (see below ) , is this what I need ? Where (what line) should I insert the time.sleep(600) ? *grep results searching for removeImage method: > grep -i removeImage /usr/share/vdsm/storage/* /usr/share/vdsm/storage/blockSD.py: self.removeImageLinks(imgUUID)
Hi Avihai, Please move the SPM to maintenance, edit /usr/share/vdsm/API.py and replace the Image.delete() code with sleep().
Bug 1415407 blocks this bugs scenario : Steps to Reproduce: 1. Shutdown some VMs -> step fails due to bug 1415407 2. Move the vDisks of those VMs to another storage domain.
in current build ( engine 4.1.0.3-0.1.el7 ) this bug rarely reproduce (2 failures out of 20 ) . Trying to verify
verified . Engine : 4.1.0.3-0.1.el7 VDSM : 4.19.2-2
https://access.redhat.com/errata/RHEA-2017:0997