Description of problem: Failure during cold move with new HSM infra causes disk to remain locked Version-Release number of selected component (if applicable): vdsm-4.19.2-1.gitd9c3ccb.el7.centos.x86_64 ovirt-engine-4.1.0.3-0.0.master.20170122091652.gitc6fc2c2.el7.centos.noarch How reproducible: Steps to Reproduce: 1. Create VM with preallocated disk and power VM off 2. Select to Move the block disk to another domain 3. Restart the VDSM on the SPM as soon as the CopyDataCommand is reported on the engine Actual results: The Move fails and Disk is reported as locked, progress bar also stuck on 16% Expected results: Move should succeed Additional info: Engine log ----------------------------------- 2017-01-22 19:22:57,448+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (DefaultQuartzScheduler5) [49fca5e1] Running command: CopyDataCommand internal: true. REBOOT VDSM SPM on HERE******** 2017-01-22 19:23:05,993+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Connection reset by peer 2017-01-22 19:23:09,822+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /10.35.102.54 2017-01-22 19:23:10,232+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [6cfa57ec] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM purple-vds2.qa.lab.tlv.redhat.com command GetStatsVDS failed: Recovering from crash or Initializing 2017-01-22 19:23:10,233+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler9) [6cfa57ec] Command 'GetStatsVDSCommand(HostName = purple-vds2.qa.lab.tlv.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='f47abe9f-e366-4004-ab89-fb3ded9c02bd', vds='Host[purple-vds2.qa.lab.tlv.redhat.com,f47abe9f-e366-4004-ab89-fb3ded9c02bd]'})' execution failed: Recovering from crash or Initializing 2017-01-22 19:23:10,233+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler9) [6cfa57ec] Failed getting vds stats, host='purple-vds2.qa.lab.tlv.redhat.com'(f47abe9f-e366-4004-ab89-fb3ded9c02bd): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSRecoveringException: Recovering from crash or Initializing 2017-01-22 19:23:10,250+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [6cfa57ec] EVENT_ID: VDS_INITIALIZING(514), Correlation ID: null, Call Stack: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSRecoveringException: Recovering from crash or Initializing at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:42) at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111) at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:407) at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsStats(HostMonitoring.java:391) at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:115) at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:85) at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:274) at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) at org.quartz.core.JobRunShell.run(JobRunShell.java:213) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) . . . . 20 MINUTES GO BY AND STILL 2017-01-22 19:48:15,549+02 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler9) [49fca5e1] Command 'CopyImageGroupVolumesData' (id: '0f1f47d9-f321-44fe-a5e8-ab21f0da59eb') waiting on child command id: '1642bfe7-eae0-4ad1-b20f-73eaff7c1f00' type:'CopyData' to complete
Created attachment 1243374 [details] server, vdsm, engine.log Adding logs
Kevin, next time please supply trimmed logs, the engine log alone here is 7MBs and contains logs of 4 days which are totally irrelevant for us
*** Bug 1417456 has been marked as a duplicate of this bug. ***
Verified with the following code: ---------------------------------------------------------------- Version-Release number of selected component (if applicable): vdsm-4.19.4-1.el7ev.x86_64 rhevm-4.1.0.3-0.1.el7.noarch ovirt-engine-4.1.0.3-0.1.el7.noarch Verified with the following scenario: --------------------------------------------------------------- Steps to Reproduce: 1. Create a VM with some disks, block and nfs, preallocated and thin 2. Move all the disks to different domains and restart the vdsm on the SPM as soon as the CopyDataCommand is reported in the engine log >>>>> some disks are moved successfully as expected and some fail and are no longer in a locked state. Moving the failed disks again works fine Moving to VERIFIED!