Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1415502

Summary: New HSM infra - Disk remains locked after failure during Cold Move
Product: [oVirt] ovirt-engine Reporter: Kevin Alon Goldblatt <kgoldbla>
Component: BLL.StorageAssignee: Liron Aravot <laravot>
Status: CLOSED CURRENTRELEASE QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aefrat, bugs, gklein, laravot, tnisan, ylavi
Target Milestone: ovirt-4.1.0-betaKeywords: Regression
Target Release: 4.1.0.3Flags: rule-engine: ovirt-4.1+
rule-engine: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-15 14:50:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
server, vdsm, engine.log none

Description Kevin Alon Goldblatt 2017-01-22 17:51:00 UTC
Description of problem:
Failure during cold move with new HSM infra causes disk to remain locked

Version-Release number of selected component (if applicable):
vdsm-4.19.2-1.gitd9c3ccb.el7.centos.x86_64
ovirt-engine-4.1.0.3-0.0.master.20170122091652.gitc6fc2c2.el7.centos.noarch

How reproducible:


Steps to Reproduce:
1. Create VM with preallocated disk and power VM off
2. Select to Move the block disk to another domain
3. Restart the VDSM on the SPM as soon as the CopyDataCommand is reported on the engine

Actual results:
The Move fails and Disk is reported as locked, progress bar also stuck on 16%

Expected results:
Move should succeed

Additional info:
Engine log
-----------------------------------
2017-01-22 19:22:57,448+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (DefaultQuartzScheduler5) [49fca5e1] Running command: CopyDataCommand internal: true. 
REBOOT VDSM SPM on HERE********

2017-01-22 19:23:05,993+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Connection reset by peer
2017-01-22 19:23:09,822+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /10.35.102.54
2017-01-22 19:23:10,232+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [6cfa57ec] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM purple-vds2.qa.lab.tlv.redhat.com command GetStatsVDS failed: Recovering from crash or Initializing
2017-01-22 19:23:10,233+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler9) [6cfa57ec] Command 'GetStatsVDSCommand(HostName = purple-vds2.qa.lab.tlv.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='f47abe9f-e366-4004-ab89-fb3ded9c02bd', vds='Host[purple-vds2.qa.lab.tlv.redhat.com,f47abe9f-e366-4004-ab89-fb3ded9c02bd]'})' execution failed: Recovering from crash or Initializing
2017-01-22 19:23:10,233+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler9) [6cfa57ec] Failed getting vds stats, host='purple-vds2.qa.lab.tlv.redhat.com'(f47abe9f-e366-4004-ab89-fb3ded9c02bd): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSRecoveringException: Recovering from crash or Initializing
2017-01-22 19:23:10,250+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [6cfa57ec] EVENT_ID: VDS_INITIALIZING(514), Correlation ID: null, Call Stack: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSRecoveringException: Recovering from crash or Initializing
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:42)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111)
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73)
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14)
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:407)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsStats(HostMonitoring.java:391)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:115)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:85)
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:274)
        at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77)
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
.
.
.
.
20 MINUTES GO BY AND STILL

2017-01-22 19:48:15,549+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler9) [49fca5e1] Command 'CopyImageGroupVolumesData' (id: '0f1f47d9-f321-44fe-a5e8-ab21f0da59eb') waiting on child command id: '1642bfe7-eae0-4ad1-b20f-73eaff7c1f00' type:'CopyData' to complete

Comment 1 Kevin Alon Goldblatt 2017-01-22 18:21:17 UTC
Created attachment 1243374 [details]
server, vdsm, engine.log

Adding logs

Comment 2 Tal Nisan 2017-01-22 19:24:17 UTC
Kevin, next time please supply trimmed logs, the engine log alone here is 7MBs and contains logs of 4 days which are totally irrelevant for us

Comment 3 Liron Aravot 2017-01-30 13:37:04 UTC
*** Bug 1417456 has been marked as a duplicate of this bug. ***

Comment 4 Kevin Alon Goldblatt 2017-02-05 19:41:44 UTC
Verified with the following code:
----------------------------------------------------------------
Version-Release number of selected component (if applicable):
vdsm-4.19.4-1.el7ev.x86_64
rhevm-4.1.0.3-0.1.el7.noarch
ovirt-engine-4.1.0.3-0.1.el7.noarch


Verified with the following scenario:
---------------------------------------------------------------

Steps to Reproduce:
1. Create a VM with some disks, block and nfs, preallocated and thin
2. Move all the disks to different domains and restart the vdsm on the SPM as soon as the CopyDataCommand is reported in the engine log >>>>> some disks are moved successfully as expected and some fail and are no longer in a locked state. Moving the failed disks again works fine


Moving to VERIFIED!