Bug 1415502 - New HSM infra - Disk remains locked after failure during Cold Move
Summary: New HSM infra - Disk remains locked after failure during Cold Move
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.1.0-beta
: 4.1.0.3
Assignee: Liron Aravot
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
: 1417456 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-22 17:51 UTC by Kevin Alon Goldblatt
Modified: 2017-02-15 14:50 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-15 14:50:02 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+
rule-engine: blocker+


Attachments (Terms of Use)
server, vdsm, engine.log (752.10 KB, application/x-gzip)
2017-01-22 18:21 UTC, Kevin Alon Goldblatt
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 71034 0 master MERGED core: StorageJobCallback - jobs fencing 2017-01-24 13:02:01 UTC
oVirt gerrit 71100 0 ovirt-engine-4.1 MERGED core: StorageJobCallback - jobs fencing 2017-01-24 15:46:00 UTC

Description Kevin Alon Goldblatt 2017-01-22 17:51:00 UTC
Description of problem:
Failure during cold move with new HSM infra causes disk to remain locked

Version-Release number of selected component (if applicable):
vdsm-4.19.2-1.gitd9c3ccb.el7.centos.x86_64
ovirt-engine-4.1.0.3-0.0.master.20170122091652.gitc6fc2c2.el7.centos.noarch

How reproducible:


Steps to Reproduce:
1. Create VM with preallocated disk and power VM off
2. Select to Move the block disk to another domain
3. Restart the VDSM on the SPM as soon as the CopyDataCommand is reported on the engine

Actual results:
The Move fails and Disk is reported as locked, progress bar also stuck on 16%

Expected results:
Move should succeed

Additional info:
Engine log
-----------------------------------
2017-01-22 19:22:57,448+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (DefaultQuartzScheduler5) [49fca5e1] Running command: CopyDataCommand internal: true. 
REBOOT VDSM SPM on HERE********

2017-01-22 19:23:05,993+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Connection reset by peer
2017-01-22 19:23:09,822+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /10.35.102.54
2017-01-22 19:23:10,232+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [6cfa57ec] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM purple-vds2.qa.lab.tlv.redhat.com command GetStatsVDS failed: Recovering from crash or Initializing
2017-01-22 19:23:10,233+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler9) [6cfa57ec] Command 'GetStatsVDSCommand(HostName = purple-vds2.qa.lab.tlv.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='f47abe9f-e366-4004-ab89-fb3ded9c02bd', vds='Host[purple-vds2.qa.lab.tlv.redhat.com,f47abe9f-e366-4004-ab89-fb3ded9c02bd]'})' execution failed: Recovering from crash or Initializing
2017-01-22 19:23:10,233+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler9) [6cfa57ec] Failed getting vds stats, host='purple-vds2.qa.lab.tlv.redhat.com'(f47abe9f-e366-4004-ab89-fb3ded9c02bd): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSRecoveringException: Recovering from crash or Initializing
2017-01-22 19:23:10,250+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [6cfa57ec] EVENT_ID: VDS_INITIALIZING(514), Correlation ID: null, Call Stack: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSRecoveringException: Recovering from crash or Initializing
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:42)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111)
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73)
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14)
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:407)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsStats(HostMonitoring.java:391)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:115)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:85)
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:274)
        at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77)
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
.
.
.
.
20 MINUTES GO BY AND STILL

2017-01-22 19:48:15,549+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler9) [49fca5e1] Command 'CopyImageGroupVolumesData' (id: '0f1f47d9-f321-44fe-a5e8-ab21f0da59eb') waiting on child command id: '1642bfe7-eae0-4ad1-b20f-73eaff7c1f00' type:'CopyData' to complete

Comment 1 Kevin Alon Goldblatt 2017-01-22 18:21:17 UTC
Created attachment 1243374 [details]
server, vdsm, engine.log

Adding logs

Comment 2 Tal Nisan 2017-01-22 19:24:17 UTC
Kevin, next time please supply trimmed logs, the engine log alone here is 7MBs and contains logs of 4 days which are totally irrelevant for us

Comment 3 Liron Aravot 2017-01-30 13:37:04 UTC
*** Bug 1417456 has been marked as a duplicate of this bug. ***

Comment 4 Kevin Alon Goldblatt 2017-02-05 19:41:44 UTC
Verified with the following code:
----------------------------------------------------------------
Version-Release number of selected component (if applicable):
vdsm-4.19.4-1.el7ev.x86_64
rhevm-4.1.0.3-0.1.el7.noarch
ovirt-engine-4.1.0.3-0.1.el7.noarch


Verified with the following scenario:
---------------------------------------------------------------

Steps to Reproduce:
1. Create a VM with some disks, block and nfs, preallocated and thin
2. Move all the disks to different domains and restart the vdsm on the SPM as soon as the CopyDataCommand is reported in the engine log >>>>> some disks are moved successfully as expected and some fail and are no longer in a locked state. Moving the failed disks again works fine


Moving to VERIFIED!


Note You need to log in before you can comment on or make changes to this bug.