1415502 – New HSM infra - Disk remains locked after failure during Cold Move

Bug 1415502 - New HSM infra - Disk remains locked after failure during Cold Move

Summary: New HSM infra - Disk remains locked after failure during Cold Move

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.1.0-beta
Target Release:	4.1.0.3
Assignee:	Liron Aravot
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1417456 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-22 17:51 UTC by Kevin Alon Goldblatt
Modified:	2017-02-15 14:50 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-02-15 14:50:02 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ rule-engine: blocker+

Attachments	(Terms of Use)
server, vdsm, engine.log (752.10 KB, application/x-gzip) 2017-01-22 18:21 UTC, Kevin Alon Goldblatt	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	71034	0	master	MERGED	core: StorageJobCallback - jobs fencing	2017-01-24 13:02:01 UTC
oVirt gerrit	71100	0	ovirt-engine-4.1	MERGED	core: StorageJobCallback - jobs fencing	2017-01-24 15:46:00 UTC

Description Kevin Alon Goldblatt 2017-01-22 17:51:00 UTC

Description of problem:
Failure during cold move with new HSM infra causes disk to remain locked

Version-Release number of selected component (if applicable):
vdsm-4.19.2-1.gitd9c3ccb.el7.centos.x86_64
ovirt-engine-4.1.0.3-0.0.master.20170122091652.gitc6fc2c2.el7.centos.noarch

How reproducible:


Steps to Reproduce:
1. Create VM with preallocated disk and power VM off
2. Select to Move the block disk to another domain
3. Restart the VDSM on the SPM as soon as the CopyDataCommand is reported on the engine

Actual results:
The Move fails and Disk is reported as locked, progress bar also stuck on 16%

Expected results:
Move should succeed

Additional info:
Engine log
-----------------------------------
2017-01-22 19:22:57,448+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.CopyDataCommand] (DefaultQuartzScheduler5) [49fca5e1] Running command: CopyDataCommand internal: true. 
REBOOT VDSM SPM on HERE********

2017-01-22 19:23:05,993+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Connection reset by peer
2017-01-22 19:23:09,822+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /10.35.102.54
2017-01-22 19:23:10,232+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [6cfa57ec] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM purple-vds2.qa.lab.tlv.redhat.com command GetStatsVDS failed: Recovering from crash or Initializing
2017-01-22 19:23:10,233+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler9) [6cfa57ec] Command 'GetStatsVDSCommand(HostName = purple-vds2.qa.lab.tlv.redhat.com, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='f47abe9f-e366-4004-ab89-fb3ded9c02bd', vds='Host[purple-vds2.qa.lab.tlv.redhat.com,f47abe9f-e366-4004-ab89-fb3ded9c02bd]'})' execution failed: Recovering from crash or Initializing
2017-01-22 19:23:10,233+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (DefaultQuartzScheduler9) [6cfa57ec] Failed getting vds stats, host='purple-vds2.qa.lab.tlv.redhat.com'(f47abe9f-e366-4004-ab89-fb3ded9c02bd): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSRecoveringException: Recovering from crash or Initializing
2017-01-22 19:23:10,250+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [6cfa57ec] EVENT_ID: VDS_INITIALIZING(514), Correlation ID: null, Call Stack: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSRecoveringException: Recovering from crash or Initializing
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:42)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111)
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73)
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
        at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14)
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:407)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsStats(HostMonitoring.java:391)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:115)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:85)
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:274)
        at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77)
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
.
.
.
.
20 MINUTES GO BY AND STILL

2017-01-22 19:48:15,549+02 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler9) [49fca5e1] Command 'CopyImageGroupVolumesData' (id: '0f1f47d9-f321-44fe-a5e8-ab21f0da59eb') waiting on child command id: '1642bfe7-eae0-4ad1-b20f-73eaff7c1f00' type:'CopyData' to complete

Comment 1 Kevin Alon Goldblatt 2017-01-22 18:21:17 UTC

Created attachment 1243374 [details]
server, vdsm, engine.log

Adding logs

Comment 2 Tal Nisan 2017-01-22 19:24:17 UTC

Kevin, next time please supply trimmed logs, the engine log alone here is 7MBs and contains logs of 4 days which are totally irrelevant for us

Comment 3 Liron Aravot 2017-01-30 13:37:04 UTC

*** Bug 1417456 has been marked as a duplicate of this bug. ***

Comment 4 Kevin Alon Goldblatt 2017-02-05 19:41:44 UTC

Verified with the following code:
----------------------------------------------------------------
Version-Release number of selected component (if applicable):
vdsm-4.19.4-1.el7ev.x86_64
rhevm-4.1.0.3-0.1.el7.noarch
ovirt-engine-4.1.0.3-0.1.el7.noarch


Verified with the following scenario:
---------------------------------------------------------------

Steps to Reproduce:
1. Create a VM with some disks, block and nfs, preallocated and thin
2. Move all the disks to different domains and restart the vdsm on the SPM as soon as the CopyDataCommand is reported in the engine log >>>>> some disks are moved successfully as expected and some fail and are no longer in a locked state. Moving the failed disks again works fine


Moving to VERIFIED!

Note You need to log in before you can comment on or make changes to this bug.