Bug 1565040

Summary: Engine stuck on CopyData despite task completion in vdsm
Product: [oVirt] vdsm Reporter: Kevin Alon Goldblatt <kgoldbla>
Component: CoreAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED CURRENTRELEASE QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.23CC: bugs, bzlotnik, ishaby, kgoldbla, tnisan, ylavi
Target Milestone: ovirt-4.2.5Flags: rule-engine: ovirt-4.2+
ylavi: exception+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-31 15:27:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1565052    
Attachments:
Description Flags
engine and vdsm logs
none
engine and vdsm logs none

Description Kevin Alon Goldblatt 2018-04-09 08:39:50 UTC
Created attachment 1419151 [details]
engine and vdsm logs

Description of problem:
During Cold Move, access to the target Domain was blocked. After restoring the acces, the job is completed by the vdsm and reported as finished, but the Engine is still stuck waiting for the job completion. The disk is still locked

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.2.6-0.1.el7.noarch
vdsm-4.20.23-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. On a VM with disks write some data
2. Stop the VM and start cold move of the disks
3. During the cold move block the access to the target storage with iptables
4. The host is non-operational
5. After about 2 hours restore the connection and the host comes back up
6. The job is completed by the vdsm and reported as finished, however the engine never finishes the job and the disk remains locked

Actual results:
The job is completed by the vdsm and reported as finished, however the engine never finishes the job and the disk remains locked

Expected results:
The job is completed by the vdsm and reported as finished, and the engine completes the task. The disk is no longer locked


Additional info:
------ *****  START THE MOVE DISK OPERATION HERE  ***** ----
2018-04-08 16:23:23,709+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-20) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] EVENT_ID: USER_MOVED_DISK(2,008), User admin@inter
nal-authz moving disk vm08042018a_Disk1 to domain nfs_2.
2018-04-08 16:23:23,720+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (default task-20) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Lock freed to object 'EngineLock:{exclusiveLocks='[6d30921
6-37e7-4138-b09c-596a2838dfab=DISK]', sharedLocks=''}'
2018-04-08 16:23:23,730+03 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (default task-20) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Trying to release exclusive lock which does not exist, lock key: '6d
309216-37e7-4138-b09c-596a2838dfabDISK'
2018-04-08 16:23:23,730+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-20) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Lock freed to object 'EngineLock:{exclusiveLocks='[6d309216-37e7
-4138-b09c-596a2838dfab=DISK]', sharedLocks=''}'
2018-04-08 16:23:23,770+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-25) [7add9547-a996-4797-8cf4-2821e8eb1e69] Lock Acquired to object 'EngineLock:{exclusiveLocks='[874c3126-1
44e-4dd1-abc3-292a208782d2=DISK]', sharedLocks=''}'
2018-04-08 16:23:23,778+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-25) [7add9547-a996-4797-8cf4-2821e8eb1e69] Running command: MoveDiskCommand internal: false. Entities affec
ted :  ID: 874c3126-144e-4dd1-abc3-292a208782d2 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER,  ID: f3d0fa77-a26f-4e99-99db-2d0a1770acbe Type: StorageAction group CREATE_DISK with role type U
SER
2018-04-08 16:23:24,116+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command 'MoveOrCop
yDisk' (id: '546f807e-6ba4-438b-85a8-cc644e4ccfbd') waiting on child command id: '77d60ad1-4767-4ca6-ab03-81e5626e8599' type:'MoveImageGroup' to complete
2018-04-08 16:23:24,126+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command 'CopyImageGrou
pWithData' (id: 'f5db3294-c18a-4a97-9ea6-d47eccfe53f1') waiting on child command id: '4feb21b1-695d-46f4-a2b9-d542cb61a1ec' type:'CloneImageGroupVolumesStructure' to complete
2018-04-08 16:23:24,127+03 INFO  [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (default task-25) [7add9547-a996-4797-8cf4-2821e8eb1e69] Running command: MoveOrCopyDiskCommand internal: true. Ent
ities affected :  ID: 874c3126-144e-4dd1-abc3-292a208782d2 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER,  ID: f3d0fa77-a26f-4e99-99db-2d0a1770acbe Type: StorageAction group CREATE_DISK with 
role type USER
2018-04-08 16:23:24,131+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.CloneImageGroupVolumesStructureCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] S
tarting child command 1 of 1, image 'b82bed28-b15d-4c1e-8e0f-08a312697933'
2018-04-08 16:23:24,193+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Running com
mand: CreateVolumeContainerCommand internal: true.
.
.
. ---- ***** ADDING TASK HERE ***** ----
2018-04-08 16:23:25,132+03 INFO  [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Adding task 'afcd1e8e-877e-4035-93ba
-b9ff29168239' (Parent Command 'CreateVolumeContainer', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet..
2018-04-08 16:23:25,150+03 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] BaseAsyncTask::startPollingTask: Starting to poll task 'afcd1e8e-877e-4035-93ba-b9ff29168239'.
2018-04-08 16:23:25,178+03 INFO  [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command 'MoveImageGroup' (id: '77d60ad1-4767-4ca6-ab03-81e5626e8599') waiting on child command id: 'f5db3294-c18a-4a97-9ea6-d47eccfe53f1' type:'CopyImageGroupWithData' to complete
.
.
.---- ******* STORAGE HEARTBEAT NO LONGER RECEIVED ***** ----
74c3126-144e-4dd1-abc3-292a208782d2', postZeros='false', discard='true', forceDelete='false'}), log id: 17fffde
2018-04-08 16:25:19,107+03 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Heartbeat exceeded for host 'storage-ge10-vdsm1.scl.lab.tlv.redhat.com', last response arrived 7121 
ms ago.
2018-04-08 16:25:19,119+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [5c533c95] EVENT_ID: IRS_BROKER_COMMAND_FAILURE(10,803),
 VDSM command SetVolumeDescriptionVDS failed: Heartbeat exceeded
2018-04-08 16:25:19,121+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [5c533c95] ERROR, SetVolumeDescriptionVDSCommand( SetVolumeDescrip
tionVDSCommandParameters:{storagePoolId='a9e2b4b2-6c07-4af8-8fb0-5675878e2376', ignoreFailoverLimit='false', storageDomainId='2d0e3644-2ba7-428b-855c-13d35c4e650f', imageGroupId='ee2bb254-ffab-4d7c-9fc4-46712b2c
f0b2', imageId='f369d96d-f77c-44b3-9871-647b479d3c50'}), exception: VDSGenericException: VDSNetworkException: Heartbeat exceeded, log id: 3d9ea579
2018-04-08 16:25:19,121+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [5c533c95] Exception
2018-04-08 16:25:19,149+03 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to storage-ge10-vdsm1.scl.lab.tlv.redhat.com/10.35.82.87
2018-04-08 16:25:19,153+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [5c533c95] Command 'SpmStatusVDSCommand(HostName = host_mixed_1, SpmStatusVDSCommandParameters:{hostId='f6e3b61d-a07a-40f1-820d-d7fd9c67f6f7', storagePoolId='a9e2b4b2-6c07-4af8-8fb0-5675878e2376'})' execution failed: java.net.ConnectException: Connection refused
2018-04-08 16:25:19,153+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [5c533c95] Irs placed on server 'f6e3b61d-a07a-40f1-820d-d7fd9c67f6f7' failed. Proceed Failover
.
.
. ---- ***** HOST BACK UP HERE *** ---
2018-04-08 18:33:00,984+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [247a258f] EVENT_ID: VDS_DETECTED(13), Status of host host_mixed_1 was set to Up.
.
.
. ---- ***** Still waiting for task to complete several hours later - THIS CONTINUES FOR 50 HOURS *** ---
2018-04-09 01:23:54,333+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] START, GetHostJobsVDS
Command(HostName = host_mixed_1, GetHostJobsVDSCommandParameters:{hostId='f6e3b61d-a07a-40f1-820d-d7fd9c67f6f7', type='storage', jobIds='[1babe44f-82e2-451b-9d0d-87bd024c607c]'}), log id: 3fa59fa5
2018-04-09 01:23:54,344+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] FINISH, GetHostJobsVD
SCommand, return: {}, log id: 3fa59fa5
2018-04-09 01:23:54,349+03 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command CopyData id: '5e547444-7c34-4aa1
-b50f-518d36d2931c': attempting to determine the job status by polling the entity.
2018-04-09 01:23:54,361+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] START, GetVolumeInf
oVDSCommand(HostName = host_mixed_2, GetVolumeInfoVDSCommandParameters:{hostId='fea2403a-5581-4920-80db-a015412f72ef', storagePoolId='a9e2b4b2-6c07-4af8-8fb0-5675878e2376', storageDomainId='f3d0fa77-a26f-4e99-99
db-2d0a1770acbe', imageGroupId='6d309216-37e7-4138-b09c-596a2838dfab', imageId='b82bed28-b15d-4c1e-8e0f-08a312697933'}), log id: 61872b66
2018-04-09 01:23:54,436+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] FINISH, GetVolumeIn
foVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.DiskImage@177c3c9a, log id: 61872b66
2018-04-09 01:23:54,436+03 INFO  [org.ovirt.engine.core.bll.storage.disk.image.VdsmImagePoller] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command CopyData id: '5e547444-7c34-4aa1-b50f-518d36d2931c': the volume lease is not FREE - the job is running
2018-04-09 01:23:54,436+03 INFO  [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command CopyData id: '5e547444-7c34-4aa1-b50f-518d36d2931c': waiting for job '1babe44f-82e2-451b-9d0d-87bd024c607c' on host 'host_mixed_1' (id: 'f6e3b61d-a07a-40f1-820d-d7fd9c67f6f7') to complete


VDSM.log:
--------------------------------------------------------------
--- *** START COPY OPERATION **** ---
2018-04-08 16:23:44,471+0300 INFO  (jsonrpc/7) [vdsm.api] START sdm_copy_data(job_id='1babe44f-82e2-451b-9d0d-87bd024c607c', source={'img_id': '6d309216-37e7-4138-b09c-596a2838dfab', 'sd_id': '74668615-6112-4f5c-b0f0-8f817da34df6', 'endpoint_type': 'div', 'vol_id': 'b82bed28-b15d-4c1e-8e0f-08a312697933'}, destination={'generation': 0, 'img_id': '6d309216-37e7-4138-b09c-596a2838dfab', 'sd_id': 'f3d0fa77-a26f-4e99-99db-2d0a1770acbe', 'endpoint_type': 'div', 'vol_id': 'b82bed28-b15d-4c1e-8e0f-08a312697933'}) from=::ffff:10.35.162.9,43264, flow_id=02801675-9a0c-4bf8-ae3b-18e3828ff78a, task_id=b65a563b-e25e-42d1-b9ec-4aeda4a84abd (api:46)
2018-04-08 16:23:44,473+0300 DEBUG (jsonrpc/7) [storage.TaskManager] scheduled job copy_data for task b65a563b-e25e-42d1-b9ec-4aeda4a84abd  (taskManager:79)
2018-04-08 16:23:44,474+0300 INFO  (jsonrpc/7) [vdsm.api] FINISH sdm_copy_data return=None from=::ffff:10.35.162.9,43264, flow_id=02801675-9a0c-4bf8-ae3b-18e3828ff78a, task_id=b65a563b-e25e-42d1-b9ec-4aeda4a84abd (api:52)
2018-04-08 16:23:44,474+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') Prepare: 1 jobs exist, move to acquiring (task:1193)
2018-04-08 16:23:44,475+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') moving from state preparing -> state acquiring (task:602)
2018-04-08 16:23:44,475+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') moving from state acquiring -> state queued (task:602)
2018-04-08 16:23:44,476+0300 DEBUG (jsonrpc/7) [storage.TaskManager] queuing task: b65a563b-e25e-42d1-b9ec-4aeda4a84abd (taskManager:57)
2018-04-08 16:23:44,477+0300 DEBUG (jsonrpc/7) [storage.TaskManager] task queued: b65a563b-e25e-42d1-b9ec-4aeda4a84abd (taskManager:65)
2018-04-08 16:23:44,478+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') returning (task:1198)
2018-04-08 16:23:44,478+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') ref 0 aborting False (task:1002)
2018-04-08 16:23:44,479+0300 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call SDM.copy_data succeeded in 0.01 seconds (__init__:573)
2018-04-08 16:23:44,477+0300 DEBUG (tasks/4) [storage.ThreadPool] Number of running tasks: 1 (threadPool:60)
2018-04-08 16:23:44,480+0300 INFO  (tasks/4) [storage.ThreadPool.WorkerThread] START task b65a563b-e25e-42d1-b9ec-4aeda4a84abd (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7f2391add5f0>>, args=None) (threadPool:208)
2018-04-08 16:23:44,481+0300 DEBUG (tasks/4) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') committing task: b65a563b-e25e-42d1-b9ec-4aeda4a84abd (task:1210)
2018-04-08 16:23:44,482+0300 DEBUG (tasks/4) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') moving from state queued -> state running (task:602)
2018-04-08 16:23:44,482+0300 DEBUG (tasks/4) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') Task.run: running job 0: copy_data: <bound method Job.run of <Job id=1babe44f-82e2-451b-9d0d-87bd024c607c status=pending at 0x139790744739152>> (args: () kwargs: {}) (task:914)
2018-04-08 16:23:44,483+0300 DEBUG (tasks/4) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') Job.run: running copy_data: <bound method Job.run of <Job id=1babe44f-82e2-451b-9d0d-87bd024c607c status=pending at 0x139790744739152>> (args: () kwargs: {}) callback None (task:333)
2018-04-08 16:23:44,483+0300 INFO  (tasks/4) [root] Running job '1babe44f-82e2-451b-9d0d-87bd024c607c'... (jobs:183)
2018-04-08 16:23:44,485+0300 DEBUG (tasks/4) [storage.ResourceManager] Trying to register resource '00_storage.74668615-6112-4f5c-b0f0-8f817da34df6' for lock type 'shared' (resourceManager:495)
2018-04-08 16:23:44,486+0300 DEBUG (tasks/4) [storage.ResourceManager] Resource '00_storage.74668615-6112-4f5c-b0f0-8f817da34df6' is free. Now locking as 'shared' (1 active user) (resourceManager:552)
2018-04-08 16:23:44,487+0300 DEBUG (tasks/4) [storage.ResourceManager.Request] (ResName='00_storage.74668615-6112-4f5c-b0f0-8f817da34df6', ReqID='7c872634-5fa2-4cc7-8a36-7d228734d9c4') Granted request (resourceManager:222)
2018-04-08 16:23:44,488+0300 DEBUG (tasks/4) [storage.ResourceManager] Trying to register resource '00_storage.f3d0fa77-a26f-4e99-99db-2d0a1770acbe' for lock type 'shared' (resourceManager:495)
2018-04-08 16:23:44,489+0300 DEBUG (tasks/4) [storage.ResourceManager] Resource '00_storage.f3d0fa77-a26f-4e99-99db-2d0a1770acbe' is free. Now locking as 'shared' (1 active user) (resourceManager:552)
2018-04-08 16:23:44,489+0300 DEBUG (tasks/4) [storage.ResourceManager.Request] (ResName='00_storage.f3d0fa77-a26f-4e99-99db-2d0a1770acbe', ReqID='1a6cd1d6-bfdb-47e3-a201-d2bf1fde31a6') Granted request (resourceManager:222)
2018-04-08 16:23:44,491+0300 DEBUG (tasks/4) [storage.ResourceManager] Trying to register resource '01_img_74668615-6112-4f5c-b0f0-8f817da34df6.6d309216-37e7-4138-b09c-596a2838dfab' for lock type 'shared' (resourceManager:495)
2018-04-08 16:23:44,493+0300 DEBUG (tasks/4) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a5160049e|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 74668615-6112-4f5c-b0f0-8f817da34df6 (cwd None) (commands:65)
2018-04-08 16:23:44,570+0300 INFO  (jsonrpc/2) [api.host] START getJobs(job_type='storage', job_ids=['1babe44f-82e2-451b-9d0d-87bd024c607c']) from=::ffff:10.35.162.9,43264, flow_id=02801675-9a0c-4bf8-ae3b-18e3828ff78a (api:46)
2018-04-08 16:23:44,572+0300 INFO  (jsonrpc/2) [api.host] FINISH getJobs return={'status': {'message': 'Done', 'code': 0}, 'jobs': {'1babe44f-82e2-451b-9d0d-87bd024c607c': {'status': 'running', 'job_type': 'storage', 'id': '1babe44f-82e2-451b-9d0d-87bd024c607c', 'description': 'copy_data'}}} from=::ffff:10.35.162.9,43264, flow_id=02801675-9a0c-4bf8-ae3b-18e3828ff78a (api:52)

--- *** FINISH job reported *** ---
2018-04-08 18:33:06,438+0300 INFO  (jsonrpc/2) [api.host] START getJobs(job_type='storage', job_ids=['1babe44f-82e2-451b-9d0d-87bd024c607c']) from=::ffff:10.35.162.9,42792, flow_id=02801675-9a0c-4bf8-ae3b-18e382
8ff78a (api:46)
2018-04-08 18:33:06,439+0300 INFO  (jsonrpc/2) [api.host] FINISH getJobs return={'status': {'message': 'Done', 'code': 0}, 'jobs': {}} from=::ffff:10.35.162.9,42792, flow_id=02801675-9a0c-4bf8-ae3b-18e3828ff78a (api:52)
2018-04-08 18:33:06,440+0300 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getJobs succeeded in 0.00 seconds (__init__:573)
2018-04-08 18:33:06,568+0300 DEBUG (check/loop) [storage.check] START check u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge10__nfs__1/9e8c3e55-8a87-4528-9bbe-fae1f0ee05eb/dom_md/metadata' (delay=0.00) (check:289)

Comment 1 Yaniv Lavi 2018-04-11 08:27:51 UTC
Is this a regression from 4.1?

Comment 2 Idan Shaby 2018-04-15 09:08:10 UTC
Hi Kevin,

I am not able to reproduce this bug.
Can you please share some more info about the steps to reproduce?
How many disks, size, domain type, how many hosts in the dc (one or more) etc.

I blocked the connection to the storage server, the operation failed and after a few minutes, the host went into a non-operational state. However, after I brought back the connection to the storage server, the host went up but the cold move hasn't started again and nothing else happened.

Thanks

Comment 4 Kevin Alon Goldblatt 2018-04-17 16:11:07 UTC
(In reply to Yaniv Lavi from comment #1)
> Is this a regression from 4.1?

Could not reproduce this on 4.1

Comment 5 Kevin Alon Goldblatt 2018-04-24 08:34:28 UTC
The same issue occurred on the automation tier4 run.

See https://bugzilla.redhat.com/show_bug.cgi?id=1571142

Comment 6 Kevin Alon Goldblatt 2018-04-24 08:46:03 UTC
This seems to be a statistical issue as it does not happen all the time.
During my first attempt it happened 2/2 times on 4.2
During my second attempt it happened 1 out of 4 times on 4.2
During the automation run on tier 4 it happened 2 out of 6 times
During my attempt of 4.1 it did not happen during 4 attempts

Regards,

Kevin

Comment 7 Yaniv Kaul 2018-04-24 11:00:28 UTC
This is not a blocker nor an exception, yet it is targeted for 4.2.3?

Comment 9 Benny Zlotnik 2018-04-28 11:57:35 UTC
In my attempt to reproduce I found that:

The lease was acquired here:
2018-04-26 21:40:34 187302 [23137]: s25:r27 resource 77847bfc-b39c-4406-8377-a9515267318a:b6956d4e-d0e0-4cb1-84f1-373dd661b1b5:/rhev/data-center/mn
t/10.35.0.233:_root_storage__domains_sd1/77847bfc-b39c-4406-8377-a9515267318a/images/6b10a8ab-00d3-4ea6-bd71-b1fd43399db9/b6956d4e-d0e0-4cb1-84f1-3
73dd661b1b5.lease:0 for 2,16,29591

This is the sanlock.log where sanlock kills vdsm after the storage block
2018-04-26 21:42:26 187414 [19686]: s25 renewal error -2 delta_length 20 last_success 187302
2018-04-26 21:42:27 187415 [22765]: s25 kill 29591 sig 15 count 34
2018-04-26 21:42:28 187416 [22765]: s25 kill 29591 sig 15 count 35
2018-04-26 21:42:29 187417 [22765]: s25 kill 29591 sig 15 count 36
2018-04-26 21:42:30 187418 [22765]: s25 kill 29591 sig 15 count 37
2018-04-26 21:42:31 187419 [22765]: s25 kill 29591 sig 15 count 38
2018-04-26 21:42:32 187420 [22765]: s25 kill 29591 sig 15 count 39
2018-04-26 21:42:33 187421 [22765]: s25 kill 29591 sig 15 count 40
2018-04-26 21:42:34 187422 [22765]: s25 kill 29591 sig 9 count 41
2018-04-26 21:42:34 187422 [22765]: dead 29591 ci 2 count 41


This is what the lease currently holds, the timestamp 187302 is the same as the in the log, done for pid: 29591
[root@localhost ~]# sanlock direct dump /rhev/data-center/55ab06f0-3b0b-11e8-9eaa-507b9dec63c2/77847bfc-b39c-4406-8377-a9515267318a/images/6b10a8ab-00d3-4ea6-bd71-b1fd43399db9/b6956d4e-d0e0-4cb1-84f1-373dd661b1b5.lease
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000 77847bfc-b39c-4406-8377-a9515267318a             b6956d4e-d0e0-4cb1-84f1-373dd661b1b5 0000187302 0001 0018 2



Trying to manually acquire the lease works, the current pid of vdsm is 32231:
[root@localhost ~]# sanlock client acquire -r 77847bfc-b39c-4406-8377-a9515267318a:b6956d4e-d0e0-4cb1-84f1-373dd661b1b5:/rhev/data-center/55ab06f0-3b0b-11e8-9eaa-507b9dec63c2/77847bfc-b39c-4406-8377-a9515267318a/images/6b10a8ab-00d3-4ea6-bd71-b1fd43399db9/b6956d4e-d0e0-4cb1-84f1-373dd661b1b5.lease:0 -p 32231
acquire pid 32231
acquire done 0

[root@localhost ~]# sanlock direct dump /rhev/data-center/55ab06f0-3b0b-11e8-9eaa-507b9dec63c2/77847bfc-b39c-4406-8377-a9515267318a/images/6b10a8ab-00d3-4ea6-bd71-b1fd43399db9/b6956d4e-d0e0-4cb1-84f1-373dd661b1b5.lease
  offset                            lockspace                                         resource  timestamp  own  gen lver
00000000 77847bfc-b39c-4406-8377-a9515267318a             b6956d4e-d0e0-4cb1-84f1-373dd661b1b5 0000188231 0001 0019 3

Now we see the the generation is 19, the previous generation was 18 (before the storage block)

So vdsm should probably take the host generation into accout when checking the lease status

Comment 10 Sandro Bonazzola 2018-06-15 12:13:39 UTC
This bug is not marked as blocker and we entered blocker only phase for 4.2.4.
Please consider re-targeting to 4.2.5.

Comment 15 Kevin Alon Goldblatt 2018-07-12 09:24:56 UTC
Created attachment 1458298 [details]
engine and vdsm logs

Added new logs after running scenario again with longer wait before blocking the access to the storage

Comment 16 Kevin Alon Goldblatt 2018-07-12 13:45:34 UTC
Based on comment 14 and additional successful run moving to VERIFY

Comment 17 Sandro Bonazzola 2018-07-31 15:27:26 UTC
This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.