Created attachment 1419151 [details] engine and vdsm logs Description of problem: During Cold Move, access to the target Domain was blocked. After restoring the acces, the job is completed by the vdsm and reported as finished, but the Engine is still stuck waiting for the job completion. The disk is still locked Version-Release number of selected component (if applicable): ovirt-engine-4.2.2.6-0.1.el7.noarch vdsm-4.20.23-1.el7ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. On a VM with disks write some data 2. Stop the VM and start cold move of the disks 3. During the cold move block the access to the target storage with iptables 4. The host is non-operational 5. After about 2 hours restore the connection and the host comes back up 6. The job is completed by the vdsm and reported as finished, however the engine never finishes the job and the disk remains locked Actual results: The job is completed by the vdsm and reported as finished, however the engine never finishes the job and the disk remains locked Expected results: The job is completed by the vdsm and reported as finished, and the engine completes the task. The disk is no longer locked Additional info: ------ ***** START THE MOVE DISK OPERATION HERE ***** ---- 2018-04-08 16:23:23,709+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-20) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] EVENT_ID: USER_MOVED_DISK(2,008), User admin@inter nal-authz moving disk vm08042018a_Disk1 to domain nfs_2. 2018-04-08 16:23:23,720+03 INFO [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (default task-20) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Lock freed to object 'EngineLock:{exclusiveLocks='[6d30921 6-37e7-4138-b09c-596a2838dfab=DISK]', sharedLocks=''}' 2018-04-08 16:23:23,730+03 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (default task-20) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Trying to release exclusive lock which does not exist, lock key: '6d 309216-37e7-4138-b09c-596a2838dfabDISK' 2018-04-08 16:23:23,730+03 INFO [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-20) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Lock freed to object 'EngineLock:{exclusiveLocks='[6d309216-37e7 -4138-b09c-596a2838dfab=DISK]', sharedLocks=''}' 2018-04-08 16:23:23,770+03 INFO [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-25) [7add9547-a996-4797-8cf4-2821e8eb1e69] Lock Acquired to object 'EngineLock:{exclusiveLocks='[874c3126-1 44e-4dd1-abc3-292a208782d2=DISK]', sharedLocks=''}' 2018-04-08 16:23:23,778+03 INFO [org.ovirt.engine.core.bll.storage.disk.MoveDiskCommand] (default task-25) [7add9547-a996-4797-8cf4-2821e8eb1e69] Running command: MoveDiskCommand internal: false. Entities affec ted : ID: 874c3126-144e-4dd1-abc3-292a208782d2 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER, ID: f3d0fa77-a26f-4e99-99db-2d0a1770acbe Type: StorageAction group CREATE_DISK with role type U SER 2018-04-08 16:23:24,116+03 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command 'MoveOrCop yDisk' (id: '546f807e-6ba4-438b-85a8-cc644e4ccfbd') waiting on child command id: '77d60ad1-4767-4ca6-ab03-81e5626e8599' type:'MoveImageGroup' to complete 2018-04-08 16:23:24,126+03 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command 'CopyImageGrou pWithData' (id: 'f5db3294-c18a-4a97-9ea6-d47eccfe53f1') waiting on child command id: '4feb21b1-695d-46f4-a2b9-d542cb61a1ec' type:'CloneImageGroupVolumesStructure' to complete 2018-04-08 16:23:24,127+03 INFO [org.ovirt.engine.core.bll.storage.disk.MoveOrCopyDiskCommand] (default task-25) [7add9547-a996-4797-8cf4-2821e8eb1e69] Running command: MoveOrCopyDiskCommand internal: true. Ent ities affected : ID: 874c3126-144e-4dd1-abc3-292a208782d2 Type: DiskAction group CONFIGURE_DISK_STORAGE with role type USER, ID: f3d0fa77-a26f-4e99-99db-2d0a1770acbe Type: StorageAction group CREATE_DISK with role type USER 2018-04-08 16:23:24,131+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.CloneImageGroupVolumesStructureCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] S tarting child command 1 of 1, image 'b82bed28-b15d-4c1e-8e0f-08a312697933' 2018-04-08 16:23:24,193+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.CreateVolumeContainerCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Running com mand: CreateVolumeContainerCommand internal: true. . . . ---- ***** ADDING TASK HERE ***** ---- 2018-04-08 16:23:25,132+03 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Adding task 'afcd1e8e-877e-4035-93ba -b9ff29168239' (Parent Command 'CreateVolumeContainer', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters'), polling hasn't started yet.. 2018-04-08 16:23:25,150+03 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] BaseAsyncTask::startPollingTask: Starting to poll task 'afcd1e8e-877e-4035-93ba-b9ff29168239'. 2018-04-08 16:23:25,178+03 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command 'MoveImageGroup' (id: '77d60ad1-4767-4ca6-ab03-81e5626e8599') waiting on child command id: 'f5db3294-c18a-4a97-9ea6-d47eccfe53f1' type:'CopyImageGroupWithData' to complete . . .---- ******* STORAGE HEARTBEAT NO LONGER RECEIVED ***** ---- 74c3126-144e-4dd1-abc3-292a208782d2', postZeros='false', discard='true', forceDelete='false'}), log id: 17fffde 2018-04-08 16:25:19,107+03 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Heartbeat exceeded for host 'storage-ge10-vdsm1.scl.lab.tlv.redhat.com', last response arrived 7121 ms ago. 2018-04-08 16:25:19,119+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [5c533c95] EVENT_ID: IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command SetVolumeDescriptionVDS failed: Heartbeat exceeded 2018-04-08 16:25:19,121+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [5c533c95] ERROR, SetVolumeDescriptionVDSCommand( SetVolumeDescrip tionVDSCommandParameters:{storagePoolId='a9e2b4b2-6c07-4af8-8fb0-5675878e2376', ignoreFailoverLimit='false', storageDomainId='2d0e3644-2ba7-428b-855c-13d35c4e650f', imageGroupId='ee2bb254-ffab-4d7c-9fc4-46712b2c f0b2', imageId='f369d96d-f77c-44b3-9871-647b479d3c50'}), exception: VDSGenericException: VDSNetworkException: Heartbeat exceeded, log id: 3d9ea579 2018-04-08 16:25:19,121+03 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [5c533c95] Exception 2018-04-08 16:25:19,149+03 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to storage-ge10-vdsm1.scl.lab.tlv.redhat.com/10.35.82.87 2018-04-08 16:25:19,153+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [5c533c95] Command 'SpmStatusVDSCommand(HostName = host_mixed_1, SpmStatusVDSCommandParameters:{hostId='f6e3b61d-a07a-40f1-820d-d7fd9c67f6f7', storagePoolId='a9e2b4b2-6c07-4af8-8fb0-5675878e2376'})' execution failed: java.net.ConnectException: Connection refused 2018-04-08 16:25:19,153+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [5c533c95] Irs placed on server 'f6e3b61d-a07a-40f1-820d-d7fd9c67f6f7' failed. Proceed Failover . . . ---- ***** HOST BACK UP HERE *** --- 2018-04-08 18:33:00,984+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [247a258f] EVENT_ID: VDS_DETECTED(13), Status of host host_mixed_1 was set to Up. . . . ---- ***** Still waiting for task to complete several hours later - THIS CONTINUES FOR 50 HOURS *** --- 2018-04-09 01:23:54,333+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] START, GetHostJobsVDS Command(HostName = host_mixed_1, GetHostJobsVDSCommandParameters:{hostId='f6e3b61d-a07a-40f1-820d-d7fd9c67f6f7', type='storage', jobIds='[1babe44f-82e2-451b-9d0d-87bd024c607c]'}), log id: 3fa59fa5 2018-04-09 01:23:54,344+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] FINISH, GetHostJobsVD SCommand, return: {}, log id: 3fa59fa5 2018-04-09 01:23:54,349+03 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command CopyData id: '5e547444-7c34-4aa1 -b50f-518d36d2931c': attempting to determine the job status by polling the entity. 2018-04-09 01:23:54,361+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] START, GetVolumeInf oVDSCommand(HostName = host_mixed_2, GetVolumeInfoVDSCommandParameters:{hostId='fea2403a-5581-4920-80db-a015412f72ef', storagePoolId='a9e2b4b2-6c07-4af8-8fb0-5675878e2376', storageDomainId='f3d0fa77-a26f-4e99-99 db-2d0a1770acbe', imageGroupId='6d309216-37e7-4138-b09c-596a2838dfab', imageId='b82bed28-b15d-4c1e-8e0f-08a312697933'}), log id: 61872b66 2018-04-09 01:23:54,436+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] FINISH, GetVolumeIn foVDSCommand, return: org.ovirt.engine.core.common.businessentities.storage.DiskImage@177c3c9a, log id: 61872b66 2018-04-09 01:23:54,436+03 INFO [org.ovirt.engine.core.bll.storage.disk.image.VdsmImagePoller] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command CopyData id: '5e547444-7c34-4aa1-b50f-518d36d2931c': the volume lease is not FREE - the job is running 2018-04-09 01:23:54,436+03 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-96) [02801675-9a0c-4bf8-ae3b-18e3828ff78a] Command CopyData id: '5e547444-7c34-4aa1-b50f-518d36d2931c': waiting for job '1babe44f-82e2-451b-9d0d-87bd024c607c' on host 'host_mixed_1' (id: 'f6e3b61d-a07a-40f1-820d-d7fd9c67f6f7') to complete VDSM.log: -------------------------------------------------------------- --- *** START COPY OPERATION **** --- 2018-04-08 16:23:44,471+0300 INFO (jsonrpc/7) [vdsm.api] START sdm_copy_data(job_id='1babe44f-82e2-451b-9d0d-87bd024c607c', source={'img_id': '6d309216-37e7-4138-b09c-596a2838dfab', 'sd_id': '74668615-6112-4f5c-b0f0-8f817da34df6', 'endpoint_type': 'div', 'vol_id': 'b82bed28-b15d-4c1e-8e0f-08a312697933'}, destination={'generation': 0, 'img_id': '6d309216-37e7-4138-b09c-596a2838dfab', 'sd_id': 'f3d0fa77-a26f-4e99-99db-2d0a1770acbe', 'endpoint_type': 'div', 'vol_id': 'b82bed28-b15d-4c1e-8e0f-08a312697933'}) from=::ffff:10.35.162.9,43264, flow_id=02801675-9a0c-4bf8-ae3b-18e3828ff78a, task_id=b65a563b-e25e-42d1-b9ec-4aeda4a84abd (api:46) 2018-04-08 16:23:44,473+0300 DEBUG (jsonrpc/7) [storage.TaskManager] scheduled job copy_data for task b65a563b-e25e-42d1-b9ec-4aeda4a84abd (taskManager:79) 2018-04-08 16:23:44,474+0300 INFO (jsonrpc/7) [vdsm.api] FINISH sdm_copy_data return=None from=::ffff:10.35.162.9,43264, flow_id=02801675-9a0c-4bf8-ae3b-18e3828ff78a, task_id=b65a563b-e25e-42d1-b9ec-4aeda4a84abd (api:52) 2018-04-08 16:23:44,474+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') Prepare: 1 jobs exist, move to acquiring (task:1193) 2018-04-08 16:23:44,475+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') moving from state preparing -> state acquiring (task:602) 2018-04-08 16:23:44,475+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') moving from state acquiring -> state queued (task:602) 2018-04-08 16:23:44,476+0300 DEBUG (jsonrpc/7) [storage.TaskManager] queuing task: b65a563b-e25e-42d1-b9ec-4aeda4a84abd (taskManager:57) 2018-04-08 16:23:44,477+0300 DEBUG (jsonrpc/7) [storage.TaskManager] task queued: b65a563b-e25e-42d1-b9ec-4aeda4a84abd (taskManager:65) 2018-04-08 16:23:44,478+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') returning (task:1198) 2018-04-08 16:23:44,478+0300 DEBUG (jsonrpc/7) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') ref 0 aborting False (task:1002) 2018-04-08 16:23:44,479+0300 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call SDM.copy_data succeeded in 0.01 seconds (__init__:573) 2018-04-08 16:23:44,477+0300 DEBUG (tasks/4) [storage.ThreadPool] Number of running tasks: 1 (threadPool:60) 2018-04-08 16:23:44,480+0300 INFO (tasks/4) [storage.ThreadPool.WorkerThread] START task b65a563b-e25e-42d1-b9ec-4aeda4a84abd (cmd=<bound method Task.commit of <vdsm.storage.task.Task instance at 0x7f2391add5f0>>, args=None) (threadPool:208) 2018-04-08 16:23:44,481+0300 DEBUG (tasks/4) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') committing task: b65a563b-e25e-42d1-b9ec-4aeda4a84abd (task:1210) 2018-04-08 16:23:44,482+0300 DEBUG (tasks/4) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') moving from state queued -> state running (task:602) 2018-04-08 16:23:44,482+0300 DEBUG (tasks/4) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') Task.run: running job 0: copy_data: <bound method Job.run of <Job id=1babe44f-82e2-451b-9d0d-87bd024c607c status=pending at 0x139790744739152>> (args: () kwargs: {}) (task:914) 2018-04-08 16:23:44,483+0300 DEBUG (tasks/4) [storage.TaskManager.Task] (Task='b65a563b-e25e-42d1-b9ec-4aeda4a84abd') Job.run: running copy_data: <bound method Job.run of <Job id=1babe44f-82e2-451b-9d0d-87bd024c607c status=pending at 0x139790744739152>> (args: () kwargs: {}) callback None (task:333) 2018-04-08 16:23:44,483+0300 INFO (tasks/4) [root] Running job '1babe44f-82e2-451b-9d0d-87bd024c607c'... (jobs:183) 2018-04-08 16:23:44,485+0300 DEBUG (tasks/4) [storage.ResourceManager] Trying to register resource '00_storage.74668615-6112-4f5c-b0f0-8f817da34df6' for lock type 'shared' (resourceManager:495) 2018-04-08 16:23:44,486+0300 DEBUG (tasks/4) [storage.ResourceManager] Resource '00_storage.74668615-6112-4f5c-b0f0-8f817da34df6' is free. Now locking as 'shared' (1 active user) (resourceManager:552) 2018-04-08 16:23:44,487+0300 DEBUG (tasks/4) [storage.ResourceManager.Request] (ResName='00_storage.74668615-6112-4f5c-b0f0-8f817da34df6', ReqID='7c872634-5fa2-4cc7-8a36-7d228734d9c4') Granted request (resourceManager:222) 2018-04-08 16:23:44,488+0300 DEBUG (tasks/4) [storage.ResourceManager] Trying to register resource '00_storage.f3d0fa77-a26f-4e99-99db-2d0a1770acbe' for lock type 'shared' (resourceManager:495) 2018-04-08 16:23:44,489+0300 DEBUG (tasks/4) [storage.ResourceManager] Resource '00_storage.f3d0fa77-a26f-4e99-99db-2d0a1770acbe' is free. Now locking as 'shared' (1 active user) (resourceManager:552) 2018-04-08 16:23:44,489+0300 DEBUG (tasks/4) [storage.ResourceManager.Request] (ResName='00_storage.f3d0fa77-a26f-4e99-99db-2d0a1770acbe', ReqID='1a6cd1d6-bfdb-47e3-a201-d2bf1fde31a6') Granted request (resourceManager:222) 2018-04-08 16:23:44,491+0300 DEBUG (tasks/4) [storage.ResourceManager] Trying to register resource '01_img_74668615-6112-4f5c-b0f0-8f817da34df6.6d309216-37e7-4138-b09c-596a2838dfab' for lock type 'shared' (resourceManager:495) 2018-04-08 16:23:44,493+0300 DEBUG (tasks/4) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/lvm lvs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3514f0c5a5160049e|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,vg_name,attr,size,seg_start_pe,devices,tags 74668615-6112-4f5c-b0f0-8f817da34df6 (cwd None) (commands:65) 2018-04-08 16:23:44,570+0300 INFO (jsonrpc/2) [api.host] START getJobs(job_type='storage', job_ids=['1babe44f-82e2-451b-9d0d-87bd024c607c']) from=::ffff:10.35.162.9,43264, flow_id=02801675-9a0c-4bf8-ae3b-18e3828ff78a (api:46) 2018-04-08 16:23:44,572+0300 INFO (jsonrpc/2) [api.host] FINISH getJobs return={'status': {'message': 'Done', 'code': 0}, 'jobs': {'1babe44f-82e2-451b-9d0d-87bd024c607c': {'status': 'running', 'job_type': 'storage', 'id': '1babe44f-82e2-451b-9d0d-87bd024c607c', 'description': 'copy_data'}}} from=::ffff:10.35.162.9,43264, flow_id=02801675-9a0c-4bf8-ae3b-18e3828ff78a (api:52) --- *** FINISH job reported *** --- 2018-04-08 18:33:06,438+0300 INFO (jsonrpc/2) [api.host] START getJobs(job_type='storage', job_ids=['1babe44f-82e2-451b-9d0d-87bd024c607c']) from=::ffff:10.35.162.9,42792, flow_id=02801675-9a0c-4bf8-ae3b-18e382 8ff78a (api:46) 2018-04-08 18:33:06,439+0300 INFO (jsonrpc/2) [api.host] FINISH getJobs return={'status': {'message': 'Done', 'code': 0}, 'jobs': {}} from=::ffff:10.35.162.9,42792, flow_id=02801675-9a0c-4bf8-ae3b-18e3828ff78a (api:52) 2018-04-08 18:33:06,440+0300 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getJobs succeeded in 0.00 seconds (__init__:573) 2018-04-08 18:33:06,568+0300 DEBUG (check/loop) [storage.check] START check u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge10__nfs__1/9e8c3e55-8a87-4528-9bbe-fae1f0ee05eb/dom_md/metadata' (delay=0.00) (check:289)
Is this a regression from 4.1?
Hi Kevin, I am not able to reproduce this bug. Can you please share some more info about the steps to reproduce? How many disks, size, domain type, how many hosts in the dc (one or more) etc. I blocked the connection to the storage server, the operation failed and after a few minutes, the host went into a non-operational state. However, after I brought back the connection to the storage server, the host went up but the cold move hasn't started again and nothing else happened. Thanks
(In reply to Yaniv Lavi from comment #1) > Is this a regression from 4.1? Could not reproduce this on 4.1
The same issue occurred on the automation tier4 run. See https://bugzilla.redhat.com/show_bug.cgi?id=1571142
This seems to be a statistical issue as it does not happen all the time. During my first attempt it happened 2/2 times on 4.2 During my second attempt it happened 1 out of 4 times on 4.2 During the automation run on tier 4 it happened 2 out of 6 times During my attempt of 4.1 it did not happen during 4 attempts Regards, Kevin
This is not a blocker nor an exception, yet it is targeted for 4.2.3?
In my attempt to reproduce I found that: The lease was acquired here: 2018-04-26 21:40:34 187302 [23137]: s25:r27 resource 77847bfc-b39c-4406-8377-a9515267318a:b6956d4e-d0e0-4cb1-84f1-373dd661b1b5:/rhev/data-center/mn t/10.35.0.233:_root_storage__domains_sd1/77847bfc-b39c-4406-8377-a9515267318a/images/6b10a8ab-00d3-4ea6-bd71-b1fd43399db9/b6956d4e-d0e0-4cb1-84f1-3 73dd661b1b5.lease:0 for 2,16,29591 This is the sanlock.log where sanlock kills vdsm after the storage block 2018-04-26 21:42:26 187414 [19686]: s25 renewal error -2 delta_length 20 last_success 187302 2018-04-26 21:42:27 187415 [22765]: s25 kill 29591 sig 15 count 34 2018-04-26 21:42:28 187416 [22765]: s25 kill 29591 sig 15 count 35 2018-04-26 21:42:29 187417 [22765]: s25 kill 29591 sig 15 count 36 2018-04-26 21:42:30 187418 [22765]: s25 kill 29591 sig 15 count 37 2018-04-26 21:42:31 187419 [22765]: s25 kill 29591 sig 15 count 38 2018-04-26 21:42:32 187420 [22765]: s25 kill 29591 sig 15 count 39 2018-04-26 21:42:33 187421 [22765]: s25 kill 29591 sig 15 count 40 2018-04-26 21:42:34 187422 [22765]: s25 kill 29591 sig 9 count 41 2018-04-26 21:42:34 187422 [22765]: dead 29591 ci 2 count 41 This is what the lease currently holds, the timestamp 187302 is the same as the in the log, done for pid: 29591 [root@localhost ~]# sanlock direct dump /rhev/data-center/55ab06f0-3b0b-11e8-9eaa-507b9dec63c2/77847bfc-b39c-4406-8377-a9515267318a/images/6b10a8ab-00d3-4ea6-bd71-b1fd43399db9/b6956d4e-d0e0-4cb1-84f1-373dd661b1b5.lease offset lockspace resource timestamp own gen lver 00000000 77847bfc-b39c-4406-8377-a9515267318a b6956d4e-d0e0-4cb1-84f1-373dd661b1b5 0000187302 0001 0018 2 Trying to manually acquire the lease works, the current pid of vdsm is 32231: [root@localhost ~]# sanlock client acquire -r 77847bfc-b39c-4406-8377-a9515267318a:b6956d4e-d0e0-4cb1-84f1-373dd661b1b5:/rhev/data-center/55ab06f0-3b0b-11e8-9eaa-507b9dec63c2/77847bfc-b39c-4406-8377-a9515267318a/images/6b10a8ab-00d3-4ea6-bd71-b1fd43399db9/b6956d4e-d0e0-4cb1-84f1-373dd661b1b5.lease:0 -p 32231 acquire pid 32231 acquire done 0 [root@localhost ~]# sanlock direct dump /rhev/data-center/55ab06f0-3b0b-11e8-9eaa-507b9dec63c2/77847bfc-b39c-4406-8377-a9515267318a/images/6b10a8ab-00d3-4ea6-bd71-b1fd43399db9/b6956d4e-d0e0-4cb1-84f1-373dd661b1b5.lease offset lockspace resource timestamp own gen lver 00000000 77847bfc-b39c-4406-8377-a9515267318a b6956d4e-d0e0-4cb1-84f1-373dd661b1b5 0000188231 0001 0019 3 Now we see the the generation is 19, the previous generation was 18 (before the storage block) So vdsm should probably take the host generation into accout when checking the lease status
This bug is not marked as blocker and we entered blocker only phase for 4.2.4. Please consider re-targeting to 4.2.5.
Created attachment 1458298 [details] engine and vdsm logs Added new logs after running scenario again with longer wait before blocking the access to the storage
Based on comment 14 and additional successful run moving to VERIFY
This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.