+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1638540 +++ ====================================================================== Description of problem: Live Storage Migration (LSM) completed the disk move steps, but never issued the live merge sequence. Right after VmReplicateDiskFinishVDSCommand completed, the following error occurred; 2018-09-21 03:01:58,772+02 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [6c0649c7-d37d-408a-a6af-0c50ccf8353d] An exception has occurred while trying to create a command object for command 'LiveMigrateDisk' with parameters 'LiveMigrateDiskParameters:{commandId='bfaabf40-cac3-4771-a569-6baf768aca23', user='admin', commandType='LiveMigrateDisk'}': WELD-000049: Unable to invoke protected final void org.ovirt.engine.core.bll.CommandBase.postConstruct() on org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand@4a6d91c8 The end result as that the disk had been moved, but the disk images remained locked. Version-Release number of selected component (if applicable): RHV 4.2.6 RHEL 7.5 hosts w/vdsm-4.20.35-1 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: (Originally by Gordon Watson)
I don't have anything to add here, if you have a specific question please raise it. I know Benny Zlotnik (bzlotnik) is investigating this right now. (Originally by Elad Ben Aharon)
I was able to reproduce the issue. It seems there is a high load in the LSM department which caused commands to wait more than 10 minutes for the lock on the VM (in order to perform the snapshot operation), by the time the lock was released the transaction was already killed by the transaction reaper. This probably affected live merge as well since they use the same lock as well. (Originally by Benny Zlotnik)
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2.z': '?'}', ] For more info please contact: rhv-devops (Originally by rhv-bugzilla-bot)
This bug needs cloning before it can move to ON_QA. (Originally by Eyal Edri)
The live snapshot phase for LSM takes too long and then failed. In addition, failed to make live storage migration. Steps to reproduce : 1. Create a VM with a disk 2. Run the VM 3. Connected to the host that VM run on it 4. Add sleep to the host (In "def snapshot" method add an import time\ time.sleep(310)) - Benny did it 5. Edited step 1 and restart the host - Benny did it 6. Run LSM on the disk Errors from engine log : 2019-01-14 14:51:35,568+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand' with failure. 2019-01-14 14:51:35,593+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotDiskCommand' with failure. 2019-01-14 14:51:35,608+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand' with failure. 2019-01-14 14:51:36,563+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'New_VM_Disk1 Auto-generated for Live Storage Migration' creation for VM 'New_VM'. 2019-01-14 14:51:37,950+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-64) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure. 2019-01-14 14:51:38,001+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-64) [e2310d2f-389e-43e9-8f24-004baa035c19] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), User admin@internal-authz have failed to move disk New_VM_Disk1 to domain iscsi_2. Errors from vdsm log: 2019-01-14 14:51:35,649+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='1ba137c3-c857-43bf-be44-5bd253dfd1c7') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in revertTask File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2305, in revertTask return self.taskMng.revertTask(taskID=taskID) File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 165, in revertTask t = self.__getTask(taskID) File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 85, in __getTask raise se.UnknownTask(taskID) UnknownTask: Task id unknown: (u'd0152eb1-d457-4d13-819c-8a819ca26ec2',) 2019-01-14 14:51:35,662+0200 ERROR (jsonrpc/1) [storage.Dispatcher] FINISH revertTask error=Task id unknown: (u'd0152eb1-d457-4d13-819c-8a819ca26ec2',) (dispatcher:82) Logs attached
Created attachment 1520595 [details] Logs
The above failure is a result of a vdsm timeout (vdsm is expected to return an answer within 3 minutes), and the sleep I injected is longer. Thus another, step was added, the updated steps are: 1. Create a VM with a disk 2. Run the VM 3. Connected to the host that VM run on it 4. Reduce the transaction timeout to 60 seconds (can be done via jboss-cli) - /subsystem=transactions:write-attribute(name=default-timeout,value=60 5. Add sleep to the host (In "def snapshot" method add an import time\ time.sleep(90)) 6. Edited step 1 and restart the host 7. Run LSM on the disk
Pasting the relevant traceback from the log. Benny, the below seems to be caused by the sleep that was added. Adding need info on Shir. 2019-01-14 14:51:34,118+0200 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'frozen': True, u'vmID': u'2075abde-c39a-414a-812c-89e18e5b2fc3', u's napDrives': [{u'baseVolumeID': u'7505c4f2-1351-4b9e-89c0-f5433e55e7c5', u'domainID': u'194b263e-44a1-429f-8abf-29bb6fbfc2be', u'volumeID': u'f971e3fe-04de-458e-8180-0aec141b205f', u'imageID': u'87ae8d43-789d-4cc e-8661-5e5c624b0a90'}]}, 'jsonrpc': '2.0', 'method': u'VM.snapshot', 'id': u'56be48b0-2c29-458d-9b29-6f67df383287'} at 0x7fca7ecac290> timeout=60, duration=180 at 0x7fca7ecacd50> task#=21 at 0x7fca940e5950>, tra ceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File: "<string>", line 2, in snapshot File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File: "<string>", line 2, in snapshot File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 691, in snapshot time.sleep(350) (executor:363)
Verified ovirt-engine-4.2.8.2-0.1.el7ev.noarch vdsm-4.20.46-1.el7ev.x86_64 After I'm talking with Benny we succeeded to reproduce this bug together by following steps : 1. Create a VM with a disk 2. Run the VM 3. Connected to the host that VM run on it 4. Reduce the transaction timeout to 60 seconds (can be done via jboss-cli) - /subsystem=transactions:write-attribute(name=default-timeout,value=60 5. Add sleep to the host (In "def snapshot" method add an import time\ time.sleep(90)) 6. Edited step 1 and restart the host 7. Run LSM on the disk
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0121
sync2jira