1660529 – [downstream clone - 4.2.8] LSM encountered WELD-000049 exception and never issued live merge

Bug 1660529 - [downstream clone - 4.2.8] LSM encountered WELD-000049 exception and never issued live merge

Summary: [downstream clone - 4.2.8] LSM encountered WELD-000049 exception and never is...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.8
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	Shir Fishbain
Docs Contact:
URL:
Whiteboard:
Depends On:	1638540
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-18 14:51 UTC by RHV bug bot
Modified:	2022-03-13 16:33 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-engine-4.2.8.2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1638540
Environment:
Last Closed:	2019-01-22 12:44:51 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)
Logs (5.32 MB, application/zip) 2019-01-14 17:57 UTC, Shir Fishbain	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-44351	None	None	None	2021-12-10 18:32:49 UTC
Red Hat Knowledge Base (Solution)	3656221	None	None	None	2018-12-18 14:52:27 UTC
Red Hat Product Errata	RHBA-2019:0121	None	None	None	2019-01-22 12:44:57 UTC
oVirt gerrit	95204	master	MERGED	core: make MoveDiskCommand NonTransactive	2018-12-18 14:52:27 UTC
oVirt gerrit	95265	ovirt-engine-4.2	MERGED	core: make MoveDiskCommand NonTransactive	2018-12-18 14:52:27 UTC

Description RHV bug bot 2018-12-18 14:51:50 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1638540 +++
======================================================================

Description of problem:

Live Storage Migration (LSM) completed the disk move steps, but never issued the live merge sequence. Right after VmReplicateDiskFinishVDSCommand completed, the following error occurred;

2018-09-21 03:01:58,772+02 ERROR [org.ovirt.engine.core.bll.CommandsFactory] (EE-ManagedThreadFactory-engineScheduled-Thread-79) [6c0649c7-d37d-408a-a6af-0c50ccf8353d] An exception has occurred while trying to create a command object for command 'LiveMigrateDisk' with parameters 'LiveMigrateDiskParameters:{commandId='bfaabf40-cac3-4771-a569-6baf768aca23', user='admin', commandType='LiveMigrateDisk'}': WELD-000049: Unable to invoke protected final void org.ovirt.engine.core.bll.CommandBase.postConstruct() on org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand@4a6d91c8


The end result as that the disk had been moved, but the disk images remained locked.


Version-Release number of selected component (if applicable):

RHV 4.2.6
RHEL 7.5 hosts w/vdsm-4.20.35-1


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

(Originally by Gordon Watson)

Comment 12 RHV bug bot 2018-12-18 14:52:12 UTC

I don't have anything to add here, if you have a specific question please raise it.
I know Benny Zlotnik (bzlotnik) is investigating this right now.

(Originally by Elad Ben Aharon)

Comment 13 RHV bug bot 2018-12-18 14:52:14 UTC

I was able to reproduce the issue. It seems there is a high load in the LSM department which caused commands to wait more than 10 minutes for the lock on the VM (in order to perform the snapshot operation), by the time the lock was released the transaction was already killed by the transaction reaper.
This probably affected live merge as well since they use the same lock as well.

(Originally by Benny Zlotnik)

Comment 16 RHV bug bot 2018-12-18 14:52:18 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

(Originally by rhv-bugzilla-bot)

Comment 17 RHV bug bot 2018-12-18 14:52:19 UTC

This bug needs cloning before it can move to ON_QA.

(Originally by Eyal Edri)

Comment 20 Shir Fishbain 2019-01-14 17:49:13 UTC

The live snapshot phase for LSM takes too long and then failed.
In addition, failed to make live storage migration.

Steps to reproduce :
1. Create a VM with a disk 
2. Run the VM 
3. Connected to the host that VM run on it 
4. Add sleep to the host (In "def snapshot" method add an import time\ time.sleep(310)) - Benny did it
5. Edited step 1 and restart the host - Benny did it
6. Run LSM on the disk  

Errors from engine log :
2019-01-14 14:51:35,568+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand' with failure.
2019-01-14 14:51:35,593+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotDiskCommand' with failure.
2019-01-14 14:51:35,608+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand' with failure.

2019-01-14 14:51:36,563+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'New_VM_Disk1 Auto-generated for Live Storage Migration' creation for VM 'New_VM'.
2019-01-14 14:51:37,950+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-64) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2019-01-14 14:51:38,001+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-64) [e2310d2f-389e-43e9-8f24-004baa035c19] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), User admin@internal-authz have failed to move disk New_VM_Disk1 to domain iscsi_2.

Errors from vdsm log:
2019-01-14 14:51:35,649+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='1ba137c3-c857-43bf-be44-5bd253dfd1c7') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in revertTask
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2305, in revertTask
    return self.taskMng.revertTask(taskID=taskID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 165, in revertTask
    t = self.__getTask(taskID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 85, in __getTask
    raise se.UnknownTask(taskID)
UnknownTask: Task id unknown: (u'd0152eb1-d457-4d13-819c-8a819ca26ec2',)

2019-01-14 14:51:35,662+0200 ERROR (jsonrpc/1) [storage.Dispatcher] FINISH revertTask error=Task id unknown: (u'd0152eb1-d457-4d13-819c-8a819ca26ec2',) (dispatcher:82)

Logs attached

Comment 21 Shir Fishbain 2019-01-14 17:57:53 UTC

Created attachment 1520595 [details]
Logs

Comment 22 Benny Zlotnik 2019-01-16 14:09:15 UTC

The above failure is a result of a vdsm timeout (vdsm is expected to return an answer within 3 minutes), and the sleep I injected is longer.
Thus another, step was added, the updated steps are:
1. Create a VM with a disk 
2. Run the VM 
3. Connected to the host that VM run on it 
4. Reduce the transaction timeout to 60 seconds (can be done via jboss-cli) - /subsystem=transactions:write-attribute(name=default-timeout,value=60
5. Add sleep to the host (In "def snapshot" method add an import time\ time.sleep(90))
6. Edited step 1 and restart the host
7. Run LSM on the disk

Comment 23 Elad 2019-01-16 14:27:53 UTC

Pasting the relevant traceback from the log. Benny, the below seems to be caused by the sleep that was added. Adding need info on Shir.


2019-01-14 14:51:34,118+0200 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'frozen': True, u'vmID': u'2075abde-c39a-414a-812c-89e18e5b2fc3', u's
napDrives': [{u'baseVolumeID': u'7505c4f2-1351-4b9e-89c0-f5433e55e7c5', u'domainID': u'194b263e-44a1-429f-8abf-29bb6fbfc2be', u'volumeID': u'f971e3fe-04de-458e-8180-0aec141b205f', u'imageID': u'87ae8d43-789d-4cc
e-8661-5e5c624b0a90'}]}, 'jsonrpc': '2.0', 'method': u'VM.snapshot', 'id': u'56be48b0-2c29-458d-9b29-6f67df383287'} at 0x7fca7ecac290> timeout=60, duration=180 at 0x7fca7ecacd50> task#=21 at 0x7fca940e5950>, tra
ceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in snapshot
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "<string>", line 2, in snapshot
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 691, in snapshot
  time.sleep(350) (executor:363)

Comment 24 Shir Fishbain 2019-01-16 15:17:18 UTC

Verified 
ovirt-engine-4.2.8.2-0.1.el7ev.noarch
vdsm-4.20.46-1.el7ev.x86_64

After I'm talking with Benny we succeeded to reproduce this bug together by following steps :
1. Create a VM with a disk 
2. Run the VM 
3. Connected to the host that VM run on it 
4. Reduce the transaction timeout to 60 seconds (can be done via jboss-cli) - /subsystem=transactions:write-attribute(name=default-timeout,value=60
5. Add sleep to the host (In "def snapshot" method add an import time\ time.sleep(90))
6. Edited step 1 and restart the host
7. Run LSM on the disk

Comment 26 errata-xmlrpc 2019-01-22 12:44:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0121

Comment 27 Daniel Gur 2019-08-28 13:11:33 UTC

sync2jira

Comment 28 Daniel Gur 2019-08-28 13:15:45 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.