Bug 1660529
| Summary: | [downstream clone - 4.2.8] LSM encountered WELD-000049 exception and never issued live merge | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | RHV bug bot <rhv-bugzilla-bot> | ||||
| Component: | ovirt-engine | Assignee: | Benny Zlotnik <bzlotnik> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Shir Fishbain <sfishbai> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.2.6 | CC: | bzlotnik, ebenahar, gveitmic, gwatson, ratamir, Rhev-m-bugs, sfishbai, tnisan | ||||
| Target Milestone: | ovirt-4.2.8 | Keywords: | Reopened, ZStream | ||||
| Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | ovirt-engine-4.2.8.2 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | 1638540 | Environment: | |||||
| Last Closed: | 2019-01-22 12:44:51 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1638540 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
RHV bug bot
2018-12-18 14:51:50 UTC
I don't have anything to add here, if you have a specific question please raise it. I know Benny Zlotnik (bzlotnik) is investigating this right now. (Originally by Elad Ben Aharon) I was able to reproduce the issue. It seems there is a high load in the LSM department which caused commands to wait more than 10 minutes for the lock on the VM (in order to perform the snapshot operation), by the time the lock was released the transaction was already killed by the transaction reaper. This probably affected live merge as well since they use the same lock as well. (Originally by Benny Zlotnik) WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:
[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]
For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:
[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]
For more info please contact: rhv-devops
(Originally by rhv-bugzilla-bot)
This bug needs cloning before it can move to ON_QA. (Originally by Eyal Edri) The live snapshot phase for LSM takes too long and then failed.
In addition, failed to make live storage migration.
Steps to reproduce :
1. Create a VM with a disk
2. Run the VM
3. Connected to the host that VM run on it
4. Add sleep to the host (In "def snapshot" method add an import time\ time.sleep(310)) - Benny did it
5. Edited step 1 and restart the host - Benny did it
6. Run LSM on the disk
Errors from engine log :
2019-01-14 14:51:35,568+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand' with failure.
2019-01-14 14:51:35,593+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotDiskCommand' with failure.
2019-01-14 14:51:35,608+02 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotCommand' with failure.
2019-01-14 14:51:36,563+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'New_VM_Disk1 Auto-generated for Live Storage Migration' creation for VM 'New_VM'.
2019-01-14 14:51:37,950+02 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-64) [e2310d2f-389e-43e9-8f24-004baa035c19] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2019-01-14 14:51:38,001+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-64) [e2310d2f-389e-43e9-8f24-004baa035c19] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2,011), User admin@internal-authz have failed to move disk New_VM_Disk1 to domain iscsi_2.
Errors from vdsm log:
2019-01-14 14:51:35,649+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='1ba137c3-c857-43bf-be44-5bd253dfd1c7') Unexpected error (task:875)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
return fn(*args, **kargs)
File "<string>", line 2, in revertTask
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2305, in revertTask
return self.taskMng.revertTask(taskID=taskID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 165, in revertTask
t = self.__getTask(taskID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/taskManager.py", line 85, in __getTask
raise se.UnknownTask(taskID)
UnknownTask: Task id unknown: (u'd0152eb1-d457-4d13-819c-8a819ca26ec2',)
2019-01-14 14:51:35,662+0200 ERROR (jsonrpc/1) [storage.Dispatcher] FINISH revertTask error=Task id unknown: (u'd0152eb1-d457-4d13-819c-8a819ca26ec2',) (dispatcher:82)
Logs attached
Created attachment 1520595 [details]
Logs
The above failure is a result of a vdsm timeout (vdsm is expected to return an answer within 3 minutes), and the sleep I injected is longer. Thus another, step was added, the updated steps are: 1. Create a VM with a disk 2. Run the VM 3. Connected to the host that VM run on it 4. Reduce the transaction timeout to 60 seconds (can be done via jboss-cli) - /subsystem=transactions:write-attribute(name=default-timeout,value=60 5. Add sleep to the host (In "def snapshot" method add an import time\ time.sleep(90)) 6. Edited step 1 and restart the host 7. Run LSM on the disk Pasting the relevant traceback from the log. Benny, the below seems to be caused by the sleep that was added. Adding need info on Shir.
2019-01-14 14:51:34,118+0200 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'frozen': True, u'vmID': u'2075abde-c39a-414a-812c-89e18e5b2fc3', u's
napDrives': [{u'baseVolumeID': u'7505c4f2-1351-4b9e-89c0-f5433e55e7c5', u'domainID': u'194b263e-44a1-429f-8abf-29bb6fbfc2be', u'volumeID': u'f971e3fe-04de-458e-8180-0aec141b205f', u'imageID': u'87ae8d43-789d-4cc
e-8661-5e5c624b0a90'}]}, 'jsonrpc': '2.0', 'method': u'VM.snapshot', 'id': u'56be48b0-2c29-458d-9b29-6f67df383287'} at 0x7fca7ecac290> timeout=60, duration=180 at 0x7fca7ecacd50> task#=21 at 0x7fca940e5950>, tra
ceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
result = fn(*methodArgs)
File: "<string>", line 2, in snapshot
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
ret = func(*args, **kwargs)
File: "<string>", line 2, in snapshot
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 691, in snapshot
time.sleep(350) (executor:363)
Verified ovirt-engine-4.2.8.2-0.1.el7ev.noarch vdsm-4.20.46-1.el7ev.x86_64 After I'm talking with Benny we succeeded to reproduce this bug together by following steps : 1. Create a VM with a disk 2. Run the VM 3. Connected to the host that VM run on it 4. Reduce the transaction timeout to 60 seconds (can be done via jboss-cli) - /subsystem=transactions:write-attribute(name=default-timeout,value=60 5. Add sleep to the host (In "def snapshot" method add an import time\ time.sleep(90)) 6. Edited step 1 and restart the host 7. Run LSM on the disk Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0121 sync2jira sync2jira |