Created attachment 1675007 [details] scale test vdsm log Description of problem: When storage is out of space or not avail, creating a volume lease would fail and a volume metadata rollback would be called as part of the task recovery. The parameters sent for createVolumeMetadataRollback in case of block volume are the sdUUID and the metadata slot to be cleared for the already written slot. The metadata slot number is sent as str, which should be an int as there is an arithmetic involved in calculating the offset to be cleared in the metadata volume. This results in a types mismatch error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 405, in run return function(self.task, *argslist) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 484, in createVolumeMetadataRollback sd.clear_metadata_block(slot) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1035, in clear_metadata_block self.write_metadata_block(slot, data) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1027, in write_metadata_block f.seek(self.metadata_offset(slot)) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1005, in metadata_offset return METADATA_BASE_V5 + slot * METADATA_SLOT_SIZE_V5 TypeError: unsupported operand type(s) for +: 'int' and 'str' Version-Release number of selected component (if applicable): vdsm 4.3.9-7 How reproducible: At least on one scale test set as described for snapshots creation https://docs.google.com/document/d/1sbMC54-1KfM-g-WF9AIIJv9vXUA0Ls3VIhHB7C4SpLM/edit?ts=5e81f6c2 Steps to Reproduce: Running 10 parallel threads: 1. Used exist VMs 2. Create 30 snapshots Log attached is from Scale tests for vdsm 4.3.9-7, but issue with bad rollback code for block volume also exists for 4.4 Actual results: Sanlock Exception about out of space for lease creation 2020-03-26 15:09:43,429+0000 INFO (tasks/7) [storage.Volume] The requested size for volume 6f54c381-6f69-48b2-b66a-333faee5920f doesn't match the granularity on domain c157f961-d205-4ed3-9f86-720ad9186e03, updating the volume capacity from 1384120320 to 1476395008 (volume:1258) 2020-03-26 15:09:43,464+0000 ERROR (tasks/7) [storage.Volume] Unexpected error (volume:1278) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1272, in create cls.newVolumeLease(metaId, sdUUID, volUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1555, in newVolumeLease return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 296, in newVolumeLease sd.create_volume_lease(slot, volUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 988, in create_volume_lease sector=self.block_size) SanlockException: (28, 'Sanlock resource write failure', 'No space left on device') Exception about rollback failure 2020-03-26 15:09:50,938+0000 INFO (tasks/7) [storage.Volume] Metadata rollback for sdUUID=c157f961-d205-4ed3-9f86-720ad9186e03 slot=1948 (blockVolume:482) 2020-03-26 15:09:50,940+0000 ERROR (tasks/7) [storage.TaskManager.Task] (Task='65810c24-2b41-4bb2-ae11-d380d1b022e1') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 405, in run return function(self.task, *argslist) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 484, in createVolumeMetadataRollback sd.clear_metadata_block(slot) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1035, in clear_metadata_block self.write_metadata_block(slot, data) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1027, in write_metadata_block f.seek(self.metadata_offset(slot)) File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1005, in metadata_offset return METADATA_BASE_V5 + slot * METADATA_SLOT_SIZE_V5 TypeError: unsupported operand type(s) for +: 'int' and 'str' 2020-03-26 15:09:50,941+0000 WARN (tasks/7) [storage.TaskManager.Task] (Task='65810c24-2b41-4bb2-ae11-d380d1b022e1') task 65810c24-2b41-4bb2-ae11-d380d1b022e1: recovery failed: Task is aborted: u"unsupported operand type(s) for +: 'int' and 'str'" - code 100 (task:818) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 815, in _recover self._run(rec.run) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 896, in _run raise se.TaskAborted(message, code) Expected results: Only sanlock exception about out of space. Rollback should work. Additional info:
Hi David, I guess hitting SanlockException: (28, 'Sanlock resource write failure', 'No space left on device') is not too trivial, but could you provide the snapshots script you use in your scale env for the relevant reproduction steps, maybe it could be useful for verification. Thanks
Hi Amit, I used JMeter(Load tool) for creating the snapshots. If it is relevant for you, I will share the script.
This is a regression introduced in 4.3 when adding 4k support. The error was reported when testing 4.3.9. Lukas, for some reason I cannot add a regression keyword.
It's new bugzilla, clean your cache :)
Hi Amit, please provide a clear scenario for verification.
Right, it should run with 10 concurrency users
(In reply to David Vaanunu from comment #7) > Right, it should run with 10 concurrency users Dudu, the functional team can verify this if it has a clear functional scenario which in this case does not exist. This looks like a scale team scenario/issue. Can you please help with this? Moving this bug to scale team.
(In reply to Avihai from comment #8) > (In reply to David Vaanunu from comment #7) > > Right, it should run with 10 concurrency users > > Dudu, the functional team can verify this if it has a clear functional > scenario which in this case does not exist. > This looks like a scale team scenario/issue. > > Can you please help with this? > > Moving this bug to scale team. Just verified this one with the following steps
(In reply to Evelina Shames from comment #9) > (In reply to Avihai from comment #8) > > (In reply to David Vaanunu from comment #7) > > > Right, it should run with 10 concurrency users > > > > Dudu, the functional team can verify this if it has a clear functional > > scenario which in this case does not exist. > > This looks like a scale team scenario/issue. > > > > Can you please help with this? > > > > Moving this bug to scale team. > > Just verified this one with the following steps * Create VM * Create 30 snapshots * Delete VM -> 10 concurrency users on: vdsm-4.40.13-1.el8ev.x86_64 and engine-4.4.0-0.33.master.el8ev
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.