Bug 1819098 - Broken rollback for BlockVolume createVolumeMetadata
Summary: Broken rollback for BlockVolume createVolumeMetadata
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.30.44
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ovirt-4.4.0
: ---
Assignee: Amit Bawer
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-31 08:20 UTC by Amit Bawer
Modified: 2020-05-20 19:57 UTC (History)
8 users (show)

Fixed In Version: vdsm-4.40.13
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-20 19:57:38 UTC
oVirt Team: Scale
Embargoed:
sbonazzo: ovirt-4.4?


Attachments (Terms of Use)
scale test vdsm log (761.38 KB, application/x-xz)
2020-03-31 08:20 UTC, Amit Bawer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 108143 0 master MERGED blockVolume: parse slot into an int 2020-07-28 07:38:36 UTC
oVirt gerrit 108164 0 ovirt-4.3 MERGED blockVolume: parse slot into an int 2020-07-28 07:38:36 UTC

Description Amit Bawer 2020-03-31 08:20:55 UTC
Created attachment 1675007 [details]
scale test vdsm log

Description of problem: 

When storage is out of space or not avail, creating a volume lease would fail and a volume metadata rollback would be called as part of the task recovery. 
The parameters sent for createVolumeMetadataRollback in case of block volume are the sdUUID and the metadata slot to be cleared for the already written slot.
The metadata slot number is sent as str, which should be an int as there is an arithmetic involved in calculating the offset to be cleared in the metadata volume. This results in a types mismatch error:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 405, in run
    return function(self.task, *argslist)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 484, in createVolumeMetadataRollback
    sd.clear_metadata_block(slot)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1035, in clear_metadata_block
    self.write_metadata_block(slot, data)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1027, in write_metadata_block
    f.seek(self.metadata_offset(slot))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1005, in metadata_offset
    return METADATA_BASE_V5 + slot * METADATA_SLOT_SIZE_V5
TypeError: unsupported operand type(s) for +: 'int' and 'str'
 




Version-Release number of selected component (if applicable): 
vdsm 4.3.9-7


How reproducible: 
At least on one scale test set as described for snapshots creation

https://docs.google.com/document/d/1sbMC54-1KfM-g-WF9AIIJv9vXUA0Ls3VIhHB7C4SpLM/edit?ts=5e81f6c2


Steps to Reproduce:

Running 10 parallel threads:
1. Used exist VMs
2. Create 30 snapshots 

Log attached is from Scale tests for vdsm 4.3.9-7, 
but issue with bad rollback code for block volume also exists for 4.4


Actual results:

Sanlock Exception about out of space for lease creation

2020-03-26 15:09:43,429+0000 INFO  (tasks/7) [storage.Volume] The requested size for volume 6f54c381-6f69-48b2-b66a-333faee5920f doesn't match the granularity on domain c157f961-d205-4ed3-9f86-720ad9186e03, updating the volume capacity from 1384120320 to 1476395008 (volume:1258)
2020-03-26 15:09:43,464+0000 ERROR (tasks/7) [storage.Volume] Unexpected error (volume:1278)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1272, in create
    cls.newVolumeLease(metaId, sdUUID, volUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1555, in newVolumeLease
    return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 296, in newVolumeLease
    sd.create_volume_lease(slot, volUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 988, in create_volume_lease
    sector=self.block_size)
SanlockException: (28, 'Sanlock resource write failure', 'No space left on device')


Exception about rollback failure

2020-03-26 15:09:50,938+0000 INFO  (tasks/7) [storage.Volume] Metadata rollback for sdUUID=c157f961-d205-4ed3-9f86-720ad9186e03 slot=1948 (blockVolume:482)
2020-03-26 15:09:50,940+0000 ERROR (tasks/7) [storage.TaskManager.Task] (Task='65810c24-2b41-4bb2-ae11-d380d1b022e1') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 405, in run
    return function(self.task, *argslist)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 484, in createVolumeMetadataRollback
    sd.clear_metadata_block(slot)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1035, in clear_metadata_block
    self.write_metadata_block(slot, data)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1027, in write_metadata_block
    f.seek(self.metadata_offset(slot))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1005, in metadata_offset
    return METADATA_BASE_V5 + slot * METADATA_SLOT_SIZE_V5
TypeError: unsupported operand type(s) for +: 'int' and 'str'
2020-03-26 15:09:50,941+0000 WARN  (tasks/7) [storage.TaskManager.Task] (Task='65810c24-2b41-4bb2-ae11-d380d1b022e1') task 65810c24-2b41-4bb2-ae11-d380d1b022e1: recovery failed: Task is aborted: u"unsupported operand type(s) for +: 'int' and 'str'" - code 100 (task:818)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 815, in _recover
    self._run(rec.run)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 896, in _run
    raise se.TaskAborted(message, code)





Expected results:

Only sanlock exception about out of space.
Rollback should work.




Additional info:

Comment 1 Amit Bawer 2020-03-31 12:10:38 UTC
Hi David, 

I guess hitting SanlockException: (28, 'Sanlock resource write failure', 'No space left on device') is not too trivial,
but could you provide the snapshots script you use in your scale env for the relevant reproduction steps,
maybe it could be useful for verification.

Thanks

Comment 2 David Vaanunu 2020-03-31 12:39:06 UTC
Hi Amit,

I used JMeter(Load tool) for creating the snapshots.
If it is relevant for you, I will share the script.

Comment 3 Nir Soffer 2020-04-02 16:40:03 UTC
This is a regression introduced in 4.3 when adding 4k support.
The error was reported when testing 4.3.9.

Lukas, for some reason I cannot add a regression keyword.

Comment 4 Lukas Svaty 2020-04-03 08:52:01 UTC
It's new bugzilla, clean your cache :)

Comment 5 Evelina Shames 2020-04-12 13:09:29 UTC
Hi Amit, please provide a clear scenario for verification.

Comment 7 David Vaanunu 2020-04-16 06:40:39 UTC
Right, it should run with 10 concurrency users

Comment 8 Avihai 2020-04-20 07:35:58 UTC
(In reply to David Vaanunu from comment #7)
> Right, it should run with 10 concurrency users

Dudu, the functional team can verify this if it has a clear functional scenario which in this case does not exist.
This looks like a scale team scenario/issue.

Can you please help with this?

Moving this bug to scale team.

Comment 9 Evelina Shames 2020-04-20 08:08:33 UTC
(In reply to Avihai from comment #8)
> (In reply to David Vaanunu from comment #7)
> > Right, it should run with 10 concurrency users
> 
> Dudu, the functional team can verify this if it has a clear functional
> scenario which in this case does not exist.
> This looks like a scale team scenario/issue.
> 
> Can you please help with this?
> 
> Moving this bug to scale team.

Just verified this one with the following steps

Comment 10 Evelina Shames 2020-04-20 08:10:24 UTC
(In reply to Evelina Shames from comment #9)
> (In reply to Avihai from comment #8)
> > (In reply to David Vaanunu from comment #7)
> > > Right, it should run with 10 concurrency users
> > 
> > Dudu, the functional team can verify this if it has a clear functional
> > scenario which in this case does not exist.
> > This looks like a scale team scenario/issue.
> > 
> > Can you please help with this?
> > 
> > Moving this bug to scale team.
> 
> Just verified this one with the following steps

* Create VM
* Create 30 snapshots
* Delete VM
 -> 10 concurrency users

on: vdsm-4.40.13-1.el8ev.x86_64 and engine-4.4.0-0.33.master.el8ev

Comment 11 Sandro Bonazzola 2020-05-20 19:57:38 UTC
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020.

Since the problem described in this bug report should be
resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.