Bug 1723611

Summary:	VM disk creations fail while changing LV tags.
Product:	Red Hat Enterprise Linux 7	Reporter:	Siddhant Rao <sirao>
Component:	lvm2	Assignee:	David Teigland <teigland>
lvm2 sub component:	Default / Unclassified	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	agk, cmarthal, heinzm, jbrassow, lsurette, mcsontos, msnitzer, nkshirsa, prajnoha, rhandlin, srevivo, teigland, ycui, zkabelac
Version:	7.6	Keywords:	ZStream
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	lvm2-2.02.186-1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1732994 1740521 (view as bug list)		Environment:
Last Closed:	2020-03-31 20:04:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1732994, 1740521

Description Siddhant Rao 2019-06-25 00:51:56 UTC

Description of problem:

When creating a VM disk or a VM on a storage domain , it fails to write to the metadata and is thereby left with stale tags of "OVIRT_VOL_INITIALIZING".

Version-Release number of selected component (if applicable):
rhvm-4.2.8.7-0.1.el7ev.noarch
vdsm-4.30.13-4.el7ev.x86_64
lvm2-2.02.180-10.el7_6.7.x86_64

How reproducible:
Randomly reproducible in the customer's environment.

Steps to Reproduce:
1.
2.
3.

Actual results:
Cannot create VM disks or VM's.

Expected results:
VM or disk creations should be successful. 

Additional info:

From the support case , we could see that the LV does get created, but with a tag of "OVIRT_VOL_INITIALIZING"

If I am not mistaken, this tag is passed initially when we create the VM

TAG_VOL_UNINIT = "OVIRT_VOL_INITIALIZING" ( https://github.com/oVirt/vdsm/blob/5a404b514d13e43810c0420b4553296b80787701/lib/vdsm/storage/constants.py )

We use this tag during the first initialization when we create the LV, we pass this as a parameter (initialTags) when we create the LV . Later-on we change the tags.


    def _create(cls, dom, imgUUID, volUUID, size, volFormat, preallocate,
                volParent, srcImgUUID, srcVolUUID, volPath, initialSize=None):
        """
        Class specific implementation of volumeCreate. All the exceptions are
        properly handled and logged in volume.create()
        """

        lvSize = cls.calculate_volume_alloc_size(preallocate,
                                                 size, initialSize)

        lvm.createLV(dom.sdUUID, volUUID, lvSize, activate=True,
                     initialTags=(sc.TAG_VOL_UNINIT,)) <<<



We then try to acquire metadata and change the tags


	 with dom.acquireVolumeMetadataSlot(volUUID) as slot:
	            mdTags = ["%s%s" % (sc.TAG_PREFIX_MD, slot),
	                      "%s%s" % (sc.TAG_PREFIX_PARENT, srcVolUUID),
	                      "%s%s" % (sc.TAG_PREFIX_IMAGE, imgUUID)]
	            lvm.changeLVTags(dom.sdUUID, volUUID, delTags=[sc.TAG_VOL_UNINIT],
                             addTags=mdTags)


    [...]


	def changeLVTags(vg, lv, delTags=(), addTags=()):
	    log.info("Change LV tags (vg=%s, lv=%s, delTags=%s, addTags=%s)",
	             vg, lv, delTags, addTags)   <<< This is seen
	    
            lvname = '%s/%s' % (vg, lv)
	    delTags = set(delTags)
	    addTags = set(addTags)
	    if delTags.intersection(addTags):
	        raise se.LogicalVolumeReplaceTagError(
	            "Cannot add and delete the same tag lv: `%s` tags: `%s`" %
	            (lvname, ", ".join(delTags.intersection(addTags))))

~~~

2019-06-06 16:15:59,222-0600 WARN  (tasks/9) [storage.StorageDomainManifest] Could not find mapping for lv 0c06f7f6-bf55-4eff-a23b-72f4886675b7/1d65c235-e6da-426c-b72b-056edc000442 (blockSD:895)
2019-06-06 16:15:59,223-0600 INFO  (tasks/9) [storage.LVM] Change LV tags (vg=0c06f7f6-bf55-4eff-a23b-72f4886675b7, lv=1d65c235-e6da-426c-b72b-056edc000442, delTags=['OVIRT_VOL_INITIALIZING'], addTags=['MD_49', u'PU_00000000-0000-0000-0000-000000000000', u'IU_8f26d805-e76e-4ff6-b68f-14899c18d7fe']) (lvm:1467)
2019-06-06 16:15:59,713-0600 ERROR (tasks/9) [storage.Volume] Unexpected error (volume:1222)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1182, in create
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 530, in _create
    addTags=mdTags)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1492, in changeLVTags
    (lvname, ", ".join(addTags), ", ".join(delTags), err[-1]))
LogicalVolumeReplaceTagError: Replace Logical Volume tag error: (u'lv: `0c06f7f6-bf55-4eff-a23b-72f4886675b7/1d65c235-e6da-426c-b72b-056edc000442` add: `IU_8f26d805-e76e-4ff6-b68f-14899c18d7fe, PU_00000000-0000-0000-0000-000000000000, MD_49` del: `OVIRT_VOL_INITIALIZING` (  Limit write at 1099377410048 len 131072 beyond last byte 4608)',)

~~~

So I think the problem really here is "Limit write at 1099377410048 len 131072 beyond last byte 4608" , There are fails to write to the mda header.

~~~

2019-06-06 16:16:00,669-0600 WARN  (tasks/9) [storage.Volume] Remove logical volume failed 0c06f7f6-bf55-4eff-a23b-72f4886675b7/1d65c235-e6da-426c-b72b-056edc000442 Cannot remove Logical Volume: ('0c06f7f6-bf55-4eff-a23b-72f4886675b7', "[u'1d65c235-e6da-426c-b72b-056edc000442']", ['  Error writing device /dev/mapper/36000d31000242e0000000000000001c2 at 1099377410048 length 512.', '  Failed to write mda header to /dev/mapper/36000d31000242e0000000000000001c2 fd -1', '  Failed to write metadata area header', '  Limit write at 1099377410048 len 131072 beyond last byte 4608', '  Error writing device /dev/mapper/36000d31000242e0000000000000001c2 at 4096 length 512.', '  Failed to write mda header to /dev/mapper/36000d31000242e0000000000000001c2 fd -1', '  Failed to write metadata area header', '  Error reading device /dev/mapper/36000d31000242e0000000000000001c2 at 1099377410048 length 512.', '  Limit write at 1099377410048 len 131072 beyond last byte 4608', '  Failed to read metadata area header on /dev/mapper/36000d31000242e0000000000000001c2 at 1099377410048', '  Limit write at 1099377410048 len 131072 beyond last byte 4608']) (blockVolume:477)

~~~

I think this was introduced to avoid it, but it's still seen :- https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=ab27d5dc2a5c3bf23ab8fed438f1542015dc723d;ds=sidebyside

- - - - - - - - - - - -

Please note that this is randomly seen in the customer's systems and not seen everytime the lvchange operation to modify the tags is done

This is not exclusive to add-tag or delete-tag , it was seen in both.

For example, a manual run with verbosity gives this,

~~~

# lvchange --config 'global { use_lvmetad=0 locking_type=1 } devices { filter = [ "a|.*|" ] }' --addtag 'OVIRT_VOL_INITIALIZING' /dev/0c06f7f6-bf55-4eff-a23b-72f4886675b7/1d65c235-e6da-426c-b72b-056edc000442
  /dev/sdb: open failed: No medium found
  /dev/sdc: open failed: No medium found
  Logical volume 0c06f7f6-bf55-4eff-a23b-72f4886675b7/1d65c235-e6da-426c-b72b-056edc000442 changed.
  Error writing device /dev/mapper/36000d31000242e0000000000000001c2 at 4096 length 512.
  Failed to write mda header to /dev/mapper/36000d31000242e0000000000000001c2 fd -1
  Failed to write metadata area header
  Error reading device /dev/mapper/36000d31000242e0000000000000001c2 at 4096 length 512.
  Failed to read metadata area header on /dev/mapper/36000d31000242e0000000000000001c2 at 4096
  Failed to update metadata of 0c06f7f6-bf55-4eff-a23b-72f4886675b7/1d65c235-e6da-426c-b72b-056edc000442 on disk.

# lvs -a -o +tags | grep -i '1d65c235'
  /dev/sdb: open failed: No medium found
  /dev/sdc: open failed: No medium found
  1d65c235-e6da-426c-b72b-056edc000442 0c06f7f6-bf55-4eff-a23b-72f4886675b7 -wi-------   50.00g

~~~

I have Managed to get an manual output of lvchange operation such as above with -vvvv which I will attach to the case.

Comment 8 Zdenek Kabelac 2019-07-04 13:00:29 UTC

Please grab 'lvmdump -a -m'   and attach.

Comment 15 David Teigland 2019-07-23 20:38:23 UTC

The physical block size of a device should usually be 512 (or sometimes 4K), but in this case it's 2MB:
/dev/mapper/36000d31000242e0000000000000001c2: Physical block size is 2097152 bytes

I'm still trying to understand what errors are being produced in LVM from using 2MB, but that analysis is somewhat academic because the value needs to be fixed for LVM to work.  Until we can build a version of LVM that ignores invalid block sizes, you might be able to force the devices or the system to use 512 bytes.

Comment 16 David Teigland 2019-07-23 21:33:31 UTC

Could you please run the blockdev command with each of the following options on /dev/mapper/36000d31000242e0000000000000001c2

 --getss                   get logical block (sector) size
 --getpbsz                 get physical block (sector) size
 --getiomin                get minimum I/O size
 --getioopt                get optimal I/O size
 --getbsz                  get blocksize

Comment 17 David Teigland 2019-07-24 18:08:02 UTC

The mda_header is a 512 byte block at the start of the metadata area,
which lvm writes to commit new metadata (it points to the location of the
metadata text within the circular metadata buffer following the header).

When lvm writes this 512 byte mda_header, it tells bcache to not write
beyond the end of the mda_header (metadata area start + 512.)  This is the
region of disk lvm wants to be writing to, and no more.  LVM uses this
write-limiting mechanism (set_last_byte) to avoid problems in case a
bcache block (128KB) happens to go past the end of the metadata area.
That's not common, but possible, and we must not be writing past the
metadata area, or we'd be overwriting user data.

When bcache goes to write the 512 byte mda_header from a 128KB bcache
block, the write-limiting code goes into effect and reduces the write size
to either 512 (when the mda_header is located at the start of the bcache
block), or 4608 (when the mda_header is located 4096 bytes from the start
of the bcache block).  So, bcache is writing a partial bcache block, and
is writing only 512 of 4609 bytes to disk, rather than 128KB.

There is one more wrinkle in this write-limiting mechanism.  After
reducing the write size according to the limit imposed by lvm (making the
write either 512 or 4608 bytes), the resulting write size is checked to
see if it's a multiple of the disk sector size.  We do not want to write
partial disk sectors.  If the reduced write size not an even number of
sectors, the reduced size is extended to be an even number of sectors.
So, the write size is first reduced per the lvm limit, and then extended
again per the sector size.

The sector size is expected to be either 512 of 4096 bytes.  When 512, the
reduced write size (either 512 or 4608) is always a multiple of 512 and
does not need extending.  When 4096, the reduced write size is extended to
4096 or 8192 bytes.

In both cases, the reduced+extended write size is far below the 128KB
bcache block size, and lvm is still only writing a partial bcache block to
disk [1].

However, in some cases the sector size returned from BLKPBSZGET is not 512
or 4096; in one case it is 2MB.  When 2MB, the reduced write size is
extended to 2MB, which is far beyond the size of the 128KB bcache block.
This leads to writing unknown data to disk, into the metadata area and
often into the data area (since the default metadata area is 1MB, we'd
often be writing nearly 1MB of junk into the data area.)

LVM needs to check the sector size returned from the ioctl, and use only
512 or 4096, regardless of what other values are returned.

(We may also want to use BLKSSZGET instead, which the kernel describes as 
the "lowest possible block size that the storage device can address".
BLKPBSZGET is described as "the lowest possible sector size that the
hardware can operate on without reverting to read-modify-write
operations".)


[1] There is some question here if lvm were to write within the final 512
bytes of a bcache block, and a 4K sector size extended the write to 4096.
This may not happen in practice, but we need to handle it correctly.

Comment 18 David Teigland 2019-07-24 19:15:44 UTC

From the failing command trace which shows each mda_header write extended to 2097152 bytes (2MB).

write first mda_header (precommit)
the bcache block starts at offset 0, the mda_header is at offset 4096,
so the limit on number of bytes written from the bcache block is 4608,
which is where the first mda_header ends within the bcache block.
This write succeeds.  RHV uses a very large metadata area (128MB?), so the
2MB written when writing the mda_header wrote junk into the circular metadata
buffer, and in an area not used by the current copy of the metadata. So,
there was no damage done by this write.

#device/bcache.c:205           Limit write at 0 len 131072 to len 4608 rounded to 2097152
#format_text/format-text.c:331           Reading mda header sector from /dev/mapper/36000d31000242e0000000000000001c2 at 109937741004
#format_text/format-text.c:800           Pre-Committing 0c06f7f6-bf55-4eff-a23b-72f4886675b7 metadata (500) to /dev/mapper/36000d3100
#device/dev-io.c:609           Opened /dev/mapper/36000d31000242e0000000000000001c2 RO O_DIRECT
#device/dev-io.c:166           /dev/mapper/36000d31000242e0000000000000001c2: Block size is 4096 bytes
#device/dev-io.c:177           /dev/mapper/36000d31000242e0000000000000001c2: Physical block size is 2097152 bytes
#device/dev-io.c:658           Closed /dev/mapper/36000d31000242e0000000000000001c2

write second mda_header (precommit)
the bcache block starts at offset 1099377410048, the mda_header is at
offset 0 within the bcache block, so the limit on number of bytes written
from the bcache block is 512, which is where the second mda_header ends
within the bcache block.  This write fails somewhere in bcache.c, but all
we see is false returned by the top level bcache write function, so we
don't know what caused the write error.  If this write had succeeded, it
would not have caused damage because the current copy of metadata is far
enough into the metadata area to not be clobbered.  (When writing the second
mda header at the end of the disk, we don't have to be concerned about
clobbering user data, but the incorrect write extension could lead to writing
beyond the end of the device.)

#device/bcache.c:205           Limit write at 1099377410048 len 131072 to len 512 rounded to 2097152
#label/label.c:1304    Error writing device /dev/mapper/36000d31000242e0000000000000001c2 at 1099377410048 length 512.
#format_text/format-text.c:407     Failed to write mda header to /dev/mapper/36000d31000242e0000000000000001c2 fd -1


I can't explain why this VG had no problems earlier in its lifetime, especially when the metadata text was within the range of the errant 2MB writes.  One theory is that adjacent bcache blocks in memory contained the current blocks of data from disk, so when unintended bcache blocks were written to disk they happened to contain the correct data.

Also, until we can find out which bcache.c error is causing the write error, we don't have a complete explanation of the problem.

Comment 19 David Teigland 2019-07-24 22:01:32 UTC

Initial fix for further review and testing is here:
https://sourceware.org/git/?p=lvm2.git;a=shortlog;h=refs/heads/dev-dct-fix-sector-size-1

Comment 20 David Teigland 2019-07-24 22:07:21 UTC

> [1] There is some question here if lvm were to write within the final 512
> bytes of a bcache block, and a 4K sector size extended the write to 4096.
> This may not happen in practice, but we need to handle it correctly.

Looked into this and it's not a problem as long as the bcache block size is a multiple of the 512|4096 sector size.

Comment 21 Siddhant Rao 2019-07-25 18:38:45 UTC

Hi David,

Thanks for your help, as you rightly mentioned the size of the block is 2MB, please find the output which you requested

# blockdev -v --getss /dev/mapper/36000d31000242e0000000000000001c2
get logical block (sector) size: 512

# blockdev -v --getpbsz /dev/mapper/36000d31000242e0000000000000001c2
get physical block (sector) size: 2097152

# blockdev -v --getiomin /dev/mapper/36000d31000242e0000000000000001c2
get minimum I/O size: 2097152

# blockdev -v --getioopt /dev/mapper/36000d31000242e0000000000000001c2
get optimal I/O size: 2097152

# blockdev -v --getbsz /dev/mapper/36000d31000242e0000000000000001c2
get blocksize: 4096


--------

Let me know if you need anything else.

Thanks!

Comment 22 David Teigland 2019-07-25 22:08:25 UTC

Second version of patch which saves the logical/physical sizes in the dev struct to reuse.
https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=7550665ba49ac7d497d5b212e14b69298ef01361

Comment 24 David Teigland 2019-07-30 15:59:46 UTC

One thing about this patch that I'm slighly unsure about is how to handle physical_sector_size=512 && logical_sector_size=4096, or when physical_sector_size=4096 && logical_sector_size=512.

In both of these cases the device should handle 512 byte io's, so I'm rounding our io's to the nearest 512 bytes.

Comment 25 David Teigland 2019-07-30 20:31:16 UTC

Ben figured out how to use scsi_debug to test this:

Create a 1GB scsi device with 2MB physical sector size and 4096 logical sector size:

modprobe scsi_debug sector_size=4096 physblk_exp=9 dev_size_mb=1024

Find out which scsi device that is (assuming you're not already using scsi_debug):

SDN=`grep scsi_debug /sys/block/*/device/model | sed 's/^\/sys\/block\/\(.*\)\/device.*/\1/'`

vgcreate test /dev/$SDN

for i in `seq 1 40`; do lvcreate -l1 -an test; done

For me this results in corrupted VG metadata at about lvol38 when using the old version of lvm.

Comment 26 David Teigland 2019-07-30 21:17:16 UTC

pushed to stable-2.02 branch:
https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=7550665ba49ac7d497d5b212e14b69298ef01361

Comment 29 Corey Marthaler 2019-11-11 17:27:41 UTC

Fixed verified with the latest rpms using the example in comment #25.

3.10.0-1109.el7.x86_64

lvm2-2.02.186-3.el7    BUILT: Fri Nov  8 07:07:01 CST 2019
lvm2-libs-2.02.186-3.el7    BUILT: Fri Nov  8 07:07:01 CST 2019
lvm2-cluster-2.02.186-3.el7    BUILT: Fri Nov  8 07:07:01 CST 2019
lvm2-lockd-2.02.186-3.el7    BUILT: Fri Nov  8 07:07:01 CST 2019
device-mapper-1.02.164-3.el7    BUILT: Fri Nov  8 07:07:01 CST 2019
device-mapper-libs-1.02.164-3.el7    BUILT: Fri Nov  8 07:07:01 CST 2019
device-mapper-event-1.02.164-3.el7    BUILT: Fri Nov  8 07:07:01 CST 2019
device-mapper-event-libs-1.02.164-3.el7    BUILT: Fri Nov  8 07:07:01 CST 2019
device-mapper-persistent-data-0.8.5-1.el7    BUILT: Mon Jun 10 03:58:20 CDT 2019


With fix (no issues):
  [...]
  Logical volume "lvol33" created.
  WARNING: Logical volume test/lvol34 not zeroed.
  Logical volume "lvol34" created.
  WARNING: Logical volume test/lvol35 not zeroed.
  Logical volume "lvol35" created.
  WARNING: Logical volume test/lvol36 not zeroed.
  Logical volume "lvol36" created.
  WARNING: Logical volume test/lvol37 not zeroed.
  Logical volume "lvol37" created.
  WARNING: Logical volume test/lvol38 not zeroed.
  Logical volume "lvol38" created.
  WARNING: Logical volume test/lvol39 not zeroed.
  Logical volume "lvol39" created.



Without fix (lvm2-2.02.185-2.el7.x86_64):

  Logical volume "lvol24" created.
  WARNING: Logical volume test/lvol25 not zeroed.
  Logical volume "lvol25" created.
  WARNING: Logical volume test/lvol26 not zeroed.
  Logical volume "lvol26" created.
  Metadata on /dev/sdl at 139264 has wrong VG name "" expected test.
  Metadata on /dev/sdl at 139264 has wrong VG name "" expected test.
  WARNING: Logical volume test/lvol27 not zeroed.
  Logical volume "lvol27" created.
  WARNING: Logical volume test/lvol28 not zeroed.
  Logical volume "lvol28" created.

Comment 31 errata-xmlrpc 2020-03-31 20:04:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1129