1665575 – Metadata growing over certain threshold cause failure to write VG meta to the PV

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1665575 - Metadata growing over certain threshold cause failure to write VG meta to the PV

Summary: Metadata growing over certain threshold cause failure to write VG meta to th...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1696742
TreeView+	depends on / blocked

Reported:	2019-01-11 20:10 UTC by bugzilla
Modified:	2021-09-03 12:55 UTC (History)
CC List:	11 users (show)
Fixed In Version:	lvm2-2.02.184-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1696742 (view as bug list)
Environment:
Last Closed:	2019-08-06 13:10:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Bulk-delete lvthin logical volumes (4.54 KB, text/plain) 2019-01-17 00:46 UTC, bugzilla	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2253	0	None	None	None	2019-08-06 13:11:05 UTC

Description bugzilla 2019-01-11 20:10:17 UTC

Description of problem:

We were attempting to resize a disk and started getting this error:
bcache failed to get block 1035 fd 5
Error writing device /dev/vg/data-pv at 131594752 length 4072089.
Failed to write metadata to /dev/vg/data-pv fd -1
Failed to write VG data.

Notice that data-pv is an LV that serves as a PV for the volume group "data". This volume group has 17MB of VG metadata and we have thousands of snapshots. We did the math to make sure the ring buffer isn't full, and there is still plenty of room.

Version-Release number of selected component (if applicable):

lvm2-2.02.180-10.el7_6.2.x86_64

How reproducible:

I don't know how to reproduce it on another system, but we solved the problem with the following. Note that the blkdiscard length is the metadata area of the PV as shown by pvs -o pv_all in the PMdaSize column:

lvchange -an data
vgcfgbackup -f /tmp/pool0-current data
blkdiscard -z -l 150994944 /dev/vg/data-pv
pvscan --cache
pvcreate --uuid nocOg0-j6yE-46mt-bSuC-KgoS-geXO-SaKgmY --restorefile /tmp/pool0-current /dev/vg/data-pv --force --force
vgcfgrestore --file /tmp/pool0-current data --force

Once we had reset the vgmeta to the beginning of the disk instead of somewhere out in the middle of the PV meta, we were able to delete volumes. It happened once or twice more, but after we had deleted several volumes we did not hit it again.

Actual results:

Unable to delete, resize, or create logical volumes

Expected results:

The ability to create, resize and delete logical volumes

Additional info:

ACTIVATION_SKIP flag set for LV data/bar, skipping activation.
/dev/vg/data-pv 0: 0 4048: lvol0_pmspare(0:0)
/dev/vg/data-pv 1: 4048 204800: pool0_tdata(0:0)
/dev/vg/data-pv 2: 208848 4048: pool0_tmeta(0:0)
/dev/vg/data-pv 3: 212896 188416: pool0_tdata(204800:0)
/dev/vg/data-pv 4: 401312 122940: NULL(0:0)
Dropping cache for data.
Unlock: Memlock counters: prioritized:0 locked:0 critical:0 daemon:0 suspended:0
Reading mda header sector from /dev/vg/data-pv at 4096
Doubling metadata output buffer to 131072
Doubling metadata output buffer to 262144
Doubling metadata output buffer to 524288
Doubling metadata output buffer to 1048576
Doubling metadata output buffer to 2097152
Doubling metadata output buffer to 4194304
Writing metadata for VG data to /dev/vg/data-pv at 131594752 len 4072089 (wrap 0)
Opened /dev/vg/data-pv RO O_DIRECT
/dev/vg/data-pv: Block size is 4096 bytes
/dev/vg/data-pv: Physical block size is 512 bytes
Closed /dev/vg/data-pv
Close and reopen to write /dev/vg/data-pv
bcache failed to get block 1035 fd 5
Error writing device /dev/vg/data-pv at 131594752 length 4072089.
Failed to write metadata to /dev/vg/data-pv fd -1
Failed to write VG data.
Unlock: Memlock counters: prioritized:0 locked:0 critical:0 daemon:0 suspended:0
Syncing device names
Dropping cache for data.
Unlocking /run/lock/lvm/V_data
_undo_flock /run/lock/lvm/V_data
Freeing VG data at 0x55606e957ed0.
Freeing VG data at 0x55606e9b9ac0.
Dropping VG info
lvmcache has no info for vgname "#orphans_lvm1" with VGID #orphans_lvm1.
lvmcache has no info for vgname "#orphans_lvm1".
lvmcache: Initialised VG #orphans_lvm1.
lvmcache has no info for vgname "#orphans_pool" with VGID #orphans_pool.
lvmcache has no info for vgname "#orphans_pool".
lvmcache: Initialised VG #orphans_pool.
lvmcache has no info for vgname "#orphans_lvm2" with VGID #orphans_lvm2.
lvmcache has no info for vgname "#orphans_lvm2".
lvmcache: Initialised VG #orphans_lvm2.
Completed: lvcreate -vvv -s -n bar data/foo

Comment 2 David Teigland 2019-01-14 15:51:33 UTC

Thanks, I recall seeing something like this once while playing with various sizes of bcache blocks and io sizes, I need to see if I have any record of it.  It may have been related to a single io being split into too many bcache blocks.

Comment 3 David Teigland 2019-01-15 20:00:50 UTC

This is a bug in the way that lvm sets the number of blocks in the bcache, which in this case is too small.  To fix this we need to either use a value that is always sufficiently large (which is hard to do without always allocating a lot of unnecessary memory), or dynamically increase the number of blocks, or fall back to using temporary buffers on demand that are not cached.

lvm sets the number of bcache blocks to the number of devices on the system, and this comment explains when that may not be a good choice:

https://sourceware.org/git/?p=lvm2.git;a=blob;f=lib/label/label.c;h=e01608d2ce571e714669567b179575b28dd9e7f5;hb=refs/heads/2018-06-01-stable#l775

I had assumed that if this value was too small, bcache would fall back to recycling blocks which would just reduce efficiency.  That's true, but in this case, all of the blocks are needed to complete a single write (since the metadata is so large.)

I don't see a good way of temporarily working around this problem until we can put out a fix.  Options include keeping the size of the metadata lower, or possibly adding an artificial number of additional devices that lvm would count when setting the size of the bcache.

Comment 4 David Teigland 2019-01-15 20:17:15 UTC

Joe, what are your thoughts fixing this in RHEL7?  The simplest and quickest solution is to always allocate a larger number of blocks.  In combination with that, we would probably want to add a config setting that could set the bcache size even larger (since I don't think lvm imposes any limit on the max metadata size.)  I'd probably go for this option unless you think one of the others looks simple.

Comment 5 bugzilla 2019-01-17 00:43:19 UTC

Could bug#1639470 be a related somehow?   We hit bug#1639470 and tried to fix it up by deleting the queued create message1 in pool0.  When attempting to restore the dump we hit bug#166575 (this thread). We are now wiping the PV header and reloading the VG using the procedure above.  

Since there are so many snapshots, lvremove takes about 30 seconds per removal, even when listed on the same commandline (eg, xargs).  To speed this up, we are deleting them by programatically updating the metadata and calling `dmsetup message pool0 0 'delete dev_id'` for each ID matching our regex.  (nb, it would be nice to have lvremove handle bulk removes without treating them individually.)

I'll attach the lv-delete-bulk script we use in case it helps anyone else.

Comment 6 bugzilla 2019-01-17 00:46:15 UTC

Created attachment 1521162 [details]
Bulk-delete lvthin logical volumes

Comment 7 Joe Thornber 2019-02-27 16:09:16 UTC

We could add a method to grow the bcache which would be called when large metadata was discovered.  Have to be careful to call it at a point where memory allocation is allowed though.

Comment 8 David Teigland 2019-02-27 20:30:03 UTC

Another logical effect that I reproduced with more testing, is that the scanning phase will fail to read the metadata when it's larger than the bcache size (when a vg already exists with metadata larger than the bcache size.) So, this bug will prevent existing large vgs from being read, not just from being modified.

Growing the bcache would probably be required in two places:

1. In response to a "bcache full" error returned during the scanning phase, to handle existing vgs that won't fit in bcache.

2. After the scanning phase is done, checking the max metadata size that was seen, and growing the bcache size to some larger size that would accomodate any possible modification the command might make.

That sounds a little too complex for rhel7 at this point. For rhel7 I think we should add an lvm.conf setting that lets users forcibly increase the bcache size. To reduce the cases where users have to do this, I think we can add some more intelligence to the bcache size we create.

I thought of another hint we could use to pick a large enough bcache size, and that is looking at the size of the metadata backup files in /etc/lvm/backup, and setting the bcache size to the largest seen. This would not be entirely accurate, but in most cases should be effective. (The backup files can be disabled, and backup files are not removed when a vg is removed which could lead to unnecessarily large bcache sizes after a large vg is removed.) I'm not yet sure how this would look, or if it's a good idea.

Comment 9 David Teigland 2019-03-04 17:30:35 UTC

Pushed three commits to stable-2.02 branch:

https://sourceware.org/git/?p=lvm2.git;a=commit;h=8dbfdb5b737cf916a8b95b8d19eec67a960a6392
https://sourceware.org/git/?p=lvm2.git;a=commit;h=863a2e693ee95b95463d60fa8b21f4c7c084292c
https://sourceware.org/git/?p=lvm2.git;a=commit;h=590a1ebcf78b8aae2a1e5ebaba1ac24a54435690

commit 8dbfdb5b737cf916a8b95b8d19eec67a960a6392
Author: David Teigland <teigland>
Date:   Fri Mar 1 13:55:59 2019 -0600

    config: add new setting io_memory_size
    
    which defines the amount of memory that lvm will allocate
    for bcache.  Increasing this setting is required if it is
    smaller than a single copy of VG metadata.

commit 863a2e693ee95b95463d60fa8b21f4c7c084292c
Author: David Teigland <teigland>
Date:   Mon Mar 4 10:57:52 2019 -0600

    io: warn when metadata size approaches io memory size
    
    When a single copy of metadata gets within 1MB of the
    current io_memory_size value, begin printing a warning
    that the io_memory_size should be increased.

commit 590a1ebcf78b8aae2a1e5ebaba1ac24a54435690
Author: David Teigland <teigland>
Date:   Mon Mar 4 11:18:34 2019 -0600

    io: increase the default io memory from 4 to 8 MiB
    
    This is the default bcache size that is created at the
    start of the command.  It needs to be large enough to
    hold a single copy of metadata for a given VG, or the
    VG cannot be read or written (since the entire VG would
    not fit into available memory.)
    
    Increasing the default reduces the chances of anyone
    needing to increase the default to use their VG.
    
    The size can be set in lvm.conf global/io_memory_size;
    the lower limit is 4 MiB and the upper limit is 128 MiB.

Comment 13 Corey Marthaler 2019-07-02 22:07:12 UTC

Marking this verified (SanityOnly) with the latest rpms. I know that in the original comment for this bug the circular buffer was not filed, but i never saw a point in between filling it up where I couldn't create, and after it was full, I was able to remove, create and resize.

3.10.0-1057.el7.x86_64
lvm2-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-libs-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-cluster-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-lockd-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-python-boom-0.9-18.el7    BUILT: Fri Jun 21 04:18:58 CDT 2019
cmirror-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-libs-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-event-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-event-libs-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-persistent-data-0.8.5-1.el7    BUILT: Mon Jun 10 03:58:20 CDT 2019



[root@hayes-02 ~]# systemctl status lvm2-lvmetad
â— lvm2-lvmetad.service - LVM2 metadata daemon
   Loaded: loaded (/usr/lib/systemd/system/lvm2-lvmetad.service; static; vendor preset: enabled)
   Active: active (running) since Tue 2019-07-02 16:32:36 CDT; 31min ago
     Docs: man:lvmetad(8)
 Main PID: 1043 (lvmetad)
   CGroup: /system.slice/lvm2-lvmetad.service
           â””â”€1043 /usr/sbin/lvmetad -f

Jul 02 16:32:36 hayes-02.lab.msp.redhat.com systemd[1]: Started LVM2 metadata daemon.


[root@hayes-02 ~]# pvcreate --metadatasize 100k /dev/sdb1
  Physical volume "/dev/sdb1" successfully created.

[root@hayes-02 ~]# vgcreate vg /dev/sdb1
  Volume group "vg" successfully created

[root@hayes-02 ~]#  pvs -a -o +devices,vg_mda_size
  PV         VG Fmt  Attr PSize    PFree    Devices VMdaSize 
  /dev/sdb1  vg lvm2 a--  <931.25g <931.25g          1020.00k


[root@hayes-02 ~]# for i in `seq 1 2000`; do lvcreate -an -l1 -n lv$i vg; done
  [...]
  VG vg metadata on /dev/sdb1 (522150 bytes) too large for circular buffer (1043968 bytes with 521842 used)
  Failed to write VG vg.
  VG vg metadata on /dev/sdb1 (522150 bytes) too large for circular buffer (1043968 bytes with 521842 used)
  Failed to write VG vg.
  VG vg metadata on /dev/sdb1 (522150 bytes) too large for circular buffer (1043968 bytes with 521842 used)
  Failed to write VG vg.
  VG vg metadata on /dev/sdb1 (522150 bytes) too large for circular buffer (1043968 bytes with 521842 used)
  Failed to write VG vg.
  VG vg metadata on /dev/sdb1 (522150 bytes) too large for circular buffer (1043968 bytes with 521842 used)
  Failed to write VG vg.
  VG vg metadata on /dev/sdb1 (522150 bytes) too large for circular buffer (1043968 bytes with 521842 used)
  Failed to write VG vg.
  VG vg metadata on /dev/sdb1 (522150 bytes) too large for circular buffer (1043968 bytes with 521842 used)
  Failed to write VG vg.

# removes worked
[root@hayes-02 ~]# for i in `seq 901 1000`; do lvremove -f vg/lv$i; done
  Logical volume "lv901" successfully removed
  Logical volume "lv902" successfully removed
  Logical volume "lv903" successfully removed
  [...]
  Logical volume "lv998" successfully removed
  Logical volume "lv999" successfully removed
  Logical volume "lv1000" successfully removed

# creates worked
[root@hayes-02 ~]# for i in `seq 2000 2100`; do lvcreate -an -l1 -n lv$i vg; done
  [...]
  WARNING: Logical volume vg/lv2097 not zeroed.
  Logical volume "lv2097" created.
  WARNING: Logical volume vg/lv2098 not zeroed.
  Logical volume "lv2098" created.
  WARNING: Logical volume vg/lv2099 not zeroed.
  Logical volume "lv2099" created.
  VG vg metadata on /dev/sdb1 (522249 bytes) too large for circular buffer (1043968 bytes with 521941 used)
  Failed to write VG vg.


[root@hayes-02 ~]# vgextend vg /dev/sdc1
  Physical volume "/dev/sdc1" successfully created.
  VG vg metadata on /dev/sdb1 (522110 bytes) too large for circular buffer (1043968 bytes with 521941 used)
  WARNING: Failed to write an MDA of VG vg.
  Volume group "vg" successfully extended


[root@hayes-02 ~]# vgextend -vvvv vg /dev/sde1 > /tmp/vgextend  2>&1
[root@hayes-02 ~]# grep bcache /tmp/vgextend
#device/bcache.c:211           Limit write at 0 len 131072 to len 8192
#device/bcache.c:211           Limit write at 0 len 131072 to len 2048
#device/bcache.c:211           Limit write at 0 len 131072 to len 4608
#device/bcache.c:211           Limit write at 0 len 131072 to len 1024
#device/bcache.c:211           Limit write at 0 len 131072 to len 1024
#device/bcache.c:211           Limit write at 0 len 131072 to len 4608
#device/bcache.c:211           Limit write at 0 len 131072 to len 4608

Comment 15 errata-xmlrpc 2019-08-06 13:10:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2253

Note You need to log in before you can comment on or make changes to this bug.