Bug 1671964 - Second extend of second lvmraid mirror does not sync
Summary: Second extend of second lvmraid mirror does not sync
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: LVM and device-mapper
Classification: Community
Component: lvm2
Version: 2.02.176
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: ---
Assignee: Heinz Mauelshagen
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
: 1669999 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-02 20:36 UTC by Steve D
Modified: 2020-06-10 12:53 UTC (History)
9 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-06-10 12:53:56 UTC
Embargoed:
pm-rhel: lvm-technical-solution?
pm-rhel: lvm-test-coverage?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1814389 0 None None None 2019-02-02 20:36:32 UTC

Description Steve D 2019-02-02 20:36:33 UTC
Description of problem:

Extending an lvmraid(7) type1 mirror for the second time seems to result in the mirror legs not getting synced, *if* there is another type1 mirror in the vg.

Version-Release number of selected component (if applicable):

2.02.176 (4.1ubuntu3)

How reproducible:

Seems to be reliably reproducible here on Ubuntu 18.04 at least.

Steps to Reproduce:

# quickly fill two 10G files with random data
openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero | dd bs=$((1024*1024*1024)) count=10 of=pv1.img iflag=fullblock
openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero | dd bs=$((1024*1024*1024)) count=10 of=pv2.img iflag=fullblock

# change loop devices if you have loads of snaps in use
losetup /dev/loop10 pv1.img
losetup /dev/loop11 pv2.img
pvcreate /dev/loop10
pvcreate /dev/loop11
vgcreate testvg /dev/loop10 /dev/loop11

lvcreate --type raid1 -L2G -n test testvg
watch lvs -o +raid_sync_action,sync_percent,raid_mismatch_count testvg

# wait for sync

lvcreate --type raid1 -L2G -n test2 testvg
watch lvs -o +raid_sync_action,sync_percent,raid_mismatch_count testvg

# wait for sync

# the following will sync OK, observe kernel message for output from md subsys noting time taken
#
lvextend -L+2G testvg/test2
watch lvs -o +raid_sync_action,sync_percent,raid_mismatch_count testvg

# wait for sync

# the following will FAIL to sync, the sync will seem to complete instantly, e.g:
# Feb 02 15:22:50 asr-host kernel: md: resync of RAID array mdX
# Feb 02 15:22:50 asr-host kernel: md: mdX: resync done.
#
lvextend -L+2G testvg/test2

lvchange --syncaction check testvg/test2
watch lvs -o +raid_sync_action,sync_percent,raid_mismatch_count testvg

# observe error count

Actual results:

The sync after the final lvextend completes instantly, and a subsequent lvchange --syncaction check reports a high number for raid_mismatch_count

Expected results:

The sync after the final lvextend should take at least a few seconds, and a subsequent lvchange --syncaction check should not report any errors for raid_mismatch_count (unless the underlying hardware has failed.)

Additional info:

Launchpad bug: https://bugs.launchpad.net/ubuntu/+source/lvm2/+bug/1814389

Comment 1 Jonathan Earl Brassow 2019-08-19 21:21:12 UTC
ok, so it seems you are creating 2 RAID1 LVs and extending the second one twice - ultimately leaving the second one 3x its original size.

I've repeated the commands that you ran on RHEL7.7.

I found that the second extend (which is where the problem began in your case) did not instantly return for me and took about the same amount of time as the first; proceeding from ~66% to 100% performing the resync.  HOWEVER, I too encountered the strange mismatch count.

  LV    VG     Attr	  LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert SyncAction Cpy%Sync Mismatches
  test  testvg rwi-a-r--- 2.00g                                    100.00           idle       100.00            0
  test2 testvg rwi-a-r-m- 6.00g                                    100.00           idle       100.00	   1986560

Importantly, I noticed that 'Mismatches' was 0 up to when the 'check' got to 66% - then the number began increasing.  This is highly suggestive that the second extend did not have an effective resync.  Further, when I wrote out 5GiB to 'testvgf/test2', and rechecked 'Mismatches', I got:
  LV    VG     Attr	  LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert SyncAction Cpy%Sync Mismatches
  test  testvg rwi-a-r--- 2.00g                                    100.00           idle       100.00            0
  test2 testvg rwi-a-r-m- 6.00g                                    100.00           idle       100.00       983040
with 'Mismatches' only starting to count at 80+% (~5/6ths), right where you would expect if the last 1GiB was left alone.

There is definitely something to be investigated here.

Comment 2 Jonathan Earl Brassow 2019-08-19 21:33:58 UTC
Importantly, I do not believe there is any danger of corruption or data loss.  New writes seem to be operating correctly, so only the unused areas of the disk might contain different data and thus report mismatches.  It is a bug and worth investigating, but it shouldn't cause any problems other than scaring the user with a non-zero 'Mismatches' count.

Comment 3 Steve D 2019-08-20 08:51:44 UTC
It's been a while since I looked at all this, I assume bug #1640630 is related .. I'm pretty sure I still have LVs where my monthly check does not in fact check the whole volume; this potentially runs the risk of not spotting e.g. hardware data corruption.

Comment 4 Heinz Mauelshagen 2019-08-20 12:00:16 UTC
This does not need 2 RaidLVs in the VG, 4TiB is the magical size after which extents are not being synchronized any more.

On Fedora 30 for instance (5.1.7-300.fc30.x86_64 / 2.02.183(2)), create a 4GiB 2-legged raid1 LV
in an empty VG which initially syncs fine but any extension after that do not (they immediate jump
to 100 sync percent), thus the extended, unsynced segments are most likely subject to check failures:

[root@vm161 ~]# vgs --noh t
  t   64   0   0 wz--n- 63.75g 63.75g

[root@vm161 ~]# lvcreate -L4g -nt -m1 t;lvs --noh t/t
  Logical volume "t" created.
  t    t            rwi-a-r---  4.00g                                    0.00            

 [root@vm161 ~]# lvs --noh t/t
  t    t            rwi-a-r---  4.00g                                    53.98

[root@vm161 ~]# lvextend -y -L+1g t/t;lvs -noh t/t # <-- this immediately jumps to 100 sync percent!
  Extending 2 mirror images.
  Size of logical volume t/t changed from 4.00 GiB (1024 extents) to 5.00 GiB (1280 extents).
  Logical volume t/t successfully resized.
  t    t            rwi-a-r---  5.00g                                    100.00


As Jon elaborated on in comment #2, this is not a corruption or data loss issue,
because any newly written data to the extended segment will be mirrored.

The extended segment is similar to creating a new raid1 LV with the --nosync option
_and_ avoiding to read what has not been written before.

You'd also get 'check' failures for such --nosync created raid1 unless you removed
a previous, synchronized raid1 of the same or larger size and recreated it anew
with --nosync which is likely to result in the exact same allocation (unless you
also changed allocation arguments).


Jon,
relative to your '...does not check the whole volume...' remark in comment #2:
got examples/logs of that?

Comment 5 Steve D 2019-08-20 12:45:24 UTC
> Jon, relative to your '...does not check the whole volume...' remark in comment #2:
> got examples/logs of that?

Not sure if this was actually meant for me - I've just manually scrubbed my RAID LVs and it's working OK at the moment, but I swear I've seen problems in the past where RAID LVs would get "stuck" and repeatedly not fully check/scrub. Hard to be certain though as my health is variable and it's a while ago that I was looking at this. I've had a couple of disk changes in my array over the last few months and when that happened I likely consolidated the volumes in question, which may have made a difference..

Comment 6 Heinz Mauelshagen 2019-08-28 21:29:19 UTC
(In reply to Heinz Mauelshagen from comment #4)
> This does not need 2 RaidLVs in the VG, 4TiB is the magical size after which

Typo: 4GiB it is in my testing

Comment 7 Heinz Mauelshagen 2019-08-28 21:30:20 UTC
*** Bug 1669999 has been marked as a duplicate of this bug. ***

Comment 8 Heinz Mauelshagen 2019-08-28 21:55:43 UTC
Analysis so far shows, that extension past 4GiB returns 2GiB skipped sectors to the MD sync thread
from the bitmap layer for the extended segment thus preventing synchronization.  Found a solution by
requesting recovery imperatively in the dm-raid target via the MD_RECOVERY_REQUESTED flag from MD.
Will investigate a bit more aiming to avoid issues with the MD recovery sync request path into the bitmap layer.

Comment 9 XiaoNi 2019-08-30 09:31:20 UTC
After the second extension, the bitmap skip too much.

[ 1007.906146] md/raid1:mdX: not clean -- starting background reconstruction
[ 1007.945422] md/raid1:mdX: active with 2 out of 2 mirrors
[ 1008.046904] md: resync of RAID array mdX
[ 1008.069887] /* We can skip this 0:4194304 block, and probably several more */
[ 1008.111110] /* We can skip this 4194304:4194304 block, and probably several more */
[ 1008.155622] /* We can skip this 8388608:4194304 block, and probably several more */
[ 1008.200560] md: mdX: resync done.

I'll update more information.

Comment 10 XiaoNi 2019-09-10 08:13:10 UTC
Hi all

md uses bitmap to check if it needs to do resync. The bitmap uses one bit to describe
one area of disk space. The bitmap chunk is the size. In this case, it's 2MB. The bitmap
is stored on each member disk of raid device.

It reads bitmap during creating raid device. Function md_bitmap_load->md_bitmap_init_from_disk
does this job. For the first creation, it reads the bitmap from one member disk and sets
every bit to 1. It needs a full resync. In this case, there are 1024 bits in bitmap.

For the first exchange, lvextend -L+2G testvg/test:
It reads the bitmap from the disk. The first 1024 bits are not set, so it skip this area (2G)
The following 1024 bits are set. So it does resync for this area (2G)
[ 1199.510291] mdX: bitmap initialized from disk: read 1 pages, set 1024 of 2048 bits
[ 1199.531596] md: resync of RAID array mdX
[ 1209.863815] md: mdX: resync done.

For the second exchange, lvextend -L+2G testvg/test:
It reads the bitmap from one member disk. It finds all the bits are zero. 
[ 1370.240692] mdX: bitmap initialized from disk: read 1 pages, set 0 of 3072 bits
[ 1370.263343] md: resync of RAID array mdX
[ 1370.267883] md: mdX: resync done.

                paddr = kmap_atomic(page);
                if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags))
                        b = test_bit(bit, paddr);
                else
                        b = test_bit_le(bit, paddr);
                kunmap_atomic(paddr);
                if (b) {                                            <<<<---------------- all bits are zero
                        /* if the disk bit is set, set the memory bit */
                        int needed = ((sector_t)(i+1) << bitmap->counts.chunkshift
                                      >= start);
                        md_bitmap_set_memory_bits(bitmap,
                                                  (sector_t)i << bitmap->counts.chunkshift,
                                                  needed);
                        bit_cnt++;
                }


So there are some codes in lvm that sets the bitmap to 1 during lvextend. But it doesn't
do this job during the second extending. 

Thanks
Xiao

Comment 17 Steve D 2019-10-01 16:38:16 UTC
My brain's not working well at the moment for health reasons - will this have affected mdadm / mdraid as well as LVM? Or does mdadm have its own interface? I'm confused about the underlying architecture here..

Comment 18 Steve D 2019-10-01 16:53:46 UTC
Also, meant to say - thank you all for looking into this - looks like it was quite involved. I one of those people who seems to be irredeemably attractive to obscure corner cases in software :)

Comment 19 Heinz Mauelshagen 2019-10-01 20:44:49 UTC
(In reply to Steve D from comment #17)
> My brain's not working well at the moment for health reasons - will this
> have affected mdadm / mdraid as well as LVM? Or does mdadm have its own
> interface? I'm confused about the underlying architecture here..

All the best :-)

No mdraid issues here, just the device-mapper kernel side, i.e.
the dm-raid target interfacing into kernel mdraid needs fixes.

Comment 20 Heinz Mauelshagen 2020-06-10 12:53:56 UTC
In since kernel v5.5-rc7.


Note You need to log in before you can comment on or make changes to this bug.