Bug 1966712

Summary:	mdadm erroneously reports incorrect checksum
Product:	Red Hat Enterprise Linux 8	Reporter:	Chris Moore <christopherm>
Component:	mdadm	Assignee:	XiaoNi <xni>
Status:	CLOSED ERRATA	QA Contact:	Fine Fan <ffan>
Severity:	high	Docs Contact:
Priority:	high
Version:	CentOS Stream	CC:	alex.iribarren, bstinson, carl, daniel.vanderster, davide, dledford, ffan, jamien, janguyen, jdonohue, jwboyer, mharri, ncroxon, ngompa13, pcahyna, rmeggins, xni, yizhan
Target Milestone:	beta	Keywords:	Triaged
Target Release:	8.5	Flags:	pm-rhel: mirror+
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	mdadm-4.2-rc1_3.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-11-09 20:02:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Chris Moore 2021-06-01 17:55:32 UTC

Description of problem:
The "Examine" mode of mdadm can report an incorrect checksum when the checksum in the superblock is correct.

Version-Release number of selected component (if applicable):
4.2-rc1




How reproducible: always


Steps to Reproduce:
1.  Create a RAID 1 array with mdadm version 4.1
2.  Update mdadm to version 4.2-rc1
3.  Use "mdadm -E" to examine one of the devices in the array

Actual results:
- Feature Map is reported as 0x0
- Checksum is reported as being incorrect

Example:
[lab@cl1-fair-02 ~]$ sudo mdadm -E /dev/nvme2n1p1
/dev/nvme2n1p1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : 01291ee1:6bb27d23:98f42aba:9a68c40c
           Name : cl1-fair-02.nvidia.com:boot-efi
  Creation Time : Tue Jun  1 06:33:46 2021
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 1048544 sectors (511.98 MiB 536.85 MB)
     Array Size : 524224 KiB (511.94 MiB 536.81 MB)
  Used Dev Size : 1048448 sectors (511.94 MiB 536.81 MB)
   Super Offset : 1048560 sectors
   Unused Space : before=0 sectors, after=104 sectors
          State : active
    Device UUID : 9549c1c3:af4acc36:685b276d:39f87113

    Update Time : Tue Jun  1 07:03:17 2021
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : d6fffdcf - expected d6fffdce
         Events : 28


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)

Expected results:
- Feature Map should be 0x1 if bitmap is enabled
- Checksum should not show an error

/dev/nvme2n1p1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : 01291ee1:6bb27d23:98f42aba:9a68c40c
           Name : cl1-fair-02.nvidia.com:boot-efi
  Creation Time : Tue Jun  1 06:33:46 2021
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 1048544 sectors (511.98 MiB 536.85 MB)
     Array Size : 524224 KiB (511.94 MiB 536.81 MB)
  Used Dev Size : 1048448 sectors (511.94 MiB 536.81 MB)
   Super Offset : 1048560 sectors
   Unused Space : before=0 sectors, after=96 sectors
          State : active
    Device UUID : 9549c1c3:af4acc36:685b276d:39f87113

Internal Bitmap : -16 sectors from superblock
    Update Time : Tue Jun  1 06:54:17 2021
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : d6fffbab - correct
         Events : 20


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)

Additional info:

This bug was introduced in commit commit 1fe2e1007310778d0551d5c34317e5318507399d

That patch changed the behavior of locate_bitmap1() in super1.c.
If the BITMAP_OFFSET bit is set in the feature map of the superblock, mdadm tries to read the bitmap.  Because locate_bitmap1() is returning an invalid location the read of the bitmap fails.  When this read fails mdadm clears the BITMAP_OFFSET bit in the header.  It then calculates a checksum based on this modified in-memory copy of the superblock, but this checksum is now off by 1 because the BITMAP_OFFSET bit was cleared.  It then reports an error because the checksum doesn't match the one read from the disk.

Comment 1 XiaoNi 2021-06-02 14:29:17 UTC

The patch is at https://marc.info/?l=linux-raid&m=162259662926315&w=2. It needs to wait for ack from maintainer.

Comment 2 Chris Moore 2021-06-02 16:07:02 UTC

I think I've found the issue.  sb->bitmap_offset can be negative (in my case it's -16), but struct mdp_superblock_1 defines it as __u32, so it has to be cast to an int32_t before doing the math on it.  mdadm version 4.1 had the cast, but it disappeared in version 4.2.  The following change seems to fix the problem:

$ git diff super1.c
diff --git a/super1.c b/super1.c
index c05e6237..f7981e3d 100644
--- a/super1.c
+++ b/super1.c
@@ -2631,7 +2631,7 @@ static int locate_bitmap1(struct supertype *st, int fd, int node_num)
        else
                ret = -1;

-       offset = __le64_to_cpu(sb->super_offset) + __le32_to_cpu(sb->bitmap_offset);
+       offset = __le64_to_cpu(sb->super_offset) + (int32_t) __le32_to_cpu(sb->bitmap_offset);
        if (node_num) {
                bms = (bitmap_super_t*)(((char*)sb)+MAX_SB_SIZE);
                bm_sectors_per_node = calc_bitmap_size(bms, 4096) >> 9;

Comment 3 XiaoNi 2021-06-02 23:23:49 UTC

Hi Chris

The patch is right. I've sent the link at comment1. Make comment1 not private. Sorry for this.
By the way, are you only testing with super 1.0 or you use super 1.0 in product? If for product, 
why don't use super1.2? The reason I ask this question is that I want to know more responses
from different people :)

Regards
Xiao

Comment 4 Chris Moore 2021-06-03 15:51:48 UTC

(In reply to XiaoNi from comment #3)
> Hi Chris
> 
> The patch is right. I've sent the link at comment1. Make comment1 not
> private. Sorry for this.
> By the way, are you only testing with super 1.0 or you use super 1.0 in
> product? If for product, 
> why don't use super1.2? The reason I ask this question is that I want to
> know more responses
> from different people :)
> 
> Regards
> Xiao

That's an interesting question.  This RAID was setup by the RHEL Anaconda installer.  On the installer GUI we create a 512 MiB /boot/efi partition, and the remainder of the disk is mounted to /.  Both are created as RAID 1 by selecting "RAID" as the type in the GUI.  I don't know of any place that we can select the superblock format.  But it's odd that we get 1.0, since I think 1.2 is the default for mdadm --create.

Comment 5 Doug Ledford 2021-06-03 16:37:27 UTC

(In reply to Chris Moore from comment #4)
> (In reply to XiaoNi from comment #3)
> > Hi Chris
> > 
> > The patch is right. I've sent the link at comment1. Make comment1 not
> > private. Sorry for this.
> > By the way, are you only testing with super 1.0 or you use super 1.0 in
> > product? If for product, 
> > why don't use super1.2? The reason I ask this question is that I want to
> > know more responses
> > from different people :)
> > 
> > Regards
> > Xiao
> 
> That's an interesting question.  This RAID was setup by the RHEL Anaconda
> installer.  On the installer GUI we create a 512 MiB /boot/efi partition,
> and the remainder of the disk is mounted to /.  Both are created as RAID 1
> by selecting "RAID" as the type in the GUI.  I don't know of any place that
> we can select the superblock format.  But it's odd that we get 1.0, since I
> think 1.2 is the default for mdadm --create.

The installer automatically creates a RAID1 /boot partition as version 1.0 so that it can be read by a non-RAID aware boot loader.  That used to be a requirement before grub2 was the norm.

Comment 6 Doug Ledford 2021-06-03 16:40:03 UTC

(In reply to Doug Ledford from comment #5)
> (In reply to Chris Moore from comment #4)
> > (In reply to XiaoNi from comment #3)
> > > Hi Chris
> > > 
> > > The patch is right. I've sent the link at comment1. Make comment1 not
> > > private. Sorry for this.
> > > By the way, are you only testing with super 1.0 or you use super 1.0 in
> > > product? If for product, 
> > > why don't use super1.2? The reason I ask this question is that I want to
> > > know more responses
> > > from different people :)
> > > 
> > > Regards
> > > Xiao
> > 
> > That's an interesting question.  This RAID was setup by the RHEL Anaconda
> > installer.  On the installer GUI we create a 512 MiB /boot/efi partition,
> > and the remainder of the disk is mounted to /.  Both are created as RAID 1
> > by selecting "RAID" as the type in the GUI.  I don't know of any place that
> > we can select the superblock format.  But it's odd that we get 1.0, since I
> > think 1.2 is the default for mdadm --create.
> 
> The installer automatically creates a RAID1 /boot partition as version 1.0
> so that it can be read by a non-RAID aware boot loader.  That used to be a
> requirement before grub2 was the norm.

Sorry, misread your prior statement.  We create /boot/efi as a 1.0 superblock array because the EFI partition must be BIOS readable and the BIOS doesn't know how to skip a superblock at the beginning of the device.  It is a hard requirement that an EFI partition be superblock 1.0 as a result.  This doesn't change regardless of the grub version in use (although we also used to create /boot partitions as superblock 1.0 when grub 1 was in use too).

Comment 7 Dan van der Ster 2021-06-07 16:11:22 UTC

Exactly the same issue here, and it causes Stream 8 anaconda installation to fail on our hardware.

Our anaconda raids are defined like this:

```
# partition table
%pre
#!/bin/sh
DISKS=$(lsblk -d -o name,rota,fstype --noheadings | grep ^sd | grep -v -i 'LVM2_member' | awk '{if ($2=='0') print $1}' | head -n 2)
ONE=$(echo ${DISKS} | cut -d ' ' -f 1)
TWO=$(echo ${DISKS} | cut -d ' ' -f 2)

# it is very important to only clearpart on the first two drives.
# there are often other drives sdc, etc.. which can be OSD journals
# or OSD data disks. We must not overwrite those partition tables.
echo "ignoredisk --only-use=${ONE},${TWO}" > /tmp/part-include
echo "clearpart --all --initlabel --drives ${ONE},${TWO}" >> /tmp/part-include

# for /boot
echo "partition raid.01 --size 1024 --ondisk ${ONE}" >> /tmp/part-include
echo "partition raid.02 --size 1024 --ondisk ${TWO}" >> /tmp/part-include

# for /boot/efi
echo "partition raid.11 --size 256  --ondisk ${ONE}" >> /tmp/part-include
echo "partition raid.12 --size 256  --ondisk ${TWO}" >> /tmp/part-include

# for /
echo "partition raid.21 --size 1    --ondisk ${ONE} --grow" >> /tmp/part-include
echo "partition raid.22 --size 1    --ondisk ${TWO} --grow" >> /tmp/part-include

echo "raid /boot     --level=1 --device=boot     --fstype=xfs raid.01 raid.02" >> /tmp/part-include
echo "raid /boot/efi --level=1 --device=boot_efi --fstype=efi raid.11 raid.12" >> /tmp/part-include
echo "raid /         --level=1 --device=root     --fstype=xfs raid.21 raid.22" >> /tmp/part-include
%end

# use the partition table defined above and dumped to file
%include /tmp/part-include
```

Installation fails with: 

   dasbus.error.DBusError: Process reported exit code 1: mdadm: RUN_ARRAY failed: Invalid argument

dmesg shows:

   md126: bitmap superblock UUID mismatch
   md126: fialed to create bitmap (-22)

And mdadm -E shows 1-bit checksum errors on the members of boot_efi, just like Chris posted.

Comment 8 XiaoNi 2021-06-09 03:11:33 UTC

(In reply to Doug Ledford from comment #6)
> 
> Sorry, misread your prior statement.  We create /boot/efi as a 1.0
> superblock array because the EFI partition must be BIOS readable and the
> BIOS doesn't know how to skip a superblock at the beginning of the device. 
> It is a hard requirement that an EFI partition be superblock 1.0 as a
> result.  This doesn't change regardless of the grub version in use (although
> we also used to create /boot partitions as superblock 1.0 when grub 1 was in
> use too).

Hi Doug

Thanks for the explanation.

Comment 9 XiaoNi 2021-06-09 03:29:20 UTC

(In reply to Dan van der Ster from comment #7)
> Exactly the same issue here, and it causes Stream 8 anaconda installation to
> fail on our hardware.
> 

Hi Dan

Could you try this patch https://marc.info/?l=linux-raid&m=162259662926315&w=2

Comment 10 Dan van der Ster 2021-06-09 09:53:39 UTC

> Could you try this patch https://marc.info/?l=linux-raid&m=162259662926315&w=2

First, a clear reproducer for you with 4.2-rc1_1, maybe to add to some test framework:

```
# rpm -q mdadm
mdadm-4.2-rc1_1.el8.x86_64
# dd if=/dev/zero of=a.dat b
s=1M count=256
256+0 records in
256+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 0.107383 s, 2.5 GB/s
# dd if=/dev/zero of=b.dat b
s=1M count=256
256+0 records in
256+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 0.108689 s, 2.5 GB/s
# losetup /dev/loop0 a.dat
# losetup /dev/loop1 b.dat
# mdadm --create /dev/md0 --level=1 --metadata=1.0 --bitmap=internal --raid-devices=2 /dev/loop0 /dev/loop1
mdadm: RUN_ARRAY failed: Invalid argument
# mdadm -E /dev/loop1
/dev/loop1:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x0
     Array UUID : c5d9f7f7:53213a6c:90828f3b:c5f8ba22
           Name : 0
  Creation Time : Wed Jun  9 11:08:45 2021
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 524256 sectors (255.98 MiB 268.42 MB)
     Array Size : 262080 KiB (255.94 MiB 268.37 MB)
  Used Dev Size : 524160 sectors (255.94 MiB 268.37 MB)
   Super Offset : 524272 sectors
   Unused Space : before=0 sectors, after=104 sectors
          State : active
    Device UUID : d9ca4edc:50ee669f:c3b6dd74:f42d1b02

    Update Time : Wed Jun  9 11:08:45 2021
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : 1ff86d78 - expected 1ff86d77
         Events : 0


   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)
# dmesg | tail
[161542.795797] md/raid1:md0: not clean -- starting background reconstruction
[161542.795798] md/raid1:md0: active with 2 out of 2 mirrors
[161542.795813] md0: invalid bitmap file superblock: bad magic
[161542.795815] md0: failed to create bitmap (-22)
[161542.795852] md: md0 stopped.
```


Now I built with that fix and it works:
```
# rpm -Uvh /tmp/mdadm/mdadm-4.2-rc1_2.el8.x86_64.rpm
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:mdadm-4.2-rc1_2.el8              ################################# [ 50%]
Cleaning up / removing...
   2:mdadm-4.2-rc1_1.el8              ################################# [100%]
# dd if=/dev/zero of=/dev/loop0
dd: writing to '/dev/loop0': No space left on device
524289+0 records in
524288+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 1.50245 s, 179 MB/s
# dd if=/dev/zero of=/dev/loop1
dd: writing to '/dev/loop1': No space left on device
524289+0 records in
524288+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 1.65947 s, 162 MB/s
# mdadm --create /dev/md0 --level=1 --metadata=1.0 --bitmap=internal --raid-devices=2 /dev/loop0 /dev/loop1
mdadm: array /dev/md0 started.
# mdadm -E /dev/loop0
/dev/loop0:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : 719d639a:d16c5b0b:188ca971:cbb9e114
           Name : 0
  Creation Time : Wed Jun  9 11:50:04 2021
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 524256 sectors (255.98 MiB 268.42 MB)
     Array Size : 262080 KiB (255.94 MiB 268.37 MB)
  Used Dev Size : 524160 sectors (255.94 MiB 268.37 MB)
   Super Offset : 524272 sectors
   Unused Space : before=0 sectors, after=96 sectors
          State : clean
    Device UUID : e6aa2d8d:43097efd:40b6e6fc:b1922f80

Internal Bitmap : -16 sectors from superblock
    Update Time : Wed Jun  9 11:50:05 2021
  Bad Block Log : 512 entries available at offset -8 sectors
       Checksum : 9ed9b9d9 - correct
         Events : 17


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)
# dmesg | tail
[164021.431804] md/raid1:md0: not clean -- starting background reconstruction
[164021.431805] md/raid1:md0: active with 2 out of 2 mirrors
[164021.433174] md0: detected capacity change from 0 to 268369920
[164021.433229] md: resync of RAID array md0
[164022.621793] md: md0: resync done.
```

So I assume anaconda will also work.

Comment 11 XiaoNi 2021-06-09 11:10:56 UTC

(In reply to Dan van der Ster from comment #10)
> > Could you try this patch https://marc.info/?l=linux-raid&m=162259662926315&w=2
> 
> First, a clear reproducer for you with 4.2-rc1_1, maybe to add to some test
> framework:

Hi Fine

Could you add the test case to our regression test.

> ```
> 
> 
> Now I built with that fix and it works:

Thanks
Xiao

Comment 12 Fine Fan 2021-06-10 02:09:10 UTC

Recorded Adding.

Comment 13 Alex Iribarren 2021-06-25 08:20:13 UTC

We're unable to install new machines with CS8 due to this bug. What is the ETA for a fix? I see @ncroxon set a target release of 8.6, which I'm not sure if that means a year...

Comment 14 XiaoNi 2021-06-25 08:57:45 UTC

I'll ping the upstream maintainer again. I have sent the patch to upstream some days ago. 
We still have some time to fix this in 8.5. So change target to 8.5 again.

Comment 17 Pavel Cahyna 2021-07-21 13:44:14 UTC

Hello,

I have encountered the problem with creating 1.0 metadata on EFI partition (from inside anaconda) too:
# mdadm --create /dev/md/boot_efi --run --level=raid1 --raid-devices=2 --metadata=1.0 --bitmap=internal --chunk=512 /dev/vdb1 /dev/vdb2

mdadm: RUN_ARRAY failed: Invalid argument

kernel says:
[  119.232426] md/raid1:md127: not clean -- starting background reconstruction
[  119.232426] md/raid1:md127: active with 2 out of 2 mirrors
[  119.233852] md127: invalid bitmap file superblock: bad magic
[  119.233856] md127: failed to create bitmap (-22)
[  119.233942] md: md127 stopped.

With the updated mdadm-4.2-rc1_3.el8, I don't get this problem anymore, but I get another one:

# mdadm --create /dev/md/boot_efi --run --level=raid1 --raid-devices=2 --metadata=1.0 --bitmap=internal --chunk=512 /dev/vdb1 /dev/vdb2
mdadm: specifying chunk size is forbidden for this level

now this is not specific to 1.0 metadata, with 1.2 the same thing happens.

In the previous version mdadm-4.2-rc1_2.el8, I have not had this problem when using 1.2 metadata.

Comment 20 XiaoNi 2021-07-23 09:20:52 UTC

*** Bug 1917308 has been marked as a duplicate of this bug. ***

Comment 24 Fine Fan 2021-07-31 16:16:37 UTC

no new issue found on mdadm-4.2-rc1_3.el8

Comment 27 errata-xmlrpc 2021-11-09 20:02:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (mdadm bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4494

Comment 28 Red Hat Bugzilla 2023-09-15 01:08:50 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days