1917308 – storage/tests_raid_volume_options.yml failed because mdadm super1.0 bitmap offset is wrong

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1917308 - storage/tests_raid_volume_options.yml failed because mdadm super1.0 bitmap offset is wrong

Summary: storage/tests_raid_volume_options.yml failed because mdadm super1.0 bitmap of...

Keywords:
Status:	CLOSED DUPLICATE of bug 1966712
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	mdadm
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.5
Assignee:	XiaoNi
QA Contact:	Storage QE
Docs Contact:
URL:
Whiteboard:	role:storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-18 09:40 UTC by Zhang Yi
Modified:	2021-09-06 15:23 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-23 09:20:56 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Fail log (74.25 KB, application/gzip) 2021-01-18 09:41 UTC, Zhang Yi	no flags	Details
pass log (68.08 KB, application/gzip) 2021-01-18 09:42 UTC, Zhang Yi	no flags	Details
View All

Description Zhang Yi 2021-01-18 09:40:24 UTC

Description of problem:
storage_tests_raid_volume_options.yml failed

Version-Release number of selected component (if applicable):
rhel-system-roles-1.0-24.el8.noarch

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
From the dmesg log[3], we can see "disk_provisioner.sh stop" executed before md raid stopped, but from the execution step[4], the "disk_provisioner.sh stop" executed after the ansible playbook. 
Maybe we need wait some time for the md raid fully stopped


[1]
beaker job: 
https://beaker.engineering.redhat.com/recipes/9377133

[2]
Execution log:
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::   Test
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [ 03:59:43 ] :: [   LOG    ] ::   guessed distribution version: 8
:: [ 03:59:43 ] :: [   LOG    ] :: Checking if 'roles/rhel-system-roles.storage' supports rhel version: 8
:: [ 03:59:44 ] :: [   LOG    ] ::   supported rhel versions: [7, 8]
:: [ 03:59:44 ] :: [  BEGIN   ] :: Make virtual disks to test on :: actually running 'FMF_DIR=roles/rhel-system-roles.storage/tests/ ./disk_provisioner.sh start'
Configuration saved to /tmp/disk_provisioner/target_backup.json
targetcli shell version 2.1.53
Copyright 2011-2013 by Datera, Inc and others.
For help on commands, type 'help'.

/> Parameter auto_cd_after_create is now 'true'.
/> Created target naa.50014057080e615b.
Entering new node /loopback/naa.50014057080e615b
/loopback/naa.50014057080e615b> Parameter auto_cd_after_create is now 'false'.
/loopback/naa.50014057080e615b> Created fileio disk0 with size 10737418240
/loopback/naa.50014057080e615b> Created LUN 0.
/loopback/naa.50014057080e615b> Created fileio disk1 with size 10737418240
/loopback/naa.50014057080e615b> Created LUN 1.
/loopback/naa.50014057080e615b> Created fileio disk2 with size 10737418240
/loopback/naa.50014057080e615b> Created LUN 2.
/loopback/naa.50014057080e615b> exit
Global pref auto_save_on_exit=true
Last 10 configs saved in /etc/target/backup/.
Configuration saved to /etc/target/saveconfig.json
:: [ 03:59:45 ] :: [   PASS   ] :: Make virtual disks to test on (Expected 0, got 0)
:: [ 03:59:45 ] :: [  BEGIN   ] :: Test storage/tests_raid_volume_options.yml (roles/rhel-system-roles.storage/tests/tests_raid_volume_options.yml) with ANSIBLE-2 :: actually running 'ansible-playbook -vv -i inventory --skip-tags tests::expfail,tests::slow,tests::reboot,tests::multihost_localhost,tests::avc setup.yml roles/rhel-system-roles.storage/tests/tests_raid_volume_options.yml &> SYSTEM-ROLE-storage_tests_raid_volume_options.yml-ANSIBLE-2.log'
:: [ 04:01:50 ] :: [   FAIL   ] :: Test storage/tests_raid_volume_options.yml (roles/rhel-system-roles.storage/tests/tests_raid_volume_options.yml) with ANSIBLE-2 (Expected 0, got 2)
:: [ 04:01:50 ] :: [   INFO   ] :: Sending SYSTEM-ROLE-storage_tests_raid_volume_options.yml-ANSIBLE-2.log as SYSTEM-ROLE-storage_tests_raid_volume_options.yml-ANSIBLE-2.log
:: [ 04:01:50 ] :: [   LOG    ] :: File '/tmp/tmp.ATWST9RdHX/SYSTEM-ROLE-storage_tests_raid_volume_options.yml-ANSIBLE-2.log' stored here: /var/tmp/BEAKERLIB_STORED_SYSTEM-ROLE-storage_tests_raid_volume_options.yml-ANSIBLE-2.log
:: [ 04:01:50 ] :: [  BEGIN   ] :: Cleanup virtual disks :: actually running './disk_provisioner.sh stop'
Configuration restored from /tmp/disk_provisioner/target_backup.json
:: [ 04:01:52 ] :: [   PASS   ] :: Cleanup virtual disks (Expected 0, got 0)
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::   Duration: 129s
::   Assertions: 2 good, 1 bad
::   RESULT: FAIL (Test)

[3]
dmesg:
[ 2493.899557] scsi host2: TCM_Loopback
[ 2493.961599] scsi 2:0:1:0: Direct-Access     LIO-ORG  disk0            4.0  PQ: 0 ANSI: 5
[ 2493.961677] scsi 2:0:1:0: alua: supports implicit and explicit TPGS
[ 2493.961688] scsi 2:0:1:0: alua: device naa.6001405d735a5a889d74368b6bfcf200 port group 0 rel port 1
[ 2493.961855] sd 2:0:1:0: [sdb] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
[ 2493.961886] sd 2:0:1:0: [sdb] Write Protect is off
[ 2493.961892] sd 2:0:1:0: [sdb] Mode Sense: 43 00 10 08
[ 2493.961918] sd 2:0:1:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 2493.961931] loopback/naa.5001405d15dc62b6: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION.
[ 2493.961963] sd 2:0:1:0: [sdb] Optimal transfer size 8388608 bytes
[ 2493.962132] sd 2:0:1:0: Attached scsi generic sg1 type 0
[ 2493.997573] scsi 2:0:1:1: Direct-Access     LIO-ORG  disk1            4.0  PQ: 0 ANSI: 5
[ 2493.997632] scsi 2:0:1:1: alua: supports implicit and explicit TPGS
[ 2493.997640] scsi 2:0:1:1: alua: device naa.6001405baa9851ee3ef4aab938b1fb5e port group 0 rel port 1
[ 2493.997815] sd 2:0:1:1: Attached scsi generic sg2 type 0
[ 2493.997830] sd 2:0:1:1: [sdc] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
[ 2493.997861] sd 2:0:1:1: [sdc] Write Protect is off
[ 2493.997871] sd 2:0:1:1: [sdc] Mode Sense: 43 00 10 08
[ 2493.997908] sd 2:0:1:1: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 2493.997925] loopback/naa.5001405d15dc62b6: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION.
[ 2493.997958] sd 2:0:1:1: [sdc] Optimal transfer size 8388608 bytes
[ 2494.034077] scsi 2:0:1:2: Direct-Access     LIO-ORG  disk2            4.0  PQ: 0 ANSI: 5
[ 2494.034134] scsi 2:0:1:2: alua: supports implicit and explicit TPGS
[ 2494.034142] scsi 2:0:1:2: alua: device naa.6001405638d867cd0754cc1a07dc4fa5 port group 0 rel port 1
[ 2494.034308] sd 2:0:1:2: Attached scsi generic sg3 type 0
[ 2494.034326] sd 2:0:1:2: [sdd] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
[ 2494.034362] sd 2:0:1:2: [sdd] Write Protect is off
[ 2494.034370] sd 2:0:1:2: [sdd] Mode Sense: 43 00 10 08
[ 2494.034401] sd 2:0:1:2: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 2494.034420] loopback/naa.5001405d15dc62b6: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION.
[ 2494.034460] sd 2:0:1:2: [sdd] Optimal transfer size 8388608 bytes
[ 2494.048481] sd 2:0:1:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2494.048676] sd 2:0:1:1: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2494.078359] sd 2:0:1:1: alua: transition timeout set to 60 seconds
[ 2494.078362] sd 2:0:1:0: alua: transition timeout set to 60 seconds
[ 2494.078374] sd 2:0:1:1: alua: port group 00 state A non-preferred supports TOlUSNA
[ 2494.078381] sd 2:0:1:0: alua: port group 00 state A non-preferred supports TOlUSNA
[ 2494.079024] sd 2:0:1:1: [sdc] Attached SCSI disk
[ 2494.079159] sd 2:0:1:0: [sdb] Attached SCSI disk
[ 2494.090423] sd 2:0:1:2: [sdd] Attached SCSI disk
[ 2494.098496] sd 2:0:1:2: alua: transition timeout set to 60 seconds
[ 2494.098517] sd 2:0:1:2: alua: port group 00 state A non-preferred supports TOlUSNA
[ 2537.310970]  sdd:
[ 2537.436922]  sdd:
[ 2537.580422]  sdd: sdd1
[ 2537.922413]  sdc:
[ 2538.047056]  sdc:
[ 2538.527994]  sdb:
[ 2538.657407]  sdb:
[ 2538.751823]  sdb: sdb1
[ 2539.179188] md/raid1:md127: not clean -- starting background reconstruction
[ 2539.179208] md/raid1:md127: active with 2 out of 2 mirrors
[ 2539.188170] md127: detected capacity change from 0 to 10728898560
[ 2539.188258] md: resync of RAID array md127
[ 2553.261583] XFS (md127): Mounting V5 Filesystem
[ 2555.064053] XFS (md127): Ending clean mount
[ 2609.585080] XFS (md127): Unmounting Filesystem
[ 2618.840798] md: md127: resync done.
[ 2619.563412] sd 2:0:1:2: [sdd] Synchronizing SCSI cache
[ 2619.698024] sd 2:0:1:1: [sdc] Synchronizing SCSI cache
[ 2619.698082] sd 2:0:1:1: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2620.519262] blk_update_request: I/O error, dev sdc, sector 20971488 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 2620.519278] blk_update_request: I/O error, dev sdc, sector 20971488 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 2620.519287] md: super_written gets error=10
[ 2620.519293] md/raid1:md127: Disk failure on sdc1, disabling device.
               md/raid1:md127: Operation continuing on 1 devices.
[ 2620.522863] md: super_written gets error=10
[ 2620.522869] md/raid1:md127: Disk failure on sdd1, disabling device.
               md/raid1:md127: Operation continuing on 1 devices.
[ 2620.522899] sd 2:0:1:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2620.642967] sd 2:0:1:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2620.686518] sd 2:0:1:0: [sdb] Synchronizing SCSI cache
[ 2620.751501] md: super_written gets error=10
[ 2620.795754] md: super_written gets error=10
[ 2620.795773] md: super_written gets error=10
[ 2620.795795] md: md127 still in use.
[ 2620.855754] scsi 2:0:1:2: alua: Detached
[ 2620.855835] md: super_written gets error=10
[ 2620.855847] md: super_written gets error=10
[ 2620.895741] scsi 2:0:1:1: alua: Detached
[ 2620.895817] md: super_written gets error=10
[ 2620.895829] md: super_written gets error=10
[ 2621.061079] md: super_written gets error=10
[ 2621.061100] md: super_written gets error=10
[ 2621.061106] md: super_written gets error=10
[ 2621.061218] md127: detected capacity change from 10728898560 to 0
[ 2621.061229] md: md127 stopped.
[ 2621.135771] scsi 2:0:1:0: alua: Detached


[4]
rlRun "ansible-playbook -vv -i inventory --skip-tags "$EXCLUDETAGS" setup.yml $PLAYBOOK &> $LOGFILE" 0 "Test $testname ($PLAYBOOK) with ANSIBLE-$ANSIBLE_VER"
rlFileSubmit "$LOGFILE"
rlRun "./disk_provisioner.sh stop" 0 "Cleanup virtual disks"

Comment 1 Zhang Yi 2021-01-18 09:41:46 UTC

Created attachment 1748414 [details]
Fail log

Comment 2 Zhang Yi 2021-01-18 09:42:12 UTC

Created attachment 1748415 [details]
pass log

Comment 4 David Lehman 2021-05-13 20:27:26 UTC

This looks like a hardware/configuration issue. See dmes.log from failure tarball:

[ 2619.698082] sd 2:0:1:1: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2620.519262] blk_update_request: I/O error, dev sdc, sector 20971488 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 2620.519278] blk_update_request: I/O error, dev sdc, sector 20971488 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 2620.519287] md: super_written gets error=10
[ 2620.519293] md/raid1:md127: Disk failure on sdc1, disabling device.
               md/raid1:md127: Operation continuing on 1 devices.
[ 2620.522863] md: super_written gets error=10
[ 2620.522869] md/raid1:md127: Disk failure on sdd1, disabling device.
               md/raid1:md127: Operation continuing on 1 devices.
[ 2620.522899] sd 2:0:1:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2620.642967] sd 2:0:1:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2620.686518] sd 2:0:1:0: [sdb] Synchronizing SCSI cache
[ 2620.751501] md: super_written gets error=10
[ 2620.795754] md: super_written gets error=10
[ 2620.795773] md: super_written gets error=10
[ 2620.795795] md: md127 still in use.
[ 2620.855754] scsi 2:0:1:2: alua: Detached
[ 2620.855835] md: super_written gets error=10
[ 2620.855847] md: super_written gets error=10
[ 2620.895741] scsi 2:0:1:1: alua: Detached
[ 2620.895817] md: super_written gets error=10
[ 2620.895829] md: super_written gets error=10
[ 2621.061079] md: super_written gets error=10
[ 2621.061100] md: super_written gets error=10
[ 2621.061106] md: super_written gets error=10
[ 2621.061218] md127: detected capacity change from 10728898560 to 0
[ 2621.061229] md: md127 stopped.
[ 2621.135771] scsi 2:0:1:0: alua: Detached



Are we sure this is a bug in the storage role?

Comment 5 Zhang Yi 2021-05-14 00:11:03 UTC

(In reply to David Lehman from comment #4)
> This looks like a hardware/configuration issue. See dmes.log from failure
> tarball:
> 
> [ 2619.698082] sd 2:0:1:1: Warning! Received an indication that the LUN
> assignments on this target have changed. The Linux SCSI layer does not
> automatical
> [ 2620.519262] blk_update_request: I/O error, dev sdc, sector 20971488 op
> 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
> [ 2620.519278] blk_update_request: I/O error, dev sdc, sector 20971488 op
> 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
> [ 2620.519287] md: super_written gets error=10
> [ 2620.519293] md/raid1:md127: Disk failure on sdc1, disabling device.
>                md/raid1:md127: Operation continuing on 1 devices.
> [ 2620.522863] md: super_written gets error=10
> [ 2620.522869] md/raid1:md127: Disk failure on sdd1, disabling device.
>                md/raid1:md127: Operation continuing on 1 devices.
> [ 2620.522899] sd 2:0:1:0: Warning! Received an indication that the LUN
> assignments on this target have changed. The Linux SCSI layer does not
> automatical
> [ 2620.642967] sd 2:0:1:0: Warning! Received an indication that the LUN
> assignments on this target have changed. The Linux SCSI layer does not
> automatical
> [ 2620.686518] sd 2:0:1:0: [sdb] Synchronizing SCSI cache
> [ 2620.751501] md: super_written gets error=10
> [ 2620.795754] md: super_written gets error=10
> [ 2620.795773] md: super_written gets error=10
> [ 2620.795795] md: md127 still in use.
> [ 2620.855754] scsi 2:0:1:2: alua: Detached
> [ 2620.855835] md: super_written gets error=10
> [ 2620.855847] md: super_written gets error=10
> [ 2620.895741] scsi 2:0:1:1: alua: Detached
> [ 2620.895817] md: super_written gets error=10
> [ 2620.895829] md: super_written gets error=10
> [ 2621.061079] md: super_written gets error=10
> [ 2621.061100] md: super_written gets error=10
> [ 2621.061106] md: super_written gets error=10
> [ 2621.061218] md127: detected capacity change from 10728898560 to 0
> [ 2621.061229] md: md127 stopped.
> [ 2621.135771] scsi 2:0:1:0: alua: Detached
> 
> 
> 
> Are we sure this is a bug in the storage role?

I'm not sure, but from the log, seems the disks stopped before md stopped.

Comment 6 Rich Megginson 2021-05-21 16:24:29 UTC

This is working for me with the rhel 8.4.0 GA bits.  However, it fails for me with the rhel 8.5 nightly bits - the error is still in the raid test, but the error message is different:

TASK [linux-system-roles.storage : manage the pools and volumes to match the specified state] ***
fatal: [/home/rmeggins/.cache/libvirt/rhel-8-y.qcow2]: FAILED! => {
    "actions": [],
    "changed": false,
    "crypts": [],
    "leaves": [],
    "mounts": [],
    "packages": [
        "dosfstools",
        "lvm2",
        "xfsprogs",
        "mdadm"
    ],
    "pools": [],
    "volumes": []
}

MSG:

Failed to commit changes to disk: Process reported exit code 1: mdadm: RUN_ARRAY failed: Invalid argument


Can a storage dev please take a look at this?  This is a test blocker for rhel 8.5 (or I'll need to add an exception for the raid tests).

Steps to reproduce:

0) git clone https://github.com/linux-system-roles/storage; cd storage/tests
1) create a _setup.yml like this:

- hosts: all
  tasks:
    - name: Set up internal repositories for RHEL 8
      copy:
        content: |-
          {% for repo, url in repourls.items() %}[{{repo}}]
          name={{repo}}
          baseurl={{ url }}
          enabled=1
          gpgcheck=0
          {% endfor %}
        dest: /etc/yum.repos.d/rhel.repo
      vars:
        repourls:
          rhel-baseos: "http://download.devel.redhat.com/rhel-8/nightly/RHEL-8/latest-RHEL-8.5/compose/BaseOS/x86_64/os/"
          rhel-appstream: "http://download.devel.redhat.com/rhel-8/nightly/RHEL-8/latest-RHEL-8.5/compose/AppStream/x86_64/os/"

2) grab the latest rhel 8.5 cloud image:

curl -o rhel-8-y.qcow2 http://download.devel.redhat.com/rhel-8/nightly/RHEL-8/latest-RHEL-8.5/compose/BaseOS/x86_64/images/rhel-guest-image-8.5-514.x86_64.qcow2

3) run the test

TEST_SUBJECTS=rhel-8-y.qcow2 ANSIBLE_STDOUT_CALLBACK=debug ansible-playbook -i /usr/share/ansible/inventory/standard-inventory-qcow2 _setup.yml tests_raid_pool_options.yml 2>&1 | tee output

If you want to leave the machine around to ssh into it, then use TEST_DEBUG=true TEST_SUBJECTS...

The test output will contain instructions about how to ssh into the machine

Comment 7 Rich Megginson 2021-05-21 17:26:35 UTC

here is the dmesg output:

[   92.066391]  vdd:
[   92.176553]  vdd:
[   92.205674]  vdd: vdd1
[   92.460778]  vdc:
[   92.577499]  vdc:
[   92.603789]  vdc: vdc1
[   92.885760]  vdb:
[   93.003420]  vdb:
[   93.030983]  vdb: vdb1
[   93.338363] md/raid1:md127: not clean -- starting background reconstruction
[   93.339333] md/raid1:md127: active with 2 out of 2 mirrors
[   93.340266] md127: invalid bitmap file superblock: bad magic
[   93.341036] md127: failed to create bitmap (-22)
[   93.341742] md: md127 stopped.

Comment 8 Rich Megginson 2021-05-21 20:37:09 UTC

I got the tests_raid_pool_options.yml test to pass like this:

diff --git a/tests/tests_raid_pool_options.yml b/tests/tests_raid_pool_options.yml
index 2743ef0..425a6ef 100644
--- a/tests/tests_raid_pool_options.yml
+++ b/tests/tests_raid_pool_options.yml
@@ -31,7 +31,7 @@
             raid_level: "raid1"
             raid_device_count: 2
             raid_spare_count: 1
-            raid_metadata_version: "1.0"
+            raid_metadata_version: "1.2"
             state: present
             volumes:
               - name: lv1
@@ -72,7 +72,7 @@
               blivet_output.pools[0].raid_level == 'raid1' and
               blivet_output.pools[0].raid_device_count == 2 and
               blivet_output.pools[0].raid_spare_count == 1 and
-              blivet_output.pools[0].raid_metadata_version == '1.0'
+              blivet_output.pools[0].raid_metadata_version == '1.2'
         msg: "Failure to preserve RAID settings for preexisting pool."
 
     - include_tasks: verify-role-results.yml
@@ -88,7 +88,7 @@
             raid_level: "raid1"
             raid_device_count: 2
             raid_spare_count: 1
-            raid_metadata_version: "1.0"
+            raid_metadata_version: "1.2"
             state: absent
             volumes:
               - name: lv1

So it looks like the metadata version has changed?  Or perhaps 1.0 is no longer supported in rhel 8.5?  Is there a list of "minimum supported metadata version" by platform?

Comment 9 Rich Megginson 2021-05-21 21:18:52 UTC

additional data - the only platform that requires raid_metadata_version: "1.2" is rhel 8.5.0 - F34 and RHEL9 work just fine with raid_metadata_version: "1.0"

I'm including the rhel mdadm maintainer Xiao Ni - what's up with rhel 8.5?  Is it due to mdadm-4.2-rc1_1.el8.x86_64 ?

Comment 10 Rich Megginson 2021-05-21 21:21:17 UTC

> Is it due to mdadm-4.2-rc1_1.el8.x86_64 ?

Could be - a quick search using brew and koji shows that the only platform using mdadm-4.2-rc1 is rhel 8.5.0

Comment 11 XiaoNi 2021-05-24 02:48:07 UTC

(In reply to Rich Megginson from comment #7)
> here is the dmesg output:
> 
> [   92.066391]  vdd:
> [   92.176553]  vdd:
> [   92.205674]  vdd: vdd1
> [   92.460778]  vdc:
> [   92.577499]  vdc:
> [   92.603789]  vdc: vdc1
> [   92.885760]  vdb:
> [   93.003420]  vdb:
> [   93.030983]  vdb: vdb1
> [   93.338363] md/raid1:md127: not clean -- starting background
> reconstruction
> [   93.339333] md/raid1:md127: active with 2 out of 2 mirrors
> [   93.340266] md127: invalid bitmap file superblock: bad magic
> [   93.341036] md127: failed to create bitmap (-22)
> [   93.341742] md: md127 stopped.

Hi Rich

Could you give a summary about how to use md device? If not, I have to try to read the codes in the test case
to find how to create raid device, how to use the raid device.

If you can give a reproducer in a simple script, that is much better. So we can save much time to understand
the problem itself.

By the way, super 1.2 is the mostly used version now. Super 1.0 is still supported.

Thanks
Xiao

Comment 15 Rich Megginson 2021-05-24 16:15:18 UTC

(In reply to XiaoNi from comment #11)
> (In reply to Rich Megginson from comment #7)
> > here is the dmesg output:
> > 
> > [   92.066391]  vdd:
> > [   92.176553]  vdd:
> > [   92.205674]  vdd: vdd1
> > [   92.460778]  vdc:
> > [   92.577499]  vdc:
> > [   92.603789]  vdc: vdc1
> > [   92.885760]  vdb:
> > [   93.003420]  vdb:
> > [   93.030983]  vdb: vdb1
> > [   93.338363] md/raid1:md127: not clean -- starting background
> > reconstruction
> > [   93.339333] md/raid1:md127: active with 2 out of 2 mirrors
> > [   93.340266] md127: invalid bitmap file superblock: bad magic
> > [   93.341036] md127: failed to create bitmap (-22)
> > [   93.341742] md: md127 stopped.
> 
> Hi Rich
> 
> Could you give a summary about how to use md device? If not, I have to try
> to read the codes in the test case
> to find how to create raid device, how to use the raid device.

I'm not familiar with the blivet library.  This is the code that is creating the raid array:

            raid_array = self._blivet.new_mdarray(name=raid_name,
                                                  level=self._spec_dict["raid_level"],
                                                  member_devices=active_count,
                                                  total_devices=len(members),
                                                  parents=members,
                                                  chunk_size=chunk_size,
                                                  metadata_version=self._spec_dict.get("raid_metadata_version"),
                                                  fmt=self._get_format())

I'm not sure how this gets translated in to an equivalent call to mdadm - maybe something like this?

mdadm --create /dev/md0 -e 1.0 --level=raid1 --raid-devices=2 --spare-devices=1 /dev/vdb /dev/vdc /dev/vdd

?

When I run the above command, I see this:

mdadm: partition table exists on /dev/vdb
mdadm: metadata will over-write last partition on /dev/vdb.
mdadm: partition table exists on /dev/vdc
mdadm: metadata will over-write last partition on /dev/vdc.
mdadm: partition table exists on /dev/vdd
mdadm: metadata will over-write last partition on /dev/vdd.
Continue creating array? y

Then I don't see any output.  How do I check the status of this command, and see the output and errors?

NOTE that the test creates virtual devices for /dev/vdb, /dev/vdc, and /dev/vdd by creating 10GB files in /tmp - so if you are running low on disk space in /, please be aware of this.

> 
> If you can give a reproducer in a simple script, that is much better. So we
> can save much time to understand
> the problem itself.
> 
> By the way, super 1.2 is the mostly used version now. Super 1.0 is still
> supported.
> 
> Thanks
> Xiao

Comment 16 Rich Megginson 2021-05-24 16:33:26 UTC

mdadm --create /dev/md0 -e 1.0 --level=raid1 --raid-devices=2 --spare-devices=1 /dev/vdb /dev/vdc /dev/vdd

This seems to work:

> mdadm --detail /dev/md0

/dev/md0:
           Version : 1.0
     Creation Time : Mon May 24 12:30:36 2021
        Raid Level : raid1
        Array Size : 10485632 (10.00 GiB 10.74 GB)
     Used Dev Size : 10485632 (10.00 GiB 10.74 GB)
      Raid Devices : 2
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Mon May 24 12:31:28 2021
             State : clean 
    Active Devices : 2
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : resync

              Name : 0
              UUID : 521214a2:4d9704eb:662865f1:09f19fe7
            Events : 17

    Number   Major   Minor   RaidDevice State
       0     252       16        0      active sync   /dev/vdb
       1     252       32        1      active sync   /dev/vdc

       2     252       48        -      spare   /dev/vdd


Is the problem that the blivet library needs to be updated to work with mdadm 4.2?

Comment 27 Vojtech Trefny 2021-05-25 12:11:40 UTC

The exact command we are running in blivet is:

mdadm --create --run test1 --level=raid1 --raid-devices=2 --spare-devices=1 --metadata=1.0 --bitmap=internal --chunk=512 /dev/vdb1 /dev/vdc1 /dev/vdd1

This also fails when running manually on 8.5, the internal bitmap location is the culprit here, without it the command works and the array is started. We've been always adding this option for arrays with redundancy so something must changed in mdadm. If this was an intentional change, we can stop adding the option with 1.0 metadata but it looks like a mdadm bug to me now.

Comment 39 Sergey Korobitsin 2021-06-23 05:58:45 UTC

I'm hitting the same problem on the latest CentOS Stream 8.

# mdadm --version
mdadm - v4.2-rc1 - 2021-04-14

Everything as Vojtech Trefny described, but one more detail: I'm getting it from this kind of partitioning:

part raid.11 --size 200    --asprimary --ondrive=/dev/disk/by-path/pci-0000:00:11.5-ata-5
part raid.21 --size 200    --asprimary --ondrive=/dev/disk/by-path/pci-0000:00:11.5-ata-6
raid /boot/efi --fstype="efi" --device efi --level=RAID1 raid.11 raid.21 --noformat --fsoptions="umask=0077,shortname=winnt"

I need efi partition mirrored and working, so blivet generates mdadm 1.0 metadata array, which stopped working with internal bitmap for some reason.
For now, I'll try to work it around using %pre, but if fix will be released it would be great.

Comment 43 XiaoNi 2021-07-23 09:20:56 UTC


*** This bug has been marked as a duplicate of bug 1966712 ***

Comment 44 Rich Megginson 2021-07-28 15:15:42 UTC

*** Bug 1986630 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.