Bug 1917308
| Summary: | storage/tests_raid_volume_options.yml failed because mdadm super1.0 bitmap offset is wrong | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Zhang Yi <yizhan> | ||||||
| Component: | mdadm | Assignee: | XiaoNi <xni> | ||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Storage QE <storage-qe> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 8.4 | CC: | dledford, dlehman, hwkernel-mgr, japokorn, jdonohue, ncroxon, rmeggins, sergey.korobitsin, vtrefny, xni | ||||||
| Target Milestone: | rc | Keywords: | Triaged | ||||||
| Target Release: | 8.5 | Flags: | pm-rhel:
mirror+
|
||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | role:storage | ||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2021-07-23 09:20:56 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Zhang Yi
2021-01-18 09:40:24 UTC
Created attachment 1748414 [details]
Fail log
Created attachment 1748415 [details]
pass log
This looks like a hardware/configuration issue. See dmes.log from failure tarball:
[ 2619.698082] sd 2:0:1:1: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2620.519262] blk_update_request: I/O error, dev sdc, sector 20971488 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 2620.519278] blk_update_request: I/O error, dev sdc, sector 20971488 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 2620.519287] md: super_written gets error=10
[ 2620.519293] md/raid1:md127: Disk failure on sdc1, disabling device.
md/raid1:md127: Operation continuing on 1 devices.
[ 2620.522863] md: super_written gets error=10
[ 2620.522869] md/raid1:md127: Disk failure on sdd1, disabling device.
md/raid1:md127: Operation continuing on 1 devices.
[ 2620.522899] sd 2:0:1:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2620.642967] sd 2:0:1:0: Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatical
[ 2620.686518] sd 2:0:1:0: [sdb] Synchronizing SCSI cache
[ 2620.751501] md: super_written gets error=10
[ 2620.795754] md: super_written gets error=10
[ 2620.795773] md: super_written gets error=10
[ 2620.795795] md: md127 still in use.
[ 2620.855754] scsi 2:0:1:2: alua: Detached
[ 2620.855835] md: super_written gets error=10
[ 2620.855847] md: super_written gets error=10
[ 2620.895741] scsi 2:0:1:1: alua: Detached
[ 2620.895817] md: super_written gets error=10
[ 2620.895829] md: super_written gets error=10
[ 2621.061079] md: super_written gets error=10
[ 2621.061100] md: super_written gets error=10
[ 2621.061106] md: super_written gets error=10
[ 2621.061218] md127: detected capacity change from 10728898560 to 0
[ 2621.061229] md: md127 stopped.
[ 2621.135771] scsi 2:0:1:0: alua: Detached
Are we sure this is a bug in the storage role?
(In reply to David Lehman from comment #4) > This looks like a hardware/configuration issue. See dmes.log from failure > tarball: > > [ 2619.698082] sd 2:0:1:1: Warning! Received an indication that the LUN > assignments on this target have changed. The Linux SCSI layer does not > automatical > [ 2620.519262] blk_update_request: I/O error, dev sdc, sector 20971488 op > 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0 > [ 2620.519278] blk_update_request: I/O error, dev sdc, sector 20971488 op > 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0 > [ 2620.519287] md: super_written gets error=10 > [ 2620.519293] md/raid1:md127: Disk failure on sdc1, disabling device. > md/raid1:md127: Operation continuing on 1 devices. > [ 2620.522863] md: super_written gets error=10 > [ 2620.522869] md/raid1:md127: Disk failure on sdd1, disabling device. > md/raid1:md127: Operation continuing on 1 devices. > [ 2620.522899] sd 2:0:1:0: Warning! Received an indication that the LUN > assignments on this target have changed. The Linux SCSI layer does not > automatical > [ 2620.642967] sd 2:0:1:0: Warning! Received an indication that the LUN > assignments on this target have changed. The Linux SCSI layer does not > automatical > [ 2620.686518] sd 2:0:1:0: [sdb] Synchronizing SCSI cache > [ 2620.751501] md: super_written gets error=10 > [ 2620.795754] md: super_written gets error=10 > [ 2620.795773] md: super_written gets error=10 > [ 2620.795795] md: md127 still in use. > [ 2620.855754] scsi 2:0:1:2: alua: Detached > [ 2620.855835] md: super_written gets error=10 > [ 2620.855847] md: super_written gets error=10 > [ 2620.895741] scsi 2:0:1:1: alua: Detached > [ 2620.895817] md: super_written gets error=10 > [ 2620.895829] md: super_written gets error=10 > [ 2621.061079] md: super_written gets error=10 > [ 2621.061100] md: super_written gets error=10 > [ 2621.061106] md: super_written gets error=10 > [ 2621.061218] md127: detected capacity change from 10728898560 to 0 > [ 2621.061229] md: md127 stopped. > [ 2621.135771] scsi 2:0:1:0: alua: Detached > > > > Are we sure this is a bug in the storage role? I'm not sure, but from the log, seems the disks stopped before md stopped. This is working for me with the rhel 8.4.0 GA bits. However, it fails for me with the rhel 8.5 nightly bits - the error is still in the raid test, but the error message is different:
TASK [linux-system-roles.storage : manage the pools and volumes to match the specified state] ***
fatal: [/home/rmeggins/.cache/libvirt/rhel-8-y.qcow2]: FAILED! => {
"actions": [],
"changed": false,
"crypts": [],
"leaves": [],
"mounts": [],
"packages": [
"dosfstools",
"lvm2",
"xfsprogs",
"mdadm"
],
"pools": [],
"volumes": []
}
MSG:
Failed to commit changes to disk: Process reported exit code 1: mdadm: RUN_ARRAY failed: Invalid argument
Can a storage dev please take a look at this? This is a test blocker for rhel 8.5 (or I'll need to add an exception for the raid tests).
Steps to reproduce:
0) git clone https://github.com/linux-system-roles/storage; cd storage/tests
1) create a _setup.yml like this:
- hosts: all
tasks:
- name: Set up internal repositories for RHEL 8
copy:
content: |-
{% for repo, url in repourls.items() %}[{{repo}}]
name={{repo}}
baseurl={{ url }}
enabled=1
gpgcheck=0
{% endfor %}
dest: /etc/yum.repos.d/rhel.repo
vars:
repourls:
rhel-baseos: "http://download.devel.redhat.com/rhel-8/nightly/RHEL-8/latest-RHEL-8.5/compose/BaseOS/x86_64/os/"
rhel-appstream: "http://download.devel.redhat.com/rhel-8/nightly/RHEL-8/latest-RHEL-8.5/compose/AppStream/x86_64/os/"
2) grab the latest rhel 8.5 cloud image:
curl -o rhel-8-y.qcow2 http://download.devel.redhat.com/rhel-8/nightly/RHEL-8/latest-RHEL-8.5/compose/BaseOS/x86_64/images/rhel-guest-image-8.5-514.x86_64.qcow2
3) run the test
TEST_SUBJECTS=rhel-8-y.qcow2 ANSIBLE_STDOUT_CALLBACK=debug ansible-playbook -i /usr/share/ansible/inventory/standard-inventory-qcow2 _setup.yml tests_raid_pool_options.yml 2>&1 | tee output
If you want to leave the machine around to ssh into it, then use TEST_DEBUG=true TEST_SUBJECTS...
The test output will contain instructions about how to ssh into the machine
here is the dmesg output: [ 92.066391] vdd: [ 92.176553] vdd: [ 92.205674] vdd: vdd1 [ 92.460778] vdc: [ 92.577499] vdc: [ 92.603789] vdc: vdc1 [ 92.885760] vdb: [ 93.003420] vdb: [ 93.030983] vdb: vdb1 [ 93.338363] md/raid1:md127: not clean -- starting background reconstruction [ 93.339333] md/raid1:md127: active with 2 out of 2 mirrors [ 93.340266] md127: invalid bitmap file superblock: bad magic [ 93.341036] md127: failed to create bitmap (-22) [ 93.341742] md: md127 stopped. I got the tests_raid_pool_options.yml test to pass like this:
diff --git a/tests/tests_raid_pool_options.yml b/tests/tests_raid_pool_options.yml
index 2743ef0..425a6ef 100644
--- a/tests/tests_raid_pool_options.yml
+++ b/tests/tests_raid_pool_options.yml
@@ -31,7 +31,7 @@
raid_level: "raid1"
raid_device_count: 2
raid_spare_count: 1
- raid_metadata_version: "1.0"
+ raid_metadata_version: "1.2"
state: present
volumes:
- name: lv1
@@ -72,7 +72,7 @@
blivet_output.pools[0].raid_level == 'raid1' and
blivet_output.pools[0].raid_device_count == 2 and
blivet_output.pools[0].raid_spare_count == 1 and
- blivet_output.pools[0].raid_metadata_version == '1.0'
+ blivet_output.pools[0].raid_metadata_version == '1.2'
msg: "Failure to preserve RAID settings for preexisting pool."
- include_tasks: verify-role-results.yml
@@ -88,7 +88,7 @@
raid_level: "raid1"
raid_device_count: 2
raid_spare_count: 1
- raid_metadata_version: "1.0"
+ raid_metadata_version: "1.2"
state: absent
volumes:
- name: lv1
So it looks like the metadata version has changed? Or perhaps 1.0 is no longer supported in rhel 8.5? Is there a list of "minimum supported metadata version" by platform?
additional data - the only platform that requires raid_metadata_version: "1.2" is rhel 8.5.0 - F34 and RHEL9 work just fine with raid_metadata_version: "1.0" I'm including the rhel mdadm maintainer Xiao Ni - what's up with rhel 8.5? Is it due to mdadm-4.2-rc1_1.el8.x86_64 ? > Is it due to mdadm-4.2-rc1_1.el8.x86_64 ?
Could be - a quick search using brew and koji shows that the only platform using mdadm-4.2-rc1 is rhel 8.5.0
(In reply to Rich Megginson from comment #7) > here is the dmesg output: > > [ 92.066391] vdd: > [ 92.176553] vdd: > [ 92.205674] vdd: vdd1 > [ 92.460778] vdc: > [ 92.577499] vdc: > [ 92.603789] vdc: vdc1 > [ 92.885760] vdb: > [ 93.003420] vdb: > [ 93.030983] vdb: vdb1 > [ 93.338363] md/raid1:md127: not clean -- starting background > reconstruction > [ 93.339333] md/raid1:md127: active with 2 out of 2 mirrors > [ 93.340266] md127: invalid bitmap file superblock: bad magic > [ 93.341036] md127: failed to create bitmap (-22) > [ 93.341742] md: md127 stopped. Hi Rich Could you give a summary about how to use md device? If not, I have to try to read the codes in the test case to find how to create raid device, how to use the raid device. If you can give a reproducer in a simple script, that is much better. So we can save much time to understand the problem itself. By the way, super 1.2 is the mostly used version now. Super 1.0 is still supported. Thanks Xiao (In reply to XiaoNi from comment #11) > (In reply to Rich Megginson from comment #7) > > here is the dmesg output: > > > > [ 92.066391] vdd: > > [ 92.176553] vdd: > > [ 92.205674] vdd: vdd1 > > [ 92.460778] vdc: > > [ 92.577499] vdc: > > [ 92.603789] vdc: vdc1 > > [ 92.885760] vdb: > > [ 93.003420] vdb: > > [ 93.030983] vdb: vdb1 > > [ 93.338363] md/raid1:md127: not clean -- starting background > > reconstruction > > [ 93.339333] md/raid1:md127: active with 2 out of 2 mirrors > > [ 93.340266] md127: invalid bitmap file superblock: bad magic > > [ 93.341036] md127: failed to create bitmap (-22) > > [ 93.341742] md: md127 stopped. > > Hi Rich > > Could you give a summary about how to use md device? If not, I have to try > to read the codes in the test case > to find how to create raid device, how to use the raid device. I'm not familiar with the blivet library. This is the code that is creating the raid array: raid_array = self._blivet.new_mdarray(name=raid_name, level=self._spec_dict["raid_level"], member_devices=active_count, total_devices=len(members), parents=members, chunk_size=chunk_size, metadata_version=self._spec_dict.get("raid_metadata_version"), fmt=self._get_format()) I'm not sure how this gets translated in to an equivalent call to mdadm - maybe something like this? mdadm --create /dev/md0 -e 1.0 --level=raid1 --raid-devices=2 --spare-devices=1 /dev/vdb /dev/vdc /dev/vdd ? When I run the above command, I see this: mdadm: partition table exists on /dev/vdb mdadm: metadata will over-write last partition on /dev/vdb. mdadm: partition table exists on /dev/vdc mdadm: metadata will over-write last partition on /dev/vdc. mdadm: partition table exists on /dev/vdd mdadm: metadata will over-write last partition on /dev/vdd. Continue creating array? y Then I don't see any output. How do I check the status of this command, and see the output and errors? NOTE that the test creates virtual devices for /dev/vdb, /dev/vdc, and /dev/vdd by creating 10GB files in /tmp - so if you are running low on disk space in /, please be aware of this. > > If you can give a reproducer in a simple script, that is much better. So we > can save much time to understand > the problem itself. > > By the way, super 1.2 is the mostly used version now. Super 1.0 is still > supported. > > Thanks > Xiao mdadm --create /dev/md0 -e 1.0 --level=raid1 --raid-devices=2 --spare-devices=1 /dev/vdb /dev/vdc /dev/vdd
This seems to work:
> mdadm --detail /dev/md0
/dev/md0:
Version : 1.0
Creation Time : Mon May 24 12:30:36 2021
Raid Level : raid1
Array Size : 10485632 (10.00 GiB 10.74 GB)
Used Dev Size : 10485632 (10.00 GiB 10.74 GB)
Raid Devices : 2
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Mon May 24 12:31:28 2021
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1
Consistency Policy : resync
Name : 0
UUID : 521214a2:4d9704eb:662865f1:09f19fe7
Events : 17
Number Major Minor RaidDevice State
0 252 16 0 active sync /dev/vdb
1 252 32 1 active sync /dev/vdc
2 252 48 - spare /dev/vdd
Is the problem that the blivet library needs to be updated to work with mdadm 4.2?
The exact command we are running in blivet is: mdadm --create --run test1 --level=raid1 --raid-devices=2 --spare-devices=1 --metadata=1.0 --bitmap=internal --chunk=512 /dev/vdb1 /dev/vdc1 /dev/vdd1 This also fails when running manually on 8.5, the internal bitmap location is the culprit here, without it the command works and the array is started. We've been always adding this option for arrays with redundancy so something must changed in mdadm. If this was an intentional change, we can stop adding the option with 1.0 metadata but it looks like a mdadm bug to me now. I'm hitting the same problem on the latest CentOS Stream 8. # mdadm --version mdadm - v4.2-rc1 - 2021-04-14 Everything as Vojtech Trefny described, but one more detail: I'm getting it from this kind of partitioning: part raid.11 --size 200 --asprimary --ondrive=/dev/disk/by-path/pci-0000:00:11.5-ata-5 part raid.21 --size 200 --asprimary --ondrive=/dev/disk/by-path/pci-0000:00:11.5-ata-6 raid /boot/efi --fstype="efi" --device efi --level=RAID1 raid.11 raid.21 --noformat --fsoptions="umask=0077,shortname=winnt" I need efi partition mirrored and working, so blivet generates mdadm 1.0 metadata array, which stopped working with internal bitmap for some reason. For now, I'll try to work it around using %pre, but if fix will be released it would be great. *** This bug has been marked as a duplicate of bug 1966712 *** *** Bug 1986630 has been marked as a duplicate of this bug. *** |