Bug 1897350 - Anaconda does not recognize pre-existing mdadm software RAID partitions
Summary: Anaconda does not recognize pre-existing mdadm software RAID partitions
Keywords:
Status: CLOSED DUPLICATE of bug 1960798
Alias: None
Product: Fedora
Classification: Fedora
Component: python-blivet
Version: 34
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Blivet Maintenance Team
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-12 20:45 UTC by Jason Herring
Modified: 2024-01-21 23:55 UTC (History)
17 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2021-06-25 18:01:07 UTC
Type: Bug
Embargoed:
bcotton: fedora_prioritized_bug?


Attachments (Terms of Use)
Log collection from install (153.38 KB, application/gzip)
2021-03-24 00:32 UTC, Kevin Masaryk
no flags Details
Initial boot with missing md device (83.90 KB, image/png)
2021-03-24 00:48 UTC, Kevin Masaryk
no flags Details
Console showing md device disappears (24.77 KB, image/png)
2021-03-24 00:49 UTC, Kevin Masaryk
no flags Details
Error popup (109.48 KB, image/png)
2021-03-24 17:35 UTC, Kevin Masaryk
no flags Details
VM install logs with exception (107.19 KB, application/gzip)
2021-03-24 17:36 UTC, Kevin Masaryk
no flags Details

Description Jason Herring 2020-11-12 20:45:36 UTC
Description of problem:

Installer on Fedora Server standard and Fedora Live workstation fails to recognize pre-built software RAID partitions.

Multi-disk system running existing Fedora distribution with fresh partitions accessible, mountable, and available in Fedora 27 are not seen in Fedora 33 install.

This system has 8 separate active mirrors.

In Fedora 27:

# cat /proc/mdstat
Personalities : [raid1] 
md3 : active raid1 sdh1[1] sda1[2]
      2930133824 blocks super 1.2 [2/2] [UU]
      
md16 : active raid1 sdb6[0] sdf6[2]
      81723392 blocks super 1.2 [2/2] [UU]
      
md14 : active raid1 sdb4[3] sdf4[2]
      81723392 blocks super 1.2 [2/2] [UU]
      
md18 : active raid1 sdb8[0] sdf8[2]
      206187520 blocks super 1.2 [2/2] [UU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

md4 : active raid1 sdi1[0] sdc1[2]
      5860389888 blocks super 1.2 [2/2] [UU]
      bitmap: 9/44 pages [36KB], 65536KB chunk

md11 : active raid1 sdf9[2] sdb9[0]
      843200 blocks super 1.2 [2/2] [UU]
      
md15 : active raid1 sdb5[0] sdf5[2]
      81723392 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

Running on top of these RAID mirrors are LVM physical volumes, volume groups and logical partitions.  This scheme has been in use in this environment for a long time:

# pvs
  PV         VG          Fmt  Attr PSize   PFree 
  /dev/md14  f23root_vg  lvm2 a--   77.93g 14.49g
  /dev/md15  root27_vg   lvm2 a--   77.93g     0 
  /dev/md3   vg_parchive lvm2 a--   <2.73t <9.11g
  /dev/md4   vg_pdata    lvm2 a--   <5.46t     0 


In Fedora 33 installer:

None of the RAID devices are visible to the Workstation or Server installer.  

Quit installer, run terminal (Workstation Live):

Running mdadm --detail --scan finds all the devices.  
Run installer - scan finds no MD devices.
Quit installer.

adding the results from mdadm --detail --scan to /etc/mdadm.conf 
Run installer - scan finds no MD devices.
Quit installer.

Running mdadm --assemble --scan activates all the devices at the CLI.
cat /proc/mdstat sees all devices
Run installer - scan finds no MD devices
Quit installer.
cat /proc/mdstat and no devices are present.
Checking logs and apparently the devices are being purposefully stopped:

[  199.327835] md125: detected capacity change from 83684753408 to 0
[  199.327845] md: md125 stopped.
[  199.641588] md122: detected capacity change from 83684753408 to 0
[  199.641598] md: md122 stopped.
[  199.694222] md123: detected capacity change from 83684753408 to 0
[  199.694230] md: md123 stopped.
[  199.763487] md121: detected capacity change from 211136020480 to 0
[  199.763495] md: md121 stopped.
[  199.866266] md124: detected capacity change from 863436800 to 0
[  199.866274] md: md124 stopped.
[  200.189653] md127: detected capacity change from 6001039245312 to 0
[  200.189663] md: md127 stopped.
[  200.567074] md: md3 stopped.

This may be happening when I run the installer and it attempts to detect devices.  Each time I assemble the devices and run the installer I see all the md devices stopping.


Version-Release number of selected component (if applicable): 33


How reproducible: 100%


Steps to Reproduce:
1. create md devices on running Fedora system
2. boot from Fedora 33 installer
3. verify md devices are not detected via CLI or installer
4. manually assemble devices using mdadm --assemble --scan
5. verify devices are visible to the system: cat /proc/mdstat
6. run the installer and verify md devices are not detected
7. check dmesg for logs confirming md devices are being stopped
8. cat /proc/mdstat to confirm all devices are no longer visible

Actual results:


Expected results:


Additional info:

Comment 1 Jason Herring 2020-11-13 05:04:35 UTC
New clue:

Whatever process is disassembling my arrays during install time *appears* to be doing so to any array consisting of partitions - all my previous arrays were of the type /dev/sda1 and /dev/sdc1, for example. If the md array is built from the raw disk device such as below (/dev/sdj and /dev/sdg) it recognizes it!

I created an array on a full disk and the installer stopped all my arrays, then restarted selectively only the one array which was built on the raw unpartitioned device.

[root@localhost-live log]# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10]
md120 : active (auto-read-only) raid1 sdj[1] sdg[0]
234299968 blocks super 1.2 [2/2] [UU]
bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

[root@localhost-live log]# fdisk -l /dev/sdj
Disk /dev/sdj: 223.57 GiB, 240057409536 bytes, 468862128 sectors
Disk model: ST240HM000-1G515
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x000700b1

Comment 2 Kevin Masaryk 2021-02-14 21:42:51 UTC
I can confirm similar behavior when trying to install onto an existing raid1 mirror consisting of two devices. I boot the installation CD with "md.auto" appended to the kernel boot line and can see that the kernel finds and activates the md device while booting. Once the anaconda installation starts, however, the md device is not present. Switching to a terminal with Ctrl-F1 and then the tmux shell window, a 'cat /proc/mdstat' shows that the array is no longer active. If I run 'mdadm --assemble --scan' then the md device is immediately found and activated without issue. At this point, /proc/mdstat shows the active md device. Switching back to the GUI installer and selecting 'refresh' on the partitioning screen still does not show the md device. Sure enough, switching back to the shell at this point shows that the md device has been deactivated since it longer shows up in /proc/mdstat.

So it seems pretty clear that Anaconda is deactivating any md devices it finds for some reason which makes it impossible to install onto an existing md device.

I can also say that this bug surfaced somewhere between F31 and F33 since I'm currently running F30 and had no issue installing on the existing md device back when I performed the initial install of that version.

A few more details:
* There are two members of the raid1 array.
* Each member is a partition in a SATA-attached SSD drive; /dev/sda3 and /dev/sdb3.
* The md device (/dev/md0) is configured as an LVM device with numerous LV's on it. This is the location of all required partitions except /boot which is on /dev/sda2.

Comment 3 Kevin Masaryk 2021-02-14 22:15:33 UTC
Wondering if it has something to do with this change: https://docs.fedoraproject.org/en-US/fedora/f33/release-notes/sysadmin/Storage/#dmraid-systemd-udev

Comment 4 Vendula Poncova 2021-02-15 13:08:59 UTC
t seems to be an issue in the storage configuration library. Reassigning to blivet.

Comment 5 Kevin Masaryk 2021-03-18 23:39:00 UTC
Could we please get an update on this?

Comment 6 David Lehman 2021-03-19 00:11:47 UTC
Please attach the logs, collected after hitting the problem, preferably after the simplest possible reproducer (rather than a ton of twiddling/tweaking).

I don't know anymore where the logs are in a live installation. They will either be in /var/log/anaconda or /tmp. The files to collect are syslog, storage.log, program.log, and anaconda.log. Thanks!

Comment 7 Kevin Masaryk 2021-03-20 19:09:18 UTC
We've already outlined exactly how to reproduce the issue with even the simplest RAID setup. Are you saying you don't have an environment you can test RAID in?

Something along these lines should work:
1. Spin up a VM and attach two small virtual disks.
2. Configure the two disks to be a RAID 1 (mirror) array with the mdadm tools.
3. Shut down that VM and detach the RAID disks.
4. Create a new VM with a boot disk and also the two RAID disks attached.
5. Attempt to install Fedora on the new VM following the procedures we've already outlined earlier in this BZ.

Comment 8 Vojtech Trefny 2021-03-22 13:23:33 UTC
(In reply to Kevin Masaryk from comment #3)
> Wondering if it has something to do with this change:
> https://docs.fedoraproject.org/en-US/fedora/f33/release-notes/sysadmin/
> Storage/#dmraid-systemd-udev

No, this shouldn't be related, DM RAID is needed for firmware/BIOS RAID setups, not for mdadm RAID devices.

(In reply to Kevin Masaryk from comment #7)
> We've already outlined exactly how to reproduce the issue with even the
> simplest RAID setup. Are you saying you don't have an environment you can
> test RAID in?
> 
> Something along these lines should work:
> 1. Spin up a VM and attach two small virtual disks.
> 2. Configure the two disks to be a RAID 1 (mirror) array with the mdadm
> tools.
> 3. Shut down that VM and detach the RAID disks.
> 4. Create a new VM with a boot disk and also the two RAID disks attached.
> 5. Attempt to install Fedora on the new VM following the procedures we've
> already outlined earlier in this BZ.

I wasn't able to reproduce this issue with Fedora 33 or Fedora 34.

I have two RAID 1 devices: md127 created directly on disks and md126 created on partitions and I can see both in the installer. Note: Those two RAIDs are displayed in a different way in the installer. RAID on top of disks is considered to be a disk and is displayed in the "Installation destination" spoke together with other disks[1]. RAID on top of partitions is treated same way other "higher level" devices like LVs are so it is displayed in the "Manual partitioning" spoke after selecting the disks[2].

We had some issues with some RAID configurations in the past, so it's possible your setup doesn't work, but we need the logs to be able to see what's wrong or to be able to replicate your configuration for debugging. Default RAID 1 configured with mdadm seems to be working as expected.


[1] https://vtrefny.fedorapeople.org/images/raid-disk.png
[2] https://vtrefny.fedorapeople.org/images/raid-part.png

Comment 9 Kevin Masaryk 2021-03-24 00:32:09 UTC
Created attachment 1765765 [details]
Log collection from install

Includes syslog, storage.log, program.log and anaconda.log. All logs were found under /tmp.

Comment 10 Kevin Masaryk 2021-03-24 00:47:12 UTC
I was able to easily reproduce the issue with a simple VM setup as I outlined earlier. Exact same behavior as with the real hardware. I just created a new VM with vda as the root and then two extra disks built as a mirror array. Each disk has a GPT table with three partitions (this is similar to how my real drives are partitioned). The third partition is configured as a raid member. The raid device is md42. I kept the disks small, only 1GB and the md42 device is about 860M. None of the partitions are formatted and no LVM to keep it simple. The raid device had been created in a separate VM with 'mdadm --create --level=1 --raid-devices=2 /dev/md42 /dev/sdb3 /dev/sdc3' so nothing out of the ordinary there.

As with the real hardware, you can clearly see in the logs that the md dev is found but then deactivated. The installer GUI shows the individual disks but never the md device. I switched to the terminal for an 'mdadm --assemble --scan', confirmed the md device is present, switched back to the GUI and clicked 'rescan'...the md device never shows up and switching back to the terminal shows that it's been deactivated again.

If you need more detail about how I created the VM and disk devices, I can walk you through that.

Comment 11 Kevin Masaryk 2021-03-24 00:48:14 UTC
Created attachment 1765766 [details]
Initial boot with missing md device

Both md member disks are present but md device is missing.

Comment 12 Kevin Masaryk 2021-03-24 00:49:58 UTC
Created attachment 1765767 [details]
Console showing md device disappears

Comment 13 David Lehman 2021-03-24 13:58:44 UTC
The screen you are showing there is only for selecting disks -- not the actual block devices for the file systems. Select all three disks, activate 'Custom' (or "blivet-gui"), click on "Done", and then see if '42' appears on the next screen. I expect it will based on the logs.

Comment 14 Kevin Masaryk 2021-03-24 17:34:34 UTC
Selecting all three disks, 'Custom' and 'Done' throws a Python exception in Anaconda. I'm attaching the logs and a screenshot of it. This is on the VM. I never saw this exact behavior on the real hardware, the md device just never appears.

Comment 15 Kevin Masaryk 2021-03-24 17:35:38 UTC
Created attachment 1765989 [details]
Error popup

Comment 16 Kevin Masaryk 2021-03-24 17:36:20 UTC
Created attachment 1765990 [details]
VM install logs with exception

Comment 17 David Lehman 2021-03-25 15:06:08 UTC
You are hitting a bug in some code that resolves device name to internal device object. It can easily be worked around by giving your md array a name like '/dev/md/mydata' instead of '/dev/md42'. We will work to fix the bug, of course, but in the meantime you should be able to work around it without much difficulty.

Comment 18 Neal Gompa 2021-06-25 11:48:27 UTC
This still affects Fedora Linux 34, per one prominent user: https://twitter.com/tekwendell/status/1408255117506326529

Comment 19 Adam Williamson 2021-06-25 17:17:04 UTC
does it? what is "this"? I just read through the whole report and so far the only concrete bug that was identified was the problem with names like /dev/md42 . beyond that there hasn't been a clear issue pinpointed, so I'm not sure how that person knows they're hitting the same issue.

Comment 20 Vojtech Trefny 2021-06-25 18:01:07 UTC
There are multiple issues in this bug:

1. Blivet `devicetree.resolve_device` function doesn't work correctly for devices with name containing only two numbers, this unfortunately also applies to MD devices, because UDev returns "42" as MD_DEVNAME for "/dev/md42". This is already tracked in 1960798 so I'm closing this as a duplicate. I've created an updates image for Fedora 34 with the upstream fix posted for this issue and added it to 1960798. With this updates image I was able to successfully finish installation with / on a preexisting md42 array so I'm pretty sure this is the same problem and the proposed patch fixes it. Feel free to reopen this bug if you think it's a different issue or a combination of this bug and something else.

2. MD arrays are deactivated by Anaconda. This is "normal", Anaconda always behaved like that, all devices are unmounted/locked/deactivated during the initial storage scan and later mounted/unlocked/activated when/if needed. This is not ideal, it causes some issues, but it would be really complicated to change this behaviour and it generally works. Yes, we stop the array, but we keep the information about the array existence internally and assemble it later, if you choose to use it during the installation.

3. MD arrays on top of partitions are not visible on the Installation destination spoke. I've tried to explain this in comment #8, arrays on top of partitions should be visible in the Custom or blivet-gui partitioning, the tweet linked in comment #18 clearly shows the array visible in blivet-gui spoke (yes, it has size of 0 B, but that might be a different bug, possibly only in blivet-gui, really hard to tell without logs). Screenshot of the error in comment #15 also shows the "42" array visible on the custom partitioning spoke. Again, I'm not saying this is the ideal situation, it makes sense to me ("disk array" X "partition array"), but this might be only me and I assume it can be changed if we agree on that, but that's mostly a UI/UX decision, not a bug in blivet.

Issue 1 should be already fixed upstream, I'm sorry it took us that long, the fix will be available in F35. If you think 2 and 3 are bugs, please report these separately against Anaconda.

*** This bug has been marked as a duplicate of bug 1960798 ***

Comment 21 Sean Rhone 2024-01-21 19:54:26 UTC
I believe this is still a problem with Fedora Workstation 39.

From old notes I think back on F20-something, I had "mdadm --create '/dev/md0' --name='RAID' --level='0' --raid-devices='3' '/dev/sda' '/dev/sdb' '/dev/sdc'". This works on F39 to create the array, but Anaconda didn't show any drives. I saw the comment about doing /dev/md/name and did "mdadm --create '/dev/md/raid0'" instead which then had the array showing in Anaconda.

Comment 22 Jason Herring 2024-01-21 23:55:01 UTC
(In reply to Sean Rhone from comment #21)
> I believe this is still a problem with Fedora Workstation 39.
> 
> From old notes I think back on F20-something, I had "mdadm --create
> '/dev/md0' --name='RAID' --level='0' --raid-devices='3' '/dev/sda'
> '/dev/sdb' '/dev/sdc'". This works on F39 to create the array, but Anaconda
> didn't show any drives. I saw the comment about doing /dev/md/name and did
> "mdadm --create '/dev/md/raid0'" instead which then had the array showing in
> Anaconda.

This is still a problem though it's a little better, but not for my use case.  I'm the original poster.

If you use whole-disk (no partition) arrays it seems to now work OK.  For any partitioned disk arrays (which I use for almost everything) it's still just as much a problem.

Here is how I worked around it when I just installed F39:

Start the installer.  Don't do anything before this because the installer seems to "stop" ALL arrays and then restart ONLY arrays that are whole-disk/unpartitioned arrays (why??).

Go into the 'Storage config' screen.  Select your disks that you have partition-level md devices but pause there.  DO NOT RESCAN FOR CHANGES!  This will do the thing it did earlier, stop all arrays and restart only whole-disk arrays.

Open a shell and type "mdadm --assemble --scan"  ... this will start your arrays but they may appear (auto-read-only).
In the shell, type "mdadm --readwrite </dev/array>" for each array you want to use to install upon to make the array readwrite.

Check the box for Blivet/advanced config, then click "DONE".

On the next screen (the button should say "NEXT" and not "DONE" IMHO) you will see your array devices listed with folder icons next to them.  Select them and install as you would any partition.

I have done this twice successfully.

The bottom line is that the installer is purposely ignoring valid md arrays on partitions for whatever reason each time it queries them.

Note that after install and booting into F39 you may also want to 'mdadm --detail --scan > /etc/mdadm.conf because F39 will not pick up any other previous arrays properly, starting them as "auto-read-only" without config entries.  This ought to be fixable, as my F33 system just as "AUTO +imsm +1.x -all" in /etc/mdadm.conf, but that didn't work for F39 for me so their must be some other config I'm missing.


Note You need to log in before you can comment on or make changes to this bug.