2357214 – parted corrupts partition table during fixing when mdraids change their size

Bug 2357214 - parted corrupts partition table during fixing when mdraids change their size

Summary: parted corrupts partition table during fixing when mdraids change their size

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	python-blivet
Sub Component:
Version:	42
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Vojtech Trefny
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	AcceptedBlocker
Duplicates (1):	2357128 (view as bug list)
Depends On:
Blocks:	F42FinalBlocker
TreeView+	depends on / blocked

Reported:	2025-04-03 14:52 UTC by Katerina Koukiou
Modified:	2025-04-11 04:05 UTC (History)
CC List:	19 users (show)
Fixed In Version:	python-blivet-3.12.1-2.fc42
Clone Of:
Environment:
Last Closed:	2025-04-11 04:05:04 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
journalctl output (968.59 KB, text/plain) 2025-04-03 14:53 UTC, Katerina Koukiou	no flags	Details
View All

Description Katerina Koukiou 2025-04-03 14:52:55 UTC

When deleting an MD RAID array using the Cockpit Storage UI, the removal process does not properly wipe the array metadata. This results in issues when attempting to create a new RAID array using the same name and disks. Specifically, a "broken" GPT table may appear on the newly created array, leading to partitioning errors.




Reproducible: Always

Steps to Reproduce:
Reproducer:

1. Create a RAID0 array using /dev/vda, /dev/vdb, and /dev/vdc.
2. Apply a GPT label and create three partitions.
3. Delete the RAID array using the Cockpit Storage UI.
4. Create a RAID1 array using the same disks and the same name.
5. Apply a GPT label and create three partitions.
6. Run parted -l


Actual Results:  
This error shows up:

Not all of the space available to /dev/md/raid0 appears to be used, you ca     n fix the GPT to use all of the space (an extra 62877696 blocks) or continue with the current setting?


The newly created RAID1 array has leftover metadata from the previously deleted RAID0 array, leading to GPT corruption and partitioning errors.

Expected Results:  
Deleting a RAID array in Cockpit Storage should properly clean up all metadata, ensuring no residual partition tables or RAID metadata interfere with future operations.

This caused a bug in anaconda, as seen in the attached screenshot and journal.

Comment 1 Katerina Koukiou 2025-04-03 14:53:48 UTC

Created attachment 2083251 [details]
journalctl output

Comment 2 Katerina Koukiou 2025-04-03 14:54:17 UTC

Created attachment 2083252 [details]
anaconda crash

Comment 3 Kamil Páral 2025-04-04 06:16:54 UTC

*** Bug 2357128 has been marked as a duplicate of this bug. ***

Comment 4 Kamil Páral 2025-04-04 06:18:15 UTC

I reported bug 2357128 which was said to be a duplicate of this, so you can see some additional logs etc there. It was proposed as a blocker, so the proposal got transferred here.

Comment 5 Marius Vollmer 2025-04-04 11:30:32 UTC

We have debugged this some using this reproducing test case: https://github.com/KKoukiou/anaconda-webui/tree/rhbz%232357214

I can not find anything wrong with the things that Cockpit and UDisks2 have done up to the point where Anaconda takes over again with the "Checking storage confirmation" dialog. All udev properties are as expected and parted correctly identifies the partition labels on all devices. Parted also does not complain about "Not all of the space available appears to be used" when run on the command line.

The first time of "Checking storage confirmation" with the raid0 (striping) mdraid called "SOMERAID", everything succeeds. Note that this time /dev/md/SOMERAID is 30 GiB in size (two times 15 GiB because of striping).

Then the test deletes /dev/md/SOMERAID and creates a new mdraid with the same name but level raid1 (mirroring).

The second time "Checking storage confirmation" is done with this array. This time /dev/md/SOMERAID is only 15 GiB (one times 15 GiB because of mirroring).

This time parted (or probably rather libparted) reports:

INFO:blivet:parted exception: Not all of the space available to /dev/md/SOMERAID appears to be used, you can fix the GPT to use all of the space (an extra 31438848 blocks) or continue with the current setting?
INFO:blivet:parted exception: Invalid argument during seek for write on /dev/md/SOMERAID

Parted (or libparted) thinks that /dev/md/SOMERAID is 31438848 blocks larger than what the partition table on it says. Note that 31438848 times 512 bytes is 15 GiB. My engineering pinky tells me that somehow libparted is still working with the assumption that /dev/md/SOMERAID is 30 GiB in size.

Anaconda tells libparted to go ahead and fix this inconsistenty, but then we see "Invalid argument during seek for write on /dev/md/SOMERAID". Did libparted try to write something at what it thinks is the end of the block device? That would fail with such an error because the block device is of course only 15 GiB and not 30 GiB as parted might assume.

And indeed, the fixing does not result in something consistent: Afterwards, the partition table type of /dev/md/SOMERAID is detected as "PMBR" by blkid. This is not a partition table type that we should ever see. It's a naked "Protective Master Boot Record" that should normally always be followed by the rest of the GPT structure.
Confusion arises from this, with the block device for the partition /dev/md/SOMERAID1 still existing but not having any of the expected udev properties of partitions.

If we change the test to create the first (striped, 30 GiB) mdraid with name "SOMERAID" and the second (mirrored, 15 GiB) mdraid with name "OTHERRAID", then there is no message from parted. No "fixing" happens and all devices keep their expected udev properties. Anaconda goes on to do its things and then fails due to another bug: https://bugzilla.redhat.com/show_bug.cgi?id=2354798

Comment 6 Marius Vollmer 2025-04-04 11:37:53 UTC

(In reply to Marius Vollmer from comment #5)
> Anaconda goes on to do its things 

I forgot to mention that during doing its thing, we see this error 12 times or so:

    WARNING:dasbus.server.handler:The call org.fedoraproject.Anaconda.Modules.Storage.DeviceTree.Viewer.GetDiskTotalSpace has failed with an exception:
    Traceback (most recent call last):
      File "/usr/lib/python3.13/site-packages/dasbus/server/handler.py", line 455, in _method_callback
        result = self._handle_call(
            interface_name,
        ...<2 lines>...
            **additional_args
        )
      File "/usr/lib/python3.13/site-packages/dasbus/server/handler.py", line 265, in _handle_call
        return handler(*parameters, **additional_args)
      File "/usr/lib64/python3.13/site-packages/pyanaconda/modules/storage/devicetree/viewer_interface.py", line 185, in GetDiskTotalSpace
        return self.implementation.get_disk_total_space(disk_ids)
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
      File "/usr/lib64/python3.13/site-packages/pyanaconda/modules/storage/devicetree/viewer.py", line 495, in get_disk_total_space
        disks = self._get_devices(disk_ids)
      File "/usr/lib64/python3.13/site-packages/pyanaconda/modules/storage/devicetree/viewer.py", line 292, in _get_devices
        return list(map(self._get_device, device_ids))
      File "/usr/lib64/python3.13/site-packages/pyanaconda/modules/storage/devicetree/viewer.py", line 282, in _get_device
        raise UnknownDeviceError(device_id)
    pyanaconda.modules.common.errors.storage.UnknownDeviceError: MDRAID-SOMERAID

Aanconda is still trying to access the "SOMERAID" device, which does no longer exist. I take this as a hint that Anacondo or Blivet or libparted might indeed keep outdated information about block devices, maybe including their size.

Comment 7 Katerina Koukiou 2025-04-04 14:04:05 UTC

This error from the comment above is a red herring. Anaconda-webui indeed has some old state, and tries to read information about invalid devices. But this is not problematic. I actually have an open PR to fix that [1], and with that this error is not present in the journal (we ignore it now anyway)


[1] https://github.com/rhinstaller/anaconda-webui/pull/754

Comment 8 Adam Williamson 2025-04-04 15:28:55 UTC

when we delete the first SOMERAID, do we do the equivalent of a `wipefs -a` on the disks to ensure all RAID metadata related to it is wiped? If not, I can certainly see that might cause issues.

Comment 9 Marius Vollmer 2025-04-04 16:36:03 UTC

(In reply to Adam Williamson from comment #8)
> when we delete the first SOMERAID, do we do the equivalent of a `wipefs -a`
> on the disks to ensure all RAID metadata related to it is wiped?

There is wiping, but it is done with the libblkid function blkid_do_wipe after using libblkid to probe for superblocks. I don't know whether that is equivalent to the "-a" flag.

The same kind of wiping also happens on a newly created mdraid device.

Comment 10 Kamil Páral 2025-04-04 16:47:53 UTC

(In reply to Marius Vollmer from comment #5)
> Afterwards,
> the partition table type of /dev/md/SOMERAID is detected as "PMBR" by blkid.
> This is not a partition table type that we should ever see. 

I have noticed this during my anaconda workflows. Sometimes, after deleting an MDRAID device (IIRC), one of the former-raid-member disks was marked as having PMBR part table. I paid no special attention to it, and I don't know how to trigger it intentionally, but I saw it multiple times.

Comment 11 Marius Vollmer 2025-04-05 15:38:39 UTC

Here is another observation.

The reproducing test does these steps:

 1) Cockpit: Create level 0 mdraid of size 30 GiB named "SOMERAID" on two devices with root partition on it.
 2) Anaconda: "Check storage configuration", then cancel and return to Cockpit
 3) Cockpit: Delete SOMERAID
 4) Cockpit: Create level 1 mdraid of size 15 GiB named "SOMERAID" on the same two devices with root partition on it.
 5) Anaconda: "Check storage configuration"

With these steps, parted does the autofixing in step 5 and destroy the partitions on SOMERAID in the process. (This is what I have described in comment 5.)

The new observation is: If we remove step 2) from this, then parted does no autofixing, the partitions on SOMERAID stay intact, and the test passes.

Thus, if Anaconda/Blivet/Parted never "see" SOMERAID while it is 30 GiB, they accept it without complaint when it is 15 GiB.

Comment 12 Vojtech Trefny 2025-04-07 08:40:37 UTC

I was able to reproduce this with only blivet and mdadm. I am not sure if this is a blivet or py/libparted issue yet, but I am moving the bug to blivet for now, because anaconda and cockpit are not involved.

Minimal reproducer:

```
import os

import blivet

# RAID 0 creation with GPT and single 50 GiB partition
os.system("mdadm --create --run SOMERAID --level=raid0 --raid-devices=3 /dev/sdb /dev/sdc /dev/sdd")
os.system("parted --script /dev/md/SOMERAID mklabel gpt mkpart primary 1MiB 50GiB")

# blivet rescan
b=blivet.Blivet()
b.reset()

# Remove MD metadata from disks
os.system("mdadm --stop /dev/md/SOMERAID")
os.system("mdadm --zero-superblock /dev/sdb /dev/sdc /dev/sdd")

# RAID 1 creation with GPT and single 15 GiB partition
os.system("mdadm --create --run SOMERAID --level=raid1 --raid-devices=3 /dev/sdb /dev/sdc /dev/sdd")
os.system("parted --script /dev/md/SOMERAID mklabel gpt mkpart primary 1MiB 15GiB")

# blivet rescan
try:
    b.reset()
except Exception as e:
    print("Rescan failed: %s" % str(e))
    # check the partition table with fdisk
    os.system("fdisk -l /dev/md/SOMERAID")
finally:
    # cleanup
    os.system("wipefs -a /dev/md/SOMERAID")
    os.system("mdadm --stop /dev/md/SOMERAID")
    os.system("mdadm --zero-superblock /dev/sdb /dev/sdc /dev/sdd")
```

Comment 13 Lukas Brabec 2025-04-07 17:56:15 UTC

Discussed during the 2025-04-07 blocker review meeting [1]:

* AGREED: 2357214 - Accepted FinalBlocker - This violates our criterion: "...installer must be able to: Correctly interpret, and modify ... software RAID arrays at RAID levels 0, 1 and 5 containing ext4 partitions".

[1] https://meetbot.fedoraproject.org/blocker-review_matrix_fedoraproject-org/2025-04-07/f42-blocker-review.2025-04-07-16.01.log.html

Comment 14 Vojtech Trefny 2025-04-08 12:28:46 UTC

upstream PR: https://github.com/storaged-project/blivet/pull/1365

Comment 15 Fedora Update System 2025-04-09 09:23:49 UTC

FEDORA-2025-c88bd0d892 (python-blivet-3.12.1-2.fc42) has been submitted as an update to Fedora 42.
https://bodhi.fedoraproject.org/updates/FEDORA-2025-c88bd0d892

Comment 16 Kamil Páral 2025-04-09 15:48:30 UTC

With F42 RC 1.1, I've made an installation where I changed MDRAID0 to MDRAID1 while keeping the device name, and no crash occurred, the system works fine. It looks resolved.

Comment 17 Fedora Update System 2025-04-10 02:08:34 UTC

FEDORA-2025-c88bd0d892 has been pushed to the Fedora 42 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2025-c88bd0d892`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2025-c88bd0d892

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 18 Fedora Update System 2025-04-11 04:05:04 UTC

FEDORA-2025-c88bd0d892 (python-blivet-3.12.1-2.fc42) has been pushed to the Fedora 42 stable repository.
If problem still persists, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.