629719 – random failures when notifying kernel of changes to partitioned md (isw) device's partition table

Bug 629719 - random failures when notifying kernel of changes to partitioned md (isw) device's partition table

Summary: random failures when notifying kernel of changes to partitioned md (isw) devi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	parted
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Brian Lane
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	anaconda_trace_hash:0d2614a39e7bcd6cc...
Duplicates (1):	629730 (view as bug list)
Depends On:
Blocks:	F14Beta, F14BetaBlocker
TreeView+	depends on / blocked

Reported:	2010-09-02 19:31 UTC by Sandro Mathys
Modified:	2010-09-22 04:09 UTC (History)
CC List:	18 users (show)
Fixed In Version:	parted-2.3-2.fc14
Clone Of:
Environment:
Last Closed:	2010-09-22 04:09:16 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Attached traceback automatically from anaconda. (495.51 KB, text/plain) 2010-09-02 19:31 UTC, Sandro Mathys	no flags	Details
anaconda traceback (F14 Beta TC1) (285.34 KB, text/plain) 2010-09-11 09:17 UTC, Sandro Mathys	no flags	Details
traceback from reproducer using different case (249.84 KB, text/plain) 2010-09-15 22:50 UTC, David Lehman	no flags	Details
dm-1.udevdb (1.05 KB, text/plain) 2010-09-16 14:10 UTC, James Laska	no flags	Details
dm-2.udevdb (940 bytes, text/plain) 2010-09-16 14:33 UTC, James Laska	no flags	Details
PATCH 1/2 for parted which might fix / workaround this (1.63 KB, patch) 2010-09-17 13:29 UTC, Hans de Goede	no flags	Details \| Diff
PATCH 2/2 for parted which might fix / workaround this (1.65 KB, patch) 2010-09-17 13:30 UTC, Hans de Goede	no flags	Details \| Diff
Show Obsolete (1) View All

Description Sandro Mathys 2010-09-02 19:31:04 UTC

The following was filed automatically by anaconda:
anaconda 14.15 exception report
Traceback (most recent call first):
  File "/usr/lib64/python2.7/site-packages/pyanaconda/storage/formats/__init__.py", line 266, in create
    raise FormatCreateError("invalid device specification", self.device)
  File "/usr/lib64/python2.7/site-packages/pyanaconda/storage/formats/fs.py", line 849, in create
    DeviceFormat.create(self, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/pyanaconda/storage/deviceaction.py", line 290, in execute
    options=self.device.formatArgs)
  File "/usr/lib64/python2.7/site-packages/pyanaconda/storage/devicetree.py", line 700, in processActions
    action.execute(intf=self.intf)
  File "/usr/lib64/python2.7/site-packages/pyanaconda/storage/__init__.py", line 313, in doIt
    self.devicetree.processActions()
  File "/usr/lib64/python2.7/site-packages/pyanaconda/packages.py", line 109, in turnOnFilesystems
    anaconda.storage.doIt()
  File "/usr/lib64/python2.7/site-packages/pyanaconda/dispatch.py", line 212, in moveStep
    rc = stepFunc(self.anaconda)
  File "/usr/lib64/python2.7/site-packages/pyanaconda/dispatch.py", line 131, in gotoNext
    self.moveStep()
  File "/usr/lib64/python2.7/site-packages/pyanaconda/gui.py", line 1174, in nextClicked
    self.anaconda.dispatch.gotoNext()
FormatCreateError: ('invalid device specification', '/dev/md127p3')

Comment 1 Sandro Mathys 2010-09-02 19:31:10 UTC

Created an attachment (id=442710)
Attached traceback automatically from anaconda.

Comment 2 Sandro Mathys 2010-09-02 19:59:42 UTC

I'm not sure I can reproduce this. I just tried and noticed /dev/md127p3 was of type 'unknown'. So I'm not sure if it was left like that by the previous installation (which generated the traceback above) or the installation before that (which ended in a traceback caused with the iscsi target).

I now deleted all linux partitions and will see whether I can reproduce this with no unknown-type partitions :)

Comment 3 Sandro Mathys 2010-09-02 20:10:23 UTC

*** Bug 629730 has been marked as a duplicate of this bug. ***

Comment 4 Sandro Mathys 2010-09-02 20:13:22 UTC

I first tried deleting the unknown-type partition within anaconda with fdisk and then just continuing - this led to the same error and I was not sure whether the fdisk-changed were being reflected.

So I rebooted and gave it another try - and the unknown-type partition was back (I figure this bug causes that type of partitions). I went ahead and deleted the unknown-type partition in the anaconda partitioning tool - but the same happened anyway, see the duplicate bug.

Comment 5 Sandro Mathys 2010-09-02 20:14:49 UTC

By the way, this is Fedora 14 Alpha RC4 x86_64 with 4 SATA-II disks in a RAID 10 :)

Comment 6 Sandro Mathys 2010-09-02 20:21:16 UTC

Adding Fedora 14 Beta Blocker according to https://fedoraproject.org/wiki/QA:Testcase_Install_to_BIOS_RAID which has Beta release level on https://fedoraproject.org/wiki/Test_Results:Fedora_14_Alpha_RC4_Install#General_Tests

Comment 7 Ben Boeckel 2010-09-03 14:07:05 UTC

I encountered this with F14-Alpha x86_64 on 2 RAID 1 disks. The partitioning also failed on /md127p3.

=============================
 Partition  Mount      FS
=============================
  md127p1   /boot  ext2
  md127p2   --     swap
  md127p3   /      ext4
  md127p4   --     extended
  md127p5   /home  ext4
=============================

Comment 8 Adam Williamson 2010-09-03 16:56:51 UTC

Discussed at today's blocker review meeting. Accepted as a Beta blocker under the criterion "The installer must be able to create and install to software, hardware or BIOS RAID-0, RAID-1 or RAID-5 partitions for anything except /boot".

Comment 9 John Poelstra 2010-09-09 04:33:08 UTC

What are the plans for addressing and assessing this bug?  We'd love to have more information before the next blocker meeting this Friday.

Comment 10 David Lehman 2010-09-09 22:18:56 UTC

(In reply to comment #7)
> I encountered this with F14-Alpha x86_64 on 2 RAID 1 disks. The partitioning
> also failed on /md127p3.

Ben, can you please attach the exception report to this bug so we can use that information to isolate the problem?

Comment 11 Ben Boeckel 2010-09-10 00:20:42 UTC

Unfortunately, it's my work machine and I had to get it installed. I didn't see a way to save the results from Anaconda before rebooting (well, I suppose a USB stick would have worked, but hind sight is 20-20) and it's not really up for installing again soon.

I ended up dropping to a TTY while Anaconda asked the earlier questions and using fdisk and mkfs to get the partitioning right. That went without a hitch. I don't have any other RAID instances. Maybe a call out to the devel list will get someone. I can get the specs (harddrive and motherboard) when I get in tomorrow.

Comment 12 Adam Williamson 2010-09-10 16:53:38 UTC

Discussed at 2010-09-10 blocker review meeting. No essential change from last week, but given Ben's inability to test further, we will have to close this as INSUFFICIENT_DATA if we cannot reproduce it in TC1 testing.

Comment 13 Kevin Kofler 2010-09-11 02:03:03 UTC

Is the information attached by Sandro Mathys not sufficient?

Comment 14 Ben Boeckel 2010-09-11 03:00:26 UTC

I just got it to go through without a crash (other than bug #632799). I imagine anaconda doesn't like the partition followed by a format.

Comment 15 Sandro Mathys 2010-09-11 09:17:04 UTC

Created attachment 446642 [details]
anaconda traceback (F14 Beta TC1)

Reproduced with F14 Beta TC1 (x86_64 install DVD)

Comment 16 David Lehman 2010-09-13 17:12:50 UTC

(In reply to comment #4)
> I first tried deleting the unknown-type partition within anaconda with fdisk
> and then just continuing - this led to the same error and I was not sure
> whether the fdisk-changed were being reflected.

Please do not modify the partition table outside of anaconda without forcing anaconda to reset its storage objects. It will not help you.

(In reply to comment #15)
> Reproduced with F14 Beta TC1 (x86_64 install DVD)

Please describe in detail what you did: all partitioning-related operations and choices made in anaconda, as well as any activities carried out on the shell on tty2.

Comment 17 Sandro Mathys 2010-09-13 17:29:24 UTC

> (In reply to comment #15)
> > Reproduced with F14 Beta TC1 (x86_64 install DVD)
> 
> Please describe in detail what you did: all partitioning-related operations and
> choices made in anaconda, as well as any activities carried out on the shell on
> tty2.

I really didn't do anything fancy. This time I kept it as simple as possible, i.e. didn't really change anything nor use tty2. I used a Swiss German keyboard layout, used the BIOS RAID and no single HDD/SDD, set Europe/Zurich (non-UTC) timezone, chose to review the partitioning layout but didn't change anything and failed.

So, the partitioning layout looked like this:
- md127p1: ntfs (windows hidden system whatever)
- md127p2: ntfs (windows 7 C:\)
- md127p3: ext4 (/boot)
- md127p4: extended
- md127p5: LVM

Comment 18 Adam Williamson 2010-09-14 13:59:01 UTC

sandro, when you posted comment #17, are you talking about the attempt logged in comment #13, or did you do a new attempt but not post the logs from that attempt yet?

Comment 19 Sandro Mathys 2010-09-14 14:19:29 UTC

(In reply to comment #18)
> sandro, when you posted comment #17, are you talking about the attempt logged
> in comment #13, or did you do a new attempt but not post the logs from that
> attempt yet?

comment #17 refers to comment #15, sorry for the confusion.

Comment 20 David Lehman 2010-09-15 22:47:56 UTC

I found a test machine with the same raid controller, although mine will only do basic striping or mirroring. I ran the f14-alpha installer on it, starting with an uninitialized (unpartitioned/unformatted) mirror array. I specified /boot, /, swap, and /home partitions on the mirror and hit the same error when trying to create the ext4 filesystem on md127p2 for /.

The problem appears to be somewhere below anaconda: pyparted, parted, or the kernel (md). Since this is only happening on these partitioned md devices, my money is on the kernel/md.

Immediately after doing the parted commit to add the second partition, we try to zero out the beginning and end of the new partition to wipe any residual metadata -- this fails with the error -ENOENT. If I switch to the shell on tty2, /dev and /proc/partitions agree that md127 has one partition, while parted shows it having two partitions. The system's still up, so feel free to ask for more data.

Will upload anaconda traceback shortly...

Comment 21 David Lehman 2010-09-15 22:50:27 UTC

Created attachment 447585 [details]
traceback from reproducer using different case

Comment 22 David Lehman 2010-09-16 00:33:28 UTC

We are seeing seemingly random failures when parted is trying to notify the kernel of changes to the md device's partition table. This is with intel isw fwraid via md, which is the only fwraid type anaconda uses md for. The others use dmraid.

The failures are also reproducible using the parted utility from the shell -- no errors are displayed but, after exiting, the kernel shows fewer partitions than are actually present.

We are assembling the array using mdadm -I on each of the member disks in turn, in case that matters.

Comment 23 Adam Williamson 2010-09-16 09:36:17 UTC

Given that this is a random kernel failure (as I guessed! yay me!), how practical is it for us to say 'just try again' is a workaround? Does anyone who's hit this have a feel for how likely you are to get a success in a couple of tries? Is there any other workaround?



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 24 Sandro Mathys 2010-09-16 10:27:32 UTC

So far I tried to install F14 Alpha/Beta at least 4 times and the issue always stopped me, i.e. trying again doesn't seem to be a valid workaround.

Comment 25 Barry Donahue 2010-09-16 13:31:11 UTC

I tried this test case on F-14-Beta-TC1. When I created the RAID set the first time, the install failed. I tried it again and it passed the second time.

Comment 26 James Laska 2010-09-16 14:10:03 UTC

Created attachment 447753 [details]
dm-1.udevdb

Attaching udevadm info per request from dlehman.  This is from a system with a "nVidia Corporation CK804 Serial ATA Controller"

# udevadm info --query=all --name=dm-1 > /tmp/dm-1.udevdb

Comment 27 David Lehman 2010-09-16 14:24:24 UTC

(In reply to comment #23)
> of tries? Is there any other workaround?

This is only with intel isw bios/fw raid, so we should have the workaround of passing "noiswmd" on boot command line to force dmraid instead of md for these arrays, but we've turned up some bugs in dmraid and/or the device-mapper udev rules that prevent the workaround from being productive.

Comment 28 James Laska 2010-09-16 14:33:11 UTC

Created attachment 447757 [details]
dm-2.udevdb

(In reply to comment #26)
> Created attachment 447753 [details]
> dm-1.udevdb
> 
> Attaching udevadm info per request from dlehman.  This is from a system with a
> "nVidia Corporation CK804 Serial ATA Controller"
> 
> # udevadm info --query=all --name=dm-1 > /tmp/dm-1.udevdb

Oops, using correct device this time (dm-2)

Comment 29 Dave Jones 2010-09-16 20:16:59 UTC

(In reply to comment #22)

> The failures are also reproducible using the parted utility from the shell --
> no errors are displayed but, after exiting, the kernel shows fewer partitions
> than are actually present.
> 
> We are assembling the array using mdadm -I on each of the member disks in turn,
> in case that matters.

can you give me a list of commands to try and reproduce this ? I'll ask the upstream md maintainer to take a look.

Comment 30 Hans de Goede 2010-09-17 09:59:18 UTC

Some insight on this from my side (as the ex anaconda bios raid and the ex parted maintainer). F-14 contains parted-2.3, which has switched from using the partition table reread ioctl which fdisk and parted-2.1 (in F-13) uses, to using blkpg. blkpg allows parted to remove the kernels knowledge of partitions one partition at a time and to add new partitions on the fly even though some existing partitions are busy. Esp. these 2 commits are relevant:

http://git.debian.org/?p=parted/parted.git;a=commitdiff;h=0e04d17386274fc218a9e6f9ae17d75510e632a3
http://git.debian.org/?p=parted/parted.git;a=commitdiff;h=7165951dfb584aae2901ac3f1a28fe3624667f19

It could very well be that blkpg does not play well together with
mdraid devices. I'll create a parted patch which will make parted use the reread partition table ioctl again for mdraid sets, which can then be used to test
this theory.

Comment 31 Hans de Goede 2010-09-17 13:27:53 UTC

Here is a scratch build which I think might fix this:
http://koji.fedoraproject.org/koji/taskinfo?taskID=2472964

Dave Lehman, can you test this? Just drop pingthe libparted.so.0 file in an updates.img should do the trick (for anaconda, for testing from tty2 you need to set LD_LIBRARY_PATH).

I'll attach the 2 patches which I've added to the srpm from which this scratch build was done.

Comment 32 Hans de Goede 2010-09-17 13:29:42 UTC

Created attachment 448010 [details]
PATCH 1/2 for parted which might fix / workaround this

Comment 33 Hans de Goede 2010-09-17 13:30:55 UTC

Created attachment 448011 [details]
PATCH 2/2 for parted which might fix / workaround this

Comment 34 Hans de Goede 2010-09-17 13:42:07 UTC

Thinking more about this the use of blkext majors rather then
the main disk major for partitions (which the kernel does for partitions > 15
on scsi disks and for any and all md partitions), is likely the cause of this
mdraid partition issues.

So we may have a similar issues with normal disks with > 15 partitions.

Comment 35 Fedora Update System 2010-09-17 19:09:20 UTC

parted-2.3-2.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/parted-2.3-2.fc14

Comment 36 Brian Lane 2010-09-17 19:13:14 UTC

Ends up this is actually parted, see bug 634980

Comment 37 Adam Williamson 2010-09-18 08:56:53 UTC

RC1 is now out and should include a fix for this:

http://serverbeach1.fedoraproject.org/pub/alt/stage/14-Beta.RC1/Fedora/

Please test and confirm this. I've tested a case which we think is hitting the same issue - >15 partitions on a single disk - but we need to make sure this RAID issue is actually the same bug, and the fix works for it. Thanks!

Comment 38 Sandro Mathys 2010-09-19 19:16:09 UTC

Can confirm that this is fixed in Beta RC2 :)

Comment 39 Adam Williamson 2010-09-19 21:42:25 UTC

Awesome, thanks for testing!

Also, I'm sure Dave is very sorry for accusing you of modifying things and denying it ;)



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 40 James Laska 2010-09-20 15:28:02 UTC

Moving to VERIFIED based on comment#38

Comment 41 Fedora Update System 2010-09-20 18:41:26 UTC

parted-2.3-2.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update parted'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/parted-2.3-2.fc14

Comment 42 James Laska 2010-09-20 19:15:42 UTC

Moving back to VERIFIED, still based on comment#38

Comment 43 Adam Williamson 2010-09-20 20:24:54 UTC

Sandro, can you +1 the update in Bodhi? Thanks!

Comment 44 Fedora Update System 2010-09-22 04:09:05 UTC

parted-2.3-2.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.