Bug 1111442 - Under 3.16 one of my partition devices has zero size, but is normal under 3.15
Summary: Under 3.16 one of my partition devices has zero size, but is normal under 3.15
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-20 04:07 UTC by Bruno Wolff III
Modified: 2014-06-28 13:19 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-28 13:19:50 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
boot log info (78.94 KB, text/plain)
2014-06-20 04:07 UTC, Bruno Wolff III
no flags Details
dmesg output from successful boot (130.01 KB, text/plain)
2014-06-20 14:40 UTC, Bruno Wolff III
no flags Details
initramfs differences (48.41 KB, text/plain)
2014-06-20 17:35 UTC, Bruno Wolff III
no flags Details
blkid output from live image instance (2.24 KB, text/plain)
2014-06-25 00:20 UTC, Bruno Wolff III
no flags Details

Description Bruno Wolff III 2014-06-20 04:07:04 UTC
Created attachment 910627 [details]
boot log info

Description of problem:
There isn't a kernel crash, but the boot fails. I rebuilt the initramfs for the 3.15 kernel, but it still boots. So it seems likely the issue is kernel related, rather than say dracut.

There is a message about /dev/resume already existing and an error reported for not being able to read a floppy (there was none in the drive), that I don't usually notice.

Version-Release number of selected component (if applicable):
kernel-PAE-3.16.0-0.rc1.git2.2.fc21.i686 (the .1 debug kernel and the rc1.git0 kernels have the same problem).

Comment 1 Josh Boyer 2014-06-20 14:04:31 UTC
It's failing because it can't find what it was told is root:

root=/dev/mapper/luks-9a976b86-8aaa-40d9-8039-89d710eac5c9

Nothing in the output from the various blkid commands seems to list anything with a UUID like that.

What does the 3.15 boot output look like?

Comment 2 Bruno Wolff III 2014-06-20 14:40:27 UTC
Created attachment 910796 [details]
dmesg output from successful boot

This is the dmesg output from the current boot.

I have an x86_64 machione that works just fine with a similar setup.

I have another i686 machine that the kernel crashed on with 3.16-0.rc1.git0, but I haven't captured the traceback info from that machine yet. It also has a similar setup.

The root partition on all of these machines is ext4 on luks on md raid 1.

Comment 3 Bruno Wolff III 2014-06-20 14:42:15 UTC
There is one other odd thing that might have triggered a problem if there were md changes. The root partition is on a raid array with just one element (not degraded), because I was using the matching partition on the other disk to test older kernels. Perhaps there was an md change that caused this setup to break.

Comment 4 Josh Boyer 2014-06-20 14:58:43 UTC
What does a diff of the 3.15 and 3.16 initramfses show?  Can you extract them and run a diff to see if somehow the config files contained within them are different?  Since you regenerated the 3.15 initramfs, I would expect them to be identical except the /lib/modules/ directory.

Comment 5 Bruno Wolff III 2014-06-20 17:35:00 UTC
Created attachment 910843 [details]
initramfs differences

A number of the binary files are different, but I don't see any script differences. I can rebuild both the latest 3.16 and the 3.15 initramfs without doing any updates inbetween and retest this over the weekend. That will make it easier to rule out non-kernel effects if 3.15 still works and 3.16 doesn't.

Comment 6 Bruno Wolff III 2014-06-22 16:51:10 UTC
I used to the kernel scripts to rebuild a 3.15 and a 3.16 initramfs right after one another and a bunch of binaries still were different. My guess is that they are staticly linked and there end up being small differences between them. 

I played with mdadm after a 3.16 boot failed and found that mdadm doesn't recognize the superblock on /dev/sda3. I have two other raid 1 arrays (/home and a frozen copy of /home I want to recover some stuff from) and I was able to manually start both of those.

It looks like there is something about the superblock on /dev/sda3 that passes a check on 3.15 kernels, but not on 3.16 kernels. I could try rewriting the superblock and see if this fixes things, but it would be nice to know if the superblock is corrupt or the kernel check is incorrect before losing its contents.

Comment 7 Bruno Wolff III 2014-06-22 17:01:08 UTC
My other rawhide i686 system is still crashing before it gets to asking for the luks passwords. Likely it's a nouveau regression. So I can't check if I see similar raid issues on that machine. My x86_64 machine appears to be working OK.

Comment 8 Bruno Wolff III 2014-06-22 17:24:28 UTC
Here is what mdadm says about /dev/sda3:
mdadm --examine /dev/sda3
/dev/sda3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 9e06cb82:14b726e2:af554f00:b9b73901
           Name : bruno.wolff.to:13  (local to host bruno.wolff.to)
  Creation Time : Thu Jun 30 06:22:05 2011
     Raid Level : raid1
   Raid Devices : 1

 Avail Dev Size : 167770112 (80.00 GiB 85.90 GB)
     Array Size : 83884984 (80.00 GiB 85.90 GB)
  Used Dev Size : 167769968 (80.00 GiB 85.90 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
   Unused Space : before=1968 sectors, after=144 sectors
          State : clean
    Device UUID : 350f0baa:0622d7da:bc485abd:a4b765f9

    Update Time : Sun Jun 22 12:22:42 2014
       Checksum : 3e3671a9 - correct
         Events : 7031153


   Device Role : Active device 0
   Array State : A ('A' == active, '.' == missing, 'R' == replacing)

Comment 9 Bruno Wolff III 2014-06-22 17:34:58 UTC
I opened a kernel bug about this:
https://bugzilla.kernel.org/show_bug.cgi?id=78711

Comment 10 Bruno Wolff III 2014-06-24 11:32:38 UTC
The problem is still happening in kernel-PAE-3.16.0-0.rc2.git0.1.fc21.i686 .

Comment 11 Bruno Wolff III 2014-06-25 00:20:24 UTC
Created attachment 911900 [details]
blkid output from live image instance

I got a live image to boot, but it was pretty unstable and locked up a few times (requiring reboots) while I was collecting info.

dd indicated that /dev/sda3 was 0 bytes long. So that suggests that the difference isn't in md, but rather in some other part of I/O.

The device is created and blkid reports some information about it.

Comment 12 Bruno Wolff III 2014-06-25 04:34:35 UTC
fdisk doesn't seem to show anything odd about the partition table:
fdisk -l /dev/sda

Disk /dev/sda: 298,1 GiB, 320072933376 bytes, 625142448 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x6c3efcfe

Device     Boot     Start       End   Sectors   Size Id Type
/dev/sda1  *         2048   2099199   2097152     1G fd Linux raid autodetect
/dev/sda2         2099200  23070719  20971520    10G fd Linux raid autodetect
/dev/sda3        23070720 190842879 167772160    80G fd Linux raid autodetect
/dev/sda4       190842880 625142447 434299568 207,1G fd Linux raid autodetect

Comment 13 Josh Boyer 2014-06-25 13:30:20 UTC
Do either of the block or raid teams have any ideas on this one?

Comment 14 Bruno Wolff III 2014-06-25 14:20:03 UTC
I haven't anything from the block team. Neil concurred that the partition device being zero size would account for the raid behavior and that we need to figure out why this one partition consistently is zero length under 3.16, but works fine with 3.15.

Comment 15 Jeff Moyer 2014-06-25 14:29:57 UTC
I'm not certain this is your issue, but could you try with Al's patch applied?

https://lkml.org/lkml/2014/6/23/96

Comment 16 Bruno Wolff III 2014-06-25 15:23:16 UTC
Yes I can test this tonight.

Comment 17 Bruno Wolff III 2014-06-25 19:37:58 UTC
I have built a kernel from d91d66e88ea95b6dd21958834414009614385153 with the patch linked to above applied using the config from config-3.16.0-0.rc2.git1.2.fc21.i686+PAE. I need to wait until I get home to reboot the machine and verify if it fixes the partition device issue.

Comment 18 Bruno Wolff III 2014-06-25 23:55:50 UTC
When I tested this I got past where I had been getting stuck. It looked like the raid array using the problem partition did start. The system didn't finish booting, but likely due to some other issue. (I hadn't included any patches currently being used by Fedora kernels so I might have been missing other important fixes.)

Comment 19 Bruno Wolff III 2014-06-26 15:09:50 UTC
I am working on testing this with a rebuild of the Fedora srpm for kernel-3.16.0-0.rc2.git0.1.fc21 (modified by the patch), but takes a long time on my machine and didn't finish overnight. I won't be able to see if that kernel works a bit better until late tonight or early tomorrow.

Comment 20 Bruno Wolff III 2014-06-26 20:55:27 UTC
The fix for this is now in Linus' tree:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0b86dbf675e0170a191a9ca18e5e99fd39a678c0

Comment 21 Bruno Wolff III 2014-06-27 13:23:18 UTC
I tested a rebuilt 3.16.0-0.rc2.git0.1.fc21.i686 with the patch to make sure nothing else new picked up in my previous test fixed the problem. (And to also have a kernel with some other fixes, so that I could really test the kernel.) It has been running overnight without any problems.
I expect the next rawhide kernel will pick up the fix from Linus' tree and I'll be able to close this bug.

Comment 22 Josh Boyer 2014-06-27 15:10:06 UTC
Kicked off the build for tomorrow's rawhide.  I'll let you verify and close the bug out if it resolves your issue.  Thanks Bruno.

Comment 23 Bruno Wolff III 2014-06-28 13:19:50 UTC
Testing with kernel-PAE-3.16.0-0.rc2.git4.2.fc21.i686 (from the nodebug repo) went well and I believe this issue is resolved.


Note You need to log in before you can comment on or make changes to this bug.