Created attachment 910627 [details] boot log info Description of problem: There isn't a kernel crash, but the boot fails. I rebuilt the initramfs for the 3.15 kernel, but it still boots. So it seems likely the issue is kernel related, rather than say dracut. There is a message about /dev/resume already existing and an error reported for not being able to read a floppy (there was none in the drive), that I don't usually notice. Version-Release number of selected component (if applicable): kernel-PAE-3.16.0-0.rc1.git2.2.fc21.i686 (the .1 debug kernel and the rc1.git0 kernels have the same problem).
It's failing because it can't find what it was told is root: root=/dev/mapper/luks-9a976b86-8aaa-40d9-8039-89d710eac5c9 Nothing in the output from the various blkid commands seems to list anything with a UUID like that. What does the 3.15 boot output look like?
Created attachment 910796 [details] dmesg output from successful boot This is the dmesg output from the current boot. I have an x86_64 machione that works just fine with a similar setup. I have another i686 machine that the kernel crashed on with 3.16-0.rc1.git0, but I haven't captured the traceback info from that machine yet. It also has a similar setup. The root partition on all of these machines is ext4 on luks on md raid 1.
There is one other odd thing that might have triggered a problem if there were md changes. The root partition is on a raid array with just one element (not degraded), because I was using the matching partition on the other disk to test older kernels. Perhaps there was an md change that caused this setup to break.
What does a diff of the 3.15 and 3.16 initramfses show? Can you extract them and run a diff to see if somehow the config files contained within them are different? Since you regenerated the 3.15 initramfs, I would expect them to be identical except the /lib/modules/ directory.
Created attachment 910843 [details] initramfs differences A number of the binary files are different, but I don't see any script differences. I can rebuild both the latest 3.16 and the 3.15 initramfs without doing any updates inbetween and retest this over the weekend. That will make it easier to rule out non-kernel effects if 3.15 still works and 3.16 doesn't.
I used to the kernel scripts to rebuild a 3.15 and a 3.16 initramfs right after one another and a bunch of binaries still were different. My guess is that they are staticly linked and there end up being small differences between them. I played with mdadm after a 3.16 boot failed and found that mdadm doesn't recognize the superblock on /dev/sda3. I have two other raid 1 arrays (/home and a frozen copy of /home I want to recover some stuff from) and I was able to manually start both of those. It looks like there is something about the superblock on /dev/sda3 that passes a check on 3.15 kernels, but not on 3.16 kernels. I could try rewriting the superblock and see if this fixes things, but it would be nice to know if the superblock is corrupt or the kernel check is incorrect before losing its contents.
My other rawhide i686 system is still crashing before it gets to asking for the luks passwords. Likely it's a nouveau regression. So I can't check if I see similar raid issues on that machine. My x86_64 machine appears to be working OK.
Here is what mdadm says about /dev/sda3: mdadm --examine /dev/sda3 /dev/sda3: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 9e06cb82:14b726e2:af554f00:b9b73901 Name : bruno.wolff.to:13 (local to host bruno.wolff.to) Creation Time : Thu Jun 30 06:22:05 2011 Raid Level : raid1 Raid Devices : 1 Avail Dev Size : 167770112 (80.00 GiB 85.90 GB) Array Size : 83884984 (80.00 GiB 85.90 GB) Used Dev Size : 167769968 (80.00 GiB 85.90 GB) Data Offset : 2048 sectors Super Offset : 8 sectors Unused Space : before=1968 sectors, after=144 sectors State : clean Device UUID : 350f0baa:0622d7da:bc485abd:a4b765f9 Update Time : Sun Jun 22 12:22:42 2014 Checksum : 3e3671a9 - correct Events : 7031153 Device Role : Active device 0 Array State : A ('A' == active, '.' == missing, 'R' == replacing)
I opened a kernel bug about this: https://bugzilla.kernel.org/show_bug.cgi?id=78711
The problem is still happening in kernel-PAE-3.16.0-0.rc2.git0.1.fc21.i686 .
Created attachment 911900 [details] blkid output from live image instance I got a live image to boot, but it was pretty unstable and locked up a few times (requiring reboots) while I was collecting info. dd indicated that /dev/sda3 was 0 bytes long. So that suggests that the difference isn't in md, but rather in some other part of I/O. The device is created and blkid reports some information about it.
fdisk doesn't seem to show anything odd about the partition table: fdisk -l /dev/sda Disk /dev/sda: 298,1 GiB, 320072933376 bytes, 625142448 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x6c3efcfe Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 2099199 2097152 1G fd Linux raid autodetect /dev/sda2 2099200 23070719 20971520 10G fd Linux raid autodetect /dev/sda3 23070720 190842879 167772160 80G fd Linux raid autodetect /dev/sda4 190842880 625142447 434299568 207,1G fd Linux raid autodetect
Do either of the block or raid teams have any ideas on this one?
I haven't anything from the block team. Neil concurred that the partition device being zero size would account for the raid behavior and that we need to figure out why this one partition consistently is zero length under 3.16, but works fine with 3.15.
I'm not certain this is your issue, but could you try with Al's patch applied? https://lkml.org/lkml/2014/6/23/96
Yes I can test this tonight.
I have built a kernel from d91d66e88ea95b6dd21958834414009614385153 with the patch linked to above applied using the config from config-3.16.0-0.rc2.git1.2.fc21.i686+PAE. I need to wait until I get home to reboot the machine and verify if it fixes the partition device issue.
When I tested this I got past where I had been getting stuck. It looked like the raid array using the problem partition did start. The system didn't finish booting, but likely due to some other issue. (I hadn't included any patches currently being used by Fedora kernels so I might have been missing other important fixes.)
I am working on testing this with a rebuild of the Fedora srpm for kernel-3.16.0-0.rc2.git0.1.fc21 (modified by the patch), but takes a long time on my machine and didn't finish overnight. I won't be able to see if that kernel works a bit better until late tonight or early tomorrow.
The fix for this is now in Linus' tree: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0b86dbf675e0170a191a9ca18e5e99fd39a678c0
I tested a rebuilt 3.16.0-0.rc2.git0.1.fc21.i686 with the patch to make sure nothing else new picked up in my previous test fixed the problem. (And to also have a kernel with some other fixes, so that I could really test the kernel.) It has been running overnight without any problems. I expect the next rawhide kernel will pick up the fix from Linus' tree and I'll be able to close this bug.
Kicked off the build for tomorrow's rawhide. I'll let you verify and close the bug out if it resolves your issue. Thanks Bruno.
Testing with kernel-PAE-3.16.0-0.rc2.git4.2.fc21.i686 (from the nodebug repo) went well and I believe this issue is resolved.