Description of problem: From kernel-3.2.1, boot failed with timeout and enters maintenance console. Until kernel-3.1.x, boot is successful and all RAIDs are correctly activated. With kernel-3.1.x, all HDDs are scanned and all RAIDs are activated quickly. But, kernel-3.2.x, kernel scans HDDs repeatedly and spends long time. And finally failed by timeout. I checked RAIDs status from maintenance console. Some RAIDs are activated successfully. But some HDDs cannot be accessed, and some RAIDs are failed to activate. Version-Release number of selected component (if applicable): kernel-PAE-3.2.1-3.fc16.i686.rpm, kernel-PAE-3.2.2-1.fc16.rpm How reproducible: Build some RAIDs with many HDDs. I use 40 HDDs. 2 of them contain 3 partitions. Others have no partitions. This is a content of /proc/mdstat with kernel-3.1.x. ---- Personalities : [raid1] [raid6] [raid5] [raid4] [raid0] md11 : active raid0 sdm[1] sdk[0] 273508864 blocks super 1.2 512k chunks md12 : active raid0 sdy[0] sdaa[1] 372600832 blocks super 1.2 512k chunks md0 : active raid1 sda1[0] sdb1[1] 204736 blocks [2/2] [UU] md7 : active raid5 sds[0] sdt[3] sdr[2] sdq[1] 1465151808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md9 : active raid6 md12[8] sdz[4] sdu[0] sdad[7] sdx[3] sdac[2] sdw[6] sdab[5] sdv[1] 2187980032 blocks level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU] md8 : active raid6 md11[8] sdo[5] sdg[1] sdp[0] sdh[2] sdl[6] sdn[7] sdi[3] sdj[4] 1709382080 blocks level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU] md6 : active raid5 sdd[2] sdf[0] sde[1] sdc[3] 2930279808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md2 : active raid1 sdb3[1] sda3[0] 8388544 blocks [2/2] [UU] md1 : active raid1 sdb2[1] sda2[0] 968166528 blocks [2/2] [UU] ---- Steps to Reproduce: 1. install kernel-3.2.x 2. reboot Actual results: Spend long time in scanning HDDs many times while booting, and failed into maintenance console. Some HDDs cannot be accessed via device file (/dev/sd?). Some RAIDs are failed to activate. Expected results: Boot quickly and all RAIDs are activated shown above. Additional info: kernel-3.1.x works correct.
Sometimes kernel breaks superblocks while booting. So it requires like this: $ /sbin/mdadm --zero-supserblock /dev/sda2 $ /sbin/mdadm /dev/md1 --add /dev/sda2
kernel-PAE-3.2.3-2.fc16.i686.rpm also fails.
we need to see the debug logs from dracut to really figure out what's going wrong here.
(In reply to comment #3) > we need to see the debug logs from dracut to really figure out what's going > wrong here. How can I get the debug logs? I think dracut will be invoked while installing kernel. I'll try to get information in the way what you say and send it.
when you get dropped to the debug shell, you should be able to redirect dmesg output somewhere.
Created attachment 559652 [details] I
Sorry, I'll try it soon.
Created attachment 559662 [details] dmesg output with kernel-PAE-3.2.3-2.fc16.i686 I installed kernel-PAE-3.2.3-2.fc16.i686 and kernel-PAE-devel-3.2.3-2.fc16.i686 and rebooted. Booting failed and entered debug shell. I ran dmesg and this is dmesg output. PS. Sorry to late. e2fsck ran on a huge RAID file system.
Created attachment 560046 [details] dmesg output with kernel 3.1.x ( works fine ) After that, the machine became to fail boot because the root partition had some errors. RAID0 for root partition was broken also. Finally, I fixed problems and got a dmesg sample. The kernel wasn't formal package. But the dmesg might become some help to identify the problem. Thank you for your supports.
Created attachment 565229 [details] dmesg output with kernel-PAE-3.2.5-3.fc16.i686
Created attachment 565230 [details] dmesg output with kernel-PAE-3.2.6-3.fc16.i686
I changed configuration of RAIDs because some HDD problems. This is a content of /proc/mdstat with kernel-3.1.x. ---- Personalities : [raid1] [raid6] [raid5] [raid4] [raid0] md0 : active raid1 sda1[0] sdb1[1] 204736 blocks [2/2] [UU] md11 : active raid0 sdk[0] sdy[2] sdm[1] 333539840 blocks super 1.2 512k chunks md7 : active raid5 sdq[1] sdt[3] sdr[2] sds[0] 1465151808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md9 : active raid6 sdz[4] sdu[0] sdx[3] sdw[6] sdab[5] sdad[7] sdaa[8] sdac[2] sdv[1] 2187980032 blocks level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU] md8 : active raid6 md11[8] sdh[2] sdo[0] sdg[1] sdi[3] sdl[6] sdp[5] sdn[7] sdj[4] 1709382080 blocks level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU] md6 : active raid5 sdf[0] sde[1] sdd[2] sdc[3] 2930279808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] md1 : active raid1 sdb2[1] sda2[0] 968166528 blocks [2/2] [UU] md2 : active raid1 sda3[0] sdb3[1] 8388544 blocks [2/2] [UU] unused devices: <none> ----
Created attachment 565233 [details] dmesg output with kernel-PAE-3.2.7-1.fc16.i686 I got this log after I changed the RAIDs configuration.
Created attachment 565236 [details] dmesg output with kernel 3.1.x ( works fine ) I got this log after I changed RAIDs configuration. Some superbroks are broken after trying kernel-PAE-3.2.7-1. So, this log contains some error messages about /dev/md9 RAID device.
Do you have any additional information about this problem?
(In reply to comment #15) > Do you have any additional information about this problem? Sorry, I made mistake. Do you want any additional information about this problem?
Created attachment 569384 [details] This patch looks to solve my problem. At least, this patch solves my problem. OS boots quickly and all RAIDs are built. But, I cannot understand ata subsystem. I'm afraid of side effects. Can you consider this patch?
Created attachment 569414 [details] original dmesg output with kernel-PAE-3.2.9-2.fc16.i686 dmesg log before applying my patch. Linux version 3.2.9-2.fc16.i686.PAE (mockbuild.fedoraproject.org) (gcc version 4.6.2 20111027 (Red Hat 4.6.2-1) (GCC) ) #1 SMP Mon Mar 5 21:04:30 UTC 2012
Created attachment 569415 [details] dmesg output with kernel-PAE-3.2.9-2.fc16.i686 after applying my patch dmesg output after applying my patch
can you post this patch upstream to linux-kernel.org ? if it is acceptable there, we will add it to the Fedora kernel.
(In reply to comment #20) > can you post this patch upstream to linux-kernel.org ? > if it is acceptable there, we will add it to the Fedora kernel. I send a mail that explains situation and patch now.
[mass update] kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository. Please retest with this update.
(In reply to comment #24) > [mass update] > kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository. > Please retest with this update. Situation is not changed. I just started to discuss with developer, but he thinks it isn't problem. I sent dmesg log with many HDDs and waiting his reply.
I just received the reply from developer. I have to try some operation but it took some days.
From kernel 3.2, the driver about SiI3726 changed to use staggered spin-up. Other port multipliers are not changed. From view of technical point, staggered spin-up is better. So, he opposed to use old driver. But it causes to spend long time to boot and to lose some HDDs connected via SiI3726. So, I have asked the reason why 40HDDs machine loses some HDDs connected to SiI3726. The SiI3726 driver cannot identify some HDDs and I cannot understand the reason from log. It looks to take long time...
The engineer wrote that it is SiI3726 firmware bug about SRST(Software Reset). Since the old driver don't send SRST, the bug doesn't become apparent. From kernel 3.2.x and 3.3, the driver send SRST and the bug becomes apparent. It means that we have to wait the firmware bug is fixed. I'm confirming how will he handle about this problem. But, another, more important problem have been found. The engineer responded me that it was not Sil3726 problem. So I've registered it as a new bug. https://bugzilla.redhat.com/show_bug.cgi?id=807958 The patch is also a workaround for this bug.
It seems that the engineer has no plant to do something about this problem. He haven't denied the conclusion about this SRST problem and avoidance of this problem yet: Wait the firmware bug is fixed Replace these HDD boxes Continue to use kernel 3.1.x driver So, I do not know what to do any more.
The engineer hasn't denied the conclusion. It seems that the engineer think that all discussion is finished. I think that if continuing to use kernel 3.1.x driver is OK, the patch is OK because the patch is back to 3.1.x driver.
# Mass update to all open bugs. Kernel 3.6.2-1.fc16 has just been pushed to updates. This update is a significant rebase from the previous version. Please retest with this kernel, and let us know if your problem has been fixed. In the event that you have upgraded to a newer release and the bug you reported is still present, please change the version field to the newest release you have encountered the issue with. Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered. If you are not the original bug reporter and you still experience this bug, please file a new report, as it is possible that you may be seeing a different problem. (Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).
The situation isn't changed. The patch still works well.
This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.