787468 – boot fails by timeout while activating RAIDs with many HDDs

Bug 787468 - boot fails by timeout while activating RAIDs with many HDDs

Summary: boot fails by timeout while activating RAIDs with many HDDs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	16
Hardware:	i686
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-02-05 10:54 UTC by ANEZAKI, Akira
Modified:	2013-02-13 15:35 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-02-13 15:35:32 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
I (12.16 KB, application/x-bzip) 2012-02-06 14:50 UTC, ANEZAKI, Akira	no flags	Details
dmesg output with kernel-PAE-3.2.3-2.fc16.i686 (22.60 KB, application/x-bzip) 2012-02-06 15:33 UTC, ANEZAKI, Akira	no flags	Details
dmesg output with kernel 3.1.x ( works fine ) (24.53 KB, application/x-bzip) 2012-02-07 20:10 UTC, ANEZAKI, Akira	no flags	Details
dmesg output with kernel-PAE-3.2.5-3.fc16.i686 (22.75 KB, application/x-bzip2) 2012-02-23 08:51 UTC, ANEZAKI, Akira	no flags	Details
dmesg output with kernel-PAE-3.2.6-3.fc16.i686 (22.66 KB, application/x-bzip2) 2012-02-23 08:53 UTC, ANEZAKI, Akira	no flags	Details
dmesg output with kernel-PAE-3.2.7-1.fc16.i686 (23.24 KB, application/x-bzip2) 2012-02-23 09:09 UTC, ANEZAKI, Akira	no flags	Details
dmesg output with kernel 3.1.x ( works fine ) (23.20 KB, application/x-bzip2) 2012-02-23 09:14 UTC, ANEZAKI, Akira	no flags	Details
This patch looks to solve my problem. (467 bytes, patch) 2012-03-12 13:05 UTC, ANEZAKI, Akira	no flags	Details \| Diff
original dmesg output with kernel-PAE-3.2.9-2.fc16.i686 (22.95 KB, application/x-bzip2) 2012-03-12 14:06 UTC, ANEZAKI, Akira	no flags	Details
dmesg output with kernel-PAE-3.2.9-2.fc16.i686 after applying my patch (23.50 KB, application/x-bzip2) 2012-03-12 14:08 UTC, ANEZAKI, Akira	no flags	Details
Show Obsolete (1) View All

Description ANEZAKI, Akira 2012-02-05 10:54:42 UTC

Description of problem:
From kernel-3.2.1, boot failed with timeout and enters maintenance console.
Until kernel-3.1.x, boot is successful and all RAIDs are correctly activated.
With kernel-3.1.x, all HDDs are scanned and all RAIDs are activated quickly.
But, kernel-3.2.x, kernel scans HDDs repeatedly and spends long time.
And finally failed by timeout.
I checked RAIDs status from maintenance console.
Some RAIDs are activated successfully. But some HDDs cannot be accessed, and some RAIDs are failed to activate.

Version-Release number of selected component (if applicable):
kernel-PAE-3.2.1-3.fc16.i686.rpm, kernel-PAE-3.2.2-1.fc16.rpm

How reproducible:
Build some RAIDs with many HDDs.
I use 40 HDDs. 2 of them contain 3 partitions. Others have no partitions.

This is a content of /proc/mdstat with kernel-3.1.x.
----
Personalities : [raid1] [raid6] [raid5] [raid4] [raid0] 
md11 : active raid0 sdm[1] sdk[0]
      273508864 blocks super 1.2 512k chunks
      
md12 : active raid0 sdy[0] sdaa[1]
      372600832 blocks super 1.2 512k chunks
      
md0 : active raid1 sda1[0] sdb1[1]
      204736 blocks [2/2] [UU]
      
md7 : active raid5 sds[0] sdt[3] sdr[2] sdq[1]
      1465151808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      
md9 : active raid6 md12[8] sdz[4] sdu[0] sdad[7] sdx[3] sdac[2] sdw[6] sdab[5] sdv[1]
      2187980032 blocks level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      
md8 : active raid6 md11[8] sdo[5] sdg[1] sdp[0] sdh[2] sdl[6] sdn[7] sdi[3] sdj[4]
      1709382080 blocks level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      
md6 : active raid5 sdd[2] sdf[0] sde[1] sdc[3]
      2930279808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      
md2 : active raid1 sdb3[1] sda3[0]
      8388544 blocks [2/2] [UU]
      
md1 : active raid1 sdb2[1] sda2[0]
      968166528 blocks [2/2] [UU]
----

Steps to Reproduce:
1. install kernel-3.2.x
2. reboot
  
Actual results:
Spend long time in scanning HDDs many times while booting, and failed into maintenance console. Some HDDs cannot be accessed via device file (/dev/sd?). Some RAIDs are failed to activate.

Expected results:
Boot quickly and all RAIDs are activated shown above.

Additional info:
kernel-3.1.x works correct.

Comment 1 ANEZAKI, Akira 2012-02-06 07:12:31 UTC

Sometimes kernel breaks superblocks while booting. So it requires like this:
$ /sbin/mdadm --zero-supserblock /dev/sda2
$ /sbin/mdadm /dev/md1 --add /dev/sda2

Comment 2 ANEZAKI, Akira 2012-02-06 11:52:02 UTC

kernel-PAE-3.2.3-2.fc16.i686.rpm also fails.

Comment 3 Dave Jones 2012-02-06 14:11:58 UTC

we need to see the debug logs from dracut to really figure out what's going wrong here.

Comment 4 ANEZAKI, Akira 2012-02-06 14:30:04 UTC

(In reply to comment #3)
> we need to see the debug logs from dracut to really figure out what's going
> wrong here.

How can I get the debug logs? I think dracut will be invoked while installing kernel.
I'll try to get information in the way what you say and send it.

Comment 5 Dave Jones 2012-02-06 14:48:54 UTC

when you get dropped to the debug shell, you should be able to redirect dmesg output somewhere.

Comment 6 ANEZAKI, Akira 2012-02-06 14:50:39 UTC

Created attachment 559652 [details]
I

Comment 7 ANEZAKI, Akira 2012-02-06 14:51:52 UTC

Sorry, I'll try it soon.

Comment 8 ANEZAKI, Akira 2012-02-06 15:33:18 UTC

Created attachment 559662 [details]
dmesg output with kernel-PAE-3.2.3-2.fc16.i686

I installed kernel-PAE-3.2.3-2.fc16.i686 and kernel-PAE-devel-3.2.3-2.fc16.i686 and rebooted. Booting failed and entered debug shell. I ran dmesg and this is dmesg output.

PS.
Sorry to late. e2fsck ran on a huge RAID file system.

Comment 9 ANEZAKI, Akira 2012-02-07 20:10:24 UTC

Created attachment 560046 [details]
dmesg output with kernel 3.1.x ( works fine )

After that, the machine became to fail boot because the root partition had some errors.  RAID0 for root partition was broken also.

Finally, I fixed problems and got a dmesg sample. The kernel wasn't formal package. But the dmesg might become some help to identify the problem.

Thank you for your supports.

Comment 10 ANEZAKI, Akira 2012-02-23 08:51:34 UTC

Created attachment 565229 [details]
dmesg output with kernel-PAE-3.2.5-3.fc16.i686

Comment 11 ANEZAKI, Akira 2012-02-23 08:53:45 UTC

Created attachment 565230 [details]
dmesg output with kernel-PAE-3.2.6-3.fc16.i686

Comment 12 ANEZAKI, Akira 2012-02-23 09:03:50 UTC

I changed configuration of RAIDs because some HDD problems.
This is a content of /proc/mdstat with kernel-3.1.x.
----
Personalities : [raid1] [raid6] [raid5] [raid4] [raid0] 
md0 : active raid1 sda1[0] sdb1[1]
      204736 blocks [2/2] [UU]
      
md11 : active raid0 sdk[0] sdy[2] sdm[1]
      333539840 blocks super 1.2 512k chunks
      
md7 : active raid5 sdq[1] sdt[3] sdr[2] sds[0]
      1465151808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      
md9 : active raid6 sdz[4] sdu[0] sdx[3] sdw[6] sdab[5] sdad[7] sdaa[8] sdac[2] sdv[1]
      2187980032 blocks level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU]

      
md8 : active raid6 md11[8] sdh[2] sdo[0] sdg[1] sdi[3] sdl[6] sdp[5] sdn[7] sdj[4]
      1709382080 blocks level 6, 64k chunk, algorithm 2 [9/9] [UUUUUUUUU]
      
md6 : active raid5 sdf[0] sde[1] sdd[2] sdc[3]
      2930279808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      
md1 : active raid1 sdb2[1] sda2[0]
      968166528 blocks [2/2] [UU]
      
md2 : active raid1 sda3[0] sdb3[1]
      8388544 blocks [2/2] [UU]
      
unused devices: <none>
----

Comment 13 ANEZAKI, Akira 2012-02-23 09:09:19 UTC

Created attachment 565233 [details]
dmesg output with kernel-PAE-3.2.7-1.fc16.i686

I got this log after I changed the RAIDs configuration.

Comment 14 ANEZAKI, Akira 2012-02-23 09:14:50 UTC

Created attachment 565236 [details]
dmesg output with kernel 3.1.x ( works fine )

I got this log after I changed RAIDs configuration.
Some superbroks are broken after trying kernel-PAE-3.2.7-1. So, this log contains some error messages about /dev/md9 RAID device.

Comment 15 ANEZAKI, Akira 2012-02-23 09:19:39 UTC

Do you have any additional information about this problem?

Comment 16 ANEZAKI, Akira 2012-02-23 15:20:49 UTC

(In reply to comment #15)
> Do you have any additional information about this problem?

Sorry, I made mistake.

Do you want any additional information about this problem?

Comment 17 ANEZAKI, Akira 2012-03-12 13:05:00 UTC

Created attachment 569384 [details]
This patch looks to solve my problem.

At least, this patch solves my problem. OS boots quickly and all RAIDs are built.
But, I cannot understand ata subsystem. I'm afraid of side effects.

Can you consider this patch?

Comment 18 ANEZAKI, Akira 2012-03-12 14:06:07 UTC

Created attachment 569414 [details]
original dmesg output with kernel-PAE-3.2.9-2.fc16.i686

dmesg log before applying my patch.
 Linux version 3.2.9-2.fc16.i686.PAE (mockbuild.fedoraproject.org) (gcc version 4.6.2 20111027 (Red Hat 4.6.2-1) (GCC) ) #1 SMP Mon Mar 5 21:04:30 UTC 2012

Comment 19 ANEZAKI, Akira 2012-03-12 14:08:44 UTC

Created attachment 569415 [details]
dmesg output with kernel-PAE-3.2.9-2.fc16.i686 after applying my patch

dmesg output after applying my patch

Comment 20 Dave Jones 2012-03-12 20:30:43 UTC

can you post this patch upstream to linux-kernel.org ?
if it is acceptable there, we will add it to the Fedora kernel.

Comment 21 ANEZAKI, Akira 2012-03-13 04:37:33 UTC

(In reply to comment #20)
> can you post this patch upstream to linux-kernel.org ?
> if it is acceptable there, we will add it to the Fedora kernel.

I send a mail that explains situation and patch now.

Comment 22 Dave Jones 2012-03-22 16:54:04 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 23 Dave Jones 2012-03-22 16:57:42 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 24 Dave Jones 2012-03-22 17:08:55 UTC

[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 25 ANEZAKI, Akira 2012-03-22 17:11:08 UTC

(In reply to comment #24)
> [mass update]
> kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
> Please retest with this update.

Situation is not changed.

I just started to discuss with developer, but he thinks it isn't problem.
I sent dmesg log with many HDDs and waiting his reply.

Comment 26 ANEZAKI, Akira 2012-03-22 17:25:53 UTC

I just received the reply from developer. I have to try some operation but it took some days.

Comment 27 ANEZAKI, Akira 2012-03-24 02:57:21 UTC

From kernel 3.2, the driver about SiI3726 changed to use staggered spin-up. Other port multipliers are not changed.
From view of technical point, staggered spin-up is better. So, he opposed to use old driver. But it causes to spend long time to boot and to lose some HDDs connected via SiI3726.

So, I have asked the reason why 40HDDs machine loses some HDDs connected to SiI3726. The SiI3726 driver cannot identify some HDDs and I cannot understand the reason from log. It looks to take long time...

Comment 28 ANEZAKI, Akira 2012-03-30 03:49:24 UTC

The engineer wrote that it is SiI3726 firmware bug about SRST(Software Reset).
Since the old driver don't send SRST, the bug doesn't become apparent. From kernel 3.2.x and 3.3, the driver send SRST and the bug becomes apparent. It means that we have to wait the firmware bug is fixed.

I'm confirming how will he handle about this problem.

But, another, more important problem have been found. The engineer responded me that it was not Sil3726 problem. So I've registered it as a new bug.

https://bugzilla.redhat.com/show_bug.cgi?id=807958

The patch is also a workaround for this bug.

Comment 29 ANEZAKI, Akira 2012-03-30 16:45:08 UTC

It seems that the engineer has no plant to do something about this problem.

He haven't denied the conclusion about this SRST problem and avoidance of this problem yet:
  Wait the firmware bug is fixed
  Replace these HDD boxes
  Continue to use kernel 3.1.x driver

So, I do not know what to do any more.

Comment 30 ANEZAKI, Akira 2012-04-01 00:37:52 UTC

The engineer hasn't denied the conclusion. It seems that the engineer think that all discussion is finished.

I think that if continuing to use kernel 3.1.x driver is OK, the patch is OK because the patch is back to 3.1.x driver.

Comment 31 Dave Jones 2012-10-23 15:39:11 UTC

# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 32 ANEZAKI, Akira 2012-10-23 18:57:54 UTC

The situation isn't changed. The patch still works well.

Comment 33 Fedora End Of Life 2013-01-16 14:32:13 UTC

This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 34 Fedora End Of Life 2013-02-13 15:35:37 UTC

Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.