Bug 807958 - Sometimes fails to assemble RAID1 when Sil3726 Port Multipliers are connected
Summary: Sometimes fails to assemble RAID1 when Sil3726 Port Multipliers are connected
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: i686
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-03-29 08:49 UTC by ANEZAKI, Akira
Modified: 2013-02-13 15:36 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-02-13 15:36:17 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
patch for libata-pmp.c in 3.2.x and 3.3 to back in 3.1.x. (467 bytes, patch)
2012-03-29 08:49 UTC, ANEZAKI, Akira
no flags Details | Diff
One external HDD box may cause this problem. (17.87 KB, application/x-bzip)
2012-03-29 08:55 UTC, ANEZAKI, Akira
no flags Details
2 external HDDs cause the problem. (19.17 KB, application/x-bzip)
2012-03-29 08:59 UTC, ANEZAKI, Akira
no flags Details
2 external HDDs cause the problem ( boot failed ) (16.05 KB, application/x-bzip)
2012-03-29 09:05 UTC, ANEZAKI, Akira
no flags Details
2 external HDDs cause the problem with kernel 3.3.0-4 (18.47 KB, application/x-bzip)
2012-03-29 09:11 UTC, ANEZAKI, Akira
no flags Details
2 external HDDs cause the problem in warm reboot (19.17 KB, application/x-bzip)
2012-03-29 09:18 UTC, ANEZAKI, Akira
no flags Details

Description ANEZAKI, Akira 2012-03-29 08:49:19 UTC
Created attachment 573573 [details]
patch for libata-pmp.c in 3.2.x and 3.3 to back in 3.1.x.

Description of problem:
Some RAID1 fails to be assembled while booting.
RAID1 is assembled only one partition. or sometimes dracut doesn't assemble some RAID1. If the RAID1 partition that is not assembled is root file system, boot fails.
This problem happens on a PC that external HDD boxes that has Sil3726 chip are connected. The more HDD box and more HDDs in the box, this problem seems to occur more frequently. This problem happens with kernel 3.2.x or later.

Version-Release number of selected component (if applicable):
kernel 3.2.x , 3.3

How reproducible:
Some HDD boxes that uses Sil3726 are needed. I use 2 boxes. In my case, every boxes contain 4 HDDs. These boxes are connected to the PC with 2 eSATA cables.
And, 2 HDDs in the PC are used to compose 3 RAID1 disks. /dev/md0 for /boot, /dev/md1 for root, and /dev/md2 for swap. 2 HDD boxes are used for 2 RAID5.

Steps to Reproduce:
1. boot ( problem occurs more frequently in power cycle but warm reboot also )
  
Actual results:
RAID1 is assembled only by single partition.
It is rare case that dracut desn't assemble some RAID1 partition.

Expected results:
All RAID1 partitions are assembled correctly. Each RAID1 are assembled with 2 partitions.

Additional info:
I cannot understand why this patch avoid the problem, but I haven't watched yet the trouble with this patched version or kernel 3.1.x.

I'll attach some dmesg logs for more detail information.

Comment 1 ANEZAKI, Akira 2012-03-29 08:55:36 UTC
Created attachment 573574 [details]
One external HDD box may cause this problem.

This is a sample that only one external HDD box cause this problem.

RAID1s aren't contain any HDDs in external HDDs.

/dev/md1 is activated only one partition.

Comment 2 ANEZAKI, Akira 2012-03-29 08:59:34 UTC
Created attachment 573575 [details]
2 external HDDs cause the problem.

This is a sample that 2 external HDD boxes causes the problem.
This is log of booting power cycle.

RAID1s aren't contain any HDDs in external HDDs.

/dev/md1 and /dev/md2 are activated with single partition.

Comment 3 ANEZAKI, Akira 2012-03-29 09:05:03 UTC
Created attachment 573576 [details]
2 external HDDs cause the problem ( boot failed )

This is a sample that boot failed.
This is log of booting power cycle.

/dev/md0 is activated, but both of /dev/md1 and /dev/md2 aren't activated.
dev/md1 is root and boot failed because there are no root device.
There are no error messages. So, it seems that dracut skips to activate /dev/md1 and /dev/md2.

Comment 4 ANEZAKI, Akira 2012-03-29 09:11:09 UTC
Created attachment 573577 [details]
2 external HDDs cause the problem with kernel 3.3.0-4

This is log of booting power cycle.

RAID1s aren't contain any HDDs in external HDDs.

/dev/md1 and /dev/md2 are activated with single partition.

Comment 5 ANEZAKI, Akira 2012-03-29 09:18:21 UTC
Created attachment 573590 [details]
2 external HDDs cause the problem in warm reboot

2 external HDDs cause the problem.

This is a sample that 2 external HDD boxes causes the problem.
This is log of warm reboot.

RAID1s aren't contain any HDDs in external HDDs.

/dev/md1 and /dev/md2 are activated with single partition.

Comment 6 ANEZAKI, Akira 2012-03-31 16:46:19 UTC
The problem occurs for about 30% at power cycle on the PC that I got logs 

I have configured some PCs similar to the PC. I've tried those PCs , but this problem haven't occurred on some PC. I haven't found out the common condition yet.

Comment 7 Dave Jones 2012-10-23 15:33:17 UTC
# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 8 ANEZAKI, Akira 2012-10-23 16:20:35 UTC
The problem still sometimes occurs with kernel-PAE-3.6.2-1.fc16.i686.

Comment 9 ANEZAKI, Akira 2012-10-23 18:02:13 UTC
The patch still works well.

Comment 10 Christian Pandolfi 2012-12-14 18:43:54 UTC
I have the same Bug on two different machines using 
03:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01)
Is this patch only for Sil 3726?

Comment 11 ANEZAKI, Akira 2012-12-14 22:13:54 UTC
(In reply to comment #10)
> I have the same Bug on two different machines using 
> 03:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA
> Raid II Controller (rev 01)
> Is this patch only for Sil 3726?

Yes, it is. The modification is set flag not to use device soft reset while bootup.
Sil 3132 is a PCI-E to SATA i/f and it isn't port multiplier. So, I think this patch will have no effect.

Comment 12 Christian Pandolfi 2012-12-15 08:56:41 UTC
afaik Sil 3132 is a port multiplier device connected with PCI-E  
ata4: controller in dubious state, performing PORT_RST
[  610.619623] ata4.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
[  610.619998] ata4.00: hard resetting link
[  610.957179] ata4.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  610.957231] ata4.01: hard resetting link
[  611.294585] ata4.01: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  611.294637] ata4.02: hard resetting link
[  611.632001] ata4.02: SATA link up 3.0 Gbps (SStatus 123 SControl 320)

and i notice the same error on three of those cards
[  560.133869] ata4.00: failed command: READ FPDMA QUEUED

	Kernel driver in use: sata_sil24
	Kernel modules: sata_sil24

Comment 13 ANEZAKI, Akira 2012-12-15 11:24:48 UTC
(In reply to comment #12)
> afaik Sil 3132 is a port multiplier device connected with PCI-E  
> ata4: controller in dubious state, performing PORT_RST
> [  610.619623] ata4.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
> [  610.619998] ata4.00: hard resetting link
> [  610.957179] ata4.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
> [  610.957231] ata4.01: hard resetting link
> [  611.294585] ata4.01: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
> [  611.294637] ata4.02: hard resetting link
> [  611.632001] ata4.02: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
> 
> and i notice the same error on three of those cards
> [  560.133869] ata4.00: failed command: READ FPDMA QUEUED
> 
> 	Kernel driver in use: sata_sil24
> 	Kernel modules: sata_sil24

From kernel 3.2, libata-pmp.c was changed only for Sil3726 to do staggered spin up. It sends soft reset signal to each HDD. But, this code change kicks real bug somewhere on some systems that have Sil3726 chip, including Sil3726 firmware. I tested more two PCs with similar configuration to check the bug appears on them, but the bug didn't appear on them.
From code level, this patch only stops staggered spin up because soft reset signal sometimes kick Sil3726 firmware bug. So the patch has no effect about FPDMA and NCQ. If Sil3132 retuns Product ID as 0x3726, the patch may have some effects. But I'm sorry that I don't know the Product ID value that Sil3132 returns. If the Product ID differs (and I think it will differ), this patch has no effect because patch effects inside of Sil3132 initialization block only.

I think it is better to report it as new bug.

Comment 14 ANEZAKI, Akira 2012-12-15 17:02:44 UTC
(In reply to comment #13)

Oops!

> this patch has no effect because patch effects inside of Sil3132
> initialization block only.

this patch has no effect because the patch affects inside of Sil3726 initialization block only.

is correct.

Comment 15 ANEZAKI, Akira 2012-12-15 19:17:29 UTC
(In reply to comment #10)
> I have the same Bug on two different machines using 
> 03:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA
> Raid II Controller (rev 01)
> Is this patch only for Sil 3726?

From http://www.siliconimage.com/products/product.aspx?pid=32 , SiI 3132 is two port SATA host controller that supports port multiplier. I also use some cards on some PCs including the PC that the bug occurs, too. But my cards have no port multiplier chips on it.

I don't know what chip you are using to connect Sil 3132 and HDDs.

In the output of dmesg command, there may be some lines like this:
> [    7.806493] ata6.15: Port Multiplier 1.1, 0x1095:0x3726 r23, 6 ports, feat 0x1/0x9
0x1095 is a Vendor ID of Silicon Image, and 0x3726 is a Product ID. It means that this port multiplier uses Sil3726.

If the output of dmesg command includes some lines like above, the patch may affect something.

Comment 16 Fedora End Of Life 2013-01-16 14:32:46 UTC
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 17 Fedora End Of Life 2013-02-13 15:36:24 UTC
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.