Bug 807632 - Hot-swapping e-sata disks fails in kernel 3.3
Summary: Hot-swapping e-sata disks fails in kernel 3.3
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-03-28 11:40 UTC by Kriton Kyrimis
Modified: 2012-04-26 03:27 UTC (History)
7 users (show)

Fixed In Version: kernel-2.6.43.2-6.fc15
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-04-24 04:29:03 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Output of dmesg for kernels 3.2.10-3 and 3.3.0-5.6 (25.48 KB, application/x-bzip2)
2012-03-28 15:31 UTC, Kriton Kyrimis
no flags Details
Lin Ming Patch to resolve E-Sata Hotplugging (3.18 KB, patch)
2012-03-29 10:14 UTC, Matthias Hensler
no flags Details | Diff

Description Kriton Kyrimis 2012-03-28 11:40:32 UTC
Description of problem:
I have two external disks, which I connect via eSATA, to take backups, alternating between the two disks. Up until kernel 3.2.10-3.fc16, this would work just fine. With kernel 3.3.0-4.fc16, the disk is recognized the first time after booting (I have only tried it with the disk already connected and powered up.) When I change the disk, I cannot get it to be recognized, until I reboot.

I have reverted to kernel 3.2.10-3.fc16 for the time being.

Version-Release number of selected component (if applicable):
3.3.0-4.fc16.x86_64

How reproducible:
Almost always. If I unplug and re-plug the disk very quickly, it might get recognized again.

Steps to Reproduce:
1. Boot the computer, with one disk already connected and powered up.
2. Mount the disk.
3. Unmount the disk and power it down.
4. Disconnect the disk, wait a bit (half a minute should be more than enough), connect the second disk, and power it up.
5. Mount the disk.
  
Actual results:
The disk is not mounted. In the console, I see messages like the following:

Mar 28 12:24:57 ######### kernel: [92113.439787] ata1: exception Emask 0x10 SAct
 0x0 SErr 0x990000 action 0xe frozen
Mar 28 12:24:57 ######### kernel: [92113.439918] ata1: irq_stat 0x00400000, PHY
RDY changed
Mar 28 12:24:57 ######### kernel: [92113.439997] ata1: SError: { PHYRdyChg 10B8B
 Dispar LinkSeq }
Mar 28 12:24:57 ######### kernel: [92113.440118] ata1: hard resetting link
Mar 28 12:24:57 ######### kernel: [92114.163030] ata1: SATA link down (SStatus 0
 SControl 300)

(I have redacted the host name.)

Expected results:
The disk should be mounted.

Additional info:
The SATA controller on which the disks are connected is a JMicron Technology Corp. JMB360 AHCI Controller (rev 02)

I have tried rescanning with
echo "- - -" > /sys/class/scsi_host/hostN/scan
for all valid values of N, but it did not help.

Comment 1 Josh Boyer 2012-03-28 13:17:59 UTC
Could you try this kernel:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3937168

and post dmesg output from 3.2.10 and 3.3 as well.

Comment 2 Kriton Kyrimis 2012-03-28 15:31:51 UTC
Created attachment 573377 [details]
Output of dmesg for kernels 3.2.10-3 and 3.3.0-5.6

Comment 3 Kriton Kyrimis 2012-03-28 15:32:50 UTC
The problem persists with kernel 3.3.0-5.6. I have uploaded an attachment width the output of dmesg for kernels 3.2.10-3 and 3.3.0-5.6.

Comment 4 Matthias Hensler 2012-03-29 07:55:28 UTC
External esata devices are heavily broken with kernel 3.3 (see also related bugreport https://bugzilla.redhat.com/show_bug.cgi?id=806676).

After fresh booting a 3.3 kernel you can only once attach an external esata device and use it. If you detach the device you will not be able to attach a new one. Sometimes also detaching is stuck with a kernel irq task running at 100% (see bug 806676).

If you do one suspend cycle the esata port is no longer recognized.

The problem was already mentioned on LKML and a patch was suggested by Lin Ming (see https://lkml.org/lkml/2012/3/12/798). Last update was a week ago, as far as I see the patch was not yet accepted.

Comment 5 Kriton Kyrimis 2012-03-29 09:12:29 UTC
So, I guess we wait until the bug is fixed upstream. Meanwhile, would it make sense to reinstate kernel 3.2.10-3 in the repositories, to make downgrading easier for those affected by this bug?

Comment 6 Matthias Hensler 2012-03-29 10:14:56 UTC
Created attachment 573616 [details]
Lin Ming Patch to resolve E-Sata Hotplugging

I can now confirm that the attached patch indeed resolves the problem. I added it to the current kernel from koji (3.3.0-5.fc16) and rebuild it for x86_64. I can provide the rpms for testing.

However, since the problem exists upstream I would suggest to push it there.

Comment 7 Josh Boyer 2012-03-29 11:43:24 UTC
(In reply to comment #5)
> So, I guess we wait until the bug is fixed upstream. Meanwhile, would it make
> sense to reinstate kernel 3.2.10-3 in the repositories, to make downgrading
> easier for those affected by this bug?

We can't do that.

We'll look at the attached patch and see if it's suitable for backporting.

Comment 8 Kriton Kyrimis 2012-03-29 13:16:35 UTC
I can confirm that kernel-3.3.0-5, with the attached patch, fixes the problem I reported. (I used the sources from kernel-3.3.0-5.fc17.src.rpm)

Comment 9 Kriton Kyrimis 2012-04-02 12:32:08 UTC
The patch also works with kernel 3.3.0-8.

Comment 10 Josh Boyer 2012-04-04 14:50:15 UTC
I've applied the submitted patch to all Fedora branches.  It will be in the
next submitted update.

Comment 11 Fedora Update System 2012-04-05 12:51:06 UTC
kernel-3.3.1-3.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.3.1-3.fc17

Comment 12 Fedora Update System 2012-04-05 12:53:37 UTC
kernel-3.3.1-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.3.1-3.fc16

Comment 13 Fedora Update System 2012-04-05 18:25:14 UTC
Package kernel-3.3.1-3.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.3.1-3.fc17'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-5346/kernel-3.3.1-3.fc17
then log in and leave karma (feedback).

Comment 14 Fedora Update System 2012-04-08 03:27:25 UTC
kernel-3.3.1-3.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 15 Fedora Update System 2012-04-11 00:27:49 UTC
kernel-3.3.1-5.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.3.1-5.fc16

Comment 16 Fedora Update System 2012-04-11 00:29:09 UTC
kernel-3.3.1-5.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.3.1-5.fc17

Comment 17 Fedora Update System 2012-04-11 00:30:01 UTC
kernel-2.6.43.1-5.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.43.1-5.fc15

Comment 18 Dr J Austin 2012-04-12 08:52:01 UTC
This problem is still present in
kernel-3.3.1-5.fc16 and
ja@minix ~ 1$ uname -a
Linux minix 3.3.1-5.fc17.x86_64 #1 SMP Tue Apr 10 20:42:28 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

I have been providing success/failure feedback against 806676 (still open)
This was probably not the correct location
Is there more useful testing I can perform?

Comment 19 Fedora Update System 2012-04-13 21:33:31 UTC
kernel-3.3.1-5.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 20 Fedora Update System 2012-04-14 00:40:59 UTC
kernel-2.6.43.2-2.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.43.2-2.fc15

Comment 21 Fedora Update System 2012-04-14 04:33:55 UTC
kernel-3.3.1-5.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 22 Dr J Austin 2012-04-14 10:12:58 UTC
Double rechecked on my machine - the problem is still present

ja@minix ~ 22$ uname -a
Linux minix 3.3.1-5.fc16.x86_64 #1 SMP Tue Apr 10 19:56:52 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Absolutely no indication in dmesg that the disk has be plugged in the second time around

----------------------------------------------------------------------
ja@minix ~ 24$ cat dmesg_disk_add_remove_add_3.3.1-5.fc16.x86_64
Clean reboot

ja@minix ~ 2$ uname -a
Linux minix 3.3.1-5.fc16.x86_64 #1 SMP Tue Apr 10 19:56:52 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Plug in disk the first time
ja@minix ~ 7$ mount |grep sd
/dev/sda4 on / type ext4 (rw,noatime,seclabel,user_xattr,acl,barrier=1,data=ordered,discard)
/dev/sda3 on /boot type ext4 (rw,noatime,seclabel,user_xattr,acl,barrier=1,data=ordered,discard)
/dev/sdb1 on /media/wd250 type ext4 (rw,relatime,seclabel,user_xattr,acl,barrier=1,data=ordered)

--------------------------------------------------------------
[  136.296735] ata4: exception Emask 0x10 SAct 0x0 SErr 0x40c0000 action 0xe frozen
[  136.296744] ata4: irq_stat 0x00000040, connection status changed
[  136.296753] ata4: SError: { CommWake 10B8B DevExch }
[  136.296767] ata4: hard resetting link
[  137.174196] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  137.174823] ata4.00: ATA-7: WDC WD2500KS-00MJB0, 02.01C03, max UDMA/133
[  137.174833] ata4.00: 488397168 sectors, multi 0: LBA48 
[  137.175519] ata4.00: configured for UDMA/133
[  137.175534] ata4: EH complete
[  137.175732] scsi 3:0:0:0: Direct-Access     ATA      WDC WD2500KS-00M 02.0 PQ: 0 ANSI: 5
[  137.176181] sd 3:0:0:0: [sdb] 488397168 512-byte logical blocks: (250 GB/232 GiB)
[  137.176435] sd 3:0:0:0: [sdb] Write Protect is off
[  137.176443] sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[  137.176634] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  137.176650] sd 3:0:0:0: Attached scsi generic sg1 type 0
[  137.221163]  sdb: sdb1
[  137.223369] sd 3:0:0:0: [sdb] Attached SCSI disk
[  137.570148] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
[  137.570155] SELinux: initialized (dev sdb1, type ext4), uses xattr
-----------------------------------------------------------------
[root@minix ~]# umount /dev/sdb1
No new messages in dmesg 

[  136.296735] ata4: exception Emask 0x10 SAct 0x0 SErr 0x40c0000 action 0xe frozen
[  136.296744] ata4: irq_stat 0x00000040, connection status changed
[  136.296753] ata4: SError: { CommWake 10B8B DevExch }
[  136.296767] ata4: hard resetting link
[  137.174196] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  137.174823] ata4.00: ATA-7: WDC WD2500KS-00MJB0, 02.01C03, max UDMA/133
[  137.174833] ata4.00: 488397168 sectors, multi 0: LBA48 
[  137.175519] ata4.00: configured for UDMA/133
[  137.175534] ata4: EH complete
[  137.175732] scsi 3:0:0:0: Direct-Access     ATA      WDC WD2500KS-00M 02.0 PQ: 0 ANSI: 5
[  137.176181] sd 3:0:0:0: [sdb] 488397168 512-byte logical blocks: (250 GB/232 GiB)
[  137.176435] sd 3:0:0:0: [sdb] Write Protect is off
[  137.176443] sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[  137.176634] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  137.176650] sd 3:0:0:0: Attached scsi generic sg1 type 0
[  137.221163]  sdb: sdb1
[  137.223369] sd 3:0:0:0: [sdb] Attached SCSI disk
[  137.570148] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
[  137.570155] SELinux: initialized (dev sdb1, type ext4), uses xattr
------------------------------------------------------------------
Unplug disk

[  372.979293] ata4: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen
[  372.979302] ata4: irq_stat 0x00400000, PHY RDY changed
[  372.979312] ata4: SError: { RecovComm Persist PHYRdyChg 10B8B }
[  372.979323] ata4: hard resetting link
[  373.702176] ata4: SATA link down (SStatus 0 SControl 300)
[  378.702211] ata4: hard resetting link
[  379.007177] ata4: SATA link down (SStatus 0 SControl 300)
[  379.007195] ata4: limiting SATA link speed to 1.5 Gbps
[  384.007200] ata4: hard resetting link
[  384.312177] ata4: SATA link down (SStatus 0 SControl 310)
[  384.312194] ata4.00: disabled
[  384.312220] ata4: EH complete
[  384.312237] ata4.00: detaching (SCSI 3:0:0:0)
[  384.314742] sd 3:0:0:0: [sdb] Synchronizing SCSI cache
[  384.314810] sd 3:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  384.314820] sd 3:0:0:0: [sdb] Stopping disk
[  384.314837] sd 3:0:0:0: [sdb] START_STOP FAILED
[  384.314842] sd 3:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
-------------------------------------------------------------
Plug in disk again
No change in dmesg !!
No indication that the disk has been plugged in

[  372.979293] ata4: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen
[  372.979302] ata4: irq_stat 0x00400000, PHY RDY changed
[  372.979312] ata4: SError: { RecovComm Persist PHYRdyChg 10B8B }
[  372.979323] ata4: hard resetting link
[  373.702176] ata4: SATA link down (SStatus 0 SControl 300)
[  378.702211] ata4: hard resetting link
[  379.007177] ata4: SATA link down (SStatus 0 SControl 300)
[  379.007195] ata4: limiting SATA link speed to 1.5 Gbps
[  384.007200] ata4: hard resetting link
[  384.312177] ata4: SATA link down (SStatus 0 SControl 310)
[  384.312194] ata4.00: disabled
[  384.312220] ata4: EH complete
[  384.312237] ata4.00: detaching (SCSI 3:0:0:0)
[  384.314742] sd 3:0:0:0: [sdb] Synchronizing SCSI cache
[  384.314810] sd 3:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  384.314820] sd 3:0:0:0: [sdb] Stopping disk
[  384.314837] sd 3:0:0:0: [sdb] START_STOP FAILED
[  384.314842] sd 3:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Comment 23 Dr J Austin 2012-04-14 12:59:52 UTC
It looks as if this may be my/the problem

http://www.spinics.net/lists/linux-ide/msg43173.html
(Date: Fri, 13 Apr 2012 09:24:07 +08002012_04_13)

"...
> The fundamental problem with this patch is that all SATA ports are 
> hotpluggable... even the ones the firmware/silicon failed to mark as 
> hotpluggable via AHCI's PORT_CMD_MPSP | PORT_CMD_HPCP

So the acceptable solution is to add runtime pm support for hotpluggable
port.

I'll send new patches.

Thanks,
Lin Ming"

Comment 24 Dr J Austin 2012-04-18 09:47:40 UTC
There is more activity on this bug here (it is obviously not closed!)
(I am unsure of the relationship between Red Hat Bugzilla and the people
referenced below)

http://www.spinics.net/lists/linux-ide/msg43236.html

To: Mark Lord <kernel@xxxxxxxxxxxx>
Subject: Re: Hotplug borked after suspend/resume in Linux-3.3 ?
From: Jeff Garzik <jgarzik@xxxxxxxxx>
Date: Wed, 18 Apr 2012 02:18:40 -0400
Cc: Lin Ming <ming.m.lin@xxxxxxxxx>, Tejun Heo <htejun@xxxxxxxxx>, linux-ide@xxxxxxxxxxxxxxx
In-reply-to: <4F8E1D0A.8000203>
List-id: <linux-ide.vger.kernel.org>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329 Thunderbird/11.0.1
On 04/17/2012 09:46 PM, Mark Lord wrote:
On 12-04-17 09:37 PM, Mark Lord wrote:
On 12-04-17 09:29 PM, Lin Ming wrote:

I'm working on the hotplug issue fix.

Before the fix is ready, here is the one-line patch.
Could you give it a try?
..
--- a/drivers/ata/libata-transport.c
+++ b/drivers/ata/libata-transport.c
@@ -294,6 +294,7 @@ int ata_tport_add(struct device *parent,
  	device_enable_async_suspend(dev);
  	pm_runtime_set_active(dev);
  	pm_runtime_enable(dev);
+	pm_runtime_forbid(dev);

..

I'm rebuilding the kernel right now.. should take about 5min or less to test.

Yeah, that (by itself) is enough to make things work again.
This looks like the one-liner that we really need upstream and in -stable.

Jeff?

I'll now use it instead of the (much larger) "v2 disable runtime pm" thing.

Yeah I like that a -whole- lot better...

Will push upstream tomorrow.

Comment 25 Josh Boyer 2012-04-18 12:40:16 UTC
(In reply to comment #24)
> There is more activity on this bug here (it is obviously not closed!)
> (I am unsure of the relationship between Red Hat Bugzilla and the people
> referenced below)
> 
> http://www.spinics.net/lists/linux-ide/msg43236.html
> 
> To: Mark Lord <kernel@xxxxxxxxxxxx>
> Subject: Re: Hotplug borked after suspend/resume in Linux-3.3 ?
> From: Jeff Garzik <jgarzik@xxxxxxxxx>
> Date: Wed, 18 Apr 2012 02:18:40 -0400
> Cc: Lin Ming <ming.m.lin@xxxxxxxxx>, Tejun Heo <htejun@xxxxxxxxx>,
> linux-ide@xxxxxxxxxxxxxxx
> In-reply-to: <4F8E1D0A.8000203>
> List-id: <linux-ide.vger.kernel.org>
> User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329
> Thunderbird/11.0.1
> On 04/17/2012 09:46 PM, Mark Lord wrote:
> On 12-04-17 09:37 PM, Mark Lord wrote:
> On 12-04-17 09:29 PM, Lin Ming wrote:
> 
> I'm working on the hotplug issue fix.
> 
> Before the fix is ready, here is the one-line patch.
> Could you give it a try?
> ..
> --- a/drivers/ata/libata-transport.c
> +++ b/drivers/ata/libata-transport.c
> @@ -294,6 +294,7 @@ int ata_tport_add(struct device *parent,
>    device_enable_async_suspend(dev);
>    pm_runtime_set_active(dev);
>    pm_runtime_enable(dev);
> + pm_runtime_forbid(dev);

Here is a scratch build with the original patch removed, and the one above added.  When it completes, I would appreciate if you could test it and let me know how it works.

http://koji.fedoraproject.org/koji/taskinfo?taskID=4001505

Comment 26 Dr J Austin 2012-04-18 17:13:19 UTC
ja@minix ~ 7$ uname -a
Linux minix 3.3.2-3.1.fc16.x86_64 #1 SMP Wed Apr 18 12:51:06 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

It is now possible to plugin/mount/umount/unplug external eSATAp devices
(I have tried two, SSD and mechanical disks, several times)
as and when required without a reboot being necessary!
Many thanks
John

Comment 27 Josh Boyer 2012-04-18 17:31:42 UTC
(In reply to comment #26)
> ja@minix ~ 7$ uname -a
> Linux minix 3.3.2-3.1.fc16.x86_64 #1 SMP Wed Apr 18 12:51:06 UTC 2012 x86_64
> x86_64 x86_64 GNU/Linux
> 
> It is now possible to plugin/mount/umount/unplug external eSATAp devices
> (I have tried two, SSD and mechanical disks, several times)
> as and when required without a reboot being necessary!

Great.  Thank you for testing.  I'll get this patch rolled into the next update later today.

Comment 28 Kriton Kyrimis 2012-04-19 07:22:23 UTC
Kernel 3.3.2-3.1.fc16.x86_64 fixed the bug in my case, as well, where kernel 3.3.1-5.fc16.x86_64 had also worked.

Comment 29 Fedora Update System 2012-04-21 15:22:03 UTC
kernel-3.3.2-8.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.3.2-8.fc17

Comment 30 Fedora Update System 2012-04-21 16:26:06 UTC
kernel-3.3.2-6.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.3.2-6.fc16

Comment 31 Fedora Update System 2012-04-21 16:46:09 UTC
kernel-2.6.43.2-6.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.43.2-6.fc15

Comment 32 Fedora Update System 2012-04-21 21:07:41 UTC
Package kernel-3.3.2-8.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.3.2-8.fc17'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-6344/kernel-3.3.2-8.fc17
then log in and leave karma (feedback).

Comment 33 Fedora Update System 2012-04-24 04:29:03 UTC
kernel-3.3.2-8.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 34 Fedora Update System 2012-04-24 14:53:53 UTC
kernel-3.3.2-6.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 35 Fedora Update System 2012-04-26 03:27:13 UTC
kernel-2.6.43.2-6.fc15 has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.