Bug 807632

Summary: Hot-swapping e-sata disks fails in kernel 3.3
Product: [Fedora] Fedora Reporter: Kriton Kyrimis <kyrimis>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16CC: gansalmon, itamar, ja, jonathan, kernel-maint, madhu.chinakonda, mails.bugzilla.redhat.com
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.43.2-6.fc15 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-24 04:29:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Output of dmesg for kernels 3.2.10-3 and 3.3.0-5.6
none
Lin Ming Patch to resolve E-Sata Hotplugging none

Description Kriton Kyrimis 2012-03-28 11:40:32 UTC
Description of problem:
I have two external disks, which I connect via eSATA, to take backups, alternating between the two disks. Up until kernel 3.2.10-3.fc16, this would work just fine. With kernel 3.3.0-4.fc16, the disk is recognized the first time after booting (I have only tried it with the disk already connected and powered up.) When I change the disk, I cannot get it to be recognized, until I reboot.

I have reverted to kernel 3.2.10-3.fc16 for the time being.

Version-Release number of selected component (if applicable):
3.3.0-4.fc16.x86_64

How reproducible:
Almost always. If I unplug and re-plug the disk very quickly, it might get recognized again.

Steps to Reproduce:
1. Boot the computer, with one disk already connected and powered up.
2. Mount the disk.
3. Unmount the disk and power it down.
4. Disconnect the disk, wait a bit (half a minute should be more than enough), connect the second disk, and power it up.
5. Mount the disk.
  
Actual results:
The disk is not mounted. In the console, I see messages like the following:

Mar 28 12:24:57 ######### kernel: [92113.439787] ata1: exception Emask 0x10 SAct
 0x0 SErr 0x990000 action 0xe frozen
Mar 28 12:24:57 ######### kernel: [92113.439918] ata1: irq_stat 0x00400000, PHY
RDY changed
Mar 28 12:24:57 ######### kernel: [92113.439997] ata1: SError: { PHYRdyChg 10B8B
 Dispar LinkSeq }
Mar 28 12:24:57 ######### kernel: [92113.440118] ata1: hard resetting link
Mar 28 12:24:57 ######### kernel: [92114.163030] ata1: SATA link down (SStatus 0
 SControl 300)

(I have redacted the host name.)

Expected results:
The disk should be mounted.

Additional info:
The SATA controller on which the disks are connected is a JMicron Technology Corp. JMB360 AHCI Controller (rev 02)

I have tried rescanning with
echo "- - -" > /sys/class/scsi_host/hostN/scan
for all valid values of N, but it did not help.

Comment 1 Josh Boyer 2012-03-28 13:17:59 UTC
Could you try this kernel:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3937168

and post dmesg output from 3.2.10 and 3.3 as well.

Comment 2 Kriton Kyrimis 2012-03-28 15:31:51 UTC
Created attachment 573377 [details]
Output of dmesg for kernels 3.2.10-3 and 3.3.0-5.6

Comment 3 Kriton Kyrimis 2012-03-28 15:32:50 UTC
The problem persists with kernel 3.3.0-5.6. I have uploaded an attachment width the output of dmesg for kernels 3.2.10-3 and 3.3.0-5.6.

Comment 4 Matthias Hensler 2012-03-29 07:55:28 UTC
External esata devices are heavily broken with kernel 3.3 (see also related bugreport https://bugzilla.redhat.com/show_bug.cgi?id=806676).

After fresh booting a 3.3 kernel you can only once attach an external esata device and use it. If you detach the device you will not be able to attach a new one. Sometimes also detaching is stuck with a kernel irq task running at 100% (see bug 806676).

If you do one suspend cycle the esata port is no longer recognized.

The problem was already mentioned on LKML and a patch was suggested by Lin Ming (see https://lkml.org/lkml/2012/3/12/798). Last update was a week ago, as far as I see the patch was not yet accepted.

Comment 5 Kriton Kyrimis 2012-03-29 09:12:29 UTC
So, I guess we wait until the bug is fixed upstream. Meanwhile, would it make sense to reinstate kernel 3.2.10-3 in the repositories, to make downgrading easier for those affected by this bug?

Comment 6 Matthias Hensler 2012-03-29 10:14:56 UTC
Created attachment 573616 [details]
Lin Ming Patch to resolve E-Sata Hotplugging

I can now confirm that the attached patch indeed resolves the problem. I added it to the current kernel from koji (3.3.0-5.fc16) and rebuild it for x86_64. I can provide the rpms for testing.

However, since the problem exists upstream I would suggest to push it there.

Comment 7 Josh Boyer 2012-03-29 11:43:24 UTC
(In reply to comment #5)
> So, I guess we wait until the bug is fixed upstream. Meanwhile, would it make
> sense to reinstate kernel 3.2.10-3 in the repositories, to make downgrading
> easier for those affected by this bug?

We can't do that.

We'll look at the attached patch and see if it's suitable for backporting.

Comment 8 Kriton Kyrimis 2012-03-29 13:16:35 UTC
I can confirm that kernel-3.3.0-5, with the attached patch, fixes the problem I reported. (I used the sources from kernel-3.3.0-5.fc17.src.rpm)

Comment 9 Kriton Kyrimis 2012-04-02 12:32:08 UTC
The patch also works with kernel 3.3.0-8.

Comment 10 Josh Boyer 2012-04-04 14:50:15 UTC
I've applied the submitted patch to all Fedora branches.  It will be in the
next submitted update.

Comment 11 Fedora Update System 2012-04-05 12:51:06 UTC
kernel-3.3.1-3.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.3.1-3.fc17

Comment 12 Fedora Update System 2012-04-05 12:53:37 UTC
kernel-3.3.1-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.3.1-3.fc16

Comment 13 Fedora Update System 2012-04-05 18:25:14 UTC
Package kernel-3.3.1-3.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.3.1-3.fc17'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-5346/kernel-3.3.1-3.fc17
then log in and leave karma (feedback).

Comment 14 Fedora Update System 2012-04-08 03:27:25 UTC
kernel-3.3.1-3.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 15 Fedora Update System 2012-04-11 00:27:49 UTC
kernel-3.3.1-5.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.3.1-5.fc16

Comment 16 Fedora Update System 2012-04-11 00:29:09 UTC
kernel-3.3.1-5.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.3.1-5.fc17

Comment 17 Fedora Update System 2012-04-11 00:30:01 UTC
kernel-2.6.43.1-5.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.43.1-5.fc15

Comment 18 Dr J Austin 2012-04-12 08:52:01 UTC
This problem is still present in
kernel-3.3.1-5.fc16 and
ja@minix ~ 1$ uname -a
Linux minix 3.3.1-5.fc17.x86_64 #1 SMP Tue Apr 10 20:42:28 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

I have been providing success/failure feedback against 806676 (still open)
This was probably not the correct location
Is there more useful testing I can perform?

Comment 19 Fedora Update System 2012-04-13 21:33:31 UTC
kernel-3.3.1-5.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 20 Fedora Update System 2012-04-14 00:40:59 UTC
kernel-2.6.43.2-2.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.43.2-2.fc15

Comment 21 Fedora Update System 2012-04-14 04:33:55 UTC
kernel-3.3.1-5.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 22 Dr J Austin 2012-04-14 10:12:58 UTC
Double rechecked on my machine - the problem is still present

ja@minix ~ 22$ uname -a
Linux minix 3.3.1-5.fc16.x86_64 #1 SMP Tue Apr 10 19:56:52 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Absolutely no indication in dmesg that the disk has be plugged in the second time around

----------------------------------------------------------------------
ja@minix ~ 24$ cat dmesg_disk_add_remove_add_3.3.1-5.fc16.x86_64
Clean reboot

ja@minix ~ 2$ uname -a
Linux minix 3.3.1-5.fc16.x86_64 #1 SMP Tue Apr 10 19:56:52 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Plug in disk the first time
ja@minix ~ 7$ mount |grep sd
/dev/sda4 on / type ext4 (rw,noatime,seclabel,user_xattr,acl,barrier=1,data=ordered,discard)
/dev/sda3 on /boot type ext4 (rw,noatime,seclabel,user_xattr,acl,barrier=1,data=ordered,discard)
/dev/sdb1 on /media/wd250 type ext4 (rw,relatime,seclabel,user_xattr,acl,barrier=1,data=ordered)

--------------------------------------------------------------
[  136.296735] ata4: exception Emask 0x10 SAct 0x0 SErr 0x40c0000 action 0xe frozen
[  136.296744] ata4: irq_stat 0x00000040, connection status changed
[  136.296753] ata4: SError: { CommWake 10B8B DevExch }
[  136.296767] ata4: hard resetting link
[  137.174196] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  137.174823] ata4.00: ATA-7: WDC WD2500KS-00MJB0, 02.01C03, max UDMA/133
[  137.174833] ata4.00: 488397168 sectors, multi 0: LBA48 
[  137.175519] ata4.00: configured for UDMA/133
[  137.175534] ata4: EH complete
[  137.175732] scsi 3:0:0:0: Direct-Access     ATA      WDC WD2500KS-00M 02.0 PQ: 0 ANSI: 5
[  137.176181] sd 3:0:0:0: [sdb] 488397168 512-byte logical blocks: (250 GB/232 GiB)
[  137.176435] sd 3:0:0:0: [sdb] Write Protect is off
[  137.176443] sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[  137.176634] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  137.176650] sd 3:0:0:0: Attached scsi generic sg1 type 0
[  137.221163]  sdb: sdb1
[  137.223369] sd 3:0:0:0: [sdb] Attached SCSI disk
[  137.570148] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
[  137.570155] SELinux: initialized (dev sdb1, type ext4), uses xattr
-----------------------------------------------------------------
[root@minix ~]# umount /dev/sdb1
No new messages in dmesg 

[  136.296735] ata4: exception Emask 0x10 SAct 0x0 SErr 0x40c0000 action 0xe frozen
[  136.296744] ata4: irq_stat 0x00000040, connection status changed
[  136.296753] ata4: SError: { CommWake 10B8B DevExch }
[  136.296767] ata4: hard resetting link
[  137.174196] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  137.174823] ata4.00: ATA-7: WDC WD2500KS-00MJB0, 02.01C03, max UDMA/133
[  137.174833] ata4.00: 488397168 sectors, multi 0: LBA48 
[  137.175519] ata4.00: configured for UDMA/133
[  137.175534] ata4: EH complete
[  137.175732] scsi 3:0:0:0: Direct-Access     ATA      WDC WD2500KS-00M 02.0 PQ: 0 ANSI: 5
[  137.176181] sd 3:0:0:0: [sdb] 488397168 512-byte logical blocks: (250 GB/232 GiB)
[  137.176435] sd 3:0:0:0: [sdb] Write Protect is off
[  137.176443] sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[  137.176634] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  137.176650] sd 3:0:0:0: Attached scsi generic sg1 type 0
[  137.221163]  sdb: sdb1
[  137.223369] sd 3:0:0:0: [sdb] Attached SCSI disk
[  137.570148] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
[  137.570155] SELinux: initialized (dev sdb1, type ext4), uses xattr
------------------------------------------------------------------
Unplug disk

[  372.979293] ata4: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen
[  372.979302] ata4: irq_stat 0x00400000, PHY RDY changed
[  372.979312] ata4: SError: { RecovComm Persist PHYRdyChg 10B8B }
[  372.979323] ata4: hard resetting link
[  373.702176] ata4: SATA link down (SStatus 0 SControl 300)
[  378.702211] ata4: hard resetting link
[  379.007177] ata4: SATA link down (SStatus 0 SControl 300)
[  379.007195] ata4: limiting SATA link speed to 1.5 Gbps
[  384.007200] ata4: hard resetting link
[  384.312177] ata4: SATA link down (SStatus 0 SControl 310)
[  384.312194] ata4.00: disabled
[  384.312220] ata4: EH complete
[  384.312237] ata4.00: detaching (SCSI 3:0:0:0)
[  384.314742] sd 3:0:0:0: [sdb] Synchronizing SCSI cache
[  384.314810] sd 3:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  384.314820] sd 3:0:0:0: [sdb] Stopping disk
[  384.314837] sd 3:0:0:0: [sdb] START_STOP FAILED
[  384.314842] sd 3:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
-------------------------------------------------------------
Plug in disk again
No change in dmesg !!
No indication that the disk has been plugged in

[  372.979293] ata4: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen
[  372.979302] ata4: irq_stat 0x00400000, PHY RDY changed
[  372.979312] ata4: SError: { RecovComm Persist PHYRdyChg 10B8B }
[  372.979323] ata4: hard resetting link
[  373.702176] ata4: SATA link down (SStatus 0 SControl 300)
[  378.702211] ata4: hard resetting link
[  379.007177] ata4: SATA link down (SStatus 0 SControl 300)
[  379.007195] ata4: limiting SATA link speed to 1.5 Gbps
[  384.007200] ata4: hard resetting link
[  384.312177] ata4: SATA link down (SStatus 0 SControl 310)
[  384.312194] ata4.00: disabled
[  384.312220] ata4: EH complete
[  384.312237] ata4.00: detaching (SCSI 3:0:0:0)
[  384.314742] sd 3:0:0:0: [sdb] Synchronizing SCSI cache
[  384.314810] sd 3:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  384.314820] sd 3:0:0:0: [sdb] Stopping disk
[  384.314837] sd 3:0:0:0: [sdb] START_STOP FAILED
[  384.314842] sd 3:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Comment 23 Dr J Austin 2012-04-14 12:59:52 UTC
It looks as if this may be my/the problem

http://www.spinics.net/lists/linux-ide/msg43173.html
(Date: Fri, 13 Apr 2012 09:24:07 +08002012_04_13)

"...
> The fundamental problem with this patch is that all SATA ports are 
> hotpluggable... even the ones the firmware/silicon failed to mark as 
> hotpluggable via AHCI's PORT_CMD_MPSP | PORT_CMD_HPCP

So the acceptable solution is to add runtime pm support for hotpluggable
port.

I'll send new patches.

Thanks,
Lin Ming"

Comment 24 Dr J Austin 2012-04-18 09:47:40 UTC
There is more activity on this bug here (it is obviously not closed!)
(I am unsure of the relationship between Red Hat Bugzilla and the people
referenced below)

http://www.spinics.net/lists/linux-ide/msg43236.html

To: Mark Lord <kernel@xxxxxxxxxxxx>
Subject: Re: Hotplug borked after suspend/resume in Linux-3.3 ?
From: Jeff Garzik <jgarzik@xxxxxxxxx>
Date: Wed, 18 Apr 2012 02:18:40 -0400
Cc: Lin Ming <ming.m.lin@xxxxxxxxx>, Tejun Heo <htejun@xxxxxxxxx>, linux-ide@xxxxxxxxxxxxxxx
In-reply-to: <4F8E1D0A.8000203>
List-id: <linux-ide.vger.kernel.org>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329 Thunderbird/11.0.1
On 04/17/2012 09:46 PM, Mark Lord wrote:
On 12-04-17 09:37 PM, Mark Lord wrote:
On 12-04-17 09:29 PM, Lin Ming wrote:

I'm working on the hotplug issue fix.

Before the fix is ready, here is the one-line patch.
Could you give it a try?
..
--- a/drivers/ata/libata-transport.c
+++ b/drivers/ata/libata-transport.c
@@ -294,6 +294,7 @@ int ata_tport_add(struct device *parent,
  	device_enable_async_suspend(dev);
  	pm_runtime_set_active(dev);
  	pm_runtime_enable(dev);
+	pm_runtime_forbid(dev);

..

I'm rebuilding the kernel right now.. should take about 5min or less to test.

Yeah, that (by itself) is enough to make things work again.
This looks like the one-liner that we really need upstream and in -stable.

Jeff?

I'll now use it instead of the (much larger) "v2 disable runtime pm" thing.

Yeah I like that a -whole- lot better...

Will push upstream tomorrow.

Comment 25 Josh Boyer 2012-04-18 12:40:16 UTC
(In reply to comment #24)
> There is more activity on this bug here (it is obviously not closed!)
> (I am unsure of the relationship between Red Hat Bugzilla and the people
> referenced below)
> 
> http://www.spinics.net/lists/linux-ide/msg43236.html
> 
> To: Mark Lord <kernel@xxxxxxxxxxxx>
> Subject: Re: Hotplug borked after suspend/resume in Linux-3.3 ?
> From: Jeff Garzik <jgarzik@xxxxxxxxx>
> Date: Wed, 18 Apr 2012 02:18:40 -0400
> Cc: Lin Ming <ming.m.lin@xxxxxxxxx>, Tejun Heo <htejun@xxxxxxxxx>,
> linux-ide@xxxxxxxxxxxxxxx
> In-reply-to: <4F8E1D0A.8000203>
> List-id: <linux-ide.vger.kernel.org>
> User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329
> Thunderbird/11.0.1
> On 04/17/2012 09:46 PM, Mark Lord wrote:
> On 12-04-17 09:37 PM, Mark Lord wrote:
> On 12-04-17 09:29 PM, Lin Ming wrote:
> 
> I'm working on the hotplug issue fix.
> 
> Before the fix is ready, here is the one-line patch.
> Could you give it a try?
> ..
> --- a/drivers/ata/libata-transport.c
> +++ b/drivers/ata/libata-transport.c
> @@ -294,6 +294,7 @@ int ata_tport_add(struct device *parent,
>    device_enable_async_suspend(dev);
>    pm_runtime_set_active(dev);
>    pm_runtime_enable(dev);
> + pm_runtime_forbid(dev);

Here is a scratch build with the original patch removed, and the one above added.  When it completes, I would appreciate if you could test it and let me know how it works.

http://koji.fedoraproject.org/koji/taskinfo?taskID=4001505

Comment 26 Dr J Austin 2012-04-18 17:13:19 UTC
ja@minix ~ 7$ uname -a
Linux minix 3.3.2-3.1.fc16.x86_64 #1 SMP Wed Apr 18 12:51:06 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

It is now possible to plugin/mount/umount/unplug external eSATAp devices
(I have tried two, SSD and mechanical disks, several times)
as and when required without a reboot being necessary!
Many thanks
John

Comment 27 Josh Boyer 2012-04-18 17:31:42 UTC
(In reply to comment #26)
> ja@minix ~ 7$ uname -a
> Linux minix 3.3.2-3.1.fc16.x86_64 #1 SMP Wed Apr 18 12:51:06 UTC 2012 x86_64
> x86_64 x86_64 GNU/Linux
> 
> It is now possible to plugin/mount/umount/unplug external eSATAp devices
> (I have tried two, SSD and mechanical disks, several times)
> as and when required without a reboot being necessary!

Great.  Thank you for testing.  I'll get this patch rolled into the next update later today.

Comment 28 Kriton Kyrimis 2012-04-19 07:22:23 UTC
Kernel 3.3.2-3.1.fc16.x86_64 fixed the bug in my case, as well, where kernel 3.3.1-5.fc16.x86_64 had also worked.

Comment 29 Fedora Update System 2012-04-21 15:22:03 UTC
kernel-3.3.2-8.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.3.2-8.fc17

Comment 30 Fedora Update System 2012-04-21 16:26:06 UTC
kernel-3.3.2-6.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.3.2-6.fc16

Comment 31 Fedora Update System 2012-04-21 16:46:09 UTC
kernel-2.6.43.2-6.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.43.2-6.fc15

Comment 32 Fedora Update System 2012-04-21 21:07:41 UTC
Package kernel-3.3.2-8.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.3.2-8.fc17'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-6344/kernel-3.3.2-8.fc17
then log in and leave karma (feedback).

Comment 33 Fedora Update System 2012-04-24 04:29:03 UTC
kernel-3.3.2-8.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 34 Fedora Update System 2012-04-24 14:53:53 UTC
kernel-3.3.2-6.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 35 Fedora Update System 2012-04-26 03:27:13 UTC
kernel-2.6.43.2-6.fc15 has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.