Bug 1011362 - ALX NIC driver dies after resume. Regression from 3.10
Summary: ALX NIC driver dies after resume. Regression from 3.10
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 19
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: fedora-kernel-ethernet-ath
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1011777 (view as bug list)
Depends On:
Blocks: 1034952
TreeView+ depends on / blocked
 
Reported: 2013-09-24 07:15 UTC by Gilboa Davara
Modified: 2013-11-29 06:54 UTC (History)
15 users (show)

Fixed In Version: kernel-3.11.9-100.fc18
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1034952 (view as bug list)
Environment:
Last Closed: 2013-11-24 03:48:10 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Callstack SS 1/2 (1.54 MB, image/jpeg)
2013-09-24 07:15 UTC, Gilboa Davara
no flags Details
lspci log. (26.44 KB, text/x-log)
2013-10-03 16:08 UTC, Gilboa Davara
no flags Details
dmesg (boot) (3.29 KB, text/x-log)
2013-10-03 16:21 UTC, Gilboa Davara
no flags Details

Description Gilboa Davara 2013-09-24 07:15:31 UTC
Created attachment 802064 [details]
Callstack SS 1/2

Description of problem:
Since updating to 3.11 the ALX driver no longer successfully resumes from suspend.
Post resume the kernel logs gets flooded by the following message (in the 1000's):
alx 0000:04:00.0: invalid PHY speed/duplex: 0xffff
....
Attempting to shutdown the machine will result in slowpath OOPs (SS attached, sorry for the poor quality, no serial port on this laptop).

Switching back to 3.10 solves the issue.


Version-Release number of selected component (if applicable):
3.11.1-200.fc19.x86_64


Additional info:
Also reported by Ubuntu users:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1213009

Comment 1 Josh Boyer 2013-09-25 17:09:12 UTC
*** Bug 1011777 has been marked as a duplicate of this bug. ***

Comment 2 Josh Boyer 2013-09-25 17:18:25 UTC
I emailed upstream about this.  It's possible this was introduced when WoL support was removed, but I'm unsure at the moment.

http://thread.gmane.org/gmane.linux.network/284929

Comment 3 Gilboa Davara 2013-09-26 11:12:10 UTC
Let me know if you want me to open an upstream bug about it.

Thanks.
- Gilboa

Comment 4 markusN 2013-10-02 18:05:32 UTC
The problem persists in the latest kernel:

...
[ 2992.767858] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[ 2992.768989] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[ 2992.770126] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[ 2992.771228] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[ 2992.772355] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[ 2992.773521] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[ 2992.774662] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[ 2992.775797] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[ 2992.776924] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff

[neteler@oboe ~]$ uname -a
Linux oboe.localdomain 3.11.2-201.fc19.x86_64 #1 SMP Fri Sep 27 19:20:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

I need to continue to use 3.10.6 which is the last working kernel 
in F19 on since the following 3.11 kernels were affected by bug 917081.

Comment 5 John Greene 2013-10-03 15:02:38 UTC
The callstack is helpful, could you please upload the dmesg output so I can get a bit fuller picture.  

The output of this would help me too (just the part with the alx device is all I need).

lspci -vvnn

Comment 6 Gilboa Davara 2013-10-03 16:08:02 UTC
Created attachment 807162 [details]
lspci log.

(Slowpath OPPs callstack already attached as screenshot, due to the lack of serial port on the laptop)

Comment 7 Gilboa Davara 2013-10-03 16:21:28 UTC
Created attachment 807164 [details]
dmesg (boot)

dmesg post resume is useless (invalid PHY speed/duplex: 0xffff by the millions)
Here's a fresh dmesg (pre-suspend).

- Gilboa

Comment 8 Jeff Gold 2013-10-07 22:16:41 UTC
I see the same symptoms as the original report using the 3.11.3-201.fc19.x86_64 kernel.  I ended up with over 5GiB of these "invalid PHY speed/duplex" lines in /var/log/messages after a suspend.  I'm using an ASUS X202E (Intel i3-3217U and Atheros AR816x/AR817x).

Comment 9 Gil Forcada 2013-10-10 18:25:11 UTC
Same here, after suspending my CPU was at full speed writing on journal...

Using a 3.10 kernel solved the issue by now.

Comment 10 John Greene 2013-10-10 19:36:27 UTC
So it appears a regression has occurred from 3.10 to 3.11.  

Small change list 
c3eb7a7 alx: remove redundant D0 power state set
a8798a5 alx: fix lockdep annotation
bc2bebe alx: remove WoL support
7ec5689 alx: fix ethtool support code
46ab9b3 alx: fix MAC address alignment problem
a5b87cc alx: separate link speed/duplex fields
4a134c3 alx: make sizes unsigned
17fdd35 alx: fix 100mbit/half duplex speed translation
ef0cc4b alx: treat flow control correctly in alx_set_pauseparam()

Educated guess is the problem is one of the above..
c3eb7a7 alx: remove redundant D0 power state set
7ec5689 alx: fix ethtool support code
a5b87cc alx: separate link speed/duplex fields
17fdd35 alx: fix 100mbit/half duplex speed translation
a5b87cc alx: separate link speed/duplex fields

Do any of you have the ability to build and test kernels?

Try to reverting these on 3.11...

Comment 11 digger vermont 2013-10-14 17:16:59 UTC
Looks like it is still an issue with a kernel upgrade in F20

Linux localhost.localdomain 3.11.4-302.fc20.x86_64 #1 SMP Fri Oct 11 17:43:41 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Also this looks like the bugreport upstream:
https://bugzilla.kernel.org/show_bug.cgi?id=62491

As of Oct 11 they had know idea what it is.

Comment 12 Gilboa Davara 2013-10-17 19:09:54 UTC
(In reply to John Greene from comment #10)
> Do any of you have the ability to build and test kernels?
> Try to reverting these on 3.11...

I'll try and free some time next week to revert each patch and see what breaks.

- Gilboa

Comment 13 markusN 2013-10-23 19:19:14 UTC
Still unsolved:

[  928.222244] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[  928.223338] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[  928.224448] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[  928.225572] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[  928.226722] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[  928.227844] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[  928.228971] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[  928.230081] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff
[  928.231190] alx 0000:03:00.0: invalid PHY speed/duplex: 0xffff

uname -a
Linux oboe.localdomain 3.11.4-201.fc19.x86_64 #1 SMP Thu Oct 10 14:11:18 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

The only usable kernel for F19 remains 3.10.6. I will preserve it carefully...

Comment 14 kevin 2013-11-13 21:08:45 UTC
Any news / progress on this. Just had yet another hit, 2nd in two days.

Comment 15 John Greene 2013-11-14 19:25:56 UTC
No fix at this point upstream, but I did some additional looking at the history.
It appears that the 3.10.x family all seem to share the same history wrt the 
ALX driver.  You may be able to 
 (In reply to John Greene from comment #10)
> So it appears a regression has occurred from 3.10 to 3.11.  
> 
> Small change list 
> c3eb7a7 alx: remove redundant D0 power state set
> a8798a5 alx: fix lockdep annotation
> bc2bebe alx: remove WoL support
> 7ec5689 alx: fix ethtool support code
> 46ab9b3 alx: fix MAC address alignment problem
> a5b87cc alx: separate link speed/duplex fields
> 4a134c3 alx: make sizes unsigned
> 17fdd35 alx: fix 100mbit/half duplex speed translation
> ef0cc4b alx: treat flow control correctly in alx_set_pauseparam()
> 
> Educated guess is the problem is one of the above..
> c3eb7a7 alx: remove redundant D0 power state set
> 7ec5689 alx: fix ethtool support code
> a5b87cc alx: separate link speed/duplex fields
> 17fdd35 alx: fix 100mbit/half duplex speed translation
> a5b87cc alx: separate link speed/duplex fields
> 
> Do any of you have the ability to build and test kernels?
> 
> Try to reverting these on 3.11...

I don't have access to this device internally, so if I get time I might be able to revert these for you.   It may take a bit to get around to that. Hence the question:  Do any of you have the ability to build and test kernels?
Or at least willingness to test a kernel I might be able to generate?

Comment 16 John Greene 2013-11-14 19:28:54 UTC
oh..somebody try this: add this to kernel load command and see if it help this at all:

pcie_aspm=off

Let me know what you come up with.

Comment 17 kevin 2013-11-14 21:21:35 UTC
I've just had another hit so willing to try anything. As time is pretty scarce for me ATM I've taken the easy route with the pcie_aspm option. Obviously will not know if that fixes the problem or just minimises it.

If that doesn't work I'm happy to test a kernel - but don't have the time to build it myself.

Comment 18 John Greene 2013-11-15 14:52:10 UTC
Kevin,

Great..If you could try that and let me know if the problem does go away.  It's a common workaround.  It will tell me a bit to focus the bisect for the problem.

Comment 19 digger vermont 2013-11-15 16:27:14 UTC
Here is the comment from kernel.org with the attached patch:

https://bugzilla.kernel.org/show_bug.cgi?id=62491#c7

Link to the patch:

https://bugzilla.kernel.org/attachment.cgi?id=114381&action=diff#a/drivers/net/ethernet/atheros/alx/main.c_sec1

and here's the patch:

==============================================

From 27744b24f9291782c1342dbd6cac511e68da907c Mon Sep 17 00:00:00 2001
From: hahnjo <hahnjo>
Date: Tue, 12 Nov 2013 18:19:24 +0100
Subject: [PATCH] alx: Reset phy speed after resume

This fixes bug 62491 (https://bugzilla.kernel.org/show_bug.cgi?id=62491).
After resuming some users got the following error flooding the kernel log:
alx 0000:02:00.0: invalid PHY speed/duplex: 0xffff
---
 drivers/net/ethernet/atheros/alx/main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/atheros/alx/main.c b/drivers/net/ethernet/atheros/alx/main.c
index fc95b23..6305a5d 100644
--- a/drivers/net/ethernet/atheros/alx/main.c
+++ b/drivers/net/ethernet/atheros/alx/main.c
@@ -1389,6 +1389,9 @@ static int alx_resume(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct alx_priv *alx = pci_get_drvdata(pdev);
+	struct alx_hw *hw = &alx->hw;
+
+	alx_reset_phy(hw);
 
 	if (!netif_running(alx->dev))
 		return 0;
-- 
1.8.4.2

======================================

Comment 20 John Greene 2013-11-15 17:48:03 UTC
Good find.  Will check this out soon.

Comment 21 digger vermont 2013-11-15 18:03:23 UTC
I do have time to install a patched kernel and check it out. But unfortunately no time for building.

Comment 22 kevin 2013-11-15 21:01:38 UTC
I hope that patch works  because the kernel parameter has not - 2 * hits this morning. :(

Also willing to try the custom kernel.

Comment 23 Charles R. Anderson 2013-11-16 06:02:42 UTC
I made a local mock build with the patch from comment #19 and tested it on my Asus X202E.  It fixes the problem for me.

I also have some koji scratch builds submitted with the patch applied.  Try these out when they finish:

F19:

http://koji.fedoraproject.org/koji/taskinfo?taskID=6187061

F18:

http://koji.fedoraproject.org/koji/taskinfo?taskID=6186733

Comment 24 markusN 2013-11-16 17:12:08 UTC
(In reply to Charles R. Anderson from comment #23)
> F19: 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=6187061

I would be happy to test it. How to install, any RTFM for these builts?
Or simply download all relevant RPMs from there?

Comment 25 Michele Baldessari 2013-11-16 18:59:37 UTC
For Fedora kernel folks, this has now hit the 'net' tree, so it will trickle down to Linus:
commit b54629e226d196e802abdd30c5e34f2a47cddcf2
Author: hahnjo <hahnjo>
Date:   Tue Nov 12 18:19:24 2013 +0100

    alx: Reset phy speed after resume
    
    This fixes bug 62491 (https://bugzilla.kernel.org/show_bug.cgi?id=62491).
    After resuming some users got the following error flooding the kernel log:
    alx 0000:02:00.0: invalid PHY speed/duplex: 0xffff
    
    Signed-off-by: Jonas Hahnfeld <linux>
    Signed-off-by: David S. Miller <davem>

I don't currently see it in Davem's stable patchwork, so it might worth adding
to Fedora's tree for the time being
(http://patchwork.ozlabs.org/bundle/davem/stable/?state=*)

Comment 26 Charles R. Anderson 2013-11-16 19:44:07 UTC
(In reply to markusN from comment #24)
> (In reply to Charles R. Anderson from comment #23)
> > F19: 
> > http://koji.fedoraproject.org/koji/taskinfo?taskID=6187061
> 
> I would be happy to test it. How to install, any RTFM for these builts?
> Or simply download all relevant RPMs from there?

On your system, find out which ones you need first by doing this:

rpm -qa kernel\* | sort

Then download the ones you need (typically only kernel-3.* and kernel-modules-extra-3.* for your arch either i686 or x86_64) and do:

yum update kernel*

and reboot to the new kernel.

Comment 27 markusN 2013-11-17 11:09:03 UTC
(In reply to markusN from comment #24)
> (In reply to Charles R. Anderson from comment #23)
> > F19: 
> > http://koji.fedoraproject.org/koji/taskinfo?taskID=6187061
> 
> I would be happy to test it. How to install, any RTFM for these builts?
> Or simply download all relevant RPMs from there?

As per Comment #26 I have updated to 3.11.8-200.bz1011362.fc19.x86_64
and resumed successfully from suspend already twice.
No more message flooding and the wireless device works!

Looks good, thanks for the test kernel which I'll continue to test.

Comment 28 John Greene 2013-11-18 15:09:16 UTC
Nice start to the week..thanks Charles.
I'll close the loop and see to it this flows into Fedora asap, if not in process already.  Please update your testing status here as you go.

Comment 29 Josh Boyer 2013-11-18 15:57:13 UTC
Thanks for testing everyone.  I've applied the patch Michele pointed to with comment #25.

Comment 31 markusN 2013-11-19 23:51:00 UTC
I see that kernel 3.11.8-200 has been released. Does it contain the bugfix
which continues to work fine on my ASUS X202E?

(I ask since I don't see it mentioned in
 http://koji.fedoraproject.org/koji/buildinfo?buildID=478117 )

Thanks again for the fix.

Comment 32 digger vermont 2013-11-20 00:48:02 UTC
(In reply to markusN from comment #31)
> I see that kernel 3.11.8-200 has been released. Does it contain the bugfix
> which continues to work fine on my ASUS X202E?
> 
> (I ask since I don't see it mentioned in
>  http://koji.fedoraproject.org/koji/buildinfo?buildID=478117 )
> 
> Thanks again for the fix.

The last date on the changelog for link you give is Nov 13. The path was applied on Nov 18. See comment 29

Comment 33 Fedora Update System 2013-11-21 14:45:32 UTC
kernel-3.11.9-300.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/kernel-3.11.9-300.fc20

Comment 34 Fedora Update System 2013-11-21 14:48:31 UTC
kernel-3.11.9-200.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/kernel-3.11.9-200.fc19

Comment 35 Fedora Update System 2013-11-21 14:53:58 UTC
kernel-3.11.9-100.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/kernel-3.11.9-100.fc18

Comment 36 markusN 2013-11-21 18:09:35 UTC
(In reply to Fedora Update System from comment #34)
> kernel-3.11.9-200.fc19 has been submitted as an update for Fedora 19.
> https://admin.fedoraproject.org/updates/kernel-3.11.9-200.fc19

Thanks, suspend/resume works with this kernel. I left stable karma.

Comment 37 Fedora Update System 2013-11-23 19:41:13 UTC
Package kernel-3.11.9-100.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.11.9-100.fc18'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-21822/kernel-3.11.9-100.fc18
then log in and leave karma (feedback).

Comment 38 Fedora Update System 2013-11-24 03:48:10 UTC
kernel-3.11.9-200.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 39 kevin 2013-11-24 20:39:09 UTC
Installed the new kernel last night and this is the first morning in over a week that I've not had to do a reboot.

THANK YOU!!!!!!!

Comment 40 Fedora Update System 2013-11-24 23:47:05 UTC
kernel-3.11.9-300.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 41 Fedora Update System 2013-11-29 06:54:55 UTC
kernel-3.11.9-100.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.