Bug 1251434

Summary: alx driver doesn't work with kernel >= 4.1.2
Product: [Fedora] Fedora Reporter: Dave M <dave.nerd>
Component: kernelAssignee: Jarod Wilson <jarod>
Status: CLOSED UPSTREAM QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: high    
Version: 23CC: dave.nerd, gansalmon, itamar, jonathan, kernel-maint, labbott, ldap.tester, madhu.chinakonda, mchehab, pradhiguru, raphael.slagmolen, reddy, zoot1612
Target Milestone: ---Flags: pradhiguru: needinfo?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel 4.7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-11 14:45:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
output_from_lspci
none
networkmanager_info none

Description Dave M 2015-08-07 09:45:50 UTC
Created attachment 1060319 [details]
output_from_lspci

Description of problem:
With kernel <= 4.0.8, alx ethernet driver works automagically.  Starting with 4.1.2 and continued to 4.1.3, only the wireless connection works.

Version-Release number of selected component (if applicable):
$ uname -a
Linux hero.x.org 4.1.3-201.fc22.x86_64 #1 SMP Wed Jul 29 19:50:22 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

$ rpm -qa |grep NetworkManager
NetworkManager-team-1.0.2-1.fc22.x86_64
NetworkManager-vpnc-gnome-1.0.2-1.fc22.x86_64
NetworkManager-openconnect-1.0.2-1.fc22.x86_64
NetworkManager-1.0.2-1.fc22.x86_64
NetworkManager-glib-1.0.2-1.fc22.x86_64
NetworkManager-wifi-1.0.2-1.fc22.x86_64
NetworkManager-vpnc-1.0.2-1.fc22.x86_64
NetworkManager-pptp-gnome-1.1.0-1.20150428git695d4f2.fc22.x86_64
NetworkManager-openvpn-gnome-1.0.2-2.fc22.x86_64
NetworkManager-wwan-1.0.2-1.fc22.x86_64
NetworkManager-libnm-1.0.2-1.fc22.x86_64
NetworkManager-openvpn-1.0.2-2.fc22.x86_64
NetworkManager-pptp-1.1.0-1.20150428git695d4f2.fc22.x86_64
NetworkManager-adsl-1.0.2-1.fc22.x86_64
NetworkManager-config-connectivity-fedora-1.0.2-1.fc22.x86_64
NetworkManager-bluetooth-1.0.2-1.fc22.x86_64

Actual results:
Says connected but is apparently not.

Expected results:
Work automagically like older kernels.

Additional info:
See attachments. Please let me know if additional information is required.

Comment 1 Dave M 2015-08-07 09:46:38 UTC
Created attachment 1060320 [details]
networkmanager_info

Comment 2 Ldap Tester 2015-08-11 15:47:40 UTC
I have two machines with an Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 08).  Up to and including kernel 4.0.8, this interface was working fine.  With kernel 4.1.3, it has stopped working properly.  If I boot back into the 4.0.8 kernel, the interface works fine again.  It is not a hardware problem.

Upon booting up with the 4.1.3 kernel, the network works for about a minute and then stops communicating.  I can't ping the machine from another machine on the network.  After about 25 minutes I get the message:
kernel: alx 0000:06:00.0 p5p1: fatal interrupt 0x8400, resetting
and then the network works again for about a minute and then stops working.  This scenario repeats about every 25 minutes.  The number sometimes is 0x0400 or 0x0401.

This bug is severe and high priority as these machines are basically unusable with the latest kernel.

Comment 3 Dave M 2015-08-11 16:14:49 UTC
@Ldap Tester - thanks for posting.  I changed the priority levels; hope it gets some attention now.  If my system didn't have a wireless card, I'd be out of luck too.

Comment 4 John 2015-08-14 01:10:24 UTC
Same issue with DELL One 27 with Atheros AR8161 PCIe 10/100/1000 Mbps Ethernet.

Currently defaulting to old kernel 4.0 to overcome issue.

I believe Bug report 1249493 is the same bug.

Comment 5 Laura Abbott 2015-08-17 23:21:26 UTC
Can you try a bisect using the scripts at https://pagure.io/fedbisect ? That's going to be the easiest way to narrow down the problem

Comment 6 Ldap Tester 2015-08-21 21:33:48 UTC
Laura, I looked at the bisect procedure and it seems to be a bit over my head.  I would be willing to try it, but the two machines I have with the AR8161 are in a remote location.  If I were to reboot into a bad kernel, I would lose contact with the machine.  Recovering is logistically problematic.  Maybe one of the other posters can do the bisect.  

All I can tell you now is that the AR8161 was (and is) working fine with kernel-4.0.8-200.fc21.x86_64 and fails under 4.1.3-100.fc21 or 4.1.3-200.fc22.  I realize there are many kernel changes between those two versions, but can there be that many changes to the alx driver?  The error message I posted could give you an additional clue.  I consider this regression to be urgent in that my only work around is to run a kernel without the latest security fixes.

Comment 7 John 2015-08-23 22:29:09 UTC
I have with started fedbisect and gone through three iterations of the kernel at this point. As the computer is at home it may take me some time to complete.

Comment 8 John 2015-08-24 00:02:47 UTC
(In reply to Laura Abbott from comment #5)
> Can you try a bisect using the scripts at https://pagure.io/fedbisect ?
> That's going to be the easiest way to narrow down the problem

Some clarification regarding fedbisect, I know simple things but I had trouble with them as a newbie: -
-------------
git clone <location of repo>
should be
git clone <location of fedbisect repo>
-------------
Missing step after cloning fedbisect repository.
cd <fedbisect repo>
-------------
This will clone a kernel tree in <subdir> of the repo.
should be
This will clone a kernel tree in <subdir> of the fedbisect repo.
-------------
./fedbisect good

or

./fedbisect bad

should be

./fedbisect.sh good

or

./fedbisect.sh bad
-------------

Comment 9 John 2015-08-24 12:19:05 UTC
Hi all
Here is what I ended up with Bisect. Doesn't mean anything to me.


bash-4.3$ ./fedbisect.sh bad
HEAD is now at 387d375 PCI: Don't clear ASPM bits when the FADT declares it's unsupported
387d37577fdd05e9472c20885464c2a53b3c945f is the first bad commit
commit 387d37577fdd05e9472c20885464c2a53b3c945f
Author: Matthew Garrett <mjg59>
Date:   Tue Apr 7 11:07:00 2015 -0700

    PCI: Don't clear ASPM bits when the FADT declares it's unsupported
    
    Communications with a hardware vendor confirm that the expected behaviour
    on systems that set the FADT ASPM disable bit but which still grant full
    PCIe control is for the OS to leave any BIOS configuration intact and
    refuse to touch the ASPM bits.  This mimics the behaviour of Windows.
    
    Signed-off-by: Matthew Garrett <mjg59>
    Signed-off-by: Bjorn Helgaas <bhelgaas>

:040000 040000 733931aa65713217b4ee3c2ffe6952a961229e62 4e55b70cb187a3705734a04d16a675cd21d974ff M	drivers
:040000 040000 100d49d0e01d4719f851e9cdb6a4ac6333b4eb01 2f3abbee84c7f4e697993078a80ea3310f3bcded M	include
# first bad commit: [387d37577fdd05e9472c20885464c2a53b3c945f] PCI: Don't clear ASPM bits when the FADT declares it's unsupported
Found your commit!

Comment 10 John 2015-08-25 06:17:38 UTC
Built a Kernel with this patch reversed but it didn't make any difference, still no Ethernet connection.

Comment 11 Laura Abbott 2015-08-25 15:48:19 UTC
If the bisect didn't work then this should be reported to the maintainers upstream. There haven't been any changes to the alx driver recently.

Jay Cliburn <jcliburn> (maintainer:ATLX ETHERNET DRIVERS)
Chris Snook <chris.snook> (maintainer:ATLX ETHERNET DRIVERS)
netdev.org (open list:ATLX ETHERNET DRIVERS)
linux-kernel.org (open list)

Comment 12 John 2015-08-26 00:47:51 UTC
Hi all
Found this as an interim solution. Works for me (currently)

https://bugzilla.kernel.org/show_bug.cgi?id=70761

Comment 13 Ldap Tester 2015-09-08 19:49:41 UTC
On one of my two machines that use the alx driver, I have been testing kernel 4.1.3-200.fc22.x86_64 with MTU=9000 for the last six days, and I have seen no failures yet.  I do not consider this a workaround, but rather a dodge, in that it appears to dodge the bug but we do not know why, and have no assurance that it will continue to do so.

As John said, this severe regression has already been reported to the upstream maintainers, but there has been no response.  I think the Fedora maintainers should exert some pressure on the upstream maintainers, because the users aren't getting any attention from them.  This regression is urgent.  I cannot update my kernel to include the latest security fixes.

Comment 14 Dave M 2015-09-11 11:23:20 UTC
(In reply to Ldap Tester from comment #13)
> 
> As John said, this severe regression has already been reported to the
> upstream maintainers, but there has been no response.  I think the Fedora
> maintainers should exert some pressure on the upstream maintainers, because
> the users aren't getting any attention from them.  This regression is
> urgent.  I cannot update my kernel to include the latest security fixes.

Has anyone tried to directly contact the people/group Laura posted in #11?  If not, I will.

Thanks,
Dave M

Comment 15 Ldap Tester 2015-09-11 15:16:47 UTC
I have sent email to the people mentioned, but everyone with this problem should do so also.  The more voices they hear, the more likely we'll see some action.

Comment 16 Ldap Tester 2015-09-14 16:25:38 UTC
Here's the email I received from Jay Cliburn.

From: J. K. Cliburn <jcliburn>	Fri, Sep 11, 2015 at 12:40 PM
Cc: Chris Snook <chris.snook>, "Huang, Xiong" <xiong.com>

I'm pretty sure the alx driver is maintained by Qualcomm. Chris Snook and I (neither of us Qualcomm employees) worked to integrate the atl1 and atl2 drivers into the kernel several years ago. Since that time, however, Qualcomm has provided kernel support for all subsequent drivers (atl1c, atl1e, alx, and so on).

I've included on the cc list my only Qualcomm contact, and I'm not sure he's still there. The link below also provides a couple of email addresses that may help.

http://www.linuxfoundation.org/collaborate/workgroups/networking/alx

Comment 17 Ldap Tester 2015-10-02 15:16:01 UTC
I received this patch from Matthew Garrett mjg59. I really can't try the patch myself.  My affected machines are in a remote location and in almost constant use. John, can you (or anyone else) test this patch?

diff --git a/drivers/net/ethernet/atheros/alx/main.c b/drivers/net/ethernet/atheros/alx/main.c
index c8af3ce..fb562cc 100644
--- a/drivers/net/ethernet/atheros/alx/main.c
+++ b/drivers/net/ethernet/atheros/alx/main.c
@@ -1242,6 +1242,8 @@ static int alx_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bool phy_configured;
 	int bars, err;
 
+	pci_disable_link_state(pdev, PCIE_LINK_STATE_CLKPM);
+
 	err = pci_enable_device_mem(pdev);
 	if (err)
 		return err;

Comment 18 Ldap Tester 2015-10-02 16:41:26 UTC
Standby on the above patch.  Matthew says it won't work.  He will be sending me a new one.

Comment 19 Justin M. Forbes 2015-10-20 19:44:22 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 22 kernel bugs.

Fedora 22 has now been rebased to 4.2.3-200.fc22.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 23, and are still experiencing this issue, please change the version to Fedora 23.

If you experience different issues, please open a new bug report for those.

Comment 20 Ldap Tester 2015-11-04 15:42:01 UTC
I can now confirm that this issue is still present under kernel-4.2.3-200.fc22.x86_64.

I also note that the MTU=9000 dodge doesn't always work.  While it seems to work for kernel-4.2.3-200.fc22.x86_64, it didn't work for me under kernel-4.1.8-200.fc22.x86_64.

I can also confirm what others have reported.  That when this issue occurs, packets are successfully transmitted out of the machine, but none are received into the machine.

I do note that there appears to be some progress in resolving this bug, but we need a kernel maintainer to pick up the ball on this.  Please see: https://bugzilla.kernel.org/show_bug.cgi?id=70761

Comment 21 Ldap Tester 2015-11-16 21:25:29 UTC
Apparently openSUSE has already produced a patch for this bug and it has been reported to work.  Please see https://bugzilla.kernel.org/show_bug.cgi?id=70761  Can we get a fedora kernel maintainer to do the same for us?

Comment 22 Josh Boyer 2015-11-16 21:43:49 UTC
(In reply to Ldap Tester from comment #21)
> Apparently openSUSE has already produced a patch for this bug and it has

Actually, that isn't what is in the opensuse tracker.  All Takashi did is create a one-off side repo with a separate ALX module that contains the 3 patches from pru in the kernel.org bug.  Fedora doesn't provide kmod packages.

As the original patch author says in https://bugzilla.kernel.org/show_bug.cgi?id=70761#c31, the patches need to be reviewed upstream.

Comment 23 Ldap Tester 2015-11-16 23:20:58 UTC
Well, upstream isn't doing anything at all about this bug.

Comment 24 Josh Boyer 2015-11-17 14:12:54 UTC
It might be because upstream isn't actually aware of it.  The kernel.org bugzilla service isn't mandatory and isn't widely followed by a majority of the maintainers.  In any case, the patches still need to be sent to the actual maintainers on the netdev list.

Comment 25 Ldap Tester 2015-11-20 21:29:45 UTC
On September 14, I did send email to netdev.org, advising them of this bug.  I have received no response.

Comment 26 Jarod Wilson 2015-11-26 21:34:33 UTC
I'm chasing this some upstream now. I've got a laptop with a nic driven by alx, but it has no problems to speak of, so I can really only regression-test, if I can ever get a custom built kernel to actually boot -- currently tripping over a grub2 double-free whenever I try to boot a kernel I built with patches added (which are based on the ones in the upstream bug with some modifications).

Comment 27 Jarod Wilson 2015-12-02 16:28:34 UTC
I've attached a patch to the upstream bug that could use some testing. I've actually already got an F23 x86_64 build around here with the patch included that I could push somewhere for testing.

Comment 28 Ville Skyttä 2015-12-04 11:51:02 UTC
(In reply to Jarod Wilson from comment #27)
> I've
> actually already got an F23 x86_64 build around here with the patch included
> that I could push somewhere for testing.

If you can get it pushed somewhere (maybe a temporary copr repo?), I can take a look if I can test it.

Comment 29 Jarod Wilson 2015-12-07 22:24:49 UTC
(In reply to Ville Skyttä from comment #28)
> (In reply to Jarod Wilson from comment #27)
> > I've
> > actually already got an F23 x86_64 build around here with the patch included
> > that I could push somewhere for testing.
> 
> If you can get it pushed somewhere (maybe a temporary copr repo?), I can
> take a look if I can test it.

I've never set up copr before, I just shoved the local mock build onto prc:
  http://people.redhat.com/jwilson/kernels/alx/

Comment 30 Ville Skyttä 2015-12-08 08:31:22 UTC
(In reply to Jarod Wilson from comment #29)
> I've never set up copr before, I just shoved the local mock build onto prc:
>   http://people.redhat.com/jwilson/kernels/alx/

Seems to make a slight difference here, but doesn't seem to be a definitive fix. I did get network working once across several reboots and cable detach/attaches, but only once; otherwise it exhibits the same problems as 4.2.6-301.fc23.x86_64. Setting MTU to 9000 works the same as with other kernels and fixes connectivity.

Comment 31 Jarod Wilson 2015-12-09 04:45:06 UTC
Does it *only* work with a 9000 MTU, or do other things larger than 1500, smaller than 9000 also work? Wish I actually had affected hardware (and documentation) here... Kind of wondering how big the packets you receive with MTU set to 9000 actually are, might be able to figure out just how much padding might be needed. Might also need to write sans-header MTU into the hardware registers, not sure. Firing from the hip, in the dark, hoping to hit something. :)

Comment 32 Ville Skyttä 2015-12-09 16:53:53 UTC
I bisected the MTU some (with kernel-4.2.6-300.fc23.x86_64, not yours):

1500 does not work
1501 does not work
2000 does not work
2500 does not work
2999 works
3000 works
8000 works
9000 works

I can continue testing, just let me know what data you want and how I can get it. Feel also free to contact me in PM if you think that's more appropriate than discussing stuff here.

Comment 33 Jarod Wilson 2016-01-07 19:07:39 UTC
(In reply to Ville Skyttä from comment #32)
> I bisected the MTU some (with kernel-4.2.6-300.fc23.x86_64, not yours):
> 
> 1500 does not work
> 1501 does not work
> 2000 does not work
> 2500 does not work
> 2999 works
> 3000 works
> 8000 works
> 9000 works
> 
> I can continue testing, just let me know what data you want and how I can
> get it. Feel also free to contact me in PM if you think that's more
> appropriate than discussing stuff here.

Exactly which hardware is it that you have? I've got two reports in the upstream bug that my patch (which is now in net-next, and should make it's way into kernel 4.5) does help them considerably. One of them was an AR8162, not sure what the other was.

Something else you could try is adding that patch from net-next, and increase ALX_FRAME_PAD from 16 to larger values, see if there's a reasonable larger padding that behaves for your hardware. We could then tweak the driver code to set different padding amounts for different hardware, if that solves this particular issue for your hardware.

Comment 34 Jarod Wilson 2016-02-04 17:52:21 UTC
I've got multiple reports from the upstream bug that the patch that went into 4.5 fixes things for 8161 and 8162 users.

Comment 35 Ville Skyttä 2016-02-04 19:15:31 UTC
Good to hear. BTW I don't think I ever got a mail for comment 33, so I didn't know there was a question waiting for me. HW here is ASUS N56VZ laptop and the NIC is:

$ lspci -nnv | grep -A 1 Ethernet 
04:00.0 Ethernet controller [0200]: Qualcomm Atheros AR8161 Gigabit Ethernet [1969:1091] (rev 08)
	Subsystem: ASUSTeK Computer Inc. N56VZ [1043:1477]

Comment 36 Jarod Wilson 2016-02-04 19:52:16 UTC
(In reply to Ville Skyttä from comment #35)
> Good to hear. BTW I don't think I ever got a mail for comment 33, so I
> didn't know there was a question waiting for me. HW here is ASUS N56VZ
> laptop and the NIC is:
> 
> $ lspci -nnv | grep -A 1 Ethernet 
> 04:00.0 Ethernet controller [0200]: Qualcomm Atheros AR8161 Gigabit Ethernet
> [1969:1091] (rev 08)
> 	Subsystem: ASUSTeK Computer Inc. N56VZ [1043:1477]

Hm. Just got another report today specifically for an AR8161 that said it worked perfectly without the MTU work-around as of 4.5, so not sure why yours would still be misbehaving. :\

Can you try straight 4.5-rc2 or later, and possibly with an increased ALX_FRAME_PAD, if things still misbehave?

I could really use some hardware in front of me that reproduces the issue...

Comment 37 Ville Skyttä 2016-02-04 20:07:30 UTC
(In reply to Jarod Wilson from comment #36)
> Hm. Just got another report today specifically for an AR8161 that said it
> worked perfectly without the MTU work-around as of 4.5, so not sure why
> yours would still be misbehaving. :\

Hm, there might be a misunderstanding here. I haven't tried with anything newer than 4.3.4-300.fc23.x86_64, so no idea whether it's misbehaving with 4.5 or not.

> Can you try straight 4.5-rc2 or later, and possibly with an increased
> ALX_FRAME_PAD, if things still misbehave?

I'll see if I can get 4.5.0-0.rc2.git2.1.fc24 from koji installed first, hopefully testing that is useful. Haven't tried building vanilla kernels myself in a long time and would rather stick with packaged ones.

Comment 38 Jarod Wilson 2016-02-04 23:21:04 UTC
(In reply to Ville Skyttä from comment #37)
> (In reply to Jarod Wilson from comment #36)
> > Hm. Just got another report today specifically for an AR8161 that said it
> > worked perfectly without the MTU work-around as of 4.5, so not sure why
> > yours would still be misbehaving. :\
> 
> Hm, there might be a misunderstanding here. I haven't tried with anything
> newer than 4.3.4-300.fc23.x86_64, so no idea whether it's misbehaving with
> 4.5 or not.

Oh, yes, my mistake, I hadn't read closely enough. I thought you were still seeing the problems while using a build with that patch included.

> > Can you try straight 4.5-rc2 or later, and possibly with an increased
> > ALX_FRAME_PAD, if things still misbehave?
> 
> I'll see if I can get 4.5.0-0.rc2.git2.1.fc24 from koji installed first,
> hopefully testing that is useful. Haven't tried building vanilla kernels
> myself in a long time and would rather stick with packaged ones.

Yes, that should be useful for testing out the fix. Fingers crossed that it works! :)

Comment 39 Ville Skyttä 2016-02-05 06:33:50 UTC
(In reply to Jarod Wilson from comment #38)
> > I'll see if I can get 4.5.0-0.rc2.git2.1.fc24 from koji installed first,
> > hopefully testing that is useful. Haven't tried building vanilla kernels
> > myself in a long time and would rather stick with packaged ones.
> 
> Yes, that should be useful for testing out the fix. Fingers crossed that it
> works! :)

In its unmodified form, it doesn't help. (I'm just testing switching between MTU automatic and MTU 4000, not anything in between. 4000 works with everything I have.)

Bumping ALX_FRAME_PAD to 32 doesn't help either, but 128 seems to make things work. I'm currently doing a build with it set to 64 and will test that once ready. Builds tend to take quite a long time so bisecting this way quite slow progress. I'm doing it in mock (and ccache does help a lot, but anyway) -- what's the command I could run in the chroot just to get the modified alx.ko recompiled? I'd just modify and recompile it in the last build tree which is still sitting there and could work with just the recompiled module instead of doing a full rpm rebuild.

Maybe we should take the discussion to private mail, wonder if people are getting annoyed already? Anyway I don't mind Bugzilla, but whichever way, let me know the above and how much bisecting to find the right ALX_FRAME_PAD makes sense or where should I stop?

Comment 40 Ville Skyttä 2016-02-08 06:32:37 UTC
I've tested a bunch of different ALX_FRAME_PAD values now with 4.5.0-0.rc2.git2.1.fc24, results:

32, 64, 96, 112, 128, 160: Works for some time, then stops working, may start working again later, break again etc. 128 seems to work longest (IIRC even some tens of minutes or so), then 112. Others stop working almost immediately (within seconds).

16, 256: Does not seem to work at all.

Comment 41 Jarod Wilson 2016-02-09 05:56:35 UTC
I have no problem continuing the discussion in bugzilla, it could help others if we continue the discussion out in the open.

I'd have to go poke around for exact steps, but loosely... If you have kernel-devel for the running kernel installed, do an rpmbuild -bp of its src.rpm, and go down into the alx directory of the prepped tree, you can invoke make to build the alx kernel module against kernel-devel, and build an alx.ko that you can load into the running kernel.

Section 2.1 of this actually covers it pretty well:

https://www.kernel.org/doc/Documentation/kbuild/modules.txt

I believe it's kernel-devel that lays down /lib/modules/`uname -r`/build for you.

Comment 42 Jarod Wilson 2016-02-09 05:58:20 UTC
(In reply to Ville Skyttä from comment #40)
> I've tested a bunch of different ALX_FRAME_PAD values now with
> 4.5.0-0.rc2.git2.1.fc24, results:
> 
> 32, 64, 96, 112, 128, 160: Works for some time, then stops working, may
> start working again later, break again etc. 128 seems to work longest (IIRC
> even some tens of minutes or so), then 112. Others stop working almost
> immediately (within seconds).
> 
> 16, 256: Does not seem to work at all.

Weird. I'm not quite sure yet what to make of those results... I'll try to take another look at the alx driver tomorrow with that information in mind and see if anything jumps out...

Comment 43 Ldap Tester 2016-03-01 21:20:49 UTC
Jarod,
I just tried kernel-4.5.0-0.rc5.git0.2.fc25.x86_64 from fedora rawhide.  It did not solve my problem.  Same symptoms as I have reported before.  I have a Dell 2710 All in One with an AR8161.

Comment 44 Jarod Wilson 2016-03-01 22:57:36 UTC
Got a little sidetracked. Have been comparing some of the alx code with other drivers, and noticed many other drivers have rx buffer alignments of 1024, 2048 or 4096, while the alx driver is using only 8... So, maybe:

#define ALX_MAX_FRAME_LEN(_mtu) (ALIGN((ALX_RAW_MTU(_mtu) + ALX_FRAME_PAD), 8)) 

  |
  v

#define ALX_MAX_FRAME_LEN(_mtu) (ALIGN((ALX_RAW_MTU(_mtu) + ALX_FRAME_PAD), 1024))

(or 2048 or 4096 instead of 1024)

Other than that, I've got nothing. Wish I had affected hardware myself to poke at.

Comment 45 Jarod Wilson 2016-03-01 23:05:59 UTC
Just noticed something else that could be relevant to the AR8161. In alx_init_sw(), there's some junk that sets alx->hw.lnk_patch for the AR8161 with device ID 0091 and revision 0, and causes some different register settings to be applied. That bit has been there since the driver first appeared, wondering if there are additional AR8161 devices that need the same treatment. Ville's appears to be an AR8161 with device ID 1091, revision 8 though, which I suspect is far newer, and probably less likely to need what look like work-arounds for flaky early chip versions, but I'm just theorizing here.

Comment 46 Ville Skyttä 2016-03-02 07:15:50 UTC
Tried with different ALX_MAX_FRAME_LEN values along with some different ALX_FRAME_PAD combinations, ditto (separately) setting alx->hw.lnk_patch unconditionally to true, but unfortunately still no joy.

Comment 47 Ville Skyttä 2016-03-02 07:16:39 UTC
(Instructions in comment 41 make testing things a breeze though, thanks for that and the continuous stream of ideas to try out!)

Comment 48 Radek Valasek 2016-03-05 12:40:39 UTC
*** Bug 1305243 has been marked as a duplicate of this bug. ***

Comment 49 Radek Valasek 2016-03-05 13:10:21 UTC
Found working solution (https://bbs.archlinux.org/viewtopic.php?id=201459). It's most probably caused by setup of Jumbo frames. I followed a setup of MTU to 9000 and my wired connection now work without issues.

Comment 50 Ville Skyttä 2016-06-06 05:55:46 UTC
Problems persist with 4.5.5-201.fc23.x86_64

Comment 51 Jarod Wilson 2016-07-25 16:27:28 UTC
The 4.7 kernel should carry two additional patches from Intel's Feng Tang, which will hopefully finally resolve things for everyone:

commit 26c5f03b2ae8018418ceb25b2e6a48560e8c2f5b
Author: Feng Tang <feng.tang>
Date:   Wed May 25 14:49:54 2016 +0800

    net: alx: use custom skb allocator

commit 881d0327db37ad917a367c77aff1afa1ee41e0a9
Author: Feng Tang <feng.tang>
Date:   Sun Jun 12 17:36:37 2016 +0800

    net: alx: Work around the DMA RX overflow issue

I believe these may also be headed to -stable trees.

Comment 52 Ville Skyttä 2016-08-09 09:22:56 UTC
(In reply to Jarod Wilson from comment #51)
> The 4.7 kernel should carry two additional patches from Intel's Feng Tang,
> which will hopefully finally resolve things for everyone:

Tested 4.7.0-2.fc25.x86_64 on F-23: seems fixed for me!

Comment 53 Jarod Wilson 2016-08-11 14:45:29 UTC
Going to go ahead and close this bug then. I see at least one of the two patches already in 4.6.3 (and 4.6.6 was just released), I believe it went to older stable branches too, and it's definitely in 4.7 and later.

Comment 54 Ville Skyttä 2016-08-11 17:46:45 UTC
Ah yes, seems to work for me with at least 4.6.6-200.fc23.x86_64 too.

Comment 55 Dave M 2016-08-12 17:10:46 UTC
Just to clarify, we'll no longer need things like "MTU=9000", right?

Thanks,
Dave M

Comment 56 Ville Skyttä 2016-08-12 17:37:40 UTC
(In reply to Dave M from comment #55)
> Just to clarify, we'll no longer need things like "MTU=9000", right?

Yes, that's what I mean by "works".

Comment 57 Guruprasad 2018-05-04 07:54:20 UTC
I have also have same problem with kernel version 3.10.0-693 
with 
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-V
02:00.0 Ethernet controller: Qualcomm Atheros AR8161 Gigabit Ethernet (rev 10)
network cards

below are the logs i am getting 

Apr 30 15:56:05 localhost kernel: alx 0000:02:00.0 enp2s0: fatal interrupt 0x4001607, resetting
Apr 30 15:56:05 localhost kernel: alx 0000:02:00.0 enp2s0: fatal interrupt 0x4001607, resetting
Apr 30 15:56:05 localhost kernel: alx 0000:02:00.0 enp2s0: fatal interrupt 0x4001607, resetting