Bug 632650 - Intel 82574L NIC failure (e1000e module)
Summary: Intel 82574L NIC failure (e1000e module)
Status: CLOSED DUPLICATE of bug 562273
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
(Show other bugs)
Version: 5.5
Hardware: All Linux
low
medium
Target Milestone: rc
: ---
Assignee: Andy Gospodarek
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-10 16:03 UTC by Akemi Yagi
Modified: 2014-06-29 23:02 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-10-26 19:26:40 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
lspci -vvv snapshot (47.04 KB, text/plain)
2012-11-27 19:19 UTC, R G
no flags Details

Description Akemi Yagi 2010-09-10 16:03:19 UTC
Description of problem:
[ This issue was originally reported by user dma in the ELRepo bug tracker at
http://elrepo.org/bugs/view.php?id=69 ]

A brief summary of the report by dma is as follows.

After a random (?) amount of time / traffic, the Intel 82574L-based onboard NIC (e1000e module) will fail catastrophically. The only way to "fix" the fault is to reboot. I have tested this on two different (but identical) machines, and on three physically different networks with different switching equipment, cabling, and traffic profiles. The problem has (at least) two components :

1. Amount / type of traffic on LAN / segment.
2. Boot media.

Version-Release number of selected component (if applicable):
kernel-2.6.18-x.y.z.el5

How reproducible: Always

Steps to Reproduce:
1. Boot a system with Intel 82574L NIC (e1000e module) 
2. Use network
3.
  
Actual results: The NIC fails with errors.


Expected results: No errors.


Additional info:
As details in the ELRepo bug tracker, the original reporter, dma, tried the latest e1000e driver (1.2.10_NAPI) from Intel and confirmed that that version, but not any earlier versions, fixed the problem. The e1000e kernel module in the RHEL-5 kernel is at version 1.0.2-k3. The OP's NIC has the Vendor:Device pairing 8086:10d3 which is listed as 'supported' by the e1000e module:

$ grep 8086 /lib/modules/*/modules.alias | grep -i 10d3
/lib/modules/2.6.18-194.el5/modules.alias:alias pci:v00008086d000010D3sv*sd*bc*sc*i* e1000e

ELRepo now offers the updated kmod-e1000e package ( http://elrepo.org/tiki/kmod-e1000e ). The OP verified that it rectified the issue.

Please update the kernel e1000e module to version 1.2.10.

Comment 1 Akemi Yagi 2010-09-10 17:17:23 UTC
The RHEL 6beta2 refresh kernel 2.6.32-44.2.el6 has version 1.2.7-k2. Hope the GA release gets the latest driver.

Comment 3 Andy Gospodarek 2010-09-11 00:54:08 UTC
Akemi, we do not and will not include the Intel driver from SourceForge in RHEL5 or RHEL6.

If the fix needed is already included in the upstream driver (read: one that is in Linus' or Dave Miller's git trees on kernel.org) please let us know the commit ID and we can make sure this is included as soon as possible.

If it is not already available upstream, we will have to wait until it is made available before we can use it in RHEL or Fedora.  Thanks!

Comment 4 Alan Bartlett 2010-09-11 09:37:33 UTC
(In reply to comment #3)

<snip>
> will not include the Intel driver from SourceForge in RHEL5 or RHEL6.
<snip>

Just to correct the misunderstanding, the sources used were direct from Intel and *not* SourceForge.

(And, for the record, the ELRepo Project will always use the original manufacturer's sources.)

Comment 5 Jon Masters 2010-09-11 17:30:30 UTC
Alan: btw, I pinged a few folks after I got your IRC about this. Looks like it's all in hand, so I'll leave it with you guys.

Comment 6 Akemi Yagi 2010-09-11 18:06:57 UTC
(In reply to comment #3)
 
> If the fix needed is already included in the upstream driver (read: one that is
> in Linus' or Dave Miller's git trees on kernel.org) please let us know the
> commit ID and we can make sure this is included as soon as possible.
> 
> If it is not already available upstream, we will have to wait until it is made
> available before we can use it in RHEL or Fedora.  Thanks!

Thanks for the note. To summarize, 

(1) RHEL/Fedora pulls code only from kernel.org. ( If/when I see patches/fixes in the git tree, I will update this thread. )

(2) ELRepo provides the latest driver available from the manufacturers. Until RHEL gets the updated driver, the affected users can count on packages from ELRepo. :)

Comment 7 Alan Bartlett 2010-09-12 15:04:26 UTC
(In reply to comment #5)
> Alan: btw, I pinged a few folks after I got your IRC about this. Looks like
> it's all in hand, so I'll leave it with you guys.

Thanks, Jon. 

However, it was toracat (Akemi), not burakkucat (me), who pinged you on IRC. :-)

Comment 8 Jon Masters 2010-09-12 15:35:56 UTC
Ah yes. I suck at IRC names. That's why I'm "jcm" or "jonmasters" :)

Comment 9 Andy Gospodarek 2010-09-13 13:50:00 UTC
(In reply to comment #5)
> Alan: btw, I pinged a few folks after I got your IRC about this. Looks like
> it's all in hand, so I'll leave it with you guys.

Please elaborate.

Comment 10 Andy Gospodarek 2010-09-13 14:12:18 UTC
(In reply to comment #6)
> (In reply to comment #3)
> 
> > If the fix needed is already included in the upstream driver (read: one that is
> > in Linus' or Dave Miller's git trees on kernel.org) please let us know the
> > commit ID and we can make sure this is included as soon as possible.
> > 
> > If it is not already available upstream, we will have to wait until it is made
> > available before we can use it in RHEL or Fedora.  Thanks!
> 
> Thanks for the note. To summarize, 
> 
> (1) RHEL/Fedora pulls code only from kernel.org. ( If/when I see patches/fixes
> in the git tree, I will update this thread. )

Correct.
 
> (2) ELRepo provides the latest driver available from the manufacturers. Until
> RHEL gets the updated driver, the affected users can count on packages from
> ELRepo. :)

I'm obviously fine with that, but we do work extremely hard with Intel to support all hardware that will be out before the next update with the driver included in RHEL.

If you find that hardware support is lacking in a the latest update or a critical bug is missing, feel free to do as you have done here and open a bug on bugzilla.redhat.com.  You can assign any network driver bugs to me (agospoda@redhat.com) and I will fix them or get them routed to the correct person.

(In reply to comment #4)
> (In reply to comment #3)
> 
> <snip>
> > will not include the Intel driver from SourceForge in RHEL5 or RHEL6.
> <snip>
> 
> Just to correct the misunderstanding, the sources used were direct from Intel
> and *not* SourceForge.
> 
> (And, for the record, the ELRepo Project will always use the original
> manufacturer's sources.)

Sorry for the confusion on the source of the driver.  Intel regularly refers to the driver they distribute from sourceforge here:

http://sourceforge.net/projects/e1000/

as their official driver.  They might offer a download location from intel.com as well, but I suspect the sources are identical.

Comment 11 Akemi Yagi 2010-09-13 14:22:09 UTC
(In reply to comment #10)

> If you find that hardware support is lacking in a the latest update or a
> critical bug is missing, feel free to do as you have done here and open a bug
> on bugzilla.redhat.com.  You can assign any network driver bugs to me
> (agospoda@redhat.com) and I will fix them or get them routed to the correct
> person.

Thanks. Will do.

> (In reply to comment #4)
> > (In reply to comment #3)

> > (And, for the record, the ELRepo Project will always use the original
> > manufacturer's sources.)
> 
> Sorry for the confusion on the source of the driver.  Intel regularly refers to
> the driver they distribute from sourceforge here:
> 
> http://sourceforge.net/projects/e1000/
> 
> as their official driver.  They might offer a download location from intel.com
> as well, but I suspect the sources are identical.

Yes, I noticed that the tarball from sourceforge and that from the Intel site were identical when I did the packaging for ELRepo's kmod-e1000e. Your note explains it.

Comment 12 Andy Gospodarek 2010-10-26 19:26:40 UTC
We are shipping an updated e1000e driver that contains all of the upstream (not necessarily Intel's driver) fixes that currently exist.  If you can pass along these kernels to your users I suspect they will resolve the issue.  The test kernels are available here:

http://people.redhat.com/agospoda/#rhel5

I am going to close this as a duplicate of the bug that added the driver update, but please reopen this if it does not resolve their issue.  Thanks!

*** This bug has been marked as a duplicate of bug 562273 ***

Comment 13 Alan Bartlett 2010-10-27 00:30:41 UTC
(In reply to comment #12)
> We are shipping an updated e1000e driver that contains all of the upstream (not
> necessarily Intel's driver) fixes that currently exist.  If you can pass along
> these kernels to your users I suspect they will resolve the issue.  The test
> kernels are available here:
> 
> http://people.redhat.com/agospoda/#rhel5
> 
> I am going to close this as a duplicate of the bug that added the driver
> update, but please reopen this if it does not resolve their issue.  Thanks!
> 
> *** This bug has been marked as a duplicate of bug 562273 ***

This is another case of being provided with a bug reference and upon attempting to read it, seeing --

"You are not authorized to access bug #562273."

As Homer (Jay Simpson) is prone to utter "D'oh!".  :-)

Comment 14 Andy Gospodarek 2010-10-27 00:45:05 UTC
(In reply to comment #13)
> 
> This is another case of being provided with a bug reference and upon attempting
> to read it, seeing --
> 
> "You are not authorized to access bug #562273."
> 
> As Homer (Jay Simpson) is prone to utter "D'oh!".  :-)

Ugh, sorry about that.  There isn't really anything special in that bug now except an indication that we updated it to upstream version 1.2.7-k2 (currently the latest), and have all needed upstream changes through:

19833b5dffe2f2e92a1b377f9aae9d5f32239512 e1000e: disable ASPM L1 on 82573

That bug probably started out as one with limited access because it may contain information about the existence of hardware that didn't exist at the time of bug creation.

Comment 15 Charlie Brady 2011-08-03 22:51:54 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > 
> > This is another case of being provided with a bug reference and upon attempting
> > to read it, seeing --
> > 
> > "You are not authorized to access bug #562273."
> > 
> > As Homer (Jay Simpson) is prone to utter "D'oh!".  :-)
> 
> Ugh, sorry about that.

It's great that you're sorry about that, but is there something which Red Hat can do to fix this recurring problem? It's very, very frustrating when you say "problem fixed" and then show us a black box.

Comment 16 Akemi Yagi 2011-08-03 23:24:34 UTC
From what I can see, the driver version of e1000e in the latest kernel is:

EL5: 1.3.10-k2
EL6: 1.2.20-k2
ELRepo: 1.4.4

According to the original report, version 1.2.10 solved the issue. So, I think that the issue has been fixed both in EL5 and EL6.

Comment 17 Andy Gospodarek 2011-08-04 18:42:32 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > 
> > Ugh, sorry about that.
> 
> It's great that you're sorry about that, but is there something which Red Hat
> can do to fix this recurring problem? It's very, very frustrating when you say
> "problem fixed" and then show us a black box.

Thanks for that feedback, Charlie.

I don't know if this is the case for all bugs that you are seeing, but the bug referenced is a private bug because it contains some information that was considered confidential at the time it was created.  The non-confidential information that would be available in the bug would be the list of upstream patches that were contained in the update.  Would that be helpful in this case?

Comment 18 Charlie Brady 2011-08-04 18:52:49 UTC
(In reply to comment #17)

> considered confidential at the time it was created.  The non-confidential
> information that would be available in the bug would be the list of upstream
> patches that were contained in the update.  Would that be helpful in this case?

That would be helpful, but I would also expect to see a kernel version reference and/or a reference to an update advisory.

Rather than dup a public bug to a private one, could the private one not be cloned (with no confidential information in the cloned public bug), and then both public and private bugs be duped to the public one which can be tracked to resolution?

Comment 19 Andy Gospodarek 2011-08-22 18:29:58 UTC
(In reply to comment #18)
> 
> That would be helpful, but I would also expect to see a kernel version
> reference and/or a reference to an update advisory.
> 
> Rather than dup a public bug to a private one, could the private one not be
> cloned (with no confidential information in the cloned public bug), and then
> both public and private bugs be duped to the public one which can be tracked to
> resolution?

Thanks, Charlie.  I'll see what our product management folks think about how we could do this so more information is intentionally publicly available.

Comment 20 Charlie Brady 2011-08-22 19:56:46 UTC
Thanks Andy. I know I wouldn't be the only one grateful if you could do that.

Comment 21 Levente Farkas 2011-09-28 09:30:44 UTC
at least can you tell us in which kernel/rhel release will be fixed this bug?

Comment 22 Andy Gospodarek 2011-09-28 13:39:08 UTC
(In reply to comment #21)
> at least can you tell us in which kernel/rhel release will be fixed this bug?

Sure, this was fixed in kernel-2.6.18-221.el5 (which was a RHEL5.6 devel kernel), so it is fixed in all RHEL5.6 kernels (since the released kernel for RHEL5.6 was kernel-2.6.18-238.el5).

Comment 23 Levente Farkas 2011-09-28 15:38:52 UTC
the same problem exist in rhel-6.1. when will be fixed in rhel-6? or should i open a new bz?

Comment 24 L.G. 2012-01-11 21:44:48 UTC
Same issue in RHEL 6.2 (of course)!

Comment 25 Charlie Brady 2012-01-11 21:55:38 UTC
(In reply to comment #23)
> the same problem exist in rhel-6.1. when will be fixed in rhel-6? or should i
> open a new bz?

You should open a new bz.

Comment 26 Charlie Brady 2012-01-11 21:56:32 UTC
(In reply to comment #24)
> Same issue in RHEL 6.2 (of course)!

Not 'of course'. But perhaps 'not surprisingly'. Hopefully somebody at RH is now listening.

Comment 27 Andy Gospodarek 2012-01-11 22:38:21 UTC
We are always here listening.  Some of us more often than others for CLOSED bugs.

I would be *extremely* surprised if this issue was fixed in RHEL5.6 and not fixed in RHEL6.1, but maybe it wasn't actually fixed in RHEL5.6?  The reporter who was on this bug never said it was not fixed in RHEL5.6, so I have no way to know if it was not.

Please open a new bug if you are still seeing issues on RHEL6.2 with the latest RHEL drivers.  You can assign it to dnelson@redhat.com, and add me to the cc-list.

Detailed reproduction steps as well as the system seeing the issue will be needed to properly debug it.  A sosreport would not hurt either.

Comment 28 Akemi Yagi 2012-01-11 22:56:53 UTC
(In reply to comment #16)
 
> EL5: 1.3.10-k2
> EL6: 1.2.20-k2
> ELRepo: 1.4.4
> 
> According to the original report, version 1.2.10 solved the issue. So, I think
> that the issue has been fixed both in EL5 and EL6.

Let me update the info for the record:

EL5.7: 1.3.10-k2
EL6.2: 1.4.4-k
ELRepo: 1.9.5

Comment 29 Andy Gospodarek 2012-01-13 22:53:05 UTC
(In reply to comment #28)
> 
> Let me update the info for the record:
> 
> EL5.7: 1.3.10-k2
> EL6.2: 1.4.4-k
> ELRepo: 1.9.5

Was this a problem you could reproduce or a problem that someone else reported to you?

Comment 30 Akemi Yagi 2012-01-14 17:05:09 UTC
I filed the original report on behalf of people who had a problem with the e1000e module in the RHEL kernel. The kmod package provided by ELRepo was a temporary solution. According to the info available to me, the issue was fixed in version 1.2.10. Therefore the current RHEL kernels should not present the problem this bug report dealt with.

So, I suggest users having issues with the e1000e driver open a new bug report.

Comment 31 v 2012-02-04 20:04:19 UTC
I'm new here, but I came across this bug report, since it exactly describes the same problem I have in EL6.2:
I have a board with two 82574L adapters (actually I have two of these main boards, with exactly the same problem each). The eth0 adapter works fine initially, but after an undetermined amount of traffic, the adapter suddenly stops working. 

I'm very familiar with ip addr add / ip route add / arp, and I'm very certain that the problem is somewhere with the adapter or the driver. And since the adapter on the same host is doing fine in another OS (W XP)..., and the RH6.2 does fine on other hosts (with different type/brand nics, I must conclude that the problem is in the nic driver.

In any case I'd be happy to assist to find the cause of this issue, but so far I haven't been able to find anything, and additionally I was unable to get the adapter working, except by rebooting. ip link set dev eth0 down / up, doesn't work, modprobe -r doesn't work either.

01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

e1000e driver version: 1.4.4-k (kernel 2.6.32-220.4.1.el6.i686)

Is there anything I can check when this adapter is in this non-communicative state?

Comment 33 v 2012-02-04 20:41:33 UTC
Thank you for your response. I had to find out how to install this ELRepo e1000e driver, but I think I got it right.

Unfortunately, this newer ElRepo version does not change the behaviour. The adapter still goes in a state where no frames are being transmitted.

ethtool -i eth0:
  driver: e1000e
  version: 1.9.5-NAPI (the original version read 1.4.4-k here)
  firmware-version: 0.3-0

Comment 34 Akemi Yagi 2012-02-04 21:02:00 UTC
This is the Red Hat bug tracker, so may not be the best place to get help. You might want to try the Intel Wired Ethernet site at sourceforge.net:

http://sourceforge.net/tracker/?group_id=42302&atid=447449

Comment 35 Andy Gospodarek 2012-02-06 14:14:23 UTC
(In reply to comment #34)
> This is the Red Hat bug tracker, so may not be the best place to get help. You
> might want to try the Intel Wired Ethernet site at sourceforge.net:
> 
> http://sourceforge.net/tracker/?group_id=42302&atid=447449

Really?!?!?!?!  Since the user is using RHEL6.2, this *is* the best place to get direct access to RH and Intel developers -- especially for those with active subscriptions.

The difficult part is that this is a closed bug for RHEL5, so it is unlikely to receive as much attention as an open one.

v@penguinmail.com, feel free to open a new bug with details and assign it to me (agospoda@redhat.com) and I'll can offer some debugging suggestions.

Comment 36 v 2012-02-06 14:52:25 UTC
Considering the track record of this bug, the amount of people experiencing the same problem with this driver in combination with Intel 82574L, with all sorts of different drivers, I think there is no point in reporting it, since it is unlikely that it will ever be fixed. And it is not RHEL specific either. CentOS 6.2 has the problem (no surprise), and also all version of Fedora as of 10, and probably one version earlier. I think the problem did not exist in FC8.

Comment 37 Andy Gospodarek 2012-02-06 15:16:53 UTC
(In reply to comment #36)
> Considering the track record of this bug, the amount of people experiencing the
> same problem with this driver in combination with Intel 82574L, with all sorts
> of different drivers, I think there is no point in reporting it, since it is
> unlikely that it will ever be fixed. And it is not RHEL specific either. CentOS
> 6.2 has the problem (no surprise), and also all version of Fedora as of 10, and
> probably one version earlier. I think the problem did not exist in FC8.

Based on comment #31 this is a surprise, but I will respect your wishes.

If you decide that you want to test it on the latest Fedora (Fedora 16) with a pure upstream driver you can always run one of the liveUSB[1] images and see how it works when booting from a USB key.  You may know about these already, but I figured I would mention it anyway as it can be a nice testing/debugging tool.

You also might want to check your BIOS and see if ASPM is completely disabled and boot with 'pcie_aspm=off' added to the kernel command line.  We saw lots of problems where ASPM on systems made in the last few years and making sure the kernel and BIOS both felt it was actually disabled was helpful.  I thought most of these were resolves in RHEL6.2, but there is a chance they are not all fixed.

1. Go to http://fedoraproject.org/wiki/How_to_create_and_use_Live_USB and follow the 'Quick Start' instructions.

Comment 38 v 2012-02-13 20:57:59 UTC
For what it is worth:

adding pcie_aspm=off to the kernel cmd line (and nothing else), renders the system completely stable indeed. No more strange NIC hangs.

Comment 40 Andy Gospodarek 2012-04-13 14:24:46 UTC
hahacc, please don't come to RHEL bugs and recommend using out of tree drivers from non-RH repositories.

If you can suggest an upstream patch that will resolve this, we can take it.  

Asking to use code that will NOT be put in RHEL is not a way to get the problem fixed on RHEL or CentOS.

Thanks.

Comment 41 Levente Farkas 2012-04-13 16:02:10 UTC
first of all closing a bug as duplicate of another bug which is not public is not help anyone who like to find a solution. and here this is the case.

second until you (upstream) doesn't provide any solution to a bug it's a big help to anyone to give any workaround for all kind of bug even in bugzilla.

of course the best would be if you can fix this problem (which still exists in rhel-6.2 too), but until then it's still better then nothing.

Comment 42 Charlie Brady 2012-04-13 16:12:15 UTC
(In reply to comment #41)

> of course the best would be if you can fix this problem (which still exists in
> rhel-6.2 too), but until then it's still better then nothing.

If the bug exists in rhel-6.2, then open a RHEL6 bug and provide details.

Comment 44 R G 2012-11-27 15:05:06 UTC
Why can't I/we access "bug 562273" ?

I'm trying to find a solution to this nasty bug in 6.3

Comment 45 Charlie Brady 2012-11-27 15:41:44 UTC
(In reply to comment #44)
> Why can't I/we access "bug 562273" ?

Already answered above.

> I'm trying to find a solution to this nasty bug in 6.3

Then you are wasting your time here. See advice in comment #35 and comment #42.

Comment 46 R G 2012-11-27 15:43:43 UTC
I was trying to find out if there is a bug report already opened for 6.3.

Should i assume there is none? Given the amount of people complaining about the issue, it would be really odd nobody reported it.

Comment 47 Andy Gospodarek 2012-11-27 15:47:36 UTC
(In reply to comment #44)
> Why can't I/we access "bug 562273" ?

You cannot access it as it contains private customer data, partner data, or because the person that opened it wanted it to be private.  The bug would not contain any specific details that would really help with this problem as this was closed as a duplicate of bug that was used to track the major driver update for RHEL5.  This major update included upstream commit 19833b5dffe2f2e92a1b377f9aae9d5f32239512 which was thought to resolve this problem based on initial reports.  This is essentially what I said in comment #14.

> I'm trying to find a solution to this nasty bug in 6.3

It appears some systems still have issues, but applying upstream commit d4a4206ebbaf48b55803a7eb34e330530d83a889 or booting with 'pcie_aspm=off' on the kernel command line appears to resolve additional problems that exist.  This commit mentioned above will be included in 6.4.

Comment 49 R G 2012-11-27 16:07:17 UTC
Booting with pcie_aspm=off helped in 6.2 from what I can see. I have a 6.3 box though which crashed yesterday, and it was booted with pcie_aspm=off.

Is there any other solution besides pcie_aspm=off, or wait for 6.4 to be released?

Comment 50 Charlie Brady 2012-11-27 16:24:08 UTC
> I have a 6.3 box though which crashed yesterday, and it was booted
> with pcie_aspm=off.

IMO you should create a new bug containing details. "crashed" != 'NIC did not pass packets".

Comment 52 Andy Gospodarek 2012-11-27 18:09:22 UTC
(In reply to comment #51)
> 
> 2.1.4 is the latest stable version of the e1000e driver from Intel. The
> changelog can be found here:
> 

I would not suggest trying anything unless you can be sure that the changes from upstream commit d4a4206ebbaf48b55803a7eb34e330530d83a889 are included.

Comment 53 Akemi Yagi 2012-11-27 18:16:45 UTC
(In reply to comment #52)

> I would not suggest trying anything unless you can be sure that the changes
> from upstream commit d4a4206ebbaf48b55803a7eb34e330530d83a889 are included.

According to the changelog for e1000e-2.1.4 (please see the link provided) :

* Upstream commit d4a4206ebbaf48b55803a7eb34e330530d83a889 - e1000e: Disable ASPM L1 on 82574

So, it is indeed included. :)

Comment 55 R G 2012-11-27 18:45:40 UTC
Thanks for all the replies! Much appreciated!

What would be interesting to know, is how to actually reproduce the issue without waiting days/weeks for it to happen randomly.

I have enough boxes i can test this on, just looking for a reliable way to replicate the issue, and then confirm it's not happening with the new driver.
If anyone know how it can be replicated quickly, I would appreciate if you can let me know.

Thanks!

Comment 56 Jesse Brandeburg 2012-11-27 19:03:07 UTC
you can check ASPM state directly via lspci -vvv with the new driver vs the old driver.

you need to check the ASPM state of the upstream PCIe port as well as the device for e1000e.

if you attach full lspci -vvv from before/after I can confirm that ASPM is disabled correctly.

we can also disable ASPM manually using setpci after driver loads (to test the change on an older driver)

Comment 57 R G 2012-11-27 19:19:39 UTC
Created attachment 653052 [details]
lspci -vvv snapshot

Comment 58 R G 2012-11-27 19:20:49 UTC
I have added a snapshot of lspci -vvv , taken on the server where I had this issue with the e1000e module.

The interface on which i had the issue is 09:00.0

It seems to be Disabled for me? Or am I not reading it correctly?


Note You need to log in before you can comment on or make changes to this bug.