Bug 809706 - atl1c sometimes gives absurd network statistics, dhclient gets confused
atl1c sometimes gives absurd network statistics, dhclient gets confused
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
20
Unspecified Unspecified
unspecified Severity medium
: ---
: ---
Assigned To: Neil Horman
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-04 02:56 EDT by Stefano Cavallari
Modified: 2014-06-16 19:18 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-03-17 14:42:41 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ethtool, ifconfig, dmesg, var/log/messages (186.51 KB, text/plain)
2014-06-16 09:14 EDT, AWF
no flags Details
lspci -vvvn (29.40 KB, text/plain)
2014-06-16 19:18 EDT, AWF
no flags Details

  None (edit)
Description Stefano Cavallari 2012-04-04 02:56:31 EDT
Description of problem:
Sometimes, on resume, the eth0 interface (which is not fisically connected) gives unusual values on ifconfig:

eth0      Link encap:Ethernet  HWaddr 00:26:22:XX:XX:XX  
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:1271860075133760 errors:7631173335704445 dropped:2543728740202110 overruns:1271864370101055 frame:6359313260570685
          TX packets:1271864370101055 errors:5087457480404220 dropped:0 overruns:1271864370101055 carrier:2543728740202110
          collisions:6359321850505275 txqueuelen:1000 
          RX bytes:1271860075133760 (1.1 PiB)  TX bytes:1271864370101055 (1.1 PiB)
          Interrupt:45

This breaks dhcpd and thus networkmanager, for any interface.


Version-Release number of selected component (if applicable):
3.3.0-4.fc16.x86_64

Apr  4 08:27:15 KiwiBook dhclient[14703]: Bad line reading interface information
Apr  4 08:27:15 KiwiBook dhclient[14703]: Error getting interface information.
Apr  4 08:27:15 KiwiBook NetworkManager[1044]: Bad line reading interface information
...
Apr  4 08:27:15 KiwiBook dhclient[14703]: exiting.
Apr  4 08:27:15 KiwiBook NetworkManager[1044]: <info> (wlan0): DHCPv4 client pid 14703 exited with status 1

How reproducible:
Sometimes.

Steps to Reproduce:
1. Suspend the system
2. Wake up the system
  
Actual results:
Networkmanager won't reconnect, stops working for any interface

Expected results:
Everything is fine

Additional info:
Smolt of the machine: http://www.smolts.org/client/show/pub_5b3d2153-a96a-4833-bd41-e19763fa6671
rmmod / insmod atl1c resets the interface statistics. Networkmanager works again.
Comment 1 Stefano Cavallari 2012-04-13 02:50:34 EDT
Happened again, using kernel 3.3.0-8.fc16.x86_64
Comment 2 Stefano Cavallari 2012-05-11 04:32:40 EDT
Happened again on kernel-3.3.4-3.fc16.x86_64, this time after a reboot for upgrading the kernel, so this is not strictly related to suspend/resume.
Comment 3 Stefano Cavallari 2012-05-14 12:13:53 EDT
output of cat /proc/net/dev:
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
…
  eth0: 88274463624106 88274462819165 529646776884810 176548925628270 88274462814135 441372314070675          0 88274462814135 88274463554144 88274462823115 353115031125720    0 88278757781430 441393788907150 176557515562863          0
Comment 4 Stanislaw Gruszka 2012-05-18 07:01:48 EDT
We update atl1c driver in kernel-3.3.6-3.fc17, perhaps update fix this issue. Please test.
Comment 5 Stanislaw Gruszka 2012-06-21 16:25:01 EDT
So, do you still hit this issue on updated kernels?
Comment 6 Stefano Cavallari 2012-06-24 12:25:18 EDT
Experienced again on fedora 17 (I did an upgrade) using kernel 3.4.3-1.fc17.x86_64
Comment 7 Stefano Cavallari 2012-07-13 10:05:37 EDT
still happens using kernel 3.4.4-5.fc17.x86_64
Comment 8 Stefano Cavallari 2012-07-13 10:07:57 EDT
I see now that once the problem appears the numbers continue growing:

eth0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 00:26:22:51:eb:9b  txqueuelen 1000  (Ethernet)
        RX packets 251285651528565  bytes 251285651528565 (228.5 TiB)
        RX errors 1507713909171390  dropped 502571303057130  overruns 251285651528565  frame 1256428257642825
        TX packets 251285651528565  bytes 251285651528565 (228.5 TiB)
        TX errors 1005142606114260  dropped 0 overruns 251285651528565  carrier 502571303057130  collisions 1256428257642825

eth0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 00:26:22:51:eb:9b  txqueuelen 1000  (Ethernet)
        RX packets 251298536430450  bytes 251298536430450 (228.5 TiB)
        RX errors 1507791218582700  dropped 502597072860900  overruns 251298536430450  frame 1256492682152250
        TX packets 251298536430450  bytes 251298536430450 (228.5 TiB)
        TX errors 1005194145721800  dropped 0 overruns 251298536430450  carrier 502597072860900  collisions 1256492682152250
Comment 9 Stefano Cavallari 2013-01-23 04:26:47 EST
still happens in fedora 18, kernel 3.7.2-204.fc18.x86_64
Comment 10 Neil Horman 2013-06-03 16:32:50 EDT
hmm, it appears that this may be a hardware issue As we're just reading hardware registers into our software stats block.  Maybe the hw regs stop clearning properly.  When they start to grow large like this, do they take a big jump, then grow linearly again, or do they start to grow exponentially?
Comment 11 Neil Horman 2013-06-13 11:03:59 EDT
ping, any response here?
Comment 12 Stefano Cavallari 2013-06-13 11:46:59 EDT
I'm sorry but I'm currently not using that system, I hope to find some time to look on how the number grows in the next week.
Comment 13 Neil Horman 2013-06-26 10:33:36 EDT
ping, any update?
Comment 14 Fedora End Of Life 2013-07-03 21:19:15 EDT
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.
Comment 15 Stefano Cavallari 2014-02-24 07:32:48 EST
Hello, just experienced the thing again. in kernel 3.13.3-201.fc20.x86_64
The counter is growing linearly at a huge speed
# ( for i in {1..3}; do ifconfig eth0; sleep 1; done ) | grep -E [TR]X
        RX packets 20169166417320  bytes 20169166417320 (18.3 TiB)
        RX errors 121014998503920  dropped 40338332834640  overruns 20169166417320  frame 100845832086600
        TX packets 20169166417320  bytes 20169166417320 (18.3 TiB)
        TX errors 80676665669280  dropped 0 overruns 20169166417320  carrier 40338332834640  collisions 100845832086600
        RX packets 20182051319205  bytes 20182051319205 (18.3 TiB)
        RX errors 121092307915230  dropped 40364102638410  overruns 20182051319205  frame 100910256596025
        TX packets 20182051319205  bytes 20182051319205 (18.3 TiB)
        TX errors 80728205276820  dropped 0 overruns 20182051319205  carrier 40364102638410  collisions 100910256596025
        RX packets 20194936221090  bytes 20194936221090 (18.3 TiB)
        RX errors 121169617326540  dropped 40389872442180  overruns 20194936221090  frame 100974681105450
        TX packets 20194936221090  bytes 20194936221090 (18.3 TiB)
        TX errors 80779744884360  dropped 0 overruns 20194936221090  carrier 40389872442180  collisions 100974681105450

Sorry for long delay, I stopped using that laptop for some time, but now I'm starting in ~daily
Comment 16 Neil Horman 2014-02-24 16:35:36 EST
Ok, thats great, you still need to answer my question in comment 10 however, just re-opening this doesn't help.
Comment 17 Stefano Cavallari 2014-02-25 04:23:09 EST
Sorry forgot to answer that part. So, it makes a big jump and then grows linearly (with no cable attached since boot).
Comment 18 Neil Horman 2014-02-25 08:26:18 EST
Ok, thank you.  The stats update mechanism relies on the notion that all the stats registers are read-clear.  The fact that the stats take a big jump then grow linearly suggests that occasionally some of the read-clear registers don't clear, resulting in a large double count.  I think you're seeing a hardware problem.  You may want to check with atheros to see if you can find an errata sheet or firmware update for your NIC
Comment 19 Justin M. Forbes 2014-03-17 14:42:41 EDT
*********** MASS BUG UPDATE **************

This bug has been in a needinfo state for several weeks and is being closed with insufficient data due to inactivity. If this is still an issue with Fedora 20, please feel free to reopen the bug and provide the additional information requested.
Comment 20 Ruslan 2014-04-25 02:40:53 EDT
I experience the same in Ubuntu Precise 32 bit (with linux 3.2.0-58-generic-pae and those with which Precise started its life) — at least when I have my EEE PC 1015PN unplugged from Ethernet, and I made a test to answer the comment 10: the error count is 0 for several days, with suspend/resume cycles it doesn't change. Then suddenly it goes to almost 2^32, but not quite: 4294967274 (maybe it was higher between samples, which I did once per 60 seconds). After that the error count linearly goes down. If I suspend and wait some time, then resume, the error count continues from its previous value, i.e. it doesn't change while in suspend.

At least this is true for "RX packets" column, I can check others, but main fact is: they all suddenly go from 0 to ~2^32.
Comment 21 Ruslan 2014-04-25 02:45:40 EDT
Also, if I reload atl1c module ("modprobe -r atl1c", then "modprobe atl1c"), then all stats reset to zeroes.
Comment 22 AWF 2014-06-16 09:03:24 EDT
I am also having the same issue, and can reproduce (within 2 minutes to one days).

I have tested using the last 4 non-debug kernels for Fedora 19, but few or no errors in messages/dmesg as kernel thinks everything is OK. I believe this hardware worked without problem until maybe, roughly, November 2013, but since the problem was sporadic it went attributed to something else. But now is much worse.

I have all logs, dmesg, ifconfig, ethtool. atl1c appears to be up, and ifocnfig reports it has link but with massive errors.

It appears the PCI bus may get reset or fail as modules complain that they cannot talk to their hardware, or it is in a bad state. I have removed/disabled possibly suspect PCI devices and the problem continues.

I have memtest86, memtester, turned off NetworkManager, disabled all other networking like bluetooth, and removed/disabled all utilities like network monitoring tools. I have tried unplugging eth cable. I can ifdown/rmmod/modprobe atl1c/ifup the device and I can access the network again, most often. But it has blown other devices offline so a restart is necessary.

Thank You for any assistance.
Comment 23 AWF 2014-06-16 09:14:41 EDT
Created attachment 909121 [details]
ethtool, ifconfig, dmesg, var/log/messages
Comment 24 AWF 2014-06-16 19:18:44 EDT
Created attachment 909314 [details]
lspci -vvvn

Note You need to log in before you can comment on or make changes to this bug.