489541 – Wrong bandwidth values reported by the kernel

Bug 489541 - Wrong bandwidth values reported by the kernel

Summary: Wrong bandwidth values reported by the kernel

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-03-10 16:29 UTC by Matthias Saou
Modified:	2017-11-16 00:53 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-10-03 00:57:07 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Bandwidth monitor script (3.59 KB, text/plain) 2009-03-10 16:29 UTC, Matthias Saou	no flags	Details
Bandwidth monitor script (3.78 KB, application/octet-stream) 2009-05-07 16:19 UTC, Matthias Saou	no flags	Details
Bandwidth monitor script (3.79 KB, application/octet-stream) 2009-05-07 16:58 UTC, Matthias Saou	no flags	Details
Bandwidth monitor script (4.90 KB, application/octet-stream) 2009-05-07 18:57 UTC, Matthias Saou	no flags	Details
patch to detect gso frames using proper api calls (678 bytes, patch) 2009-05-11 18:53 UTC, Neil Horman	no flags	Details \| Diff
Show Obsolete (3) View All

Description Matthias Saou 2009-03-10 16:29:56 UTC

Created attachment 334672 [details]
Bandwidth monitor script

I've already detailed the issue on the rhelv5-list with no replies so far :

https://www.redhat.com/archives/rhelv5-list/2009-February/msg00166.html

Basically, I see impossibly high bandwidth usage values when looking at /proc, ifconfig or snmp. But I do see values which match the ones I see on the network switches when looking at values returned by tcpdump and iptraf.

Attached is a quick python script which displays in real time the total network traffic since its start using two methods : Reading /proc and using tcpdump. On many busy servers I see *huge* differences.

I really feel like something is wrong at a very low level, but I'd love to be proven wrong. I've seen this with every RHEL5 kernel I've ever tested, including from the latest 2.6.18-128.el5 series.

Comment 1 Matthias Saou 2009-04-02 14:31:04 UTC

No one is curious enough to have tried looking at what this script measures and see where the differences might come from?

I'm still seeing some really unrealistic values reported by snmpd on *many* servers. I'm convinced that many others are probably affected by this without noticing it.

Comment 2 Neil Horman 2009-05-06 14:25:45 UTC

I've only looked briefly, But I don't see anything programatically wrong with your script.  That said, I think you might be failing to account for lost packets in the tcpdump portion of your script.  Looking at the message you reference above, I see that you describe your problem as a large discrepancy between the octet counter in /proc/net/dev for an interface and the octet counter in tcpdump:

>But on a busy web server, RHEL x86_64 5.3 domU, I see :
>TCPDUMP: 1959818423 (1563119 packets)
>PROC:    5320731075
>
>This is quite a big difference.

Two thoughts occur:
1) 5320731075 is 5 billion bytes.  Depending on how long the capture lasted this is a completely reasonable number.  A gigabit interface running at line rate would only take about 40 seconds to tally 5GB of data transfer

2) The proc interface querys stats at the device, while tcpdump records packet counts at the application. There are plenty of opportunities for the stack to discard frames between the two (and IIRC tcpdump reports after taking a capture how many frames had to be discarded for various reasons).

I'd say start by adjusting the math on the tcpdump portion of your script.  If possible determine how many frames were lost during the tcpdump capture, and how many octets that amounts to.  If that plus the capture octet number is about equal to the value in the proc interface, there is no problem.  If they don't match, then we can look a bit deeper.

Comment 3 Matthias Saou 2009-05-06 15:15:01 UTC

There was unfortunately something slightly wrong with the python script. See :
http://lkml.org/lkml/2009/5/5/66

From there, I need to re-do all of my testing, as it may turn out that I wasn't reproducing the problem if things were being counted wrong.

The logic in what you write seems to be the opposite to what could explain the issue : I see much higher values for the _outgoing_ traffic reported by proc than by tcpdump. If tcpdump was capturing _less_ traffic than what proc exposes, then the difference would be the other way around, wouldn't it?

I can re-explain what the original situation was, which lead to trying to find the cause :
 * We have unrealistic bandwidth graphs using net-snmp for the interfaces of our streaming servers, some go higher than the physical interface's speed
 * Looking at the real usage using iptraf, the values reported are realistic and seem correct
 * Looking at the snmp bandwidth graphs from the network switches, the values are realistic and match the ones seen in real time with iptraf

Then when we started digging, we saw that :
 * iptraf uses the same way of reporting traffic than tcpdump does
 * net-snmp uses the proc kernel interface for reporting traffic

Which is why we then decided to create a simple test script to run on various servers in order to compare the values collected using tcpdump and using /proc.

I'm still suspecting it might be indirectly caused by something the streaming application is doing, but still, I don't think it's normal to have /proc report traffic for an interface if that traffic isn't "real"... hence this bug report.

Comment 4 Neil Horman 2009-05-06 16:56:44 UTC

Sorry, I didn't see that you had munged your script to do tx traffic rather than rx traffic (as the comment in the script indicates). Given what Eric said, I also would retest just to be sure the proc/net/dev values weren't erroneous

That all said however, its still entirely possible for more data to go out the hardware than what your script covers. For example, if your web server preforms source NAT translation for instance your tcpdump filter won't pick up on that, given that your matching on source address, and the outgoing tap happens right before we pass the frame to the driver in dev_hard_start_xmit.

Theres also the possibility that gso is affecting your computation. If you have any offload features enabled in your hardware, One very large packet will be sent through the tcp stack, and fragmented withing the hardware, scaling up the number of ethernet frame bytes actually on the wire. Those extra bytes get picked up in the hardware counters in /proc/net/dev, but would never be seen by tcpdump running locally on the system.

There is also the possibilty of latency issues within the script itself. The parsing code for the tcpdump output seems significantly more complex computationally than the parsing of /proc/net/dev. Given that the former thread has to yield the cpu to wait for a child process to produce output, which then must be parsed, while the latter thread is much less likely to yield while quickly reading a proc file. It may be the case that for any given output issued from the main routine of the script, the tcpdump thread is simply lagging the output from /proc/net/dev. I think it would be more accurate if you were to modify the script so that you captured /proc/net/dev at the start of a run, then ran tcpdump, capturing the output, then captured /proc/net/dev again. Then you could post-process the tcpdump output to determine frame/octet count, and compare that to the difference in values in /proc/net/dev from before and after the tcpdump capture. Its still not perfect, but I think it would eliminate a good deal of latency stemming from the real time processing your doing.

There is also of course the possibility that the driver has a bug in how it record statistics. If you send me a sysreport of the system, I can start poking about in the driver while you modify the script and re-test.

Comment 5 Matthias Saou 2009-05-06 17:25:34 UTC

Thanks for all those details. There are indeed many possibilities, so let's start to rule out as many as possible.

The only iptables rule other than trivial INPUT rules is :
-A PREROUTING -i eth3 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 8800

This is because the streaming server is running as non root by default, but needs to be reachable on port 80. This shouldn't affect outgoing traffic at all. No NAT whatsoever involved here.

I've sent you the sosreport output of an affected server as well as png images of the same gigabit link, one made polling net-snmp on the server for its eth3 and the other made polling a Cisco 6500 router on the gigabit interface where the server is connected. There is't easy to see that the server's incoming traffic matches the router's outgoing traffic, but the opposite isn't true... at all.

Comment 6 Matthias Saou 2009-05-06 17:27:59 UTC

Here are links to those cacti snmp graphs so that anyone can have a look :

http://ftp.egwn.net/tmp/streamer001-eth3.png
http://ftp.egwn.net/tmp/streamer001-gi1-16.png

Comment 7 Neil Horman 2009-05-06 20:40:30 UTC

as per the email I just sent you, could you please provide ethtool -S stats of the interface in quesiton on the server.  Its an e1000e NIC which updates stats seen in /proc/net/dev from software, so I'd like to compare those to the real hw counters that ethtool retrieves.  Thanks!

Comment 8 Matthias Saou 2009-05-06 21:32:16 UTC

# ethtool -S eth3
NIC statistics:
     rx_packets: 57283646614
     tx_packets: 64706864418
     rx_bytes: 35905148664355
     tx_bytes: 47677757449792
     rx_broadcast: 39908964
     tx_broadcast: 15014
[...]

At the exact same moment, /proc/net/dev indicated :

Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
eth3:2029134087 1449071704    0  185    0     0          0         0 2292325960 282354921    0    0    0     0       0          0

Which would be :
rx packets : 1449071704
tx packets : 282354921
rx bytes : 2029134087
tx bytes : 2292325960

Given the huge difference, I'm assuming the /proc/net/dev counters have looped already quite a few times (busy server with 181 days of uptime). I'll try to extract the same data from a similar but recently rebooted server.

Comment 9 Neil Horman 2009-05-07 00:26:50 UTC

That would be good please.  Thanks!

Comment 10 Matthias Saou 2009-05-07 16:19:53 UTC

Created attachment 342870 [details]
Bandwidth monitor script

Attaching a new version of monitor.py :
 * Fixed the problem where /proc/net/dev wasn't split properly
 * Added packet count from /proc/net/dev to the output
 * Print a warning line if a wrap in the /proc/net/dev counter is detected

Comment 11 Matthias Saou 2009-05-07 16:58:15 UTC

Created attachment 342881 [details]
Bandwidth monitor script

Sorry for the noise. The previous version was discarding tcpdump lines with high traffic. This one fixes this. Now I get this on an affected server after a couple of minutes :

TCPDUMP: 2620549364 (1756458 packets)
PROC:    27688700317 (1859516 packets)

As you can see, the number of packets matches more or less, but the reported bandwidth is ten times higher with /proc/net/dev!

On a 64bit RHEL5 domU (xennet) with the same script, I get :

TCPDUMP: 853991340 (956356 packets)
PROC:    822447554 (597497 packets)

It doesn't surprise me that much to have tcpdump "see" more packets than proc, nor that it reports slightly more usage. Very different result.

Comment 12 Neil Horman 2009-05-07 17:01:11 UTC

do you that ethtool/proc comparison that we discuessed?  That would be good confirmation that we had a problem in the e1000e driver

Comment 13 Matthias Saou 2009-05-07 18:03:32 UTC

(In reply to comment #12)
> do you that ethtool/proc comparison that we discuessed?  That would be good
> confirmation that we had a problem in the e1000e driver  

Sorry, but no : No servers will be rebooted before next week. But I've added a message to the current script's output each time the /proc/net/dev interface's bytes counter wraps, and on the server where I was testing it, the wrapping occurs currently after... 14 seconds. Even if the real traffic 10 times less, it would wrap in little more than 2 minutes.

So I think it'll be quite hard to get any kind of reliable "initial" ethtool/proc comparison... the best alternative that comes to mind would be to add a counter based on the ethtool -S output to the monitor.py script. I'll do that.

Comment 14 Neil Horman 2009-05-07 18:36:03 UTC

please do, that will help me figure out what I'm looking for.

Comment 15 Matthias Saou 2009-05-07 18:57:58 UTC

Created attachment 342916 [details]
Bandwidth monitor script

I think we're getting somewhere! This new monitor.py adds a field with a counter using the ethtool -S output. I've noticed that the strings to look for are quite different between EL5's e1000e and my Fedora 10's tg3, so it might only work with EL5 and/or e1000e ("tx_bytes" and "tx_packets").

Here is the output after running for a few minutes :
TCPDUMP: 13081564852 (8100964 packets)
PROC:    124778066271 (9144477 packets)
ETHTOOL: 12513698149 (9145118 packets)

Pretty obvious at last! Please let me know if/when you need me to try anything else.

Comment 16 Neil Horman 2009-05-07 19:08:20 UTC

Ok, so that seems to support the notion that the tx bytes counter in e1000e is off by a factor 10 about 10.  I'll start digging into the driver.

Comment 17 Matthias Saou 2009-05-07 20:17:19 UTC

(In reply to comment #16)
> Ok, so that seems to support the notion that the tx bytes counter in e1000e is
> off by a factor 10 about 10.  I'll start digging into the driver.  

This seems true only for some values. I think the graphs I provided in comment #6 are quite good to give an idea : It's not directly proportional. For the "real" 500Mbps peak, it was reporting little more than 600Mbps, yet it also reports 600Mbps at times when the traffic is below 150Mbps. It does look much more proportional for "real" traffic up to 100Mbps, with something close to a 10 factor. These are just observations, though...

Comment 18 Matthias Saou 2009-05-08 07:46:58 UTC

Maybe a silly question : I see that it's RHEL5 defaulting to use e1000e, but it also ships with e1000. Would switching to e1000 work? (it would be interesting to check if the problem is also there with it)

Comment 19 Neil Horman 2009-05-08 10:58:45 UTC

depends on the pci id.  whats the vendor/dev id on the card?

Comment 20 Matthias Saou 2009-05-08 11:37:28 UTC

This is the lspci output :

0a:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
0a:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)

And this is the full lshal output for the firs port :

udi = '/org/freedesktop/Hal/devices/pci_8086_105e_0'
  info.udi = '/org/freedesktop/Hal/devices/pci_8086_105e_0'  (string)
  linux.subsystem = 'pci'  (string)
  linux.hotplug_type = 1  (0x1)  (int)
  pci.subsys_product = 'PRO/1000 PT Dual Port Server Adapter'  (string)
  pci.subsys_vendor = 'Intel Corporation'  (string)
  info.product = '82571EB Gigabit Ethernet Controller'  (string)
  pci.product = '82571EB Gigabit Ethernet Controller'  (string)
  info.vendor = 'Intel Corporation'  (string)
  pci.vendor = 'Intel Corporation'  (string)
  pci.device_protocol = 0  (0x0)  (int)
  pci.device_subclass = 0  (0x0)  (int)
  pci.device_class = 2  (0x2)  (int)
  pci.subsys_vendor_id = 32902  (0x8086)  (int)
  pci.subsys_product_id = 4958  (0x135e)  (int)
  pci.vendor_id = 32902  (0x8086)  (int)
  pci.product_id = 4190  (0x105e)  (int)
  info.linux.driver = 'e1000e'  (string)
  pci.linux.sysfs_path = '/sys/devices/pci0000:00/0000:00:04.0/0000:0a:00.0'  (string)
  info.parent = '/org/freedesktop/Hal/devices/pci_8086_25f8'  (string)
  info.bus = 'pci'  (string)
  linux.sysfs_path_device = '/sys/devices/pci0000:00/0000:00:04.0/0000:0a:00.0'  (string)
  linux.sysfs_path = '/sys/devices/pci0000:00/0000:00:04.0/0000:0a:00.0'  (string)

udi = '/org/freedesktop/Hal/devices/net_00_15_17_76_71_06'
  info.udi = '/org/freedesktop/Hal/devices/net_00_15_17_76_71_06'  (string)
  linux.subsystem = 'net'  (string)
  linux.hotplug_type = 2  (0x2)  (int)
  net.80203.mac_address = 90587951366  (0x1517767106)  (uint64)
  info.product = 'Networking Interface'  (string)
  net.arp_proto_hw_id = 1  (0x1)  (int)
  net.linux.ifindex = 4  (0x4)  (int)
  net.address = '00:15:17:76:71:06'  (string)
  net.interface = 'eth2'  (string)
  net.physical_device = '/org/freedesktop/Hal/devices/pci_8086_105e_0'  (string)
  info.capabilities = {'net', 'net.80203'} (string list)
  info.category = 'net.80203'  (string)
  info.parent = '/org/freedesktop/Hal/devices/pci_8086_105e_0'  (string)
  linux.sysfs_path = '/sys/class/net/eth2'  (string)

Comment 21 Neil Horman 2009-05-08 13:45:37 UTC

negative, the e1000 driver isn't going to recognize that card.  You might be able to get it do something if you used the sysfs new_id interface, but I would not recommend it.  I'll check the e1000e driver for errors in the stats recording here shortly.

Comment 22 Neil Horman 2009-05-08 15:24:24 UTC

Theres one thing that immediately stands out to me. In e1000_clean_tx_irq from the e1000e driver:
if (cleaned) {
                                struct sk_buff *skb = buffer_info->skb;
                                unsigned int segs, bytecount;
                                segs = skb_shinfo(skb)->gso_segs ?: 1;
                                /* multiply data chunks by size of headers */
                                bytecount = ((segs - 1) * skb_headlen(skb)) +
                                            skb->len;
                                total_tx_packets += segs;
                                total_tx_bytes += bytecount;
                        }

This is attempting to multiply the tx byte counter to reflect what actually gets sent on the wire rather than what the packet actually looks like in the event that gso is enabled.  I wonder if a bogus gso value is getting set in that skb.  Its interesting to note that only e1000[e] and ixgb[e] drivers do this, everyone else just pulls stats from the hardware, which seems much more sane

This will be a good test to see if we're on the right track.  issue this command:
ethtool -k ethX
where X is the number of the interface you are sending all this traffic on.  If gso or tso come back indicating they are on, try turning them off.  The stats from that point should remain at a constant offset from their corresponding hardware/ethtool recorded statistics if this is the problem.

Comment 23 Matthias Saou 2009-05-08 15:50:28 UTC

Well, I just ran :
ethtool -K eth3 tso off

And now my script reports sane values!
TCPDUMP: 783583073 (705250 packets)
PROC:    692208750 (559020 packets)
ETHTOOL: 696191360 (560424 packets)

So the problem clearly lies right there. I will use this as a workaround, though, once I've made sure it doesn't have any ill side effects.

Comment 24 Neil Horman 2009-05-08 18:43:24 UTC

It will potentially have a performance side effect, in that more processing will have to be done in the cpu than without tso/gso.  Thats good confirmation though.  I'll try to dig up an e1000e card here that I can reproduce/fix this bug with.

Comment 26 Andy Gospodarek 2009-05-08 20:35:21 UTC

Sounds like there must be something in the tso path that is mis-counting frames, but it sounds like Neil is on it.

Comment 27 Jesse Brandeburg 2009-05-11 17:36:02 UTC

Please note: tcpdump with TSO enabled will not show the correct number of "on the wire" transmitted packets.  This is expected behavior because the stack is allowing the hardware to "segment" packets.  The byte counts won't even match because the hardware is inserting a header for each packet, whereas the stack only sent down ONE header and a bunch of data.

Neil, that code in e1000_tx_clean_irq is doing on the fly counting in order figure out the "transmit bandwidth" of the last interrupt.  This is used in controlling the dynamic interrupt throttling in the driver.

since reading the hardware "tx bytes/tx packets" counters would take thousands of cpu cycles per read it is way too much overhead to read those registers in every interrupt.

likewise on the receive size if LRO or GRO is being done, the stack (tcpdump) will only see "jumbo frames" but the hardware counters will be correct.

I'm not sure this is playing into your problem, as I'm still confused whether you're counting transmits or receives.

Comment 28 Neil Horman 2009-05-11 18:01:13 UTC

Its transmits that hes looking at. And while I agree that the packet count will be incorrect, I don't think the byte counters should be off as they are in comment #15.  The proc counters are a reflection of the dev->net_stats.tx_bytes value that the driver maintains, which shadows what the transmit bandwidth is computing.  From e1000_tx_clean_irq:

        adapter->total_tx_bytes += total_tx_bytes;
        adapter->total_tx_packets += total_tx_packets;
        adapter->net_stats.tx_bytes += total_tx_bytes;
        adapter->net_stats.tx_packets += total_tx_packets;
        return cleaned;

The first 2 lines are tracking transmit bandwidth, while the latter are recording statistics.  Earlier in the clean routine, there is this code:
         segs = skb_shinfo(skb)->gso_segs ?: 1;
         /* multiply data chunks by size of headers */
         bytecount = ((segs - 1) * skb_headlen(skb)) +
                                            skb->len;
         total_tx_packets += segs;
         total_tx_bytes += bytecount;

So it appears to me that the driver is attempting to determine how many segments the hardware will split a given packet into, so that the driver count will match the hardware count.  However, given comment 15 that doesn't seem to be happening, which suggests to me that perhaps we're receiving packets at the driver in which the gso_segs value is set incorrectly (i.e. a large value for a relatively small packet.  Interestingly I've got an e1000e card here (not the same model though), and I can't reproduce the above problem.

Thoughts welcome!

Comment 29 Neil Horman 2009-05-11 18:51:59 UTC

Ok, so I've tried all 4 combinations of tso & gso being off and on, and I'm unable to reproduce this with the e1000e card I have available.  I do see a slight divergence between ethtool and driver stats, but its in the opposite direction from whats described above.  Matthias, can you try a patch for me?  The one thing I did notice was that the skb_is_gso api uses gso_size to detect gso frames, while the e1000e driver uses gso_segs.  That shouldn't make a difference, but if some part of the stack is leaving garbage data in the gso_segs field, that might be the source some confusion.  This patch I'm attaching should fix it, if thats the problem.  Can you give it a go?  Thanks!

Comment 30 Neil Horman 2009-05-11 18:53:54 UTC

Created attachment 343491 [details]
patch to detect gso frames using proper api calls

Comment 31 Neil Horman 2009-06-15 13:12:45 UTC

ping, any update?

Comment 32 Neil Horman 2009-10-03 00:57:07 UTC

closing for lack of response

Note You need to log in before you can comment on or make changes to this bug.