Bug 66497 - too much interrupts using e1000 make gigabit ethernet totally useless
too much interrupts using e1000 make gigabit ethernet totally useless
Status: CLOSED NOTABUG
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-06-11 09:35 EDT by Renato
Modified: 2007-04-18 12:43 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-06-11 17:27:18 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Mhz.c (1.24 KB, text/plain)
2002-06-11 16:53 EDT, Arjan van de Ven
no flags Details

  None (edit)
Description Renato 2002-06-11 09:35:28 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461; .NET 
CLR 1.0.3705)

Description of problem:
Upgrading from 7.2 to 7.3 causes a really loss in network performance. 

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Use a gigabit ethernet
2. Simulates traffic of just 55Mbps 
	

Actual Results:  You are going to see "ksoftirqd_CPU0" eating up 100% of CPU ( 
and I'm using a Dual XEON 2 Ghz !!!!!!!! )

Running 'vmstat 1':
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  0  1      0 1988284   2528  17052   0   0     0     0  826   352   0  16  84
 0  0  0      0 1988264   2528  17052   0   0     0     0 17486    51   0  25  
75
 0  0  1      0 1988264   2528  17052   0   0     0     0 17661    29   0  25  
75
 0  0  1      0 1988264   2528  17052   0   0     0     0 17424    32   0  25  
75
 0  0  1      0 1988264   2528  17052   0   0     0     0 17428    42   0  25  
75
 0  0  1      0 1988264   2528  17052   0   0     0     0 17572    46   0  25  
75



Expected Results:  With Red Hat 7.2 and Suse 8.0 the number of interrupts 
handled is more than 50.000/s and the CPU is not nearly close to 100%.


Additional info:

lsmod:

Module                  Size  Used by    Tainted: P  
ipchains               46184 162 
e1000                  57508   2 
raid1                  15780   2 
aic7xxx               125440   6 
sd_mod                 12896  12 
scsi_mod              112272   2  [aic7xxx sd_mod]

( I don't have rules loaded despite the ipchaind module being on memory )

I'm downgrading to 7.2. 7.3 is unfortunately totally useless.
Comment 1 Renato 2002-06-11 10:13:37 EDT
Another info: I think the kernel misdetect the CPU:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) CPU 1.80GHz
stepping        : 4
cpu MHz         : 798.663
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 1595.80

Maybe this is the reason why it slowed down...
Comment 2 Arjan van de Ven 2002-06-11 10:19:15 EDT
OK I think I see what's going on. You have a HyperThreading enabled system and
7.3 sees 2x the amount of cpus (compared to the physical count). We know that
under some loads things get slower with HyperThreading.

Can you try disabling that in the bios and see if that makes things go faster again?
Comment 3 Renato 2002-06-11 10:40:35 EDT
Ok. I disabled it. 

I got a little bit better, but at 60Mbps I start loosing packets and again 
ksoftirq is eating up all the memory.

new entries:

'vmstat 1': despite the 50% idle, one CPU is 100%.

 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  0  1      0 2022164   1216  11608   0   0    53    45 3603   136   2  28  69
 0  0  1      0 2022160   1216  11608   0   0     0     0 17724   118   0  50  
50
 0  0  1      0 2022108   1216  11608   0   0     0     0 17742    93   0  50  
50

/proc/cpuinfo:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) CPU 1.80GHz
stepping        : 4
cpu MHz         : 798.668
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 1595.80


It's still at wrong clock.


Thanks for the promptness.

Comment 4 Renato 2002-06-11 12:13:48 EDT
I download the latest driver on Intel - 4.2.17 - changes the /etc/modules.conf 
to:

options e1000 RxDescriptors=256 RxIntDelay=4

And now it running OK :))

Any extra thing I could do ?
Comment 5 Arjan van de Ven 2002-06-11 12:17:03 EDT
It could well be that Intel changed the RxIntDelay default between the driver
they asked us to ship in 7.2 and the one they gave us for 7.3......
Comment 6 Renato 2002-06-11 13:00:38 EDT
I think I was too excited by the preliminary results. It's still not good. What 
I see right now is just one CPU of my server being used, while I have 2 NICs ( 
it's a firewall ). Is there a way to split the load between both CPUs ? I mean 
having each of them processing the traffic of each NIC and not just one CPU ?
Comment 7 Arjan van de Ven 2002-06-11 13:26:26 EDT
> Is there a way to split the load between both CPUs ? 
Yes

if you look in /proc/irq/<irq number/ as root, there is a file "smp_affinity".
That file is a bitmask of which CPU's the irq is allowed on (default is ALL).
So if you do echo "1" > /proc/irq/11/smp_affinity then irq 11 only goes to the
first cpu.
echo 2 > /proc/irq/12/smp_affinity will make irq 12 go to the second one, and so
on.
(note you'll have to look up the irq's for your nick and replace "11" and "12"
with the real numbers; the /proc/interrupts file will have this info)
Comment 8 Renato 2002-06-11 13:55:22 EDT
It worked. Now my CPU processing got down and my traffic is on 80Mbps and the 
CPU is fine. Now I believe that the kernel is somehow misidentifying my CPU 
clock. If you look at my previous report there is a probably a bug on it:

model name      : Intel(R) XEON(TM) CPU 1.80GHz
stepping        : 4
cpu MHz         : 798.668
cache size      : 512 KB

As far as I know the kernel ajusts itself to the reported speed, so if my 
processor is actually a 1800 Mhz, I might the loosing clock cycles, right ?
Comment 9 Arjan van de Ven 2002-06-11 13:58:32 EDT
Ingo: is there any chance the cpu mhz timer can get confused by HT ?
Comment 10 Renato 2002-06-11 14:01:25 EDT
One more info. I got a Red Hat 7.2 on my network:

uname -a 
Linux hm43 2.4.9-21smp #1 SMP Thu Jan 17 14:01:48 EST 2002 i686 unknown

cat /proc/cpuinfo: ( this one has hyperthreading on )

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) XEON(TM) CPU 1.80GHz
stepping        : 4
cpu MHz         : 1796.998
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 3591.37

And it reports fine.

I think there is definitely a bug with 2.4.18-4smp.
Comment 11 Ingo Molnar 2002-06-11 14:14:27 EDT
the CPU mhz calibration cannot get confused by HT if HT is disabled (which as
far as i understand is the case here).
Comment 12 Renato 2002-06-11 14:20:07 EDT
Yes, I think scenario got reduced to:

same hardware ( Dual Xeon 1,8 Ghz )

'cat /proc/cpuinfo'

2.4.9-31smp:
model name      : Intel(R) XEON(TM) CPU 1.80GHz
stepping        : 4
cpu MHz         : 1796.998

2.4.18-4smp:
model name      : Intel(R) XEON(TM) CPU 1.80GHz
stepping        : 4
cpu MHz         : 798.668


Comment 13 Arjan van de Ven 2002-06-11 16:53:17 EDT
Created attachment 60572 [details]
Mhz.c
Comment 14 Arjan van de Ven 2002-06-11 16:54:06 EDT
The MHz.c program is a CPU speed benchmark program. Might be worth trying to see
if the cpu actually runs slower...
Comment 15 Renato 2002-06-11 17:04:29 EDT
I ran on both machines.

Run on 2.4.9-21smp
1796MHz processor (estimate).
1797MHz processor (estimate).
...

Run on 2.4.18-4smp:
798MHz processor (estimate).
798MHz processor (estimate).
....
Comment 16 Arjan van de Ven 2002-06-11 17:13:58 EDT
ok so it looks the cpu really runs slower not just in /proc info
*boggle*
Comment 17 Arjan van de Ven 2002-06-11 17:20:27 EDT
Just to rule stuff out: pIV's run slower when they get hot; are you sure it's
not cooling related ?
Comment 18 Renato 2002-06-11 17:27:13 EDT
I don't believe it could be this. But I'll check. Also, I have another new 
equipment here that I will run some more tests just to make sure I'm not doing 
anything wrong, or got a buggy BIOS, etc. I'll keep you posted.

BTW, thanks for all the help !!
Comment 19 Renato 2002-06-13 10:20:47 EDT
Brown bag paper time... I supervised by myself a new installation on a server, 
and apparently when you upgrade the BIOS, it resets the real clock to the 
lowest ( in my case, exact '800Mhz' ). My technicians overlooked it... 

I want to apologize and thank you for all the help.

Note You need to log in before you can comment on or make changes to this bug.