Bug 481715 - BCM5704 NIC results in CPU 100%SI , sluggish system performance
BCM5704 NIC results in CPU 100%SI , sluggish system performance
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
i686 Linux
urgent Severity urgent
: rc
: ---
Assigned To: Andy Gospodarek
Red Hat Kernel QE team
: Regression, ZStream
: 520183 (view as bug list)
Depends On: 469772
Blocks: 502837
  Show dependency treegraph
 
Reported: 2009-01-27 05:37 EST by Andre ten Bohmer
Modified: 2014-06-29 19:01 EDT (History)
20 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 04:01:49 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sosreport (16.62 MB, application/x-bzip)
2009-01-28 05:44 EST, Andre ten Bohmer
no flags Details
mpstat, ps, interrupts (25.46 KB, text/x-log)
2009-02-16 10:04 EST, Roland Friedwagner
no flags Details
mpstat, ps, interrupts (2.6.18-92.1.22) (24.99 KB, text/x-log)
2009-02-16 10:06 EST, Roland Friedwagner
no flags Details
rhel5-tg3-softirq-fixup.patch (3.40 KB, patch)
2009-04-13 22:47 EDT, Andy Gospodarek
no flags Details | Diff

  None (edit)
Description Andre ten Bohmer 2009-01-27 05:37:53 EST
Description of problem:
Updated a HP Proliant DL-140 (no HP psp installed) from 5.2 to 5.3, and after a 
reboot to activate kernel 2.6.18-128.el5 the system is very sluggish.
Stopping services like smb and dhcpd does not improve system 
performance at all.
Running only services like ssh and syslog, the top command still shows 
every few seconds 100% CPU regarding SI (software interrupts if 
correct?).
Rebooting the system to activate previous (5.2) kernel 2.6.18-
92.1.22.el5 solves the sluggish behavior and also does not show "CPU 
100%si" .

After some more Googling found that SI can be related with device drivers so I disabled one of the NIC's. Now the system is running smooth with kernel 2.6.18-128.el5 . The disabled NIC is connected to a 1G snifpoint and a lot of traffic is passing by (between 0.3Gbit/s 0.9Gbit/s), never a problem before but it seems this kernel version has.

2 NIC: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet Module used : Broadcom Tigon3 ethernet driver 3.93, /lib/modules/2.6.18-128.el5/kernel/drivers/net/tg3.ko

1 NIC for management (IP configured), 1 NIC as snifpoint (no IP configured).


Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 5.3 (Tikanga)


Additional info:
HP Prolaint DL140
Dual Intel(R) Xeon(TM) CPU 2.40GHz
MemTotal:      1295332 kB (SwapTotal:     1052216 kB)
Linux raid (1) 2 IDE disks (80GB  Seagate Barracuda ST380011A and 80GB Maxtor 6Y080L0)
Comment 1 Andy Gospodarek 2009-01-27 10:44:48 EST
Andre, thank you for the report and the debugging work.  This is quite helpful.

If your system is spending 100% of it's time servicing softirqs is a bit odd.  I would guess this is because there is so much traffic on the 5704 devices and something is incorrect with the tg3_poll routine that is used. 

I don't have your exact system, but I do have a tg3-based system (only one port though) that I can try.  If you don't mind, I'm curious if you could try a few tests for me as well.

1. Could you capture the contents of /proc/interrupts at 60 sec intervals before starting the capture and after?

So something like this (all on one line):

# cat /proc/interrupts > /tmp/int.log; date >> /tmp/int.log; sleep 60; cat /proc/interrupts > /tmp/int.log; date >> /tmp/int.log

then start your capture and do almost the same commands again:

# cat /proc/interrupts > /tmp/int2.log; date >> /tmp/int2.log; sleep 60; cat /proc/interrupts > /tmp/int2.log; date >> /tmp/int2.log

This would at least let me see that it's probably your ethernet device that is doing the sniffing and it would help to know if you are getting more interrupts on that device.

If you were also able to repeat these tests (with different log files) on the older kernel that would also be helpful.

I am going to test my system now and I will report back the results.
Comment 2 Andy Gospodarek 2009-01-27 11:22:52 EST
I see no significant performance difference when I compare the 2 kernels.  This might be because I do not have an identical system to yours, but it also could be related to something different.

On the device that is the 'snifpoint' are you running tcpdump/wireshark to capture the data or is data just coming toward device and the device is expected to drop it?

If you are capturing data have you tried to send the output to a different filesystem (or even sending it to /dev/null)?
Comment 3 Andre ten Bohmer 2009-01-27 11:33:09 EST
Indeed, even without running a packet capture tool the system is performing poor/feeling sluggish. Looks like the high rate of traffic passing by is the problem even if it has to drop most of it. Just posted te test results but there not showing up, maybe we added comment at the same time so dropping mine? I'll try again in a few minutes.
Comment 4 Andre ten Bohmer 2009-01-27 11:34:12 EST
Andy, the test results.

---------
Old kernel 2.6.18-92.1.22.el5 (SMP)

* No Snort:
           CPU0       CPU1       
  0:   89674136          0    IO-APIC-edge  timer
  1:          9          0    IO-APIC-edge  i8042
  8:          1          0    IO-APIC-edge  rtc
  9:          1          0   IO-APIC-level  acpi
 10:          0          0   IO-APIC-level  ohci_hcd:usb1
 12:        103          0    IO-APIC-edge  i8042
 14:       5057    4165221    IO-APIC-edge  ide0
 15:         27     805124    IO-APIC-edge  ide1
169:         27 2136067101   IO-APIC-level  eth1
177:        856    4973475   IO-APIC-level  eth0
NMI:          0          0 
LOC:   89681421   89681420 
ERR:          0
MIS:          0
Tue Jan 27 17:01:49 CET 2009

           CPU0       CPU1       
  0:   89734153          0    IO-APIC-edge  timer
  1:          9          0    IO-APIC-edge  i8042
  8:          1          0    IO-APIC-edge  rtc
  9:          1          0   IO-APIC-level  acpi
 10:          0          0   IO-APIC-level  ohci_hcd:usb1
 12:        103          0    IO-APIC-edge  i8042
 14:       5057    4168547    IO-APIC-edge  ide0
 15:         27     805664    IO-APIC-edge  ide1
169:         27 2136067101   IO-APIC-level  eth1
177:        856    4977139   IO-APIC-level  eth0
NMI:          0          0 
LOC:   89741444   89741443 
ERR:          0
MIS:          0
Tue Jan 27 17:02:49 CET 2009


* Snort running:
           CPU0       CPU1       
  0:   89843939          0    IO-APIC-edge  timer
  1:          9          0    IO-APIC-edge  i8042
  8:          1          0    IO-APIC-edge  rtc
  9:          1          0   IO-APIC-level  acpi
 10:          0          0   IO-APIC-level  ohci_hcd:usb1
 12:        103          0    IO-APIC-edge  i8042
 14:       5057    4177364    IO-APIC-edge  ide0
 15:         27     806654    IO-APIC-edge  ide1
169:         27 2137905574   IO-APIC-level  eth1
177:        856    4992378   IO-APIC-level  eth0
NMI:          0          0 
LOC:   89851239   89851238 
ERR:          0
MIS:          0
Tue Jan 27 17:04:39 CET 2009

           CPU0       CPU1       
  0:   89903959          0    IO-APIC-edge  timer
  1:          9          0    IO-APIC-edge  i8042
  8:          1          0    IO-APIC-edge  rtc
  9:          1          0   IO-APIC-level  acpi
 10:          0          0   IO-APIC-level  ohci_hcd:usb1
 12:        103          0    IO-APIC-edge  i8042
 14:       5057    4180771    IO-APIC-edge  ide0
 15:         27     807190    IO-APIC-edge  ide1
169:         27 2139526328   IO-APIC-level  eth1
177:        856    4996373   IO-APIC-level  eth0
NMI:          0          0 
LOC:   89911264   89911263 
ERR:          0
MIS:          0
Tue Jan 27 17:05:39 CET 2009


-----------
New kernel 2.6.18-128.el5 (SMP)

* No Snort:
           CPU0       CPU1       
  0:     215550          0    IO-APIC-edge  timer
  1:          9          0    IO-APIC-edge  i8042
  8:          1          0    IO-APIC-edge  rtc
  9:          1          0   IO-APIC-level  acpi
 10:          0          0   IO-APIC-level  ohci_hcd:usb1
 12:        103          0    IO-APIC-edge  i8042
 14:       9290      10920    IO-APIC-edge  ide0
 15:        293        502    IO-APIC-edge  ide1
169:     100401          2   IO-APIC-level  eth1
177:       4594      15088   IO-APIC-level  eth0
NMI:          0          0 
LOC:     215354     215352 
ERR:          0
MIS:          0

Tue Jan 27 17:12:07 CET 2009
           CPU0       CPU1       
  0:     276446          0    IO-APIC-edge  timer
  1:          9          0    IO-APIC-edge  i8042
  8:          1          0    IO-APIC-edge  rtc
  9:          1          0   IO-APIC-level  acpi
 10:          0          0   IO-APIC-level  ohci_hcd:usb1
 12:        103          0    IO-APIC-edge  i8042
 14:       9290      13906    IO-APIC-edge  ide0
 15:        293        724    IO-APIC-edge  ide1
169:     100401          2   IO-APIC-level  eth1
177:       4594      17496   IO-APIC-level  eth0
NMI:          0          0 
LOC:     276255     276253 
ERR:          0
MIS:          0
Tue Jan 27 17:13:08 CET 2009



* Snort running:
           CPU0       CPU1       
  0:     601113          0    IO-APIC-edge  timer
  1:          9          0    IO-APIC-edge  i8042
  8:          1          0    IO-APIC-edge  rtc
  9:          1          0   IO-APIC-level  acpi
 10:          0          0   IO-APIC-level  ohci_hcd:usb1
 12:        103          0    IO-APIC-edge  i8042
 14:       9290      31733    IO-APIC-edge  ide0
 15:        293       2766    IO-APIC-edge  ide1
169:     100401    2834858   IO-APIC-level  eth1
177:       4594      42280   IO-APIC-level  eth0
NMI:          0          0 
LOC:     600949     600947 
ERR:          0
MIS:          0

Tue Jan 27 17:18:33 CET 2009
           CPU0       CPU1       
  0:     661442          0    IO-APIC-edge  timer
  1:          9          0    IO-APIC-edge  i8042
  8:          1          0    IO-APIC-edge  rtc
  9:          1          0   IO-APIC-level  acpi
 10:          0          0   IO-APIC-level  ohci_hcd:usb1
 12:        103          0    IO-APIC-edge  i8042
 14:       9290      35349    IO-APIC-edge  ide0
 15:        293       3072    IO-APIC-edge  ide1
169:     100401    3492872   IO-APIC-level  eth1
177:       4594      45312   IO-APIC-level  eth0
NMI:          0          0 
LOC:     661284     661282 
ERR:          0
MIS:          0
Tue Jan 27 17:19:33 CET 2009


]# mpstat -P ALL 1

05:17:48 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
05:17:49 PM  all    0.00    0.00    0.00    0.00    0.00   50.00    0.00   50.00   1012.00
05:17:49 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1002.00
05:17:49 PM    1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00     10.00

05:17:49 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
05:17:50 PM  all    1.50    0.00    0.00    0.00    0.00   49.00    0.00   49.50   1922.00
05:17:50 PM    0    1.00    0.00    0.00    0.00    0.00    0.00    0.00   99.00   1002.00
05:17:50 PM    1    2.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    928.00

05:17:50 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
05:17:51 PM  all   30.85    0.00    3.48   10.95    2.99   13.93    0.00   37.81  29691.92
05:17:51 PM    0    1.01    0.00    1.01   22.22    0.00    0.00    0.00   75.76   1012.12
05:17:51 PM    1   61.00    0.00    5.00    0.00    6.00   28.00    0.00    0.00  28679.80

05:17:51 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
05:17:52 PM  all   32.34    0.00    5.47    0.50    2.99   15.92    0.00   42.79  27378.43
05:17:52 PM    0    6.86    0.00    6.86    0.98    0.00    0.00    0.00   85.29    982.35
05:17:52 PM    1   58.00    0.00    5.00    0.00    6.00   31.00    0.00    0.00  26388.24

05:17:52 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
05:17:53 PM  all    6.00    0.00    2.50    0.00    0.00   50.00    0.00   41.50   1011.11
05:17:53 PM    0   12.12    0.00    5.05    0.00    0.00    0.00    0.00   82.83   1011.11
05:17:53 PM    1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00      0.00

05:17:53 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
05:17:54 PM  all    0.00    0.00    0.00    0.00    0.00   50.00    0.00   50.00   1019.00
05:17:54 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1001.00
05:17:54 PM    1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00     18.00
Comment 5 Andy Gospodarek 2009-01-27 12:48:40 EST
Can you tell me a little more about the type of traffic you have flowing on eth0 and eth1?  eth1 should have no work to do since there was no traffic destined for that interface, so we should really not see any interrupts.  I could see a possible problem if you are using multicast traffic and more addresses were added to the hardware list than the hardware can support (I'll have to check on the limit for 5704) because after you cross that point the interface goes into promiscuous mode.  This would send all traffic to the CPU.

Can you send me the output from the sosreport/sysreport command?  It will give me some detailed information about the system and it's configuration (from a few specific files).
Comment 6 Andy Gospodarek 2009-01-27 13:43:02 EST
I'm also curious if these chips use ASF.  You can tell from the driver initialization printouts.  There will either be ASF[0] of ASF[1] in the output of dmesg or /var/log/messages.
Comment 7 Andre ten Bohmer 2009-01-28 05:40:29 EST
Regarding Communt #5
eth0 is connected to a management vlan, not that much traffic. eth1 is linked to a interface which monitors all traffic from and to internet for over 5000 systems. 

]# ethtool -S eth1 | grep -v ": 0"
NIC statistics:
     rx_octets: 2769081372
     rx_ucast_packets: 2432544982
     rx_mcast_packets: 71882
     rx_bcast_packets: 73813
     rx_fcs_errors: 2
     rx_64_or_less_octet_packets: 1351734485
     rx_65_to_127_octet_packets: 1080348058
     rx_128_to_255_octet_packets: 309887052
     rx_256_to_511_octet_packets: 132635119
     rx_512_to_1023_octet_packets: 193703695
     rx_1024_to_1522_octet_packets: 3659349515
     tx_octets: 29568
     tx_xon_sent: 212
     tx_xoff_sent: 250
     tx_flow_control: 250
     dma_writeq_full: 115742734
     rx_discards: 46861
     rx_errors: 1
     rx_threshold_hit: 1144211626
     ring_status_update: 1588697619
     nic_irqs: 1461566302
     nic_avoided_irqs: 127131317

]# ethtool -S eth0 | grep -v ": 0"
NIC statistics:
     rx_octets: 375477887
     rx_ucast_packets: 2477756
     rx_mcast_packets: 644
     rx_bcast_packets: 331019
     rx_64_or_less_octet_packets: 1051402
     rx_65_to_127_octet_packets: 1320782
     rx_128_to_255_octet_packets: 268658
     rx_256_to_511_octet_packets: 109783
     rx_512_to_1023_octet_packets: 5042
     rx_1024_to_1522_octet_packets: 53752
     tx_octets: 2663152195
     tx_collisions: 3
     tx_single_collisions: 2
     tx_deferred: 5
     tx_late_collisions: 1
     tx_ucast_packets: 2698592
     tx_bcast_packets: 1398
     rx_threshold_hit: 3950
     dma_readq_full: 29241
     ring_set_send_prod_index: 2699991
     ring_status_update: 4176706
     nic_irqs: 3951672
     nic_avoided_irqs: 225034
     nic_tx_threshold_hit: 47527


Regarding Comment #6 :
kernel: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[0]
kernel: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
Comment 8 Andre ten Bohmer 2009-01-28 05:44:51 EST
Created attachment 330219 [details]
sosreport

md5 c836c0752ae729dd1032c89c3a8a6ed5
Comment 9 Andre ten Bohmer 2009-02-11 07:44:04 EST
Latest kernel, 2.6.18-128.1.1.el5, is less sluggish then 2.6.18-128.el5, but still not as 'smooth' runiing 2.6.18-92.1.22.el5.

Linux 2.6.18-128.1.1.el5 (scomp1001.wur.nl) 	02/11/2009

01:41:21 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
01:41:22 PM  all    0.50    0.00    0.50    0.00    2.97   60.89    0.00   35.15  28118.81
01:41:22 PM    0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    992.08
01:41:22 PM    1    0.00    0.00    0.00    1.00    6.00   23.00    0.00   70.00  27124.75

01:41:22 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
01:41:23 PM  all    1.00    0.00    0.00    0.00    2.50   60.50    0.00   36.00  28648.00
01:41:23 PM    0    0.00    0.00    0.00    0.00    0.00   97.00    0.00    3.00   1001.00
01:41:23 PM    1    3.00    0.00    0.00    0.00    4.00   23.00    0.00   70.00  27644.00

01:41:23 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
01:41:24 PM  all   32.66    0.00    2.51    1.51    2.51   12.56    0.00   48.24  28277.78
01:41:24 PM    0    0.00    0.00    0.00    3.03    0.00    0.00    0.00   96.97   1011.11
01:41:24 PM    1   63.73    0.00    4.90    0.00    5.88   25.49    0.00    0.00  27265.66

01:41:24 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
01:41:27 PM  all   29.91    0.00    2.80    0.13    2.67   51.54    0.00   12.95  27587.97
01:41:27 PM    0    0.00    0.00    0.00    0.00    0.00   74.06    0.00   25.94    999.73
01:41:27 PM    1   59.79    0.00    5.63    0.00    5.36   29.22    0.00    0.00  26588.24
Comment 10 Roland Friedwagner 2009-02-16 10:02:16 EST
Hello Sirs,

I can confirm this issue/bug for a HP DL585G1 Server.

No change in anything except upgrading to 5.3 and booting into
the 2.6.18-128.1.1 kernel.

Only first Ethernet interface ist connected here:
Feb 13 21:04:33 bach-s40 kernel: tg3.c:v3.93 (May 22, 2008)
Feb 13 21:04:33 bach-s40 kernel: tg3: eth0: Link is up at 1000 Mbps, full duplex.
Feb 13 21:04:33 bach-s40 kernel: tg3: eth0: Flow control is on for TX and on for RX.
And no sniffing or high network traffic workloads 

Interactive behavor:
Running "top -d 1" in a ssh session to the server hangs for about 1-2 sec in an intervall of 5-10 sec.

I managed to retrieve this mpstat samples for the bug kernel
on the fresh booted and idle server
(you can see there is something odd going on on cpu 4 if you look at the %soft 
columne; find the complete info in attachment mpstat_2.6.18-128.1.1.el5.log):
---------------------
[root@x1 ~]# uname -a
Linux x1.wu-wien.ac.at 2.6.18-128.1.1.el5 #1 SMP Mon Jan 26 13:58:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@x1 ~]# mpstat 1 15
Linux 2.6.18-128.1.1.el5 (x1.wu-wien.ac.at)       02/13/09

21:12:07     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
21:12:08     all    0.00    0.00    0.00    0.00    0.00    8.60    0.00   91.40   1025.00
21:12:09     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00    993.07
21:12:10     all    0.12    0.00    0.12    0.00    0.12    4.11    0.00   95.51   1034.00
21:12:11     all    0.00    0.00    0.00    0.00    0.00   12.50    0.00   87.50   1062.00
21:12:12     all    0.00    0.00    0.00    0.00    0.00   12.50    0.00   87.50   1028.00
21:12:13     all    0.00    0.00    0.00    0.00    0.00    2.37    0.00   97.63    993.07
21:12:14     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1025.00
21:12:15     all    0.00    0.00    0.00    0.00    0.00   10.12    0.00   89.88   1011.00
21:12:16     all    0.00    0.00    0.00    0.00    0.00   12.48    0.00   87.52   1027.00
21:12:17     all    0.00    0.00    0.00    0.00    0.00    8.74    0.00   91.26   1003.00
21:12:18     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1021.00
21:12:19     all    0.00    0.00    0.00    0.00    0.00    3.88    0.00   96.12   1004.00
21:12:20     all    0.00    0.00    0.00    0.00    0.00   12.48    0.00   87.52   1029.00
21:12:21     all    0.12    0.00    0.00    0.00    0.00   12.73    0.00   87.14   1010.00
21:12:22     all    0.00    0.00    0.00    0.00    0.00    2.38    0.00   97.62   1021.00
Average:     all    0.02    0.00    0.01    0.00    0.01    6.86    0.00   93.11   1019.04

I rebooted into the last 5.2 kernel (2.6.18-92.1.22) all is fine again
(also fresh booted and idle; find detail stats in attachment mpstat_2.6.18-92.1.22.el5.log):
[root@x1 ~]# uname -a
Linux x1.wu-wien.ac.at 2.6.18-92.1.22.el5 #1 SMP Fri Dec 5 09:28:22 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
[root@x1 ~]# mpstat 1 15
Linux 2.6.18-92.1.22.el5 (x1.wu-wien.ac.at)       02/13/09

21:27:41     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
21:27:42     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1004.00
21:27:43     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1022.00
21:27:44     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1003.00
21:27:45     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1033.00
21:27:46     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1046.00
21:27:47     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1021.00
21:27:48     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1003.00
21:27:49     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1011.88
21:27:50     all    0.12    0.00    0.12    0.00    0.00    0.00    0.00   99.75   1015.00
21:27:51     all    0.00    0.00    0.00    0.00    0.00    0.12    0.00   99.88   1021.00
21:27:52     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1006.00
21:27:53     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1021.00
21:27:54     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1003.00
21:27:55     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1092.00
21:27:56     all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1006.00
Average:     all    0.01    0.00    0.01    0.00    0.00    0.01    0.00   99.98   1020.52

I also collected sysreport's for both kernels (if someone at redhat is
interested in it).

Im currently in the process upgrading our Oracle Database Servers
(the others Hardware is DL580G3 and DL580G4)
and can not use the RHEL5.3 kernels (2.6.18-128.1.1)?
I also found the RHEL5.3 mkinitrd is unable to rebuild the initrd (for HP qla drivers) for the 5.2 kernels without patching it like this :-O
~~~~~~~~~~~~~~~~~~~
--- /sbin/mkinitrd      2008-12-17 20:49:36.000000000 +0100
+++ /sbin/mkinitrd.fixed_for_pre_5.3_release     2009-02-13 21:52:22.000000000 +0100
@@ -1303,12 +1303,12 @@
     fi
 fi

-if [ "$withdmraid" == "1" ]; then
-    findmodule dm-mem-cache
-    findmodule dm-region_hash
-    findmodule dm-message
-    findmodule dm-raid45
-fi
+#if [ "$withdmraid" == "1" ]; then
+#    findmodule dm-mem-cache
+#    findmodule dm-region_hash
+#    findmodule dm-message
+#    findmodule dm-raid45
+#fi

 for n in $basicmodules; do
     findmodule $n
~~~~~~~~~~~~~~~~~~~~~
So useing the 5.2 kernels for running 5.3 seem not an option?

Kind Regards,
Roland
Comment 11 Roland Friedwagner 2009-02-16 10:04:16 EST
Created attachment 332045 [details]
mpstat, ps, interrupts
Comment 12 Roland Friedwagner 2009-02-16 10:06:21 EST
Created attachment 332046 [details]
mpstat, ps, interrupts (2.6.18-92.1.22)
Comment 15 RHEL Product and Program Management 2009-03-16 20:51:51 EDT
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.
Comment 19 Allen Hewes 2009-04-06 15:28:49 EDT
I can confirm this bug on a HP DL140 G1 server. The box is unusable on any 128 series kernel! I am not using snort, only serving up static web content with Apache. I am using both interfaces on gigabit switches. This DL140 sees very light usage. The 92 series kernels don't exhibit this behavior.

Here is what a ping looks like after booting with a 128 series kernel;

Reply from x.x.x.x: bytes=32 time=2673ms TTL=57
Reply from x.x.x.x: bytes=32 time=15ms TTL=57
Reply from x.x.x.x: bytes=32 time=14ms TTL=57
Reply from x.x.x.x: bytes=32 time=2673ms TTL=57
Reply from x.x.x.x: bytes=32 time=15ms TTL=57
Reply from x.x.x.x: bytes=32 time=13ms TTL=57
Reply from x.x.x.x: bytes=32 time=2673ms TTL=57
Reply from x.x.x.x: bytes=32 time=13ms TTL=57
Reply from x.x.x.x: bytes=32 time=12ms TTL=57

$ sudo mpstat -P ALL 1 
Linux 2.6.18-128.1.6.el5PAE (websvr6.vzw.decisiv.net)   04/06/2009

03:24:35 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:24:36 PM  all    0.00    0.00    0.99   99.01    0.00    0.00    0.00    0.00   1086.14
03:24:36 PM    0    0.00    0.00    0.99   99.01    0.00    0.00    0.00    0.00   1085.15

03:24:36 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:24:40 PM  all    0.00    0.00    0.00    9.86    0.00   75.49    0.00   14.65   1011.27
03:24:40 PM    0    0.00    0.00    0.00    9.86    0.00   75.49    0.00   14.65   1011.55

03:24:40 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:24:41 PM  all    0.00    0.00    0.00    8.00    0.00    0.00    0.00   92.00   1010.00
03:24:41 PM    0    0.00    0.00    0.00    8.00    0.00    0.00    0.00   92.00   1009.00

03:24:41 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:24:45 PM  all    0.00    0.00    0.00    0.00    0.00   72.75    0.00   27.25   1001.36
03:24:45 PM    0    0.00    0.00    0.00    0.00    0.00   72.75    0.00   27.25   1001.36

03:24:45 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:24:46 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00
03:24:46 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00

03:24:46 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:24:49 PM  all    0.00    0.00    0.00    0.00    0.00   72.83    0.00   27.17   1011.14
03:24:49 PM    0    0.00    0.00    0.00    0.00    0.00   72.83    0.00   27.17   1011.14

03:24:49 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:24:50 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00
03:24:50 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00

03:24:50 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:24:54 PM  all    0.00    0.00    0.00    0.00    0.00   72.75    0.00   27.25   1001.36
03:24:54 PM    0    0.00    0.00    0.00    0.00    0.00   72.75    0.00   27.25   1001.36

03:24:54 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:24:55 PM  all    0.00    0.00    0.00    2.97    0.00    0.00    0.00   97.03    998.02
03:24:55 PM    0    0.00    0.00    0.00    2.97    0.00    0.00    0.00   97.03    998.02

03:24:55 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:24:59 PM  all    0.00    0.00    0.00    0.00    0.00   72.95    0.00   27.05   1006.28
03:24:59 PM    0    0.00    0.00    0.00    0.00    0.00   72.95    0.00   27.05   1006.28

03:24:59 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:25:00 PM  all    0.00    0.00    0.00    3.96    0.00    0.00    0.00   96.04    997.03
03:25:00 PM    0    0.00    0.00    0.00    3.96    0.00    0.00    0.00   96.04    997.03

03:25:00 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
03:25:03 PM  all    0.00    0.00    0.00    0.00    0.00   72.83    0.00   27.17   1001.09
03:25:03 PM    0    0.00    0.00    0.00    0.00    0.00   72.83    0.00   27.17   1001.09
Comment 20 Andy Gospodarek 2009-04-06 17:04:37 EDT
Based on the fact that several are seeing increases in the amount of time running in soft-irq context, I'm guessing that tg3_poll is running longer than it should.  I know we took a large update to tg3 for 5.3 and I think some napi enhancements may be causing some of this pain.  I'll take a look and see what I can find.  I will also try and wrangle up a system with a tg3 card to see if I see the same symptoms.
Comment 21 Allen Hewes 2009-04-06 18:55:58 EDT
I think it has something to do with the generation of the BCM chipset on the motherboard. I have a bunch of DL360 G4p's and they run just fine with the RHEL 5.3 update. I am not sure if this matters, but for some reason the 92.1.22 kernel detects the card as 133MHz whereas the 128.1.6 kernel detects a 66MHz. The box running the 128.1.6 kernel is a pretty recent model (2007/2008), the box running 92.1.22 is pretty old (2002/2003). I would have expected that both are clocked at 133MHz. I think you are going to need a pretty old BCM5704 card to experience this.

DL140 G1:
$ uname -s -r -v -m -p -i
Linux 2.6.18-92.1.22.el5PAE #1 SMP Tue Dec 16 12:36:25 EST 2008 i686 i686 i386

$ sudo dmesg | grep '\(eth\|tg3\)'
tg3.c:v3.86 (November 9, 2007)
eth0: Tigon3 [partno(BCM95704A6) rev 2002 PHY(5704)] (PCIX:133MHz:64-bit) 10/100/1000Base-T Ethernet 00:12:79:8f:43:d4
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[0]
eth0: dma_rwctrl[769f4000] dma_mask[64-bit]
eth1: Tigon3 [partno(BCM95704A6) rev 2002 PHY(5704)] (PCIX:133MHz:64-bit) 10/100/1000Base-T Ethernet 00:12:79:8f:43:d5
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.

$ sudo lspci -vvv -s 02:00
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02)
        Subsystem: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 177
        Region 0: Memory at febc0000 (64-bit, non-prefetchable) [size=64K]
        Region 2: Memory at febb0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [40] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=1
                Status: Dev=02:00.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: 0010000038000020  Data: 0004

02:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02)
        Subsystem: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 185
        Region 0: Memory at febf0000 (64-bit, non-prefetchable) [size=64K]
        Region 2: Memory at febe0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [40] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=1
                Status: Dev=02:00.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: d410020040040000  Data: 6202


DL360 G4p:
$ uname -s -r -v -m -p -i
Linux 2.6.18-128.1.6.el5PAE #1 SMP Tue Mar 24 12:39:24 EDT 2009 i686 i686 i386

$ sudo dmesg | grep '\(eth\|tg3\)'
tg3.c:v3.93 (May 22, 2008)
eth0: Tigon3 [partno(349321-001) rev 2100 PHY(5704)] (PCIX:66MHz:64-bit) 10/100/1000Base-T Ethernet 00:18:fe:30:c8:8a
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[0]
eth0: dma_rwctrl[769f0000] dma_mask[64-bit]
eth1: Tigon3 [partno(349321-001) rev 2100 PHY(5704)] (PCIX:66MHz:64-bit) 10/100/1000Base-T Ethernet 00:18:fe:30:c8:89
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[769f0000] dma_mask[64-bit]
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.

$ sudo lspci -s 02:02 -vvv
02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
        Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter (PCI-X, 10,100,1000-T)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 209
        Region 0: Memory at fdef0000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [40] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=1
                Status: Dev=02:02.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: ffb7dffa97fbffec  Data: f95d

02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
        Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter (PCI-X, 10,100,1000-T)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 217
        Region 0: Memory at fdee0000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [40] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=1
                Status: Dev=02:02.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: 0ebf7fe7effbeff4  Data: ffdf
Comment 22 Allen Hewes 2009-04-06 19:23:33 EDT
Hi Andy,

I was able to get a 128 series kernel running on my DL140 with the tg3 drivers provided by Broadcom at 

http://www.broadcom.com/support/ethernet_nic/netxtreme_server.php

I built the SRPM and installed it. The driver version from upstream is 3.92n.

$ sudo rpm -qi tg3      
Name        : tg3                          Relocations: (not relocatable)
Version     : 3.92n                             Vendor: Broadcom Corporation
Release     : 1                             Build Date: Mon 06 Apr 2009 07:05:48 PM EDT
Install Date: Mon 06 Apr 2009 07:07:49 PM EDT      Build Host: rhel-dev-1.decisiv.net
Group       : System/Kernel                 Source RPM: tg3-3.92n-1.src.rpm
Size        : 797079                           License: GPL
Signature   : (none)
Packager    : Allen Hewes<allen@decisiv.net>
Summary     : Broadcom NetXtreme Gigabit ethernet driver
Description :
This package contains the Broadcom NetXtreme Gigabit ethernet driver.

$ uname -s -r -v -m -p -i
Linux 2.6.18-128.1.6.el5PAE #1 SMP Wed Apr 1 10:02:22 EDT 2009 i686 i686 i386

$ sudo dmesg | grep '\(eth\|tg3\)'
tg3.c:v3.92n (September 29, 2008)
eth0: Tigon3 [partno(BCM95704A6) rev 2002 PHY(5704)] (PCIX:133MHz:64-bit) 10/100/1000Base-T Ethernet 00:12:79:8f:43:d4
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[0]
eth0: dma_rwctrl[769f4000] dma_mask[64-bit]
eth1: Tigon3 [partno(BCM95704A6) rev 2002 PHY(5704)] (PCIX:133MHz:64-bit) 10/100/1000Base-T Ethernet 00:12:79:8f:43:d5
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.

$ sudo lspci -vvv -s 02:00
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02)
        Subsystem: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 177
        Region 0: Memory at febc0000 (64-bit, non-prefetchable) [size=64K]
        Region 2: Memory at febb0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [40] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=1
                Status: Dev=02:00.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: 0010000038000020  Data: 0004

02:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02)
        Subsystem: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 185
        Region 0: Memory at febf0000 (64-bit, non-prefetchable) [size=64K]
        Region 2: Memory at febe0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [40] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=1
                Status: Dev=02:00.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: d410020040040000  Data: 6202

ping to the 128 series kernel

Reply from x.x.x.x: bytes=32 time=15ms TTL=58
Reply from x.x.x.x: bytes=32 time=15ms TTL=58
Reply from x.x.x.x: bytes=32 time=15ms TTL=58
Reply from x.x.x.x: bytes=32 time=15ms TTL=58
Reply from x.x.x.x: bytes=32 time=15ms TTL=58
Reply from x.x.x.x: bytes=32 time=15ms TTL=58
Reply from x.x.x.x: bytes=32 time=15ms TTL=58
Reply from x.x.x.x: bytes=32 time=15ms TTL=58

$ sudo mpstat -P ALL 1
Linux 2.6.18-128.1.6.el5PAE (websvr6.vzw.decisiv.net)   04/06/2009

07:22:55 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
07:22:56 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00    997.03
07:22:56 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00    996.04

07:22:56 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
07:22:57 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00
07:22:57 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00

07:22:57 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
07:22:58 PM  all    0.00    0.00    0.00    4.00    0.00    0.00    0.00   96.00   1008.00
07:22:58 PM    0    0.00    0.00    0.00    4.00    0.00    0.00    0.00   96.00   1008.00

07:22:58 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
07:22:59 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00
07:22:59 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00

07:22:59 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
07:23:00 PM  all    0.00    0.00    0.00    6.00    0.00    0.00    0.00   94.00   1009.00
07:23:00 PM    0    0.00    0.00    0.00    6.00    0.00    0.00    0.00   94.00   1009.00

07:23:00 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
07:23:01 PM  all    0.00    0.00    0.00    7.00    0.00    0.00    0.00   93.00   1012.00
07:23:01 PM    0    0.00    0.00    0.00    7.00    0.00    0.00    0.00   93.00   1012.00

07:23:01 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
07:23:02 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1006.00
07:23:02 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1006.00

07:23:02 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
07:23:03 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00    995.05
07:23:03 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00    995.05
Comment 23 Andrew Hecox 2009-04-06 20:32:11 EDT
seeing here with a tg3 5703:

03:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet (rev 02)
Comment 25 Roland Friedwagner 2009-04-07 05:03:49 EDT
the HP DL585 here showing up the problem has this controller chip: 

02:06.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
        Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter (PCI-X, 10,100,1000-T)
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 193
        Region 0: Memory at f7df0000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [40] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=1
                Status: Dev=02:06.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: dd3b7d185acbf7ec  Data: c1ff

(lspci run on 2.6.18-92.1.22.el5 kernel)
Kind Regars, Roland
Comment 26 Roland Friedwagner 2009-04-07 05:15:18 EDT
I extracted this lspci info from sosreport of the 128 kernel:

02:06.0 0200: 14e4:1648 (rev 10)
        Subsystem: 0e11:00d0
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 201
        Region 0: Memory at f7df0000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [40] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=1
                Status: Dev=02:06.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: dd3b7d185acbf7ec  Data: c1ff
Comment 28 masanari iida 2009-04-10 06:45:10 EDT
I have encountered similar problem on following system.

DL380(G5) + RHEL5.3 (x86)
DL360(G5) + RHEL5.3 (x86_64)

Both system uses Broadcom NIC, and the driver is bnx2 (v.1.8.2c from RHEL kernel)
Comment 29 masanari iida 2009-04-12 23:46:37 EDT
Hello.

On our reproduce box, it turned out this symptom was caused by HP's
hp-snmp-agents.

How did I troubleshoot.

(1) Rename bnx2.ko driver to bnx2_1.8.2c.ko.

(2) Boot up the system with single user mode.
Make sure bnx2 driver is not loaded...
OK.
Load average is stay low.

(3) Change the bnx2 driver back to original name.
Modprobe bnx2 to load the driver.
OK.
Load average is stay low.

(4) Change the runlevel to 2.
Symptom start to reproduce.
Load average increased around 1.

(5) Stop hp-snmp-agents
# /etc/init.d/hp-snmp-agents stop
This command change the Load average to lower.

(6) Start hp-snmp-agents
# /etc/init.d/hp-snmp-agents start
This command change the Load average to around 1.

(7) Disable hp-snmp-agents and boot runlevel 2
# /sbin/chkconfig --level 2345 hp-snmp-agents off

Symptom not happened.

So, my suggestion is, better check out if HP Proliant Support Pack
is installed or not.
FYI  hp-snmp-agents-8.2.0-284.rhel5 is installed on this box.
Comment 30 Andy Gospodarek 2009-04-13 14:54:28 EDT
(In reply to comment #29)
> Hello.
> 
> On our reproduce box, it turned out this symptom was caused by HP's
> hp-snmp-agents.
> 
> How did I troubleshoot.
> 
> (1) Rename bnx2.ko driver to bnx2_1.8.2c.ko.
> 


For bnx2 that might be the case, but this issue is related to the tg3 driver.

There is a separate issue related to the bnx2 driver when using legacy interrupts that you might find helpful though.  You can examine bug 489519 for more information.
Comment 31 Andy Gospodarek 2009-04-13 22:47:29 EDT
Created attachment 339409 [details]
rhel5-tg3-softirq-fixup.patch

It seems this problem only exists on systems that are using on-board and using ASF.  The hint was that the comment said 2.5ms (2500usec), but the delay would clearly be 2500ms (2500000usec) which was 2.5s!  A simple one-liner could be used to fix this, but it turns out this was also fixed upstream already with this commit:

commit 4ba526ced990f4d61ee8d65fe8a6f0745e8e455c
Author: Matt Carlson <mcarlson@broadcom.com>
Date:   Fri Aug 15 14:10:04 2008 -0700

    tg3: Fix firmware event timeouts

Attached is the backported version that I've tested on RHEL5 and can confirm clean mpstat output as well as more reliable ping performance.
Comment 34 Issue Tracker 2009-04-15 21:51:53 EDT
A customer from IT272401 confirmed  this fix on their environment.

Masahiro


This event sent from IssueTracker by mmatsuya 
 issue 272401
Comment 35 Andre ten Bohmer 2009-04-16 03:42:16 EDT
Andy,

Do you have intructions on how to aply this fix? I tried it, but to no avail, so I'm doing something wrong.

]# wget -v ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/kernel-2.6.18-128.1.6.el5.src.rpm
]# rpm -Uhv kernel-2.6.18-128.1.6.el5.src.rpm 
]# cd /usr/src/redhat/
]# rpmbuild -bp --target=i686 SPECS/kernel-2.6.spec
]# cd BUILD/kernel-2.6.18/linux-2.6.18.i686/
]# make oldconfig ; make menuconfig
]# make scripts
]# patch -p1 < rhel5-tg3-softirq-fixup.patch.txt 
]# make M=drivers/net/
]# cd /lib/modules/2.6.18-128.1.6.el5/kernel/drivers/net/
]# mv tg3.ko tg3.ko_org
]# cp /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.i686/drivers/net/tg3.ko .
depmod -a 2.6.18-128.1.6.el5

]# ls -al /lib/modules/2.6.18-128.1.6.el5/kernel/drivers/net/tg3*
-rwxr-xr-x 1 root root 127884 Mar 24 18:43 tg3.ko
-rwxr-xr-x 1 root root 570287 Apr 14 15:49 tg3.ko_new

Rebooting system to activate 2.6.18-128.1.6.el5 with fixed tg3 driver resulst in following kernel messages and no enabled NIC's:

Apr 14 16:01:48  kernel: tg3: disagrees about version of symbol phy_ethtool_sset
Apr 14 16:01:48  kernel: tg3: Unknown symbol phy_ethtool_sset
Apr 14 16:01:48  kernel: tg3: disagrees about version of symbol phy_mii_ioctl
Apr 14 16:01:48  kernel: tg3: Unknown symbol phy_mii_ioctl
Apr 14 16:01:48  kernel: tg3: disagrees about version of symbol phy_connect
Apr 14 16:01:48  kernel: tg3: Unknown symbol phy_connect
Apr 14 16:01:48  kernel: tg3: disagrees about version of symbol mdiobus_register
Apr 14 16:01:48  kernel: tg3: Unknown symbol mdiobus_register
Apr 14 16:01:48  kernel: tg3: disagrees about version of symbol phy_start
Apr 14 16:01:48  kernel: tg3: Unknown symbol phy_start
Apr 14 16:01:48  kernel: tg3: disagrees about version of symbol phy_start_aneg
Apr 14 16:01:48  kernel: tg3: Unknown symbol phy_start_aneg
Apr 14 16:01:48  kernel: tg3: disagrees about version of symbol mdiobus_unregister
Apr 14 16:01:48  kernel: tg3: Unknown symbol mdiobus_unregister
Apr 14 16:01:48  kernel: tg3: disagrees about version of symbol phy_ethtool_gset
Apr 14 16:01:48  kernel: tg3: Unknown symbol phy_ethtool_gset
Apr 14 16:01:48  kernel: tg3: disagrees about version of symbol phy_stop
Apr 14 16:01:48  kernel: tg3: Unknown symbol phy_stop
Apr 14 16:01:48  kernel: tg3: disagrees about version of symbol phy_disconnect
Apr 14 16:01:48  kernel: tg3: Unknown symbol phy_disconnect

TIA,
Andre
Comment 36 Andy Gospodarek 2009-04-16 10:56:28 EDT
What kernel were you running before?  Are you using a 5785?

You will probably also need to build and install the modules in drivers/net/phy to test this.
Comment 37 Allen Hewes 2009-04-16 11:08:52 EDT
Couldn't a compiled test kernel be produced for this? I am willing to test the fix if it's as easy as installing a new kernel. The servers that are having this issue are in production, I don't have any test servers to to patch and build a new kernel on.
Comment 38 Andy Gospodarek 2009-04-16 14:28:38 EDT
(In reply to comment #37)
> Couldn't a compiled test kernel be produced for this? I am willing to test the
> fix if it's as easy as installing a new kernel. The servers that are having
> this issue are in production, I don't have any test servers to to patch and
> build a new kernel on.  

It will be, but I just haven't had a chance to put together a test kernel recently.  I decided to post the patch in the meantime for anyone that wanted to try it out.  I will post to this bug when a test kernel is available, so make sure you are on the cc-list and you can try it when it's ready.  Thanks!
Comment 39 RHEL Product and Program Management 2009-04-16 14:48:48 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 40 Andre ten Bohmer 2009-04-17 09:27:42 EDT
(In reply to comment #36)
> What kernel were you running before?  Are you using a 5785?
> You will probably also need to build and install the modules in drivers/net/phy
> to test this.  

Indeed, compiling and installing drivers/net/phy did the trick (kernel 2.6.18-128.1.6.el5). I'll can confirm this fix on our environment (HP DL-140, BCM5704)
Comment 42 Don Zickus 2009-05-06 13:15:58 EDT
in kernel-2.6.18-144.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 44 Jim 2009-05-08 16:12:13 EDT
verified that kernel-2.6.18-144.el5 fixes this problem on our sun workstations using the BCM5703X chip.
Comment 45 Allen Hewes 2009-05-08 16:55:05 EDT
Andy and Don,

I can confirm that the 144 kernel fixes this issue. Thanks!

$ uname -s -r -v -m -p -i
Linux 2.6.18-144.el5PAE #1 SMP Tue May 5 20:56:42 EDT 2009 i686 i686 i386

$ sudo dmesg | grep '\(eth\|tg3\)'
tg3.c:v3.96-1 (November 21, 2008)
eth0: Tigon3 [partno(BCM95704A6) rev 2002 PHY(5704)] (PCIX:133MHz:64-bit) 10/100/1000Base-T Ethernet 00:12:79:8f:43:d4
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[0]
eth0: dma_rwctrl[769f4000] dma_mask[64-bit]
eth1: Tigon3 [partno(BCM95704A6) rev 2002 PHY(5704)] (PCIX:133MHz:64-bit) 10/100/1000Base-T Ethernet 00:12:79:8f:43:d5
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.

$ sudo lspci -vvv -s 02:00
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02)
        Subsystem: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 177
        Region 0: Memory at febc0000 (64-bit, non-prefetchable) [size=64K]
        Region 2: Memory at febb0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [40] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=1
                Status: Dev=02:00.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: 0010000038000020  Data: 0004

02:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02)
        Subsystem: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (16000ns min), Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 185
        Region 0: Memory at febf0000 (64-bit, non-prefetchable) [size=64K]
        Region 2: Memory at febe0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [40] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=2048 OST=1
                Status: Dev=02:00.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
                Address: d410020040040000  Data: 6202

$ sudo mpstat -P ALL 1
Linux 2.6.18-144.el5PAE (x.x.x.x)       05/08/2009

04:51:30 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:31 PM  all    0.00    0.00    0.00    1.00    0.00    0.00    0.00   99.00   1007.00
04:51:31 PM    0    0.00    0.00    0.00    1.00    0.00    0.00    0.00   99.00   1006.00

04:51:31 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:32 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1006.00
04:51:32 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1006.00

04:51:32 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:33 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00    996.04
04:51:33 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00    996.04

04:51:33 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:34 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1006.00
04:51:34 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1006.00

04:51:34 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:35 PM  all    0.00    0.00    0.00    6.00    0.00    0.00    0.00   94.00   1009.00
04:51:35 PM    0    0.00    0.00    0.00    6.00    0.00    0.00    0.00   94.00   1009.00

04:51:35 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:36 PM  all    0.00    0.00    0.00    7.00    0.00    0.00    0.00   93.00   1011.00
04:51:36 PM    0    0.00    0.00    0.00    7.00    0.00    0.00    0.00   93.00   1011.00

04:51:36 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:37 PM  all    0.00    0.00    0.00    2.00    0.00    0.00    0.00   98.00   1049.00
04:51:37 PM    0    0.00    0.00    0.00    2.00    0.00    0.00    0.00   98.00   1049.00

04:51:37 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:38 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00
04:51:38 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00

04:51:38 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:39 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00
04:51:39 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00

04:51:39 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:40 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00
04:51:40 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1005.00

04:51:40 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:51:41 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1006.00
04:51:41 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   1006.00

Reply from x.x.x.x: bytes=32 time=24ms TTL=58
Reply from x.x.x.x: bytes=32 time=25ms TTL=58
Reply from x.x.x.x: bytes=32 time=24ms TTL=58
Reply from x.x.x.x: bytes=32 time=24ms TTL=58
Reply from x.x.x.x: bytes=32 time=24ms TTL=58
Reply from x.x.x.x: bytes=32 time=29ms TTL=58
Reply from x.x.x.x: bytes=32 time=25ms TTL=58
Reply from x.x.x.x: bytes=32 time=24ms TTL=58
Reply from x.x.x.x: bytes=32 time=25ms TTL=58
Reply from x.x.x.x: bytes=32 time=24ms TTL=58
Reply from x.x.x.x: bytes=32 time=24ms TTL=58
Comment 52 Andre ten Bohmer 2009-06-18 09:48:11 EDT
Bug seems to be solved in "RHSA-2009:1106 - Security Advisory"

quote:
* using the Broadcom NetXtreme BCM5704 network device with the tg3 driver
caused high system load and very bad performance. (BZ#502837)

Installed and activated kernel 2.6.18-128.1.14.el5 and system performance and response is ok, thanks.
Comment 53 Detlef Graef 2009-07-18 03:35:33 EDT
I have the same problem with a HP NC320T PCI Express gigabit server adapter with Broadcom BCM5721 chip.

http://h18006.www1.hp.com/products/servers/networking/nc320t/index.html

I'm using Fedora 11 with this kernel:

2.6.29.5-191.fc11.x86_64 #1 SMP Tue Jun 16 23:23:21 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

When network traffic occurs the mouse and keyboard gets unusable.
Comment 56 Andy Gospodarek 2009-07-20 11:27:34 EDT
(In reply to comment #53)
> I have the same problem with a HP NC320T PCI Express gigabit server adapter
> with Broadcom BCM5721 chip.
> 
> http://h18006.www1.hp.com/products/servers/networking/nc320t/index.html
> 
> I'm using Fedora 11 with this kernel:
> 
> 2.6.29.5-191.fc11.x86_64 #1 SMP Tue Jun 16 23:23:21 EDT 2009 x86_64 x86_64
> x86_64 GNU/Linux
> 
> When network traffic occurs the mouse and keyboard gets unusable.  

This is really a RHEL bug, so bugs with Fedora should be discussed in their own bug.  I'll ask you to try something though. :-)

Have you ever looked at the output from 'mpstat 10 1' or some other interval when the system is running poorly.

Are there any other devices that are sharing an interrupt with the tg3 device.Take a look at /proc/interrupts and see if there is more than one device using the same irq number as the tg3 device.  If so this patch might be helpful:

commit 624f8e5082efd0348ccf7e3d3f4bfc41efead26c
Author: Matt Carlson <mcarlson@broadcom.com>
Date:   Mon Apr 20 06:55:01 2009 +0000

    tg3: Allow screaming interrupt detection

and that patch is not in the latest F11 kernel yet (it first appeared in 2.6.30-rc1).
Comment 57 Detlef Graef 2009-07-21 14:02:03 EDT
I was not sure if the bug will be counted as duplicate if I open a new one for Fedora 11. But I can open a new one for Fedora 11, no problem.

Now some infomation.

output from "lspci -vvv" of the network card:

03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
        Subsystem: Hewlett-Packard Company NC320T PCIe Gigabit Server Adapter                            
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- 
        Latency: 0, Cache Line Size: 64 bytes                                                                
        Interrupt: pin A routed to IRQ 17                                                                    
        Region 0: Memory at feaf0000 (64-bit, non-prefetchable) [size=64K]                                   
        Expansion ROM at feae0000 [disabled] [size=64K]                                                      
        Capabilities: [48] Power Management version 2                                                        
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)                   
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-                                                  
        Capabilities: [50] Vital Product Data                                                                
                Product Name: HP NC320T GIGABIT SERVER ADAPTER                                               
                Read-only fields:                                                                            
                        [PN] Part number: 012429-001                                                         
                        [EC] Engineering changes: 0A                                                         
                        [SN] Serial number: 0123456789                                                       
                        [MN] Manufacture ID: 31 30 33 43                                                     
                        [RV] Reserved: checksum bad, 47 byte(s) reserved                                     
                Read/write fields:                                                                           
                        [YA] Asset tag: XYZ01234567                                                          
                        [RW] Read-write area: 107 byte(s) free                                               
                End                                                                                          
        Capabilities: [58] MSI: Mask- 64bit+ Count=1/8 Enable-                                               
                Address: 1000700200002804  Data: 1008                                                        
        Capabilities: [d0] Express (v1) Endpoint, MSI 00                                                     
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited                    
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE- FLReset-                                      
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-                           
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-                                         
                        MaxPayload 128 bytes, MaxReadReq 4096 bytes                                          
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-                          
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 <2us, L1 <64us                
                        ClockPM- Surprise- LLActRep- BwNot-                                                  
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-                              
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-                                       
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-           
        Capabilities: [100] Advanced Error Reporting                                                         
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-                                   
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-                                   
                AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-                                   
        Capabilities: [13c] Virtual Channel <?>                                                                   
        Kernel driver in use: tg3                                                                                 
        Kernel modules: tg3           


Output from mpstat:

I've first started "mpstat 2 60" then started a download with "wget", stopped the download and then stopped "mpstat 2 60"

[robin@robin ~]$ mpstat 2 60
Linux 2.6.29.5-191.fc11.x86_64 (robin.homeunix.net)     21.07.2009

19:38:01     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
19:38:03     all    0,50    0,00    0,25    0,00    0,00    0,00    0,00   99,25     66,00
19:38:05     all    0,50    0,00    0,50    0,00    0,00    0,00    0,00   99,00    168,00
19:38:07     all    1,40    0,00    0,56    0,00    0,00    0,00    0,00   98,04    149,00
19:38:09     all    0,95    0,00    0,47    0,00    0,24    0,24    0,00   98,10    107,91
19:38:12     all    2,82    0,00    1,76    0,00    1,76    3,17    0,00   90,49    224,58
19:38:14     all    3,06    0,00    1,02    1,02    0,51    2,55    0,00   91,84    109,00
19:38:16     all    4,55    0,00    2,27    0,00    1,52    9,09    0,00   82,58    140,59
19:38:18     all   16,67    0,00   12,50    0,00   25,00   45,83    0,00    0,00    144,19
19:38:20     all   13,64    0,00    4,55    4,55   27,27   50,00    0,00    0,00    146,80
19:38:22     all   15,38    0,00   11,54    0,00   30,77   34,62    0,00    7,69    176,00
19:38:24     all    9,52    0,00   19,05    0,00   23,81   47,62    0,00    0,00    144,50
19:38:26     all   19,05    0,00   14,29    0,00   19,05   47,62    0,00    0,00    126,67
19:38:28     all   14,71    0,00   14,71    0,00   23,53   47,06    0,00    0,00    209,76
19:38:30     all   22,73    0,00   13,64    0,00   18,18   40,91    0,00    4,55    119,73
19:38:32     all   20,00    0,00   12,00    0,00   24,00   44,00    0,00    0,00    136,32
19:38:35     all    9,09    0,00    9,09    0,00   27,27   54,55    0,00    0,00     75,98
19:38:37     all    6,25    0,00    6,25    0,00   37,50   50,00    0,00    0,00    154,21
19:38:39     all    6,67    0,00    0,00    0,00   40,00   53,33    0,00    0,00    141,87
19:38:41     all   12,50    0,00    6,25    0,00   18,75   62,50    0,00    0,00    104,29
19:38:43     all    6,25    0,00   12,50    0,00   25,00   50,00    0,00    6,25    104,88
19:38:45     all    0,00    0,00    6,67    0,00   33,33   60,00    0,00    0,00    145,02
19:38:47     all    7,14    0,00    0,00    0,00   35,71   57,14    0,00    0,00    134,95
19:38:49     all    0,00    0,00    0,00    0,00   33,33   60,00    0,00    6,67    129,41
19:38:51     all    5,26    0,00    5,26    0,00   42,11   42,11    0,00    5,26    165,87
19:38:53     all    0,00    0,00    6,67    0,00   26,67   66,67    0,00    0,00    135,35
19:38:55     all   11,76    0,00   11,76    0,00   29,41   47,06    0,00    0,00    125,50
19:38:57     all    5,26    0,00   10,53   15,79   21,05   47,37    0,00    0,00    135,92
19:38:59     all    2,86    0,00    1,43    0,00    2,86    8,57    0,00   84,29    281,82
19:39:01     all    0,00    0,00    0,00    0,00    0,00    0,00    0,00  100,00     29,00
19:39:03     all    0,00    0,00    0,00    0,00    0,00    0,00    0,00  100,00    138,24
19:39:05     all    0,00    0,00    0,00    0,00    0,00    0,00    0,00  100,00    101,46
19:39:07     all    0,00    0,00    0,00    0,00    0,00    0,00    0,00  100,00     75,50
19:39:09     all    0,00    0,00    0,00    0,00    0,00    0,00    0,00  100,00    213,50
19:39:11     all    0,00    0,00    0,00    0,00    0,00    0,00    0,00  100,00    221,50
^C

The download performance was very poor at 60 Kbytes/sec. Normally I have 600 Kbytes/sec (I have a download bandwith of 6 MBit/sec).

Kernel is:  2.6.29.5-191.fc11.x86_64 #1 SMP Tue Jun 16 23:23:21 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux


I think the interrupt is not shared:
(output after the download)

[root@robin ~]# cat /proc/interrupts
           CPU0       CPU1
  0:        133          3   IO-APIC-edge      timer
  1:        719         40   IO-APIC-edge      i8042
  4:          0          2   IO-APIC-edge
  5:          0          0   IO-APIC-edge      MPU401 UART
  7:          1          0   IO-APIC-edge      parport0
  8:          0         36   IO-APIC-edge      rtc0
  9:         16          0   IO-APIC-fasteoi   acpi
 12:          0      20163   IO-APIC-edge      i8042
 14:         33      14407   IO-APIC-edge      pata_amd
 15:          0          0   IO-APIC-edge      pata_amd
 17:          0          8   IO-APIC-fasteoi   firewire_ohci
 18:        898         46   IO-APIC-fasteoi   eth0
 19:          0          0   IO-APIC-fasteoi   radeon
 20:        432        174   IO-APIC-fasteoi   HDA Intel
 21:          1         19   IO-APIC-fasteoi   ohci_hcd:usb2
 22:          0          2   IO-APIC-fasteoi   ehci_hcd:usb1
 28:          1       3081   PCI-MSI-edge      ahci
NMI:          0          0   Non-maskable interrupts
LOC:      74575      52341   Local timer interrupts
RES:       8968       6332   Rescheduling interrupts
CAL:        109         45   Function call interrupts
TLB:        582        297   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
SPU:          0          0   Spurious interrupts
ERR:          1
MIS:          0
Comment 58 Andy Gospodarek 2009-07-21 14:07:52 EDT
It won't be closed as a duplicate, I promise.

If you will cut and paste the contents of comment #53, comment #56, and comment #57 to a new bug and assign it to 'agospoda@redhat.com' we can work on it there.
Comment 60 Andy Gospodarek 2009-08-31 16:14:41 EDT
*** Bug 520183 has been marked as a duplicate of this bug. ***
Comment 61 errata-xmlrpc 2009-09-02 04:01:49 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.