This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 451112 - MRG machines losing ethernet connectivity
MRG machines losing ethernet connectivity
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
beta
x86_64 All
low Severity high
: ---
: ---
Assigned To: Red Hat Real Time Maintenance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-06-12 16:32 EDT by IBM Bug Proxy
Modified: 2008-09-15 23:17 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-09-15 23:17:27 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 44627 None None None Never

  None (edit)
Description IBM Bug Proxy 2008-06-12 16:32:20 EDT
Problem description:
JTC Realtime test machines running Red Hat MRG are losing network connectivity
after a few days of running realtime Java tests. The type of tests does not seem
to matter.
We have seen this several times on kernels 2.6.24.3-29, 2.6.24.4-30 and
2.6.24.4-32. I will post further comments here if 2.6.24.4-47 makes any difference.

Hardware Environment
We've seen this on two machines:
rtj-opt6.hursley.ibm.com - eServer 326m
Model number 7969-76G
2 x 2.4 GHz Opteron 280 (dual core)
5 GB RAM

rtj-opt22.hursley.ibm.com - eServer x3455
Model number 7984-52G
2 x 2.6 GHz Opteron 2218 (dual core)
10 GB RAM

The kernel outputs the following messages:
May  4 04:12:34 rtj-opt6 kernel: NETDEV WATCHDOG: eth0: transmit timed out
May  4 04:12:34 rtj-opt6 kernel: tg3: eth0: transmit timed out, resetting
May  4 04:12:34 rtj-opt6 kernel: tg3: DEBUG: MAC_TX_STATUS[ffffffff]
MAC_RX_STATUS[ffffffff]
May  4 04:12:34 rtj-opt6 kernel: tg3: DEBUG: RDMAC_STATUS[ffffffff]
WDMAC_STATUS[ffffffff]
May  4 04:12:34 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=2c00
enable_bit=2
May  4 04:12:34 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=2000
enable_bit=2
May  4 04:12:34 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=2400
enable_bit=2
May  4 04:12:34 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=2800
enable_bit=2
May  4 04:12:34 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=3000
enable_bit=2
May  4 04:12:34 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=1400
enable_bit=2
May  4 04:12:34 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=1800
enable_bit=2
May  4 04:12:34 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2
May  4 04:12:35 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=4800
enable_bit=2
May  4 04:12:35 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=1000
enable_bit=2
May  4 04:12:35 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=1c00
enable_bit=2
May  4 04:12:35 rtj-opt6 kernel: tg3: tg3_abort_hw timed out for eth0,
TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
May  4 04:12:35 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=3c00
enable_bit=2
May  4 04:12:35 rtj-opt6 kernel: tg3: tg3_stop_block timed out, ofs=4c00
enable_bit=2
May  4 04:12:36 rtj-opt6 kernel: tg3: eth0: No firmware running.
May  4 04:12:37 rtj-opt6 kernel: tg3: tg3_abort_hw timed out for eth0,
TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
May  4 04:12:49 rtj-opt6 kernel: tg3: eth0: Link is down.


After this the machine has to be rebooted to get the network to work again.

The two machines have different network hardware. Both have the latest firmware.
rtj-opt6:
04:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit
Ethernet (rev 03)
04:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit
Ethernet (rev 03)

rtj-opt22:
02:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)
02:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)

This problem reminds me of the one that was fixed in Bug #25487. However, this
time round the problem takes several days to appear - I am unable to make it
happen immediately as described in comment 7 of that bug.
=Comment: #3=================================================
P. N. Stanton <pstanton@uk.ibm.com> - 2008-05-29 07:02 EDT
Seen again on rtj-opt22.hursley.ibm.com which is now running 2.6.24.7-54.el5rt

There were some other messages in the log this time, produced immediately before
the tg3 messages we've seen before:

May 28 05:06:32 rtj-opt22 kernel: irq 11: nobody cared (try booting with the
"irqpoll" option)
May 28 05:06:32 rtj-opt22 kernel: irq 11: Some systems using an IO-APIC require
a special quirk to workaround
May 28 05:06:32 rtj-opt22 kernel: irq 11: problems with interrupt routing. If
your system requires such a quirk,
May 28 05:06:32 rtj-opt22 kernel: irq 11: please try booting with the
"ioapic_level_quirk=1" option.
May 28 05:06:32 rtj-opt22 kernel: Pid: 555, comm: IRQ-11 Not tainted
2.6.24.7-54.el5rt #1
May 28 05:06:32 rtj-opt22 kernel:
May 28 05:06:32 rtj-opt22 kernel: Call Trace:
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff8808cf66>] ?
:libata:ata_interrupt+0x1c5/0x1dd
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff8107b998>] __report_bad_irq+0x71/0xc2
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff8107bbf2>] note_interrupt+0x209/0x243
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff8107aee6>] thread_simple_irq+0x80/0x9d
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff8107b5c1>] do_irqd+0xc2/0x2c2
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff8107b4ff>] ? do_irqd+0x0/0x2c2
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff8107b4ff>] ? do_irqd+0x0/0x2c2
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff81050fab>] kthread+0x49/0x76
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff8100cfe8>] child_rip+0xa/0x12
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff81050f62>] ? kthread+0x0/0x76
May 28 05:06:32 rtj-opt22 kernel:  [<ffffffff8100cfde>] ? child_rip+0x0/0x12
May 28 05:06:32 rtj-opt22 kernel:
May 28 05:06:32 rtj-opt22 kernel: handlers:
May 28 05:06:32 rtj-opt22 kernel: [<ffffffff8808cda1>] (ata_interrupt+0x0/0x1dd
[libata])

Should I try booting the machine with the options that the kernel suggests?
=Comment: #4=================================================
Vernon Mauery <mauery@us.ibm.com> - 2008-05-29 10:28 EDT
I am not the APIC expert, but last I understood, irqpoll does nothing in an -rt
kernel.  But you could try the "ioapic_level_quirk=1" option.  You might also
try "acpi=noirq".

=Comment: #6=================================================
P. N. Stanton <pstanton@uk.ibm.com> - 2008-05-30 04:37 EDT
Machine is now running with kernel option ioapic_level_quirk=1


As requested in comment 5:
[root@rtj-opt22 ~]# cat /proc/interrupts; sleep 5; cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  0:        323          0          0          1   IO-APIC-edge      timer
  1:          0          0          0          2   IO-APIC-edge      i8042
  4:          0          0          0         12   IO-APIC-edge      serial
  8:          0          0          0          0   IO-APIC-edge      rtc0
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 10:          0          0          0         80   IO-APIC-fasteoi  
ohci_hcd:usb1, ohci_hcd:usb2, ehci_hcd:usb3
 11:          0          0         22      60028   IO-APIC-fasteoi   sata_svw
 12:          0          0          0          4   IO-APIC-edge      i8042
 14:          0          0          0          0   IO-APIC-edge      libata
 15:          0          0          0          0   IO-APIC-edge      libata
 18:          0          0         62     302266   IO-APIC-pcix-fasteoi  eth0
NMI:          0          0          0          0   Non-maskable interrupts
LOC:   61245573   61232035   61230121   61352349   Local timer interrupts
RES:      58516      10291      11188      11819   Rescheduling interrupts
CAL:        425        486        512        197   function call interrupts
TLB:        561       1753        527       1784   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
SPU:          0          0          0          0   Spurious interrupts
ERR:          0
           CPU0       CPU1       CPU2       CPU3
  0:        323          0          0          1   IO-APIC-edge      timer
  1:          0          0          0          2   IO-APIC-edge      i8042
  4:          0          0          0         12   IO-APIC-edge      serial
  8:          0          0          0          0   IO-APIC-edge      rtc0
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 10:          0          0          0         80   IO-APIC-fasteoi  
ohci_hcd:usb1, ohci_hcd:usb2, ehci_hcd:usb3
 11:          0          0         22      60031   IO-APIC-fasteoi   sata_svw
 12:          0          0          0          4   IO-APIC-edge      i8042
 14:          0          0          0          0   IO-APIC-edge      libata
 15:          0          0          0          0   IO-APIC-edge      libata
 18:          0          0         62     302290   IO-APIC-pcix-fasteoi  eth0
NMI:          0          0          0          0   Non-maskable interrupts
LOC:   61250577   61237039   61235125   61357363   Local timer interrupts
RES:      58516      10299      11191      11821   Rescheduling interrupts
CAL:        425        486        512        197   function call interrupts
TLB:        561       1753        529       1784   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
SPU:          0          0          0          0   Spurious interrupts
ERR:          0

=Comment: #7=================================================
P. N. Stanton <pstanton@uk.ibm.com> - 2008-06-12 11:55 EDT
The ioapic_level_quirk=1 kernel option stopped the messages about IRQ 11
appearing, but rtj-opt22 still lost ethernet as before. Machine is running
2.6.24.7-62.el5rt, and this time only took 18 hours from boot to losing ethernet
connection - previously it has taken several days.

I will try the acpi=noirq option next.
=Comment: #9=================================================
Nivedita Singhvi <nivedita@us.ibm.com> - 2008-06-12 12:27 EDT
When you say "losing ethernet" - I assume you are still seeing 
messages in /var/log/messages saying the interface went down?

Can you provide the following from both failing machines:

1. ethtool -i $interface
2. ifconfig
3. netstat -s
4. lsmod
5. ps -fe | grep irq



=Comment: #10=================================================
P. N. Stanton <pstanton@uk.ibm.com> - 2008-06-12 12:35 EDT
Yes, messages produced were as before and eth0 was down.

As requested:

[root@rtj-opt22 ~]# ethtool -i eth0
driver: tg3
version: 3.86
firmware-version: 5704-v3.41, ASFIPMIc v2.45
bus-info: 0000:02:01.0

[root@rtj-opt22 ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:14:5E:55:A1:F0
          inet addr:9.71.34.145  Bcast:9.71.34.255  Mask:255.255.255.0
          inet6 addr: fe80::214:5eff:fe55:a1f0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5181 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3985 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1310750 (1.2 MiB)  TX bytes:580892 (567.2 KiB)
          Interrupt:18

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:22 errors:0 dropped:0 overruns:0 frame:0
          TX packets:22 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1872 (1.8 KiB)  TX bytes:1872 (1.8 KiB)

[root@rtj-opt22 ~]# netstat -s
Ip:
    3257 total packets received
    0 forwarded
    0 incoming packets discarded
    3257 incoming packets delivered
    3960 requests sent out
Icmp:
    338 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 40
        echo requests: 298
    300 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 2
        echo replies: 298
IcmpMsg:
        InType3: 40
        InType8: 298
        OutType0: 298
        OutType3: 2
Tcp:
    52 active connections openings
    4 passive connection openings
    0 failed connection attempts
    2 connection resets received
    16 connections established
    2355 segments received
    2516 segments send out
    9 segments retransmited
    0 bad segments received.
    38 resets sent
Udp:
    242 packets received
    0 packets to unknown port received.
    0 packet receive errors
    1135 packets sent
UdpLite:
error parsing /proc/net/netstat: Success

[root@rtj-opt22 ~]# lsmod
Module                  Size  Used by
autofs4                29064  2
hidp                   24704  2
nfs                   260184  6
lockd                  72496  2 nfs
nfs_acl                11392  1 nfs
rfcomm                 44064  0
l2cap                  29952  10 hidp,rfcomm
bluetooth              62180  5 hidp,rfcomm,l2cap
sunrpc                183912  22 nfs,lockd,nfs_acl
ipv6                  284136  28
dm_multipath           25744  0
video                  27284  0
output                 11776  1 video
sbs                    22416  0
sbshc                  14208  1 sbs
battery                21384  0
ac                     13704  0
parport_pc             35496  0
lp                     20612  0
parport                43788  2 parport_pc,lp
sg                     40832  0
tg3                   119556  0
pata_acpi              14336  0
k8_edac                25544  0
button                 16032  0
edac_core              52240  4 k8_edac
serio_raw              14340  0
pata_serverworks       17024  0
ata_generic            14596  0
shpchp                 38556  0
k8temp                 13440  0
hwmon                  11224  1 k8temp
pcspkr                 11264  0
dm_snapshot            23704  0
dm_zero                10240  0
dm_mirror              27904  0
dm_mod                 63600  9 dm_multipath,dm_snapshot,dm_zero,dm_mirror
sata_svw               14980  3
libata                146736  4 pata_acpi,pata_serverworks,ata_generic,sata_svw
sd_mod                 33872  5
scsi_mod              153432  3 sg,libata,sd_mod
ext3                  130448  2
jbd                    53288  1 ext3
mbcache                16128  1 ext3
ehci_hcd               40076  0
ohci_hcd               29444  0
uhci_hcd               31008  0

[root@rtj-opt22 ~]# ps -fe | grep irq
root         5     2  0 16:58 ?        00:00:00 [sirq-high/0]
root         6     2  0 16:58 ?        00:00:01 [sirq-timer/0]
root         7     2  0 16:58 ?        00:00:00 [sirq-net-tx/0]
root         8     2  0 16:58 ?        00:00:00 [sirq-net-rx/0]
root         9     2  0 16:58 ?        00:00:00 [sirq-block/0]
root        10     2  0 16:58 ?        00:00:00 [sirq-tasklet/0]
root        11     2  0 16:58 ?        00:00:00 [sirq-sched/0]
root        12     2  0 16:58 ?        00:00:00 [sirq-hrtimer/0]
root        13     2  0 16:58 ?        00:00:00 [sirq-rcu/0]
root        18     2  0 16:58 ?        00:00:00 [sirq-high/1]
root        19     2  0 16:58 ?        00:00:01 [sirq-timer/1]
root        20     2  0 16:58 ?        00:00:00 [sirq-net-tx/1]
root        21     2  0 16:58 ?        00:00:00 [sirq-net-rx/1]
root        22     2  0 16:58 ?        00:00:00 [sirq-block/1]
root        23     2  0 16:58 ?        00:00:00 [sirq-tasklet/1]
root        24     2  0 16:58 ?        00:00:00 [sirq-sched/1]
root        25     2  0 16:58 ?        00:00:00 [sirq-hrtimer/1]
root        26     2  0 16:58 ?        00:00:00 [sirq-rcu/1]
root        31     2  0 16:58 ?        00:00:00 [sirq-high/2]
root        32     2  0 16:58 ?        00:00:01 [sirq-timer/2]
root        33     2  0 16:58 ?        00:00:00 [sirq-net-tx/2]
root        34     2  0 16:58 ?        00:00:00 [sirq-net-rx/2]
root        35     2  0 16:58 ?        00:00:00 [sirq-block/2]
root        36     2  0 16:58 ?        00:00:00 [sirq-tasklet/2]
root        37     2  0 16:58 ?        00:00:00 [sirq-sched/2]
root        38     2  0 16:58 ?        00:00:00 [sirq-hrtimer/2]
root        39     2  0 16:58 ?        00:00:00 [sirq-rcu/2]
root        44     2  0 16:58 ?        00:00:00 [sirq-high/3]
root        45     2  0 16:58 ?        00:00:01 [sirq-timer/3]
root        46     2  0 16:58 ?        00:00:00 [sirq-net-tx/3]
root        47     2  0 16:58 ?        00:00:00 [sirq-net-rx/3]
root        48     2  0 16:58 ?        00:00:00 [sirq-block/3]
root        49     2  0 16:58 ?        00:00:00 [sirq-tasklet/3]
root        50     2  0 16:58 ?        00:00:00 [sirq-sched/3]
root        51     2  0 16:58 ?        00:00:00 [sirq-hrtimer/3]
root        52     2  0 16:58 ?        00:00:00 [sirq-rcu/3]
root      5719  5672  0 17:34 pts/2    00:00:00 grep irq

=Comment: #11=================================================
Nivedita Singhvi <nivedita@us.ibm.com> - 2008-06-12 12:49 EDT
Peter, was that information from the e326 or the x3455? Could you
throw out the other machine, as well?  The fact that two different
boxes here are seeing the same error points to elsewhere in the
stack, not the driver/NIC FW, unless it's the same.. ?
Comment 1 IBM Bug Proxy 2008-06-13 13:16:29 EDT
------- Comment From pstanton@uk.ibm.com 2008-06-13 13:12 EDT-------
rtj-opt22 is the x3455. Here is the same information from the e326m. I've
noticed that the e326m isn't displaying this problem as often as the x3455.

[root@rtj-opt6 ~]# ethtool -i eth0
driver: tg3
version: 3.86
firmware-version: 5780-v3.29, ASFIPMI v6.21
bus-info: 0000:04:04.0

[root@rtj-opt6 ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:11:25:C4:34:F4
inet addr:9.71.34.108  Bcast:9.71.34.255  Mask:255.255.255.0
inet6 addr: fe80::211:25ff:fec4:34f4/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:1253557 errors:0 dropped:0 overruns:0 frame:0
TX packets:359494 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:171229079 (163.2 MiB)  TX bytes:61588103 (58.7 MiB)
Interrupt:26

lo        Link encap:Local Loopback
inet addr:127.0.0.1  Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING  MTU:16436  Metric:1
RX packets:242 errors:0 dropped:0 overruns:0 frame:0
TX packets:242 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:40351 (39.4 KiB)  TX bytes:40351 (39.4 KiB)

[root@rtj-opt6 ~]# netstat -s
Ip:
448082 total packets received
0 forwarded
0 incoming packets discarded
448082 incoming packets delivered
356231 requests sent out
Icmp:
13241 ICMP messages received
8 input ICMP message failed.
ICMP input histogram:
destination unreachable: 1687
echo requests: 11554
11565 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 11
echo replies: 11554
IcmpMsg:
InType3: 1687
InType8: 11554
OutType0: 11554
OutType3: 11
Tcp:
1333 active connections openings
147 passive connection openings
0 failed connection attempts
67 connection resets received
12 connections established
294920 segments received
334087 segments send out
908 segments retransmited
0 bad segments received.
2180 resets sent
Udp:
7580 packets received
9 packets to unknown port received.
0 packet receive errors
9669 packets sent
UdpLite:
error parsing /proc/net/netstat: Success

[root@rtj-opt6 ~]# lsmod
Module                  Size  Used by
autofs4                29064  2
hidp                   24704  2
nfs                   260184  6
lockd                  72496  2 nfs
nfs_acl                11392  1 nfs
rfcomm                 44064  0
l2cap                  29952  10 hidp,rfcomm
bluetooth              62180  5 hidp,rfcomm,l2cap
sunrpc                183912  22 nfs,lockd,nfs_acl
ipv6                  284136  30
dm_multipath           25744  0
video                  27284  0
output                 11776  1 video
sbs                    22416  0
sbshc                  14208  1 sbs
battery                21384  0
ac                     13704  0
parport_pc             35496  0
lp                     20612  0
parport                43788  2 parport_pc,lp
sr_mod                 23844  0
sg                     40832  0
cdrom                  40616  1 sr_mod
tg3                   119556  0
pata_serverworks       17024  0
pata_acpi              14336  0
k8_edac                25544  0
ata_generic            14596  0
edac_core              52240  4 k8_edac
shpchp                 38556  0
k8temp                 13440  0
button                 16032  0
serio_raw              14340  0
hwmon                  11224  1 k8temp
pcspkr                 11264  0
dm_snapshot            23704  0
dm_zero                10240  0
dm_mirror              27904  0
dm_mod                 63600  9 dm_multipath,dm_snapshot,dm_zero,dm_mirror
sata_svw               14980  2
libata                146736  4 pata_serverworks,pata_acpi,ata_generic,sata_svw
mptspi                 26000  0
mptscsih               39040  1 mptspi
scsi_transport_spi     32768  1 mptspi
mptbase                77540  2 mptspi,mptscsih
sd_mod                 33872  3
scsi_mod              153432  7
sr_mod,sg,libata,mptspi,mptscsih,scsi_transport_spi,sd_mod
ext3                  130448  2
jbd                    53288  1 ext3
mbcache                16128  1 ext3
ehci_hcd               40076  0
ohci_hcd               29444  0
uhci_hcd               31008  0

[root@rtj-opt6 ~]# ps -fe | grep irq
root         5     2  0 Jun04 ?        00:00:00 [sirq-high/0]
root         6     2  0 Jun04 ?        00:10:58 [sirq-timer/0]
root         7     2  0 Jun04 ?        00:00:00 [sirq-net-tx/0]
root         8     2  0 Jun04 ?        00:00:00 [sirq-net-rx/0]
root         9     2  0 Jun04 ?        00:00:00 [sirq-block/0]
root        10     2  0 Jun04 ?        00:00:00 [sirq-tasklet/0]
root        11     2  0 Jun04 ?        00:00:11 [sirq-sched/0]
root        12     2  0 Jun04 ?        00:00:00 [sirq-hrtimer/0]
root        13     2  0 Jun04 ?        00:02:31 [sirq-rcu/0]
root        18     2  0 Jun04 ?        00:00:00 [sirq-high/1]
root        19     2  0 Jun04 ?        00:27:12 [sirq-timer/1]
root        20     2  0 Jun04 ?        00:00:00 [sirq-net-tx/1]
root        21     2  0 Jun04 ?        00:00:00 [sirq-net-rx/1]
root        22     2  0 Jun04 ?        00:00:00 [sirq-block/1]
root        23     2  0 Jun04 ?        00:00:00 [sirq-tasklet/1]
root        24     2  0 Jun04 ?        00:01:08 [sirq-sched/1]
root        25     2  0 Jun04 ?        00:00:00 [sirq-hrtimer/1]
root        26     2  0 Jun04 ?        00:02:43 [sirq-rcu/1]
root        31     2  0 Jun04 ?        00:00:00 [sirq-high/2]
root        32     2  0 Jun04 ?        00:11:09 [sirq-timer/2]
root        33     2  0 Jun04 ?        00:00:00 [sirq-net-tx/2]
root        34     2  0 Jun04 ?        00:00:04 [sirq-net-rx/2]
root        35     2  0 Jun04 ?        00:00:22 [sirq-block/2]
root        36     2  0 Jun04 ?        00:00:00 [sirq-tasklet/2]
root        37     2  0 Jun04 ?        00:00:14 [sirq-sched/2]
root        38     2  0 Jun04 ?        00:00:03 [sirq-hrtimer/2]
root        39     2  0 Jun04 ?        00:02:51 [sirq-rcu/2]
root        44     2  0 Jun04 ?        00:00:00 [sirq-high/3]
root        45     2  0 Jun04 ?        00:10:57 [sirq-timer/3]
root        46     2  0 Jun04 ?        00:00:00 [sirq-net-tx/3]
root        47     2  0 Jun04 ?        00:00:19 [sirq-net-rx/3]
root        48     2  0 Jun04 ?        00:01:52 [sirq-block/3]
root        49     2  0 Jun04 ?        00:00:00 [sirq-tasklet/3]
root        50     2  0 Jun04 ?        00:00:15 [sirq-sched/3]
root        51     2  0 Jun04 ?        00:00:03 [sirq-hrtimer/3]
root        52     2  0 Jun04 ?        00:02:47 [sirq-rcu/3]
root     21645 21611  0 18:07 pts/1    00:00:00 grep irq
Comment 2 IBM Bug Proxy 2008-06-13 13:32:23 EDT
------- Comment From dvhltc@us.ibm.com 2008-06-13 13:30 EDT-------
So they are both use same network card driver The e326 appears to have more
recent tg3 firmware, but we need to confirm like chipsets for that to be
relevant.  Can you run the following on your systems and compare that to ours:

x3455
[root@rt-ash ~]# lspci | grep -i ethernet
02:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)
02:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)

e326m
[root@rt-birch ~]# lspci | grep -i ethernet
04:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit
Ethernet (rev 03)
04:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit
Ethernet (rev 03)

THe chipsets are different on our platforms, I don't know if that means the
firware levels are not comparable or not.  I suspect it does.
Comment 3 IBM Bug Proxy 2008-06-13 13:40:43 EDT
------- Comment From mauery@us.ibm.com 2008-06-13 13:36 EDT-------
Just curious, but could this possibly be a duplicate of bug 33820?  It
mysteriously disappeared, we were unable to reproduce it so we closed the bug.
Maybe this is it.
Comment 4 IBM Bug Proxy 2008-06-13 15:16:35 EDT
------- Comment From pstanton@uk.ibm.com 2008-06-13 15:11 EDT-------
(In reply to comment #15)
> Can you run the following on your systems and compare that to ours:

See comment 1. The chipsets and revision levels are the same as your systems.
Upgrading the firmware was the first thing I tried when I found this problem -
it had no effect.

------- Comment From pstanton@uk.ibm.com 2008-06-13 15:13 EDT-------
(In reply to comment #18)

> See comment 1.

Sorry - meant the original submission *before* comment 1.
Comment 5 IBM Bug Proxy 2008-06-18 09:48:24 EDT
------- Comment From sripathi@in.ibm.com 2008-06-18 09:40 EDT-------
Googling for the console messages posted in this bug results in a number of
hits. While the suggestion in some of the threads is to upgrade the tg3 driver,
in some others the recommendation is to do the following:

ethtool -K eth0 tso off

I am no expert in networking, so I don't know the effect of this command.
Comment 6 IBM Bug Proxy 2008-06-18 10:33:13 EDT
------- Comment From pstanton@uk.ibm.com 2008-06-18 10:26 EDT-------
(In reply to comment #26)

> in some others the recommendation is to do the following:
>
> ethtool -K eth0 tso off

It seems to default to being off on the x3455:

[root@rtj-opt22 ~]# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

and defaults to being on on the e326m:

[root@rtj-opt6 ~]# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
Comment 7 IBM Bug Proxy 2008-06-18 10:40:38 EDT
------- Comment From sripathi@in.ibm.com 2008-06-18 10:37 EDT-------
Taking a quick, naive look at tg3.c, I see this definition that defines the
timeout after which it is assumed that the hardware has malfunctioned and it
needs to be reset:
#define TG3_TX_TIMEOUT                  (5 * HZ)

I wonder whether we need to increase this.
Comment 8 IBM Bug Proxy 2008-06-23 02:16:31 EDT
------- Comment From sripathi@in.ibm.com 2008-06-23 02:14 EDT-------
The original tg3 hack by Ted cannot be easily ported to current kernels.
However, Keith's patch still applies with some complaints. I am building a
kernel with this patch. Keith had reported some kernel panics with this patch in
bug 33820, so I am not sure how good my mileage will be.
Comment 9 IBM Bug Proxy 2008-06-23 07:16:56 EDT
------- Comment From sripathi@in.ibm.com 2008-06-23 07:11 EDT-------
I compiled a kernel with Keith's patch. It booted up fine, but the system does
not seem to come on the network. I logged in through the console and did
"service network restart". I see no errors. However, the system doesn't respond
through network. I also see the following message in dmesg:

ADDRCONF(NETDEV_UP): eth0: link is not ready

ethtool also says link is not detected. I am not sure what is causing this.
Comment 10 IBM Bug Proxy 2008-06-23 08:40:45 EDT
------- Comment From sripathi@in.ibm.com 2008-06-23 08:33 EDT-------
The message I mentioned in the last comment appears even on working kernels.
However, the link comes up later. However, with the patch, the link never comes
up. I have verified that the thread that pings the card is indeed running.
Comment 11 IBM Bug Proxy 2008-06-24 04:16:31 EDT
------- Comment From sripathi@in.ibm.com 2008-06-24 04:12 EDT-------
I can recreate this problem easily. Woohoo! However, I don't see most of the
messages posted in the opening comment of this bug.

The difference is that I had been trying this on e326m all these days, but today
I tried it on an x3455 (llm55.in.ibm.com in Bangalore lab). I ran sched_football
testcase with 4 threads for a few seconds to recreate this problem.
(sched_footbal -n 4 -l 20). The system goes off the network. Console can be used
after the test finishes. I restarted the network (which involves ifdown and
ifup), but that did not fix the problem. I needed to remove and re-insert tg3
module to get over the problem (or reboot the system, of course).

dmesg at various stages:
-----------------------
When the problem happened:
Clocksource tsc unstable (delta = 28122181979 ns)

Nothing about tg3

When I did "service network stop":
tg3: tg3_abort_hw timed out for eth0, TX_MODE_ENABLE will not clear
MAC_TX_MODE=ffffffff
tg3: eth0: No firmware running.
tg3: eth0: Link is down.

When I did "service network start":
ADDRCONF(NETDEV_UP): eth0: link is not ready

The machine did not come up on the network at this stage.

When I did "modprobe -r tg3":
ACPI: PCI interrupt for device 0000:02:01.1 disabled
tg3: tg3_abort_hw timed out for eth0, TX_MODE_ENABLE will not clear
MAC_TX_MODE=ffffffff
ACPI: PCI interrupt for device 0000:02:01.0 disabled

When I did "modprobe tg3":
tg3.c:v3.86 (November 9, 2007)
PCI: Enabling device 0000:02:01.0 (0000 -> 0002)
ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 18 (level, low) -> IRQ 18
eth0: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:100MHz:64-bit)
10/100/1000Base-T Ethernet 00:14:5e:55:ce:55
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[0]
eth0: dma_rwctrl[769f4000] dma_mask[64-bit]
ACPI: PCI Interrupt 0000:02:01.1[B] -> GSI 17 (level, low) -> IRQ 17
eth1: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:100MHz:64-bit)
10/100/1000Base-T Ethernet 00:14:5e:55:ce:56
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
ADDRCONF(NETDEV_UP): eth0: link is not ready
ADDRCONF(NETDEV_UP): eth1: link is not ready
tg3: eth0: Link is up at 100 Mbps, full duplex.
tg3: eth0: Flow control is off for TX and off for RX.
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
eth0: no IPv6 routers present

and the network was up again.

I will dig this a bit more. The machine doesn't have serial console setup, so it
is a bit inconvenient. All x3455 machines on ABAT are taken.
Comment 12 IBM Bug Proxy 2008-06-24 04:48:27 EDT
------- Comment From pstanton@uk.ibm.com 2008-06-24 04:43 EDT-------
(In reply to comment #32)
> I can recreate this problem easily. Woohoo! However, I don't see most of the
> messages posted in the opening comment of this bug.

Is your x3455 accessing the network at all? Ours was reading/writing files via
NFS, and before the tg3 messages I've posted we got "NFS server not responding"
messages.
Comment 13 IBM Bug Proxy 2008-06-24 05:24:37 EDT
------- Comment From sripathi@in.ibm.com 2008-06-24 05:17 EDT-------
Okay, I think I know the root cause and solution for this problem now!

I looked at the softirqs on llm55.in and realized that these are running as
SCHED_OTHER threads, not SCHED_FIFO. The cause for this seems to be an old
installation on the system, done when softirqs were named "softirq-xxxxx". RH
has since changed these names to "sirq-xxxxx" in the kernel. We updated
/etc/set_kthread_prio.conf to reflect this change. However, we never updated
llm55.in thoroughly, instead just upgraded the kernel on it. Hence we were left
with softirqs running as SCHED_OTHER threads. I now updated the system properly,
which fixed the softirq priorities. After that, I cannot recreate the problem
with sched_football.

I checked this on JTC's rtj-opt6 and rtj-opt22 systems. They both have the same
problem. I am pretty sure updating these systems will fix the problem. Peter,
can you please try this out and confirm?
Comment 14 IBM Bug Proxy 2008-06-24 10:48:36 EDT
------- Comment From pstanton@uk.ibm.com 2008-06-24 10:40 EDT-------
rtj-opt6 and rtj-opt22 updated and running tests.
Comment 15 IBM Bug Proxy 2008-06-25 04:40:41 EDT
------- Comment From pstanton@uk.ibm.com 2008-06-25 04:32 EDT-------
Both machines rebooted within minutes of each other early this morning whilst
running Realtime stress tests. rtj-opt6 has a crash dump in
/var/crash/2008-06-25-06:43
Comment 16 IBM Bug Proxy 2008-06-25 04:56:38 EDT
------- Comment From sripathi@in.ibm.com 2008-06-25 04:51 EDT-------
(In reply to comment #37)
> Both machines rebooted within minutes of each other early this morning whilst
> running Realtime stress tests. rtj-opt6 has a crash dump in
> /var/crash/2008-06-25-06:43

I am trying to look at this dump. Have some difficulty with the 'crash' on the
system. Will need to upgrade.
Comment 17 IBM Bug Proxy 2008-06-25 06:08:29 EDT
------- Comment From sripathi@in.ibm.com 2008-06-25 06:06 EDT-------
I upgraded 'crash' on the system. However, I am still unable to open the dump
under /var/crash/2008-06-25-06:43 using crash. Through some heroics with gdb, I
got the dmesg buffer out of it. It appears to be a new problem, related to
read-write locks:

<1>Unable to handle kernel NULL pointer dereference at 0000000000000006 RIP:
<1> [<ffffffff8113eeda>] plist_del+0x26/0x70
<4>PGD 158cc1067 PUD 158cc0067 PMD 158c7c067 PTE 0
<0>Oops: 0002 [1] PREEMPT SMP
<4>CPU 2
<4>Modules linked in: autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth
sunrpc ipv6 dm_multipath video output sbs sbshc battery ac parport_pc lp parport
sr_mod cdrom sg k8_edac edac_core tg3 button pata_serverworks k8temp hwmon
pata_acpi serio_raw ata_generic shpchp pcspkr dm_snapshot dm_zero dm_mirror
dm_mod sata_svw libata mptspi mptscsih scsi_transport_spi mptbase sd_mod
scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd
<4>Pid: 5276, comm: java Not tainted 2.6.24.7-65ibmrt2.4 #1
<4>RIP: 0010:[<ffffffff8113eeda>]  [<ffffffff8113eeda>] plist_del+0x26/0x70
<4>RSP: 0018:ffff81009f91bd98  EFLAGS: 00210086
<4>RAX: 0000000000000006 RBX: ffff81015c01a9d0 RCX: ffff81009eebbe50
<4>RDX: ffff81009eebbe58 RSI: ffff81009eeb4080 RDI: ffff810158c25be0
<4>RBP: ffff81009f91bd98 R08: ffff810158c25be8 R09: 00000000bbdf380b
<4>R10: ffff810152c575e0 R11: 000000039f91bbc8 R12: ffff81015c01a9d0
<4>R13: ffff810158c25bb8 R14: 0000000000000000 R15: ffff81015c01a9d0
<4>FS:  00002b932b3ec880(0000) GS:ffff81015faaa7c0(0063) knlGS:00000000c907eb90
<4>CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
<4>CR2: 0000000000000006 CR3: 0000000158cab000 CR4: 00000000000006e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process java (pid: 5276, threadinfo ffff81009f91a000, task ffff81009f914040)
<4>Stack:  ffff81009f91bdd8 ffffffff8105dd9b ffff81009f91be58 ffff81015c01a9d0
<4> 0000000000200202 0000000000000000 000000008000149c ffff81015c01a9d0
<4> ffff81009f91bdf8 ffffffff81288357 ffff81015c01a9c0 ffff81015fb63cd8
<4>Call Trace:
<4> [<ffffffff8105dd9b>] wakeup_next_waiter+0x65/0x1b2
<4> [<ffffffff81288357>] rt_mutex_slowunlock+0x3b/0x59
<4> [<ffffffff81288136>] rt_mutex_unlock+0x28/0x2a
<4> [<ffffffff8105ccea>] do_futex+0x9d5/0xb42
<4> [<ffffffff8105f398>] ? rt_mutex_up_read+0x22d/0x232
<4> [<ffffffff8128ba32>] ? do_page_fault+0x3f6/0x76d
<4> [<ffffffff810317c7>] ? post_schedule_rt+0x31/0x35
<4> [<ffffffff810368a2>] ? finish_task_switch+0x4c/0xdc
<4> [<ffffffff8105d3e9>] compat_sys_futex+0xed/0x10b
<4> [<ffffffff8100f895>] ? syscall_trace_enter+0xb7/0xbb
<4> [<ffffffff81027a94>] cstar_do_call+0x1b/0x65
<4>
<4>
<0>Code: 5f c9 c3 90 90 4c 8d 47 08 4c 39 47 08 55 48 89 e5 74 45 48 8b 4f 18 48
83 e9 18 48 8d 51 08 48 8b 71 08 48 8b 42 08 48 89 46 08 <48> 89 30 49 8b 40 08
4c 89 41 08 49 89 50 08 48 89 10 48 89 42
<1>RIP  [<ffffffff8113eeda>] plist_del+0x26/0x70
Comment 18 IBM Bug Proxy 2008-06-25 06:16:49 EDT
------- Comment From sripathi@in.ibm.com 2008-06-25 06:09 EDT-------
I noticed another dump on the system, taken 8 hours earlier, again with
2.6.24.7-65ibmrt2.4 kernel, under /var/crash/2008-06-24-20:42 directory. The
backtrace in it too shows a problem with read-write locks. These could be new
problems related rwlocks-multi patches.

Unable to handle kernel paging request at 0000000000002625 RIP:
[<ffffffff8113efa3>] plist_add+0x7f/0xa6
PGD 14f899067 PUD 14f8e7067 PMD 13edfa067 PTE 0
Oops: 0002 [1] PREEMPT SMP
CPU 2
Modules linked in: autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc
ipv6 dm_multipath video output
sbs sbshc battery ac parport_pc lp parport sg sr_mod cdrom tg3 k8_edac
pata_serverworks shpchp edac_core pata
_acpi button k8temp hwmon serio_raw ata_generic pcspkr dm_snapshot dm_zero
dm_mirror dm_mod sata_svw libata mp
tspi mptscsih scsi_transport_spi mptbase sd_mod scsi_mod ext3 jbd mbcache
ehci_hcd ohci_hcd uhci_hcd
Pid: 9640, comm: java Not tainted 2.6.24.7-65ibmrt2.4 #1
RIP: 0010:[<ffffffff8113efa3>]  [<ffffffff8113efa3>] plist_add+0x7f/0xa6
RSP: 0018:ffff8100bbd97e18  EFLAGS: 00010083
RAX: ffff81015e0c75a0 RBX: ffff81009e52dba0 RCX: 0000000000002625
RDX: ffff81009e52dba8 RSI: ffff81015e0c7598 RDI: ffff81009e52dba0
RBP: ffff8100bbd97e28 R08: ffff81009acc1ba8 R09: 0000000000000000
R10: ffff8100bdfc24d8 R11: ffff8100bbd97db8 R12: ffff8100bbd92820
R13: ffff81009acc1b78 R14: ffff81009e52db78 R15: ffff8100bbd92818
FS:  00002ae7ce277480(0000) GS:ffff81015faaa7c0(0063) knlGS:00000000c4638b90
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000002625 CR3: 000000014f89a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 9640, threadinfo ffff8100bbd96000, task ffff8100bbd92a60)
Stack:  ffff8100bbd92040 ffff81015e0c75a8 ffff8100bbd97e68 ffffffff8105ded4
ffff8100bbd97e48 ffff81009acc1b78 ffff81015e0c75a8 0000000000000001
ffff81015e0c75a0 ffff81015e0c75c0 ffff8100bbd97ec8 ffffffff81288995
Call Trace:
[<ffffffff8105ded4>] wakeup_next_waiter+0x19e/0x1b2
[<ffffffff81288995>] rt_write_slowunlock+0x82/0x1c6
[<ffffffff8105f166>] rt_mutex_up_write+0x69/0x6e
[<ffffffff8105fcb7>] rt_up_write+0x9/0xb
[<ffffffff8109a243>] sys_mprotect+0x210/0x22d
[<ffffffff8100f895>] ? syscall_trace_enter+0xb7/0xbb
[<ffffffff81028244>] sys32_mprotect+0x9/0xb
[<ffffffff81027a94>] cstar_do_call+0x1b/0x65

Code: 89 c6 eb 2e 48 89 c6 48 8b 56 08 48 8d 46 08 4c 39 e0 0f 18 0a 75 dc 48 8d
46 08 48 8d 53 08 48 8b 48 08
48 89 43 08 48 89 50 08 <48> 89 11 48 89 4a 08 48 8d 46 18 48 8d 53 18 48 8b 48
08 48 89
RIP  [<ffffffff8113efa3>] plist_add+0x7f/0xa6
RSP <ffff8100bbd97e18>

------- Comment From sripathi@in.ibm.com 2008-06-25 06:11 EDT-------
Both the backtraces above seems to show a different problem from the one
reported in this bug, that is ethernet connection getting dropped. Hence I am
going to open a new bug to cover these problems.

Peter, I still think the ethernet problem is now fixed. Please let us know if
you see that problem again.
Comment 19 Luis Claudio R. Goncalves 2008-07-02 11:17:55 EDT
Have you tried this test with 2.6.24.7-69.el5rt? This kernel has several rwlock
fixes that were written to address a few issues with similar backtraces.
Comment 20 IBM Bug Proxy 2008-07-22 01:50:40 EDT
------- Comment From sripathi@in.ibm.com 2008-07-22 01:40 EDT-------
To RH:

The original problem reported by this bug, that of loss of ethernet
connectivity, has now been resolved. We are using bug RH452974 to track
crashes/warning messages related to rwlocks.
Comment 21 Clark Williams 2008-09-15 23:17:27 EDT
closing

Note You need to log in before you can comment on or make changes to this bug.