473846 – [5.3] Network Not Working in the Second Kernel

Bug 473846 - [5.3] Network Not Working in the Second Kernel

Summary: [5.3] Network Not Working in the Second Kernel

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Red Hat Kernel Manager
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	5.4, TechnicalNotes
TreeView+	depends on / blocked

Reported:	2008-12-01 02:44 UTC by Qian Cai
Modified:	2009-09-09 05:05 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Some <filename>forcedeth</filename> based devices may encounter difficulty accessing memory above 4GB during operation in a <filename>kdump</filename> kernel. To work around this issue, add the following line to the <filename>/etc/sysconfig/kdump</filename> file: <screen> KDUMP_COMMANDLINE_APPEND="dma_64bit=0" </screen> This work around prevents the forcedeth network driver from using high memory resources in the kdump kernel, allowing the network to function properly.
Clone Of:
Environment:
Last Closed:	2009-07-23 15:37:29 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmidecode for hp-xw4550-01 (07/20/2007) (17.02 KB, text/plain) 2008-12-01 08:00 UTC, Qian Cai	no flags	Details
dmidecode for hp-xw9300-01 (11/28/2006) (20.49 KB, text/plain) 2008-12-01 08:01 UTC, Qian Cai	no flags	Details
Ifconfig from Normal Kernel on hp-xw9300 (907 bytes, text/plain) 2008-12-09 12:45 UTC, Qian Cai	no flags	Details
Ethtool from Normal Kernel on hp-xw9300 (2.50 KB, text/plain) 2008-12-09 12:46 UTC, Qian Cai	no flags	Details
Ifconfig from Kdump Kernel on hp-xw9300 (379 bytes, text/plain) 2008-12-09 12:46 UTC, Qian Cai	no flags	Details
Ethtool from Kdump Kernel on hp-xw9300 (2.51 KB, text/plain) 2008-12-09 12:47 UTC, Qian Cai	no flags	Details
Tcpdump from Kdump Kernel on hp-xw9300 (1.07 KB, text/plain) 2008-12-09 12:47 UTC, Qian Cai	no flags	Details
patch to map all reserved region of ram into kdump kernel (2.23 KB, patch) 2008-12-17 16:25 UTC, Neil Horman	no flags	Details \| Diff
View All

Description Qian Cai 2008-12-01 02:44:33 UTC

Description of problem:
I have seen a problem that the second Linux kernel could not get any
incoming packet sometimes on two HP XW machines (hp-xw9300-01.rhts.bos.redhat.com and hp-xw4550-01.rhts.bos.redhat.com), as the result, it was impossible to save the VMCore to any remote host.

The server was configured to abtain an IP via DHCP.

eth0 Link Up.  Waiting 60 Seconds
+sleep 60
+echo Continuing
Continuing
+[ 0000:00:0a.0 == Bonding ]
+[ 0000:00:0a.0 == Vlan ]
+exit 0
+shift 1
+/bin/msh -c udhcpc -n -p /var/run/udhcpc.eth0.pid -i eth0
udhcpc (v1.2.0) started
udhcpc[1292]: udhcpc (v1.2.0) started
+[ -z deconfig ]
+/sbin/ifconfig eth0 0.0.0.0
+exit 0
Sending discover...
udhcpc[1292]: Sending discover...
Sending discover...
udhcpc[1292]: Sending discover...
Sending discover...
udhcpc[1292]: Sending discover...
+[ -z leasefail ]
+exit 0
No lease, failing.
udhcpc[1292]: No lease, failing.
root:/>

You can see from here, all DHCP requests were failed.

The interesting thing was that if I configured a static IP to this
server, and then setup tcpdump on another host B in the same subnet. If
the server pinged B, there were ARP requests and replys seen from the
tcpdump, but neither DHCP nor ICMP reply. Looked like something broken in IP stack.

06:07:19.791705 IP (tos 0x0, ttl  64, id 0, offset 0, flags [none],
proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps:
BOOTP/DHCP, Request from 00:17:08:2a:08:34 (oui Unknown), length: 548,
xid:0x8c59213c, flags: [none] (0x0000)
                  Client Ethernet Address: 00:17:08:2a:08:34 (oui
                  Unknown) [|bootp]
06:07:22.797101 IP (tos 0x0, ttl  64, id 0, offset 0, flags [none],
proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps:
BOOTP/DHCP, Request from 00:17:08:2a:08:34 (oui Unknown), length: 548,
xid:0x8c59213c, flags: [none] (0x0000)
                  Client Ethernet Address: 00:17:08:2a:08:34 (oui
                  Unknown) [|bootp]

"00:17:08:2a:08:34" was the affected machine's MAC address.

root:/> ifconifg eth0 10.16.64.84 netmask 255.255.248.0 broadcast 10.16.71.255

root:/> ifconfig
eth0      Link encap:Ethernet  HWaddr 00:17:08:2A:08:34
          inet addr:10.16.64.84  Bcast:10.16.71.255
          Mask:255.255.248.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:1 dropped:0 overruns:0 frame:1
          Interrupt:225 Base address:0xe000 ^M

root:/> ping -c 3 10.16.64.121
PING 10.16.64.121 (10.16.64.121): 56 data bytes

--- 10.16.64.121 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

The server even cannot ping itself.
root:/> ping -c 3 10.16.64.84
PING 10.16.64.84 (10.16.64.84): 56 data bytes

--- 10.16.64.84 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss

I doubted it was because of transient network problem, because there was no such problem in normal kernel I was aware of.

Version-Release number of selected component (if applicable):
kernel-2.6.18-92.el5
kernel-2.6.18-124.el5
kexec-tools-1.102pre-51.el5

How reproducible:
Usually 50%.

Steps to Reproduce:
1. configure kdump with crashkernel=128M@16M.
2. use the following kdump.conf

net server@nfs
default shell

3. echo c >/proc/sysrq-trigger
  
Actual results:
Kdump kernel failed to get an IP address via DHCP.

Expected results:
Kdump kernel got an IP address via DHCP and saved the VMCore to the remote host.

Comment 1 Qian Cai 2008-12-01 04:17:51 UTC

Some network driver information:

hp-xw9300-01:
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
	Subsystem: Hewlett-Packard Company Unknown device 1500
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0 (250ns min, 5000ns max)
	Interrupt: pin A routed to IRQ 193
	Region 0: Memory at f2104000 (32-bit, non-prefetchable) [size=4K]
	Region 1: I/O ports at 28f0 [size=8]
	Capabilities: [44] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 PME-Enable+ DSel=0 DScale=0 PME-

hp-xw4550-01:
3f:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5755 Gigabit Ethernet PCI Express (rev 02)
	Subsystem: Hewlett-Packard Company Unknown device 12ff
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 225
	Region 0: Memory at d8800000 (64-bit, non-prefetchable) [size=64K]
	Expansion ROM at <ignored> [disabled]
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Vendor Specific Information
	Capabilities: [e8] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
		Address: 00000000fee01000  Data: 40e1
	Capabilities: [d0] Express Endpoint IRQ 0
		Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag+
		Device: Latency L0s <4us, L1 unlimited
		Device: AtnBtn- AtnInd- PwrInd-
		Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
		Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
		Device: MaxPayload 128 bytes, MaxReadReq 4096 bytes
		Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
		Link: Latency L0s <4us, L1 <64us
		Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch-
		Link: Speed 2.5Gb/s, Width x1
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [13c] Virtual Channel
	Capabilities: [160] Device Serial Number 3b-88-0c-fe-ff-4b-1a-00
	Capabilities: [16c] Power Budgeting

Comment 2 Qian Cai 2008-12-01 06:08:27 UTC

On hp-xw4550-01, looks like the tg3 driver does not function at all.

We opened a tcpdump server using the following command,
# tcpdump -envvv 'ether host 00:1A:4B:0C:88:3B'
# echo c >/proc/sysrq-trigger

...
00:1a:4b:0c:88:3b0: Tigon3 [partno(BCM95755) rev a002 PHY(5755)] (PCI Express) 10/100/1000Base-T Ethernet pshot.ko module
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]

Loading libphy.eth0: dma_rwctrl[76180000] dma_mask[64-bit]
ko module
Loading tg3.ko module
...
udhcpc (v1.2.0) started
udhcpc[1042]: udhcpc (v1.2.0) started
Sending discover...
udhcpc[1042]: Sending discover...
Sending discover...
udhcpc[1042]: Sending discover...
Sending discover...
udhcpc[1042]: Sending discover...
No lease, failing.
udhcpc[1042]: No lease, failing.
eth0 failed to come up
Dropping to shell. exit to reboot
root:/> ifup eth0 
udhcpc (v1.2.0) started
udhcpc[1071]: udhcpc (v1.2.0) started
Sending discover...
udhcpc[1071]: Sending discover...
Sending discover...
udhcpc[1071]: Sending discover...
Sending discover...
udhcpc[1071]: Sending discover...
No lease, failing.
udhcpc[1071]: No lease, failing.
root:/>
<Tcpdump did not output anything at this point.>

root:/> mii-tool -v
eth0: negotiated 100baseTx-FD, link ok
  product info: vendor 00:50:ef, model 12 rev 0
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD

Manually setup IP address from here.
root:/> ifconfig eth0 10.16.65.42 netmask 255.255.248.0 broadcast 10.16.71.255
root:/> ifconfig
eth0      Link encap:Ethernet  HWaddr 00:1A:4B:0C:88:3B  
          inet addr:10.16.65.42  Bcast:10.16.71.255  Mask:255.255.248.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:5 

root:/> route add default gw 10.16.71.254
root:/> route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.16.64.0      *               255.255.248.0   U     0      0        0 eth0
default         10.16.71.254    0.0.0.0         UG    0      0        0 eth0

Ping ourselves.
root:/> ping -c 1 10.16.65.42
PING 10.16.65.42 (10.16.65.42): 56 data bytes

--- 10.16.65.42 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
<Tcpdump did not output anything at this point.>

Ping the gateway.
root:/> ping -c 1 10.16.71.254
PING 10.16.71.254 (10.16.71.254): 56 data bytes

--- 10.16.71.254 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
<Tcpdump did not output anything at this point.>

Ethertool did not show anything packets.
root:/> ethtool -S eth0
NIC statistics:
     rx_octets: 0
     rx_fragments: 0
     rx_ucast_packets: 0
     rx_mcast_packets: 0
     rx_bcast_packets: 0
     rx_fcs_errors: 0
     rx_align_errors: 0
     rx_xon_pause_rcvd: 0
     rx_xoff_pause_rcvd: 0
     rx_mac_ctrl_rcvd: 0
     rx_xoff_entered: 0
     rx_frame_too_long_errors: 0
     rx_jabbers: 0
     rx_undersize_packets: 0
     rx_in_length_errors: 0
     rx_out_length_errors: 0
     rx_64_or_less_octet_packets: 0
     rx_65_to_127_octet_packets: 0
     rx_128_to_255_octet_packets: 0
     rx_256_to_511_octet_packets: 0
     rx_512_to_1023_octet_packets: 0
     rx_1024_to_1522_octet_packets: 0
     rx_1523_to_2047_octet_packets: 0
     rx_2048_to_4095_octet_packets: 0
     rx_4096_to_8191_octet_packets: 0
     rx_8192_to_9022_octet_packets: 0
     tx_octets: 0
     tx_collisions: 0
     tx_xon_sent: 0
     tx_xoff_sent: 0
     tx_flow_control: 0
     tx_mac_errors: 0
     tx_single_collisions: 0
     tx_mult_collisions: 0
     tx_deferred: 0
     tx_excessive_collisions: 0
     tx_late_collisions: 0
     tx_collide_2times: 0
     tx_collide_3times: 0
     tx_collide_4times: 0
     tx_collide_5times: 0
     tx_collide_6times: 0
     tx_collide_7times: 0
     tx_collide_8times: 0
     tx_collide_9times: 0
     tx_collide_10times: 0
     tx_collide_11times: 0
     tx_collide_12times: 0
     tx_collide_13times: 0
     tx_collide_14times: 0
     tx_collide_15times: 0
     tx_ucast_packets: 0
     tx_mcast_packets: 0
     tx_bcast_packets: 0
     tx_carrier_sense_errors: 0
     tx_discards: 0
     tx_errors: 0
     dma_writeq_full: 0
     dma_write_prioq_full: 0
     rxbds_empty: 0
     rx_discards: 0
     rx_errors: 0
     rx_threshold_hit: 0
     dma_readq_full: 0
     dma_read_prioq_full: 0
     tx_comp_queue_full: 0
     ring_set_send_prod_index: 0
     ring_status_update: 0
     nic_irqs: 0
     nic_avoided_irqs: 0
     nic_tx_threshold_hit: 0

Ifconfig also did not show any packets.
root:/> ifconfig
eth0      Link encap:Ethernet  HWaddr 00:1A:4B:0C:88:3B  
          inet addr:10.16.65.42  Bcast:10.16.71.255  Mask:255.255.248.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:5

root:/> cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 0 0 0 2 0 0 0 0 0
IcmpMsg: OutType3 OutType8
IcmpMsg: 1 2
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 0 0 0 0 0 0 0 0 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 0 0 0 0

root:/> ethtool -i eth0
driver: tg3
version: 3.93
firmware-version: 5755-v3.29
bus-info: 0000:3f:00.0

root:/> tc -d qdisc
qdisc pfifo_fast 0: dev eth0 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

root:/> ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		200
RX Mini:	0
RX Jumbo:	0
TX:		511

Comment 3 Qian Cai 2008-12-01 06:25:18 UTC

On hp-xw4550-01, tg3 driver worked fine with a normal kernel.

# modinfo tg3
filename:       /lib/modules/2.6.18-124.el5/kernel/drivers/net/tg3.ko
version:        3.93
license:        GPL
description:    Broadcom Tigon3 ethernet driver
author:         David S. Miller (davem) and Jeff Garzik (jgarzik)
srcversion:     9F10E7BFA7D69F890110EAC
alias:          pci:v0000106Bd00001645sv*sd*bc*sc*i*
alias:          pci:v0000173Bd000003EAsv*sd*bc*sc*i*
alias:          pci:v0000173Bd000003EBsv*sd*bc*sc*i*
alias:          pci:v0000173Bd000003E9sv*sd*bc*sc*i*
alias:          pci:v0000173Bd000003E8sv*sd*bc*sc*i*
alias:          pci:v00001148d00004500sv*sd*bc*sc*i*
alias:          pci:v00001148d00004400sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001699sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001680sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001681sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000165Bsv*sd*bc*sc*i*
alias:          pci:v000014E4d00001684sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001698sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001713sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001712sv*sd*bc*sc*i*
alias:          pci:v000014E4d000016DDsv*sd*bc*sc*i*
alias:          pci:v000014E4d0000166Bsv*sd*bc*sc*i*
alias:          pci:v000014E4d0000166Asv*sd*bc*sc*i*
alias:          pci:v000014E4d00001679sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001678sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001669sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001668sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000167Fsv*sd*bc*sc*i*
alias:          pci:v000014E4d00001693sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000169Bsv*sd*bc*sc*i*
alias:          pci:v000014E4d0000169Asv*sd*bc*sc*i*
alias:          pci:v000014E4d00001674sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001673sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000167Bsv*sd*bc*sc*i*
alias:          pci:v000014E4d00001672sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000167Asv*sd*bc*sc*i*
alias:          pci:v000014E4d000016FEsv*sd*bc*sc*i*
alias:          pci:v000014E4d000016FDsv*sd*bc*sc*i*
alias:          pci:v000014E4d000016F7sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001601sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001600sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000167Esv*sd*bc*sc*i*
alias:          pci:v000014E4d0000167Dsv*sd*bc*sc*i*
alias:          pci:v000014E4d0000167Csv*sd*bc*sc*i*
alias:          pci:v000014E4d00001677sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001676sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000165Asv*sd*bc*sc*i*
alias:          pci:v000014E4d00001659sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001658sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000166Esv*sd*bc*sc*i*
alias:          pci:v000014E4d00001649sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000170Esv*sd*bc*sc*i*
alias:          pci:v000014E4d0000170Dsv*sd*bc*sc*i*
alias:          pci:v000014E4d0000169Dsv*sd*bc*sc*i*
alias:          pci:v000014E4d0000169Csv*sd*bc*sc*i*
alias:          pci:v000014E4d00001696sv*sd*bc*sc*i*
alias:          pci:v000014E4d000016C7sv*sd*bc*sc*i*
alias:          pci:v000014E4d000016C6sv*sd*bc*sc*i*
alias:          pci:v000014E4d000016A8sv*sd*bc*sc*i*
alias:          pci:v000014E4d000016A7sv*sd*bc*sc*i*
alias:          pci:v000014E4d000016A6sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000165Esv*sd*bc*sc*i*
alias:          pci:v000014E4d0000165Dsv*sd*bc*sc*i*
alias:          pci:v000014E4d00001654sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001653sv*sd*bc*sc*i*
alias:          pci:v000014E4d0000164Dsv*sd*bc*sc*i*
alias:          pci:v000014E4d00001648sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001647sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001646sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001645sv*sd*bc*sc*i*
alias:          pci:v000014E4d00001644sv*sd*bc*sc*i*
depends:        libphy
vermagic:       2.6.18-124.el5 SMP mod_unload gcc-4.1
parm:           tg3_debug:Tigon3 bitmapped debugging message enable value (int)
module_sig:	883f3504921ea653afb531d9ea12bf11216109e3d4aba9af922865f2833869db7fe3418b632263c09f415b131f372f4d93a6ff1a5b39f2937a08bbe6a

# ifconfig 
eth0      Link encap:Ethernet  HWaddr 00:1A:4B:0C:88:3B  
          inet addr:10.16.65.42  Bcast:10.16.71.255  Mask:255.255.248.0
          inet6 addr: fe80::21a:4bff:fe0c:883b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:39257 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1352 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4128450 (3.9 MiB)  TX bytes:413490 (403.7 KiB)
          Interrupt:193 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:56 errors:0 dropped:0 overruns:0 frame:0
          TX packets:56 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:7048 (6.8 KiB)  TX bytes:7048 (6.8 KiB)

# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.16.64.0      *               255.255.248.0   U     0      0        0 eth0
169.254.0.0     *               255.255.0.0     U     0      0        0 eth0
default         10.16.71.254    0.0.0.0         UG    0      0        0 eth0

# ifdown eth0

# ifup eth0

Determining IP information for eth0... done.

00:01:27.515218 00:1a:4b:0c:88:3b > Broadcast, ethertype IPv4 (0x0800), length 342: (tos 0x10, ttl  16, id 0, offset 0, flags [none], proto: UDP (17), length: 328) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:1a:4b:0c:88:3b, length: 300, xid:0x7649d206, flags: [none] (0x0000)
	  Client Ethernet Address: 00:1a:4b:0c:88:3b [|bootp]
00:01:27.515741 00:16:3e:4b:a5:4a > 00:1a:4b:0c:88:3b, ethertype IPv4 (0x0800), length 361: (tos 0x10, ttl  16, id 0, offset 0, flags [none], proto: UDP (17), length: 347) 10.16.64.14.bootps > 10.16.65.42.bootpc: BOOTP/DHCP, Reply, length: 319, xid:0x7649d206, flags: [none] (0x0000)
	  Your IP: 10.16.65.42
	  Server IP: 10.16.64.10
	  Client Ethernet Address: 00:1a:4b:0c:88:3b [|bootp]
00:01:27.685826 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 234: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 220) 10.16.65.42.mdns > 224.0.0.251.mdns:  0*- [0q] 4/0/0 _services._dns-sd._udp.local. PTR[|domain]
00:01:27.709374 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 415: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 401) 10.16.65.42.mdns > 224.0.0.251.mdns:  0 [4q] [7n][|domain]
00:01:27.959654 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 415: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 401) 10.16.65.42.mdns > 224.0.0.251.mdns:  0 [4q] [7n][|domain]
00:01:28.209559 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 415: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 401) 10.16.65.42.mdns > 224.0.0.251.mdns:  0 [4q] [7n][|domain]
00:01:28.409556 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 194: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 180) 10.16.65.42.mdns > 224.0.0.251.mdns:  0*- [0q] 2/0/0 SFTP File Transfer on hp-xw4550-01._sftp-[|domain]
00:01:28.409831 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 274: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 260) 10.16.65.42.mdns > 224.0.0.251.mdns:  0*- [0q] 5/0/0 hp-xw4550-01 [00:1a:4b:0c:88:3b]._worksta[|domain]
00:01:28.705561 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 234: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 220) 10.16.65.42.mdns > 224.0.0.251.mdns:  0*- [0q] 4/0/0 _services._dns-sd._udp.local. PTR[|domain]
00:01:29.428629 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 353: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 339) 10.16.65.42.mdns > 224.0.0.251.mdns:  0*- [0q] 6/0/0 SFTP File Transfer on hp-xw4550-01._sftp-[|domain]
00:01:29.428746 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 110: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 96) 10.16.65.42.mdns > 224.0.0.251.mdns:  0*- [0q] 1/0/0 42.65.16.10.in-addr.arpa. (Cache flush) PTR[|domain]
00:01:30.724522 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 407: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 393) 10.16.65.42.mdns > 224.0.0.251.mdns:  0*- [0q] 9/0/0 _services._dns-sd._udp.local. PTR[|domain]
00:01:31.447461 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 353: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 339) 10.16.65.42.mdns > 224.0.0.251.mdns:  0*- [0q] 6/0/0 SFTP File Transfer on hp-xw4550-01._sftp-[|domain]
00:01:31.447566 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 110: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 96) 10.16.65.42.mdns > 224.0.0.251.mdns:  0*- [0q] 1/0/0 42.65.16.10.in-addr.arpa. (Cache flush) PTR[|domain]
00:01:38.357727 00:1a:4b:0c:88:3b > Broadcast, ethertype ARP (0x0806), length 60: arp who-has 10.16.71.254 tell 10.16.65.42

Comment 4 Qian Cai 2008-12-01 07:53:25 UTC

Doing the same things nn hp-xw9300-01, it showed a different results.

We opened a tcpdump server using the following command,
# tcpdump -envvv 'ether host 00:17:08:2A:08:34'
# echo c >/proc/sysrq-trigger
...
udhcpc[1126]: udhcpc (v1.2.0) started
Sending discover...
udhcpc[1126]: Sending discover...
eth0: link up.
Sending discover...
udhcpc[1126]: Sending discover...
Sending discover...
udhcpc[1126]: Sending discover...
No lease, failing.
udhcpc[1126]: No lease, failing.
eth0 failed to come up
Dropping to shell. exit to reboot
root:/>
...

Tcpdump output that an IP address had successfully obtained,

01:40:03.118708 00:17:08:2a:08:34 > Broadcast, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl  64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34, length: 548, xid:0xab5c5740, flags: [none] (0x0000)
	  Client Ethernet Address: 00:17:08:2a:08:34 [|bootp]
01:40:03.119272 00:16:3e:4b:a5:4a > 00:17:08:2a:08:34, ethertype IPv4 (0x0800), length 355: (tos 0x10, ttl  16, id 0, offset 0, flags [none], proto: UDP (17), length: 341) 10.16.64.14.bootps > 10.16.64.84.bootpc: BOOTP/DHCP, Reply, length: 313, xid:0xab5c5740, flags: [none] (0x0000)
	  Your IP: 10.16.64.84
	  Server IP: 10.16.64.10
	  Client Ethernet Address: 00:17:08:2a:08:34 [|bootp]
01:40:06.124582 00:17:08:2a:08:34 > Broadcast, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl  64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34, length: 548, xid:0xab5c5740, flags: [none] (0x0000)
	  Client Ethernet Address: 00:17:08:2a:08:34 [|bootp]

However, it was not.
root:/> ifconfig
eth0      Link encap:Ethernet  HWaddr 00:17:08:2A:08:34  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:1 dropped:0 overruns:0 frame:1
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:1492 (1.4 KiB)
          Interrupt:11 

Tried to obtained an IP again, no DHCP reply anymore.
root:/> ifup eth0 
udhcpc (v1.2.0) started
udhcpc[1158]: udhcpc (v1.2.0) started
Sending discover...
udhcpc[1158]: Sending discover...
Sending discover...
udhcpc[1158]: Sending discover...
Sending discover...
udhcpc[1158]: Sending discover...
No lease, failing.
udhcpc[1158]: No lease, failing.
root:/> 

1:42:34.027588 00:17:08:2a:08:34 > Broadcast, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl  64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34, length: 548, xid:0xb988d033, flags: [none] (0x0000)
	  Client Ethernet Address: 00:17:08:2a:08:34 [|bootp]
01:42:37.033458 00:17:08:2a:08:34 > Broadcast, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl  64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34, length: 548, xid:0xb988d033, flags: [none] (0x0000)
	  Client Ethernet Address: 00:17:08:2a:08:34 [|bootp]
01:42:40.038334 00:17:08:2a:08:34 > Broadcast, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl  64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34, length: 548, xid:0xb988d033, flags: [none] (0x0000)
	  Client Ethernet Address: 00:17:08:2a:08:34 [|bootp]

root:/> mii-tool -v
SIOCGMIIPHY on 'eth0' failed: Operation not supported
no MII interfaces found

Configured a static IP address manually,
root:/> ifconfig eth0 10.16.64.84 netmask 255.255.248.0 broadcast 10.16.71.255
root:/> ifconfig
eth0      Link encap:Ethernet  HWaddr 00:17:08:2A:08:34  
          inet addr:10.16.64.84  Bcast:10.16.71.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:1 dropped:0 overruns:0 frame:1
          TX packets:5 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:3274 (3.1 KiB)
          Interrupt:11 

root:/> route add default gw 10.16.71.254
root:/> route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.16.64.0      *               255.255.248.0   U     0      0        0 eth0
default         10.16.71.254    0.0.0.0         UG    0      0        0 eth0

root:/> ping -c 1 10.16.64.84
PING 10.16.64.84 (10.16.64.84): 56 data bytes

--- 10.16.64.84 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
<Tcpdump output nothing.>

root:/> ping -c 1 10.16.71.254
PING 10.16.71.254 (10.16.71.254): 56 data bytes

--- 10.16.71.254 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
root:/> 

02:37:38.600377 00:17:08:2a:08:34 > Broadcast, ethertype ARP (0x0806), length 60: arp who-has 10.16.71.254 tell 10.16.64.84
02:37:39.600326 00:17:08:2a:08:34 > Broadcast, ethertype ARP (0x0806), length 60: arp who-has 10.16.71.254 tell 10.16.64.84
02:37:40.600285 00:17:08:2a:08:34 > Broadcast, ethertype ARP (0x0806), length 60: arp who-has 10.16.71.254 tell 10.16.64.84

root:/> ethtool -S eth0
NIC statistics:
     tx_bytes: 1572
     tx_zero_rexmt: 8
     tx_one_rexmt: 0
     tx_many_rexmt: 0
     tx_late_collision: 0
     tx_fifo_errors: 0
     tx_carrier_errors: 0
     tx_excess_deferral: 0
     tx_retry_error: 0
     rx_frame_error: 0
     rx_extra_byte: 0
     rx_late_collision: 0
     rx_runt: 0
     rx_frame_too_long: 0
     rx_over_errors: 1
     rx_crc_errors: 0
     rx_frame_align_error: 0
     rx_length_error: 0
     rx_unicast: 0
     rx_multicast: 47
     rx_broadcast: 199
     rx_packets: 246
     rx_errors_total: 1
     tx_errors_total: 0

root:/> ifconfig
eth0      Link encap:Ethernet  HWaddr 00:17:08:2A:08:34  
          inet addr:10.16.64.84  Bcast:10.16.71.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:1 dropped:0 overruns:0 frame:1
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:1572 (1.5 KiB)
          Interrupt:11 

root:/> cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 2 64 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 2 0 0 0 0 4 0 0 0 0 0
IcmpMsg: OutType3 OutType8
IcmpMsg: 2 4
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 0 0 0 0 0 0 0 0 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 0 0 0 0

root:/> ethtool -i eth0
driver: forcedeth
version: 0.60
firmware-version: 
bus-info: 0000:00:0a.0

root:/> tc -d qdisc
qdisc pfifo_fast 0: dev eth0 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

root:/> ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:		16384
RX Mini:	0
RX Jumbo:	0
TX:		16384
Current hardware settings:
RX:		128
RX Mini:	0
RX Jumbo:	0
TX:		256

Comment 5 Qian Cai 2008-12-01 07:54:11 UTC

# modinfo forcedeth
filename:       /lib/modules/2.6.18-124.el5/kernel/drivers/net/forcedeth.ko
license:        GPL
description:    Reverse Engineered nForce ethernet driver
RHEL driver based on upstream driver version 0.60
Also includes additional upstream commits:
3ba4d093fe8a26f5f2da94411bf8732fa6e9da86 forcedeth: fix tx timeout
fcc5f2665c81e087fb95143325ed769a41128d50 forcedeth: fix nic poll
6fedae1f6e66ab5f169bf58064e23e015fc1307d forcedeth: fix checksum feature in mcp65
caf96469e8ab57170cc8ca9c59809132d38e529e forcedeth: disable msix
e0379a14fc80cb98978fa86989dab77b522a8106 forcedeth: fixed missing call in napi poll
a7475906bc496456ded9e4b062f94067fb93057a forcedeth: msi bugfix
9e555930bd873d238f5f7b9d76d3bf31e6e3ce93 forcedeth: boot delay fix
author:         Manfred Spraul <manfred>
srcversion:     52F782A3071D2A58B8F4D65
alias:          pci:v000010DEd0000054Fsv*sd*bc*sc*i*
alias:          pci:v000010DEd0000054Esv*sd*bc*sc*i*
alias:          pci:v000010DEd0000054Dsv*sd*bc*sc*i*
alias:          pci:v000010DEd0000054Csv*sd*bc*sc*i*
alias:          pci:v000010DEd00000453sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000452sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000451sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000450sv*sd*bc*sc*i*
alias:          pci:v000010DEd000003EFsv*sd*bc*sc*i*
alias:          pci:v000010DEd000003EEsv*sd*bc*sc*i*
alias:          pci:v000010DEd000003E6sv*sd*bc*sc*i*
alias:          pci:v000010DEd000003E5sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000373sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000372sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000269sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000268sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000038sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000037sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000057sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000056sv*sd*bc*sc*i*
alias:          pci:v000010DEd000000DFsv*sd*bc*sc*i*
alias:          pci:v000010DEd000000E6sv*sd*bc*sc*i*
alias:          pci:v000010DEd0000008Csv*sd*bc*sc*i*
alias:          pci:v000010DEd00000086sv*sd*bc*sc*i*
alias:          pci:v000010DEd000000D6sv*sd*bc*sc*i*
alias:          pci:v000010DEd00000066sv*sd*bc*sc*i*
alias:          pci:v000010DEd000001C3sv*sd*bc*sc*i*
depends:        
vermagic:       2.6.18-124.el5 SMP mod_unload gcc-4.1
parm:           max_interrupt_work:forcedeth maximum events handled per interrupt (int)
parm:           optimization_mode:In throughput mode (0), every tx & rx packet will generate an interrupt. In CPU mode (1), interrupts are controlled by a timer. (int)
parm:           poll_interval:Interval determines how frequent timer interrupt is generated by [(time_in_micro_secs * 100) / (2^10)]. Min is 0 and Max is 65535. (int)
parm:           msi:MSI interrupts are enabled by setting to 1 and disabled by setting to 0. (int)
parm:           msix:MSIX interrupts are enabled by setting to 1 and disabled by setting to 0. (int)
parm:           dma_64bit:High DMA is enabled by setting to 1 and disabled by setting to 0. (int)
module_sig:	883f3504921ea663afb531d9ea12bf1125ac309f7fdd3b5d1bc63ec6dcdebdd2ca2a22da9ade38009f59264b749fe7bab54e8f2bb1c322eca4e753739

Comment 6 Qian Cai 2008-12-01 08:00:31 UTC

Created attachment 325184 [details]
dmidecode for hp-xw4550-01 (07/20/2007)

Comment 7 Qian Cai 2008-12-01 08:01:20 UTC

Created attachment 325185 [details]
dmidecode for hp-xw9300-01 (11/28/2006)

Comment 8 Neil Horman 2008-12-01 11:35:13 UTC

Cai, I'm really not sure what you want me to do with this problem, you claim it wasn't a transient network error, yet when we spoke about it over emai, I went to try it on this system and it worked just fine, several dumps over.  I'll look at it again if you want, but if its working properly, theres really not much I can do.

And I really don't think that a transient problem noticed on one machine is high priority problem.

Comment 9 Qian Cai 2008-12-02 11:07:05 UTC

Neil, I have seen it on two HP XW machines, as you can see from the above. The failure rate is around 50% (I felt much higher on hp-xw4550 on IA-32, as I almost reproduced it every time). I have already reproduced it no less than 10 times, so I believe it is totally reproducible.

It might be a transient network problem, but I don't know why it is only noticed in kdump kernel. Because of this, I don't think RHTS administrators will believe me that it is a RHTS issue. I can try though. Also, do you have any suggestion?

Comment 10 Neil Horman 2008-12-02 12:01:14 UTC

I understand yoru frustration, and when I say transient network error, it may just as well be the nic not able to negotiate link on the wire, or a transient error in resetting the NIC.  But regardless both of those problems are screaming hardware to me. I just don't see what I'm going to be able to do about them.

Given that we seem to be seeing so many odd behaviors on the hp xw series, Its possible that this is that bios bug that prarit found in bz 456638.  As such the workaround may help there if we expand its coverage. I'm building a kernel  for bz 473038 already, so I can expand its coverage to touch these systems as well.  no promises, but its worth a shot.

Comment 12 Qian Cai 2008-12-03 09:48:57 UTC

(In reply to comment #10)
> I understand yoru frustration, and when I say transient network error, it may
> just as well be the nic not able to negotiate link on the wire, or a transient
> error in resetting the NIC.  But regardless both of those problems are
> screaming hardware to me. I just don't see what I'm going to be able to do
> about them.
> 

Well, if you seriously doubt it is a hardware problem, we can ask administrators to replace network cards on them. Otherwise, they are blocking kexec/kdump testing to a remote host. If it is a software problem, customers will lose the ability to reliably save a VMCore to remote hosts. Therefore, if it is not clear to you how to fix it at the moment, we can leave it open for now.

Comment 14 Neil Horman 2008-12-03 12:15:09 UTC

Its not magic Cai, Its a wide ranging bios issue that Prarit fixed in the -125 kernel (as I was mentioning in comment #11), I was under the impression that it got fixed in -124, but apparently there was an issue and its really in -125.  You can tell its there by the dmesg entry I gave you in comment 11.  Given that the workaround prarit has introduced for this bug has fixed several issues thus far with odd behavior on HP systems, I think its worth a shot here.  So please test with -125.el5.

Comment 15 Qian Cai 2008-12-04 03:34:46 UTC

Same thing with kernel-2.6.18-125.el5 and kexec-tools-1.102pre-54.el5.

Red Hat Enterprise Linux Server release 5.3 Beta (Tikanga)
Kernel 2.6.18-125.el5 on an x86_64

hp-xw9300-01.rhts.bos.redhat.com login: SysRq : Trigger a crashdump
Linux version 2.6.18-125.el5 (mockbuild.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Mon Dec 1 17:38:25 EST 2008
Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices  hda=cdrom memmap=exactmap memmap=640K@0K memmap=5176K@16384K memmap=125240K@22200K elfcorehdr=147440K
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000100 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003fff9300 (usable)
 BIOS-e820: 000000003fff9300 - 0000000040000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
user-defined physical RAM map:
 user: 0000000000000000 - 00000000000a0000 (usable)
 user: 0000000001000000 - 000000000150e000 (usable)
 user: 00000000015ae000 - 0000000008ffc000 (usable)
...
mapping eth0 to eth0
eth0: no link during initialization.
udhcpc (v1.2.0) startedirq 11: nobody cared (try booting with the "irqpoll" option)

Call Trace:
 <IRQ>  [<ffffffff800b8383>] __report_bad_irq+0x30/0x7d
 [<ffffffff800b85b6>] note_interrupt+0x1e6/0x227
 [<ffffffff800b7ab2>] __do_IRQ+0xbd/0x103
 [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 [<ffffffff801b2d67>] serial8250_start_tx+0x0/0x90
 [<ffffffff80010bc2>] handle_IRQ_event+0x42/0xa6
 [<ffffffff800b7a99>] __do_IRQ+0xa4/0x103
 [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 [<ffffffff801b2d67>] serial8250_start_tx+0x0/0x90
 [<ffffffff80012117>] __do_softirq+0x51/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cada>] do_softirq+0x2c/0x85
 [<ffffffff8006c962>] do_IRQ+0xec/0xf5
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff801b2d67>] serial8250_start_tx+0x0/0x90
 [<ffffffff80064c08>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff80052d0c>] uart_write+0xdf/0xee
 [<ffffffff80019d65>] write_chan+0x212/0x305
 [<ffffffff8008a41f>] default_wake_function+0x0/0xe
 [<ffffffff800284f8>] tty_write+0x177/0x20e
 [<ffffffff80019b53>] write_chan+0x0/0x305
 [<ffffffff80016734>] vfs_write+0xce/0x174
 [<ffffffff80016fe9>] sys_write+0x2d/0x6e
 [<ffffffff80017001>] sys_write+0x45/0x6e
 [<ffffffff8005d116>] system_call+0x7e/0x83

handlers:
[<ffffffff881657b7>] (nv_nic_irq_optimized+0x0/0x227 [forcedeth])
Disabling IRQ #11

udhcpc[1125]: udhcpc (v1.2.0) started
Sending discover...
udhcpc[1125]: Sending discover...
eth0: link up.
Sending discover...
udhcpc[1125]: Sending discover...
Sending discover...
udhcpc[1125]: Sending discover...
No lease, failing.
udhcpc[1125]: No lease, failing.
eth0 failed to come up
Dropping to shell. exit to reboot
root:/>

Comment 16 Neil Horman 2008-12-06 20:50:18 UTC

So, looking at the above output, this appears to be not so much a problem with dhcp failing, but rather a problem with link getting detected during restart on the forcedeth driver:

...
mapping eth0 to eth0
eth0: no link during initialization.

Looks like the NIC had an issue talking to the phy over the mii interface after the kdump operation started.  I wonder if perhaps a unconditional reset of the phy would fix the problem. I'll try to put together a patch.

Comment 17 Neil Horman 2008-12-08 15:16:32 UTC

Cai, I was just talking to gospo about this, and he pointed out a few interesting details.  Most notably, Nvidia NIC's have a habbit of creating byte swapped  MAC addresses.  Could you please attach the tcpdump that you mentioned in the initial comment of this bug?  Also could you attach the output of  ethtool -e and -d, as well as ifconfig output of the interface in question.  I'd like to compare mac addresses to make sure thats not happening here.  Thanks!

Comment 18 Qian Cai 2008-12-09 12:45:19 UTC

Created attachment 326293 [details]
Ifconfig from Normal Kernel on hp-xw9300

Comment 19 Qian Cai 2008-12-09 12:46:08 UTC

Created attachment 326294 [details]
Ethtool from Normal Kernel on hp-xw9300

Comment 20 Qian Cai 2008-12-09 12:46:35 UTC

Created attachment 326295 [details]
Ifconfig from Kdump Kernel on hp-xw9300

Comment 21 Qian Cai 2008-12-09 12:47:09 UTC

Created attachment 326296 [details]
Ethtool from Kdump Kernel on hp-xw9300

Comment 22 Qian Cai 2008-12-09 12:47:40 UTC

Created attachment 326297 [details]
Tcpdump from Kdump Kernel on hp-xw9300

Comment 23 Neil Horman 2008-12-17 16:25:47 UTC

Created attachment 327265 [details]
patch to map all reserved region of ram into kdump kernel

Cai, given that this is happening on hp machines that have had a slew of problems until recently, I think its worth giving this patch a test.  Doug Chapman recently found that kdump doesn't map reserved e820 sections in the kdump kernel, but probably should (some bios vendors mark acpi space as reserved erroneously, or map other important config data there).  Anywho, I wonder if this bug isn't some wierd result of not having all the acpi tables present during kdump boot.  This patch, applied to the latest kexec tools (-56.el5) should map those regions on x86_64 hardware.  We're planning to propose something like this upstream soon, and a test here to see if this is another of those bugs that would be solved with this patch would be good.  If you could please give this patch a try, I'd appreciate it.  Thanks!

Oh, btw, its best to use the latest kexec-tools with at least kernel-2.6.18-127.el5, as that has the kernel patch to allow masking of the gart region of these system (to avoid potential resets during vmcore copying).

Comment 24 Qian Cai 2009-02-01 09:34:16 UTC

Neil, I have just tried kernel-2.6.18-128.el5 and kexec-tools-1.102pre-56.el5_3.1, which I believe that it includes the patch you mentioned on hp-xw9300-01.rhts.bos.redhat.com, but the same problem.

Comment 25 Neil Horman 2009-02-01 23:43:41 UTC

Yeah, that should have what you need.  Ok, do you have this system reserved?  Can I hop on it and poke around?

Comment 26 Qian Cai 2009-02-02 01:03:52 UTC

Sorry, I don't have the machine reserved at the moment. The machine can be reserved via RHTS webUI,

http://rhts.redhat.com/cgi-bin/rhts/reserve_workflow.cgi

If you have any problem, let me know.

Comment 27 Neil Horman 2009-02-02 03:26:19 UTC

Ok, I'm reserving it.  Thanks

note to self: Theres a bunch of stuff upstream about forcedeth thats interesting.  I should try rebasing

Comment 28 Neil Horman 2009-02-02 15:40:13 UTC

Ok, this is definately isolated to the forcedeth driver.  Specifically to its ability to dma to highmem regions.  I added:
options forcedeth dma_64bit=0
to /etc/kdump.conf and this system was able to dump over a network a-ok.

Given that this is a dma location problem, I'm guessing thatwe are looking at hardware ideosyncracy here.  As I noted before there are several upstream commits that may have an influence on this, which we can look into.  I see three possible options:

1) Investigate the upstream changes and cherry pick any that we find that correct this situation.  Or simply preform a wholesale update of the driver from upstream

2) Add code to forcedeth to trigger on the reset_devices kernel command line option to supress the use of 64 bit dma

3) Add a release note indicating that this driver may need to have the dma_64bit module option added to kdump.conf

Of those three, I think (3) is the best option.  Option (1) is definately good, but theres no guarantee that the upstream suspects fix this problem.  Option (2) just seems like a hack.

I would say, lets go with option (3).  I'll try a wholesale backport of the latest forcedeth driver and see if that fixes the problem.  If so, I'll talk to gospo about pushing to 5.4.  In the interim, this can be release-noted (or a kbase article written so that people on 5.3 will now how to get it working).

Thoughts?

Comment 29 Qian Cai 2009-02-03 15:20:06 UTC

Yes, adding a release note sounds good to me. Then, because the other affected machine hp-xw4550  is  using tg3 instead, I guess I'll need to file another bug for it. Is that correct? Thanks.

Comment 30 Neil Horman 2009-02-03 16:35:26 UTC

Ok, I'll talk to andy about the forcedeth driver update, and write a release note here.  As for tg3, yeah a new bug would be good.

Comment 31 Neil Horman 2009-02-03 18:50:10 UTC

Ok, I've spoken with andy about a possible update of the forcdeth driver in 5.4 and we will look into it.  In the interim, I'm closing this as deferred, and I've added the release note text above.  Thanks!

Comment 32 Neil Horman 2009-02-03 18:50:10 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Some forcedeth based cards in 5.3 may have some difficulty in accessing memory above 4GB during operation in a kdump kernel.  There is a direct workaround for this bug.  In the file /etc/sysconfig/kdump, add the following:
KDUMP_COMMANDLINE_APPEND="dma_64bit=0"

This will prevent the forcedeth network driver from using high memory resources in the kdump kernel and allow the network to function properly.

Comment 33 Qian Cai 2009-07-23 02:17:53 UTC

Still seen it on RHEL5.4. I guess we'll still need to carry the release note.

hp-xw9300-01.rhts.bos.redhat.com (x86_64)
kexec-tools-1.102pre-77.el5
kernel-2.6.18-159.el5
RHEL5.4-Server-20090715.0

Serial console messages from the kdump kernel:
....
mapping eth0 to eth0
eth0: no link during initialization.
eth0 Link Up.  Waiting 60 Seconds
eth0: link up.
Continuing
udhcpc (v1.2.0) started
udhcpc[1222]: udhcpc (v1.2.0) started
Sending discover...
udhcpc[1222]: Sending discover...
Sending discover...
udhcpc[1222]: Sending discover...
Sending discover...
udhcpc[1222]: Sending discover...
No lease, failing.
udhcpc[1222]: No lease, failing.
eth0 failed to cmd: stopping all md devices.
ome up
Synchronizing SCSI cache for disk sda: 
ACPI: PCI interrupt for device 0000:00:0a.0 disabled
Restarting system.
.
machine restart

Comment 35 Neil Horman 2009-07-23 11:05:19 UTC

Hmm, I wonder if the dma issue is hardware based rather than simply a code issue. I'll talk to gospo.

Comment 39 Ryan Lerch 2009-08-19 02:24:18 UTC

Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,4 +1,5 @@
-Some forcedeth based cards in 5.3 may have some difficulty in accessing memory above 4GB during operation in a kdump kernel.  There is a direct workaround for this bug.  In the file /etc/sysconfig/kdump, add the following:
+Some <filename>forcedeth</filename> based devices may encounter difficulty accessing memory above 4GB during operation in a <filename>kdump</filename> kernel. To work around this issue, add the following line to the <filename>/etc/sysconfig/kdump</filename> file:
+<screen>
 KDUMP_COMMANDLINE_APPEND="dma_64bit=0"
-
+</screen>
-This will prevent the forcedeth network driver from using high memory resources in the kdump kernel and allow the network to function properly.+This work around prevents the forcedeth network driver from using high memory resources in the kdump kernel, allowing the network to function properly.

Note You need to log in before you can comment on or make changes to this bug.