Description of problem: I have seen a problem that the second Linux kernel could not get any incoming packet sometimes on two HP XW machines (hp-xw9300-01.rhts.bos.redhat.com and hp-xw4550-01.rhts.bos.redhat.com), as the result, it was impossible to save the VMCore to any remote host. The server was configured to abtain an IP via DHCP. eth0 Link Up. Waiting 60 Seconds +sleep 60 +echo Continuing Continuing +[ 0000:00:0a.0 == Bonding ] +[ 0000:00:0a.0 == Vlan ] +exit 0 +shift 1 +/bin/msh -c udhcpc -n -p /var/run/udhcpc.eth0.pid -i eth0 udhcpc (v1.2.0) started udhcpc[1292]: udhcpc (v1.2.0) started +[ -z deconfig ] +/sbin/ifconfig eth0 0.0.0.0 +exit 0 Sending discover... udhcpc[1292]: Sending discover... Sending discover... udhcpc[1292]: Sending discover... Sending discover... udhcpc[1292]: Sending discover... +[ -z leasefail ] +exit 0 No lease, failing. udhcpc[1292]: No lease, failing. root:/> You can see from here, all DHCP requests were failed. The interesting thing was that if I configured a static IP to this server, and then setup tcpdump on another host B in the same subnet. If the server pinged B, there were ARP requests and replys seen from the tcpdump, but neither DHCP nor ICMP reply. Looked like something broken in IP stack. 06:07:19.791705 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34 (oui Unknown), length: 548, xid:0x8c59213c, flags: [none] (0x0000) Client Ethernet Address: 00:17:08:2a:08:34 (oui Unknown) [|bootp] 06:07:22.797101 IP (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34 (oui Unknown), length: 548, xid:0x8c59213c, flags: [none] (0x0000) Client Ethernet Address: 00:17:08:2a:08:34 (oui Unknown) [|bootp] "00:17:08:2a:08:34" was the affected machine's MAC address. root:/> ifconifg eth0 10.16.64.84 netmask 255.255.248.0 broadcast 10.16.71.255 root:/> ifconfig eth0 Link encap:Ethernet HWaddr 00:17:08:2A:08:34 inet addr:10.16.64.84 Bcast:10.16.71.255 Mask:255.255.248.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:1 dropped:0 overruns:0 frame:1 Interrupt:225 Base address:0xe000 ^M root:/> ping -c 3 10.16.64.121 PING 10.16.64.121 (10.16.64.121): 56 data bytes --- 10.16.64.121 ping statistics --- 3 packets transmitted, 0 packets received, 100% packet loss The server even cannot ping itself. root:/> ping -c 3 10.16.64.84 PING 10.16.64.84 (10.16.64.84): 56 data bytes --- 10.16.64.84 ping statistics --- 3 packets transmitted, 0 packets received, 100% packet loss I doubted it was because of transient network problem, because there was no such problem in normal kernel I was aware of. Version-Release number of selected component (if applicable): kernel-2.6.18-92.el5 kernel-2.6.18-124.el5 kexec-tools-1.102pre-51.el5 How reproducible: Usually 50%. Steps to Reproduce: 1. configure kdump with crashkernel=128M@16M. 2. use the following kdump.conf net server@nfs default shell 3. echo c >/proc/sysrq-trigger Actual results: Kdump kernel failed to get an IP address via DHCP. Expected results: Kdump kernel got an IP address via DHCP and saved the VMCore to the remote host.
Some network driver information: hp-xw9300-01: 00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3) Subsystem: Hewlett-Packard Company Unknown device 1500 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 (250ns min, 5000ns max) Interrupt: pin A routed to IRQ 193 Region 0: Memory at f2104000 (32-bit, non-prefetchable) [size=4K] Region 1: I/O ports at 28f0 [size=8] Capabilities: [44] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable+ DSel=0 DScale=0 PME- hp-xw4550-01: 3f:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5755 Gigabit Ethernet PCI Express (rev 02) Subsystem: Hewlett-Packard Company Unknown device 12ff Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 225 Region 0: Memory at d8800000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at <ignored> [disabled] Capabilities: [48] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Capabilities: [58] Vendor Specific Information Capabilities: [e8] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+ Address: 00000000fee01000 Data: 40e1 Capabilities: [d0] Express Endpoint IRQ 0 Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag+ Device: Latency L0s <4us, L1 unlimited Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported- Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 128 bytes, MaxReadReq 4096 bytes Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0 Link: Latency L0s <4us, L1 <64us Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch- Link: Speed 2.5Gb/s, Width x1 Capabilities: [100] Advanced Error Reporting Capabilities: [13c] Virtual Channel Capabilities: [160] Device Serial Number 3b-88-0c-fe-ff-4b-1a-00 Capabilities: [16c] Power Budgeting
On hp-xw4550-01, looks like the tg3 driver does not function at all. We opened a tcpdump server using the following command, # tcpdump -envvv 'ether host 00:1A:4B:0C:88:3B' # echo c >/proc/sysrq-trigger ... 00:1a:4b:0c:88:3b0: Tigon3 [partno(BCM95755) rev a002 PHY(5755)] (PCI Express) 10/100/1000Base-T Ethernet pshot.ko module eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1] Loading libphy.eth0: dma_rwctrl[76180000] dma_mask[64-bit] ko module Loading tg3.ko module ... udhcpc (v1.2.0) started udhcpc[1042]: udhcpc (v1.2.0) started Sending discover... udhcpc[1042]: Sending discover... Sending discover... udhcpc[1042]: Sending discover... Sending discover... udhcpc[1042]: Sending discover... No lease, failing. udhcpc[1042]: No lease, failing. eth0 failed to come up Dropping to shell. exit to reboot root:/> ifup eth0 udhcpc (v1.2.0) started udhcpc[1071]: udhcpc (v1.2.0) started Sending discover... udhcpc[1071]: Sending discover... Sending discover... udhcpc[1071]: Sending discover... Sending discover... udhcpc[1071]: Sending discover... No lease, failing. udhcpc[1071]: No lease, failing. root:/> <Tcpdump did not output anything at this point.> root:/> mii-tool -v eth0: negotiated 100baseTx-FD, link ok product info: vendor 00:50:ef, model 12 rev 0 basic mode: autonegotiation enabled basic status: autonegotiation complete, link ok capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD Manually setup IP address from here. root:/> ifconfig eth0 10.16.65.42 netmask 255.255.248.0 broadcast 10.16.71.255 root:/> ifconfig eth0 Link encap:Ethernet HWaddr 00:1A:4B:0C:88:3B inet addr:10.16.65.42 Bcast:10.16.71.255 Mask:255.255.248.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) Interrupt:5 root:/> route add default gw 10.16.71.254 root:/> route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 10.16.64.0 * 255.255.248.0 U 0 0 0 eth0 default 10.16.71.254 0.0.0.0 UG 0 0 0 eth0 Ping ourselves. root:/> ping -c 1 10.16.65.42 PING 10.16.65.42 (10.16.65.42): 56 data bytes --- 10.16.65.42 ping statistics --- 1 packets transmitted, 0 packets received, 100% packet loss <Tcpdump did not output anything at this point.> Ping the gateway. root:/> ping -c 1 10.16.71.254 PING 10.16.71.254 (10.16.71.254): 56 data bytes --- 10.16.71.254 ping statistics --- 1 packets transmitted, 0 packets received, 100% packet loss <Tcpdump did not output anything at this point.> Ethertool did not show anything packets. root:/> ethtool -S eth0 NIC statistics: rx_octets: 0 rx_fragments: 0 rx_ucast_packets: 0 rx_mcast_packets: 0 rx_bcast_packets: 0 rx_fcs_errors: 0 rx_align_errors: 0 rx_xon_pause_rcvd: 0 rx_xoff_pause_rcvd: 0 rx_mac_ctrl_rcvd: 0 rx_xoff_entered: 0 rx_frame_too_long_errors: 0 rx_jabbers: 0 rx_undersize_packets: 0 rx_in_length_errors: 0 rx_out_length_errors: 0 rx_64_or_less_octet_packets: 0 rx_65_to_127_octet_packets: 0 rx_128_to_255_octet_packets: 0 rx_256_to_511_octet_packets: 0 rx_512_to_1023_octet_packets: 0 rx_1024_to_1522_octet_packets: 0 rx_1523_to_2047_octet_packets: 0 rx_2048_to_4095_octet_packets: 0 rx_4096_to_8191_octet_packets: 0 rx_8192_to_9022_octet_packets: 0 tx_octets: 0 tx_collisions: 0 tx_xon_sent: 0 tx_xoff_sent: 0 tx_flow_control: 0 tx_mac_errors: 0 tx_single_collisions: 0 tx_mult_collisions: 0 tx_deferred: 0 tx_excessive_collisions: 0 tx_late_collisions: 0 tx_collide_2times: 0 tx_collide_3times: 0 tx_collide_4times: 0 tx_collide_5times: 0 tx_collide_6times: 0 tx_collide_7times: 0 tx_collide_8times: 0 tx_collide_9times: 0 tx_collide_10times: 0 tx_collide_11times: 0 tx_collide_12times: 0 tx_collide_13times: 0 tx_collide_14times: 0 tx_collide_15times: 0 tx_ucast_packets: 0 tx_mcast_packets: 0 tx_bcast_packets: 0 tx_carrier_sense_errors: 0 tx_discards: 0 tx_errors: 0 dma_writeq_full: 0 dma_write_prioq_full: 0 rxbds_empty: 0 rx_discards: 0 rx_errors: 0 rx_threshold_hit: 0 dma_readq_full: 0 dma_read_prioq_full: 0 tx_comp_queue_full: 0 ring_set_send_prod_index: 0 ring_status_update: 0 nic_irqs: 0 nic_avoided_irqs: 0 nic_tx_threshold_hit: 0 Ifconfig also did not show any packets. root:/> ifconfig eth0 Link encap:Ethernet HWaddr 00:1A:4B:0C:88:3B inet addr:10.16.65.42 Bcast:10.16.71.255 Mask:255.255.248.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) Interrupt:5 root:/> cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 2 64 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps Icmp: 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 0 0 0 2 0 0 0 0 0 IcmpMsg: OutType3 OutType8 IcmpMsg: 1 2 Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts Tcp: 1 200 120000 -1 0 0 0 0 0 0 0 0 0 0 Udp: InDatagrams NoPorts InErrors OutDatagrams Udp: 0 0 0 0 root:/> ethtool -i eth0 driver: tg3 version: 3.93 firmware-version: 5755-v3.29 bus-info: 0000:3f:00.0 root:/> tc -d qdisc qdisc pfifo_fast 0: dev eth0 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 root:/> ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 511 RX Mini: 0 RX Jumbo: 0 TX: 511 Current hardware settings: RX: 200 RX Mini: 0 RX Jumbo: 0 TX: 511
On hp-xw4550-01, tg3 driver worked fine with a normal kernel. # modinfo tg3 filename: /lib/modules/2.6.18-124.el5/kernel/drivers/net/tg3.ko version: 3.93 license: GPL description: Broadcom Tigon3 ethernet driver author: David S. Miller (davem) and Jeff Garzik (jgarzik) srcversion: 9F10E7BFA7D69F890110EAC alias: pci:v0000106Bd00001645sv*sd*bc*sc*i* alias: pci:v0000173Bd000003EAsv*sd*bc*sc*i* alias: pci:v0000173Bd000003EBsv*sd*bc*sc*i* alias: pci:v0000173Bd000003E9sv*sd*bc*sc*i* alias: pci:v0000173Bd000003E8sv*sd*bc*sc*i* alias: pci:v00001148d00004500sv*sd*bc*sc*i* alias: pci:v00001148d00004400sv*sd*bc*sc*i* alias: pci:v000014E4d00001699sv*sd*bc*sc*i* alias: pci:v000014E4d00001680sv*sd*bc*sc*i* alias: pci:v000014E4d00001681sv*sd*bc*sc*i* alias: pci:v000014E4d0000165Bsv*sd*bc*sc*i* alias: pci:v000014E4d00001684sv*sd*bc*sc*i* alias: pci:v000014E4d00001698sv*sd*bc*sc*i* alias: pci:v000014E4d00001713sv*sd*bc*sc*i* alias: pci:v000014E4d00001712sv*sd*bc*sc*i* alias: pci:v000014E4d000016DDsv*sd*bc*sc*i* alias: pci:v000014E4d0000166Bsv*sd*bc*sc*i* alias: pci:v000014E4d0000166Asv*sd*bc*sc*i* alias: pci:v000014E4d00001679sv*sd*bc*sc*i* alias: pci:v000014E4d00001678sv*sd*bc*sc*i* alias: pci:v000014E4d00001669sv*sd*bc*sc*i* alias: pci:v000014E4d00001668sv*sd*bc*sc*i* alias: pci:v000014E4d0000167Fsv*sd*bc*sc*i* alias: pci:v000014E4d00001693sv*sd*bc*sc*i* alias: pci:v000014E4d0000169Bsv*sd*bc*sc*i* alias: pci:v000014E4d0000169Asv*sd*bc*sc*i* alias: pci:v000014E4d00001674sv*sd*bc*sc*i* alias: pci:v000014E4d00001673sv*sd*bc*sc*i* alias: pci:v000014E4d0000167Bsv*sd*bc*sc*i* alias: pci:v000014E4d00001672sv*sd*bc*sc*i* alias: pci:v000014E4d0000167Asv*sd*bc*sc*i* alias: pci:v000014E4d000016FEsv*sd*bc*sc*i* alias: pci:v000014E4d000016FDsv*sd*bc*sc*i* alias: pci:v000014E4d000016F7sv*sd*bc*sc*i* alias: pci:v000014E4d00001601sv*sd*bc*sc*i* alias: pci:v000014E4d00001600sv*sd*bc*sc*i* alias: pci:v000014E4d0000167Esv*sd*bc*sc*i* alias: pci:v000014E4d0000167Dsv*sd*bc*sc*i* alias: pci:v000014E4d0000167Csv*sd*bc*sc*i* alias: pci:v000014E4d00001677sv*sd*bc*sc*i* alias: pci:v000014E4d00001676sv*sd*bc*sc*i* alias: pci:v000014E4d0000165Asv*sd*bc*sc*i* alias: pci:v000014E4d00001659sv*sd*bc*sc*i* alias: pci:v000014E4d00001658sv*sd*bc*sc*i* alias: pci:v000014E4d0000166Esv*sd*bc*sc*i* alias: pci:v000014E4d00001649sv*sd*bc*sc*i* alias: pci:v000014E4d0000170Esv*sd*bc*sc*i* alias: pci:v000014E4d0000170Dsv*sd*bc*sc*i* alias: pci:v000014E4d0000169Dsv*sd*bc*sc*i* alias: pci:v000014E4d0000169Csv*sd*bc*sc*i* alias: pci:v000014E4d00001696sv*sd*bc*sc*i* alias: pci:v000014E4d000016C7sv*sd*bc*sc*i* alias: pci:v000014E4d000016C6sv*sd*bc*sc*i* alias: pci:v000014E4d000016A8sv*sd*bc*sc*i* alias: pci:v000014E4d000016A7sv*sd*bc*sc*i* alias: pci:v000014E4d000016A6sv*sd*bc*sc*i* alias: pci:v000014E4d0000165Esv*sd*bc*sc*i* alias: pci:v000014E4d0000165Dsv*sd*bc*sc*i* alias: pci:v000014E4d00001654sv*sd*bc*sc*i* alias: pci:v000014E4d00001653sv*sd*bc*sc*i* alias: pci:v000014E4d0000164Dsv*sd*bc*sc*i* alias: pci:v000014E4d00001648sv*sd*bc*sc*i* alias: pci:v000014E4d00001647sv*sd*bc*sc*i* alias: pci:v000014E4d00001646sv*sd*bc*sc*i* alias: pci:v000014E4d00001645sv*sd*bc*sc*i* alias: pci:v000014E4d00001644sv*sd*bc*sc*i* depends: libphy vermagic: 2.6.18-124.el5 SMP mod_unload gcc-4.1 parm: tg3_debug:Tigon3 bitmapped debugging message enable value (int) module_sig: 883f3504921ea653afb531d9ea12bf11216109e3d4aba9af922865f2833869db7fe3418b632263c09f415b131f372f4d93a6ff1a5b39f2937a08bbe6a # ifconfig eth0 Link encap:Ethernet HWaddr 00:1A:4B:0C:88:3B inet addr:10.16.65.42 Bcast:10.16.71.255 Mask:255.255.248.0 inet6 addr: fe80::21a:4bff:fe0c:883b/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:39257 errors:0 dropped:0 overruns:0 frame:0 TX packets:1352 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:4128450 (3.9 MiB) TX bytes:413490 (403.7 KiB) Interrupt:193 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:56 errors:0 dropped:0 overruns:0 frame:0 TX packets:56 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:7048 (6.8 KiB) TX bytes:7048 (6.8 KiB) # route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 10.16.64.0 * 255.255.248.0 U 0 0 0 eth0 169.254.0.0 * 255.255.0.0 U 0 0 0 eth0 default 10.16.71.254 0.0.0.0 UG 0 0 0 eth0 # ifdown eth0 # ifup eth0 Determining IP information for eth0... done. 00:01:27.515218 00:1a:4b:0c:88:3b > Broadcast, ethertype IPv4 (0x0800), length 342: (tos 0x10, ttl 16, id 0, offset 0, flags [none], proto: UDP (17), length: 328) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:1a:4b:0c:88:3b, length: 300, xid:0x7649d206, flags: [none] (0x0000) Client Ethernet Address: 00:1a:4b:0c:88:3b [|bootp] 00:01:27.515741 00:16:3e:4b:a5:4a > 00:1a:4b:0c:88:3b, ethertype IPv4 (0x0800), length 361: (tos 0x10, ttl 16, id 0, offset 0, flags [none], proto: UDP (17), length: 347) 10.16.64.14.bootps > 10.16.65.42.bootpc: BOOTP/DHCP, Reply, length: 319, xid:0x7649d206, flags: [none] (0x0000) Your IP: 10.16.65.42 Server IP: 10.16.64.10 Client Ethernet Address: 00:1a:4b:0c:88:3b [|bootp] 00:01:27.685826 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 234: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 220) 10.16.65.42.mdns > 224.0.0.251.mdns: 0*- [0q] 4/0/0 _services._dns-sd._udp.local. PTR[|domain] 00:01:27.709374 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 415: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 401) 10.16.65.42.mdns > 224.0.0.251.mdns: 0 [4q] [7n][|domain] 00:01:27.959654 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 415: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 401) 10.16.65.42.mdns > 224.0.0.251.mdns: 0 [4q] [7n][|domain] 00:01:28.209559 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 415: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 401) 10.16.65.42.mdns > 224.0.0.251.mdns: 0 [4q] [7n][|domain] 00:01:28.409556 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 194: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 180) 10.16.65.42.mdns > 224.0.0.251.mdns: 0*- [0q] 2/0/0 SFTP File Transfer on hp-xw4550-01._sftp-[|domain] 00:01:28.409831 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 274: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 260) 10.16.65.42.mdns > 224.0.0.251.mdns: 0*- [0q] 5/0/0 hp-xw4550-01 [00:1a:4b:0c:88:3b]._worksta[|domain] 00:01:28.705561 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 234: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 220) 10.16.65.42.mdns > 224.0.0.251.mdns: 0*- [0q] 4/0/0 _services._dns-sd._udp.local. PTR[|domain] 00:01:29.428629 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 353: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 339) 10.16.65.42.mdns > 224.0.0.251.mdns: 0*- [0q] 6/0/0 SFTP File Transfer on hp-xw4550-01._sftp-[|domain] 00:01:29.428746 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 110: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 96) 10.16.65.42.mdns > 224.0.0.251.mdns: 0*- [0q] 1/0/0 42.65.16.10.in-addr.arpa. (Cache flush) PTR[|domain] 00:01:30.724522 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 407: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 393) 10.16.65.42.mdns > 224.0.0.251.mdns: 0*- [0q] 9/0/0 _services._dns-sd._udp.local. PTR[|domain] 00:01:31.447461 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 353: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 339) 10.16.65.42.mdns > 224.0.0.251.mdns: 0*- [0q] 6/0/0 SFTP File Transfer on hp-xw4550-01._sftp-[|domain] 00:01:31.447566 00:1a:4b:0c:88:3b > 01:00:5e:00:00:fb, ethertype IPv4 (0x0800), length 110: (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto: UDP (17), length: 96) 10.16.65.42.mdns > 224.0.0.251.mdns: 0*- [0q] 1/0/0 42.65.16.10.in-addr.arpa. (Cache flush) PTR[|domain] 00:01:38.357727 00:1a:4b:0c:88:3b > Broadcast, ethertype ARP (0x0806), length 60: arp who-has 10.16.71.254 tell 10.16.65.42
Doing the same things nn hp-xw9300-01, it showed a different results. We opened a tcpdump server using the following command, # tcpdump -envvv 'ether host 00:17:08:2A:08:34' # echo c >/proc/sysrq-trigger ... udhcpc[1126]: udhcpc (v1.2.0) started Sending discover... udhcpc[1126]: Sending discover... eth0: link up. Sending discover... udhcpc[1126]: Sending discover... Sending discover... udhcpc[1126]: Sending discover... No lease, failing. udhcpc[1126]: No lease, failing. eth0 failed to come up Dropping to shell. exit to reboot root:/> ... Tcpdump output that an IP address had successfully obtained, 01:40:03.118708 00:17:08:2a:08:34 > Broadcast, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34, length: 548, xid:0xab5c5740, flags: [none] (0x0000) Client Ethernet Address: 00:17:08:2a:08:34 [|bootp] 01:40:03.119272 00:16:3e:4b:a5:4a > 00:17:08:2a:08:34, ethertype IPv4 (0x0800), length 355: (tos 0x10, ttl 16, id 0, offset 0, flags [none], proto: UDP (17), length: 341) 10.16.64.14.bootps > 10.16.64.84.bootpc: BOOTP/DHCP, Reply, length: 313, xid:0xab5c5740, flags: [none] (0x0000) Your IP: 10.16.64.84 Server IP: 10.16.64.10 Client Ethernet Address: 00:17:08:2a:08:34 [|bootp] 01:40:06.124582 00:17:08:2a:08:34 > Broadcast, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34, length: 548, xid:0xab5c5740, flags: [none] (0x0000) Client Ethernet Address: 00:17:08:2a:08:34 [|bootp] However, it was not. root:/> ifconfig eth0 Link encap:Ethernet HWaddr 00:17:08:2A:08:34 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:1 dropped:0 overruns:0 frame:1 TX packets:2 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:1492 (1.4 KiB) Interrupt:11 Tried to obtained an IP again, no DHCP reply anymore. root:/> ifup eth0 udhcpc (v1.2.0) started udhcpc[1158]: udhcpc (v1.2.0) started Sending discover... udhcpc[1158]: Sending discover... Sending discover... udhcpc[1158]: Sending discover... Sending discover... udhcpc[1158]: Sending discover... No lease, failing. udhcpc[1158]: No lease, failing. root:/> 1:42:34.027588 00:17:08:2a:08:34 > Broadcast, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34, length: 548, xid:0xb988d033, flags: [none] (0x0000) Client Ethernet Address: 00:17:08:2a:08:34 [|bootp] 01:42:37.033458 00:17:08:2a:08:34 > Broadcast, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34, length: 548, xid:0xb988d033, flags: [none] (0x0000) Client Ethernet Address: 00:17:08:2a:08:34 [|bootp] 01:42:40.038334 00:17:08:2a:08:34 > Broadcast, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto: UDP (17), length: 576) 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:17:08:2a:08:34, length: 548, xid:0xb988d033, flags: [none] (0x0000) Client Ethernet Address: 00:17:08:2a:08:34 [|bootp] root:/> mii-tool -v SIOCGMIIPHY on 'eth0' failed: Operation not supported no MII interfaces found Configured a static IP address manually, root:/> ifconfig eth0 10.16.64.84 netmask 255.255.248.0 broadcast 10.16.71.255 root:/> ifconfig eth0 Link encap:Ethernet HWaddr 00:17:08:2A:08:34 inet addr:10.16.64.84 Bcast:10.16.71.255 Mask:255.255.248.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:1 dropped:0 overruns:0 frame:1 TX packets:5 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:3274 (3.1 KiB) Interrupt:11 root:/> route add default gw 10.16.71.254 root:/> route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 10.16.64.0 * 255.255.248.0 U 0 0 0 eth0 default 10.16.71.254 0.0.0.0 UG 0 0 0 eth0 root:/> ping -c 1 10.16.64.84 PING 10.16.64.84 (10.16.64.84): 56 data bytes --- 10.16.64.84 ping statistics --- 1 packets transmitted, 0 packets received, 100% packet loss <Tcpdump output nothing.> root:/> ping -c 1 10.16.71.254 PING 10.16.71.254 (10.16.71.254): 56 data bytes --- 10.16.71.254 ping statistics --- 1 packets transmitted, 0 packets received, 100% packet loss root:/> 02:37:38.600377 00:17:08:2a:08:34 > Broadcast, ethertype ARP (0x0806), length 60: arp who-has 10.16.71.254 tell 10.16.64.84 02:37:39.600326 00:17:08:2a:08:34 > Broadcast, ethertype ARP (0x0806), length 60: arp who-has 10.16.71.254 tell 10.16.64.84 02:37:40.600285 00:17:08:2a:08:34 > Broadcast, ethertype ARP (0x0806), length 60: arp who-has 10.16.71.254 tell 10.16.64.84 root:/> ethtool -S eth0 NIC statistics: tx_bytes: 1572 tx_zero_rexmt: 8 tx_one_rexmt: 0 tx_many_rexmt: 0 tx_late_collision: 0 tx_fifo_errors: 0 tx_carrier_errors: 0 tx_excess_deferral: 0 tx_retry_error: 0 rx_frame_error: 0 rx_extra_byte: 0 rx_late_collision: 0 rx_runt: 0 rx_frame_too_long: 0 rx_over_errors: 1 rx_crc_errors: 0 rx_frame_align_error: 0 rx_length_error: 0 rx_unicast: 0 rx_multicast: 47 rx_broadcast: 199 rx_packets: 246 rx_errors_total: 1 tx_errors_total: 0 root:/> ifconfig eth0 Link encap:Ethernet HWaddr 00:17:08:2A:08:34 inet addr:10.16.64.84 Bcast:10.16.71.255 Mask:255.255.248.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:1 dropped:0 overruns:0 frame:1 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:1572 (1.5 KiB) Interrupt:11 root:/> cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 2 64 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps Icmp: 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 2 0 0 0 0 4 0 0 0 0 0 IcmpMsg: OutType3 OutType8 IcmpMsg: 2 4 Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts Tcp: 1 200 120000 -1 0 0 0 0 0 0 0 0 0 0 Udp: InDatagrams NoPorts InErrors OutDatagrams Udp: 0 0 0 0 root:/> ethtool -i eth0 driver: forcedeth version: 0.60 firmware-version: bus-info: 0000:00:0a.0 root:/> tc -d qdisc qdisc pfifo_fast 0: dev eth0 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 root:/> ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 16384 RX Mini: 0 RX Jumbo: 0 TX: 16384 Current hardware settings: RX: 128 RX Mini: 0 RX Jumbo: 0 TX: 256
# modinfo forcedeth filename: /lib/modules/2.6.18-124.el5/kernel/drivers/net/forcedeth.ko license: GPL description: Reverse Engineered nForce ethernet driver RHEL driver based on upstream driver version 0.60 Also includes additional upstream commits: 3ba4d093fe8a26f5f2da94411bf8732fa6e9da86 forcedeth: fix tx timeout fcc5f2665c81e087fb95143325ed769a41128d50 forcedeth: fix nic poll 6fedae1f6e66ab5f169bf58064e23e015fc1307d forcedeth: fix checksum feature in mcp65 caf96469e8ab57170cc8ca9c59809132d38e529e forcedeth: disable msix e0379a14fc80cb98978fa86989dab77b522a8106 forcedeth: fixed missing call in napi poll a7475906bc496456ded9e4b062f94067fb93057a forcedeth: msi bugfix 9e555930bd873d238f5f7b9d76d3bf31e6e3ce93 forcedeth: boot delay fix author: Manfred Spraul <manfred> srcversion: 52F782A3071D2A58B8F4D65 alias: pci:v000010DEd0000054Fsv*sd*bc*sc*i* alias: pci:v000010DEd0000054Esv*sd*bc*sc*i* alias: pci:v000010DEd0000054Dsv*sd*bc*sc*i* alias: pci:v000010DEd0000054Csv*sd*bc*sc*i* alias: pci:v000010DEd00000453sv*sd*bc*sc*i* alias: pci:v000010DEd00000452sv*sd*bc*sc*i* alias: pci:v000010DEd00000451sv*sd*bc*sc*i* alias: pci:v000010DEd00000450sv*sd*bc*sc*i* alias: pci:v000010DEd000003EFsv*sd*bc*sc*i* alias: pci:v000010DEd000003EEsv*sd*bc*sc*i* alias: pci:v000010DEd000003E6sv*sd*bc*sc*i* alias: pci:v000010DEd000003E5sv*sd*bc*sc*i* alias: pci:v000010DEd00000373sv*sd*bc*sc*i* alias: pci:v000010DEd00000372sv*sd*bc*sc*i* alias: pci:v000010DEd00000269sv*sd*bc*sc*i* alias: pci:v000010DEd00000268sv*sd*bc*sc*i* alias: pci:v000010DEd00000038sv*sd*bc*sc*i* alias: pci:v000010DEd00000037sv*sd*bc*sc*i* alias: pci:v000010DEd00000057sv*sd*bc*sc*i* alias: pci:v000010DEd00000056sv*sd*bc*sc*i* alias: pci:v000010DEd000000DFsv*sd*bc*sc*i* alias: pci:v000010DEd000000E6sv*sd*bc*sc*i* alias: pci:v000010DEd0000008Csv*sd*bc*sc*i* alias: pci:v000010DEd00000086sv*sd*bc*sc*i* alias: pci:v000010DEd000000D6sv*sd*bc*sc*i* alias: pci:v000010DEd00000066sv*sd*bc*sc*i* alias: pci:v000010DEd000001C3sv*sd*bc*sc*i* depends: vermagic: 2.6.18-124.el5 SMP mod_unload gcc-4.1 parm: max_interrupt_work:forcedeth maximum events handled per interrupt (int) parm: optimization_mode:In throughput mode (0), every tx & rx packet will generate an interrupt. In CPU mode (1), interrupts are controlled by a timer. (int) parm: poll_interval:Interval determines how frequent timer interrupt is generated by [(time_in_micro_secs * 100) / (2^10)]. Min is 0 and Max is 65535. (int) parm: msi:MSI interrupts are enabled by setting to 1 and disabled by setting to 0. (int) parm: msix:MSIX interrupts are enabled by setting to 1 and disabled by setting to 0. (int) parm: dma_64bit:High DMA is enabled by setting to 1 and disabled by setting to 0. (int) module_sig: 883f3504921ea663afb531d9ea12bf1125ac309f7fdd3b5d1bc63ec6dcdebdd2ca2a22da9ade38009f59264b749fe7bab54e8f2bb1c322eca4e753739
Created attachment 325184 [details] dmidecode for hp-xw4550-01 (07/20/2007)
Created attachment 325185 [details] dmidecode for hp-xw9300-01 (11/28/2006)
Cai, I'm really not sure what you want me to do with this problem, you claim it wasn't a transient network error, yet when we spoke about it over emai, I went to try it on this system and it worked just fine, several dumps over. I'll look at it again if you want, but if its working properly, theres really not much I can do. And I really don't think that a transient problem noticed on one machine is high priority problem.
Neil, I have seen it on two HP XW machines, as you can see from the above. The failure rate is around 50% (I felt much higher on hp-xw4550 on IA-32, as I almost reproduced it every time). I have already reproduced it no less than 10 times, so I believe it is totally reproducible. It might be a transient network problem, but I don't know why it is only noticed in kdump kernel. Because of this, I don't think RHTS administrators will believe me that it is a RHTS issue. I can try though. Also, do you have any suggestion?
I understand yoru frustration, and when I say transient network error, it may just as well be the nic not able to negotiate link on the wire, or a transient error in resetting the NIC. But regardless both of those problems are screaming hardware to me. I just don't see what I'm going to be able to do about them. Given that we seem to be seeing so many odd behaviors on the hp xw series, Its possible that this is that bios bug that prarit found in bz 456638. As such the workaround may help there if we expand its coverage. I'm building a kernel for bz 473038 already, so I can expand its coverage to touch these systems as well. no promises, but its worth a shot.
(In reply to comment #10) > I understand yoru frustration, and when I say transient network error, it may > just as well be the nic not able to negotiate link on the wire, or a transient > error in resetting the NIC. But regardless both of those problems are > screaming hardware to me. I just don't see what I'm going to be able to do > about them. > Well, if you seriously doubt it is a hardware problem, we can ask administrators to replace network cards on them. Otherwise, they are blocking kexec/kdump testing to a remote host. If it is a software problem, customers will lose the ability to reliably save a VMCore to remote hosts. Therefore, if it is not clear to you how to fix it at the moment, we can leave it open for now.
Its not magic Cai, Its a wide ranging bios issue that Prarit fixed in the -125 kernel (as I was mentioning in comment #11), I was under the impression that it got fixed in -124, but apparently there was an issue and its really in -125. You can tell its there by the dmesg entry I gave you in comment 11. Given that the workaround prarit has introduced for this bug has fixed several issues thus far with odd behavior on HP systems, I think its worth a shot here. So please test with -125.el5.
Same thing with kernel-2.6.18-125.el5 and kexec-tools-1.102pre-54.el5. Red Hat Enterprise Linux Server release 5.3 Beta (Tikanga) Kernel 2.6.18-125.el5 on an x86_64 hp-xw9300-01.rhts.bos.redhat.com login: SysRq : Trigger a crashdump Linux version 2.6.18-125.el5 (mockbuild.redhat.com) (gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Mon Dec 1 17:38:25 EST 2008 Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 irqpoll maxcpus=1 reset_devices hda=cdrom memmap=exactmap memmap=640K@0K memmap=5176K@16384K memmap=125240K@22200K elfcorehdr=147440K BIOS-provided physical RAM map: BIOS-e820: 0000000000000100 - 000000000009f000 (usable) BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved) BIOS-e820: 0000000000100000 - 000000003fff9300 (usable) BIOS-e820: 000000003fff9300 - 0000000040000000 (reserved) BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) user-defined physical RAM map: user: 0000000000000000 - 00000000000a0000 (usable) user: 0000000001000000 - 000000000150e000 (usable) user: 00000000015ae000 - 0000000008ffc000 (usable) ... mapping eth0 to eth0 eth0: no link during initialization. udhcpc (v1.2.0) startedirq 11: nobody cared (try booting with the "irqpoll" option) Call Trace: <IRQ> [<ffffffff800b8383>] __report_bad_irq+0x30/0x7d [<ffffffff800b85b6>] note_interrupt+0x1e6/0x227 [<ffffffff800b7ab2>] __do_IRQ+0xbd/0x103 [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa [<ffffffff801b2d67>] serial8250_start_tx+0x0/0x90 [<ffffffff80010bc2>] handle_IRQ_event+0x42/0xa6 [<ffffffff800b7a99>] __do_IRQ+0xa4/0x103 [<ffffffff8006c95d>] do_IRQ+0xe7/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa [<ffffffff801b2d67>] serial8250_start_tx+0x0/0x90 [<ffffffff80012117>] __do_softirq+0x51/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006cada>] do_softirq+0x2c/0x85 [<ffffffff8006c962>] do_IRQ+0xec/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff801b2d67>] serial8250_start_tx+0x0/0x90 [<ffffffff80064c08>] _spin_unlock_irqrestore+0x8/0x9 [<ffffffff80052d0c>] uart_write+0xdf/0xee [<ffffffff80019d65>] write_chan+0x212/0x305 [<ffffffff8008a41f>] default_wake_function+0x0/0xe [<ffffffff800284f8>] tty_write+0x177/0x20e [<ffffffff80019b53>] write_chan+0x0/0x305 [<ffffffff80016734>] vfs_write+0xce/0x174 [<ffffffff80016fe9>] sys_write+0x2d/0x6e [<ffffffff80017001>] sys_write+0x45/0x6e [<ffffffff8005d116>] system_call+0x7e/0x83 handlers: [<ffffffff881657b7>] (nv_nic_irq_optimized+0x0/0x227 [forcedeth]) Disabling IRQ #11 udhcpc[1125]: udhcpc (v1.2.0) started Sending discover... udhcpc[1125]: Sending discover... eth0: link up. Sending discover... udhcpc[1125]: Sending discover... Sending discover... udhcpc[1125]: Sending discover... No lease, failing. udhcpc[1125]: No lease, failing. eth0 failed to come up Dropping to shell. exit to reboot root:/>
So, looking at the above output, this appears to be not so much a problem with dhcp failing, but rather a problem with link getting detected during restart on the forcedeth driver: ... mapping eth0 to eth0 eth0: no link during initialization. Looks like the NIC had an issue talking to the phy over the mii interface after the kdump operation started. I wonder if perhaps a unconditional reset of the phy would fix the problem. I'll try to put together a patch.
Cai, I was just talking to gospo about this, and he pointed out a few interesting details. Most notably, Nvidia NIC's have a habbit of creating byte swapped MAC addresses. Could you please attach the tcpdump that you mentioned in the initial comment of this bug? Also could you attach the output of ethtool -e and -d, as well as ifconfig output of the interface in question. I'd like to compare mac addresses to make sure thats not happening here. Thanks!
Created attachment 326293 [details] Ifconfig from Normal Kernel on hp-xw9300
Created attachment 326294 [details] Ethtool from Normal Kernel on hp-xw9300
Created attachment 326295 [details] Ifconfig from Kdump Kernel on hp-xw9300
Created attachment 326296 [details] Ethtool from Kdump Kernel on hp-xw9300
Created attachment 326297 [details] Tcpdump from Kdump Kernel on hp-xw9300
Created attachment 327265 [details] patch to map all reserved region of ram into kdump kernel Cai, given that this is happening on hp machines that have had a slew of problems until recently, I think its worth giving this patch a test. Doug Chapman recently found that kdump doesn't map reserved e820 sections in the kdump kernel, but probably should (some bios vendors mark acpi space as reserved erroneously, or map other important config data there). Anywho, I wonder if this bug isn't some wierd result of not having all the acpi tables present during kdump boot. This patch, applied to the latest kexec tools (-56.el5) should map those regions on x86_64 hardware. We're planning to propose something like this upstream soon, and a test here to see if this is another of those bugs that would be solved with this patch would be good. If you could please give this patch a try, I'd appreciate it. Thanks! Oh, btw, its best to use the latest kexec-tools with at least kernel-2.6.18-127.el5, as that has the kernel patch to allow masking of the gart region of these system (to avoid potential resets during vmcore copying).
Neil, I have just tried kernel-2.6.18-128.el5 and kexec-tools-1.102pre-56.el5_3.1, which I believe that it includes the patch you mentioned on hp-xw9300-01.rhts.bos.redhat.com, but the same problem.
Yeah, that should have what you need. Ok, do you have this system reserved? Can I hop on it and poke around?
Sorry, I don't have the machine reserved at the moment. The machine can be reserved via RHTS webUI, http://rhts.redhat.com/cgi-bin/rhts/reserve_workflow.cgi If you have any problem, let me know.
Ok, I'm reserving it. Thanks note to self: Theres a bunch of stuff upstream about forcedeth thats interesting. I should try rebasing
Ok, this is definately isolated to the forcedeth driver. Specifically to its ability to dma to highmem regions. I added: options forcedeth dma_64bit=0 to /etc/kdump.conf and this system was able to dump over a network a-ok. Given that this is a dma location problem, I'm guessing thatwe are looking at hardware ideosyncracy here. As I noted before there are several upstream commits that may have an influence on this, which we can look into. I see three possible options: 1) Investigate the upstream changes and cherry pick any that we find that correct this situation. Or simply preform a wholesale update of the driver from upstream 2) Add code to forcedeth to trigger on the reset_devices kernel command line option to supress the use of 64 bit dma 3) Add a release note indicating that this driver may need to have the dma_64bit module option added to kdump.conf Of those three, I think (3) is the best option. Option (1) is definately good, but theres no guarantee that the upstream suspects fix this problem. Option (2) just seems like a hack. I would say, lets go with option (3). I'll try a wholesale backport of the latest forcedeth driver and see if that fixes the problem. If so, I'll talk to gospo about pushing to 5.4. In the interim, this can be release-noted (or a kbase article written so that people on 5.3 will now how to get it working). Thoughts?
Yes, adding a release note sounds good to me. Then, because the other affected machine hp-xw4550 is using tg3 instead, I guess I'll need to file another bug for it. Is that correct? Thanks.
Ok, I'll talk to andy about the forcedeth driver update, and write a release note here. As for tg3, yeah a new bug would be good.
Ok, I've spoken with andy about a possible update of the forcdeth driver in 5.4 and we will look into it. In the interim, I'm closing this as deferred, and I've added the release note text above. Thanks!
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Some forcedeth based cards in 5.3 may have some difficulty in accessing memory above 4GB during operation in a kdump kernel. There is a direct workaround for this bug. In the file /etc/sysconfig/kdump, add the following: KDUMP_COMMANDLINE_APPEND="dma_64bit=0" This will prevent the forcedeth network driver from using high memory resources in the kdump kernel and allow the network to function properly.
Still seen it on RHEL5.4. I guess we'll still need to carry the release note. hp-xw9300-01.rhts.bos.redhat.com (x86_64) kexec-tools-1.102pre-77.el5 kernel-2.6.18-159.el5 RHEL5.4-Server-20090715.0 Serial console messages from the kdump kernel: .... mapping eth0 to eth0 eth0: no link during initialization. eth0 Link Up. Waiting 60 Seconds eth0: link up. Continuing udhcpc (v1.2.0) started udhcpc[1222]: udhcpc (v1.2.0) started Sending discover... udhcpc[1222]: Sending discover... Sending discover... udhcpc[1222]: Sending discover... Sending discover... udhcpc[1222]: Sending discover... No lease, failing. udhcpc[1222]: No lease, failing. eth0 failed to cmd: stopping all md devices. ome up Synchronizing SCSI cache for disk sda: ACPI: PCI interrupt for device 0000:00:0a.0 disabled Restarting system. . machine restart
Hmm, I wonder if the dma issue is hardware based rather than simply a code issue. I'll talk to gospo.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1,5 @@ -Some forcedeth based cards in 5.3 may have some difficulty in accessing memory above 4GB during operation in a kdump kernel. There is a direct workaround for this bug. In the file /etc/sysconfig/kdump, add the following: +Some <filename>forcedeth</filename> based devices may encounter difficulty accessing memory above 4GB during operation in a <filename>kdump</filename> kernel. To work around this issue, add the following line to the <filename>/etc/sysconfig/kdump</filename> file: +<screen> KDUMP_COMMANDLINE_APPEND="dma_64bit=0" - +</screen> -This will prevent the forcedeth network driver from using high memory resources in the kdump kernel and allow the network to function properly.+This work around prevents the forcedeth network driver from using high memory resources in the kdump kernel, allowing the network to function properly.