=Comment: #0================================================= Chirag H. Jog1 <chirag.jog.com> - 2008-05-27 09:06 EDT Triggering a kdump causes the system to hang a mid point. This is seen for both the a-pros ( rt-apple and elm3b252 ). Console log is attached. =Comment: #1================================================= Chirag H. Jog1 <chirag.jog.com> - 2008-05-27 09:11 EDT Console log while triggering kdump =Comment: #2================================================= Chirag H. Jog1 <chirag.jog.com> - 2008-05-27 11:45 EDT Newer version of kexec-tools from the RHEL5.2 repo doesn't solve the problem.
Created attachment 306804 [details] Console log while triggering kdump
------- Comment From dvhltc.com 2008-05-27 17:45 EDT------- IBM Intellistation A-pro is a 2 way AMD Opteron x86_64 workstation, typically 4 to 8 GB of memory. Chirag can you confirm?
------- Comment From chirag.jog.com 2008-05-28 01:20 EDT------- (In reply to comment #9) > IBM Intellistation A-pro is a 2 way AMD Opteron x86_64 workstation, typically 4 > to 8 GB of memory. Chirag can you confirm? Ack, 2 way AMD Opteron with 4 GB memory
------- Comment From ankigarg.com 2008-06-09 04:57 EDT------- So I tried kdump on another a-pro today. the kdump kernel gave an oops..attaching the kdump log.
Created attachment 308672 [details] kdump_boot_log_with_oops
------- Comment From chirag.jog.com 2008-07-08 05:47 EDT------- Using RHEL stock kernel as the kdump kernel works fine.
------- Comment From chirag.jog.com 2008-07-08 06:39 EDT------- Using 2.6.25.10 relocatable kernel as both first and kdump kernel works fine.
------- Comment From ssant.com 2008-07-10 01:14 EDT------- Could you please add initcall_debug boot parameter to kdump kernel boot command line. This would tell us the stuck init routine.
Created attachment 311488 [details] Console log while triggering kdump with initcall_debug attaching logs with initcall_debug added as a param to kdump kernel
------- Comment From ankigarg.com 2008-07-30 23:59 EDT------- Sent mail to the kexec list regarding this issue. http://lists.infradead.org/pipermail/kexec/2008-July/002263.html
------- Comment From ankigarg.com 2008-07-31 07:15 EDT------- Am currently working this by instrumenting the kernel. Will report my findings once I have something..
Created attachment 313162 [details] kdump kernel boot log with CONFIG_PCI_DEBUG on Attached is the kdump kernel boot log with the following commandline passed to the kdump kernel and CONFIG_PCI_DEBUG on. root=/dev/sda3 ro console=tty1 console=ttyS0,19200 irqpoll maxcpus=1 reset_devices initcall_debug acpi.debug_layer=acpi acpi.debug_level=acpi debug
------- Comment From ssant.com 2008-08-01 03:18 EDT------- Looking at the latest log attached i see that the ACPI related information is correctly passed to the kdump kernel. cat /proc/iomem | grep ACPI shows aff60000-aff71fff : ACPI Tables aff72000-aff7ffff : ACPI Non-volatile Storage These values are reflected in the kdump kernel command line. Command line: root=/dev/sda3 ro console=tty1 console=ttyS0,19200 irqpoll maxcpus=1 reset_devices initcall_debug acpi.debug_layer=acpi acpi.debug_level=acpi debug memmap=exactmap memmap=640K@0K memmap=130416K@33408K elfcorehdr=163824K memmap=72K#2882944K memmap=56K#2883016K The last two memmap= options corresponds to the ACPI tables. So that probably eliminates one more suspect. Other interesting thing from the attached log .. Seems like the machine hangs while scanning PCI devices ( at least those are the last messages printed on the console ). From lspci output the device in question seems to be a nVidia card. PCI: Scanning behind PCI bridge 0000:85:00.0, config 868685, pass 0 PCI: Scanning bus 0000:86 PCI: Found 0000:86:00.0 [10de/029f] 000300 00 ^^^^^^^^^^^^^^^^^^^^^^^ Here is the corresponding o/p from lspci 85:01.0 PCI bridge: nVidia Corporation Unknown device 01b3 (rev a3) 86:00.0 VGA compatible controller: nVidia Corporation G70 [Quadro FX 4500 X2] (rev a1) ^^^^^^^^^^^^^^^^^^^^^^^^^ 87:00.0 3D controller: nVidia Corporation G70 [Quadro FX 4500 X2] (rev a1) Just out of curiosity are all other machine on which we are facing kdump issue also have this same controller ? Are there any differences in the rt level nVidia driver code as compared to vanilla kernels or RHEL kernels ? Although we can't conclude that this is the root cause for this kdump bug but might help in arriving at the root cause.
Created attachment 313166 [details] first kernel boot log with CONFIG_PCI_DEBUG on First kernel boot log for comparison.
------- Comment From ankigarg.com 2008-08-01 03:24 EDT------- (In reply to comment #22) > Looking at the latest log attached i see that the ACPI related information is > correctly passed to the kdump kernel. > > cat /proc/iomem | grep ACPI shows > > aff60000-aff71fff : ACPI Tables > aff72000-aff7ffff : ACPI Non-volatile Storage > > These values are reflected in the kdump kernel command line. > > Command line: root=/dev/sda3 ro console=tty1 console=ttyS0,19200 irqpoll > maxcpus=1 reset_devices initcall_debug acpi.debug_layer=acpi > acpi.debug_level=acpi debug memmap=exactmap memmap=640K@0K > memmap=130416K@33408K elfcorehdr=163824K memmap=72K#2882944K > memmap=56K#2883016K > > The last two memmap= options corresponds to the ACPI tables. > Hi Sachin, yes, that is true. > So that probably eliminates one more suspect. > > Other interesting thing from the attached log .. > > Seems like the machine hangs while scanning PCI devices ( at least those are the > last messages printed on the console ). From lspci output the device in question > seems to be a nVidia card. > > PCI: Scanning behind PCI bridge 0000:85:00.0, config 868685, pass 0 > PCI: Scanning bus 0000:86 > PCI: Found 0000:86:00.0 [10de/029f] 000300 00 > ^^^^^^^^^^^^^^^^^^^^^^^ > > Here is the corresponding o/p from lspci > > 85:01.0 PCI bridge: nVidia Corporation Unknown device 01b3 (rev a3) > 86:00.0 VGA compatible controller: nVidia Corporation G70 [Quadro FX 4500 X2] > (rev a1) > ^^^^^^^^^^^^^^^^^^^^^^^^^ > 87:00.0 3D controller: nVidia Corporation G70 [Quadro FX 4500 X2] (rev a1) > So I am now trying on a a-pro which does not have the quadro fx card. Will report my findings. > Just out of curiosity are all other machine on which we are facing kdump issue > also have this same controller ? > > Are there any differences in the rt level nVidia driver code as compared to > vanilla kernels or RHEL kernels ? Although we can't conclude that this is the > root cause for this kdump bug but might help in arriving at the root cause. > I think that is correct..we are apparently using out-of-tree nvidia drivers..provided by nvidia.
------- Comment From ankigarg.com 2008-08-01 04:32 EDT------- Nice..so looks like the issue is with the nvidia drivers! I tried kdump on elm3b251, which is an intellistation a-pro. [root@elm3b251 2008-08-01-04:04]#lspci 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02) 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) 00:07.5 Multimedia audio controller: Advanced Micro Devices [AMD] AMD-8111 AC97 Audio (rev 03) 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:03.0 USB Controller: NEC Corporation USB (rev 43) 01:03.1 USB Controller: NEC Corporation USB (rev 43) 01:03.2 USB Controller: NEC Corporation USB 2.0 (rev 04) 01:04.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) 02:01.0 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01) 02:01.1 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01) 03:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet (rev 02) 80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 80:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 81:00.0 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X Bridge (rev 03) 81:00.1 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X Bridge (rev 03) 83:04.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 83:04.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 84:00.0 VGA compatible controller: nVidia Corporation NV45GL [Quadro FX 3400/4400] (rev a2) lspci output from rt-apple that has the problem nvidia card: [root@rt-apple ~]# lspci 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02) 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) 00:07.5 Multimedia audio controller: Advanced Micro Devices [AMD] AMD-8111 AC97 Audio (rev 03) 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:02.0 Mass storage controller: Silicon Image, Inc. SiI 3512 [SATALink/SATARaid] Serial ATA Controller (rev 01) 01:03.0 USB Controller: NEC Corporation USB (rev 43) 01:03.1 USB Controller: NEC Corporation USB (rev 43) 01:03.2 USB Controller: NEC Corporation USB 2.0 (rev 04) 01:04.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) 03:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit Ethernet (rev 02) 80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 80:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 81:00.0 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X Bridge (rev 03) 81:00.1 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X Bridge (rev 03) 83:04.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 83:04.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 84:00.0 PCI bridge: nVidia Corporation Unknown device 01b3 (rev a3) 85:00.0 PCI bridge: nVidia Corporation Unknown device 01b3 (rev a3) 85:01.0 PCI bridge: nVidia Corporation Unknown device 01b3 (rev a3) 86:00.0 VGA compatible controller: nVidia Corporation G70 [Quadro FX 4500 X2] (rev a1) 87:00.0 3D controller: nVidia Corporation G70 [Quadro FX 4500 X2] (rev a1)
------- Comment From ankigarg.com 2008-08-01 08:19 EDT------- Updated the results on the kexec mailing list.
------- Comment From willschm.com 2008-08-01 11:39 EDT------- (In reply to comment #21) PCI: Scanning bus 0000:86 PCI: Found 0000:86:00.0 [10de/029f] 000300 00 PCI: Fixups for bus 0000:86 PCI: Bus scan for 0000:86 returning with max=86 PCI: Scanning behind PCI bridge 0000:85:01.0, config 878785, pass 0 PCI: Scanning bus 0000:87 PCI: Found 0000:87:00.0 [10de/029f] 000302 00 PCI: Fixups for bus 0000:87 ... <continues> (In reply to comment #20) PCI: Scanning behind PCI bridge 0000:85:00.0, config 868685, pass 0 PCI: Scanning bus 0000:86 PCI: Found 0000:86:00.0 [10de/029f] 000300 00 <hang occurs here> (just a couple random questions, no solutions :-( ) Can you enable/do you see any of the KERN_INFO output that would come from the nvidia_bugs function in early-quirks.c ? That kernel boots OK normally, right? it's just in the kdump scenario that it hangs? One can manually walk through some of the pci probing by xxd'ing the /sys/devices/pcixxxx contents. Not sure if that would be helpful.
Changing the summary to indicate that kdump fails only on a-pro with the NVIDIA Quadro FX 4500 X2 card.
Not being sure if the driver code is even called at the time of pci subsystem initialization, tried kdump without the nvidia driver being compiled as a module and passing a generic vga argument to the kdump kernel. The kdump kernel still hung. So, maybe some part of pci code is not able to init the card properly. Not at all sure...we should seek help from nvidia folks.
I don't know if Red Hat has easy contact with NVIDIA, but I do. I have other reasons for contacting them, anyway, so will piggyback this issue with them today.
one clarification that I just confirmed: - rt-apple has an NVIDIA Quadro FX 4500 X2, but - elm3b252 has an NVIDIA Quadro FX 4500 (not a 4500 X2) I've updated the subject to remove "X2", as the X2 is essentially just two 4500s strapped together, so only one is really required for this bug, apparently.
(In reply to comment #30) > one clarification that I just confirmed: > - rt-apple has an NVIDIA Quadro FX 4500 X2, but > - elm3b252 has an NVIDIA Quadro FX 4500 > (not a 4500 X2) > > I've updated the subject to remove "X2", as the X2 is essentially just two 4500s > strapped together, so only one is really required for this bug, apparently. I tried kdump on elm3b252. The kdump kernel did hang, but the hang somehow looked different than the one seen on rt-apple (with two 4500s).
Paul, will it be possible for you to take out the nvidia card from rt-apple for a day ? Might serve as a good confirmation...
I collected some debug information related to IO/local APIC register values using print_local_APIC() and print_IO_APIC() functions. I am not very well versed with APIC code so not sure how to interpret these results. Is there someone who could help decode the information. Here is the o/p i collected. This data was collected from kdump kernel just before PCI Subsystem initialization. ACPI: Interpreter enabled ***Here is the info you requested number of MP IRQ sources: 15. number of IO-APIC #2 registers: 24. number of IO-APIC #3 registers: 4. number of IO-APIC #4 registers: 4. number of IO-APIC #5 registers: 24. testing the IO APIC....................... IO APIC #2...... .... register #00: 02000000 ....... : physical APIC id: 02 .... register #01: 00170011 ....... : max redirection entries: 0017 ....... : PRQ implemented: 0 ....... : IO APIC version: 0011 .... register #02: 02000000 ....... : arbitration: 02 .... IRQ redirection table: NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect: 00 000 1 0 0 0 0 0 0 00 01 001 0 0 0 0 0 1 1 31 02 001 0 0 0 0 0 1 1 30 03 001 0 0 0 0 0 1 1 33 04 001 0 0 0 0 0 1 1 34 05 001 0 0 0 0 0 1 1 35 06 001 0 0 0 0 0 1 1 36 07 001 0 0 0 0 0 1 1 37 08 001 0 0 0 0 0 1 1 38 09 001 0 1 0 1 0 1 1 39 0a 001 0 0 0 0 0 1 1 3A 0b 001 0 0 0 0 0 1 1 3B 0c 001 0 0 0 0 0 1 1 3C 0d 001 0 0 0 0 0 1 1 3D 0e 001 0 0 0 0 0 1 1 3E 0f 001 0 0 0 0 0 1 1 3F 10 000 1 0 0 0 0 0 0 00 11 000 1 0 0 0 0 0 0 00 12 000 1 0 0 0 0 0 0 00 13 000 1 0 0 0 0 0 0 00 14 000 1 0 0 0 0 0 0 00 15 000 1 0 0 0 0 0 0 00 16 000 1 0 0 0 0 0 0 00 17 000 1 0 0 0 0 0 0 00 IO APIC #3...... .... register #00: 03000000 ....... : physical APIC id: 03 .... register #01: 00030011 ....... : max redirection entries: 0003 ....... : PRQ implemented: 0 ....... : IO APIC version: 0011 .... register #02: 00000000 ....... : arbitration: 00 .... IRQ redirection table: NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect: 00 000 1 0 0 0 0 0 0 00 01 000 1 0 0 0 0 0 0 00 02 000 1 0 0 0 0 0 0 00 03 000 1 0 0 0 0 0 0 00 IO APIC #4...... .... register #00: 04000000 ....... : physical APIC id: 04 .... register #01: 00030011 ....... : max redirection entries: 0003 ....... : PRQ implemented: 0 ....... : IO APIC version: 0011 .... register #02: 00000000 ....... : arbitration: 00 .... IRQ redirection table: NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect: 00 000 1 0 0 0 0 0 0 00 01 000 1 0 0 0 0 0 0 00 02 000 1 0 0 0 0 0 0 00 03 000 1 0 0 0 0 0 0 00 IO APIC #5...... .... register #00: 05000000 ....... : physical APIC id: 05 .... register #01: 00170011 ....... : max redirection entries: 0017 ....... : PRQ implemented: 0 ....... : IO APIC version: 0011 .... register #02: 05000000 ....... : arbitration: 05 .... IRQ redirection table: NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect: 00 000 1 0 0 0 0 0 0 00 01 000 1 0 0 0 0 0 0 00 02 000 1 0 0 0 0 0 0 00 03 000 1 0 0 0 0 0 0 00 04 000 1 0 0 0 0 0 0 00 05 000 1 0 0 0 0 0 0 00 06 000 1 0 0 0 0 0 0 00 07 000 1 0 0 0 0 0 0 00 08 000 1 0 0 0 0 0 0 00 09 000 1 0 0 0 0 0 0 00 0a 000 1 0 0 0 0 0 0 00 0b 000 1 0 0 0 0 0 0 00 0c 000 1 0 0 0 0 0 0 00 0d 000 1 0 0 0 0 0 0 00 0e 000 1 0 0 0 0 0 0 00 0f 000 1 0 0 0 0 0 0 00 10 000 1 0 0 0 0 0 0 00 11 000 1 0 0 0 0 0 0 00 12 000 1 0 0 0 0 0 0 00 13 000 1 0 0 0 0 0 0 00 14 000 1 0 0 0 0 0 0 00 15 000 1 0 0 0 0 0 0 00 16 000 1 0 0 0 0 0 0 00 17 000 1 0 0 0 0 0 0 00 IRQ to pin mappings: IRQ0 -> 0:2 IRQ1 -> 0:1 IRQ3 -> 0:3 IRQ4 -> 0:4 IRQ5 -> 0:5 IRQ6 -> 0:6 IRQ7 -> 0:7 IRQ8 -> 0:8 IRQ9 -> 0:9 IRQ10 -> 0:10 IRQ11 -> 0:11 IRQ12 -> 0:12 IRQ13 -> 0:13 IRQ14 -> 0:14 IRQ15 -> 0:15 .................................... done. Does the above collected data suggest anything which could help us resolve this bug ?
For now we have adopted the workaround to fix this issue. We are now using the RHEL kernel as the kdump kernel. However, since this issue is still valid, leaving it open and transferring ownership to Chandru from the IS team, as kdump bugs are owned by the IS team.
Chandru, pl feel free to change the component and other details as per your requirement.
We have decided to go with the workaround of using RHEL kernel as kdump kernel for the current release. Hence I am rejecting this bug as ALT_SOLUTION_AVAIL. We are going to work on the problem of using real-time kernel as kdump kernel, but we will raise a new bug for that.
closing