| Summary: | Crashes occurring when motherboard sound occurs during high network load | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Bruno Wolff III <bruno> | ||||
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | rawhide | CC: | bruno, dledford, gansalmon, itamar, jforbes, jonathan, kernel-maint, madhu.chinakonda | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | i686 | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2015-02-24 18:44:48 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Bruno Wolff III
2011-03-12 13:31:12 UTC
Does this still happen with the latest kernels (and this really is a kernel issue, not an mdadm issue)? I can use 2.6.38 kernels now as I can run my AGP video card in PCI mode to avoid those crashes. This could also be related to problems I have had related to the driver for onboard sound. (That seems to trigger crashes when simultaneously transferring large files over the network. I am not sure if it is the disk IO or the network IO.) I'll keep an eye out for recurrences, and when I have time to deal with it, I can trigger resync and try activites to make the system crash.) Yes, I think this is really kernel related, rather than raid related. The problem typically occurs either when raid arrays are rebuilding or I am transferring a large file over the local network (usually not when transferring files remotely at 1.5 Mb/s). I just had the problem occur when I backspaced and extra time causing an audible bell to be display while a large scp was running in another window to another machine on the LAN. The bell sound appeared to complete, but then the system was dead. I am running 2.6.38.2-10.fc15.i686.PAE. I do not have this problem when I am using my USB headset instead of the motherboard sound card. (But when I do that, I have a different problem related to the USB drivers.) [root@bruno bruno]# lspci -v 00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] System Controller (rev 11) Flags: bus master, 66MHz, medium devsel, latency 32 Memory at d0000000 (32-bit, prefetchable) [size=256M] Memory at f5000000 (32-bit, prefetchable) [size=4K] I/O ports at e800 [disabled] [size=4] Capabilities: [a0] AGP version 2.0 Kernel driver in use: agpgart-amdk7 Kernel modules: amd76x_edac 00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP Bridge (prog-if 00 [Normal decode]) Flags: bus master, 66MHz, medium devsel, latency 32 Bus: primary=00, secondary=01, subordinate=01, sec-latency=32 I/O behind bridge: 0000b000-0000bfff Memory behind bridge: f0000000-f1ffffff Prefetchable memory behind bridge: e0000000-efffffff 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05) Flags: bus master, 66MHz, medium devsel, latency 0 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE (rev 04) (prog-if 8a [Master SecP PriP]) Flags: bus master, medium devsel, latency 32 [virtual] Memory at 000001f0 (32-bit, non-prefetchable) [size=8] [virtual] Memory at 000003f0 (type 3, non-prefetchable) [size=1] [virtual] Memory at 00000170 (32-bit, non-prefetchable) [size=8] [virtual] Memory at 00000370 (type 3, non-prefetchable) [size=1] I/O ports at f000 [size=16] Kernel driver in use: pata_amd Kernel modules: pata_amd 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03) Flags: medium devsel Kernel driver in use: amd756_smbus Kernel modules: i2c-amd756, amd-rng 00:07.5 Multimedia audio controller: Advanced Micro Devices [AMD] AMD-768 [Opus] Audio (rev 03) Flags: bus master, medium devsel, latency 32, IRQ 17 I/O ports at c000 [size=256] I/O ports at c400 [size=64] Kernel driver in use: Intel ICH Kernel modules: snd-intel8x0 00:08.0 Communication controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN interface Subsystem: Device b1d9:0003 Flags: bus master, medium devsel, latency 32, IRQ 16 I/O ports at d000 [size=256] Memory at f5001000 (32-bit, non-prefetchable) [size=4K] Capabilities: [40] Power Management version 2 Kernel driver in use: wctdm Kernel modules: wctdm, hisax, netjet 00:09.0 RAID bus controller: HighPoint Technologies, Inc. HPT302/302N (rev 01) Subsystem: HighPoint Technologies, Inc. Device 0001 Flags: bus master, 66MHz, medium devsel, latency 120, IRQ 17 I/O ports at d400 [size=8] I/O ports at d800 [size=4] I/O ports at dc00 [size=8] I/O ports at e000 [size=4] I/O ports at e400 [size=256] [virtual] Expansion ROM at 80100000 [disabled] [size=128K] Capabilities: [60] Power Management version 2 Kernel driver in use: pata_hpt37x Kernel modules: pata_hpt3x2n, pata_hpt37x 00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05) (prog-if 00 [Normal decode]) Flags: bus master, 66MHz, medium devsel, latency 32 Bus: primary=00, secondary=02, subordinate=02, sec-latency=32 I/O behind bridge: 00008000-0000afff Memory behind bridge: f3000000-f4ffffff Prefetchable memory behind bridge: 80000000-800fffff 01:05.0 VGA compatible controller: ATI Technologies Inc RV280 [Radeon 9200] (rev 01) (prog-if 00 [VGA controller]) Subsystem: C.P. Technology Co. Ltd Device 2062 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 17 Memory at e0000000 (32-bit, prefetchable) [size=128M] I/O ports at b000 [size=256] Memory at f1000000 (32-bit, non-prefetchable) [size=64K] [virtual] Expansion ROM at f0000000 [disabled] [size=128K] Capabilities: [58] AGP version 2.0 Capabilities: [50] Power Management version 2 Kernel driver in use: radeon Kernel modules: radeon, radeonfb 01:05.1 Display controller: ATI Technologies Inc RV280 [Radeon 9200] (Secondary) (rev 01) Subsystem: C.P. Technology Co. Ltd Device 2063 Flags: bus master, 66MHz, medium devsel, latency 32 Memory at e8000000 (32-bit, prefetchable) [size=128M] Memory at f1010000 (32-bit, non-prefetchable) [size=64K] Capabilities: [50] Power Management version 2 02:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-768 [Opus] USB (rev 07) (prog-if 10 [OHCI]) Flags: bus master, medium devsel, latency 32, IRQ 19 Memory at f4024000 (32-bit, non-prefetchable) [size=4K] Kernel driver in use: ohci_hcd 02:05.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 37) Subsystem: LSI Logic / Symbios Logic LSI22801 PCI to Dual Channel Ultra SCSI host adapter Flags: bus master, medium devsel, latency 72, IRQ 17 I/O ports at 8000 [size=256] Memory at f4029000 (32-bit, non-prefetchable) [size=256] Memory at f402a000 (32-bit, non-prefetchable) [size=4K] [virtual] Expansion ROM at 80000000 [disabled] [size=64K] Capabilities: [40] Power Management version 1 Kernel driver in use: sym53c8xx Kernel modules: sym53c8xx 02:05.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 37) Subsystem: LSI Logic / Symbios Logic LSI22801 PCI to Dual Channel Ultra SCSI host adapter Flags: bus master, medium devsel, latency 72, IRQ 18 I/O ports at 8400 [size=256] Memory at f4025000 (32-bit, non-prefetchable) [size=256] Memory at f4026000 (32-bit, non-prefetchable) [size=4K] [virtual] Expansion ROM at 80010000 [disabled] [size=64K] Capabilities: [40] Power Management version 1 Kernel driver in use: sym53c8xx Kernel modules: sym53c8xx 02:06.0 Ethernet controller: D-Link System Inc RTL8139 Ethernet (rev 10) Subsystem: D-Link System Inc DFE-530TX+ 10/100 Ethernet Adapter Flags: bus master, medium devsel, latency 32, IRQ 18 I/O ports at 8800 [size=256] Memory at f4027000 (32-bit, non-prefetchable) [size=256] Capabilities: [50] Power Management version 2 Kernel driver in use: 8139too Kernel modules: 8139too 02:07.0 Ethernet controller: Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 0d) Subsystem: Intel Corporation EtherExpress PRO/100 Server Adapter Flags: bus master, medium devsel, latency 32, IRQ 16 Memory at f4028000 (32-bit, non-prefetchable) [size=4K] I/O ports at 8c00 [size=64] Memory at f4000000 (32-bit, non-prefetchable) [size=128K] [virtual] Expansion ROM at 80020000 [disabled] [size=64K] Capabilities: [dc] Power Management version 2 Kernel driver in use: e100 Kernel modules: e100 02:08.0 RAID bus controller: Promise Technology, Inc. PDC20276 (MBFastTrak133 Lite) (rev 01) (prog-if 85) Subsystem: Giga-byte Technology MBUltra 133 Flags: bus master, 66MHz, slow devsel, latency 32, IRQ 18 I/O ports at 9000 [size=8] I/O ports at 9400 [size=4] I/O ports at 9800 [size=8] I/O ports at 9c00 [size=4] I/O ports at a000 [size=16] Memory at f4020000 (32-bit, non-prefetchable) [size=16K] Capabilities: [60] Power Management version 1 Kernel driver in use: pata_pdc2027x Kernel modules: pata_pdc2027x Can you get kernel oops messages and backtraces from these crashes? In theory, though I didn't have much luck with this the last time I tried this. I'll work on it over the weekend. I tried setting up kexec/kdump, but the kdump service appears to be broken due to something related to systemd. (It looks like both systemd and sysvinit think the other is supposed to handle this service.) I tried working around it and got and initrd file created, but kexec didn't start. This might be because of the work around I used, or it might be some other problem. When I look at it again, I'll see if I can start it myself. I got a little further. kdump does seem to start up during boot, even if it claims it fails when run from the command line. I tried sending c to /proc/sysrq-trigger and the system froze for a while, then the screen turned black and some information relating to md devices scrolled by, then the monitor lost signal and a bit after that singal returned and bios messages for a reboot started. Nothing ended up on the machine I was trying to send the dump file to. My local machine has encrypted partitions except for /boot which is too small to hold a dump of memory. So I am trying to scp them over to another system. The kdump test of this seems to succeed, but it doesn't copy stuff there after a crash. I just noticed this doesn't happen when my system comes up with only one CPU. There is a bug where about 1/3 of the time I boot I only get one cpu used. When this happened I was unable to reproduce the issue. I have kdump working with sysrq-trigger, but not with this bug. I am going to look at enabling the watchdog timer feature for kdump to see if that will catch the crash/lockup. I have not been able to get kdump to work when the problem happens. Alt-Sysrq-c doesn't work from the keyboard once the bug is triggered. I tried using nmi_watchdog=1 and nmi_watchdog=2 as kernel parameters and echoing 1 to the various panic variables in /proc/kernel/sys. Nothing changes on the screen and pings from another system are unanswered. I am not seeing lots of nmi interrupts when I add the nmi_watchdog parameters. So I don't think adding those parameters is really turning on that feature. Do I need to do a custom kernel build to get that to work? Here are interrupt counts after the machine has been up for a bit over 3 hours:
CPU0 CPU1
0: 126 0 IO-APIC-edge timer
1: 4764 4654 IO-APIC-edge i8042
5: 0 0 IO-APIC-edge MPU401 UART
7: 0 0 IO-APIC-edge parport0
8: 0 0 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi
12: 132168 131388 IO-APIC-edge i8042
14: 21461 24098 IO-APIC-edge pata_amd
15: 23817 24142 IO-APIC-edge pata_amd
16: 1187191 1187264 IO-APIC-fasteoi eth1
17: 307876 303199 IO-APIC-fasteoi radeon, pata_hpt37x, sym53c8xx, AMD AMD768
18: 14 16 IO-APIC-fasteoi pata_pdc2027x, sym53c8xx, eth0
19: 31115 31183 IO-APIC-fasteoi ohci_hcd:usb1
NMI: 15 16 Non-maskable interrupts
LOC: 2409945 2575412 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 15 16 Performance monitoring interrupts
IWI: 0 0 IRQ work interrupts
RES: 314699 293729 Rescheduling interrupts
CAL: 113030 138124 Function call interrupts
TLB: 38515 34873 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 41 41 Machine check polls
ERR: 0
MIS: 0
I retested this with the 3.1.0-0.rc9.git0.0.fc17.i686.PAE kernel and am still seeing hangs without traceback. I am still seeing this with a 3.2.1 kernel. I tried rebuilding a debug kernel with CONFIG_BOOTPARAM_HUNG_TASK_PANIC=Y, but I still didn't get a backtrace. I retested this with kernel-PAE-3.6.0-0.rc3.git1.1.fc18.i686 and am still seeing the problem. I had netconsole running when I did the test, but there was no output as a result of the crash/hang. I am trying to narrow down when this bug was introduced and based on testing tonight, it looks like it is not in the 2.6.29.1-68.fc11 kernel, but is definitely in the 2.6.29.1-69.fc11 kernel. As far as I can tell the only difference between -68 and -69 is that the patch linux-2.6-iwl3945-rely-on-priv-_lock-to-protect-priv-access.patch got dropped. It is a wireless patch and the affected machine isn't using wireless. But it does seem to involve networking and locking which kind of makes sense if there is a way that code can get used when there isn't wireless hardware connected. Created attachment 607067 [details]
Patch that was dropped between -68 and -69
I looked at the difference between -68 and -69 and it looks like the only difference is that the attached patch (linux-2.6-iwl3945-rely-on-priv-_lock-to-protect-priv-access.patch) got dropped.
This seems odd because the affected machine doesn't have any wireless devices. However the change does affect locking and is related to networking, so it kind of seems like it could cause what I am seeing.
Just on the off chance there was a gcc change between those two builds, I went back and took a look and it doesn't appear that a gcc change could have broken -69 without also breaking -68. There were no gcc builds in between the builds of -68 and -69. I also see that 3945 support is just being built as a module which makes it seem even less likely that removing that patch should have broken things. I tried not build 3945 support at all and the crash/hang still happened. I am going to look and see if I can find anything else that might have changed so as to affect the code generated in kernel builds. redhat-rpm-config doesn't look like it changed around that time. glibc didn't have builds in between the two kernel builds. However I forgot that there might have been an updates-testing repo at the time (it would have been around f11 beta) those builds were being made so some builds before the first kernel build might not have been used until after the first kernel build. I don't seem to be able to go back far enough in bodhi to see when exactly builds to moved into stable. I think I need to pursue looking at the generated code in the drivers for those two kernels. That might take me a while to figure out where to look. Given all of your description, I would be tempted to say that I suspect either CPU or main RAM is your problem, not your kernel. Can you download the memory tester script from my web page at people.redhat.com/dledford/ see if that script will trip up your computer without the need to involve motherboard sound? The thing is, if your hardware is marginal, motherboard sound output may just be enough to put it over the edge without really being relevant to the problem. The memory test script I linked above generally will find these marginal systems. I ran the test using a kernel tar.gz file and it finished 20 passes without reporting an error. I don't think the problem is really an intermittent hardware error since when using the 2.6.29.1-68 kernel sound consistently works (I installed Fedora 11 again to test this a few days ago.) and with later kernels it hangs pretty consistently after about a 1/2 second of sound with heavy network traffic. Sometimes shorter pulses of sound (the audible bell) don't hang the system. I'd believe there could be a hardware bug or problem of some kind that is for some reason not triggered by the older kernel. Thanks for confirming that. Both the -68 and -69 builds were built with the same version of gcc (4.4.0-0.32). This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19 Is this still an issue with the 3.9 kernels in F19? I'd be shocked if it was fixed. I don't want to crash my desktop right this minute, but I'll test it on a 3.9 kernel by the end of the weekend. I got a crash using 3.9.0-0.rc5.git3.2.fc20.i686.PAE (from rawhide-nodebug). I wasn't actually getting sound out of my unamped speakers, but I am not sure if it was just very quiet, a hardware issue or a driver issue. Pulse showed output going to the device and when I started up scp the system crashed pretty quickly. I checked out the hardware using f11 and got sound. So the lack of sound in rawhide/f20 is probably due to software. I don't normally have the speakers enabled since that causes problem, so that issue may go back a ways. But it is an issue for another bug. The sound issue turned out to be pulseaudio related. I was able to change the config for the device and get it to work, though I think there is a bug in how the config for this device is handled in pulseaudio. Note this problem was originally filed as bug 496536. I wasn't able to get a kdump and the bug was eventually closed due to insufficient info. The 2.6.29.1-68.fc11.i686.PAE kernel still shows this problem in some cases. I have also had 2.6.28 builds on f10 show the problem. It may be the problem really went back further and I just wasn't triggering it. I retested this on 3.12.0-0.rc6.git1.2.fc21.i686+PAE (rawhide nodebug) and it is definitely still a problem. I also suspect that the problem goes back a lot farther than I thought. I probably didn't notice it until I started doing a lot of local file transfers. On some of my earlier testing I also might not have been as sure about how to reproduce the problem. I have now also reproduced this with a Fedora 2.6.27 kernel on F10. So I may need to go back to trying to get some sort of traceback or other information about where the system is locking up. I just retried netconsole, but no luck. The crash/hang event doesn't send over any output, though other events do. I retested that more than one cpu needs to be active for this bug to manifest itself. When I booted with maxcpus=1 playing sound on the motherboard device and doing a large local network transfer did not result in a crash/hang. I tried using nmi_watchdog=2, but that didn't help me capture any output with netconsole. I also tried nohz=off and that didn't prevent the issue from happening. I have found the following errata which could plausibly be related to the problem I am seeing: 25703E—October 2003 AMD Athlon™ Processor Model 8 Revision Guide Preliminary Information 17 Deadlock May Occur in a Two-Processor System in the Presence of Probe to Memory- Mapped I/O Products Affected. A0, B0 Normal Specified Operation. Processor should not hang. Non-conformance. In a multiprocessor system, if one processor (A) is continuously writing to a cacheable memory-mapped I/O block while the other processor (B) is trying to read the same cacheable I/O block, and at the same time both processors are also trying to write a different memory-based cache block, then processor B may hang. Should this occur and processor A fields an interrupt, the deadlock is resolved. Potential Effect on System. System will hang or exhibit performance degradation. Suggested Workaround. The current processor design assumes that memory mapped I/O is incoherent and does not handle all deadlock cases. System logic should not generate probes for memory mapped I/O addresses. Resolution Status. No fix planned. I tested this with both nics and loopback and the lockout didn't happen during the loopback transfer and did happen with both nics. (One uses 8139too and the other e100.) I have a strong suspicion that the issue is due to the above mentioned processor bug. While it might be nice to have a quirk set up to work around this issue, I doubt there are enough people still running this type of processor to make the effort worthwhile. So I am going to close this. |