From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) Description of problem: When running I/O intensive applications, such as MySQL's myisamchk on a 600MB table, on this dual 1GHz system w/1GB RAM and VIA 694DP chipset (MSI 694D Pro-@ Motherboard), the system will hardlock (with no SysRQ response) on various RH stock SMP kernels. How reproducible: Always Steps to Reproduce: 1. Run RH7.1 w/2.4.2+ SMP kernel on a 2-CPU MSI VIA M/B 2. Boot 3. Run myisamchk -So on a very large table OR Run mysqld and pound it with ~150 queries per second Actual Results: System hardlocks Expected Results: System performs duties normally Additional info: The URL contains a more detailed description of things I've tried doing to remedy the problem. It seems only the uniproc linux kernel works without locking. A discussion with Mark Hanhs brought up possible issues with PCI arbitration on the VIA bus. Apparantly the chipset has issues. I do not have resources to investigate. Please note that setting maxcpus=1 on an SMP kernel makes the system stable, bit it stalls often, kicking the loadavg into the high 100's with mysql receiving ~150qps. This behavior is not present in a uniproc kernel; the system handles high loads gracefully on one cpu.
I also have a new MSI 6321 VIA 694D-IR v2 board running dual P3-733. I had the system running fine for a week with RH7.1 but with very light load. I just installed a Megaraid Express 500 (475) card last night and reinstalled (make sure you upgrade the Megaraid firmware to 159!!) RH7.1 using only RAID disks. Left the machine on all night doing useless greps and cats of 1GB files. Came back in the morning and the screen was black and would not wake up. Num-lock would not respond. Hardlock. No hints in /var/log/messages (anywhere else I should look?). One thing they'll tell you to do is make sure you're booting with ide=noudma with these VIA boards. That I am doing (though I run no IDE hard drives). Anyways, if the problem turns out to be heavy I/O on SMP on these boards, then it would interesting to see if it matters whether one is using IDE or SCSI. Right now I am blasting the machine with concurrent make-work disk-intensive tasks to try to reproduce the crash while I'm watching.
My workstation is a SMP VIA board, and works fine. VIA admitted about 25% of their chips had a flaw and posted a workaround (sort of), that workaround is in the 2.4.3-12 kernel we released a few weeks ago.
Can you post a link to or tell us about the chipset flaw that 2.4.3-12 fixes? I'm under the impression VIA issues are mostly IDE related. Strange that the system never crashed when I used an old crappy IDE drive before I got the Megaraid 500, but it was only lightly loaded.
It's not so much IDE that caused it, but high load. It happened to be that IDE when doing UDMA100 has a rather high load..
Great. That makes me feel good. I don't know how VIA gets away with it. Well, actually, I do: there is no modern mainstream-priced dual CPU board available besides VIA's. Where can I find the details of the bug and/or fix? Not sure if this help any, but I spotted some weirdness in the message log. (The system hasn't crashed yet since I restarted this morning, BTW, and the RAID array is sure getting a good workout!) Jul 6 09:22:03 www kernel: Uhhuh. NMI received for unknown reason 30. Jul 6 09:22:04 www kernel: Dazed and confused, but trying to continue Jul 6 09:22:04 www kernel: Do you have a strange power saving mode enabled? So I went into the BIOS and completely disabled everything PM related. Perhaps that was the problem last night? The console was all black-screen (but the monitor was NOT asleep or in a power saving mode) when I tried to revive it this morning. Crossing my fingers.
Oops. I got another NMI error, this time reason 20. Seems to happen after I leave the machine for about an hour, so it's probably PM related. Strange this still occurs after I disabled all PM in the BIOS. I'll go check out the other NMI bug reports. BTW, the system still hasn't crashed, and it's been at load 5+ for 2 hours now.
Do you know if that fix in 2.4.3-12 fixes the PCI bugs or does it just fix the IDE issue?
It is the workaround for the PCI issues. However, VIA isn't totally open about this and there are signals (see an average day of the linux kernel list) that it isn't a full workaround. If only promise gave more info.....
I should have asked this also, but is this PCI fixup also included in the upstream kernel releases from kernel.org? I've tried both RH and stock SMP kernels from kernel.org and both hardlock in SMP mode. It is possible VIA fixed their issues with the chipset used in the v2 MSI board that tcordes pointed out as working.
2.4.5 and later from kernel.org have the identical fix
Well, my system made it through a whole day of pretty rough tests without any problems. Perhaps the freeze I had was caused by the thunderstorm here last night. I don't have this box on a UPS yet. Reading the other notes about VIA's south bridge, it would seem I also have the buggy version "B". Unless this board uses a newer rev under "B". For reference: my board is the MSI 6321-IR v2 VIA 694D, and the BIOS level is 5.0 (the newest). I'm running the AMI Megaraid 500 successfully now with RAID 5. The video is ATI Xpert 98. The ethernet is Dlink DFE-538TX. The RAM is ECC Kingston PC133.
I was able to access the system yesterday and found that removing one 256MB DIMM from the last memory slot made the immediate crashes go away. However, the system now locks up intermittently in SMP mode with some error messages. The following OOPs and error message were produced on the 2.4.5-0.4smp RH kernel from rawhide. The Machine Check Exception hardlocked the system. The OOPs occured after a few hours uptime after a cold reboot from the hardlock on the same kernel. CPU 1: Machine Check Exception: 0000000000000004 Bank 1: b200000000000115 Kernel panic: CPU context corrupt Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ... delta kernel: CPU 1: Machine Check Exception: 0000000000000004 Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ... delta kernel: Bank 1: b200000000000115 Message from syslogd@delta at Sat Jul 7 13:18:36 2001 ... delta kernel: Kernel panic: CPU context corrupt The OOPs (run thru ksymoops. very messy): Unable to handle kernel paging request at virtual address fffffff6 c01165d0 *pde = 00004063 Oops: 0000 CPU: 1 EIP: 0010:[<c01165d0>] Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010003 eax: 00000002 ebx: e1651fc4 ecx: 00000001 edx: 00000002 esi: c1cd2000 edi: be750000 ebp: e1653e70 esp: e1653e44 ds: 0018 es: 0018 ss: 0018 Process mysqld (pid: 790, stackpage=e1653000) Stack: e1652000 00000002 00000002 e1652000 fffffc18 00000001 e1652000 c02b8460 d108cda0 e1652000 e1652000 00000002 c0139416 d108cda0 00000000 00000000 00000000 00000000 e1652000 d108cdec d108cdec 00000000 00001000 00000400 Call Trace: [<c0139416>] [<c013bb59>] [<f087fb6b>] [<c013c382>] [<f0863850>] [<c012c993>] [<f0863850>] [<c01cb80c>] [<c0138a4e>] [<c011412c>] [<c0108c65> [<c010718b>] Code: 8b 7a f4 8d 5a c4 85 ff 75 6c 8b 45 dc 85 42 fc 74 64 8b 42 >>EIP; c01165d0 <schedule+150/550> <===== Trace; c0139416 <__wait_on_buffer+76/a0> Trace; c013bb59 <__block_prepare_write+219/2b0> Trace; f087fb6b <END_OF_CODE+3054a67b/????> Trace; c013c382 <block_prepare_write+22/70> Trace; f0863850 <END_OF_CODE+3052e360/????> Trace; c012c993 <generic_file_write+3c3/620> Trace; f0863850 <END_OF_CODE+3052e360/????> Trace; c01cb80c <kfree_skbmem+c/70> Trace; c0138a4e <sys_pwrite+ae/f0> Trace; c011412c <smp_apic_timer_interrupt+ec/100> Trace; c0108c65 <do_IRQ+e5/f0> Trace; c010718b <system_call+33/38> Code; c01165d0 <schedule+150/550> 00000000 <_EIP>: Code; c01165d0 <schedule+150/550> <===== 0: 8b 7a f4 mov 0xfffffff4(%edx),%edi <===== Code; c01165d3 <schedule+153/550> 3: 8d 5a c4 lea 0xffffffc4(%edx),%ebx Code; c01165d6 <schedule+156/550> 6: 85 ff test %edi,%edi Code; c01165d8 <schedule+158/550> 8: 75 6c jne 76 <_EIP+0x76> c0116646 <schedule+1c6/550> Code; c01165da <schedule+15a/550> a: 8b 45 dc mov 0xffffffdc(%ebp),%eax Code; c01165dd <schedule+15d/550> d: 85 42 fc test %eax,0xfffffffc(%edx) Code; c01165e0 <schedule+160/550> 10: 74 64 je 76 <_EIP+0x76> c0116646 <schedule+1c6/550> Code; c01165e2 <schedule+162/550> 12: 8b 42 00 mov 0x0(%edx),%eax NMI Watchdog detected LOCKUP on CPU1, registers: CPU: 1 EIP: 0010:[<c021c67c>] EFLAGS: 00000082 eax: 00000000 ebx: e37dbf64 ecx: e37da000 edx: 00000001 esi: 00000046 edi: c025cdc0 ebp: e1653d0c esp: e1653cf4 ds: 0018 es: 0018 ss: 0018 Process mysqld (pid: 790, stackpage=e1653000) Stack: 00000001 00000086 00000001 c02de5a1 00000046 00000001 00000000 c0119558 0000000f c0228863 e1653e10 c01157b0 c022bcc7 00000000 c0107704 00000000 c022d592 c0226fed c0228863 00000000 00000001 00004000 c0234020 e1653e10 Call Trace: [<c0119558>] [<c01157b0>] [<c0107704>] [<c0115c38>] [<c0186cd9>] [<c01873cc>] [<f082cd79>] [<c0115890>] [<c01072d0>] [<f0810018>] [<c01165d0> [<c0139416>] [<c013bb59>] [<f087fb6b>] [<c013c382>] [<f0863850>] [<c012c993> [<f0863850>] [<c01cb80c>] [<c0138a4e>] [<c011412c>] [<c0108c65>] [<c010718b> Code: 80 3d 00 84 2b c0 00 f3 90 7e f5 e9 83 a3 ef ff 80 38 00 f3 >>EIP; c021c67c <stext_lock+680/6451> <===== Trace; c0119558 <printk+148/160> Trace; c01157b0 <bust_spinlocks+50/60> Trace; c0107704 <die+54/70> Trace; c0115c38 <do_page_fault+3a8/4b0> Trace; c0186cd9 <req_new_io+49/60> Trace; c01873cc <__make_request+4dc/6c0> Trace; f082cd79 <END_OF_CODE+304f7889/????> Trace; c0115890 <do_page_fault+0/4b0> Trace; c01072d0 <error_code+38/40> Trace; f0810018 <END_OF_CODE+304dab28/????> Trace; c01165d0 <schedule+150/550> Trace; c0139416 <__wait_on_buffer+76/a0> Trace; c013bb59 <__block_prepare_write+219/2b0> Trace; f087fb6b <END_OF_CODE+3054a67b/????> Trace; c013c382 <block_prepare_write+22/70> Trace; f0863850 <END_OF_CODE+3052e360/????> Trace; c012c993 <generic_file_write+3c3/620> Trace; f0863850 <END_OF_CODE+3052e360/????> Trace; c01cb80c <kfree_skbmem+c/70> Trace; c0138a4e <sys_pwrite+ae/f0> Trace; c011412c <smp_apic_timer_interrupt+ec/100> Trace; c0108c65 <do_IRQ+e5/f0> Trace; c010718b <system_call+33/38> Code; c021c67c <stext_lock+680/6451> 00000000 <_EIP>: Code; c021c67c <stext_lock+680/6451> <===== 0: 80 3d 00 84 2b c0 00 cmpb $0x0,0xc02b8400 <===== Code; c021c683 <stext_lock+687/6451> 7: f3 90 repz nop Code; c021c685 <stext_lock+689/6451> 9: 7e f5 jle 0 <_EIP> Code; c021c687 <stext_lock+68b/6451> b: e9 83 a3 ef ff jmp ffefa393 <_EIP+0xffefa393> c0116a0f <__wake_up+3f/c0> Code; c021c68c <stext_lock+690/6451> 10: 80 38 00 cmpb $0x0,(%eax) Code; c021c68f <stext_lock+693/6451> 13: f3 00 00 repz add %al,(%eax) NMI Watchdog detected LOCKUP on CPU0, registers: NMI Watchdog detected LOCKUP on CPU1, registers: CPU: 1 EIP: 0010:[<c021cdb6>] EFLAGS: 00000082 eax: 00000000 ebx: e27c4000 ecx: 00000000 edx: c1cf1c00 esi: 00000021 edi: 00000000 ebp: 00000000 esp: e1653b74 ds: 0018 es: 0018 ss: 0018 Process mysqld (pid: 790, stackpage=e1653000) Stack: e27c4000 00000021 00000086 c0121d86 00000021 e1653bb0 e27c4000 e1652000 00000021 e1652000 0000000b c012226a 00000021 e1653bb0 e27c4000 00000021 00000000 00040002 00000316 00000064 0000000b 000016e1 000005cb 00000000 Call Trace: [<c0121d86>] [<c012226a>] [<c011412c>] [<c0107356>] [<c011be75>] [<c011c1cc>] [<c01195fe>] [<c0119558>] [<c011450d>] [<c0107356>] [<c021c67c> [<c0119558>] [<c01157b0>] [<c0107704>] [<c0115c38>] [<c0186cd9>] [<c01873cc> [<f082cd79>] [<c0115890>] [<c01072d0>] [<f0810018>] [<c01165d0>] [<c0139416> [<c013bb59>] [<f087fb6b>] [<c013c382>] [<f0863850>] [<c012c993>] [<f0863850> [<c01cb80c>] [<c0138a4e>] [<c011412c>] [<c0108c65>] [<c010718b>] Code: 80 3d 00 84 2b c0 00 f3 90 7e f5 e9 fa 4e f0 ff 80 bb 38 06 >>EIP; c021cdb6 <stext_lock+dba/6451> <===== Trace; c0121d86 <send_sig_info+86/b0> Trace; c012226a <do_notify_parent+9a/b0> Trace; c011412c <smp_apic_timer_interrupt+ec/100> Trace; c0107356 <nmi+1e/38> Trace; c011be75 <exit_notify+195/2c0> Trace; c011c1cc <do_exit+22c/240> Trace; c01195fe <release_console_sem+4e/a0> Trace; c0119558 <printk+148/160> Trace; c011450d <nmi_watchdog_tick+8d/e0> Trace; c0107356 <nmi+1e/38> Trace; c021c67c <stext_lock+680/6451> Trace; c0119558 <printk+148/160> Trace; c01157b0 <bust_spinlocks+50/60> Trace; c0107704 <die+54/70> Trace; c0115c38 <do_page_fault+3a8/4b0> Trace; c0186cd9 <req_new_io+49/60> Trace; c01873cc <__make_request+4dc/6c0> Trace; f082cd79 <END_OF_CODE+304f7889/????> Trace; c0115890 <do_page_fault+0/4b0> Trace; c01072d0 <error_code+38/40> Trace; f0810018 <END_OF_CODE+304dab28/????> Trace; c01165d0 <schedule+150/550> Trace; c0139416 <__wait_on_buffer+76/a0> Trace; c013bb59 <__block_prepare_write+219/2b0> Trace; f087fb6b <END_OF_CODE+3054a67b/????> Trace; c013c382 <block_prepare_write+22/70> Trace; f0863850 <END_OF_CODE+3052e360/????> Trace; c012c993 <generic_file_write+3c3/620> Trace; f0863850 <END_OF_CODE+3052e360/????> Trace; c01cb80c <kfree_skbmem+c/70> Trace; c0138a4e <sys_pwrite+ae/f0> Trace; c011412c <smp_apic_timer_interrupt+ec/100> Trace; c0108c65 <do_IRQ+e5/f0> Trace; c010718b <system_call+33/38> Code; c021cdb6 <stext_lock+dba/6451> 00000000 <_EIP>: Code; c021cdb6 <stext_lock+dba/6451> <===== 0: 80 3d 00 84 2b c0 00 cmpb $0x0,0xc02b8400 <===== Code; c021cdbd <stext_lock+dc1/6451> 7: f3 90 repz nop Code; c021cdbf <stext_lock+dc3/6451> 9: 7e f5 jle 0 <_EIP> Code; c021cdc1 <stext_lock+dc5/6451> b: e9 fa 4e f0 ff jmp fff04f0a <_EIP+0xfff04f0a> c0121cc0 <deliver_signal+50/90> Code; c021cdc6 <stext_lock+dca/6451> 10: 80 bb 38 06 00 00 00 cmpb $0x0,0x638(%ebx)
A machine check exception is the cpu telling you hardware selftests failed. That's BAD!
The server runs fine for a few hours, but then the machine check exception occurs. What could be attributing to this failure? Could it be the motherboard itself?
If it's after a few hours it sounds like a temperature thing..... Check that all fans are turning and that the airflow isn't blocked by cables.
I've installed lm_sensors and the temperature looks fine (~28-32dC for both CPUs). However, I am running 1GHz CPUs which Intel docs say require at least 1.70v (this was found after a bit of research into decoding the machine exception code). The 1st CPU is running at 1.70v, but the 2nd CPU is running at 1.65v. It looks like this might have been the culprit all this time.
Dear People, Sorry to interrupt, but I was wondering if you could helpme out. I have MSI 694DPRO AR motherboard with raided 2+0 ide hard drives and an additional quantum hard drive. the raided hard drives are ntfs. the quantum is fat32. The bios is the latest on the board. I have a problem with installation and was wondering if you guys who succeded in installing linux might tell me why does the installation hangs after it detects the hard disks? Please help thank you. smtan.au Warmest regards, Su Min Tan
UPDATE: the machine I talked about in my earlier posts ran fine from that time up until about 3 months ago, with only 1 hard crash that I can remember. Now all of a sudden I've had 3 hard crashes in 2-3 months. I'm suspecting the 2(?) up2date kernel updates I've done since then, as nothing else has changed on the system. It's running kernel-smp-2.4.9-34 now, but it did crash also under kernel-smp-2.4.9-31. As I said, for all of 2001 and some of 2002 it ran with only 1 crash on the older kernel. Were there any changes recently in the kernel that would affect these dual VIA boards? Would I gain anything by upgrading to RH7.3 (we're still at 7.1 on the production boxes). Note: the crashes were in our peak hours (it's a web server) so it may still be some load-dependent crash. But our peak load isn't much above average, so it's not like it's being hammered.
I have the same problem with the 694DP motherboard. It will crash within 30 minutes if there is over 512 megs of ram, and it is doing something CPU intensive (a few hours otherwise). If I put a gig of ram in it, and load a huge file into emacs, it will lock up with the same paging errors you guys are getting, but with 2 sticks of 256meg ram (total of 512) it works ok. Its only when I add the third stick that it crashes. I'm using kernel v.2.4.7- 10enterprise #1 SMP Anyone found a work-around/fix for this?
jeff: 1) why are you using the enterprise kernel with only a gig of ram and 2) have you tried upgrading to 2.4.9-34
Don't ever use the third DIMM slot! VIA is known to have problems with the 3rd slot. Reportedly it works with single-sided DIMM's only, but I wouldn't even trust that! Just use the first 2 slots. I'm seriously tempted to replace all 694D boards with Server Works-based... See my other bug https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=71204
I looked on ebay for a replacement motherboard, and all the dual p3 motherboards I looked at listed a maximum of 512M ram. It must be some limitations with the P3 chips, I guess 512M will have to do. I wonder if it would support a gig with 2 512M dimms. Anybody tried this?
The only dual boards I have found that will support 133MHz bus coppermines are the VIA-based and ServerWorks. The ServerWorks-based boards are 4 to 6 times as expensive as VIA boards. If my latest update to our systems doesn't stop the crashing, we will be going with SW-based. Both the VIA and SW chipsets should handle minimum 2 x 512MB DIMMS for 1GB of RAM. Most of the SW support registered or have more banks for a total of 2GB or maybe even 4GB with 1G DIMMs. I personally have run 512MB DIMMs on VIA dual boards with no problem. Just don't try to use the 3rd slot (if any)!!