Bug 47160 - SMP on VIA motherboard hardlocks consistently
Summary: SMP on VIA motherboard hardlocks consistently
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.1
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brock Organ
URL: http://www.uwsg.indiana.edu/hypermail...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2001-07-03 18:14 UTC by Vibol Hou
Modified: 2007-04-18 16:34 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2003-06-06 14:17:04 UTC
Embargoed:


Attachments (Terms of Use)

Description Vibol Hou 2001-07-03 18:14:19 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)

Description of problem:
When running I/O intensive applications, such as MySQL's myisamchk on a 
600MB table, on this dual 1GHz system w/1GB RAM and VIA 694DP chipset (MSI 
694D Pro-@ Motherboard), the system will hardlock (with no SysRQ response) 
on various RH stock SMP kernels.

How reproducible:
Always

Steps to Reproduce:
1. Run RH7.1 w/2.4.2+ SMP kernel on a 2-CPU MSI VIA M/B
2. Boot
3. Run myisamchk -So on a very large table
   OR
   Run mysqld and pound it with ~150 queries per second

Actual Results:  System hardlocks

Expected Results:  System performs duties normally

Additional info:

The URL contains a more detailed description of things I've tried doing to 
remedy the problem.  It seems only the uniproc linux kernel works without 
locking.

A discussion with Mark Hanhs brought up possible issues with PCI 
arbitration on the VIA bus.  Apparantly the chipset has issues.  I do not 
have resources to investigate.

Please note that setting maxcpus=1 on an SMP kernel makes the system 
stable, bit it stalls often, kicking the loadavg into the high 100's with 
mysql receiving ~150qps.  This behavior is not present in a uniproc 
kernel; the system handles high loads gracefully on one cpu.

Comment 1 Trevor Cordes 2001-07-06 13:59:43 UTC
I also have a new MSI 6321 VIA 694D-IR v2 board running dual P3-733.  I had the 
system running fine for a week with RH7.1 but with very light load.  I just 
installed a Megaraid Express 500 (475) card last night and reinstalled (make 
sure you upgrade the Megaraid firmware to 159!!) RH7.1 using only RAID disks.  
Left the machine on all night doing useless greps and cats of 1GB files.  Came 
back in the morning and the screen was black and would not wake up.  Num-lock 
would not respond.  Hardlock.  No hints in /var/log/messages (anywhere else I 
should look?).

One thing they'll tell you to do is make sure you're booting with ide=noudma 
with these VIA boards.  That I am doing (though I run no IDE hard drives).

Anyways, if the problem turns out to be heavy I/O on SMP on these boards, then 
it would interesting to see if it matters whether one is using IDE or SCSI.  
Right now I am blasting the machine with concurrent make-work disk-intensive 
tasks to try to reproduce the crash while I'm watching.


Comment 2 Arjan van de Ven 2001-07-06 14:10:11 UTC
My workstation is a SMP VIA board, and works fine. VIA admitted about 25% of
their chips had a flaw and posted a workaround (sort of), that workaround is
in the 2.4.3-12 kernel we released a few weeks ago.

Comment 3 Trevor Cordes 2001-07-06 14:21:25 UTC
Can you post a link to or tell us about the chipset flaw that 2.4.3-12 fixes?  
I'm under the impression VIA issues are mostly IDE related.  Strange that the 
system never crashed when I used an old crappy IDE drive before I got the 
Megaraid 500, but it was only lightly loaded.


Comment 4 Arjan van de Ven 2001-07-06 14:23:11 UTC
It's not so much IDE that caused it, but high load. It happened to be that IDE
when doing UDMA100 has a rather high load..

Comment 5 Trevor Cordes 2001-07-06 15:02:09 UTC
Great.  That makes me feel good.  I don't know how VIA gets away with it.  
Well, actually, I do: there is no modern mainstream-priced dual CPU board 
available besides VIA's.

Where can I find the details of the bug and/or fix?

Not sure if this help any, but I spotted some weirdness in the message log.  
(The system hasn't crashed yet since I restarted this morning, BTW, and the 
RAID array is sure getting a good workout!)

Jul  6 09:22:03 www kernel: Uhhuh. NMI received for unknown reason 30.
Jul  6 09:22:04 www kernel: Dazed and confused, but trying to continue
Jul  6 09:22:04 www kernel: Do you have a strange power saving mode enabled?

So I went into the BIOS and completely disabled everything PM related.  Perhaps 
that was the problem last night?  The console was all black-screen (but the 
monitor was NOT asleep or in a power saving mode) when I tried to revive it 
this morning.  Crossing my fingers.


Comment 6 Trevor Cordes 2001-07-06 16:29:39 UTC
Oops.  I got another NMI error, this time reason 20.  Seems to happen after I 
leave the machine for about an hour, so it's probably PM related.  Strange this 
still occurs after I disabled all PM in the BIOS.

I'll go check out the other NMI bug reports.

BTW, the system still hasn't crashed, and it's been at load 5+ for 2 hours now.


Comment 7 Vibol Hou 2001-07-06 19:59:56 UTC
Do you know if that fix in 2.4.3-12 fixes the PCI bugs or does it just fix the 
IDE issue?

Comment 8 Arjan van de Ven 2001-07-06 20:03:04 UTC
It is the workaround for the PCI issues. However, VIA isn't totally open about
this and there are signals (see an average day of the linux kernel list) that it 
isn't a full workaround. If only promise gave more info.....

Comment 9 Vibol Hou 2001-07-06 20:23:17 UTC
I should have asked this also, but is this PCI fixup also included in the 
upstream kernel releases from kernel.org?  I've tried both RH and stock SMP 
kernels from kernel.org and both hardlock in SMP mode.

It is possible VIA fixed their issues with the chipset used in the v2 MSI board 
that tcordes pointed out as working.

Comment 10 Arjan van de Ven 2001-07-06 20:28:53 UTC
2.4.5 and later from kernel.org have the identical fix

Comment 11 Trevor Cordes 2001-07-07 02:53:08 UTC
Well, my system made it through a whole day of pretty rough tests without any 
problems.  Perhaps the freeze I had was caused by the thunderstorm here last 
night.  I don't have this box on a UPS yet.

Reading the other notes about VIA's south bridge, it would seem I also have the 
buggy version "B".  Unless this board uses a newer rev under "B".

For reference: my board is the MSI 6321-IR v2 VIA 694D, and the BIOS level is 
5.0 (the newest).  I'm running the AMI Megaraid 500 successfully now with RAID 
5.  The video is ATI Xpert 98.  The ethernet is Dlink DFE-538TX.  The RAM is 
ECC Kingston PC133.


Comment 12 Vibol Hou 2001-07-07 21:01:39 UTC
I was able to access the system yesterday and found that removing one 256MB 
DIMM from the last memory slot made the immediate crashes go away.  However, 
the system now locks up intermittently in SMP mode with some error messages.  
The following OOPs and error message were produced on the 2.4.5-0.4smp RH 
kernel from rawhide.  The Machine Check Exception hardlocked the system.  The 
OOPs occured after a few hours uptime after a cold reboot from the hardlock on 
the same kernel.

CPU 1: Machine Check Exception: 0000000000000004
Bank 1: b200000000000115
Kernel panic: CPU context corrupt

Message from syslogd@delta at Sat Jul  7 13:18:36 2001 ...
delta kernel: CPU 1: Machine Check Exception: 0000000000000004

Message from syslogd@delta at Sat Jul  7 13:18:36 2001 ...                  
delta kernel: Bank 1: b200000000000115                                      
                                                                            
Message from syslogd@delta at Sat Jul  7 13:18:36 2001 ...                  
delta kernel: Kernel panic: CPU context corrupt

The OOPs (run thru ksymoops.  very messy):

Unable to handle kernel paging request at virtual address fffffff6
c01165d0
*pde = 00004063
Oops: 0000
CPU:    1
EIP:    0010:[<c01165d0>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010003
eax: 00000002   ebx: e1651fc4   ecx: 00000001   edx: 00000002
esi: c1cd2000   edi: be750000   ebp: e1653e70   esp: e1653e44
ds: 0018   es: 0018   ss: 0018
Process mysqld (pid: 790, stackpage=e1653000)
Stack: e1652000 00000002 00000002 e1652000 fffffc18 00000001 e1652000 c02b8460 
       d108cda0 e1652000 e1652000 00000002 c0139416 d108cda0 00000000 00000000 
       00000000 00000000 e1652000 d108cdec d108cdec 00000000 00001000 00000400 
Call Trace: [<c0139416>] [<c013bb59>] [<f087fb6b>] [<c013c382>] [<f0863850>] 
   [<c012c993>] [<f0863850>] [<c01cb80c>] [<c0138a4e>] [<c011412c>] [<c0108c65> 
   [<c010718b>] 
Code: 8b 7a f4 8d 5a c4 85 ff 75 6c 8b 45 dc 85 42 fc 74 64 8b 42 

>>EIP; c01165d0 <schedule+150/550>   <=====
Trace; c0139416 <__wait_on_buffer+76/a0>
Trace; c013bb59 <__block_prepare_write+219/2b0>
Trace; f087fb6b <END_OF_CODE+3054a67b/????>
Trace; c013c382 <block_prepare_write+22/70>
Trace; f0863850 <END_OF_CODE+3052e360/????>
Trace; c012c993 <generic_file_write+3c3/620>
Trace; f0863850 <END_OF_CODE+3052e360/????>
Trace; c01cb80c <kfree_skbmem+c/70>
Trace; c0138a4e <sys_pwrite+ae/f0>
Trace; c011412c <smp_apic_timer_interrupt+ec/100>
Trace; c0108c65 <do_IRQ+e5/f0>
Trace; c010718b <system_call+33/38>
Code;  c01165d0 <schedule+150/550>
00000000 <_EIP>:
Code;  c01165d0 <schedule+150/550>   <=====
   0:   8b 7a f4                  mov    0xfffffff4(%edx),%edi   <=====
Code;  c01165d3 <schedule+153/550>
   3:   8d 5a c4                  lea    0xffffffc4(%edx),%ebx
Code;  c01165d6 <schedule+156/550>
   6:   85 ff                     test   %edi,%edi
Code;  c01165d8 <schedule+158/550>
   8:   75 6c                     jne    76 <_EIP+0x76> c0116646 
<schedule+1c6/550>
Code;  c01165da <schedule+15a/550>
   a:   8b 45 dc                  mov    0xffffffdc(%ebp),%eax
Code;  c01165dd <schedule+15d/550>
   d:   85 42 fc                  test   %eax,0xfffffffc(%edx)
Code;  c01165e0 <schedule+160/550>
  10:   74 64                     je     76 <_EIP+0x76> c0116646 
<schedule+1c6/550>
Code;  c01165e2 <schedule+162/550>
  12:   8b 42 00                  mov    0x0(%edx),%eax

 NMI Watchdog detected LOCKUP on CPU1, registers:
CPU:    1
EIP:    0010:[<c021c67c>]
EFLAGS: 00000082
eax: 00000000   ebx: e37dbf64   ecx: e37da000   edx: 00000001
esi: 00000046   edi: c025cdc0   ebp: e1653d0c   esp: e1653cf4
ds: 0018   es: 0018   ss: 0018
Process mysqld (pid: 790, stackpage=e1653000)
Stack: 00000001 00000086 00000001 c02de5a1 00000046 00000001 00000000 c0119558 
       0000000f c0228863 e1653e10 c01157b0 c022bcc7 00000000 c0107704 00000000 
       c022d592 c0226fed c0228863 00000000 00000001 00004000 c0234020 e1653e10 
Call Trace: [<c0119558>] [<c01157b0>] [<c0107704>] [<c0115c38>] [<c0186cd9>] 
   [<c01873cc>] [<f082cd79>] [<c0115890>] [<c01072d0>] [<f0810018>] [<c01165d0> 
   [<c0139416>] [<c013bb59>] [<f087fb6b>] [<c013c382>] [<f0863850>] [<c012c993> 
   [<f0863850>] [<c01cb80c>] [<c0138a4e>] [<c011412c>] [<c0108c65>] [<c010718b> 
Code: 80 3d 00 84 2b c0 00 f3 90 7e f5 e9 83 a3 ef ff 80 38 00 f3 

>>EIP; c021c67c <stext_lock+680/6451>   <=====
Trace; c0119558 <printk+148/160>
Trace; c01157b0 <bust_spinlocks+50/60>
Trace; c0107704 <die+54/70>
Trace; c0115c38 <do_page_fault+3a8/4b0>
Trace; c0186cd9 <req_new_io+49/60>
Trace; c01873cc <__make_request+4dc/6c0>
Trace; f082cd79 <END_OF_CODE+304f7889/????>
Trace; c0115890 <do_page_fault+0/4b0>
Trace; c01072d0 <error_code+38/40>
Trace; f0810018 <END_OF_CODE+304dab28/????>
Trace; c01165d0 <schedule+150/550>
Trace; c0139416 <__wait_on_buffer+76/a0>
Trace; c013bb59 <__block_prepare_write+219/2b0>
Trace; f087fb6b <END_OF_CODE+3054a67b/????>
Trace; c013c382 <block_prepare_write+22/70>
Trace; f0863850 <END_OF_CODE+3052e360/????>
Trace; c012c993 <generic_file_write+3c3/620>
Trace; f0863850 <END_OF_CODE+3052e360/????>
Trace; c01cb80c <kfree_skbmem+c/70>
Trace; c0138a4e <sys_pwrite+ae/f0>
Trace; c011412c <smp_apic_timer_interrupt+ec/100>
Trace; c0108c65 <do_IRQ+e5/f0>
Trace; c010718b <system_call+33/38>
Code;  c021c67c <stext_lock+680/6451>
00000000 <_EIP>:
Code;  c021c67c <stext_lock+680/6451>   <=====
   0:   80 3d 00 84 2b c0 00      cmpb   $0x0,0xc02b8400   <=====
Code;  c021c683 <stext_lock+687/6451>
   7:   f3 90                     repz nop 
Code;  c021c685 <stext_lock+689/6451>
   9:   7e f5                     jle    0 <_EIP>
Code;  c021c687 <stext_lock+68b/6451>
   b:   e9 83 a3 ef ff            jmp    ffefa393 <_EIP+0xffefa393> c0116a0f 
<__wake_up+3f/c0>
Code;  c021c68c <stext_lock+690/6451>
  10:   80 38 00                  cmpb   $0x0,(%eax)
Code;  c021c68f <stext_lock+693/6451>
  13:   f3 00 00                  repz add %al,(%eax)

 NMI Watchdog detected LOCKUP on CPU0, registers:
 NMI Watchdog detected LOCKUP on CPU1, registers:
CPU:    1
EIP:    0010:[<c021cdb6>]
EFLAGS: 00000082
eax: 00000000   ebx: e27c4000   ecx: 00000000   edx: c1cf1c00
esi: 00000021   edi: 00000000   ebp: 00000000   esp: e1653b74
ds: 0018   es: 0018   ss: 0018
Process mysqld (pid: 790, stackpage=e1653000)
Stack: e27c4000 00000021 00000086 c0121d86 00000021 e1653bb0 e27c4000 e1652000 
       00000021 e1652000 0000000b c012226a 00000021 e1653bb0 e27c4000 00000021 
       00000000 00040002 00000316 00000064 0000000b 000016e1 000005cb 00000000 
Call Trace: [<c0121d86>] [<c012226a>] [<c011412c>] [<c0107356>] [<c011be75>] 
   [<c011c1cc>] [<c01195fe>] [<c0119558>] [<c011450d>] [<c0107356>] [<c021c67c> 
   [<c0119558>] [<c01157b0>] [<c0107704>] [<c0115c38>] [<c0186cd9>] [<c01873cc> 
   [<f082cd79>] [<c0115890>] [<c01072d0>] [<f0810018>] [<c01165d0>] [<c0139416> 
   [<c013bb59>] [<f087fb6b>] [<c013c382>] [<f0863850>] [<c012c993>] [<f0863850> 
   [<c01cb80c>] [<c0138a4e>] [<c011412c>] [<c0108c65>] [<c010718b>] 
Code: 80 3d 00 84 2b c0 00 f3 90 7e f5 e9 fa 4e f0 ff 80 bb 38 06 

>>EIP; c021cdb6 <stext_lock+dba/6451>   <=====
Trace; c0121d86 <send_sig_info+86/b0>
Trace; c012226a <do_notify_parent+9a/b0>
Trace; c011412c <smp_apic_timer_interrupt+ec/100>
Trace; c0107356 <nmi+1e/38>
Trace; c011be75 <exit_notify+195/2c0>
Trace; c011c1cc <do_exit+22c/240>
Trace; c01195fe <release_console_sem+4e/a0>
Trace; c0119558 <printk+148/160>
Trace; c011450d <nmi_watchdog_tick+8d/e0>
Trace; c0107356 <nmi+1e/38>
Trace; c021c67c <stext_lock+680/6451>
Trace; c0119558 <printk+148/160>
Trace; c01157b0 <bust_spinlocks+50/60>
Trace; c0107704 <die+54/70>
Trace; c0115c38 <do_page_fault+3a8/4b0>
Trace; c0186cd9 <req_new_io+49/60>
Trace; c01873cc <__make_request+4dc/6c0>
Trace; f082cd79 <END_OF_CODE+304f7889/????>
Trace; c0115890 <do_page_fault+0/4b0>
Trace; c01072d0 <error_code+38/40>
Trace; f0810018 <END_OF_CODE+304dab28/????>
Trace; c01165d0 <schedule+150/550>
Trace; c0139416 <__wait_on_buffer+76/a0>
Trace; c013bb59 <__block_prepare_write+219/2b0>
Trace; f087fb6b <END_OF_CODE+3054a67b/????>
Trace; c013c382 <block_prepare_write+22/70>
Trace; f0863850 <END_OF_CODE+3052e360/????>
Trace; c012c993 <generic_file_write+3c3/620>
Trace; f0863850 <END_OF_CODE+3052e360/????>
Trace; c01cb80c <kfree_skbmem+c/70>
Trace; c0138a4e <sys_pwrite+ae/f0>
Trace; c011412c <smp_apic_timer_interrupt+ec/100>
Trace; c0108c65 <do_IRQ+e5/f0>
Trace; c010718b <system_call+33/38>
Code;  c021cdb6 <stext_lock+dba/6451>
00000000 <_EIP>:
Code;  c021cdb6 <stext_lock+dba/6451>   <=====
   0:   80 3d 00 84 2b c0 00      cmpb   $0x0,0xc02b8400   <=====
Code;  c021cdbd <stext_lock+dc1/6451>
   7:   f3 90                     repz nop 
Code;  c021cdbf <stext_lock+dc3/6451>
   9:   7e f5                     jle    0 <_EIP>
Code;  c021cdc1 <stext_lock+dc5/6451>
   b:   e9 fa 4e f0 ff            jmp    fff04f0a <_EIP+0xfff04f0a> c0121cc0 
<deliver_signal+50/90>
Code;  c021cdc6 <stext_lock+dca/6451>
  10:   80 bb 38 06 00 00 00      cmpb   $0x0,0x638(%ebx)



Comment 13 Arjan van de Ven 2001-07-07 21:10:18 UTC
A machine check exception is the cpu telling you hardware selftests failed.
That's BAD!


Comment 14 Vibol Hou 2001-07-07 21:20:32 UTC
The server runs fine for a few hours, but then the machine check exception 
occurs.  What could be attributing to this failure?  Could it be the 
motherboard itself?

Comment 15 Arjan van de Ven 2001-07-07 21:22:37 UTC
If it's after a few hours it sounds like a temperature thing..... Check that
all fans are turning and that the airflow isn't blocked by cables.

Comment 16 Vibol Hou 2001-07-09 23:01:36 UTC
I've installed lm_sensors and the temperature looks fine (~28-32dC for both 
CPUs).  However, I am running 1GHz CPUs which Intel docs say require at least 
1.70v (this was found after a bit of research into decoding the machine 
exception code).  The 1st CPU is running at 1.70v, but the 2nd CPU is running 
at 1.65v.  It looks like this might have been the culprit all this time.

Comment 17 Need Real Name 2001-07-31 00:29:01 UTC
Dear People,

  Sorry to interrupt, but I was wondering if you could helpme out.

I have MSI 694DPRO AR motherboard with raided 2+0 ide hard drives and an 
additional quantum hard drive.

the raided hard drives are ntfs. the quantum is fat32.

The bios is the latest on the board.

I have a problem with installation and was wondering if you guys who succeded 
in installing linux might tell me why does the installation hangs after it 
detects the hard disks?

Please help thank you.

smtan.au

Warmest regards,
Su Min Tan

Comment 18 Trevor Cordes 2002-07-05 07:22:19 UTC
UPDATE: the machine I talked about in my earlier posts ran fine from that time
up until about 3 months ago, with only 1 hard crash that I can remember.  Now
all of a sudden I've had 3 hard crashes in 2-3 months.  I'm suspecting the 2(?)
up2date kernel updates I've done since then, as nothing else has changed on the
system.

It's running kernel-smp-2.4.9-34 now, but it did crash also under
kernel-smp-2.4.9-31.  As I said, for all of 2001 and some of 2002 it ran with
only 1 crash on the older kernel.

Were there any changes recently in the kernel that would affect these dual VIA
boards?

Would I gain anything by upgrading to RH7.3 (we're still at 7.1 on the
production boxes).

Note: the crashes were in our peak hours (it's a web server) so it may still be
some load-dependent crash.  But our peak load isn't much above average, so it's
not like it's being hammered.


Comment 19 Jeffrey Moss 2002-09-06 22:09:01 UTC
I have the same problem with the 694DP motherboard. It will crash within 30 
minutes if there is over 512 megs of ram, and it is doing something CPU 
intensive (a few hours otherwise). If I put a gig of ram in it, and load a huge 
file into emacs, it will lock up with the same paging errors you guys are 
getting, but with 2 sticks of 256meg ram (total of 512) it works ok. Its only 
when I add the third stick that it crashes. I'm using kernel v.2.4.7-
10enterprise #1 SMP

Anyone found a work-around/fix for this?

Comment 20 Arjan van de Ven 2002-09-06 22:12:52 UTC
 jeff: 1) why are you using the enterprise kernel with only a gig
of ram and 2) have you tried upgrading to 2.4.9-34

Comment 21 Trevor Cordes 2002-09-07 04:54:41 UTC
Don't ever use the third DIMM slot!  VIA is known to have problems with the 3rd
slot.  Reportedly it works with single-sided DIMM's only, but I wouldn't even
trust that!  Just use the first 2 slots.  I'm seriously tempted to replace all
694D boards with Server Works-based...

See my other bug https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=71204


Comment 22 Jeffrey Moss 2002-09-09 21:13:05 UTC
I looked on ebay for a replacement motherboard, and all the dual p3 
motherboards I looked at listed a maximum of 512M ram. It must be some 
limitations with the P3 chips, I guess 512M will have to do. I wonder if it 
would support a gig with 2 512M dimms. Anybody tried this?

Comment 23 Trevor Cordes 2002-09-10 00:39:01 UTC
The only dual boards I have found that will support 133MHz bus coppermines are
the VIA-based and ServerWorks.  The ServerWorks-based boards are 4 to 6 times as
expensive as VIA boards.  If my latest update to our systems doesn't stop the
crashing, we will be going with SW-based.

Both the VIA and SW chipsets should handle minimum 2 x 512MB DIMMS for 1GB of
RAM.  Most of the SW support registered or have more banks for a total of 2GB or
maybe even 4GB with 1G DIMMs.

I personally have run 512MB DIMMs on VIA dual boards with no problem.  Just
don't try to use the 3rd slot (if any)!!



Note You need to log in before you can comment on or make changes to this bug.