From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0) Description of problem: The system crash every 20hours approximatively. The system still respond to ping, but there is no more console nor telnet,ftp,... The output on the console shows that this is the process kjournald which crashes, with a message : "EIP is at journal_commit_transaction [kernel]" Only a reboot solve the problem (for 20 hours more...) Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.reboot and wait one day, the system crash. 2. 3. Actual Results: I've already tried to : - update in 2-4-19 ans 2-4-pre-20 => same bug - deactivate the hyperthreading so that I see only two CPUs (my 2 Xeon ,1,8Ghz on DELL P2650) => no more success, same bug I'm trying since this morning to run a mono-proc kernel, I'll know tomorrow if the problem seems to be solved. Additional info: Here is the output on the console : kernel bug at commit.c : 535 ! invalid operand : 0000 ide-cd cdrom nls-iso8859-1 nls-cp437 vfat fat eepro100 usb_ohci usbcore ide-disk CPU : 2 EIP : 0010 [<c01753f4>] Not tainted EFLAGS : 00010286 EIP is at journal_Commit_transaction [kernel] 0xcf4 (2.4.19-SG) eax: 0000001c ebx: 0000000a ecx: c029aec0 edx: 00004c04 esi: f7312960 edi: f7b34a00 ebp: f7b4e0000 esp: f7b34fe78 ds: 0018 es: 0018 ss: 0018 Process kjournald (pid: 212, stackpage=f7b4f000) Stack : c0213464 00000217 000dfef0 00000000 00000fdc f52dd024 00000000 f777c200 f5acc2d0 00000df4 37363524 42413938 46454443 4a494847 f5286700 f5286680 f5286700 f5931380 f5931780 f5931700 f5931680 f5931600 f5931580 f5738a80 Call Trace: [<c0122a95>] update_process_times [kernel] 0x25 [<c01148d9>]smp_apic_timer_interrupt [kernel] 0xa9 [<c010a756>] do_IRQ [kernel] 0xc6 [<c01174e8>] schedule [kernel] 0x348 [<c0178053>] kjournald [kernel] 0x1a3 [<c0177e90>] do_IRQ [kernel] 0xa5 [<c0107a48>] commit_timeout [kernel] 0x0 [<c0107286>] kernel_thread [kernel] 0x26 [<c0177eb0>] kjournald [kernel] 0x0 Code: 0f 0b 5a 59 6a 04 8b 44 24 18 50 56 e8 4b ef ff ff 8d 47 48 Here is an lsmod : [root@proxy]# lsmod Module Size Used by Not tainted eepro100 19504 2 usb_ohci 19040 0 (unused) usbcore 71840 1 [usb_ohci] ide-disk 11968 0 ide-probe-mod 10476 0 ide-mod 67584 0 [ide-disk ide-probe- mod] aacraid 26536 7 sd_mod 12736 14 scsi_mod 106784 2 [aacraid sd_mod] Help appreciated, this system has to be stable for ...friday !!!
FWIW, SMP problems with ext3 have been fixed in the 7.3 errata kernels. You really should update.
Do you mean that upgrading to 2.4.18-10 (RHBA-2002:085-11 and RHSA-2002:158-09) can solve this problem ? If yes, I don't understand then why 2.4.19 and 2.4-pre20 did not solve this...??? Can it be normal (in fact, have I got a chance to fix it in 2.4.18- 10) ? Best regards,
The Mono-Proc Kernel seems to be stable (2 days and 9h00 up without crash). But as I need performance, I really would like to test the stability of smp... Can you help me to know - if the 2.4.18-10 kernel really fix it ? - if yes, is it then possible that the 2.4.19 and 2.4-pre20 don't fix it ??? Thanks very much for your help, this is getting very urgent !
Sorry, this seems to be the right procedure to report needs for more information...forget the last additional comment posted by myself... --------------------------------------------------------------------- The Mono-Proc Kernel seems to be stable (2 days and 9h00 up without crash). But as I need performance, I really would like to test the stability of smp... Can you help me to know - if the 2.4.18-10 kernel really fix it ? - if yes, is it then possible that the 2.4.19 and 2.4-pre20 don't fix it ??? Thanks very much for your help, this is getting very urgent !
I have same problem on 2 processor server (2x1.4Gz PIII, SDRAM133 on Tyan Thunder HEsl-T motherboard with RAID-10 4x18Gb internal disks system). It doesn't happen so often, but there is one time in some days. I have also other same server with same Linux but with RAID-5 instead RAID-10. It hadn't this problem. After kjournald had crashed my Oracle server was still alive (as it looks from logs) but without net services. I think hardware hyperthreading is not important for this problem, more important kernel smp feature and disk system support code.