Description of the problem: - AMD-64 based system, ASUS A8V Deluxe mother board, NVIDIA GE-Force 5200 128Mb - Occurs randomly during periods when system is unloaded, i.e. nobody logged. So usually at night. - System booted in init mode 3 (text mode login screen). - All X-window related stuff not running (see prv. point). - Kernel crash, console found filled with kernel stack dump trace. Machine at RIP state. - Updated to last FC4 official kernel, no difference with initial kernel of DVD FC4 distribution: [delmott@PCLinux2 ~]$ rpm -q kernel kernel-2.6.14-1.1644_FC4 - Found in /var/log/messages: Dec 8 21:06:07 PCLinux2 kernel: Unable to handle kernel paging request at ffff80003ffa1108 RIP: Dec 8 21:06:07 PCLinux2 kernel: <ffffffff80197754>{writeback_inodes+68} Dec 8 21:06:07 PCLinux2 kernel: PGD 0 Dec 8 21:06:07 PCLinux2 kernel: Oops: 0000 [1] Dec 8 21:06:07 PCLinux2 kernel: CPU 0 Dec 8 21:06:07 PCLinux2 kernel: Modules linked in: ipv6 parport_pc lp parport autofs4 rfcomm l2cap bluetooth sunrpc pcmcia yenta_socket rsrc_nonstatic pcmcia_core video button battery ac ohci1394 ieee1394 uhci_hcd ehci_hcd shpchp i2c_viapro i2c_core snd_via82xx gameport snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore skge sk98lin floppy ext3 jbd dm_mod sata_promise sata_via libata sd_mod scsi_mod Dec 8 21:06:07 PCLinux2 kernel: Pid: 148, comm: pdflush Not tainted 2.6.14-1.1644_FC4 #1 Dec 8 21:06:07 PCLinux2 kernel: RIP: 0010:[<ffffffff80197754>] <ffffffff80197754>{writeback_inodes+68} Dec 8 21:06:07 PCLinux2 kernel: RSP: 0018:ffff81003fa15e28 EFLAGS: 00010287 Dec 8 21:06:07 PCLinux2 kernel: RAX: ffff80003ffa1108 RBX: ffff80003ffa1000 RCX: 00000000ffffffff Dec 8 21:06:07 PCLinux2 kernel: RDX: ffffffffffffffff RSI: 0000000000000246 RDI: ffff81003ed69000 Dec 8 21:06:07 PCLinux2 kernel: RBP: ffff81003ed69078 R08: 00000000000001f4 R09: 0000000000000003 Dec 8 21:06:07 PCLinux2 kernel: R10: 00000000000009f8 R11: 0000000000000000 R12: ffff81003fa15e48 Dec 8 21:06:07 PCLinux2 kernel: R13: ffffffff8015bcdd R14: 0000000000000206 R15: ffffffff80145ee0 Dec 8 21:06:07 PCLinux2 kernel: FS: 00002aaaaaac8900(0000) GS:ffffffff804f8000(0000) knlGS:0000000000000000 Dec 8 21:06:07 PCLinux2 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Dec 8 21:06:07 PCLinux2 kernel: CR2: ffff80003ffa1108 CR3: 0000000039f51000 CR4: 00000000000006e0 Dec 8 21:06:07 PCLinux2 kernel: Process pdflush (pid: 148, threadinfo ffff81003fa14000, task ffff81003f9c3800) Dec 8 21:06:07 PCLinux2 kernel: Stack: 0000000000001065 0000000100426049 ffff810002141d38 ffffffff8015b4e5 Dec 8 21:06:07 PCLinux2 kernel: 0000000000000000 0000000000000000 ffff81003fa15eb0 0000000000000400 Dec 8 21:06:07 PCLinux2 kernel: 0000000000000000 0000000000000000 Dec 8 21:06:07 PCLinux2 kernel: Call Trace:<ffffffff8015b4e5>{wb_kupdate+283} <ffffffff8015be0f>{pdflush+306} Dec 8 21:06:07 PCLinux2 kernel: <ffffffff8015b3ca>{wb_kupdate+0} <ffffffff8014610c>{kthread+191} Dec 8 21:06:07 PCLinux2 kernel: <ffffffff8012e66b>{schedule_tail+66} <ffffffff8010f20e>{child_rip+8} Dec 8 21:06:07 PCLinux2 kernel: <ffffffff80145ee0>{keventd_create_kthread+0} <ffffffff8014604d>{kthread+0} Dec 8 21:06:07 PCLinux2 kernel: <ffffffff8010f206>{child_rip+0} Dec 8 21:06:07 PCLinux2 kernel: Dec 8 21:06:07 PCLinux2 kernel: Code: 48 3b 83 08 01 00 00 75 10 48 8d 83 18 01 00 00 48 3b 83 18 Dec 8 21:06:07 PCLinux2 kernel: RIP <ffffffff80197754>{writeback_inodes+68} RSP <ffff81003fa15e28> Dec 8 21:06:07 PCLinux2 kernel: CR2: ffff80003ffa1108 - See complete log of different system parameters in attachment Version-Release number of selected component (if applicable): Fedora Core 4 x86_64 with updated kernel: kernel-2.6.14-1.1644_FC4 How reproducible: - Every day: leave the system powered up at night. It will reproduce before you get back the day after. Steps to Reproduce: 1. Leave the system alone, unlog all users. 2. 3. Actual results: - Kernel stack trace Expected results: - System keeps alive for years ! Additional info: - I will check if I cannot update the bios of MB. - I am suspicous about ACPId. If problem persist, I will recompile my own kernel while debarquing a maximum of unnecessary options. But I do not like to have to customize the kernel for each hardware flavor.
Created attachment 122073 [details] Various informations on the system
can you try running memtest on this for a while ? Oopses in this area are usually either hit by quite a few people (which isn't the case here), or quite a lot of the time they're the result of a bit-error in a bad dimm. It'd be good to rule out bad hardware before digging deeper, as this is quite puzzling on first sight.
First of all: thanks for your support. Some news from the investigations on my side... for your information and your files: 1) As you suggested, Memtest86+ v1.55 ran for 2 days on the faulty workstation (called WS_1 for simplicity): errors reported (3 in 48h)... But errors randomly spreaded over whole address space ! (2 DIMMs installed in WS_1) 2) By chance, we have another workstation with very similar HW (same MB, same CPU, etc...). Lets call it WS_2. WS_2 never show any problem. 3) DIMMs swapped between WS_1 and WS_2. 4) Memtest relaunched on WS_1 and WS_2. Although errors were expected to move from WS_1 to WS_2, IT WAS NOT THE CASE ! WS_1 with new DIMMs was still showing errors. Probability was lower (1 error in 4 days). WS_2 had no problem after 4 days, although it was equipped with the suspicious DIMMs. 5) As confirmation: WS_1 (configured as descr. at point 3) rebooted for Xmas-New Year period. WS_1 was RIP when returning back to the office on Jan the 4th. ==> Conclusion at that moment: highly probable that the problem is due to faulty hardware. DIMMs seems not faulty. Our suspicions are then narrowed down to motherboard and CPU. Recently: 6) DIMMs swapped back at original places. Memtest relaunched for one WE. Confirming that WS_1 has memory errors, not WS_2. 7) Our investigations are now focusing on a timing problem occuring on the chipset of the MB or on the front-side bus of the CPU (WS_1). 8) Today: CPUs swapped between WS_1 and WS_2... Wait and see. I'll keep updated....
hmm, definitly starting to sound like hardware problem. especially as this is a code path that every user runs through every day, and yours is the only report of an oops here.
This is a mass-update to all currently open kernel bugs. A new kernel update has been released (Version: 2.6.15-1.1830_FC4) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. Thank you.
Definitely an HW problem that has been solved by changing the motherboard: - One week long Memtest shows that the errors are always arising on data bit 7 of a random byte in the address space. ==> sound to be a problem of timing or bad contact on the databus. ==> MB has been exchanged, all the rest of the HW has been left unchanged. ==> Then 5 days Memtest did not report any errors, system is now up and running fine for one week. Thanks again for the support and sorry for the disturbance.