175357 – writeback_inodes() oops.

Bug 175357 - writeback_inodes() oops.

Summary: writeback_inodes() oops.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	Closed
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-12-09 11:26 UTC by Thierry Delmot
Modified:	2015-01-04 22:23 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-02-14 12:40:05 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Various informations on the system (4.00 KB, text/plain) 2005-12-09 11:26 UTC, Thierry Delmot	no flags	Details
View All

Description Thierry Delmot 2005-12-09 11:26:33 UTC

Description of the problem:
- AMD-64 based system, ASUS A8V Deluxe mother board, NVIDIA GE-Force 5200 128Mb
- Occurs randomly during periods when system is unloaded, i.e. nobody logged. So
usually at night.
- System booted in init mode 3 (text mode login screen).
- All X-window related stuff not running (see prv. point).
- Kernel crash, console found filled with kernel stack dump trace. Machine at
RIP state.
- Updated to last FC4 official kernel, no difference with initial kernel of DVD
FC4 distribution: 
[delmott@PCLinux2 ~]$ rpm -q kernel
kernel-2.6.14-1.1644_FC4

- Found in /var/log/messages:
Dec  8 21:06:07 PCLinux2 kernel: Unable to handle kernel paging request at
ffff80003ffa1108 RIP:
Dec  8 21:06:07 PCLinux2 kernel: <ffffffff80197754>{writeback_inodes+68}
Dec  8 21:06:07 PCLinux2 kernel: PGD 0
Dec  8 21:06:07 PCLinux2 kernel: Oops: 0000 [1]
Dec  8 21:06:07 PCLinux2 kernel: CPU 0
Dec  8 21:06:07 PCLinux2 kernel: Modules linked in: ipv6 parport_pc lp parport
autofs4 rfcomm l2cap bluetooth sunrpc pcmcia yenta_socket rsrc_nonstatic
pcmcia_core video button battery ac ohci1394 ieee1394 uhci_hcd ehci_hcd shpchp
i2c_viapro i2c_core snd_via82xx gameport snd_ac97_codec snd_ac97_bus
snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss
snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd
soundcore skge sk98lin floppy ext3 jbd dm_mod sata_promise sata_via libata
sd_mod scsi_mod
Dec  8 21:06:07 PCLinux2 kernel: Pid: 148, comm: pdflush Not tainted
2.6.14-1.1644_FC4 #1
Dec  8 21:06:07 PCLinux2 kernel: RIP: 0010:[<ffffffff80197754>]
<ffffffff80197754>{writeback_inodes+68}
Dec  8 21:06:07 PCLinux2 kernel: RSP: 0018:ffff81003fa15e28  EFLAGS: 00010287
Dec  8 21:06:07 PCLinux2 kernel: RAX: ffff80003ffa1108 RBX: ffff80003ffa1000
RCX: 00000000ffffffff
Dec  8 21:06:07 PCLinux2 kernel: RDX: ffffffffffffffff RSI: 0000000000000246
RDI: ffff81003ed69000
Dec  8 21:06:07 PCLinux2 kernel: RBP: ffff81003ed69078 R08: 00000000000001f4
R09: 0000000000000003
Dec  8 21:06:07 PCLinux2 kernel: R10: 00000000000009f8 R11: 0000000000000000
R12: ffff81003fa15e48
Dec  8 21:06:07 PCLinux2 kernel: R13: ffffffff8015bcdd R14: 0000000000000206
R15: ffffffff80145ee0
Dec  8 21:06:07 PCLinux2 kernel: FS:  00002aaaaaac8900(0000)
GS:ffffffff804f8000(0000) knlGS:0000000000000000
Dec  8 21:06:07 PCLinux2 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Dec  8 21:06:07 PCLinux2 kernel: CR2: ffff80003ffa1108 CR3: 0000000039f51000
CR4: 00000000000006e0
Dec  8 21:06:07 PCLinux2 kernel: Process pdflush (pid: 148, threadinfo
ffff81003fa14000, task ffff81003f9c3800)
Dec  8 21:06:07 PCLinux2 kernel: Stack: 0000000000001065 0000000100426049
ffff810002141d38 ffffffff8015b4e5
Dec  8 21:06:07 PCLinux2 kernel:        0000000000000000 0000000000000000
ffff81003fa15eb0 0000000000000400
Dec  8 21:06:07 PCLinux2 kernel:        0000000000000000 0000000000000000
Dec  8 21:06:07 PCLinux2 kernel: Call Trace:<ffffffff8015b4e5>{wb_kupdate+283}
<ffffffff8015be0f>{pdflush+306}
Dec  8 21:06:07 PCLinux2 kernel:        <ffffffff8015b3ca>{wb_kupdate+0}
<ffffffff8014610c>{kthread+191}
Dec  8 21:06:07 PCLinux2 kernel:        <ffffffff8012e66b>{schedule_tail+66}
<ffffffff8010f20e>{child_rip+8}
Dec  8 21:06:07 PCLinux2 kernel:       
<ffffffff80145ee0>{keventd_create_kthread+0} <ffffffff8014604d>{kthread+0}
Dec  8 21:06:07 PCLinux2 kernel:        <ffffffff8010f206>{child_rip+0}
Dec  8 21:06:07 PCLinux2 kernel:
Dec  8 21:06:07 PCLinux2 kernel: Code: 48 3b 83 08 01 00 00 75 10 48 8d 83 18 01
00 00 48 3b 83 18
Dec  8 21:06:07 PCLinux2 kernel: RIP <ffffffff80197754>{writeback_inodes+68} RSP
<ffff81003fa15e28>
Dec  8 21:06:07 PCLinux2 kernel: CR2: ffff80003ffa1108

- See complete log of different system parameters in attachment


Version-Release number of selected component (if applicable):

Fedora Core 4 x86_64 with updated kernel: kernel-2.6.14-1.1644_FC4


How reproducible:
- Every day: leave the system powered up at night. It will reproduce before you
get back the day after.


Steps to Reproduce:
1. Leave the system alone, unlog all users.
2. 
3.
  
Actual results:
- Kernel stack trace


Expected results:
- System keeps alive for years !

Additional info:
- I will check if I cannot update the bios of MB.
- I am suspicous about ACPId. If problem persist, I will recompile my own kernel
while debarquing a maximum of unnecessary options. But I do not like to have to
customize the kernel for each hardware flavor.

Comment 1 Thierry Delmot 2005-12-09 11:26:33 UTC

Created attachment 122073 [details]
Various informations on the system

Comment 2 Dave Jones 2005-12-28 05:24:46 UTC

can you try running memtest on this for a while ? Oopses in this area are
usually either hit by quite a few people (which isn't the case here), or quite a
lot of the time they're the result of a bit-error in a bad dimm.

It'd be good to rule out bad hardware before digging deeper, as this is quite
puzzling on first sight.

Comment 3 Thierry Delmot 2006-01-19 15:17:06 UTC

First of all: thanks for your support.

Some news from the investigations on my side... for your information and your files:

1) As you suggested, Memtest86+ v1.55 ran for 2 days on the faulty workstation
(called WS_1 for simplicity): errors reported (3 in 48h)... But errors randomly
spreaded over whole address space ! (2 DIMMs installed in WS_1)

2) By chance, we have another workstation with very similar HW (same MB, same
CPU, etc...). Lets call it WS_2. WS_2 never show any problem.

3) DIMMs swapped between WS_1 and WS_2.

4) Memtest relaunched on WS_1 and WS_2. Although errors were expected to move
from  WS_1 to WS_2, IT WAS NOT THE CASE ! WS_1 with new DIMMs was still showing
errors. Probability was lower (1 error in 4 days). WS_2 had no problem after 4
days, although it was equipped with the suspicious DIMMs.

5) As confirmation: WS_1 (configured as descr. at point 3) rebooted for Xmas-New
Year period. WS_1 was RIP when returning back to the office on Jan the 4th.

==> Conclusion at that moment: highly probable that the problem is due to faulty
hardware. DIMMs seems not faulty. Our suspicions are then narrowed down to
motherboard and CPU.

Recently:
6) DIMMs swapped back at original places. Memtest relaunched for one WE.
Confirming that WS_1 has memory errors, not WS_2.

7) Our investigations are now focusing on a timing problem occuring on the
chipset of the MB or on the front-side bus of the CPU (WS_1).

8) Today: CPUs swapped between WS_1 and WS_2... Wait and see.

I'll keep updated....

Comment 4 Dave Jones 2006-01-20 01:34:36 UTC

hmm, definitly starting to sound like hardware problem.
especially as this is a code path that every user runs through every day, and
yours is the only report of an oops here.

Comment 5 Dave Jones 2006-02-03 05:21:20 UTC

This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.

Comment 6 Thierry Delmot 2006-02-14 12:40:05 UTC

Definitely an HW problem that has been solved by changing the motherboard:

- One week long Memtest shows that the errors are always arising on data bit 7
of a random byte in the address space.
==> sound to be a problem of timing or bad contact on the databus.

==> MB has been exchanged, all the rest of the HW has been left unchanged.
==> Then 5 days Memtest did not report any errors, system is now up and running
fine for one week.

Thanks again for the support and sorry for the disturbance.

Note You need to log in before you can comment on or make changes to this bug.