177951 – kernel 2.6.15-1.185*_FC5 eats my filesystem

Bug 177951 - kernel 2.6.15-1.185*_FC5 eats my filesystem

Summary: kernel 2.6.15-1.185*_FC5 eats my filesystem

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeff Garzik
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	FC5Blocker FCMETA_SATA
TreeView+	depends on / blocked

Reported:	2006-01-16 19:26 UTC by Nicolas Mailhot
Modified:	2013-07-03 02:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-02-03 13:18:56 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lspci (22.05 KB, text/plain) 2006-01-16 19:26 UTC, Nicolas Mailhot	no flags	Details
/var/log/dmesg with working kernel (27.34 KB, text/plain) 2006-01-16 19:38 UTC, Nicolas Mailhot	no flags	Details
mdadm for /dev/md0 (716 bytes, text/plain) 2006-01-16 19:40 UTC, Nicolas Mailhot	no flags	Details
mdadm for /dev/md1 (715 bytes, text/plain) 2006-01-16 19:41 UTC, Nicolas Mailhot	no flags	Details
lvm info (998 bytes, text/plain) 2006-01-16 19:42 UTC, Nicolas Mailhot	no flags	Details
lsmod on working system (2.90 KB, text/plain) 2006-01-16 19:43 UTC, Nicolas Mailhot	no flags	Details
dmesg for one problem kernel (kernel-2.6.15-1.1859_FC5) (34.20 KB, text/plain) 2006-01-17 23:18 UTC, Nicolas Mailhot	no flags	Details
smart info for sda (5.19 KB, text/plain) 2006-01-24 07:30 UTC, Nicolas Mailhot	no flags	Details
smart info for sdb (5.16 KB, text/plain) 2006-01-24 07:31 UTC, Nicolas Mailhot	no flags	Details
Simple patch to disable fua (524 bytes, patch) 2006-01-27 22:46 UTC, Nicolas Mailhot	no flags	Details \| Diff
Fua blacklisting (1.38 KB, patch) 2006-01-31 22:38 UTC, Nicolas Mailhot	no flags	Details \| Diff
dmesg for kernel patched with patch #123940 (21.68 KB, text/plain) 2006-01-31 22:41 UTC, Nicolas Mailhot	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	5914	0	None	None	None	Never

Description Nicolas Mailhot 2006-01-16 19:26:27 UTC

Description of problem:

After 6 days of uptime I decided to try the latest rawhide kernel
Result -> instant corruption (it starts by refusing to use some raid array
members, then barfs about ATA, and more info may have ended in the logs except
they were eaten by the last attempted boot)

My current kernel works fine (after cleaning up the mess)
It's kernel-2.6.15-1.1819_FC5.nim (meaning built from the 2.6.15-1.1819 srpm
with latest v4l patched in, about at the time 2.6.15-1.1819 was released)

Last changelog says :
* mar jan 03 2006 Dave Jones <davej>
- Silence some gcc4.1 warnings.

I don't really have all the intermediate kernels here to test and I have little
wish to play russian roulette till an important file is nuked, so if you could
fix this without more testing in my part that would be great ;)

This is an x86_64 raid + lvm system

Version-Release number of selected component (if applicable):

kernel-2.6.15-1.1857_FC5 is bad bad bad
as is the previous (I think) except I didn't rememeber to note its number and my
system logs are a mess

How reproducible:
Always (but I won't again)

Steps to Reproduce:
1. boot on rawhide kernel
2. watch the error messages scrool by
3. reboot under trusty kernel, get dumped in the "filesystem b0rked" admin
rescue prompt

Comment 1 Nicolas Mailhot 2006-01-16 19:26:27 UTC

Created attachment 123251 [details]
lspci

Comment 2 Nicolas Mailhot 2006-01-16 19:38:57 UTC

Created attachment 123252 [details]
/var/log/dmesg with working kernel

Comment 3 Nicolas Mailhot 2006-01-16 19:40:07 UTC

Created attachment 123253 [details]
mdadm for /dev/md0

Comment 4 Nicolas Mailhot 2006-01-16 19:41:07 UTC

Created attachment 123254 [details]
mdadm for /dev/md1

Comment 5 Nicolas Mailhot 2006-01-16 19:42:10 UTC

Created attachment 123255 [details]
lvm info

Comment 6 Nicolas Mailhot 2006-01-16 19:43:25 UTC

Created attachment 123256 [details]
lsmod on working system

Comment 7 Nicolas Mailhot 2006-01-17 23:18:54 UTC

Created attachment 123343 [details]
dmesg for one problem kernel (kernel-2.6.15-1.1859_FC5)

I hope this helps - this just cost me 2h of cleanup after the attempted boot
(single mode) corrupted the filesystem again

Comment 8 Dave Jones 2006-01-24 05:28:10 UTC

this really looks like a hardware problem. Either a bad cable, or worse, a dying
drive.  Those ata warnings are a really big sign..

"Unrecovered read error - auto reallocate failed"

Means it couldn't read a sector, and when it tried to reallocate it from the
spare pool, it couldn't, which usually means its already reallocated a bunch of
sectors.

Looks like RMA time.

Comment 9 Nicolas Mailhot 2006-01-24 06:53:15 UTC

It may look like a dying drive but :
1. smart reports 0 error
2. the system is solid with 2.6.15 kernel, even after several days of I/O
3. the drives are new (ok weak point)
4. and anyway what's the probability for *two* new drives going bad at *exactly*
the same moment (being SATA BTW

Comment 10 Nicolas Mailhot 2006-01-24 06:54:25 UTC

It may look like a dying drive but :
1. smart reports 0 error
2. the system is solid when rebooted with 2.6.15 kernel, even after several days
of I/O
3. the drives are new (ok weak point)
4. and anyway what's the probability for *two* new drives going bad at *exactly*
the same moment (being SATA BTW they don't share cabling)

Comment 11 Nicolas Mailhot 2006-01-24 07:30:25 UTC

Created attachment 123604 [details]
smart info for sda

Comment 12 Nicolas Mailhot 2006-01-24 07:31:09 UTC

Created attachment 123605 [details]
smart info for sdb

Comment 13 Nicolas Mailhot 2006-01-24 20:19:18 UTC

Just let me know if you need more logs / test results

Comment 14 Nicolas Mailhot 2006-01-26 21:03:47 UTC

2.6.15-1.1872_FC5 patched to disable FUA (as suggested by Tejun Heo there :
http://marc.theaimsgroup.com/?l=linux-ide&m=113825474609128) boots fine

Comment 15 Dave Jones 2006-01-27 20:49:57 UTC

I've been unable to connect to marc.theaimsgroup.com for weeks, from multiple
locations around the world.  Can you attach that patch to the bugzilla please ?

Comment 16 Nicolas Mailhot 2006-01-27 22:43:48 UTC

Strange, it works fine there. You can find the whole thread on any other
linux-ide archive (Title is : regarding bug #5914 - fs corruption on SATA)

I'll attach the patch but it's very preliminary and useful mainly to check if
FUA is causing problems on a system (it short-circuits it). People are talking
about  drive-specific FUA blacklisting now (but the fuller patch is not cooked yet)

Comment 17 Nicolas Mailhot 2006-01-27 22:46:27 UTC

Created attachment 123808 [details]
Simple patch to disable fua

Comment 18 Nicolas Mailhot 2006-01-31 22:38:50 UTC

Created attachment 123940 [details]
Fua blacklisting

The following (tested) patch implements fua drive blacklisting (specifically,
my drive model). Was posted in the aforementioned thread

Comment 19 Nicolas Mailhot 2006-01-31 22:41:32 UTC

Created attachment 123941 [details]
dmesg for kernel patched with patch #123940

Comment 20 Nicolas Mailhot 2006-02-03 13:18:56 UTC

Closing as the blacklisting patch was merged in latest git snapshot upstream

Note You need to log in before you can comment on or make changes to this bug.