Bug 177951 - kernel 2.6.15-1.185*_FC5 eats my filesystem
Summary: kernel 2.6.15-1.185*_FC5 eats my filesystem
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Jeff Garzik
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: FC5Blocker FCMETA_SATA
TreeView+ depends on / blocked
 
Reported: 2006-01-16 19:26 UTC by Nicolas Mailhot
Modified: 2013-07-03 02:26 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-02-03 13:18:56 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
lspci (22.05 KB, text/plain)
2006-01-16 19:26 UTC, Nicolas Mailhot
no flags Details
/var/log/dmesg with working kernel (27.34 KB, text/plain)
2006-01-16 19:38 UTC, Nicolas Mailhot
no flags Details
mdadm for /dev/md0 (716 bytes, text/plain)
2006-01-16 19:40 UTC, Nicolas Mailhot
no flags Details
mdadm for /dev/md1 (715 bytes, text/plain)
2006-01-16 19:41 UTC, Nicolas Mailhot
no flags Details
lvm info (998 bytes, text/plain)
2006-01-16 19:42 UTC, Nicolas Mailhot
no flags Details
lsmod on working system (2.90 KB, text/plain)
2006-01-16 19:43 UTC, Nicolas Mailhot
no flags Details
dmesg for one problem kernel (kernel-2.6.15-1.1859_FC5) (34.20 KB, text/plain)
2006-01-17 23:18 UTC, Nicolas Mailhot
no flags Details
smart info for sda (5.19 KB, text/plain)
2006-01-24 07:30 UTC, Nicolas Mailhot
no flags Details
smart info for sdb (5.16 KB, text/plain)
2006-01-24 07:31 UTC, Nicolas Mailhot
no flags Details
Simple patch to disable fua (524 bytes, patch)
2006-01-27 22:46 UTC, Nicolas Mailhot
no flags Details | Diff
Fua blacklisting (1.38 KB, patch)
2006-01-31 22:38 UTC, Nicolas Mailhot
no flags Details | Diff
dmesg for kernel patched with patch #123940 (21.68 KB, text/plain)
2006-01-31 22:41 UTC, Nicolas Mailhot
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 5914 0 None None None Never

Description Nicolas Mailhot 2006-01-16 19:26:27 UTC
Description of problem:

After 6 days of uptime I decided to try the latest rawhide kernel
Result -> instant corruption (it starts by refusing to use some raid array
members, then barfs about ATA, and more info may have ended in the logs except
they were eaten by the last attempted boot)

My current kernel works fine (after cleaning up the mess)
It's kernel-2.6.15-1.1819_FC5.nim (meaning built from the 2.6.15-1.1819 srpm
with latest v4l patched in, about at the time 2.6.15-1.1819 was released)

Last changelog says :
* mar jan 03 2006 Dave Jones <davej>
- Silence some gcc4.1 warnings.

I don't really have all the intermediate kernels here to test and I have little
wish to play russian roulette till an important file is nuked, so if you could
fix this without more testing in my part that would be great ;)

This is an x86_64 raid + lvm system

Version-Release number of selected component (if applicable):

kernel-2.6.15-1.1857_FC5 is bad bad bad
as is the previous (I think) except I didn't rememeber to note its number and my
system logs are a mess

How reproducible:
Always (but I won't again)

Steps to Reproduce:
1. boot on rawhide kernel
2. watch the error messages scrool by
3. reboot under trusty kernel, get dumped in the "filesystem b0rked" admin
rescue prompt

Comment 1 Nicolas Mailhot 2006-01-16 19:26:27 UTC
Created attachment 123251 [details]
lspci

Comment 2 Nicolas Mailhot 2006-01-16 19:38:57 UTC
Created attachment 123252 [details]
/var/log/dmesg with working kernel

Comment 3 Nicolas Mailhot 2006-01-16 19:40:07 UTC
Created attachment 123253 [details]
mdadm for /dev/md0

Comment 4 Nicolas Mailhot 2006-01-16 19:41:07 UTC
Created attachment 123254 [details]
mdadm for /dev/md1

Comment 5 Nicolas Mailhot 2006-01-16 19:42:10 UTC
Created attachment 123255 [details]
lvm info

Comment 6 Nicolas Mailhot 2006-01-16 19:43:25 UTC
Created attachment 123256 [details]
lsmod on working system

Comment 7 Nicolas Mailhot 2006-01-17 23:18:54 UTC
Created attachment 123343 [details]
dmesg for one problem kernel (kernel-2.6.15-1.1859_FC5)

I hope this helps - this just cost me 2h of cleanup after the attempted boot
(single mode) corrupted the filesystem again

Comment 8 Dave Jones 2006-01-24 05:28:10 UTC
this really looks like a hardware problem. Either a bad cable, or worse, a dying
drive.  Those ata warnings are a really big sign..

"Unrecovered read error - auto reallocate failed"

Means it couldn't read a sector, and when it tried to reallocate it from the
spare pool, it couldn't, which usually means its already reallocated a bunch of
sectors.

Looks like RMA time.


Comment 9 Nicolas Mailhot 2006-01-24 06:53:15 UTC
It may look like a dying drive but :
1. smart reports 0 error
2. the system is solid with 2.6.15 kernel, even after several days of I/O
3. the drives are new (ok weak point)
4. and anyway what's the probability for *two* new drives going bad at *exactly*
the same moment (being SATA BTW

Comment 10 Nicolas Mailhot 2006-01-24 06:54:25 UTC
It may look like a dying drive but :
1. smart reports 0 error
2. the system is solid when rebooted with 2.6.15 kernel, even after several days
of I/O
3. the drives are new (ok weak point)
4. and anyway what's the probability for *two* new drives going bad at *exactly*
the same moment (being SATA BTW they don't share cabling)

Comment 11 Nicolas Mailhot 2006-01-24 07:30:25 UTC
Created attachment 123604 [details]
smart info for sda

Comment 12 Nicolas Mailhot 2006-01-24 07:31:09 UTC
Created attachment 123605 [details]
smart info for sdb

Comment 13 Nicolas Mailhot 2006-01-24 20:19:18 UTC
Just let me know if you need more logs / test results

Comment 14 Nicolas Mailhot 2006-01-26 21:03:47 UTC
2.6.15-1.1872_FC5 patched to disable FUA (as suggested by Tejun Heo there :
http://marc.theaimsgroup.com/?l=linux-ide&m=113825474609128) boots fine

Comment 15 Dave Jones 2006-01-27 20:49:57 UTC
I've been unable to connect to marc.theaimsgroup.com for weeks, from multiple
locations around the world.  Can you attach that patch to the bugzilla please ?


Comment 16 Nicolas Mailhot 2006-01-27 22:43:48 UTC
Strange, it works fine there. You can find the whole thread on any other
linux-ide archive (Title is : regarding bug #5914 - fs corruption on SATA)

I'll attach the patch but it's very preliminary and useful mainly to check if
FUA is causing problems on a system (it short-circuits it). People are talking
about  drive-specific FUA blacklisting now (but the fuller patch is not cooked yet)



Comment 17 Nicolas Mailhot 2006-01-27 22:46:27 UTC
Created attachment 123808 [details]
Simple patch to disable fua

Comment 18 Nicolas Mailhot 2006-01-31 22:38:50 UTC
Created attachment 123940 [details]
Fua blacklisting

The following (tested) patch implements fua drive blacklisting (specifically,
my drive model). Was posted in the aforementioned thread

Comment 19 Nicolas Mailhot 2006-01-31 22:41:32 UTC
Created attachment 123941 [details]
dmesg for kernel patched with patch #123940

Comment 20 Nicolas Mailhot 2006-02-03 13:18:56 UTC
Closing as the blacklisting patch was merged in latest git snapshot upstream


Note You need to log in before you can comment on or make changes to this bug.