Bug 177951

Summary: kernel 2.6.15-1.185*_FC5 eats my filesystem
Product: [Fedora] Fedora Reporter: Nicolas Mailhot <nicolas.mailhot>
Component: kernelAssignee: Jeff Garzik <jgarzik>
Status: CLOSED UPSTREAM QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: rawhideCC: davej, dcantrell, dwmw2, k.georgiou, peterm, sundaram, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-03 13:18:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 150222, 172490    
Attachments:
Description Flags
lspci
none
/var/log/dmesg with working kernel
none
mdadm for /dev/md0
none
mdadm for /dev/md1
none
lvm info
none
lsmod on working system
none
dmesg for one problem kernel (kernel-2.6.15-1.1859_FC5)
none
smart info for sda
none
smart info for sdb
none
Simple patch to disable fua
none
Fua blacklisting
none
dmesg for kernel patched with patch #123940 none

Description Nicolas Mailhot 2006-01-16 19:26:27 UTC
Description of problem:

After 6 days of uptime I decided to try the latest rawhide kernel
Result -> instant corruption (it starts by refusing to use some raid array
members, then barfs about ATA, and more info may have ended in the logs except
they were eaten by the last attempted boot)

My current kernel works fine (after cleaning up the mess)
It's kernel-2.6.15-1.1819_FC5.nim (meaning built from the 2.6.15-1.1819 srpm
with latest v4l patched in, about at the time 2.6.15-1.1819 was released)

Last changelog says :
* mar jan 03 2006 Dave Jones <davej>
- Silence some gcc4.1 warnings.

I don't really have all the intermediate kernels here to test and I have little
wish to play russian roulette till an important file is nuked, so if you could
fix this without more testing in my part that would be great ;)

This is an x86_64 raid + lvm system

Version-Release number of selected component (if applicable):

kernel-2.6.15-1.1857_FC5 is bad bad bad
as is the previous (I think) except I didn't rememeber to note its number and my
system logs are a mess

How reproducible:
Always (but I won't again)

Steps to Reproduce:
1. boot on rawhide kernel
2. watch the error messages scrool by
3. reboot under trusty kernel, get dumped in the "filesystem b0rked" admin
rescue prompt

Comment 1 Nicolas Mailhot 2006-01-16 19:26:27 UTC
Created attachment 123251 [details]
lspci

Comment 2 Nicolas Mailhot 2006-01-16 19:38:57 UTC
Created attachment 123252 [details]
/var/log/dmesg with working kernel

Comment 3 Nicolas Mailhot 2006-01-16 19:40:07 UTC
Created attachment 123253 [details]
mdadm for /dev/md0

Comment 4 Nicolas Mailhot 2006-01-16 19:41:07 UTC
Created attachment 123254 [details]
mdadm for /dev/md1

Comment 5 Nicolas Mailhot 2006-01-16 19:42:10 UTC
Created attachment 123255 [details]
lvm info

Comment 6 Nicolas Mailhot 2006-01-16 19:43:25 UTC
Created attachment 123256 [details]
lsmod on working system

Comment 7 Nicolas Mailhot 2006-01-17 23:18:54 UTC
Created attachment 123343 [details]
dmesg for one problem kernel (kernel-2.6.15-1.1859_FC5)

I hope this helps - this just cost me 2h of cleanup after the attempted boot
(single mode) corrupted the filesystem again

Comment 8 Dave Jones 2006-01-24 05:28:10 UTC
this really looks like a hardware problem. Either a bad cable, or worse, a dying
drive.  Those ata warnings are a really big sign..

"Unrecovered read error - auto reallocate failed"

Means it couldn't read a sector, and when it tried to reallocate it from the
spare pool, it couldn't, which usually means its already reallocated a bunch of
sectors.

Looks like RMA time.


Comment 9 Nicolas Mailhot 2006-01-24 06:53:15 UTC
It may look like a dying drive but :
1. smart reports 0 error
2. the system is solid with 2.6.15 kernel, even after several days of I/O
3. the drives are new (ok weak point)
4. and anyway what's the probability for *two* new drives going bad at *exactly*
the same moment (being SATA BTW

Comment 10 Nicolas Mailhot 2006-01-24 06:54:25 UTC
It may look like a dying drive but :
1. smart reports 0 error
2. the system is solid when rebooted with 2.6.15 kernel, even after several days
of I/O
3. the drives are new (ok weak point)
4. and anyway what's the probability for *two* new drives going bad at *exactly*
the same moment (being SATA BTW they don't share cabling)

Comment 11 Nicolas Mailhot 2006-01-24 07:30:25 UTC
Created attachment 123604 [details]
smart info for sda

Comment 12 Nicolas Mailhot 2006-01-24 07:31:09 UTC
Created attachment 123605 [details]
smart info for sdb

Comment 13 Nicolas Mailhot 2006-01-24 20:19:18 UTC
Just let me know if you need more logs / test results

Comment 14 Nicolas Mailhot 2006-01-26 21:03:47 UTC
2.6.15-1.1872_FC5 patched to disable FUA (as suggested by Tejun Heo there :
http://marc.theaimsgroup.com/?l=linux-ide&m=113825474609128) boots fine

Comment 15 Dave Jones 2006-01-27 20:49:57 UTC
I've been unable to connect to marc.theaimsgroup.com for weeks, from multiple
locations around the world.  Can you attach that patch to the bugzilla please ?


Comment 16 Nicolas Mailhot 2006-01-27 22:43:48 UTC
Strange, it works fine there. You can find the whole thread on any other
linux-ide archive (Title is : regarding bug #5914 - fs corruption on SATA)

I'll attach the patch but it's very preliminary and useful mainly to check if
FUA is causing problems on a system (it short-circuits it). People are talking
about  drive-specific FUA blacklisting now (but the fuller patch is not cooked yet)



Comment 17 Nicolas Mailhot 2006-01-27 22:46:27 UTC
Created attachment 123808 [details]
Simple patch to disable fua

Comment 18 Nicolas Mailhot 2006-01-31 22:38:50 UTC
Created attachment 123940 [details]
Fua blacklisting

The following (tested) patch implements fua drive blacklisting (specifically,
my drive model). Was posted in the aforementioned thread

Comment 19 Nicolas Mailhot 2006-01-31 22:41:32 UTC
Created attachment 123941 [details]
dmesg for kernel patched with patch #123940

Comment 20 Nicolas Mailhot 2006-02-03 13:18:56 UTC
Closing as the blacklisting patch was merged in latest git snapshot upstream