Description of problem: After 6 days of uptime I decided to try the latest rawhide kernel Result -> instant corruption (it starts by refusing to use some raid array members, then barfs about ATA, and more info may have ended in the logs except they were eaten by the last attempted boot) My current kernel works fine (after cleaning up the mess) It's kernel-2.6.15-1.1819_FC5.nim (meaning built from the 2.6.15-1.1819 srpm with latest v4l patched in, about at the time 2.6.15-1.1819 was released) Last changelog says : * mar jan 03 2006 Dave Jones <davej> - Silence some gcc4.1 warnings. I don't really have all the intermediate kernels here to test and I have little wish to play russian roulette till an important file is nuked, so if you could fix this without more testing in my part that would be great ;) This is an x86_64 raid + lvm system Version-Release number of selected component (if applicable): kernel-2.6.15-1.1857_FC5 is bad bad bad as is the previous (I think) except I didn't rememeber to note its number and my system logs are a mess How reproducible: Always (but I won't again) Steps to Reproduce: 1. boot on rawhide kernel 2. watch the error messages scrool by 3. reboot under trusty kernel, get dumped in the "filesystem b0rked" admin rescue prompt
Created attachment 123251 [details] lspci
Created attachment 123252 [details] /var/log/dmesg with working kernel
Created attachment 123253 [details] mdadm for /dev/md0
Created attachment 123254 [details] mdadm for /dev/md1
Created attachment 123255 [details] lvm info
Created attachment 123256 [details] lsmod on working system
Created attachment 123343 [details] dmesg for one problem kernel (kernel-2.6.15-1.1859_FC5) I hope this helps - this just cost me 2h of cleanup after the attempted boot (single mode) corrupted the filesystem again
this really looks like a hardware problem. Either a bad cable, or worse, a dying drive. Those ata warnings are a really big sign.. "Unrecovered read error - auto reallocate failed" Means it couldn't read a sector, and when it tried to reallocate it from the spare pool, it couldn't, which usually means its already reallocated a bunch of sectors. Looks like RMA time.
It may look like a dying drive but : 1. smart reports 0 error 2. the system is solid with 2.6.15 kernel, even after several days of I/O 3. the drives are new (ok weak point) 4. and anyway what's the probability for *two* new drives going bad at *exactly* the same moment (being SATA BTW
It may look like a dying drive but : 1. smart reports 0 error 2. the system is solid when rebooted with 2.6.15 kernel, even after several days of I/O 3. the drives are new (ok weak point) 4. and anyway what's the probability for *two* new drives going bad at *exactly* the same moment (being SATA BTW they don't share cabling)
Created attachment 123604 [details] smart info for sda
Created attachment 123605 [details] smart info for sdb
Just let me know if you need more logs / test results
2.6.15-1.1872_FC5 patched to disable FUA (as suggested by Tejun Heo there : http://marc.theaimsgroup.com/?l=linux-ide&m=113825474609128) boots fine
I've been unable to connect to marc.theaimsgroup.com for weeks, from multiple locations around the world. Can you attach that patch to the bugzilla please ?
Strange, it works fine there. You can find the whole thread on any other linux-ide archive (Title is : regarding bug #5914 - fs corruption on SATA) I'll attach the patch but it's very preliminary and useful mainly to check if FUA is causing problems on a system (it short-circuits it). People are talking about drive-specific FUA blacklisting now (but the fuller patch is not cooked yet)
Created attachment 123808 [details] Simple patch to disable fua
Created attachment 123940 [details] Fua blacklisting The following (tested) patch implements fua drive blacklisting (specifically, my drive model). Was posted in the aforementioned thread
Created attachment 123941 [details] dmesg for kernel patched with patch #123940
Closing as the blacklisting patch was merged in latest git snapshot upstream