Bug 624293
Summary: | XFS internal error / mount: Structure needs cleaning | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Michael Hagmann <michael.hagmann> | ||||||||||||||||||||
Component: | kernel | Assignee: | Eric Sandeen <esandeen> | ||||||||||||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||
Priority: | low | ||||||||||||||||||||||
Version: | 13 | CC: | anton, aquini, dchinner, dougsland, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda | ||||||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2010-08-18 15:23:21 UTC | Type: | --- | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Attachments: |
|
Description
Michael Hagmann
2010-08-15 18:37:19 UTC
Created attachment 438852 [details]
SOS Report
Created attachment 438853 [details]
MD5 Sum
Try to repair: [root@scheat tmp]# xfs_check /dev/vdb xfs_check: cannot init perag data (117) ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_check. If you are unable to mount the filesystem, then use the xfs_repair -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. [root@scheat tmp]# xfs_repair /dev/vdb Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. [root@scheat tmp]# xfs_metadump -g /dev/vdb ./dev-vdb.dump xfs_metadump: cannot init perag data (117) Copying log [root@scheat tmp] nothing help going forward with: xfs_repair -L /dev/vdb lot of errors! resetting inode 710004 nlinks from 1 to 2 resetting inode 710005 nlinks from 1 to 2 resetting inode 710006 nlinks from 1 to 2 resetting inode 710922 nlinks from 1 to 2 resetting inode 710923 nlinks from 1 to 2 resetting inode 710966 nlinks from 1 to 2 resetting inode 710967 nlinks from 1 to 2 resetting inode 710968 nlinks from 1 to 2 resetting inode 710969 nlinks from 1 to 2 resetting inode 710970 nlinks from 1 to 2 resetting inode 710971 nlinks from 1 to 2 resetting inode 711047 nlinks from 1 to 2 resetting inode 711048 nlinks from 1 to 2 resetting inode 711049 nlinks from 1 to 2 resetting inode 711050 nlinks from 1 to 2 resetting inode 711081 nlinks from 1 to 2 resetting inode 711082 nlinks from 1 to 2 resetting inode 711203 nlinks from 1 to 2 resetting inode 749502 nlinks from 10 to 1 resetting inode 749541 nlinks from 1 to 2 resetting inode 752106 nlinks from 3 to 2 resetting inode 754580 nlinks from 1 to 2 resetting inode 754591 nlinks from 6 to 4 resetting inode 755216 nlinks from 1 to 2 resetting inode 2235301695 nlinks from 1 to 2 resetting inode 2235301697 nlinks from 1 to 2 resetting inode 2235301718 nlinks from 1 to 2 resetting inode 2235301722 nlinks from 1 to 2 resetting inode 2235301727 nlinks from 1 to 2 resetting inode 2235301869 nlinks from 16 to 17 resetting inode 2235305415 nlinks from 1 to 2 resetting inode 2235305416 nlinks from 1 to 2 resetting inode 2235305820 nlinks from 1 to 2 resetting inode 2236609836 nlinks from 1 to 2 resetting inode 2236609864 nlinks from 1 to 2 resetting inode 2236610413 nlinks from 1 to 2 resetting inode 2236610415 nlinks from 1 to 2 resetting inode 2360490579 nlinks from 6 to 2 resetting inode 2367668281 nlinks from 3 to 2 cache_purge: shake on cache 0x1d1b030 left 1 nodes!? done Timeline of the Problem: - everything went fine I installing a new virtual Fileserver - The Host has a 3Ware Controller in: I have a 3Ware 9690SA-8I Controller with 4 x 2TB Disks ( RAID 10 for data ) and 2 x 320GB ( for OS ). Then I do a reboot to clean the system and checks if all OK. There one Disks disappear from the RAID 10. Most likly because I don't set it to fix Link Speed = 1.5 Gbps. Then I rebuild the array but I couldn't mount it because of Metadata Problems ! I also see the message: Aug 15 20:30:05 scheat kernel: Filesystem "vdb": Disabling barriers, trial barrier write failed Does this filesystem Problems only happen because of the disapperd Disk and the wrong Link Speed ? or do I need to change something other ? thanks for help [root@enif tmp]# tw_cli /c6/ show all /c6 Driver Version = 2.26.02.013 /c6 Model = 9690SA-8I /c6 Available Memory = 448MB /c6 Firmware Version = FH9X 4.10.00.007 /c6 Bios Version = BE9X 4.08.00.002 /c6 Boot Loader Version = BL9X 3.08.00.001 /c6 Serial Number = L340503B8130265 /c6 PCB Version = Rev 041 /c6 PCHIP Version = 2.00 /c6 ACHIP Version = 1.31A6 /c6 Controller Phys = 8 /c6 Connections = 6 of 128 /c6 Drives = 6 of 128 /c6 Units = 2 of 128 /c6 Active Drives = 6 of 128 /c6 Active Units = 2 of 32 /c6 Max Drives Per Unit = 32 /c6 Total Optimal Units = 1 /c6 Not Optimal Units = 1 /c6 Disk Spinup Policy = 2 /c6 Spinup Stagger Time Policy (sec) = 2 /c6 Auto-Carving Policy = off /c6 Auto-Carving Size = 2048 GB /c6 Auto-Rebuild Policy = on /c6 Rebuild Mode = Adaptive /c6 Rebuild Rate = 1 /c6 Verify Mode = Adaptive /c6 Verify Rate = 1 /c6 Controller Bus Type = PCIe /c6 Controller Bus Width = 8 lanes /c6 Controller Bus Speed = 2.5 Gbps/lane Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-1 OK - - - 298.013 RiW ON u1 RAID-10 REBUILDING 93% - 256K 3725.27 RiW ON VPort Status Unit Size Type Phy Encl-Slot Model ------------------------------------------------------------------------------ p0 OK u1 1.82 TB SATA 0 - WDC WD2002FYPS-01U1 p1 DEGRADED u1 1.82 TB SATA 1 - WDC WD2002FYPS-01U1 p2 OK u1 1.82 TB SATA 2 - WDC WD2002FYPS-01U1 p3 OK u1 1.82 TB SATA 3 - WDC WD2002FYPS-01U1 p6 OK u0 298.09 GB SATA 6 - Hitachi HTS545032B9 p7 OK u0 298.09 GB SATA 7 - Hitachi HTS543232L9 Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK OK 191 10-Jul-2010 [root@enif tmp]# Created attachment 438863 [details]
SOS Report Enif KVM Host
Created attachment 438864 [details]
SOS Report MD5 Sum Enif KVM Host
Created attachment 438866 [details]
LSI Support Infos
The array controller should be taking care of any data integrity problems. (In reply to comment #5) > Timeline of the Problem: > > - everything went fine I installing a new virtual Fileserver > - The Host has a 3Ware Controller in: > > I have a 3Ware 9690SA-8I Controller with 4 x 2TB Disks ( RAID 10 for data ) and > 2 x 320GB ( for OS ). > > Then I do a reboot to clean the system and checks if all OK. There one Disks > disappear from the RAID 10. Most likly because I don't set it to fix Link Speed > = 1.5 Gbps. Then I rebuild the array but I couldn't mount it because of > Metadata Problems ! Looks like the array has not recovered properly from whatever went wrong- the sosreport shows lots of directory read errors where the directory blocks contain NULLs rather than the correct headers. There are also WANT_CORRUPTED_GOTO errors which indicate free space btree corruptions as well. > I also see the message: > Aug 15 20:30:05 scheat kernel: Filesystem "vdb": Disabling barriers, trial > barrier write failed start reading here: http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F > Does this filesystem Problems only happen because of the disapperd Disk and the > wrong Link Speed ? or do I need to change something other ? The filesystem problems are most likely cause by whatever went wrong with your array and the array failed to recover the correctly. I'd be pointing fingers at the raid hardware here, not XFS... Thats clear, I already mention that the maybe the Controller trigger the Problem. But this night I get another XFS internal error during a rsync Job: Aug 16 00:02:47 scheat kernel: ffff88001574c000: 2f 50 6d 0a 52 65 64 75 7a 69 65 72 65 6e 2f 53 /Pm.Reduzieren/S Aug 16 00:02:47 scheat kernel: Filesystem "vdb": XFS internal error xfs_da_do_buf(2) at line 2113 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffffa0092cdd Aug 16 00:02:47 scheat kernel: Aug 16 00:02:47 scheat kernel: Pid: 12526, comm: rsync Not tainted 2.6.33.6-147.2.4.fc13.x86_64 #1 Aug 16 00:02:47 scheat kernel: Call Trace: Aug 16 00:02:47 scheat kernel: [<ffffffffa009c3aa>] xfs_error_report+0x3c/0x3e [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa0092cdd>] ? xfs_da_read_buf+0x25/0x27 [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa009c3fa>] xfs_corruption_error+0x4e/0x59 [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa0092b90>] xfs_da_do_buf+0x53e/0x627 [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa0092cdd>] ? xfs_da_read_buf+0x25/0x27 [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffff811cdd0b>] ? symcmp+0xf/0x11 Aug 16 00:02:47 scheat kernel: [<ffffffff811cd985>] ? hashtab_search+0x61/0x67 Aug 16 00:02:47 scheat kernel: [<ffffffffa0092cdd>] ? xfs_da_read_buf+0x25/0x27 [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa0092cdd>] xfs_da_read_buf+0x25/0x27 [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa0095b2e>] ? xfs_dir2_block_lookup_int+0x46/0x193 [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa0095b2e>] xfs_dir2_block_lookup_int+0x46/0x193 [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffff811ce2c8>] ? sidtab_context_to_sid+0x25/0xd0 Aug 16 00:02:47 scheat kernel: [<ffffffffa009611f>] xfs_dir2_block_lookup+0x47/0xda [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa00945d3>] ? xfs_dir2_isblock+0x1c/0x49 [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa0094cf4>] xfs_dir_lookup+0xde/0x14c [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa00b81c1>] xfs_lookup+0x59/0xbb [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffffa00c0fea>] xfs_vn_lookup+0x40/0x7f [xfs] Aug 16 00:02:47 scheat kernel: [<ffffffff81109303>] do_lookup+0xf0/0x186 Aug 16 00:02:47 scheat kernel: [<ffffffff811c49d0>] ? selinux_inode_permission+0x3b/0x40 Aug 16 00:02:47 scheat kernel: [<ffffffff8110b0be>] link_path_walk+0x442/0x598 Aug 16 00:02:47 scheat kernel: [<ffffffff8110b39b>] path_walk+0x64/0xd4 Aug 16 00:02:47 scheat kernel: [<ffffffff8110b51b>] do_path_lookup+0x25/0x88 Aug 16 00:02:47 scheat kernel: [<ffffffff8110bf65>] user_path_at+0x51/0x8e Aug 16 00:02:47 scheat kernel: [<ffffffff81104722>] ? might_fault+0x1c/0x1e Aug 16 00:02:47 scheat kernel: [<ffffffff81104816>] ? cp_new_stat+0xf2/0x108 Aug 16 00:02:47 scheat kernel: [<ffffffff811049fc>] vfs_fstatat+0x32/0x5d Aug 16 00:02:47 scheat kernel: [<ffffffff81104a78>] vfs_lstat+0x19/0x1b Aug 16 00:02:47 scheat kernel: [<ffffffff81104a94>] sys_newlstat+0x1a/0x38 Aug 16 00:02:47 scheat kernel: [<ffffffff81109057>] ? path_put+0x1d/0x22 Aug 16 00:02:47 scheat kernel: [<ffffffff81095cff>] ? audit_syscall_entry+0x119/0x145 Aug 16 00:02:47 scheat kernel: [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Aug 16 00:02:47 scheat kernel: ffff88001574c000: 2f 50 6d 0a 52 65 64 75 7a 69 65 72 65 6e 2f 53 /Pm.Reduzieren/S Aug 16 00:02:47 scheat kernel: Filesystem "vdb": XFS internal error xfs_da_do_buf(2) at line 2113 of file fs/xfs/xfs_da_btree.c. Caller 0xffffffffa0092cdd Aug 16 00:02:47 scheat kernel: Aug 16 00:02:47 scheat kernel: Pid: 12526, comm: rsync Not tainted 2.6.33.6-147.2.4.fc13.x86_64 #1 Aug 16 00:02:47 scheat kernel: Call Trace: Aug 16 00:02:47 scheat kernel: [<ffffffffa009c3aa>] xfs_error_report+0x3c/0x3e [xfs] There where no problem on the Host. I now disable the write Cache according the faq: /cX/uX set cache=off [root@enif lsigetlinux_062010]# tw_cli /c6/u1 show all /c6/u1 status = OK /c6/u1 is not rebuilding, its current state is OK /c6/u1 is not verifying, its current state is OK /c6/u1 is initialized. /c6/u1 Write Cache = on /c6/u1 Read Cache = Intelligent /c6/u1 volume(s) = 1 /c6/u1 name = data1 /c6/u1 serial number = Y4000754586455004ED2 /c6/u1 Ignore ECC policy = off /c6/u1 Auto Verify Policy = on /c6/u1 Storsave Policy = balance /c6/u1 Command Queuing Policy = on /c6/u1 Rapid RAID Recovery setting = all Unit UnitType Status %RCmpl %V/I/M VPort Stripe Size(GB) ------------------------------------------------------------------------ u1 RAID-10 OK - - - 256K 3725.27 u1-0 RAID-1 OK - - - - - u1-0-0 DISK OK - - p3 - 1862.63 u1-0-1 DISK OK - - p2 - 1862.63 u1-1 RAID-1 OK - - - - - u1-1-0 DISK OK - - p1 - 1862.63 u1-1-1 DISK OK - - p0 - 1862.63 u1/v0 Volume - - - - - 3725.27 [root@enif lsigetlinux_062010]# [root@enif lsigetlinux_062010]# [root@enif lsigetlinux_062010]# tw_cli /c6/u1 set cache=off Setting Write Cache Policy on /c6/u1 to [off] ... Done. [root@enif lsigetlinux_062010]# tw_cli /c6/u1 show all /c6/u1 status = OK /c6/u1 is not rebuilding, its current state is OK /c6/u1 is not verifying, its current state is OK /c6/u1 is initialized. /c6/u1 Write Cache = off /c6/u1 Read Cache = Intelligent /c6/u1 volume(s) = 1 /c6/u1 name = data1 /c6/u1 serial number = Y4000754586455004ED2 /c6/u1 Ignore ECC policy = off /c6/u1 Auto Verify Policy = on /c6/u1 Storsave Policy = balance /c6/u1 Command Queuing Policy = on /c6/u1 Rapid RAID Recovery setting = all Unit UnitType Status %RCmpl %V/I/M VPort Stripe Size(GB) ------------------------------------------------------------------------ u1 RAID-10 OK - - - 256K 3725.27 u1-0 RAID-1 OK - - - - - u1-0-0 DISK OK - - p3 - 1862.63 u1-0-1 DISK OK - - p2 - 1862.63 u1-1 RAID-1 OK - - - - - u1-1-0 DISK OK - - p1 - 1862.63 u1-1-1 DISK OK - - p0 - 1862.63 u1/v0 Volume - - - - - 3725.27 [root@enif lsigetlinux_062010] But not sure howto disable the individual Harddisk Cache. Anyone a Idea howto do this ? thanks Michael Created attachment 438894 [details]
Second Problem SOS Report KVM Host Enif
Created attachment 438895 [details]
Second Problem SOS Report KVM Host Enif / MD5 Sum
Created attachment 438896 [details]
Second Problem SOS Report KVM Guest Scheat
Created attachment 438897 [details]
Second Problem SOS Report KVM Guest Scheat / MD5 Sum
(In reply to comment #11) > Thats clear, I already mention that the maybe the Controller trigger the > Problem. > > But this night I get another XFS internal error during a rsync Job: > > Aug 16 00:02:47 scheat kernel: ffff88001574c000: 2f 50 6d 0a 52 65 64 75 7a 69 > 65 72 65 6e 2f 53 /Pm.Reduzieren/S > Aug 16 00:02:47 scheat kernel: Filesystem "vdb": XFS internal error > xfs_da_do_buf(2) at line 2113 of file fs/xfs/xfs_da_btree.c. Caller > 0xffffffffa0092cdd Once again, that is not directory block data that is being dumped there. It looks like a partial path name ("/Pm.Reduzieren/S") which tends to indicate that the directory read has returned uninitialisd data. Did the filesystem repair cleanly? if you run xfs_repair a second time, did it find more errors or was it clean? i.e. is this still corruption left over from the original incident, or is it new corruption? Cheers, Dave. The filesystem repair did work fine, all was Ok. the second was a new Problem. LSI / 3 Ware now replace the Controller and the BBU Board and also the Battery, because they don't now what's happen. ****************************************************************** Hi Michael, File system errors can be a little tricky to narrow down. In some of the more rare cases a drive might be writing out bad data. However, per the logs I didn’t see any indication of a drive problem and not one has reallocated a sector. I see that all four are running at the 1.5Gb/s Link Speed now. Sometimes the problem can be traced back to the controller and/or the BBU. I did notice something pretty interesting in the driver message log and the controller’s advanced diagnostic. According to the driver message log, the last Health Check [capacity test] was done on Aug 10th: Aug 10 21:40:35 enif kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x0051): Battery health check started:. However, the controller’s advanced log shows this: /c6/bbu Last Capacity Test = 10-Jul-2010 There is an issue between controller and BBU and we need to understand which component is at issue. If this is a live server you may want to replace both components. Or if you can perform some troubleshooting, power the system down and remove the BBU and its daughter PCB from the RAID controller. Then ensure the write cache setting remains enabled and see if there’s a reoccurrence. If so the controller is bad. If not it’s the BBU that we need to replace. Thank you, Technical Support Engineer Global Support Services ****************************************************************** Hope that help. thanks anyway for help. Mike Just for Information the Problem was a Bug in the virtio driver with disks over 2 TB ! Bug 605757 - 2tb virtio disk gets massively corrupted filesystems *** This bug has been marked as a duplicate of bug 605757 *** |