Bug 624293

Summary:

XFS internal error / mount: Structure needs cleaning

Product:

[Fedora] Fedora

Reporter:

Michael Hagmann <michael.hagmann>

Component:

kernel

Assignee:

Eric Sandeen <esandeen>

Status:

CLOSED DUPLICATE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

low

Version:

CC:

anton, aquini, dchinner, dougsland, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-08-18 15:23:21 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
SOS Report	none
MD5 Sum	none
SOS Report Enif KVM Host	none
SOS Report MD5 Sum Enif KVM Host	none
LSI Support Infos	none
Second Problem SOS Report KVM Host Enif	none
Second Problem SOS Report KVM Host Enif / MD5 Sum	none
Second Problem SOS Report KVM Guest Scheat	none
Second Problem SOS Report KVM Guest Scheat / MD5 Sum	none

Description Michael Hagmann 2010-08-15 18:37:19 UTC

Description of problem:

This is a virtual KVM System on a Fedora 13 Host. The disk is LVM based Disk with xfs direct attached to the Guest

[root@scheat /]# mount /dev/vdb
mount: Structure needs cleaning
[root@scheat /]#
0 bash  1 bash                            

Aug 15 20:30:05 scheat kernel: Filesystem "vdb": Disabling barriers, trial barrier write failed
Aug 15 20:30:05 scheat kernel: XFS mounting filesystem vdb
Aug 15 20:30:05 scheat kernel: Starting XFS recovery on filesystem: vdb (logdev: internal)
Aug 15 20:30:05 scheat kernel: ffff88007c7f1400: 00 80 00 01 00 00 00 00 d2 ff 12 0f 01 3c 00 00  .............<..
Aug 15 20:30:05 scheat kernel: Filesystem "vdb": XFS internal error xfs_read_agi at line 1499 of file fs/xfs/xfs_ialloc.c.  Caller 0xffffffffa009e190
Aug 15 20:30:05 scheat kernel:
Aug 15 20:30:05 scheat kernel: Pid: 20624, comm: mount Not tainted 2.6.33.6-147.2.4.fc13.x86_64 #1
Aug 15 20:30:05 scheat kernel: Call Trace:
Aug 15 20:30:05 scheat kernel: [<ffffffffa009c3aa>] xfs_error_report+0x3c/0x3e [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa009e190>] ? xfs_ialloc_read_agi+0x1b/0x62 [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa009c3fa>] xfs_corruption_error+0x4e/0x59 [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa009e162>] xfs_read_agi+0xc9/0xdc [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa009e190>] ? xfs_ialloc_read_agi+0x1b/0x62 [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa009e190>] xfs_ialloc_read_agi+0x1b/0x62 [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa009e1f4>] xfs_ialloc_pagi_init+0x1d/0x3f [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa00b100c>] xfs_initialize_perag_data+0x61/0xea [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa00b1b69>] xfs_mountfs+0x32a/0x60d [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa009c8df>] ? xfs_fstrm_free_func+0x0/0x99 [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa00ba6e0>] ? kmem_zalloc+0x11/0x2a [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa00b245b>] ? xfs_mru_cache_create+0x117/0x146 [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa00c4a06>] xfs_fs_fill_super+0x1f4/0x36e [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffff811039bf>] get_sb_bdev+0x134/0x197
Aug 15 20:30:05 scheat kernel: [<ffffffffa00c4812>] ? xfs_fs_fill_super+0x0/0x36e [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffffa00c2c60>] xfs_fs_get_sb+0x13/0x15 [xfs]
Aug 15 20:30:05 scheat kernel: [<ffffffff81103193>] vfs_kern_mount+0xa4/0x163
Aug 15 20:30:05 scheat kernel: [<ffffffff811032b0>] do_kern_mount+0x48/0xe8
Aug 15 20:30:05 scheat kernel: [<ffffffff8111801e>] do_mount+0x752/0x7c8
Aug 15 20:30:05 scheat kernel: [<ffffffff810d3424>] ? copy_from_user+0x3c/0x44
Aug 15 20:30:05 scheat kernel: [<ffffffff810d37b2>] ? strndup_user+0x58/0x82
Aug 15 20:30:05 scheat kernel: [<ffffffff81118117>] sys_mount+0x83/0xbd
Aug 15 20:30:05 scheat kernel: [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b

Version-Release number of selected component (if applicable):

see sosreport

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

How should i proced with this error?

thanks Michael

Comment 1 Michael Hagmann 2010-08-15 18:38:38 UTC

Created attachment 438852 [details]
SOS Report

Comment 2 Michael Hagmann 2010-08-15 18:39:13 UTC

Created attachment 438853 [details]
MD5 Sum

Comment 3 Michael Hagmann 2010-08-15 18:47:22 UTC

Try to repair:

[root@scheat tmp]# xfs_check /dev/vdb
xfs_check: cannot init perag data (117)
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_check.  If you are unable to mount the filesystem, then use
the xfs_repair -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

[root@scheat tmp]# xfs_repair /dev/vdb
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
[root@scheat tmp]#

xfs_metadump -g /dev/vdb ./dev-vdb.dump
xfs_metadump: cannot init perag data (117)
Copying log                                                
[root@scheat tmp]

nothing help

Comment 4 Michael Hagmann 2010-08-15 18:50:50 UTC

going forward with:

xfs_repair -L /dev/vdb

lot of errors!


resetting inode 710004 nlinks from 1 to 2
resetting inode 710005 nlinks from 1 to 2
resetting inode 710006 nlinks from 1 to 2
resetting inode 710922 nlinks from 1 to 2
resetting inode 710923 nlinks from 1 to 2
resetting inode 710966 nlinks from 1 to 2
resetting inode 710967 nlinks from 1 to 2
resetting inode 710968 nlinks from 1 to 2
resetting inode 710969 nlinks from 1 to 2
resetting inode 710970 nlinks from 1 to 2
resetting inode 710971 nlinks from 1 to 2
resetting inode 711047 nlinks from 1 to 2
resetting inode 711048 nlinks from 1 to 2
resetting inode 711049 nlinks from 1 to 2
resetting inode 711050 nlinks from 1 to 2
resetting inode 711081 nlinks from 1 to 2
resetting inode 711082 nlinks from 1 to 2
resetting inode 711203 nlinks from 1 to 2
resetting inode 749502 nlinks from 10 to 1
resetting inode 749541 nlinks from 1 to 2
resetting inode 752106 nlinks from 3 to 2
resetting inode 754580 nlinks from 1 to 2
resetting inode 754591 nlinks from 6 to 4
resetting inode 755216 nlinks from 1 to 2
resetting inode 2235301695 nlinks from 1 to 2
resetting inode 2235301697 nlinks from 1 to 2
resetting inode 2235301718 nlinks from 1 to 2
resetting inode 2235301722 nlinks from 1 to 2
resetting inode 2235301727 nlinks from 1 to 2
resetting inode 2235301869 nlinks from 16 to 17
resetting inode 2235305415 nlinks from 1 to 2
resetting inode 2235305416 nlinks from 1 to 2
resetting inode 2235305820 nlinks from 1 to 2
resetting inode 2236609836 nlinks from 1 to 2
resetting inode 2236609864 nlinks from 1 to 2
resetting inode 2236610413 nlinks from 1 to 2
resetting inode 2236610415 nlinks from 1 to 2
resetting inode 2360490579 nlinks from 6 to 2
resetting inode 2367668281 nlinks from 3 to 2
cache_purge: shake on cache 0x1d1b030 left 1 nodes!?
done

Comment 5 Michael Hagmann 2010-08-15 20:17:45 UTC

Timeline of the Problem:

- everything went fine I installing a new virtual Fileserver
- The Host has a 3Ware Controller in:

I have a 3Ware 9690SA-8I Controller with 4 x 2TB Disks ( RAID 10 for data ) and 2 x 320GB ( for OS ). 

Then I do a reboot to clean the system and checks if all OK. There one Disks disappear from the RAID 10. Most likly because I don't set it to fix Link Speed = 1.5 Gbps. Then I rebuild the array but I couldn't mount it because of Metadata Problems !

I also see  the message:
Aug 15 20:30:05 scheat kernel: Filesystem "vdb": Disabling barriers, trial
barrier write failed

Does this filesystem Problems only happen because of the disapperd Disk and the wrong Link Speed ? or do I need to change something other ?

thanks for help

[root@enif tmp]# tw_cli /c6/ show all
/c6 Driver Version = 2.26.02.013
/c6 Model = 9690SA-8I
/c6 Available Memory = 448MB
/c6 Firmware Version = FH9X 4.10.00.007
/c6 Bios Version = BE9X 4.08.00.002
/c6 Boot Loader Version = BL9X 3.08.00.001
/c6 Serial Number = L340503B8130265
/c6 PCB Version = Rev 041
/c6 PCHIP Version = 2.00
/c6 ACHIP Version = 1.31A6
/c6 Controller Phys = 8
/c6 Connections = 6 of 128
/c6 Drives = 6 of 128
/c6 Units = 2 of 128
/c6 Active Drives = 6 of 128
/c6 Active Units = 2 of 32
/c6 Max Drives Per Unit = 32
/c6 Total Optimal Units = 1
/c6 Not Optimal Units = 1 
/c6 Disk Spinup Policy = 2
/c6 Spinup Stagger Time Policy (sec) = 2
/c6 Auto-Carving Policy = off
/c6 Auto-Carving Size = 2048 GB
/c6 Auto-Rebuild Policy = on
/c6 Rebuild Mode = Adaptive
/c6 Rebuild Rate = 1
/c6 Verify Mode = Adaptive
/c6 Verify Rate = 1
/c6 Controller Bus Type = PCIe
/c6 Controller Bus Width = 8 lanes
/c6 Controller Bus Speed = 2.5 Gbps/lane

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-1    OK             -       -       -       298.013   RiW    ON     
u1    RAID-10   REBUILDING     93%     -       256K    3725.27   RiW    ON     

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u1   1.82 TB   SATA  0   -            WDC WD2002FYPS-01U1 
p1    DEGRADED       u1   1.82 TB   SATA  1   -            WDC WD2002FYPS-01U1 
p2    OK             u1   1.82 TB   SATA  2   -            WDC WD2002FYPS-01U1 
p3    OK             u1   1.82 TB   SATA  3   -            WDC WD2002FYPS-01U1 
p6    OK             u0   298.09 GB SATA  6   -            Hitachi HTS545032B9 
p7    OK             u0   298.09 GB SATA  7   -            Hitachi HTS543232L9 

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       191    10-Jul-2010  

[root@enif tmp]#

Comment 6 Michael Hagmann 2010-08-15 20:19:16 UTC

Created attachment 438863 [details]
SOS Report Enif KVM Host

Comment 7 Michael Hagmann 2010-08-15 20:22:18 UTC

Created attachment 438864 [details]
SOS Report MD5 Sum Enif KVM Host

Comment 8 Michael Hagmann 2010-08-15 20:52:44 UTC

Created attachment 438866 [details]
LSI Support Infos

Comment 9 Chuck Ebbert 2010-08-15 22:28:16 UTC

The array controller should be taking care of any data integrity problems.

Comment 10 Dave Chinner 2010-08-16 01:21:22 UTC

(In reply to comment #5)
> Timeline of the Problem:
> 
> - everything went fine I installing a new virtual Fileserver
> - The Host has a 3Ware Controller in:
> 
> I have a 3Ware 9690SA-8I Controller with 4 x 2TB Disks ( RAID 10 for data ) and
> 2 x 320GB ( for OS ). 
> 
> Then I do a reboot to clean the system and checks if all OK. There one Disks
> disappear from the RAID 10. Most likly because I don't set it to fix Link Speed
> = 1.5 Gbps. Then I rebuild the array but I couldn't mount it because of
> Metadata Problems !

Looks like the array has not recovered properly from whatever went wrong- the sosreport shows lots of directory read errors where the directory blocks contain NULLs rather than the correct headers. There are also WANT_CORRUPTED_GOTO errors which indicate free space btree corruptions as well.


> I also see  the message:
> Aug 15 20:30:05 scheat kernel: Filesystem "vdb": Disabling barriers, trial
> barrier write failed

start reading here:

http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache_on_journaled_filesystems.3F

 
> Does this filesystem Problems only happen because of the disapperd Disk and the
> wrong Link Speed ? or do I need to change something other ?

The filesystem problems are most likely cause by whatever went wrong with your array and the array failed to recover the correctly. I'd be pointing fingers at the raid hardware here, not XFS...

Comment 11 Michael Hagmann 2010-08-16 06:03:46 UTC

Thats clear, I already mention that the maybe the Controller trigger the Problem.

But this night I get another XFS internal error during a rsync Job:

Aug 16 00:02:47 scheat kernel: ffff88001574c000: 2f 50 6d 0a 52 65 64 75 7a 69 65 72 65 6e 2f 53  /Pm.Reduzieren/S
Aug 16 00:02:47 scheat kernel: Filesystem "vdb": XFS internal error xfs_da_do_buf(2) at line 2113 of file fs/xfs/xfs_da_btree.c.  Caller 0xffffffffa0092cdd
Aug 16 00:02:47 scheat kernel:
Aug 16 00:02:47 scheat kernel: Pid: 12526, comm: rsync Not tainted 2.6.33.6-147.2.4.fc13.x86_64 #1
Aug 16 00:02:47 scheat kernel: Call Trace:
Aug 16 00:02:47 scheat kernel: [<ffffffffa009c3aa>] xfs_error_report+0x3c/0x3e [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa0092cdd>] ? xfs_da_read_buf+0x25/0x27 [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa009c3fa>] xfs_corruption_error+0x4e/0x59 [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa0092b90>] xfs_da_do_buf+0x53e/0x627 [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa0092cdd>] ? xfs_da_read_buf+0x25/0x27 [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffff811cdd0b>] ? symcmp+0xf/0x11
Aug 16 00:02:47 scheat kernel: [<ffffffff811cd985>] ? hashtab_search+0x61/0x67
Aug 16 00:02:47 scheat kernel: [<ffffffffa0092cdd>] ? xfs_da_read_buf+0x25/0x27 [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa0092cdd>] xfs_da_read_buf+0x25/0x27 [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa0095b2e>] ? xfs_dir2_block_lookup_int+0x46/0x193 [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa0095b2e>] xfs_dir2_block_lookup_int+0x46/0x193 [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffff811ce2c8>] ? sidtab_context_to_sid+0x25/0xd0
Aug 16 00:02:47 scheat kernel: [<ffffffffa009611f>] xfs_dir2_block_lookup+0x47/0xda [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa00945d3>] ? xfs_dir2_isblock+0x1c/0x49 [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa0094cf4>] xfs_dir_lookup+0xde/0x14c [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa00b81c1>] xfs_lookup+0x59/0xbb [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffffa00c0fea>] xfs_vn_lookup+0x40/0x7f [xfs]
Aug 16 00:02:47 scheat kernel: [<ffffffff81109303>] do_lookup+0xf0/0x186
Aug 16 00:02:47 scheat kernel: [<ffffffff811c49d0>] ? selinux_inode_permission+0x3b/0x40
Aug 16 00:02:47 scheat kernel: [<ffffffff8110b0be>] link_path_walk+0x442/0x598
Aug 16 00:02:47 scheat kernel: [<ffffffff8110b39b>] path_walk+0x64/0xd4
Aug 16 00:02:47 scheat kernel: [<ffffffff8110b51b>] do_path_lookup+0x25/0x88
Aug 16 00:02:47 scheat kernel: [<ffffffff8110bf65>] user_path_at+0x51/0x8e
Aug 16 00:02:47 scheat kernel: [<ffffffff81104722>] ? might_fault+0x1c/0x1e
Aug 16 00:02:47 scheat kernel: [<ffffffff81104816>] ? cp_new_stat+0xf2/0x108
Aug 16 00:02:47 scheat kernel: [<ffffffff811049fc>] vfs_fstatat+0x32/0x5d
Aug 16 00:02:47 scheat kernel: [<ffffffff81104a78>] vfs_lstat+0x19/0x1b
Aug 16 00:02:47 scheat kernel: [<ffffffff81104a94>] sys_newlstat+0x1a/0x38
Aug 16 00:02:47 scheat kernel: [<ffffffff81109057>] ? path_put+0x1d/0x22
Aug 16 00:02:47 scheat kernel: [<ffffffff81095cff>] ? audit_syscall_entry+0x119/0x145
Aug 16 00:02:47 scheat kernel: [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Aug 16 00:02:47 scheat kernel: ffff88001574c000: 2f 50 6d 0a 52 65 64 75 7a 69 65 72 65 6e 2f 53  /Pm.Reduzieren/S
Aug 16 00:02:47 scheat kernel: Filesystem "vdb": XFS internal error xfs_da_do_buf(2) at line 2113 of file fs/xfs/xfs_da_btree.c.  Caller 0xffffffffa0092cdd
Aug 16 00:02:47 scheat kernel:
Aug 16 00:02:47 scheat kernel: Pid: 12526, comm: rsync Not tainted 2.6.33.6-147.2.4.fc13.x86_64 #1
Aug 16 00:02:47 scheat kernel: Call Trace:
Aug 16 00:02:47 scheat kernel: [<ffffffffa009c3aa>] xfs_error_report+0x3c/0x3e [xfs]

There where no problem on the Host.

I now disable the write Cache according the faq:

/cX/uX set cache=off

[root@enif lsigetlinux_062010]# tw_cli /c6/u1 show all
/c6/u1 status = OK
/c6/u1 is not rebuilding, its current state is OK
/c6/u1 is not verifying, its current state is OK
/c6/u1 is initialized.
/c6/u1 Write Cache = on
/c6/u1 Read Cache = Intelligent
/c6/u1 volume(s) = 1
/c6/u1 name = data1                
/c6/u1 serial number = Y4000754586455004ED2 
/c6/u1 Ignore ECC policy = off       
/c6/u1 Auto Verify Policy = on        
/c6/u1 Storsave Policy = balance     
/c6/u1 Command Queuing Policy = on        
/c6/u1 Rapid RAID Recovery setting = all

Unit     UnitType  Status         %RCmpl  %V/I/M  VPort Stripe  Size(GB)
------------------------------------------------------------------------
u1       RAID-10   OK             -       -       -     256K    3725.27   
u1-0     RAID-1    OK             -       -       -     -       -         
u1-0-0   DISK      OK             -       -       p3    -       1862.63   
u1-0-1   DISK      OK             -       -       p2    -       1862.63   
u1-1     RAID-1    OK             -       -       -     -       -         
u1-1-0   DISK      OK             -       -       p1    -       1862.63   
u1-1-1   DISK      OK             -       -       p0    -       1862.63   
u1/v0    Volume    -              -       -       -     -       3725.27   

[root@enif lsigetlinux_062010]# 
[root@enif lsigetlinux_062010]# 
[root@enif lsigetlinux_062010]# tw_cli /c6/u1 set cache=off
Setting Write Cache Policy on /c6/u1 to [off] ... Done.

[root@enif lsigetlinux_062010]# tw_cli /c6/u1 show all
/c6/u1 status = OK
/c6/u1 is not rebuilding, its current state is OK
/c6/u1 is not verifying, its current state is OK
/c6/u1 is initialized.
/c6/u1 Write Cache = off
/c6/u1 Read Cache = Intelligent
/c6/u1 volume(s) = 1
/c6/u1 name = data1                
/c6/u1 serial number = Y4000754586455004ED2 
/c6/u1 Ignore ECC policy = off       
/c6/u1 Auto Verify Policy = on        
/c6/u1 Storsave Policy = balance     
/c6/u1 Command Queuing Policy = on        
/c6/u1 Rapid RAID Recovery setting = all

Unit     UnitType  Status         %RCmpl  %V/I/M  VPort Stripe  Size(GB)
------------------------------------------------------------------------
u1       RAID-10   OK             -       -       -     256K    3725.27   
u1-0     RAID-1    OK             -       -       -     -       -         
u1-0-0   DISK      OK             -       -       p3    -       1862.63   
u1-0-1   DISK      OK             -       -       p2    -       1862.63   
u1-1     RAID-1    OK             -       -       -     -       -         
u1-1-0   DISK      OK             -       -       p1    -       1862.63   
u1-1-1   DISK      OK             -       -       p0    -       1862.63   
u1/v0    Volume    -              -       -       -     -       3725.27   

[root@enif lsigetlinux_062010]

But not sure howto disable the individual Harddisk Cache.

Anyone a Idea howto do this ?

thanks Michael

Comment 12 Michael Hagmann 2010-08-16 06:04:47 UTC

Created attachment 438894 [details]
Second Problem SOS Report KVM Host Enif

Comment 13 Michael Hagmann 2010-08-16 06:10:22 UTC

Created attachment 438895 [details]
Second Problem SOS Report KVM Host Enif / MD5 Sum

Comment 14 Michael Hagmann 2010-08-16 06:11:00 UTC

Created attachment 438896 [details]
Second Problem SOS Report KVM Guest Scheat

Comment 15 Michael Hagmann 2010-08-16 06:17:24 UTC

Created attachment 438897 [details]
Second Problem SOS Report KVM Guest Scheat / MD5 Sum

Comment 16 Dave Chinner 2010-08-16 23:35:16 UTC

(In reply to comment #11)
> Thats clear, I already mention that the maybe the Controller trigger the
> Problem.
> 
> But this night I get another XFS internal error during a rsync Job:
> 
> Aug 16 00:02:47 scheat kernel: ffff88001574c000: 2f 50 6d 0a 52 65 64 75 7a 69
> 65 72 65 6e 2f 53  /Pm.Reduzieren/S
> Aug 16 00:02:47 scheat kernel: Filesystem "vdb": XFS internal error
> xfs_da_do_buf(2) at line 2113 of file fs/xfs/xfs_da_btree.c.  Caller
> 0xffffffffa0092cdd

Once again, that is not directory block data that is being dumped there. It looks like a partial path name ("/Pm.Reduzieren/S") which tends to indicate that the directory read has returned uninitialisd data.

Did the filesystem repair cleanly? if you run xfs_repair a second time, did it find more errors or was it clean? i.e. is this still corruption left over from the original incident, or is it new corruption?

Cheers,

Dave.

Comment 17 Michael Hagmann 2010-08-18 15:23:21 UTC

The filesystem repair did work fine, all was Ok.

the second was a new Problem.

LSI / 3 Ware now replace the Controller and the BBU Board and also the Battery, because they don't now what's happen.

******************************************************************
Hi Michael,

File system errors can be a little tricky to narrow down. In some of the more rare cases a drive might be writing out bad data. However, per the logs I didn’t see any indication of a drive problem and not one has reallocated a sector. I see that all four are running at the 1.5Gb/s Link Speed now.

Sometimes the problem can be traced back to the controller and/or the BBU. I did notice something pretty interesting in the driver message log and the controller’s advanced diagnostic.

According to the driver message log, the last Health Check [capacity test] was done on Aug 10th:

Aug 10 21:40:35 enif kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x0051): Battery health check started:.

However, the controller’s advanced log shows this:

/c6/bbu Last Capacity Test = 10-Jul-2010

There is an issue between controller and BBU and we need to understand which component is at issue. If this is a live server you may want to replace both components. Or if you can perform some troubleshooting, power the system down and remove the BBU and its daughter PCB from the RAID controller. Then ensure the write cache setting remains enabled and see if there’s a reoccurrence. If so the controller is bad. If not it’s the BBU that we need to replace.

Thank you,

Technical Support Engineer

Global Support Services
******************************************************************

Hope that help.

thanks anyway for help.

Mike

Comment 18 Michael Hagmann 2010-08-27 20:36:34 UTC

Just for Information

the Problem was a Bug in the virtio driver with disks over 2 TB !

Bug 605757 - 2tb virtio disk gets massively corrupted filesystems

*** This bug has been marked as a duplicate of bug 605757 ***