Bug 626684

Summary: Filesystem corruption in both xfs & ext4 with KVM guest
Product: [Fedora] Fedora Reporter: Michael Hagmann <michael.hagmann>
Component: kvmAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 13CC: anton, aquini, berrange, clalance, dougsland, ehabkost, extras-orphan, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, markmc, quintela, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-08-27 16:37:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Guest Sosreport Scheat
none
Host Sosreport Enif
none
Raid controller Logs Host System
none
Controller Logs
none
SOSreport KVM Host Enif
none
SOSreport Guest System Scheat none

Description Michael Hagmann 2010-08-24 05:42:01 UTC
Created attachment 440563 [details]
Guest Sosreport Scheat

Description of problem:

I have a Fedora 13 Kvm Host (Enif, 2 Core/6GB Mem)  with a Fedora KVM Guest ( Scheat, 1 Core / 2GB Memory) . the Guest has a 4 TB LVM Lun ( from 3Ware 3690SA-8i Controller RAID 10 ) attached with xfs.

I try to copy a complete dir with 

 cp -a data data-test

then after minutes the xfs shutdown

Aug 23 23:47:08 scheat kernel: Pid: 13255, comm: cp Not tainted 2.6.33.6-147.2.4.fc13.x86_64 #1
Aug 23 23:47:08 scheat kernel: Call Trace:
Aug 23 23:47:08 scheat kernel: [<ffffffffa009c3aa>] xfs_error_report+0x3c/0x3e [xfs]
Aug 23 23:47:08 scheat kernel: [<ffffffffa00b80d9>] ? xfs_create+0x4b8/0x547 [xfs]
Aug 23 23:47:08 scheat kernel: [<ffffffffa00b39d8>] xfs_trans_cancel+0x5f/0xea [xfs]
Aug 23 23:47:08 scheat kernel: [<ffffffffa00b80d9>] xfs_create+0x4b8/0x547 [xfs]
Aug 23 23:47:08 scheat kernel: [<ffffffffa00c120b>] xfs_vn_mknod+0xd0/0x16d [xfs]
Aug 23 23:47:08 scheat kernel: [<ffffffffa00c12c3>] xfs_vn_create+0xb/0xd [xfs]
Aug 23 23:47:08 scheat kernel: [<ffffffff81109e66>] vfs_create+0x73/0x95
Aug 23 23:47:08 scheat kernel: [<ffffffff8110c445>] do_filp_open+0x36c/0xad5
Aug 23 23:47:08 scheat kernel: [<ffffffff8120396d>] ? might_fault+0x1c/0x1e
Aug 23 23:47:08 scheat kernel: [<ffffffff81114fdd>] ? alloc_fd+0x76/0x11f
Aug 23 23:47:08 scheat kernel: [<ffffffff810ff79a>] do_sys_open+0x5e/0x10a
Aug 23 23:47:08 scheat kernel: [<ffffffff810ff86f>] sys_open+0x1b/0x1d
Aug 23 23:47:08 scheat kernel: [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Aug 23 23:47:08 scheat kernel: xfs_force_shutdown(vdb,0x8) called from line 1163 of file fs/xfs/xfs_trans.c.  Return address = 0xffffffffa00b39f1
Aug 23 23:47:08 scheat kernel: Filesystem "vdb": Corruption of in-memory data detected.  Shutting down filesystem: vdb
Aug 23 23:47:08 scheat kernel: Please umount the filesystem, and rectify the problem(s)
Aug 23 23:47:11 scheat kernel: Filesystem "vdb": xfs_log_force: error 5 returned.
Aug 23 23:47:41 scheat kernel: Filesystem "vdb": xfs_log_force: error 5 returned.
Aug 23 23:48:05 scheat abrtd: Can't load '/usr/lib64/abrt/libKerneloopsScanner.so': /usr/lib64/abrt/libKerneloopsScanner.so: cannot open shared object file: No such file or directory
Aug 23 23:48:05 scheat abrtd: Plugin 'KerneloopsScanner' is not registered
Aug 23 23:48:11 scheat kernel: Filesystem "vdb": xfs_log_force: error 5 returned.
Aug 23 23:48:41 scheat kernel: Filesystem "vdb": xfs_log_force: error 5 returned.
Aug 23 23:49:11 scheat kernel: Filesystem "vdb": xfs_log_force: error 5 returned.
Aug 23 23:49:41 scheat kernel: Filesystem "vdb": xfs_log_force: error 5 returned.


Version-Release number of selected component (if applicable):


How reproducible:

 cp -a data data-test

Steps to Reproduce:
1. 
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Michael Hagmann 2010-08-24 05:42:52 UTC
Created attachment 440564 [details]
Host Sosreport Enif

Comment 2 Michael Hagmann 2010-08-24 05:43:43 UTC
Created attachment 440565 [details]
Raid controller Logs Host System

Comment 3 Eric Sandeen 2010-08-25 14:48:39 UTC
(the raid controller logs are all from months ago, but that's ok)

So it shut down on this path:

sys_open
 do_sys_open
   do_filp_open
    vfs_create
     xfs_vn_create
      xfs_vn_mknod
       xfs_create
        xfs_trans_cancel

We got the error & shut down due to canceling a dirty transaction.

It's not clear where in xfs_create things failed, but it's interesting that you've gone down the mknod path, i.e. creating a device special file.

This rings a bell for me but I can't remember why.  :(

Is it always failing in mknod?  If you do cp -v can you see what file it's copying and provide details (name, permissions, major/minor etc)?  If you mount/unmount the filesystem that shut down, then run xfs_repair (-n for a dry run) does it find corruption?

-Eric

Comment 4 Michael Hagmann 2010-08-26 15:34:59 UTC
Hi Eric

no Idea why I have some special files in my data / home ?

I try to rerun this test. xfs_repair found errors but I have to put away the xfs logs.

In the meantime I try another test. I build with my 4 2 TB Disks two Raid1 with 2TB brutto and format it with ext4 ( assumption that the xfs filesystem is bad or the Disksize is to much) 

The disks are presentend to the Host System Enif ( Fedora13 ) and one is exported to the Guest Scheat( Fedora13 ) with LVM on it and then ext4 as a disk.

[root@enif ~]# pvs
  PV         VG       Fmt  Attr PSize   PFree 
  /dev/sda3  vg_local lvm2 a-   288.01g 68.01g
  /dev/sdb   vg_data1 lvm2 a-     1.82t     0 
  /dev/sdc   vg_data2 lvm2 a-     1.82t     0 
[root@enif ~]# vgs
  VG       #PV #LV #SN Attr   VSize   VFree 
  vg_data1   1   1   0 wz--n-   1.82t     0 
  vg_data2   1   1   0 wz--n-   1.82t     0 
  vg_local   1   4   0 wz--n- 288.01g 68.01g
[root@enif ~]# lvs
  LV               VG       Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  lv_data1         vg_data1 -wi-ao   1.82t                                      
  lv_data2         vg_data2 -wi-ao   1.82t                                      
  lv_enif_crash    vg_local -wi-ao  10.00g                                      
  lv_enif_root     vg_local -wi-ao  30.00g                                      
  lv_old_enif_root vg_local -wi-a-  30.00g                                      
  lv_virt          vg_local -wi-ao 150.00g                                      
[root@enif ~]# 

config snip from kvm xml

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw'/>
      <source dev='/dev/mapper/vg_data1-lv_data1'/>
      <target dev='vdb' bus='virtio'/>
    </disk>

and now I copy the data to a 2TB disk inside the guest and the other 2TB on the host as follow:

- mount over nfs the data to Host and to guest
- rsync the data from nfs mounted share to the attached disk

result:

- on the Host enif no problem all data are there wihout error!
- on the KVM Guest Scheat lot of problems !!
- fsck run very long 


 EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #67502197: rec_len is too small for name_len - offset=0, inode=8388608, rec_len=16, name_len=128
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #69206100: inode out of bounds - offset=0, inode=4294967295, rec_len=4096, name_len=255
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #69206099: inode out of bounds - offset=0, inode=4294967295, rec_len=4096, name_len=255
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #70516741: inode out of bounds - offset=0, inode=4294967295, rec_len=4096, name_len=255
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #70516739: inode out of bounds - offset=0, inode=4294967295, rec_len=4096, name_len=255
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #69206101: inode out of bounds - offset=0, inode=4294967295, rec_len=4096, name_len=255
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73400854: rec_len is too small for name_len - offset=0, inode=12582912, rec_len=16, name_len=192
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73531481: directory entry across blocks - offset=0, inode=262672436, rec_len=248684, name_len=169
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73531484: directory entry across blocks - offset=0, inode=3230352654, rec_len=119404, name_len=187
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73531941: directory entry across blocks - offset=0, inode=1284650619, rec_len=56464, name_len=143
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73531488: directory entry across blocks - offset=0, inode=2094283311, rec_len=176972, name_len=73
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73531945: directory entry across blocks - offset=0, inode=4200826031, rec_len=195440, name_len=11
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73531949: directory entry across blocks - offset=0, inode=2307910799, rec_len=40712, name_len=108
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73533776: rec_len is too small for name_len - offset=0, inode=12582912, rec_len=16, name_len=192
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73531955: directory entry across blocks - offset=0, inode=2974024306, rec_len=45480, name_len=211
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73531954: directory entry across blocks - offset=0, inode=2359655960, rec_len=64764, name_len=19
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73531953: directory entry across blocks - offset=0, inode=2773650414, rec_len=40312, name_len=7
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73531957: directory entry across blocks - offset=0, inode=2563861061, rec_len=125588, name_len=200
EXT4-fs error (device vdb): htree_dirblock_to_tree: bad entry in directory #73269736: inode out of bounds - offset=0, inode=4294967295, rec_len=4096, name_len=2


IMHO it looks like the virtualisation with KVM made some problem

Mike

Comment 5 Michael Hagmann 2010-08-26 15:46:57 UTC
Created attachment 441261 [details]
Controller Logs

Comment 6 Michael Hagmann 2010-08-26 15:47:39 UTC
Created attachment 441262 [details]
SOSreport KVM Host Enif

Comment 7 Michael Hagmann 2010-08-26 15:48:24 UTC
Created attachment 441263 [details]
SOSreport Guest System Scheat

Comment 8 Michael Hagmann 2010-08-26 15:50:57 UTC
no Idea what was going wrong befor, fsck found lot of errors and restarted a few times:

Directory inode 71174624, block #4, offset 0: directory corrupted
Salvage? yes

Entry 'M-d^Zm_^\Fu|M-}}M-cM-WM-k>M-;M-^M-qM-*GM-VM-dM-]M-X[M-^TM-^GkM-EM-PM-tM-$M-FM-^Z^HM-^u^HM-9%^TM-^GM-+M-jFU@b-' in ??? (71174624) references inode 1310720 in group 159 where _INODE_UNINIT is set.
Fix? yes

Entry 'M-d^Zm_^\Fu|M-}}M-cM-WM-k>M-;M-^M-qM-*GM-VM-dM-]M-X[M-^TM-^GkM-EM-PM-tM-$M-FM-^Z^HM-^u^HM-9%^TM-^GM-+M-jFU@b-' in ??? (71174624) has deleted/unused inode 1310720.  Clear? yes

Directory inode 71177913, block #6, offset 0: directory corrupted
Salvage? yes

Directory inode 73273050, block #1, offset 0: directory corrupted
Salvage? yes

Directory inode 71177913, block #11, offset 0: directory corrupted
Salvage? yes

Entry '^E)^AM-^IM-CM-x^@^@^AM-`^GM-lM-^A^@^@M-^]M-l@[M-^LM-^[M-,^NM-W?^[M-^B^M?w GM-^D^L@^HM-OM-#M-B^R^TM-nM-^YM-^BRM-]' in ??? (71570850) has invalid inode #: 3120627712.
Clear? yes

Directory inode 71570850, block #2, offset 1604: directory corrupted
Salvage? yes

Directory inode 71571305, block #8, offset 0: directory corrupted
Salvage? yes

Directory inode 71177644, block #8, offset 0: directory corrupted
Salvage? yes

Restarting e2fsck from the beginning...
Group descriptor 7 checksum is invalid.  FIXED.
Group descriptor 159 checksum is invalid.  FIXED.
Group descriptor 6055 checksum is invalid.  FIXED.
Group descriptor 8768 checksum is invalid.  FIXED.
Group descriptor 8784 checksum is invalid.  FIXED.
/dev/vdb contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes

llegal block number passed to ext2fs_test_block_bitmap #3341098265 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #1023337331 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #2457660973 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #3334471979 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #666343607 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #3543293982 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #2248889662 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #615565675 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #3403100805 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #2778754129 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #3211825734 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #777616486 for multiply claimed block map
Illegal block number passed to ext2fs_test_block_bitmap #823570077 for multiply claimed block map
Multiply-claimed block(s) in inode 77463665: 26810069
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks

Comment 9 Michael Hagmann 2010-08-26 15:53:05 UTC
What do you think Eric

should I replace the lvm layer ?

I really have no Idea what's the Problem.

thanks Mike

Comment 10 Eric Sandeen 2010-08-26 16:00:26 UTC
ok, sorry I was wrong about mknod.  Ignore that part.

In any case, yes, this looks a lot like a kvm storage/setup error, not a filesystem error.

Any chance you have something else accessing the backing storage for the guest?

You may need to talk to some KVM folks to see if there is anything wrong with your setup or any known bugs...

I'm changing the subject since it's not xfs-specific.

Comment 11 Michael Hagmann 2010-08-26 16:53:28 UTC
ok

yes on the Host ( the disks are in the host )

I try to find someone

Mike

Comment 12 Michael Hagmann 2010-08-26 16:55:04 UTC
that's very bad. As far as I know Fedora13 is base for RHEL6 and we evaluate KVM as a sucessor for Vmware. but with this Problems I don't feel very comfortable.

Mike

Comment 13 Michael Hagmann 2010-08-26 17:18:33 UTC
anyone from the KVM Guys that could help with this Problem ?

thanks Mike

Comment 14 Daniel Berrangé 2010-08-26 17:30:31 UTC
> and now I copy the data to a 2TB disk inside the guest and the other 2TB on the
> host as follow:

Ah the magic phrase "2TB disk guest disk". Might well be hitting this bug:

  "2tb virtio disk gets massively corrupted filesystems "

  https://bugzilla.redhat.com/show_bug.cgi?id=605757

Comment 15 Michael Hagmann 2010-08-26 19:06:40 UTC
thanks !

update in progress ....

 Installed:
  kernel.x86_64 0:2.6.33.8-149.fc13                                                                                                                                                            

Updated:
  SDL.x86_64 0:1.2.14-7.fc13                           augeas-libs.x86_64 0:0.7.3-1.fc13           cronie.x86_64 0:1.4.5-2.fc13              cronie-anacron.x86_64 0:1.4.5-2.fc13             
  curl.x86_64 0:7.20.1-4.fc13                          dbus-glib.x86_64 0:0.86-4.fc13              gnupg2.x86_64 0:2.0.14-6.fc13             gpxe-roms-qemu.noarch 0:1.0.1-1.fc13             
  grubby.x86_64 0:7.0.16-1.fc13                        kernel-headers.x86_64 0:2.6.33.8-149.fc13   libcurl.x86_64 0:7.20.1-4.fc13            libudev.x86_64 0:153-3.fc13                      
  libusb.x86_64 0:0.1.12-23.fc13                       linux-firmware.noarch 0:20100806-4.fc13     mc.x86_64 1:4.7.3-1.fc13                  nss.x86_64 0:3.12.6-12.fc13                      
  nss-sysinit.x86_64 0:3.12.6-12.fc13                  openldap.x86_64 0:2.4.21-10.fc13            patch.x86_64 0:2.6.1-4.fc13               qemu.x86_64 2:0.12.5-1.fc13                      
  qemu-common.x86_64 2:0.12.5-1.fc13                   qemu-img.x86_64 2:0.12.5-1.fc13             qemu-kvm.x86_64 2:0.12.5-1.fc13           qemu-system-arm.x86_64 2:0.12.5-1.fc13           
  qemu-system-cris.x86_64 2:0.12.5-1.fc13              qemu-system-m68k.x86_64 2:0.12.5-1.fc13     qemu-system-mips.x86_64 2:0.12.5-1.fc13   qemu-system-ppc.x86_64 2:0.12.5-1.fc13           
  qemu-system-sh4.x86_64 2:0.12.5-1.fc13               qemu-system-sparc.x86_64 2:0.12.5-1.fc13    qemu-system-x86.x86_64 2:0.12.5-1.fc13    qemu-user.x86_64 2:0.12.5-1.fc13                 
  ruby-libs.x86_64 0:1.8.6.399-6.fc13                  seabios-bin.noarch 0:0.6.0-1.fc13           selinux-policy.noarch 0:3.7.19-49.fc13    selinux-policy-targeted.noarch 0:3.7.19-49.fc13  
  system-config-firewall-base.noarch 0:1.2.27-1.fc13   udev.x86_64 0:153-3.fc13                    yum.noarch 0:3.2.28-3.fc13               

Complete!

Comment 16 Michael Hagmann 2010-08-27 04:42:18 UTC
Now it looks OK !

[root@scheat ~]# umount /export/data1/
[root@scheat ~]# fsck /dev/vdb
fsck from util-linux-ng 2.17.2
e2fsck 1.41.10 (10-Feb-2009)
/dev/vdb: clean, 960601/122077184 files, 160360170/488278016 blocks
[root@scheat ~]# 


no problem at all

Mike

Comment 17 Michael Hagmann 2010-08-27 04:44:32 UTC
also from the host no problem any more

[root@enif data2]# fsck /dev/mapper/vg_data1-lv_data1 
fsck from util-linux-ng 2.17.2
e2fsck 1.41.10 (10-Feb-2009)
/dev/mapper/vg_data1-lv_data1: clean, 960601/122077184 files, 160360170/488278016 blocks
[root@enif data2]# 


thanks Mike

Comment 18 Chuck Ebbert 2010-08-27 16:37:55 UTC

*** This bug has been marked as a duplicate of bug 605757 ***