Bug 735105
Summary: | ext4 corruption via Ceph userspace program | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Harald Klein <hklein> | |
Component: | kernel | Assignee: | Eric Sandeen <esandeen> | |
Status: | CLOSED ERRATA | QA Contact: | Eryu Guan <eguan> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 6.1 | CC: | bmr, bstein, dchinner, eguan, esandeen, jdarcy, kkeithle, kzhang, lczerner, msvoboda, rwheeler, sage | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | kernel-2.6.32-229.el6 | Doc Type: | Bug Fix | |
Doc Text: |
When running a userspace program, such as the Ceph client, on the ext4 file system, a race condition between the sync/flush thread and the xattr-set thread could occur. This was caused by an incorrectly-set state flag on an inode. As a consequence, memory for the file system was incorrectly allocated, which resulted in file system corruption. With this update, the ext4 code has been modified to prevent this race condition from occurring and file systems are no longer corrupted under these circumstances.
|
Story Points: | --- | |
Clone Of: | ||||
: | 844854 (view as bug list) | Environment: | ||
Last Closed: | 2012-06-20 07:49:03 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 756082, 758829, 844854 |
Description
Harald Klein
2011-09-01 14:02:17 UTC
From the e2i: # e2fsck -fn /dev/vg_data/tlv0 e2fsck 1.41.12 (17-May-2010) Pass 1: Checking inodes, blocks, and sizes Inode 43, i_blocks is 8, should be 16. Fix? no Inode 2078, i_blocks is 24, should be 16. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/vg_data/tlv0: ********** WARNING: Filesystem still has errors ********** /dev/vg_data/tlv0: 772/10368 files (0.1% non-contiguous), 40206/2622603 blocks debugfs: ncheck 43 Inode Pathname 43 /current/2.1c_head/rbd_directory_head debugfs: ncheck 2078 Inode Pathname 2078 /current/2.3a_head/rbd_info_head debugfs: stat /current/2.3a_head/rbd_info_head Inode: 2078 Type: regular Mode: 0644 Flags: 0x80000 Generation: 3929733629 Version: 0x00000000:00000001 User: 0 Group: 0 Size: 8 File ACL: 524506 Directory ACL: 0 Links: 1 Blockcount: 24 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4e4033e0:a801bbc8 -- Mon Aug 8 20:07:12 2011 atime: 0x4e4033e0:a801bbc8 -- Mon Aug 8 20:07:12 2011 mtime: 0x4e4033e0:a801bbc8 -- Mon Aug 8 20:07:12 2011 crtime: 0x4e4033e0:a801bbc8 -- Mon Aug 8 20:07:12 2011 Size of extra inode fields: 28 Extended attributes stored in inode body: ceph.snapset = "01 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (26) EXTENTS: (0): 1114116 debugfs: stat /current/2.1c_head/rbd_directory_head Inode: 43 Type: regular Mode: 0644 Flags: 0x80000 Generation: 3929733628 Version: 0x00000000:00000001 User: 0 Group: 0 Size: 8 File ACL: 862 Directory ACL: 0 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4e4033e0:39d3c808 -- Mon Aug 8 20:07:12 2011 atime: 0x4e4033e0:39d3c808 -- Mon Aug 8 20:07:12 2011 mtime: 0x4e4033e0:39d3c808 -- Mon Aug 8 20:07:12 2011 crtime: 0x4e4033e0:39d3c808 -- Mon Aug 8 20:07:12 2011 Size of extra inode fields: 28 Extended attributes stored in inode body: ceph.snapset = "01 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (26) EXTENTS: (0): 1114113 Created attachment 521015 [details]
e2image of the corrupted ext4 fs
Created attachment 521016 [details]
strace of cosd
Customer did point to http://marc.info/?t=131201875400001&r=1&w=2 but they still hit this issue with this applied: commit 7132de744ba76930d13033061018ddd7e3e8cd91 Author: Maxim Patlasov <maxim.patlasov> Date: Sun Jul 10 19:37:48 2011 -0400 ext4: fix i_blocks/quota accounting when extent insertion fails bmr pointed out that the above fixes situations where "is" < "should". But the customer is seeing "is" (24) > should (16). http://68.183.106.108/lists/ceph-devel/msg03296.html seems to be the upstream thread on this issue. That issue was resolved with: commit 7132de744ba76930d13033061018ddd7e3e8cd91 Author: Maxim Patlasov <maxim.patlasov@xxxxxxxxx> Date: Sun Jul 10 19:37:48 2011 -0400 ext4: fix i_blocks/quota accounting when extent insertion fails Just to clarify, the bug was harder to hit with that patch, but it wasn't fully resolved. I think there's more than one problem here (that just happen to present with somewhat similar fsck errors). The problem fixed by the patch in comment #9 should only ever produce inodes that have less i_blocks than expected (because the value is decremented incorrectly) but the output in comment #0 has a mix of greater and less than expected: Inode 222447209, i_blocks is 8208, should be 8200. Fix<y>? no Inode 222447547, i_blocks is 6144, should be 6152. Fix<y>? no The first has i_blocks > should (could be the quota accounting problem fixed in 7132de74) but the second has i_blocks < should - I don't see how this can be the same problem at all. Typo - < > wrong way around in comment #15 - the second line could be the quota accounting problem but I can't see how the first is since i_blocks is greater than it should be. So on the image: Pass 1: Checking inodes, blocks, and sizes Inode 43, i_blocks is 8, should be 16. Fix? no Inode 2078, i_blocks is 24, should be 16. Fix? no this is the first file in question: debuge4fs: ncheck 43 Inode Pathname 43 /current/2.1c_head/rbd_directory_head In the strace, we open/create this file: [pid 17206] unlink("/ceph/osd.001/current/2.1c_head/rbd_directory_head" <unfinished ...> [pid 17206] <... unlink resumed> ) = -1 ENOENT (No such file or directory) [pid 17206] open("/ceph/osd.001/current/2.1c_head/rbd_directory_head", O_WRONLY|O_CREAT, 0644 <unfinished ...> [pid 17206] <... open resumed> ) = 16 seek to 0 on the fd, [pid 17206] lseek(16, 0, SEEK_SET <unfinished ...> [pid 17206] <... lseek resumed> ) = 0 and writev 8 bytes: [pid 17206] writev(16, [{"\0\0\0\0\0\0\0\0", 8}], 1 <unfinished ...> [pid 17206] <... writev resumed> ) = 8 Then we setxattr the user.ceph._ (154 bytes) attribute: [pid 17206] setxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph._", ..., 154, 0 <unfinished ...> [pid 17206] <... setxattr resumed> ) = 0 close the fd from another pid: [pid 17208] close(16 <unfinished ...> [pid 17208] <... close resumed> ) = 0 and set the user.ceph.snapset (26 bytes) attribute: [pid 17206] setxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph.snapset", ... , 26, 0 <unfinished ...> [pid 17206] <... setxattr resumed> ) = 0 Later we stat it & listxattr / getxattr: [pid 17223] stat("/ceph/osd.001/current/2.1c_head/rbd_directory_head", {st_mode=S_IFREG|0644, st_size=8, ...}) = 0 (I suppose full reporting here might be good, to get st_blocks at this point?) [pid 17223] listxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", (nil), 0) = 30 names are 30 bytes long, looks like both xattrs are there. [pid 17223] listxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", 0xb61180, 60) = 30 [pid 17223] getxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph.snapset", ... , 100) = 26 [pid 17223] getxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph._", 0x7f93120754f0, 100) = -1 ERANGE (Numerical result out of range) [pid 17223] getxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph._", 0x0, 0) = 154 [pid 17223] getxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph._", ..., 154) = 154 seems like we got both xattrs back ok, and that's all that's done with the file in the strace here. debuge4fs: stat <43> Inode: 43 Type: regular Mode: 0644 Flags: 0x80000 Generation: 3929733628 Version: 0x00000000:00000001 User: 0 Group: 0 Size: 8 File ACL: 862 Directory ACL: 0 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4e4033e0:39d3c808 -- Mon Aug 8 14:07:12 2011 atime: 0x4e4033e0:39d3c808 -- Mon Aug 8 14:07:12 2011 mtime: 0x4e4033e0:39d3c808 -- Mon Aug 8 14:07:12 2011 crtime: 0x4e4033e0:39d3c808 -- Mon Aug 8 14:07:12 2011 Size of extra inode fields: 28 Extended attributes stored in inode body: ceph.snapset = "01 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (26) EXTENTS: (0): 1114113 So it is size 8 (expected), with 1 block at 1114113 and that is consistent, and also has an xattr at block 862. And yet the blockcount is only 8. FWIW the xattr block seems unique on the filesystem, not shared elsewhere: debuge4fs: icheck 862 Block Inode number 862 43 The inode size is 256: debuge4fs: stats ... Inode size: 256 so the first xattr ("user.ceph._" + 154 byte value) should have bumped out to an external block. The other shorter attribute is stored in the inode body, as the stat command above shows ... Anyway, doing the same simple sequence of events manually doesn't yield any corruption here on that same kernel, so ... maybe a race somewhere? comment #19 was my first attempt to. The strace has a whole bunch of threads running with xattr syscalls being interleaved among threads (see all the unfinished/resumed) so a race does seem likely. A few more questions: Is quota in use at all in this case? If you configure ceph with 'filestore flusher = false' does it change anything? If you mount with -o noauto_da_alloc does it change anything? (probably not...) From http://www.spinics.net/lists/ceph-devel/msg03819.html - > The 'filestore flusher = false' does fix the ext4 problems, which where > reported from Christian, but this option has quite an impact of the osd > performance. > The '-o noauto_da_alloc' option did not solve the fsck problem. So, interesting. As Sage says: > When filestore flusher = true (default), after every write the fd is > handed off to another thread that uses sync_file_range() to push the data > out to disk quickly before closing the file. The purpose is to limit the > latency for the eventual snapshot or sync. Eric suspected the handoff > between threads may be what was triggering the bug in ext4. Oops where is my delete comment button; I typo'd the journal device name and it overwrote the filesystem. Please ignore the last comment... I've hacked streamtest.cc to the point where I can reproduce this now, after a 3 minute run. Only new datapoint so far is that if I disable delayed allocation, the problem goes away. And it does (unsurprisingly) seem to be related to writing both data blocks and xattr blocks. With 0-length files I don't see it, and with 2 short xattrs I don't see it. 1 data block and 1 attr block will do the trick, though. I sent this to the ext4 list today. Does seem to fix it, but I'm not terribly happy with it, we'll see if anyone has suggestions. Ceph users reported that when using Ceph on ext4, the filesystem would often become corrupted, containing inodes with incorrect i_blocks counters. I managed to reproduce this with a very hacked-up "streamtest" binary from the Ceph tree. Ceph is doing a lot of xattr writes, to out-of-inode blocks. There is also another thread which does sync_file_range and close, of the same files. The problem appears to happen due to this race: sync/flush thread xattr-set thread ----------------- ---------------- do_writepages ext4_xattr_set ext4_da_writepages ext4_xattr_set_handle mpage_da_map_blocks ext4_xattr_block_set set DELALLOC_RESERVE ext4_new_meta_blocks ext4_mb_new_blocks if (!i_delalloc_reserved_flag) vfs_dq_alloc_block ext4_get_blocks down_write(i_data_sem) set i_delalloc_reserved_flag ... up_write(i_data_sem) if (i_delalloc_reserved_flag) vfs_dq_alloc_block_nofail In other words, the sync/flush thread pops in and sets i_delalloc_reserved_flag on the inode, which makes the xattr thread think that it's in a delalloc path in ext4_new_meta_blocks(), and add the block for a second time, after already having added it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks I think this can be as simple as taking i_data_sem before we go down the new metablock path, so that we won't be testing i_delalloc_reserved_flag in a race with xt4_get_blocks setting it. Signed-off-by: Eric Sandeen <sandeen> --- This seems like slight abuse of i_data_sem though, since we are not modifying i_data[] in the inode on the xattr path. But nothing else protects readers of i_delalloc_reserved_flag in other threads. Thoughts? p.s. this almost feels like it needs quite a rework, I'm trying to remember why we set a "we are in a delalloc path" in an inode flag rather than as something passed through the callchain arguments, but I'm fuzzy on these twisty paths. diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index e4d0fca..dcebdbd 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -819,8 +819,14 @@ inserted: if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) goal = goal & EXT4_MAX_BLOCK_FILE_PHYS; + /* + * take i_data_sem because we will test + * i_delalloc_reserved_flag in ext4_mb_new_blocks */ + */ + down_read((&EXT4_I(inode)->i_data_sem)); block = ext4_new_meta_blocks(handle, inode, goal, NULL, &error); + up_read((&EXT4_I(inode)->i_data_sem)); if (error) goto cleanup; -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo.org More majordomo info at http://vger.kernel.org/majordomo-info.html Patch(es) available on kernel-2.6.32-229.el6 Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: When running a userspace program, such as the Ceph client, on the ext4 file system, a race condition between the sync/flush thread and the xattr-set thread could occur. This was caused by an incorrectly-set state flag on an inode. As a consequence, memory for the file system was incorrectly allocated, which resulted in file system corruption. With this update, the ext4 code has been modified to prevent this race condition from occurring and file systems are no longer corrupted under these circumstances. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0862.html |