Bug 735105

Summary: ext4 corruption via Ceph userspace program
Product: Red Hat Enterprise Linux 6 Reporter: Harald Klein <hklein>
Component: kernelAssignee: Eric Sandeen <esandeen>
Status: CLOSED ERRATA QA Contact: Eryu Guan <eguan>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.1CC: bmr, bstein, dchinner, eguan, esandeen, jdarcy, kkeithle, kzhang, lczerner, msvoboda, rwheeler, sage
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-229.el6 Doc Type: Bug Fix
Doc Text:
When running a userspace program, such as the Ceph client, on the ext4 file system, a race condition between the sync/flush thread and the xattr-set thread could occur. This was caused by an incorrectly-set state flag on an inode. As a consequence, memory for the file system was incorrectly allocated, which resulted in file system corruption. With this update, the ext4 code has been modified to prevent this race condition from occurring and file systems are no longer corrupted under these circumstances.
Story Points: ---
Clone Of:
: 844854 (view as bug list) Environment:
Last Closed: 2012-06-20 07:49:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 756082, 758829, 844854    

Description Harald Klein 2011-09-01 14:02:17 UTC
Description of problem:

Running CEPH on ext4 will lead to filesystem corruption:

# service ceph start
# sleep 300
# service ceph stop
# umount /dev/sda
# fsck /dev/sda
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
/dev/sda: clean, 89514/244187136 files, 103328298/976740886 blocks

# fsck -f /dev/sda
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Inode 222447151, i_blocks is 8208, should be 8200.  Fix<y>? no
Inode 222447161, i_blocks is 8208, should be 8200.  Fix<y>? no
Inode 222447175, i_blocks is 8208, should be 8200.  Fix<y>? no
Inode 222447207, i_blocks is 8208, should be 8200.  Fix<y>? no
Inode 222447209, i_blocks is 8208, should be 8200.  Fix<y>? no
Inode 222447547, i_blocks is 6144, should be 6152.  Fix<y>? no
Inode 222447616, i_blocks is 8208, should be 8200.  Fix<y>?

Version-Release number of selected component (if applicable):

2.6.32-131.6.1.el6.x86_64

How reproducible:

To reproduce the problem setup a ceph cluster on a single machine with replication between /ceph/osd.000 and /ceph/osd.001. A small setup would only have two active placement groups with 2 objects. The corruption is happening, when the replication is started from osd.000 to osd.001. It is reproducible most of the time (but not always), when the following is done:

# mkfs.ext4 -T largefile /dev/sdb1
# mount -o noatime,user_xattr /dev/sdb1 /ceph/osd.001/ 
# ceph mon getmap -p /tmp/monmap 
# cosd -i 001 --mkjournal --mkfs --monmap /tmp/monmap 
# /usr/bin/cosd -d -i 001 -c /etc/ceph/ceph.conf

### wait until replication has finished and then stop the cosd

# umount /dev/sdb1

Actual results:
# fsck.ext4 -f /dev/sdb
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes Inode 43, i_blocks is 8, should be 16.  Fix<y>? no

Inode 2078, i_blocks is 24, should be 16.  Fix<y>? no

Expected results:

no ext4 corruption

Additional info:
e2image of the corrupted filesystem is attached, also a strace of cosd

Comment 1 Harald Klein 2011-09-01 14:03:05 UTC
From the e2i:

# e2fsck -fn /dev/vg_data/tlv0
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Inode 43, i_blocks is 8, should be 16.  Fix? no

Inode 2078, i_blocks is 24, should be 16.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/vg_data/tlv0: ********** WARNING: Filesystem still has errors **********

/dev/vg_data/tlv0: 772/10368 files (0.1% non-contiguous), 40206/2622603 blocks

debugfs:  ncheck 43
Inode   Pathname
43      /current/2.1c_head/rbd_directory_head
debugfs:  ncheck 2078
Inode   Pathname
2078    /current/2.3a_head/rbd_info_head
debugfs:  stat /current/2.3a_head/rbd_info_head
Inode: 2078   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 3929733629    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 8
File ACL: 524506    Directory ACL: 0
Links: 1   Blockcount: 24
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x4e4033e0:a801bbc8 -- Mon Aug  8 20:07:12 2011
 atime: 0x4e4033e0:a801bbc8 -- Mon Aug  8 20:07:12 2011
 mtime: 0x4e4033e0:a801bbc8 -- Mon Aug  8 20:07:12 2011
crtime: 0x4e4033e0:a801bbc8 -- Mon Aug  8 20:07:12 2011
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  ceph.snapset = "01 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (26)
EXTENTS:
(0): 1114116
debugfs:  stat /current/2.1c_head/rbd_directory_head
Inode: 43   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 3929733628    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 8
File ACL: 862    Directory ACL: 0
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x4e4033e0:39d3c808 -- Mon Aug  8 20:07:12 2011
 atime: 0x4e4033e0:39d3c808 -- Mon Aug  8 20:07:12 2011
 mtime: 0x4e4033e0:39d3c808 -- Mon Aug  8 20:07:12 2011
crtime: 0x4e4033e0:39d3c808 -- Mon Aug  8 20:07:12 2011
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  ceph.snapset = "01 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (26)
EXTENTS:
(0): 1114113

Comment 2 Harald Klein 2011-09-01 14:05:04 UTC
Created attachment 521015 [details]
e2image of the corrupted ext4 fs

Comment 3 Harald Klein 2011-09-01 14:06:51 UTC
Created attachment 521016 [details]
strace of cosd

Comment 4 Harald Klein 2011-09-01 14:11:49 UTC
Customer did point to   http://marc.info/?t=131201875400001&r=1&w=2 but they still hit this issue with this applied:

commit 7132de744ba76930d13033061018ddd7e3e8cd91
Author: Maxim Patlasov <maxim.patlasov>
Date:   Sun Jul 10 19:37:48 2011 -0400

  ext4: fix i_blocks/quota accounting when extent insertion fails


bmr pointed out that the above fixes situations where "is" < "should". But the customer is seeing "is" (24) > should (16).

Comment 9 Ric Wheeler 2011-09-01 15:30:42 UTC
http://68.183.106.108/lists/ceph-devel/msg03296.html

seems to be the upstream thread on this issue.

That issue was resolved with:

commit 7132de744ba76930d13033061018ddd7e3e8cd91
Author: Maxim Patlasov <maxim.patlasov@xxxxxxxxx>
Date:   Sun Jul 10 19:37:48 2011 -0400

   ext4: fix i_blocks/quota accounting when extent insertion fails

Comment 14 Sage Weil 2011-09-02 15:55:08 UTC
Just to clarify, the bug was harder to hit with that patch, but it wasn't fully resolved.

Comment 15 Bryn M. Reeves 2011-09-02 16:04:22 UTC
I think there's more than one problem here (that just happen to present with somewhat similar fsck errors).

The problem fixed by the patch in comment #9 should only ever produce inodes that have less i_blocks than expected (because the value is decremented incorrectly) but the output in comment #0 has a mix of greater and less than expected:

Inode 222447209, i_blocks is 8208, should be 8200.  Fix<y>? no
Inode 222447547, i_blocks is 6144, should be 6152.  Fix<y>? no

The first has i_blocks > should (could be the quota accounting problem fixed in 7132de74) but the second has i_blocks < should - I don't see how this can be the same problem at all.

Comment 16 Bryn M. Reeves 2011-09-02 16:13:58 UTC
Typo - < >  wrong way around in comment #15 - the second line could be the quota accounting problem but I can't see how the first is since i_blocks is greater than it should be.

Comment 19 Eric Sandeen 2011-09-06 20:58:35 UTC
So on the image:

Pass 1: Checking inodes, blocks, and sizes
Inode 43, i_blocks is 8, should be 16.  Fix? no

Inode 2078, i_blocks is 24, should be 16.  Fix? no

this is the first file in question:

debuge4fs:  ncheck 43
Inode	Pathname
43	/current/2.1c_head/rbd_directory_head

In the strace, we open/create this file:

[pid 17206] unlink("/ceph/osd.001/current/2.1c_head/rbd_directory_head" <unfinished ...>
[pid 17206] <... unlink resumed> )      = -1 ENOENT (No such file or directory)
[pid 17206] open("/ceph/osd.001/current/2.1c_head/rbd_directory_head", O_WRONLY|O_CREAT, 0644 <unfinished ...>
[pid 17206] <... open resumed> )        = 16

seek to 0 on the fd,
[pid 17206] lseek(16, 0, SEEK_SET <unfinished ...>
[pid 17206] <... lseek resumed> )       = 0

and writev 8 bytes:
[pid 17206] writev(16, [{"\0\0\0\0\0\0\0\0", 8}], 1 <unfinished ...>
[pid 17206] <... writev resumed> )      = 8

Then we setxattr the user.ceph._ (154 bytes) attribute:
[pid 17206] setxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph._", ..., 154, 0 <unfinished ...>
[pid 17206] <... setxattr resumed> )    = 0

close the fd from another pid:
[pid 17208] close(16 <unfinished ...>
[pid 17208] <... close resumed> )       = 0

and set the user.ceph.snapset (26 bytes) attribute:
[pid 17206] setxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph.snapset", ... , 26, 0 <unfinished ...>
[pid 17206] <... setxattr resumed> )    = 0


Later we stat it & listxattr / getxattr:

[pid 17223] stat("/ceph/osd.001/current/2.1c_head/rbd_directory_head", {st_mode=S_IFREG|0644, st_size=8, ...}) = 0

(I suppose full reporting here might be good, to get st_blocks at this point?)

[pid 17223] listxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", (nil), 0) = 30

names are 30 bytes long, looks like both xattrs are there.

[pid 17223] listxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", 0xb61180, 60) = 30

[pid 17223] getxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph.snapset", ... , 100) = 26

[pid 17223] getxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph._", 0x7f93120754f0, 100) = -1 ERANGE (Numerical result out of range)

[pid 17223] getxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph._", 0x0, 0) = 154

[pid 17223] getxattr("/ceph/osd.001/current/2.1c_head/rbd_directory_head", "user.ceph._", ..., 154) = 154

seems like we got both xattrs back ok, and that's all that's done with the file in the strace here.

debuge4fs:  stat <43>
Inode: 43   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 3929733628    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 8
File ACL: 862    Directory ACL: 0
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x4e4033e0:39d3c808 -- Mon Aug  8 14:07:12 2011
 atime: 0x4e4033e0:39d3c808 -- Mon Aug  8 14:07:12 2011
 mtime: 0x4e4033e0:39d3c808 -- Mon Aug  8 14:07:12 2011
crtime: 0x4e4033e0:39d3c808 -- Mon Aug  8 14:07:12 2011
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  ceph.snapset = "01 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (26)
EXTENTS:
(0): 1114113

So it is size 8 (expected), with 1 block at 1114113 and that is consistent, and also has an xattr at block 862.  And yet the blockcount is only 8.

FWIW the xattr block seems unique on the filesystem, not shared elsewhere:

debuge4fs:  icheck 862
Block	Inode number
862	43

The inode size is 256:

debuge4fs:  stats
...
Inode size:	          256

so the first xattr  ("user.ceph._" + 154 byte value) should have bumped out to an external block.  The other shorter attribute is stored in the inode body, as the stat command above shows ...

Anyway, doing the same simple sequence of events manually doesn't yield any corruption here on that same kernel, so ... maybe a race somewhere?

Comment 21 Bryn M. Reeves 2011-09-07 10:32:22 UTC
comment #19 was my first attempt to. The strace has a whole bunch of threads running with xattr syscalls being interleaved among threads (see all the unfinished/resumed) so a race does seem likely.

Comment 25 Eric Sandeen 2011-10-12 23:09:41 UTC
A few more questions:

Is quota in use at all in this case?

If you configure ceph with 'filestore flusher = false' does it change anything?

If you mount with -o noauto_da_alloc does it change anything?  (probably not...)

Comment 26 Eric Sandeen 2011-10-17 20:04:19 UTC
From http://www.spinics.net/lists/ceph-devel/msg03819.html -

> The 'filestore flusher = false' does fix the ext4 problems, which where
> reported from Christian, but this option has quite an impact of the osd 
> performance.

> The '-o noauto_da_alloc' option did not solve the fsck problem.

So, interesting.  As Sage says:

> When filestore flusher = true (default), after every write the fd is 
> handed off to another thread that uses sync_file_range() to push the data 
> out to disk quickly before closing the file.  The purpose is to limit the 
> latency for the eventual snapshot or sync.  Eric suspected the handoff 
> between threads may be what was triggering the bug in ext4.

Comment 29 Eric Sandeen 2011-10-18 17:30:35 UTC
Oops where is my delete comment button; I typo'd the journal device name and it overwrote the filesystem.  Please ignore the last comment...

Comment 30 Eric Sandeen 2011-10-19 19:02:32 UTC
I've hacked streamtest.cc to the point where I can reproduce this now, after a 3 minute run.

Only new datapoint so far is that if I disable delayed allocation, the problem goes away.

Comment 31 Eric Sandeen 2011-10-19 21:05:52 UTC
And it does (unsurprisingly) seem to be related to writing both data blocks and xattr blocks.  With 0-length files I don't see it, and with 2 short xattrs I don't see it.  1 data block and 1 attr block will do the trick, though.

Comment 32 Eric Sandeen 2011-10-24 22:31:24 UTC
I sent this to the ext4 list today.  Does seem to fix it, but I'm not terribly happy with it, we'll see if anyone has suggestions.

Ceph users reported that when using Ceph on ext4, the filesystem
would often become corrupted, containing inodes with incorrect
i_blocks counters.

I managed to reproduce this with a very hacked-up "streamtest"
binary from the Ceph tree.

Ceph is doing a lot of xattr writes, to out-of-inode blocks.
There is also another thread which does sync_file_range and close,
of the same files.  The problem appears to happen due to this race:

sync/flush thread               xattr-set thread
-----------------               ----------------

do_writepages                   ext4_xattr_set
ext4_da_writepages              ext4_xattr_set_handle
mpage_da_map_blocks             ext4_xattr_block_set
        set DELALLOC_RESERVE
                                ext4_new_meta_blocks
                                        ext4_mb_new_blocks
                                                if (!i_delalloc_reserved_flag)
                                                        vfs_dq_alloc_block
ext4_get_blocks
	down_write(i_data_sem)
        set i_delalloc_reserved_flag
	...
	up_write(i_data_sem)
                                        if (i_delalloc_reserved_flag)
                                                vfs_dq_alloc_block_nofail


In other words, the sync/flush thread pops in and sets
i_delalloc_reserved_flag on the inode, which makes the xattr thread
think that it's in a delalloc path in ext4_new_meta_blocks(),
and add the block for a second time, after already having added
it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks

I think this can be as simple as taking i_data_sem before we
go down the new metablock path, so that we won't be testing
i_delalloc_reserved_flag in a race with xt4_get_blocks setting it.

Signed-off-by: Eric Sandeen <sandeen>
---

This seems like slight abuse of i_data_sem though, since we are
not modifying i_data[] in the inode on the xattr path.  But nothing
else protects readers of i_delalloc_reserved_flag in other
threads.  Thoughts?

p.s. this almost feels like it needs quite a rework, I'm trying
to remember why we set a "we are in a delalloc path" in an inode
flag rather than as something passed through the callchain
arguments, but I'm fuzzy on these twisty paths.

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e4d0fca..dcebdbd 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -819,8 +819,14 @@ inserted:
 			if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
 				goal = goal & EXT4_MAX_BLOCK_FILE_PHYS;
+			/*
+			 * take i_data_sem because we will test
+			 * i_delalloc_reserved_flag in ext4_mb_new_blocks */
+			 */
+			down_read((&EXT4_I(inode)->i_data_sem));
 			block = ext4_new_meta_blocks(handle, inode,
 						  goal, NULL, &error);
+			up_read((&EXT4_I(inode)->i_data_sem));
 			if (error)
 				goto cleanup;
 



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comment 40 Aristeu Rozanski 2012-02-10 19:33:41 UTC
Patch(es) available on kernel-2.6.32-229.el6

Comment 44 Miroslav Svoboda 2012-06-18 14:31:23 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When running a userspace program, such as the Ceph client, on the ext4 file system, a race condition between the sync/flush thread and the xattr-set thread could occur. This was caused by an incorrectly-set state flag on an inode. As a consequence, memory for the file system was incorrectly allocated, which resulted in file system corruption. With this update, the ext4 code has been modified to prevent this race condition from occurring and file systems are no longer corrupted under these circumstances.

Comment 46 errata-xmlrpc 2012-06-20 07:49:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0862.html