Bug 762750 (GLUSTER-1018) - Using GlusterFS for FS migration fails on mainline
Summary: Using GlusterFS for FS migration fails on mainline
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-1018
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL: http://pl.atyp.us/wordpress/?p=2908
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-06-22 16:04 UTC by Jeff Darcy
Modified: 2011-03-10 16:58 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
Volfiles, logs, and gdb traces for two failure scenarios (4.93 KB, application/octet-stream)
2010-06-22 13:06 UTC, Jeff Darcy
no flags Details

Description Jeff Darcy 2010-06-22 13:06:06 UTC
Created attachment 236 [details]
a user on smartworld.net recently hacked into my site.

Comment 1 Jeff Darcy 2010-06-22 16:04:55 UTC
As described in the URL above, using GlusterFS for local-filesystem migration used to work and could be quite useful.  I checked the version I had tested with, and it's older than I thought - glusterfs 3.0.0git built on Feb 18 2010.  Running on a freshly installed RHEL 5.5 system using a fresh git clone from today, I get different results.  I've attached a tar file containing the relevant volfiles, logs, etc.

When using replicate directly over storage/posix subvolumes, the migration succeeds some of the time but with delays and with errors in the logs (direct.log).  Other times it seems to hang (gdb trace in direct.script).  When interposing the client/server modules, which used to be the formula to make this work, it always hangs (network.log and network.script).

I know this is not a supported use of GlusterFS, but since it's a regression it might be indicative of other problems that could turn out to be more pressing so I figured you might want to know anyway.

Comment 2 Pranith Kumar K 2011-03-10 01:11:12 UTC
putting afr on top of posix will make the self-heal read-write calls go into deep recursion consuming all the stack space. So it will always run out of space for sufficiently large file. So please use replace-brick functionality in 3.1.x to do migration.

Comment 3 Jeff Darcy 2011-03-10 11:08:42 UTC
(In reply to comment #2)
> putting afr on top of posix will make the self-heal read-write calls go into
> deep recursion consuming all the stack space. So it will always run out of
> space for sufficiently large file. So please use replace-brick functionality in
> 3.1.x to do migration.

Just curious: why would this cause deep recursion on read/write calls?  Is it because the self-heal code uses the completion callback for a read on subvolume A to trigger the write on subvolume B, and the completion callback for the write on B to trigger the next read on A, all within the same context?  If so, why doesn't this show up in the stack trace and why doesn't interposing client+server work?  How could such an approach work in the normal case where client and server are on separate machines?

There might be good reasons why this won't work and shouldn't be expected to work, but I'm not convinced that the read/write recursion you mention is such a reason.

Comment 4 Pranith Kumar K 2011-03-10 12:15:50 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > putting afr on top of posix will make the self-heal read-write calls go into
> > deep recursion consuming all the stack space. So it will always run out of
> > space for sufficiently large file. So please use replace-brick functionality in
> > 3.1.x to do migration.
> 
> Just curious: why would this cause deep recursion on read/write calls?  Is it
> because the self-heal code uses the completion callback for a read on subvolume
> A to trigger the write on subvolume B, and the completion callback for the
> write on B to trigger the next read on A, all within the same context?  If so,
> why doesn't this show up in the stack trace and why doesn't interposing
> client+server work?  How could such an approach work in the normal case where
> client and server are on separate machines?
> 
> There might be good reasons why this won't work and shouldn't be expected to
> work, but I'm not convinced that the read/write recursion you mention is such a
> reason.

hi Jeff,
pasted the backtrace with self-healing a file around 1.5G
pranith @ ~/Desktop/gluster_repl_bug
761752:43:35 $ free -k
             total       used       free     shared    buffers     cached
Mem:       2019944    1984928      35016          0      15820    1115504
-/+ buffers/cache:     853604    1166340
Swap:      4880376          0    4880376

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7fac8e77f700 (LWP 2471)]
0x00007fac926951a7 in inode_ctx_get (inode=0x7fac8ea091f8, key=0x1b47cc0, value=0x7fac8df80060) at ../../../libglusterfs/src/inode.c:1421
1421	        return inode_ctx_get2 (inode, key, value, 0);
(gdb) bt
#0  0x00007fac926951a7 in inode_ctx_get (inode=0x7fac8ea091f8, key=0x1b47cc0, value=0x7fac8df80060) at ../../../libglusterfs/src/inode.c:1421
#1  0x00007fac8feef86f in pl_inode_get (this=0x1b47cc0, inode=0x7fac8ea091f8) at ../../../../../xlators/features/locks/src/common.c:425
#2  0x00007fac8fef4f13 in pl_writev (frame=0x7fac91013d08, this=0x1b47cc0, fd=0x7fac8e780024, vector=0x7fac8df80380, count=1, offset=220725248, iobref=0x1bb39d0)
    at ../../../../../xlators/features/locks/src/posix.c:779
#3  0x00007fac8fcbe546 in sh_full_read_cbk (rw_frame=0x7fac90dea1c0, cookie=0xd280000, this=0x1b48700, op_ret=65536, op_errno=0, vector=0x7fac8df80380, count=1, 
    buf=0x7fac8df803f0, iobref=0x1bb39d0) at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:246
#4  0x00007fac8fef4054 in pl_readv_cbk (frame=0x7fac91013c00, cookie=0x7fac91013c84, this=0x1b46140, op_ret=65536, op_errno=0, vector=0x7fac8df80380, count=1, 
    stbuf=0x7fac8df803f0, iobref=0x1bb39d0) at ../../../../../xlators/features/locks/src/posix.c:587
#5  0x00007fac90114938 in posix_readv (frame=0x7fac91013c84, this=0x1b44f10, fd=0x7fac8e780024, size=65536, offset=220725248)
    at ../../../../../xlators/storage/posix/src/posix.c:2370
#6  0x00007fac8fef4aef in pl_readv (frame=0x7fac91013c00, this=0x1b46140, fd=0x7fac8e780024, size=65536, offset=220725248)
    at ../../../../../xlators/features/locks/src/posix.c:738
#7  0x00007fac8fcbe92d in sh_full_read_write (frame=0x7fac90dea0ac, this=0x1b48700, offset=220725248)
    at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:291
#8  0x00007fac8fcbeb77 in sh_full_loop_driver (frame=0x7fac90dea0ac, this=0x1b48700, is_first_call=_gf_false)
    at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:359
#9  0x00007fac8fcbde7b in sh_full_loop_return (rw_frame=0x7fac90dea1c0, this=0x1b48700, offset=220659712)
    at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:125
#10 0x00007fac8fcbe0b8 in sh_full_write_cbk (rw_frame=0x7fac90dea1c0, cookie=0x1, this=0x1b48700, op_ret=65536, op_errno=0, prebuf=0x7fac8df809a0, 
    postbuf=0x7fac8df80930) at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:178
#11 0x00007fac8fef4199 in pl_writev_cbk (frame=0x7fac91013d08, cookie=0x7fac91013d8c, this=0x1b47cc0, op_ret=65536, op_errno=0, prebuf=0x7fac8df809a0, 
    postbuf=0x7fac8df80930) at ../../../../../xlators/features/locks/src/posix.c:598
#12 0x00007fac901154fd in posix_writev (frame=0x7fac91013d8c, this=0x1b471e0, fd=0x7fac8e780024, vector=0x7fac8df80d40, count=1, offset=220659712, iobref=0x1bb3980)
    at ../../../../../xlators/storage/posix/src/posix.c:2550
#13 0x00007fac8fef5417 in pl_writev (frame=0x7fac91013d08, this=0x1b47cc0, fd=0x7fac8e780024, vector=0x7fac8df80d40, count=1, offset=220659712, iobref=0x1bb3980)
    at ../../../../../xlators/features/locks/src/posix.c:837
#14 0x00007fac8fcbe546 in sh_full_read_cbk (rw_frame=0x7fac90dea1c0, cookie=0xd270000, this=0x1b48700, op_ret=65536, op_errno=0, vector=0x7fac8df80d40, count=1, 
    buf=0x7fac8df80db0, iobref=0x1bb3980) at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:246
#15 0x00007fac8fef4054 in pl_readv_cbk (frame=0x7fac91013c00, cookie=0x7fac91013c84, this=0x1b46140, op_ret=65536, op_errno=0, vector=0x7fac8df80d40, count=1, 
    stbuf=0x7fac8df80db0, iobref=0x1bb3980) at ../../../../../xlators/features/locks/src/posix.c:587
#16 0x00007fac90114938 in posix_readv (frame=0x7fac91013c84, this=0x1b44f10, fd=0x7fac8e780024, size=65536, offset=220659712)
    at ../../../../../xlators/storage/posix/src/posix.c:2370
#17 0x00007fac8fef4aef in pl_readv (frame=0x7fac91013c00, this=0x1b46140, fd=0x7fac8e780024, size=65536, offset=220659712)
    at ../../../../../xlators/features/locks/src/posix.c:738
#18 0x00007fac8fcbe92d in sh_full_read_write (frame=0x7fac90dea0ac, this=0x1b48700, offset=220659712)
    at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:291
......

#18668 0x00007fac901154fd in posix_writev (frame=0x7fac91013d8c, this=0x1b471e0, fd=0x7fac8e780024, vector=0x7fac8e38a540, count=1, offset=109510656, iobref=0x1b808a0)
    at ../../../../../xlators/storage/posix/src/posix.c:2550
#18669 0x00007fac8fef5417 in pl_writev (frame=0x7fac91013d08, this=0x1b47cc0, fd=0x7fac8e780024, vector=0x7fac8e38a540, count=1, offset=109510656, iobref=0x1b808a0)
    at ../../../../../xlators/features/locks/src/posix.c:837
#18670 0x00007fac8fcbe546 in sh_full_read_cbk (rw_frame=0x7fac90dea1c0, cookie=0x6870000, this=0x1b48700, op_ret=65536, op_errno=0, vector=0x7fac8e38a540, count=1, 
    buf=0x7fac8e38a5b0, iobref=0x1b808a0) at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:246
#18671 0x00007fac8fef4054 in pl_readv_cbk (frame=0x7fac91013c00, cookie=0x7fac91013c84, this=0x1b46140, op_ret=65536, op_errno=0, vector=0x7fac8e38a540, count=1, 
    stbuf=0x7fac8e38a5b0, iobref=0x1b808a0) at ../../../../../xlators/features/locks/src/posix.c:587
#18672 0x00007fac90114938 in posix_readv (frame=0x7fac91013c84, this=0x1b44f10, fd=0x7fac8e780024, size=65536, offset=109510656)
    at ../../../../../xlators/storage/posix/src/posix.c:2370
#18673 0x00007fac8fef4aef in pl_readv (frame=0x7fac91013c00, this=0x1b46140, fd=0x7fac8e780024, size=65536, offset=109510656)
    at ../../../../../xlators/features/locks/src/posix.c:738
#18674 0x00007fac8fcbe92d in sh_full_read_write (frame=0x7fac90dea0ac, this=0x1b48700, offset=109510656)
    at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:291
#18675 0x00007fac8fcbeb77 in sh_full_loop_driver (frame=0x7fac90dea0ac, this=0x1b48700, is_first_call=_gf_false)
    at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:359
#18676 0x00007fac8fcbde7b in sh_full_loop_return (rw_frame=0x7fac90dea1c0, this=0x1b48700, offset=109445120)
    at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:125
#18677 0x00007fac8fcbe0b8 in sh_full_write_cbk (rw_frame=0x7fac90dea1c0, cookie=0x1, this=0x1b48700, op_ret=65536, op_errno=0, prebuf=0x7fac8e38ab60, 
    postbuf=0x7fac8e38aaf0) at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:178
#18678 0x00007fac8fef4199 in pl_writev_cbk (frame=0x7fac91013d08, cookie=0x7fac91013d8c, this=0x1b47cc0, op_ret=65536, op_errno=0, prebuf=0x7fac8e38ab60, 
    postbuf=0x7fac8e38aaf0) at ../../../../../xlators/features/locks/src/posix.c:598
#18679 0x00007fac901154fd in posix_writev (frame=0x7fac91013d8c, this=0x1b471e0, fd=0x7fac8e780024, vector=0x7fac8e38af00, count=1, offset=109445120, iobref=0x1b80850)
    at ../../../../../xlators/storage/posix/src/posix.c:2550
#18680 0x00007fac8fef5417 in pl_writev (frame=0x7fac91013d08, this=0x1b47cc0, fd=0x7fac8e780024, vector=0x7fac8e38af00, count=1, offset=109445120, iobref=0x1b80850)
    at ../../../../../xlators/features/locks/src/posix.c:837
#18681 0x00007fac8fcbe546 in sh_full_read_cbk (rw_frame=0x7fac90dea1c0, cookie=0x6860000, this=0x1b48700, op_ret=65536, op_errno=0, vector=0x7fac8e38af00, count=1, 
    buf=0x7fac8e38af70, iobref=0x1b80850) at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:246
#18682 0x00007fac8fef4054 in pl_readv_cbk (frame=0x7fac91013c00, cookie=0x7fac91013c84, this=0x1b46140, op_ret=65536, op_errno=0, vector=0x7fac8e38af00, count=1, 
    stbuf=0x7fac8e38af70, iobref=0x1b80850) at ../../../../../xlators/features/locks/src/posix.c:587
#18683 0x00007fac90114938 in posix_readv (frame=0x7fac91013c84, this=0x1b44f10, fd=0x7fac8e780024, size=65536, offset=109445120)
    at ../../../../../xlators/storage/posix/src/posix.c:2370
#18684 0x00007fac8fef4aef in pl_readv (frame=0x7fac91013c00, this=0x1b46140, fd=0x7fac8e780024, size=65536, offset=109445120)
    at ../../../../../xlators/features/locks/src/posix.c:738
#18685 0x00007fac8fcbe92d in sh_full_read_write (frame=0x7fac90dea0ac, this=0x1b48700, offset=109445120)
    at ../../../../../xlators/cluster/afr/src/afr-self-heal-algorithm.c:291
.......
there are still more frames... but I think you get the point.

Please let me know if I can close this bug.

Comment 5 Pranith Kumar K 2011-03-10 12:24:11 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > putting afr on top of posix will make the self-heal read-write calls go into
> > deep recursion consuming all the stack space. So it will always run out of
> > space for sufficiently large file. So please use replace-brick functionality in
> > 3.1.x to do migration.
> 
> Just curious: why would this cause deep recursion on read/write calls?  Is it
> because the self-heal code uses the completion callback for a read on subvolume
> A to trigger the write on subvolume B, and the completion callback for the
> write on B to trigger the next read on A, all within the same context?  If so,
> why doesn't this show up in the stack trace and why doesn't interposing
> client+server work?  How could such an approach work in the normal case where
> client and server are on separate machines?
> 
> There might be good reasons why this won't work and shouldn't be expected to
> work, but I'm not convinced that the read/write recursion you mention is such a
> reason.

Sorry, I did not answer the full question,
>> If so, why doesn't this show up in the stack trace and why doesn't interposing client+server work?  How could such an approach work in the normal case where client and server are on separate machines?

client is loaded on the mount-glusterfs process and the server is loaded on the brick-glusterfs process. So client_writev transmits the data to server which will write the data on to the disk then it sends the responds which triggers client_writev_cbk from the main event loop. Same happens for all the fops.

Comment 6 Jeff Darcy 2011-03-10 13:46:38 UTC
(In reply to comment #5)
> >> If so, why doesn't this show up in the stack trace and why doesn't interposing client+server work?  How could such an approach work in the normal case where client and server are on separate machines?
> 
> client is loaded on the mount-glusterfs process and the server is loaded on the
> brick-glusterfs process. So client_writev transmits the data to server which
> will write the data on to the disk then it sends the responds which triggers
> client_writev_cbk from the main event loop. Same happens for all the fops.

OK, thanks for the explanation.  This does look like a different failure mode than I originally saw, but is consistent with what we had discussed.  I'm guessing that the reason loading client+server in one process (see network.vol) doesn't work is that we recognize the request as local and turn it around as a direct callback - a good optimization, but kills us in this case.

It's OK to close this.

Comment 7 Pranith Kumar K 2011-03-10 13:58:11 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > >> If so, why doesn't this show up in the stack trace and why doesn't interposing client+server work?  How could such an approach work in the normal case where client and server are on separate machines?
> > 
> > client is loaded on the mount-glusterfs process and the server is loaded on the
> > brick-glusterfs process. So client_writev transmits the data to server which
> > will write the data on to the disk then it sends the responds which triggers
> > client_writev_cbk from the main event loop. Same happens for all the fops.
> 
> OK, thanks for the explanation.  This does look like a different failure mode
> than I originally saw, but is consistent with what we had discussed.  I'm
> guessing that the reason loading client+server in one process (see network.vol)
> doesn't work is that we recognize the request as local and turn it around as a
> direct callback - a good optimization, but kills us in this case.
> 
> It's OK to close this.

Yes, when you reported the problem it was also hitting the bug 762920, memory corruption, I fixed that already. Closing this bug.


Note You need to log in before you can comment on or make changes to this bug.