+++ This bug was initially created as a clone of Bug #1469029 +++ Description of problem: The rebalance process hangs on a remove-brick operation if the original hashed subvol fails the min-free-disk check and a different target is selected. Version-Release number of selected component (if applicable): How reproducible: Consistently Steps to Reproduce: 1. Create a 1x3 distribute volume with 1 GB bricks 2. Create enough 5MB files on the volume such that no 2 bricks can accommodate all the files. 3. Run a remove-brick to remove one of the bricks Actual results: The rebalance hangs. Expected results: The rebalance process should terminate once all files are processed. Additional info: --- Additional comment from Nithya Balachandran on 2017-07-10 06:13:12 EDT --- RCA: From gdb: Thread 13 (Thread 0x7f7740823700 (LWP 27258)): #0 0x00007f7748006bdd in nanosleep () from /lib64/libpthread.so.0 #1 0x00007f77491bd8cb in gf_timer_proc (data=0xd0af80) at timer.c:176 #2 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 12 (Thread 0x7f7740022700 (LWP 27259)): #0 0x00007f7748007101 in sigwait () from /lib64/libpthread.so.0 #1 0x0000000000409e72 in glusterfs_sigwaiter (arg=0x7fff5bfe2da0) at glusterfsd.c:2069 #2 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 11 (Thread 0x7f773f821700 (LWP 27260)): #0 0x00007f774790b66d in nanosleep () from /lib64/libc.so.6 #1 0x00007f774790b504 in sleep () from /lib64/libc.so.6 #2 0x00007f77491dff7b in pool_sweeper (arg=0x0) at mem-pool.c:464 #3 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 10 (Thread 0x7f773f020700 (LWP 27261)): #0 0x00007f7748003a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f77491f54ab in syncenv_task (proc=0xd0b7d0) at syncop.c:603 #2 0x00007f77491f5746 in syncenv_processor (thdata=0xd0b7d0) at syncop.c:695 #3 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x7f773e81f700 (LWP 27262)): #0 0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007f773b66410f in gf_defrag_start_crawl (data=0x7f773400dfc0) at dht-rebalance.c:4479 #2 0x00007f77491f4c7a in synctask_wrap (old_task=0x7f7724001400) at syncop.c:375 #3 0x00007f7747893cf0 in ?? () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Thread 8 (Thread 0x7f773c357700 (LWP 27263)): #0 0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f774921b1ef in event_dispatch_epoll_worker (data=0xd49290) at event-epoll.c:638 #2 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7f773ab14700 (LWP 27264)): ---Type <return> to continue, or q <return> to quit--- #0 0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f774921b1ef in event_dispatch_epoll_worker (data=0x7f773401c720) at event-epoll.c:638 #2 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7f77395fa700 (LWP 27266)): #0 0x00007f774790b66d in nanosleep () from /lib64/libc.so.6 #1 0x00007f774790b504 in sleep () from /lib64/libc.so.6 #2 0x00007f773b6632af in dht_file_counter_thread (args=0x7f773401b430) at dht-rebalance.c:4158 #3 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7f7738df9700 (LWP 27267)): #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303 #4 0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0, req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client-common.c:459 #5 0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472 #6 0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091 #7 0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0, fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319 #8 0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7738df8d1c) at dht-rebalance.c:1750 #9 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340) at dht-rebalance.c:2645 #10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f7723fff700 (LWP 27268)): #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303 #4 0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0, xdata=0x7f7723ffd008) at client-common.c:375 #5 0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810, this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361 #6 0x00007f773b90167d in client_writev (frame=0x7f7714005810, this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, off=4194304, flags=0, iobref=0x7f7730002600, xdata=0x0) at client.c:1036 #7 0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304, iobref=0x7f7730002600, flags=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:1975 #8 0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880, hole_exists=0, ---Type <return> to continue, or q <return> to quit--- fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028 #9 0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1733 #10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0) at dht-rebalance.c:2645 #11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f77237fe700 (LWP 27269)): #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0 #2 0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984 #4 0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73 #5 0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305 #6 0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683 #7 0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606 #8 0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626 #9 0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261 #10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c) at dht-rebalance.c:2200 #11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645 #12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #14 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f771affd700 (LWP 27270)): #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303 #4 0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0, xdata=0x7f771affb008) at client-common.c:375 #5 0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80, this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361 #6 0x00007f773b90167d in client_writev (frame=0x7f770c002c80, this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, off=4194304, flags=0, iobref=0x7f773000e800, xdata=0x0) at client.c:1036 #7 0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304, iobref=0x7f773000e800, flags=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:1975 #8 0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880, hole_exists=0, fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028 #9 0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f771affcd1c) ---Type <return> to continue, or q <return> to quit--- at dht-rebalance.c:1733 #10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0) at dht-rebalance.c:2645 #11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f774969a780 (LWP 27257)): #0 0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007f774921b446 in event_dispatch_epoll (event_pool=0xd01f70) at event-epoll.c:732 #2 0x00007f77491de754 in event_dispatch (event_pool=0xd01f70) at event.c:124 #3 0x000000000040ab6a in main (argc=31, argv=0x7fff5bfe3ff8) at glusterfsd.c:2479 There are 4 threads (2,3,4,5) which are hung. (gdb) t 2 [Switching to thread 2 (Thread 0x7f771affd700 (LWP 27270))] #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303 #4 0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0, xdata=0x7f771affb008) at client-common.c:375 #5 0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80, this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361 #6 0x00007f773b90167d in client_writev (frame=0x7f770c002c80, this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, off=4194304, flags=0, iobref=0x7f773000e800, xdata=0x0) at client.c:1036 #7 0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304, iobref=0x7f773000e800, flags=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:1975 #8 0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880, hole_exists=0, fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028 #9 0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f771affcd1c) at dht-rebalance.c:1733 #10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0) at dht-rebalance.c:2645 #11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f774794473d in clone () from /lib64/libc.so.6 (gdb) f 3 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303 303 pthread_mutex_lock (&conf->lock); (gdb) t 3 [Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))] #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0 #2 0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984 #4 0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73 #5 0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305 #6 0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683 #7 0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606 #8 0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626 #9 0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261 #10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c) at dht-rebalance.c:2200 #11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645 #12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #14 0x00007f774794473d in clone () from /lib64/libc.so.6 (gdb) f 3 #3 0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984 984 LOCK (&fd->lock); (gdb) t 4 [Switching to thread 4 (Thread 0x7f7723fff700 (LWP 27268))] #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303 #4 0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0, xdata=0x7f7723ffd008) at client-common.c:375 #5 0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810, this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361 #6 0x00007f773b90167d in client_writev (frame=0x7f7714005810, this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, off=4194304, flags=0, iobref=0x7f7730002600, xdata=0x0) at client.c:1036 #7 0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304, iobref=0x7f7730002600, flags=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:1975 #8 0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880, hole_exists=0, fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028 #9 0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1733 #10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0) at dht-rebalance.c:2645 #11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f774794473d in clone () from /lib64/libc.so.6 (gdb) f 3 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303 303 pthread_mutex_lock (&conf->lock); (gdb) t 5 [Switching to thread 5 (Thread 0x7f7738df9700 (LWP 27267))] #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303 #4 0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0, req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client-common.c:459 #5 0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472 #6 0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091 #7 0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0, fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319 #8 0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7738df8d1c) at dht-rebalance.c:1750 #9 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340) at dht-rebalance.c:2645 #10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f774794473d in clone () from /lib64/libc.so.6 (gdb) f 3 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303 303 pthread_mutex_lock (&conf->lock); Threads 2, 4 and 5 are waiting on conf->lock which is held by thread 3. Thread 3 is waiting on fd->lock. However, it does not look like any other thread is holding fd->lock. From thread 3: (gdb) t 3 [Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))] #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0 #2 0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984 #4 0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73 #5 0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305 #6 0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683 #7 0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606 #8 0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626 #9 0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261 #10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c) at dht-rebalance.c:2200 #11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645 #12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #14 0x00007f774794473d in clone () from /lib64/libc.so.6 (gdb) f 10 #10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c) at dht-rebalance.c:2200 2200 lk_ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL); (gdb) p *loc $17 = {path = 0x7f771c000da0 "/xfile-198", name = 0x7f771c000da1 "xfile-198", inode = 0x7f771c001050, parent = 0x7f7724003520, gfid = "\261\020w\327\370\251I\205\242,\202\031Z\322#A", pargfid = '\000' <repeats 15 times>, "\001"} From the rebalance log: 2990 [2017-07-10 04:45:24.741931] I [dht-rebalance.c:1515:dht_migrate_file] 0-vol1-dht: /xfile-198: attempting to move from vol1-client-2 to vol1-client-1 2991 [2017-07-10 04:45:24.770386] W [MSGID: 0] [dht-rebalance.c:926:__dht_check_free_space] 0-vol1-dht: Write will cross min-free-disk for file - /xfile-198 on subvol - vol1 -client-1. Looking for new subvol 2992 [2017-07-10 04:45:24.770428] I [MSGID: 0] [dht-rebalance.c:985:__dht_check_free_space] 0-vol1-dht: new target found - vol1-client-0 for file - /xfile-198 2993 [2017-07-10 04:45:24.778057] W [MSGID: 114031] [client-rpc-fops.c:2004:client3_3_fallocate_cbk] 0-vol1-client-0: remote operation failed [No space left on device] 2994 [2017-07-10 04:45:24.778094] E [MSGID: 109023] [dht-rebalance.c:789:__dht_rebalance_create_dst_file] 0-vol1-dht: fallocate failed for /xfile-198 on vol1-client-0 (No sp ace left on device) 2995 [2017-07-10 04:45:24.778122] E [dht-rebalance.c:1670:dht_migrate_file] 0-vol1-dht: Create dst failed on - vol1-client-0 for file - /xfile-198 These are the last messages logged. From the dht_migrate_file code: dht_migrate_file () { ... /* create the destination, with required modes/xattr */ ret = __dht_rebalance_create_dst_file (this, to, from, loc, &stbuf, &dst_fd, xattr, fop_errno); if (ret) { gf_msg (this->name, GF_LOG_ERROR, 0, 0, "Create dst failed" " on - %s for file - %s", to->name, loc->path); goto out; } clean_dst = _gf_true; <-- the dst file will be cleaned up ret = __dht_check_free_space (this, to, from, loc, &stbuf, flag, conf, &target_changed, &new_target, &ignore_failure, fop_errno); if (target_changed) { /* Can't handle for hardlinks. Marking this as failure */ if (flag == GF_DHT_MIGRATE_HARDLINK_IN_PROGRESS || stbuf.ia_nlink > 1) { gf_msg (this->name, GF_LOG_ERROR, 0, DHT_MSG_SUBVOL_INSUFF_SPACE, "Exiting migration for" " file - %s. flag - %d, stbuf.ia_nlink - %d", loc->path, flag, stbuf.ia_nlink); ret = -1; goto out; } ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL); if (ret) { gf_log (this->name, GF_LOG_WARNING, "%s: failed to perform truncate on %s (%s)", loc->path, to->name, strerror (-ret)); ret = -1; } syncop_close (dst_fd); <-- this is now an invalid fd for the dst cleanup old_target = to; to = new_target; /* if the file migration is successful to this new target, then * update the xattr on the old destination to point the new * destination. We need to do update this only post migration * as in case of failure the linkto needs to point to the source * subvol */ ret = __dht_rebalance_create_dst_file (this, to, from, loc, &stbuf, &dst_fd, xattr, fop_errno); if (ret) { gf_log (this->name, GF_LOG_ERROR, "Create dst failed" " on - %s for file - %s", to->name, loc->path); goto out; <<< If this fails here, clean_dst is set to true but the fd is invalid, causing the hang on fd->lock>>> } else { gf_msg (this->name, GF_LOG_INFO, 0, 0, "destination for file " "- %s is changed to - %s", loc->path, to->name); } } ... --- Additional comment from Worker Ant on 2017-07-10 06:17:54 EDT --- REVIEW: https://review.gluster.org/17735 (cluster/dht: Clear clean_dst flag on target change) posted (#1) for review on master by N Balachandran (nbalacha)
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/111944/
On glusterfs version 3.8.4-35.el7rhgs.x86_64, ran the steps in the description and didn't see any rebalance hangs during remove-brick operation. Hence, moving this BZ to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774