+++ This bug was initially created as a clone of Bug #1469029 +++ Description of problem: The rebalance process hangs on a remove-brick operation if the original hashed subvol fails the min-free-disk check and a different target is selected. Version-Release number of selected component (if applicable): How reproducible: Consistently Steps to Reproduce: 1. Create a 1x3 distribute volume with 1 GB bricks 2. Create enough 5MB files on the volume such that no 2 bricks can accommodate all the files. 3. Run a remove-brick to remove one of the bricks Actual results: The rebalance hangs. Expected results: The rebalance process should terminate once all files are processed. Additional info: --- Additional comment from Nithya Balachandran on 2017-07-10 06:13:12 EDT --- RCA: From gdb: Thread 13 (Thread 0x7f7740823700 (LWP 27258)): #0 0x00007f7748006bdd in nanosleep () from /lib64/libpthread.so.0 #1 0x00007f77491bd8cb in gf_timer_proc (data=0xd0af80) at timer.c:176 #2 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 12 (Thread 0x7f7740022700 (LWP 27259)): #0 0x00007f7748007101 in sigwait () from /lib64/libpthread.so.0 #1 0x0000000000409e72 in glusterfs_sigwaiter (arg=0x7fff5bfe2da0) at glusterfsd.c:2069 #2 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 11 (Thread 0x7f773f821700 (LWP 27260)): #0 0x00007f774790b66d in nanosleep () from /lib64/libc.so.6 #1 0x00007f774790b504 in sleep () from /lib64/libc.so.6 #2 0x00007f77491dff7b in pool_sweeper (arg=0x0) at mem-pool.c:464 #3 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 10 (Thread 0x7f773f020700 (LWP 27261)): #0 0x00007f7748003a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f77491f54ab in syncenv_task (proc=0xd0b7d0) at syncop.c:603 #2 0x00007f77491f5746 in syncenv_processor (thdata=0xd0b7d0) at syncop.c:695 #3 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x7f773e81f700 (LWP 27262)): #0 0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007f773b66410f in gf_defrag_start_crawl (data=0x7f773400dfc0) at dht-rebalance.c:4479 #2 0x00007f77491f4c7a in synctask_wrap (old_task=0x7f7724001400) at syncop.c:375 #3 0x00007f7747893cf0 in ?? () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () Thread 8 (Thread 0x7f773c357700 (LWP 27263)): #0 0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f774921b1ef in event_dispatch_epoll_worker (data=0xd49290) at event-epoll.c:638 #2 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7f773ab14700 (LWP 27264)): ---Type <return> to continue, or q <return> to quit--- #0 0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f774921b1ef in event_dispatch_epoll_worker (data=0x7f773401c720) at event-epoll.c:638 #2 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7f77395fa700 (LWP 27266)): #0 0x00007f774790b66d in nanosleep () from /lib64/libc.so.6 #1 0x00007f774790b504 in sleep () from /lib64/libc.so.6 #2 0x00007f773b6632af in dht_file_counter_thread (args=0x7f773401b430) at dht-rebalance.c:4158 #3 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7f7738df9700 (LWP 27267)): #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303 #4 0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0, req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client-common.c:459 #5 0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472 #6 0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091 #7 0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0, fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319 #8 0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7738df8d1c) at dht-rebalance.c:1750 #9 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340) at dht-rebalance.c:2645 #10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f7723fff700 (LWP 27268)): #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303 #4 0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0, xdata=0x7f7723ffd008) at client-common.c:375 #5 0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810, this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361 #6 0x00007f773b90167d in client_writev (frame=0x7f7714005810, this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, off=4194304, flags=0, iobref=0x7f7730002600, xdata=0x0) at client.c:1036 #7 0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304, iobref=0x7f7730002600, flags=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:1975 #8 0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880, hole_exists=0, ---Type <return> to continue, or q <return> to quit--- fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028 #9 0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1733 #10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0) at dht-rebalance.c:2645 #11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f77237fe700 (LWP 27269)): #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0 #2 0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984 #4 0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73 #5 0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305 #6 0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683 #7 0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606 #8 0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626 #9 0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261 #10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c) at dht-rebalance.c:2200 #11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645 #12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #14 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f771affd700 (LWP 27270)): #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303 #4 0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0, xdata=0x7f771affb008) at client-common.c:375 #5 0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80, this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361 #6 0x00007f773b90167d in client_writev (frame=0x7f770c002c80, this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, off=4194304, flags=0, iobref=0x7f773000e800, xdata=0x0) at client.c:1036 #7 0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304, iobref=0x7f773000e800, flags=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:1975 #8 0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880, hole_exists=0, fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028 #9 0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f771affcd1c) ---Type <return> to continue, or q <return> to quit--- at dht-rebalance.c:1733 #10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0) at dht-rebalance.c:2645 #11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f774794473d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f774969a780 (LWP 27257)): #0 0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007f774921b446 in event_dispatch_epoll (event_pool=0xd01f70) at event-epoll.c:732 #2 0x00007f77491de754 in event_dispatch (event_pool=0xd01f70) at event.c:124 #3 0x000000000040ab6a in main (argc=31, argv=0x7fff5bfe3ff8) at glusterfsd.c:2479 There are 4 threads (2,3,4,5) which are hung. (gdb) t 2 [Switching to thread 2 (Thread 0x7f771affd700 (LWP 27270))] #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303 #4 0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0, xdata=0x7f771affb008) at client-common.c:375 #5 0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80, this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361 #6 0x00007f773b90167d in client_writev (frame=0x7f770c002c80, this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, off=4194304, flags=0, iobref=0x7f773000e800, xdata=0x0) at client.c:1036 #7 0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304, iobref=0x7f773000e800, flags=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:1975 #8 0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880, hole_exists=0, fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028 #9 0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f771affcd1c) at dht-rebalance.c:1733 #10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0) at dht-rebalance.c:2645 #11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f774794473d in clone () from /lib64/libc.so.6 (gdb) f 3 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303 303 pthread_mutex_lock (&conf->lock); (gdb) t 3 [Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))] #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0 #2 0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984 #4 0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73 #5 0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305 #6 0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683 #7 0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606 #8 0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626 #9 0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261 #10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c) at dht-rebalance.c:2200 #11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645 #12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #14 0x00007f774794473d in clone () from /lib64/libc.so.6 (gdb) f 3 #3 0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984 984 LOCK (&fd->lock); (gdb) t 4 [Switching to thread 4 (Thread 0x7f7723fff700 (LWP 27268))] #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303 #4 0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0, xdata=0x7f7723ffd008) at client-common.c:375 #5 0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810, this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361 #6 0x00007f773b90167d in client_writev (frame=0x7f7714005810, this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, off=4194304, flags=0, iobref=0x7f7730002600, xdata=0x0) at client.c:1036 #7 0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304, iobref=0x7f7730002600, flags=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:1975 #8 0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880, hole_exists=0, fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028 #9 0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1733 #10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0) at dht-rebalance.c:2645 #11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f774794473d in clone () from /lib64/libc.so.6 (gdb) f 3 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303 303 pthread_mutex_lock (&conf->lock); (gdb) t 5 [Switching to thread 5 (Thread 0x7f7738df9700 (LWP 27267))] #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0 #2 0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303 #4 0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0, req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client-common.c:459 #5 0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472 #6 0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091 #7 0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0, fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319 #8 0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7738df8d1c) at dht-rebalance.c:1750 #9 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340) at dht-rebalance.c:2645 #10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f774794473d in clone () from /lib64/libc.so.6 (gdb) f 3 #3 0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303 303 pthread_mutex_lock (&conf->lock); Threads 2, 4 and 5 are waiting on conf->lock which is held by thread 3. Thread 3 is waiting on fd->lock. However, it does not look like any other thread is holding fd->lock. From thread 3: (gdb) t 3 [Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))] #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0 #2 0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984 #4 0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73 #5 0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305 #6 0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683 #7 0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606 #8 0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626 #9 0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261 #10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c) at dht-rebalance.c:2200 #11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645 #12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812 #13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0 #14 0x00007f774794473d in clone () from /lib64/libc.so.6 (gdb) f 10 #10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c) at dht-rebalance.c:2200 2200 lk_ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL); (gdb) p *loc $17 = {path = 0x7f771c000da0 "/xfile-198", name = 0x7f771c000da1 "xfile-198", inode = 0x7f771c001050, parent = 0x7f7724003520, gfid = "\261\020w\327\370\251I\205\242,\202\031Z\322#A", pargfid = '\000' <repeats 15 times>, "\001"} From the rebalance log: 2990 [2017-07-10 04:45:24.741931] I [dht-rebalance.c:1515:dht_migrate_file] 0-vol1-dht: /xfile-198: attempting to move from vol1-client-2 to vol1-client-1 2991 [2017-07-10 04:45:24.770386] W [MSGID: 0] [dht-rebalance.c:926:__dht_check_free_space] 0-vol1-dht: Write will cross min-free-disk for file - /xfile-198 on subvol - vol1 -client-1. Looking for new subvol 2992 [2017-07-10 04:45:24.770428] I [MSGID: 0] [dht-rebalance.c:985:__dht_check_free_space] 0-vol1-dht: new target found - vol1-client-0 for file - /xfile-198 2993 [2017-07-10 04:45:24.778057] W [MSGID: 114031] [client-rpc-fops.c:2004:client3_3_fallocate_cbk] 0-vol1-client-0: remote operation failed [No space left on device] 2994 [2017-07-10 04:45:24.778094] E [MSGID: 109023] [dht-rebalance.c:789:__dht_rebalance_create_dst_file] 0-vol1-dht: fallocate failed for /xfile-198 on vol1-client-0 (No sp ace left on device) 2995 [2017-07-10 04:45:24.778122] E [dht-rebalance.c:1670:dht_migrate_file] 0-vol1-dht: Create dst failed on - vol1-client-0 for file - /xfile-198 These are the last messages logged. From the dht_migrate_file code: dht_migrate_file () { ... /* create the destination, with required modes/xattr */ ret = __dht_rebalance_create_dst_file (this, to, from, loc, &stbuf, &dst_fd, xattr, fop_errno); if (ret) { gf_msg (this->name, GF_LOG_ERROR, 0, 0, "Create dst failed" " on - %s for file - %s", to->name, loc->path); goto out; } clean_dst = _gf_true; <-- the dst file will be cleaned up ret = __dht_check_free_space (this, to, from, loc, &stbuf, flag, conf, &target_changed, &new_target, &ignore_failure, fop_errno); if (target_changed) { /* Can't handle for hardlinks. Marking this as failure */ if (flag == GF_DHT_MIGRATE_HARDLINK_IN_PROGRESS || stbuf.ia_nlink > 1) { gf_msg (this->name, GF_LOG_ERROR, 0, DHT_MSG_SUBVOL_INSUFF_SPACE, "Exiting migration for" " file - %s. flag - %d, stbuf.ia_nlink - %d", loc->path, flag, stbuf.ia_nlink); ret = -1; goto out; } ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL); if (ret) { gf_log (this->name, GF_LOG_WARNING, "%s: failed to perform truncate on %s (%s)", loc->path, to->name, strerror (-ret)); ret = -1; } syncop_close (dst_fd); <-- this is now an invalid fd for the dst cleanup old_target = to; to = new_target; /* if the file migration is successful to this new target, then * update the xattr on the old destination to point the new * destination. We need to do update this only post migration * as in case of failure the linkto needs to point to the source * subvol */ ret = __dht_rebalance_create_dst_file (this, to, from, loc, &stbuf, &dst_fd, xattr, fop_errno); if (ret) { gf_log (this->name, GF_LOG_ERROR, "Create dst failed" " on - %s for file - %s", to->name, loc->path); goto out; <<< If this fails here, clean_dst is set to true but the fd is invalid, causing the hang on fd->lock>>> } else { gf_msg (this->name, GF_LOG_INFO, 0, 0, "destination for file " "- %s is changed to - %s", loc->path, to->name); } } ... --- Additional comment from Worker Ant on 2017-07-10 06:17:54 EDT --- REVIEW: https://review.gluster.org/17735 (cluster/dht: Clear clean_dst flag on target change) posted (#1) for review on master by N Balachandran (nbalacha) --- Additional comment from Worker Ant on 2017-07-11 01:06:17 EDT --- COMMIT: https://review.gluster.org/17735 committed in master by Raghavendra G (rgowdapp) ------ commit bd71ca4fdf2554dd22c0db70af132a11b966ef38 Author: N Balachandran <nbalacha> Date: Mon Jul 10 15:45:04 2017 +0530 cluster/dht: Clear clean_dst flag on target change If the target of a file migration was changed because of min-free-disk limits, the dst_fd was closed but the clean_dst flag was not set to false. If the file could not be created on the new target for some reason, the ftruncate call to clean up the dst was sent on the now invalid fd causing the process to deadlock. Change-Id: I5bfa80f519b04567413d84229cf62d143c6e2f04 BUG: 1469029 Signed-off-by: N Balachandran <nbalacha> Reviewed-on: https://review.gluster.org/17735 Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Amar Tumballi <amarts> Reviewed-by: Raghavendra G <rgowdapp>
REVIEW: https://review.gluster.org/17745 (cluster/dht: Clear clean_dst flag on target change) posted (#1) for review on release-3.11 by N Balachandran (nbalacha)
COMMIT: https://review.gluster.org/17745 committed in release-3.11 by Shyamsundar Ranganathan (srangana) ------ commit 59d29cfc4300251b8bc92e41e5bc29790d8f2db6 Author: N Balachandran <nbalacha> Date: Mon Jul 10 15:45:04 2017 +0530 cluster/dht: Clear clean_dst flag on target change If the target of a file migration was changed because of min-free-disk limits, the dst_fd was closed but the clean_dst flag was not set to false. If the file could not be created on the new target for some reason, the ftruncate call to clean up the dst was sent on the now invalid fd causing the process to deadlock. > BUG: 1469029 > Signed-off-by: N Balachandran <nbalacha> > Reviewed-on: https://review.gluster.org/17735 > Smoke: Gluster Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.org> > Reviewed-by: Amar Tumballi <amarts> > Reviewed-by: Raghavendra G <rgowdapp> (cherry picked from commit bd71ca4fdf2554dd22c0db70af132a11b966ef38) Change-Id: I5bfa80f519b04567413d84229cf62d143c6e2f04 BUG: 1469459 Signed-off-by: N Balachandran <nbalacha> Reviewed-on: https://review.gluster.org/17745 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Susant Palai <spalai> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Shyamsundar Ranganathan <srangana>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.2, please open a new bug report. glusterfs-3.11.2 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2017-July/031908.html [2] https://www.gluster.org/pipermail/gluster-users/