Bug 1469459

Summary: Rebalance hangs on remove-brick if the target volume changes
Product: [Community] GlusterFS Reporter: Nithya Balachandran <nbalacha>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11CC: bugs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.11.2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1469029 Environment:
Last Closed: 2017-08-12 13:07:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1469029    
Bug Blocks: 1469041    

Description Nithya Balachandran 2017-07-11 10:06:11 UTC
+++ This bug was initially created as a clone of Bug #1469029 +++

Description of problem:
The rebalance process hangs on a remove-brick operation if the original hashed subvol fails the min-free-disk check and a different target is selected.

Version-Release number of selected component (if applicable):


How reproducible:
Consistently

Steps to Reproduce:
1. Create a 1x3 distribute volume with 1 GB bricks
2. Create enough 5MB files on the volume such that no 2 bricks can accommodate all the files.
3. Run a remove-brick to remove one of the bricks

Actual results:

The rebalance hangs.

Expected results:
The rebalance process should terminate once all files are processed.

Additional info:

--- Additional comment from Nithya Balachandran on 2017-07-10 06:13:12 EDT ---

RCA:


From gdb:


Thread 13 (Thread 0x7f7740823700 (LWP 27258)):
#0  0x00007f7748006bdd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f77491bd8cb in gf_timer_proc (data=0xd0af80) at timer.c:176
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f7740022700 (LWP 27259)):
#0  0x00007f7748007101 in sigwait () from /lib64/libpthread.so.0
#1  0x0000000000409e72 in glusterfs_sigwaiter (arg=0x7fff5bfe2da0) at glusterfsd.c:2069
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f773f821700 (LWP 27260)):
#0  0x00007f774790b66d in nanosleep () from /lib64/libc.so.6
#1  0x00007f774790b504 in sleep () from /lib64/libc.so.6
#2  0x00007f77491dff7b in pool_sweeper (arg=0x0) at mem-pool.c:464
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f773f020700 (LWP 27261)):
#0  0x00007f7748003a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f77491f54ab in syncenv_task (proc=0xd0b7d0) at syncop.c:603
#2  0x00007f77491f5746 in syncenv_processor (thdata=0xd0b7d0) at syncop.c:695
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f773e81f700 (LWP 27262)):
#0  0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f773b66410f in gf_defrag_start_crawl (data=0x7f773400dfc0) at dht-rebalance.c:4479
#2  0x00007f77491f4c7a in synctask_wrap (old_task=0x7f7724001400) at syncop.c:375
#3  0x00007f7747893cf0 in ?? () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f773c357700 (LWP 27263)):
#0  0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f774921b1ef in event_dispatch_epoll_worker (data=0xd49290) at event-epoll.c:638
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f773ab14700 (LWP 27264)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f774921b1ef in event_dispatch_epoll_worker (data=0x7f773401c720) at event-epoll.c:638
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f77395fa700 (LWP 27266)):
#0  0x00007f774790b66d in nanosleep () from /lib64/libc.so.6
#1  0x00007f774790b504 in sleep () from /lib64/libc.so.6
#2  0x00007f773b6632af in dht_file_counter_thread (args=0x7f773401b430) at dht-rebalance.c:4158
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f7738df9700 (LWP 27267)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
#4  0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0, req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client-common.c:459
#5  0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472
#6  0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091
#7  0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0, fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319
#8  0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7738df8d1c)
    at dht-rebalance.c:1750
#9  0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340) at dht-rebalance.c:2645
#10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f7723fff700 (LWP 27268)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0, xdata=0x7f7723ffd008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810, this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f7714005810, this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, off=4194304, flags=0, 
    iobref=0x7f7730002600, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304, iobref=0x7f7730002600, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880, hole_exists=0, 
---Type <return> to continue, or q <return> to quit---
    fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7723ffed1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0) at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f77237fe700 (LWP 27269)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f771affd700 (LWP 27270)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0, xdata=0x7f771affb008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80, this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f770c002c80, this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, off=4194304, flags=0, 
    iobref=0x7f773000e800, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304, iobref=0x7f773000e800, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880, hole_exists=0, 
    fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f771affcd1c)
---Type <return> to continue, or q <return> to quit---
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0) at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f774969a780 (LWP 27257)):
#0  0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f774921b446 in event_dispatch_epoll (event_pool=0xd01f70) at event-epoll.c:732
#2  0x00007f77491de754 in event_dispatch (event_pool=0xd01f70) at event.c:124
#3  0x000000000040ab6a in main (argc=31, argv=0x7fff5bfe3ff8) at glusterfsd.c:2479


There are 4 threads (2,3,4,5) which are hung.

(gdb) t 2
[Switching to thread 2 (Thread 0x7f771affd700 (LWP 27270))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0, xdata=0x7f771affb008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80, this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f770c002c80, this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, off=4194304, flags=0, 
    iobref=0x7f773000e800, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304, iobref=0x7f773000e800, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880, hole_exists=0, 
    fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f771affcd1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0) at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
303	        pthread_mutex_lock (&conf->lock);



(gdb) t 3
[Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984
984	        LOCK (&fd->lock);



(gdb) t 4
[Switching to thread 4 (Thread 0x7f7723fff700 (LWP 27268))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0, xdata=0x7f7723ffd008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810, this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f7714005810, this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, off=4194304, flags=0, 
    iobref=0x7f7730002600, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304, iobref=0x7f7730002600, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880, hole_exists=0, 
    fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7723ffed1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0) at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
303	        pthread_mutex_lock (&conf->lock);



(gdb) t 5
[Switching to thread 5 (Thread 0x7f7738df9700 (LWP 27267))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
#4  0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0, req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client-common.c:459
#5  0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472
#6  0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091
#7  0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0, fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319
#8  0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7738df8d1c)
    at dht-rebalance.c:1750
#9  0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340) at dht-rebalance.c:2645
#10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
303	        pthread_mutex_lock (&conf->lock);



Threads 2, 4 and 5 are waiting on conf->lock which is held by thread 3.
Thread 3 is waiting on fd->lock. However, it does not look like any other thread is holding fd->lock.



From thread 3:


(gdb) t 3
[Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 10 
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
2200	                lk_ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL);
(gdb) p *loc
$17 = {path = 0x7f771c000da0 "/xfile-198", name = 0x7f771c000da1 "xfile-198", inode = 0x7f771c001050, parent = 0x7f7724003520, 
  gfid = "\261\020w\327\370\251I\205\242,\202\031Z\322#A", pargfid = '\000' <repeats 15 times>, "\001"}



From the rebalance log:

2990 [2017-07-10 04:45:24.741931] I [dht-rebalance.c:1515:dht_migrate_file] 0-vol1-dht: /xfile-198: attempting to move from vol1-client-2 to vol1-client-1
2991 [2017-07-10 04:45:24.770386] W [MSGID: 0] [dht-rebalance.c:926:__dht_check_free_space] 0-vol1-dht: Write will cross min-free-disk for file - /xfile-198 on subvol - vol1     -client-1. Looking for new subvol
2992 [2017-07-10 04:45:24.770428] I [MSGID: 0] [dht-rebalance.c:985:__dht_check_free_space] 0-vol1-dht: new target found - vol1-client-0 for file - /xfile-198
2993 [2017-07-10 04:45:24.778057] W [MSGID: 114031] [client-rpc-fops.c:2004:client3_3_fallocate_cbk] 0-vol1-client-0: remote operation failed [No space left on device]
2994 [2017-07-10 04:45:24.778094] E [MSGID: 109023] [dht-rebalance.c:789:__dht_rebalance_create_dst_file] 0-vol1-dht: fallocate failed for /xfile-198 on vol1-client-0 (No sp     ace left on device)
2995 [2017-07-10 04:45:24.778122] E [dht-rebalance.c:1670:dht_migrate_file] 0-vol1-dht: Create dst failed on - vol1-client-0 for file - /xfile-198


These are the last messages logged.

From the dht_migrate_file code:


dht_migrate_file () {

...
                                                                               
        /* create the destination, with required modes/xattr */                 
        ret = __dht_rebalance_create_dst_file (this, to, from, loc, &stbuf,     
                                               &dst_fd, xattr, fop_errno);      
        if (ret) {                                                              
                gf_msg (this->name, GF_LOG_ERROR, 0, 0, "Create dst failed"     
                        " on - %s for file - %s", to->name, loc->path);         
                goto out;                                                       
        }                                                                       
                                                                                
        clean_dst = _gf_true;   <-- the dst file will be cleaned up                                                
                                                                                
        ret = __dht_check_free_space (this, to, from, loc, &stbuf, flag, conf,  
                                      &target_changed, &new_target, &ignore_failure, fop_errno);
        if (target_changed) {                                                   
                /* Can't handle for hardlinks. Marking this as failure */           
                if (flag == GF_DHT_MIGRATE_HARDLINK_IN_PROGRESS || stbuf.ia_nlink > 1) { 
                        gf_msg (this->name, GF_LOG_ERROR, 0,                    
                                DHT_MSG_SUBVOL_INSUFF_SPACE, "Exiting migration for"
                                " file - %s. flag - %d, stbuf.ia_nlink - %d",   
                               loc->path,  flag, stbuf.ia_nlink);               
                        ret = -1;                                               
                        goto out;                                               
                }                                                               
                                                                                
                                                                                
                ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL);             
                if (ret) {                                                      
                        gf_log (this->name, GF_LOG_WARNING,                     
                                "%s: failed to perform truncate on %s (%s)",    
                                loc->path, to->name, strerror (-ret));          
                        ret = -1;                                               
                }                                                               
                                                                                
                syncop_close (dst_fd);      <-- this is now an invalid fd for the dst cleanup                                    
                                                                                
                old_target = to;                                                
                to = new_target;                                                

                /* if the file migration is successful to this new target, then 
                 * update the xattr on the old destination to point the new         
                 * destination. We need to do update this only post migration   
                 * as in case of failure the linkto needs to point to the source
                 * subvol */                                                    
                ret = __dht_rebalance_create_dst_file (this, to, from, loc, 
                                                       &stbuf,
                                                       &dst_fd, xattr, 
                                                       fop_errno);
                if (ret) {                                                      
                        gf_log (this->name, GF_LOG_ERROR, "Create dst failed"   
                                " on - %s for file - %s", to->name, loc->path); 
                        goto out;                      

<<< If this fails here, clean_dst is set to true but the fd is invalid, causing the hang on fd->lock>>>                         

                } else {                                                        
                        gf_msg (this->name, GF_LOG_INFO, 0, 0, "destination for file "
                                "- %s is changed to - %s", loc->path, to->name);
                }                                                               
        }          
...

--- Additional comment from Worker Ant on 2017-07-10 06:17:54 EDT ---

REVIEW: https://review.gluster.org/17735 (cluster/dht: Clear clean_dst flag on target change) posted (#1) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2017-07-11 01:06:17 EDT ---

COMMIT: https://review.gluster.org/17735 committed in master by Raghavendra G (rgowdapp) 
------
commit bd71ca4fdf2554dd22c0db70af132a11b966ef38
Author: N Balachandran <nbalacha>
Date:   Mon Jul 10 15:45:04 2017 +0530

    cluster/dht: Clear clean_dst flag on target change
    
    If the target of a file migration was changed because
    of min-free-disk limits, the dst_fd was closed but the
    clean_dst flag was not set to false. If the file could
    not be created on the new target for some reason, the
    ftruncate call to clean up the dst was sent on the now
    invalid fd causing the process to deadlock.
    
    Change-Id: I5bfa80f519b04567413d84229cf62d143c6e2f04
    BUG: 1469029
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: https://review.gluster.org/17735
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Amar Tumballi <amarts>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 1 Worker Ant 2017-07-11 10:12:39 UTC
REVIEW: https://review.gluster.org/17745 (cluster/dht: Clear clean_dst flag on target change) posted (#1) for review on release-3.11 by N Balachandran (nbalacha)

Comment 2 Worker Ant 2017-07-19 11:25:48 UTC
COMMIT: https://review.gluster.org/17745 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit 59d29cfc4300251b8bc92e41e5bc29790d8f2db6
Author: N Balachandran <nbalacha>
Date:   Mon Jul 10 15:45:04 2017 +0530

    cluster/dht: Clear clean_dst flag on target change
    
    If the target of a file migration was changed because
    of min-free-disk limits, the dst_fd was closed but the
    clean_dst flag was not set to false. If the file could
    not be created on the new target for some reason, the
    ftruncate call to clean up the dst was sent on the now
    invalid fd causing the process to deadlock.
    
    > BUG: 1469029
    > Signed-off-by: N Balachandran <nbalacha>
    > Reviewed-on: https://review.gluster.org/17735
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Amar Tumballi <amarts>
    > Reviewed-by: Raghavendra G <rgowdapp>
    
    (cherry picked from commit bd71ca4fdf2554dd22c0db70af132a11b966ef38)
    Change-Id: I5bfa80f519b04567413d84229cf62d143c6e2f04
    BUG: 1469459
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: https://review.gluster.org/17745
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Susant Palai <spalai>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>

Comment 3 Shyamsundar 2017-08-12 13:07:33 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.2, please open a new bug report.

glusterfs-3.11.2 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-July/031908.html
[2] https://www.gluster.org/pipermail/gluster-users/