Bug 1469041 - Rebalance hangs on remove-brick if the target volume changes
Rebalance hangs on remove-brick if the target volume changes
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: distribute (Show other bugs)
3.3
Unspecified Unspecified
unspecified Severity unspecified
: ---
: RHGS 3.3.0
Assigned To: Nithya Balachandran
Prasad Desala
:
Depends On: 1469029 1469459
Blocks: 1417151
  Show dependency treegraph
 
Reported: 2017-07-10 06:25 EDT by Nithya Balachandran
Modified: 2017-09-21 01:02 EDT (History)
6 users (show)

See Also:
Fixed In Version: glusterfs-3.8.4-34
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1469029
Environment:
Last Closed: 2017-09-21 01:02:13 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Nithya Balachandran 2017-07-10 06:25:30 EDT
+++ This bug was initially created as a clone of Bug #1469029 +++

Description of problem:
The rebalance process hangs on a remove-brick operation if the original hashed subvol fails the min-free-disk check and a different target is selected.

Version-Release number of selected component (if applicable):


How reproducible:
Consistently

Steps to Reproduce:
1. Create a 1x3 distribute volume with 1 GB bricks
2. Create enough 5MB files on the volume such that no 2 bricks can accommodate all the files.
3. Run a remove-brick to remove one of the bricks

Actual results:

The rebalance hangs.

Expected results:
The rebalance process should terminate once all files are processed.

Additional info:

--- Additional comment from Nithya Balachandran on 2017-07-10 06:13:12 EDT ---

RCA:


From gdb:


Thread 13 (Thread 0x7f7740823700 (LWP 27258)):
#0  0x00007f7748006bdd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f77491bd8cb in gf_timer_proc (data=0xd0af80) at timer.c:176
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f7740022700 (LWP 27259)):
#0  0x00007f7748007101 in sigwait () from /lib64/libpthread.so.0
#1  0x0000000000409e72 in glusterfs_sigwaiter (arg=0x7fff5bfe2da0) at glusterfsd.c:2069
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f773f821700 (LWP 27260)):
#0  0x00007f774790b66d in nanosleep () from /lib64/libc.so.6
#1  0x00007f774790b504 in sleep () from /lib64/libc.so.6
#2  0x00007f77491dff7b in pool_sweeper (arg=0x0) at mem-pool.c:464
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f773f020700 (LWP 27261)):
#0  0x00007f7748003a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f77491f54ab in syncenv_task (proc=0xd0b7d0) at syncop.c:603
#2  0x00007f77491f5746 in syncenv_processor (thdata=0xd0b7d0) at syncop.c:695
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f773e81f700 (LWP 27262)):
#0  0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f773b66410f in gf_defrag_start_crawl (data=0x7f773400dfc0) at dht-rebalance.c:4479
#2  0x00007f77491f4c7a in synctask_wrap (old_task=0x7f7724001400) at syncop.c:375
#3  0x00007f7747893cf0 in ?? () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 8 (Thread 0x7f773c357700 (LWP 27263)):
#0  0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f774921b1ef in event_dispatch_epoll_worker (data=0xd49290) at event-epoll.c:638
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f773ab14700 (LWP 27264)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f7747944d13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f774921b1ef in event_dispatch_epoll_worker (data=0x7f773401c720) at event-epoll.c:638
#2  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f77395fa700 (LWP 27266)):
#0  0x00007f774790b66d in nanosleep () from /lib64/libc.so.6
#1  0x00007f774790b504 in sleep () from /lib64/libc.so.6
#2  0x00007f773b6632af in dht_file_counter_thread (args=0x7f773401b430) at dht-rebalance.c:4158
#3  0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f7738df9700 (LWP 27267)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
#4  0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0, req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client-common.c:459
#5  0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472
#6  0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091
#7  0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0, fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319
#8  0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7738df8d1c)
    at dht-rebalance.c:1750
#9  0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340) at dht-rebalance.c:2645
#10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f7723fff700 (LWP 27268)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0, xdata=0x7f7723ffd008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810, this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f7714005810, this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, off=4194304, flags=0, 
    iobref=0x7f7730002600, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304, iobref=0x7f7730002600, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880, hole_exists=0, 
---Type <return> to continue, or q <return> to quit---
    fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7723ffed1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0) at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f77237fe700 (LWP 27269)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f771affd700 (LWP 27270)):
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0, xdata=0x7f771affb008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80, this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f770c002c80, this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, off=4194304, flags=0, 
    iobref=0x7f773000e800, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304, iobref=0x7f773000e800, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880, hole_exists=0, 
    fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f771affcd1c)
---Type <return> to continue, or q <return> to quit---
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0) at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f774969a780 (LWP 27257)):
#0  0x00007f7748000ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f774921b446 in event_dispatch_epoll (event_pool=0xd01f70) at event-epoll.c:732
#2  0x00007f77491de754 in event_dispatch (event_pool=0xd01f70) at event.c:124
#3  0x000000000040ab6a in main (argc=31, argv=0x7fff5bfe3ff8) at glusterfsd.c:2479


There are 4 threads (2,3,4,5) which are hung.

(gdb) t 2
[Switching to thread 2 (Thread 0x7f771affd700 (LWP 27270))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f771affae30, fd=0x7f770c001050, size=1048576, offset=4194304, flags=0, xdata=0x7f771affb008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f770c002c80, this=0x7f77340089c0, data=0x7f771affaf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f770c002c80, this=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, off=4194304, flags=0, 
    iobref=0x7f773000e800, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f770c001050, vector=0x7f77300022a0, count=1, offset=4194304, iobref=0x7f773000e800, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f770c005820, dst=0x7f770c001050, ia_size=5242880, hole_exists=0, 
    fop_errno=0x7f771affcd1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f771affcda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f771affcd1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100225d0) at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f770c001050, flags=1, remote_fd=0x7f771affadb8) at client-helpers.c:303
303	        pthread_mutex_lock (&conf->lock);



(gdb) t 3
[Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984
984	        LOCK (&fd->lock);



(gdb) t 4
[Switching to thread 4 (Thread 0x7f7723fff700 (LWP 27268))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
#4  0x00007f773b940233 in client_pre_writev (this=0x7f77340089c0, req=0x7f7723ffce30, fd=0x7f77140032c0, size=1048576, offset=4194304, flags=0, xdata=0x7f7723ffd008)
    at client-common.c:375
#5  0x00007f773b92a71e in client3_3_writev (frame=0x7f7714005810, this=0x7f77340089c0, data=0x7f7723ffcf40) at client-rpc-fops.c:4361
#6  0x00007f773b90167d in client_writev (frame=0x7f7714005810, this=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, off=4194304, flags=0, 
    iobref=0x7f7730002600, xdata=0x0) at client.c:1036
#7  0x00007f77491fd73e in syncop_writev (subvol=0x7f77340089c0, fd=0x7f77140032c0, vector=0x7f7730009260, count=1, offset=4194304, iobref=0x7f7730002600, flags=0, 
    xdata_in=0x0, xdata_out=0x0) at syncop.c:1975
#8  0x00007f773b659f39 in __dht_rebalance_migrate_data (from=0x7f773400c4d0, to=0x7f77340089c0, src=0x7f7714001af0, dst=0x7f77140032c0, ia_size=5242880, hole_exists=0, 
    fop_errno=0x7f7723ffed1c) at dht-rebalance.c:1028
#9  0x00007f773b65c247 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7723ffeda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7723ffed1c)
    at dht-rebalance.c:1733
#10 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f77100220b0) at dht-rebalance.c:2645
#11 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#12 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f77140032c0, flags=1, remote_fd=0x7f7723ffcdb8) at client-helpers.c:303
303	        pthread_mutex_lock (&conf->lock);



(gdb) t 5
[Switching to thread 5 (Thread 0x7f7738df9700 (LWP 27267))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f7748001c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
#4  0x00007f773b940782 in client_pre_fsync (this=0x7f77340089c0, req=0x7f7738df6ef0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client-common.c:459
#5  0x00007f773b92b0b6 in client3_3_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, data=0x7f7738df6fe0) at client-rpc-fops.c:4472
#6  0x00007f773b901d16 in client_fsync (frame=0x7f77280053b0, this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, xdata=0x0) at client.c:1091
#7  0x00007f7749200c3e in syncop_fsync (subvol=0x7f77340089c0, fd=0x7f7728005a00, dataonly=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2319
#8  0x00007f773b65c2e3 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f7738df8da0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f7738df8d1c)
    at dht-rebalance.c:1750
#9  0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022340) at dht-rebalance.c:2645
#10 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#11 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007f773b90c765 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f7728005a00, flags=0, remote_fd=0x7f7738df6e78) at client-helpers.c:303
303	        pthread_mutex_lock (&conf->lock);



Threads 2, 4 and 5 are waiting on conf->lock which is held by thread 3.
Thread 3 is waiting on fd->lock. However, it does not look like any other thread is holding fd->lock.



From thread 3:


(gdb) t 3
[Switching to thread 3 (Thread 0x7f77237fe700 (LWP 27269))]
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f77480061bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7748001d1d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007f7748001c3a in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f77491dd5d6 in fd_ctx_get (fd=0x7f771c0056c0, xlator=0x7f77340089c0, value=0x7f77237fbd90) at fd.c:984
#4  0x00007f773b90bd81 in this_fd_get_ctx (file=0x7f771c0056c0, this=0x7f77340089c0) at client-helpers.c:73
#5  0x00007f773b90c778 in client_get_remote_fd (this=0x7f77340089c0, fd=0x7f771c0056c0, flags=0, remote_fd=0x7f77237fbe68) at client-helpers.c:305
#6  0x00007f773b941650 in client_pre_ftruncate (this=0x7f77340089c0, req=0x7f77237fbee0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client-common.c:683
#7  0x00007f773b9264a1 in client3_3_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, data=0x7f77237fbfe0) at client-rpc-fops.c:3606
#8  0x00007f773b8fe8ba in client_ftruncate (frame=0x7f771c005400, this=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata=0x0) at client.c:626
#9  0x00007f774920018e in syncop_ftruncate (subvol=0x7f77340089c0, fd=0x7f771c0056c0, offset=0, xdata_in=0x0, xdata_out=0x0) at syncop.c:2261
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
#11 0x00007f773b65ee64 in gf_defrag_migrate_single_file (opaque=0x7f7710022af0) at dht-rebalance.c:2645
#12 0x00007f773b65f620 in gf_defrag_task (opaque=0x7f773401b430) at dht-rebalance.c:2812
#13 0x00007f7747fffdc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f774794473d in clone () from /lib64/libc.so.6
(gdb) f 10 
#10 0x00007f773b65da30 in dht_migrate_file (this=0x7f773400dfc0, loc=0x7f77237fdda0, from=0x7f773400c4d0, to=0x7f77340089c0, flag=2, fop_errno=0x7f77237fdd1c)
    at dht-rebalance.c:2200
2200	                lk_ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL);
(gdb) p *loc
$17 = {path = 0x7f771c000da0 "/xfile-198", name = 0x7f771c000da1 "xfile-198", inode = 0x7f771c001050, parent = 0x7f7724003520, 
  gfid = "\261\020w\327\370\251I\205\242,\202\031Z\322#A", pargfid = '\000' <repeats 15 times>, "\001"}



From the rebalance log:

2990 [2017-07-10 04:45:24.741931] I [dht-rebalance.c:1515:dht_migrate_file] 0-vol1-dht: /xfile-198: attempting to move from vol1-client-2 to vol1-client-1
2991 [2017-07-10 04:45:24.770386] W [MSGID: 0] [dht-rebalance.c:926:__dht_check_free_space] 0-vol1-dht: Write will cross min-free-disk for file - /xfile-198 on subvol - vol1     -client-1. Looking for new subvol
2992 [2017-07-10 04:45:24.770428] I [MSGID: 0] [dht-rebalance.c:985:__dht_check_free_space] 0-vol1-dht: new target found - vol1-client-0 for file - /xfile-198
2993 [2017-07-10 04:45:24.778057] W [MSGID: 114031] [client-rpc-fops.c:2004:client3_3_fallocate_cbk] 0-vol1-client-0: remote operation failed [No space left on device]
2994 [2017-07-10 04:45:24.778094] E [MSGID: 109023] [dht-rebalance.c:789:__dht_rebalance_create_dst_file] 0-vol1-dht: fallocate failed for /xfile-198 on vol1-client-0 (No sp     ace left on device)
2995 [2017-07-10 04:45:24.778122] E [dht-rebalance.c:1670:dht_migrate_file] 0-vol1-dht: Create dst failed on - vol1-client-0 for file - /xfile-198


These are the last messages logged.

From the dht_migrate_file code:


dht_migrate_file () {

...
                                                                               
        /* create the destination, with required modes/xattr */                 
        ret = __dht_rebalance_create_dst_file (this, to, from, loc, &stbuf,     
                                               &dst_fd, xattr, fop_errno);      
        if (ret) {                                                              
                gf_msg (this->name, GF_LOG_ERROR, 0, 0, "Create dst failed"     
                        " on - %s for file - %s", to->name, loc->path);         
                goto out;                                                       
        }                                                                       
                                                                                
        clean_dst = _gf_true;   <-- the dst file will be cleaned up                                                
                                                                                
        ret = __dht_check_free_space (this, to, from, loc, &stbuf, flag, conf,  
                                      &target_changed, &new_target, &ignore_failure, fop_errno);
        if (target_changed) {                                                   
                /* Can't handle for hardlinks. Marking this as failure */           
                if (flag == GF_DHT_MIGRATE_HARDLINK_IN_PROGRESS || stbuf.ia_nlink > 1) { 
                        gf_msg (this->name, GF_LOG_ERROR, 0,                    
                                DHT_MSG_SUBVOL_INSUFF_SPACE, "Exiting migration for"
                                " file - %s. flag - %d, stbuf.ia_nlink - %d",   
                               loc->path,  flag, stbuf.ia_nlink);               
                        ret = -1;                                               
                        goto out;                                               
                }                                                               
                                                                                
                                                                                
                ret = syncop_ftruncate (to, dst_fd, 0, NULL, NULL);             
                if (ret) {                                                      
                        gf_log (this->name, GF_LOG_WARNING,                     
                                "%s: failed to perform truncate on %s (%s)",    
                                loc->path, to->name, strerror (-ret));          
                        ret = -1;                                               
                }                                                               
                                                                                
                syncop_close (dst_fd);      <-- this is now an invalid fd for the dst cleanup                                    
                                                                                
                old_target = to;                                                
                to = new_target;                                                

                /* if the file migration is successful to this new target, then 
                 * update the xattr on the old destination to point the new         
                 * destination. We need to do update this only post migration   
                 * as in case of failure the linkto needs to point to the source
                 * subvol */                                                    
                ret = __dht_rebalance_create_dst_file (this, to, from, loc, 
                                                       &stbuf,
                                                       &dst_fd, xattr, 
                                                       fop_errno);
                if (ret) {                                                      
                        gf_log (this->name, GF_LOG_ERROR, "Create dst failed"   
                                " on - %s for file - %s", to->name, loc->path); 
                        goto out;                      

<<< If this fails here, clean_dst is set to true but the fd is invalid, causing the hang on fd->lock>>>                         

                } else {                                                        
                        gf_msg (this->name, GF_LOG_INFO, 0, 0, "destination for file "
                                "- %s is changed to - %s", loc->path, to->name);
                }                                                               
        }          
...

--- Additional comment from Worker Ant on 2017-07-10 06:17:54 EDT ---

REVIEW: https://review.gluster.org/17735 (cluster/dht: Clear clean_dst flag on target change) posted (#1) for review on master by N Balachandran (nbalacha@redhat.com)
Comment 7 Atin Mukherjee 2017-07-13 01:25:05 EDT
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/111944/
Comment 9 Prasad Desala 2017-07-31 08:01:26 EDT
On glusterfs version 3.8.4-35.el7rhgs.x86_64, ran the steps in the description and didn't see any rebalance hangs during remove-brick operation. 

Hence, moving this BZ to Verified.
Comment 11 errata-xmlrpc 2017-09-21 01:02:13 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.