Description of problem: After adding a node to a 2-node setup, I performed a `gluster volume rebalance live fix-layout start`, and after some time, the fix-layout fails with the following error. How reproducible: We tried 3-4 times and it happened everytime. Additional info: Log entries from the 3 nodes: stor106: [2015-08-13 15:49:26.557880] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2925913100' stor106: [2015-08-13 15:49:26.557908] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value '97952e77-bf0a-4227-a529-2d1ae787ec3e' stor106: [2015-08-13 15:49:26.557924] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4' stor106: [2015-08-13 15:49:26.557939] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on' stor106: [2015-08-13 15:49:26.557955] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes' stor106: [2015-08-13 15:49:26.557969] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes' stor106: [2015-08-13 15:49:26.557983] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes' stor106: [2015-08-13 21:36:05.535730] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-4: server 192.168.123.106:49156 has not responded in the last 42 seconds, disconnecting. stor106: [2015-08-13 21:36:05.536519] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7f69b957b9e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f69b93469be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69b9346ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7f69b934847c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7f69b9348c38] ))))) 0-live-client-4: forced unwinding frame type(GlusterFS 3.3) op(READDIRP(40)) called at 2015-08-13 21:35:06.999805 (xid=0x245a8f) stor106: [2015-08-13 21:36:05.536943] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7f69b957b9e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f69b93469be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69b9346ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7f69b934847c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7f69b9348c38] ))))) 0-live-client-4: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-08-13 21:35:23.532921 (xid=0x245a90) stor106: [2015-08-13 21:36:05.864351] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/databases/marker_dil/rnpB/0.3-B1-thermo stor106: [2015-08-13 21:36:05.864594] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/databases/marker_dil/rnpB stor106: [2015-08-13 21:36:05.864783] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/databases/marker_dil stor106: [2015-08-13 21:36:05.865190] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/databases stor106: [2015-08-13 21:36:05.865351] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology stor106: [2015-08-13 21:36:05.865674] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia stor106: [2015-08-13 21:36:05.865854] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb stor106: [2015-08-14 09:38:17.498230] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=97952e77-bf0a-4227-a529-2d1ae787ec3e --xlator-option *dht.commit-hash=2926426186 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/97952e77-bf0a-4227-a529-2d1ae787ec3e.pid -l /var/log/glusterfs/live-rebalance.log) stor106: [2015-08-14 09:38:17.519827] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 stor106: [2015-08-14 09:38:22.493300] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926426186' stor106: [2015-08-14 09:38:22.493331] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value '97952e77-bf0a-4227-a529-2d1ae787ec3e' stor106: [2015-08-14 09:38:22.493350] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4' stor106: [2015-08-14 09:38:22.493367] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on' stor106: [2015-08-14 09:38:22.493384] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes' stor106: [2015-08-14 09:38:22.493400] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes' stor106: [2015-08-14 09:38:22.493415] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes' stor106: [2015-08-14 09:42:47.621544] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/CT 20131204_Fluo4 1 Well other Fields/Meas_01(2013-12-04_20-16-03) stor106: [2015-08-14 09:42:47.621710] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/CT 20131204_Fluo4 1 Well other Fields stor106: [2015-08-14 09:42:47.622220] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol stor106: [2015-08-14 09:42:47.622341] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs stor106: [2015-08-14 09:42:47.622457] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs stor106: [2015-08-14 12:11:27.121727] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=97952e77-bf0a-4227-a529-2d1ae787ec3e --xlator-option *dht.commit-hash=2926499706 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/97952e77-bf0a-4227-a529-2d1ae787ec3e.pid -l /var/log/glusterfs/live-rebalance.log) stor106: [2015-08-14 12:11:27.143975] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 stor106: [2015-08-14 12:11:32.122635] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926499706' stor106: [2015-08-14 12:11:32.129824] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value '97952e77-bf0a-4227-a529-2d1ae787ec3e' stor106: [2015-08-14 12:11:32.129868] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4' stor106: [2015-08-14 12:11:32.129889] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on' stor106: [2015-08-14 12:11:32.129909] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes' stor106: [2015-08-14 12:11:32.129926] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes' stor106: [2015-08-14 12:11:32.129942] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes' stor104: [2015-08-13 15:49:26.561286] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes' stor104: [2015-08-13 15:49:26.561301] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes' stor104: [2015-08-13 20:22:32.106903] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-5: server 192.168.123.106:49157 has not responded in the last 42 seconds, disconnecting. stor104: [2015-08-13 20:22:32.107488] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fede0b249e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fede08ef9be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fede08eface] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fede08f147c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fede08f1c38] ))))) 0-live-client-5: forced unwinding frame type(GlusterFS 3.3) op(READDIRP(40)) called at 2015-08-13 20:21:19.267201 (xid=0x1a21ae) stor104: [2015-08-13 20:22:32.108236] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fede0b249e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fede08ef9be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fede08eface] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fede08f147c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fede08f1c38] ))))) 0-live-client-5: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-08-13 20:21:50.102656 (xid=0x1a21af) stor104: [2015-08-13 20:22:32.111687] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs/withHuman/P1_2/motu.profiles/mOTU.v1.padded/reads.processed.solexaqa.solexaqa.allbest.l45.p97 stor104: [2015-08-13 20:22:32.112024] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs/withHuman/P1_2/motu.profiles/mOTU.v1.padded stor104: [2015-08-13 20:22:32.112365] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs/withHuman/P1_2/motu.profiles stor104: [2015-08-13 20:22:32.112735] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs/withHuman/P1_2 stor104: [2015-08-13 20:22:32.113071] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs/withHuman stor104: [2015-08-13 20:22:32.113405] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs stor104: [2015-08-13 20:22:32.113764] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis stor104: [2015-08-13 20:22:32.114065] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG stor104: [2015-08-13 20:22:32.114386] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn stor104: [2015-08-13 20:22:32.114731] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology stor104: [2015-08-13 20:22:32.115051] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia stor104: [2015-08-13 20:22:32.115433] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb stor104: [2015-08-14 09:38:17.500715] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=305c0f00-0f11-4da3-a470-50b6e6c14976 --xlator-option *dht.commit-hash=2926426186 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/305c0f00-0f11-4da3-a470-50b6e6c14976.pid -l /var/log/glusterfs/live-rebalance.log) stor104: [2015-08-14 09:38:17.514523] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 stor104: [2015-08-14 09:38:22.498426] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926426186' stor104: [2015-08-14 09:38:22.498446] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value '305c0f00-0f11-4da3-a470-50b6e6c14976' stor104: [2015-08-14 09:38:22.498456] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4' stor104: [2015-08-14 09:38:22.498465] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on' stor104: [2015-08-14 09:38:22.498475] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes' stor104: [2015-08-14 09:38:22.498483] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes' stor104: [2015-08-14 09:38:22.498492] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes' stor104: [2015-08-14 09:42:47.642919] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/CT 20131204_Fluo4 1 Well other Fields/Meas_01(2013-12-04_20-16-03) stor104: [2015-08-14 09:42:47.643189] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/CT 20131204_Fluo4 1 Well other Fields stor104: [2015-08-14 09:42:47.644244] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol stor104: [2015-08-14 09:42:47.644404] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs stor104: [2015-08-14 09:42:47.644549] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs stor104: [2015-08-14 12:11:27.130086] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=305c0f00-0f11-4da3-a470-50b6e6c14976 --xlator-option *dht.commit-hash=2926499706 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/305c0f00-0f11-4da3-a470-50b6e6c14976.pid -l /var/log/glusterfs/live-rebalance.log) stor104: [2015-08-14 12:11:27.152078] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 stor104: [2015-08-14 12:11:32.145533] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926499706' stor104: [2015-08-14 12:11:32.145593] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value '305c0f00-0f11-4da3-a470-50b6e6c14976' stor104: [2015-08-14 12:11:32.145612] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4' stor104: [2015-08-14 12:11:32.145629] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on' stor104: [2015-08-14 12:11:32.145646] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes' stor104: [2015-08-14 12:11:32.145661] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes' stor104: [2015-08-14 12:11:32.145675] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes' stor105: [2015-08-14 01:18:06.680271] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-5: server 192.168.123.106:49157 has not responded in the last 42 seconds, disconnecting. stor105: [2015-08-14 01:18:06.680939] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7f8549c409e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8549a0b9be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8549a0bace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7f8549a0d47c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7f8549a0dc38] ))))) 0-live-client-5: forced unwinding frame type(GlusterFS 3.3) op(READDIRP(40)) called at 2015-08-14 01:17:17.396967 (xid=0x3b966a) stor105: [2015-08-14 01:18:06.681340] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7f8549c409e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8549a0b9be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8549a0bace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7f8549a0d47c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7f8549a0dc38] ))))) 0-live-client-5: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-08-14 01:17:24.676542 (xid=0x3b966b) stor105: [2015-08-14 01:18:06.684419] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work/projects/ecosystem_biology/local_tools/specI_v1.0/sequences stor105: [2015-08-14 01:18:06.684710] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work/projects/ecosystem_biology/local_tools/specI_v1.0 stor105: [2015-08-14 01:18:06.685996] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work/projects/ecosystem_biology/local_tools stor105: [2015-08-14 01:18:06.686349] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work/projects/ecosystem_biology stor105: [2015-08-14 01:18:06.686602] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work/projects stor105: [2015-08-14 01:18:06.686946] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work stor105: [2015-08-14 01:18:06.687147] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu: stor105: [2015-08-14 01:18:06.687372] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32 stor105: [2015-08-14 01:18:06.687619] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10 stor105: [2015-08-14 01:18:06.687848] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11 stor105: [2015-08-14 01:18:06.688077] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014 stor105: [2015-08-14 01:18:06.688290] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia stor105: [2015-08-14 01:18:06.688505] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup stor105: [2015-08-14 01:18:06.688752] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb stor105: [2015-08-14 09:38:17.485415] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=b80cca53-bfd7-442a-a0a2-c846c978ea30 --xlator-option *dht.commit-hash=2926426186 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/b80cca53-bfd7-442a-a0a2-c846c978ea30.pid -l /var/log/glusterfs/live-rebalance.log) stor105: [2015-08-14 09:38:17.507689] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 stor105: [2015-08-14 09:38:22.480316] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926426186' stor105: [2015-08-14 09:38:22.480333] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value 'b80cca53-bfd7-442a-a0a2-c846c978ea30' stor105: [2015-08-14 09:38:22.480342] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4' stor105: [2015-08-14 09:38:22.480351] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on' stor105: [2015-08-14 09:38:22.480360] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes' stor105: [2015-08-14 09:38:22.480368] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes' stor105: [2015-08-14 09:38:22.480375] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes' stor105: [2015-08-14 09:42:47.631749] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/LB 20130307_MitoSox_exp_conc_incub_24h/Meas_01(2013-03-07_14-07-12) stor105: [2015-08-14 09:42:47.632020] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/LB 20130307_MitoSox_exp_conc_incub_24h stor105: [2015-08-14 09:42:47.633021] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol stor105: [2015-08-14 09:42:47.633226] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs stor105: [2015-08-14 09:42:47.633439] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs stor105: [2015-08-14 12:11:27.085008] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=b80cca53-bfd7-442a-a0a2-c846c978ea30 --xlator-option *dht.commit-hash=2926499706 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/b80cca53-bfd7-442a-a0a2-c846c978ea30.pid -l /var/log/glusterfs/live-rebalance.log) stor105: [2015-08-14 12:11:27.107815] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 stor105: [2015-08-14 12:11:32.086382] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926499706' stor105: [2015-08-14 12:11:32.092152] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value 'b80cca53-bfd7-442a-a0a2-c846c978ea30' stor105: [2015-08-14 12:11:32.092187] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4' stor105: [2015-08-14 12:11:32.092220] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on' stor105: [2015-08-14 12:11:32.092239] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes' stor105: [2015-08-14 12:11:32.092256] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes' stor105: [2015-08-14 12:11:32.092291] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes'
Steps to reproduce: 1. Add a new node to a 2-node setup (already full with files) 2. Start a `gluster volume rebalance live fix-layout start` 3. Wait 20k-40k seconds 4. rebalance failed
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#2) for review on master by mohammed rafi kc (rkavunga)
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#4) for review on master by mohammed rafi kc (rkavunga)
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#5) for review on master by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#6) for review on master by mohammed rafi kc (rkavunga)
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#7) for review on master by mohammed rafi kc (rkavunga)
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#8) for review on master by mohammed rafi kc (rkavunga)
we block reading from socket till event-handler completes. This might cause spurious disconnects due to ping-timer expiry if handlers take more time. In this bug, the load seems to be from readdirp. I just looked at readdirp reply path. It involves looping over dentry list from various translators: 1. protocol/client construct dentry list and hence it traverses the list. 2. afr does a loop over dentries 3. dht does a loop over dentries 4. syncop_readdirp_cbk (rebalance process use syncops) copies each dentry and constructs a new list. I am suspecting whether such heavy processing in handler might've prevented the client from reading the ping response from socket (if ping response was queued behind readdirp response), resulting in timeout of ping-timer. One solution is that it would be better if we start reading from socket once we read a complete rpc msg. We need not wait till rpc-program/rpc-clnt above transport to process the reply.
<rpc_clnt_ping_timer_expired> gettimeofday (¤t, NULL); if (((current.tv_sec - conn->last_received.tv_sec) < conn->ping_timeout) || ((current.tv_sec - conn->last_sent.tv_sec) < conn->ping_timeout)) { transport_activity = 1; } if (transport_activity) { gf_log (trans->name, GF_LOG_TRACE, "ping timer expired but transport activity " "detected - not bailing transport"); if (__rpc_clnt_rearm_ping_timer (rpc, rpc_clnt_ping_timer_expired) == -1) { gf_log (trans->name, GF_LOG_WARNING, "unable to setup ping timer"); } } else { conn->ping_started = 0; disconnect = 1; } </rpc_clnt_ping_timer_expired> As can be seen above, ping_timer_expired takes "transport_activity" into account before actually disconnecting (Its not just ping response we are looking at). Hence RCA in previous comment is most likely wrong.
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#10) for review on master by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#11) for review on master by Vijay Bellur (vbellur)
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#12) for review on master by mohammed rafi kc (rkavunga)
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#13) for review on master by mohammed rafi kc (rkavunga)
REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#14) for review on master by mohammed rafi kc (rkavunga)
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.