1254137 – Rebalance fix-layout fails after some time with a timeout

Bug 1254137 - Rebalance fix-layout fails after some time with a timeout

Summary: Rebalance fix-layout fails after some time with a timeout

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	rpc
Sub Component:
Version:	3.7.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Mohammed Rafi KC
QA Contact:
Docs Contact:
URL:
Whiteboard:	rpc-ping-timeout
Depends On:
Blocks:	1254138
TreeView+	depends on / blocked

Reported:	2015-08-17 09:17 UTC by Trefex
Modified:	2017-03-08 10:48 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Clones:	1254138 (view as bug list)
Environment:
Last Closed:	2017-03-08 10:48:24 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Trefex 2015-08-17 09:17:47 UTC

Description of problem:

After adding a node to a 2-node setup, I performed a `gluster volume rebalance live fix-layout start`, and after some time, the fix-layout fails with the following error.

How reproducible:

We tried 3-4 times and it happened everytime.

Additional info:

Log entries from the 3 nodes:

stor106: [2015-08-13 15:49:26.557880] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2925913100'
stor106: [2015-08-13 15:49:26.557908] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value '97952e77-bf0a-4227-a529-2d1ae787ec3e'
stor106: [2015-08-13 15:49:26.557924] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4'
stor106: [2015-08-13 15:49:26.557939] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on'
stor106: [2015-08-13 15:49:26.557955] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes'
stor106: [2015-08-13 15:49:26.557969] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes'
stor106: [2015-08-13 15:49:26.557983] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes'
stor106: [2015-08-13 21:36:05.535730] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-4: server 192.168.123.106:49156 has not responded in the last 42 seconds, disconnecting.
stor106: [2015-08-13 21:36:05.536519] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7f69b957b9e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f69b93469be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69b9346ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7f69b934847c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7f69b9348c38] ))))) 0-live-client-4: forced unwinding frame type(GlusterFS 3.3) op(READDIRP(40)) called at 2015-08-13 21:35:06.999805 (xid=0x245a8f)
stor106: [2015-08-13 21:36:05.536943] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7f69b957b9e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f69b93469be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69b9346ace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7f69b934847c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7f69b9348c38] ))))) 0-live-client-4: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-08-13 21:35:23.532921 (xid=0x245a90)
stor106: [2015-08-13 21:36:05.864351] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/databases/marker_dil/rnpB/0.3-B1-thermo
stor106: [2015-08-13 21:36:05.864594] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/databases/marker_dil/rnpB
stor106: [2015-08-13 21:36:05.864783] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/databases/marker_dil
stor106: [2015-08-13 21:36:05.865190] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/databases
stor106: [2015-08-13 21:36:05.865351] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology
stor106: [2015-08-13 21:36:05.865674] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia
stor106: [2015-08-13 21:36:05.865854] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb
stor106: [2015-08-14 09:38:17.498230] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=97952e77-bf0a-4227-a529-2d1ae787ec3e --xlator-option *dht.commit-hash=2926426186 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/97952e77-bf0a-4227-a529-2d1ae787ec3e.pid -l /var/log/glusterfs/live-rebalance.log)
stor106: [2015-08-14 09:38:17.519827] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
stor106: [2015-08-14 09:38:22.493300] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926426186'
stor106: [2015-08-14 09:38:22.493331] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value '97952e77-bf0a-4227-a529-2d1ae787ec3e'
stor106: [2015-08-14 09:38:22.493350] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4'
stor106: [2015-08-14 09:38:22.493367] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on'
stor106: [2015-08-14 09:38:22.493384] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes'
stor106: [2015-08-14 09:38:22.493400] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes'
stor106: [2015-08-14 09:38:22.493415] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes'
stor106: [2015-08-14 09:42:47.621544] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/CT 20131204_Fluo4 1 Well other Fields/Meas_01(2013-12-04_20-16-03)
stor106: [2015-08-14 09:42:47.621710] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/CT 20131204_Fluo4 1 Well other Fields
stor106: [2015-08-14 09:42:47.622220] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol
stor106: [2015-08-14 09:42:47.622341] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs
stor106: [2015-08-14 09:42:47.622457] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs
stor106: [2015-08-14 12:11:27.121727] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=97952e77-bf0a-4227-a529-2d1ae787ec3e --xlator-option *dht.commit-hash=2926499706 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/97952e77-bf0a-4227-a529-2d1ae787ec3e.pid -l /var/log/glusterfs/live-rebalance.log)
stor106: [2015-08-14 12:11:27.143975] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
stor106: [2015-08-14 12:11:32.122635] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926499706'
stor106: [2015-08-14 12:11:32.129824] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value '97952e77-bf0a-4227-a529-2d1ae787ec3e'
stor106: [2015-08-14 12:11:32.129868] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4'
stor106: [2015-08-14 12:11:32.129889] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on'
stor106: [2015-08-14 12:11:32.129909] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes'
stor106: [2015-08-14 12:11:32.129926] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes'
stor106: [2015-08-14 12:11:32.129942] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes'
stor104: [2015-08-13 15:49:26.561286] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes'
stor104: [2015-08-13 15:49:26.561301] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes'
stor104: [2015-08-13 20:22:32.106903] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-5: server 192.168.123.106:49157 has not responded in the last 42 seconds, disconnecting.
stor104: [2015-08-13 20:22:32.107488] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fede0b249e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fede08ef9be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fede08eface] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fede08f147c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fede08f1c38] ))))) 0-live-client-5: forced unwinding frame type(GlusterFS 3.3) op(READDIRP(40)) called at 2015-08-13 20:21:19.267201 (xid=0x1a21ae)
stor104: [2015-08-13 20:22:32.108236] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fede0b249e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fede08ef9be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fede08eface] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7fede08f147c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7fede08f1c38] ))))) 0-live-client-5: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-08-13 20:21:50.102656 (xid=0x1a21af)
stor104: [2015-08-13 20:22:32.111687] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs/withHuman/P1_2/motu.profiles/mOTU.v1.padded/reads.processed.solexaqa.solexaqa.allbest.l45.p97
stor104: [2015-08-13 20:22:32.112024] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs/withHuman/P1_2/motu.profiles/mOTU.v1.padded
stor104: [2015-08-13 20:22:32.112365] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs/withHuman/P1_2/motu.profiles
stor104: [2015-08-13 20:22:32.112735] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs/withHuman/P1_2
stor104: [2015-08-13 20:22:32.113071] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs/withHuman
stor104: [2015-08-13 20:22:32.113405] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis/mOTUs
stor104: [2015-08-13 20:22:32.113764] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG/Analysis
stor104: [2015-08-13 20:22:32.114065] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn/metaG
stor104: [2015-08-13 20:22:32.114386] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology/IMDyn
stor104: [2015-08-13 20:22:32.114731] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia/ecosystem_biology
stor104: [2015-08-13 20:22:32.115051] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/gaia
stor104: [2015-08-13 20:22:32.115433] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb
stor104: [2015-08-14 09:38:17.500715] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=305c0f00-0f11-4da3-a470-50b6e6c14976 --xlator-option *dht.commit-hash=2926426186 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/305c0f00-0f11-4da3-a470-50b6e6c14976.pid -l /var/log/glusterfs/live-rebalance.log)
stor104: [2015-08-14 09:38:17.514523] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
stor104: [2015-08-14 09:38:22.498426] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926426186'
stor104: [2015-08-14 09:38:22.498446] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value '305c0f00-0f11-4da3-a470-50b6e6c14976'
stor104: [2015-08-14 09:38:22.498456] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4'
stor104: [2015-08-14 09:38:22.498465] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on'
stor104: [2015-08-14 09:38:22.498475] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes'
stor104: [2015-08-14 09:38:22.498483] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes'
stor104: [2015-08-14 09:38:22.498492] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes'
stor104: [2015-08-14 09:42:47.642919] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/CT 20131204_Fluo4 1 Well other Fields/Meas_01(2013-12-04_20-16-03)
stor104: [2015-08-14 09:42:47.643189] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/CT 20131204_Fluo4 1 Well other Fields
stor104: [2015-08-14 09:42:47.644244] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol
stor104: [2015-08-14 09:42:47.644404] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs
stor104: [2015-08-14 09:42:47.644549] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs
stor104: [2015-08-14 12:11:27.130086] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=305c0f00-0f11-4da3-a470-50b6e6c14976 --xlator-option *dht.commit-hash=2926499706 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/305c0f00-0f11-4da3-a470-50b6e6c14976.pid -l /var/log/glusterfs/live-rebalance.log)
stor104: [2015-08-14 12:11:27.152078] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
stor104: [2015-08-14 12:11:32.145533] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926499706'
stor104: [2015-08-14 12:11:32.145593] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value '305c0f00-0f11-4da3-a470-50b6e6c14976'
stor104: [2015-08-14 12:11:32.145612] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4'
stor104: [2015-08-14 12:11:32.145629] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on'
stor104: [2015-08-14 12:11:32.145646] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes'
stor104: [2015-08-14 12:11:32.145661] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes'
stor104: [2015-08-14 12:11:32.145675] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes'
stor105: [2015-08-14 01:18:06.680271] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-live-client-5: server 192.168.123.106:49157 has not responded in the last 42 seconds, disconnecting.
stor105: [2015-08-14 01:18:06.680939] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7f8549c409e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8549a0b9be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8549a0bace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7f8549a0d47c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7f8549a0dc38] ))))) 0-live-client-5: forced unwinding frame type(GlusterFS 3.3) op(READDIRP(40)) called at 2015-08-14 01:17:17.396967 (xid=0x3b966a)
stor105: [2015-08-14 01:18:06.681340] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7f8549c409e6] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f8549a0b9be] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f8549a0bace] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7f8549a0d47c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7f8549a0dc38] ))))) 0-live-client-5: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-08-14 01:17:24.676542 (xid=0x3b966b)
stor105: [2015-08-14 01:18:06.684419] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work/projects/ecosystem_biology/local_tools/specI_v1.0/sequences
stor105: [2015-08-14 01:18:06.684710] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work/projects/ecosystem_biology/local_tools/specI_v1.0
stor105: [2015-08-14 01:18:06.685996] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work/projects/ecosystem_biology/local_tools
stor105: [2015-08-14 01:18:06.686349] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work/projects/ecosystem_biology
stor105: [2015-08-14 01:18:06.686602] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work/projects
stor105: [2015-08-14 01:18:06.686946] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:/work
stor105: [2015-08-14 01:18:06.687147] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32/access-gaia.uni.lu:
stor105: [2015-08-14 01:18:06.687372] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10/10:32
stor105: [2015-08-14 01:18:06.687619] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11/10
stor105: [2015-08-14 01:18:06.687848] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014/11
stor105: [2015-08-14 01:18:06.688077] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia/2014
stor105: [2015-08-14 01:18:06.688290] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup/gaia
stor105: [2015-08-14 01:18:06.688505] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb/esb_backup
stor105: [2015-08-14 01:18:06.688752] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /esb
stor105: [2015-08-14 09:38:17.485415] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=b80cca53-bfd7-442a-a0a2-c846c978ea30 --xlator-option *dht.commit-hash=2926426186 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/b80cca53-bfd7-442a-a0a2-c846c978ea30.pid -l /var/log/glusterfs/live-rebalance.log)
stor105: [2015-08-14 09:38:17.507689] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
stor105: [2015-08-14 09:38:22.480316] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926426186'
stor105: [2015-08-14 09:38:22.480333] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value 'b80cca53-bfd7-442a-a0a2-c846c978ea30'
stor105: [2015-08-14 09:38:22.480342] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4'
stor105: [2015-08-14 09:38:22.480351] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on'
stor105: [2015-08-14 09:38:22.480360] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes'
stor105: [2015-08-14 09:38:22.480368] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes'
stor105: [2015-08-14 09:38:22.480375] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes'
stor105: [2015-08-14 09:42:47.631749] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/LB 20130307_MitoSox_exp_conc_incub_24h/Meas_01(2013-03-07_14-07-12)
stor105: [2015-08-14 09:42:47.632020] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol/LB 20130307_MitoSox_exp_conc_incub_24h
stor105: [2015-08-14 09:42:47.633021] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs/OperaArchiveCol
stor105: [2015-08-14 09:42:47.633226] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs/hcs
stor105: [2015-08-14 09:42:47.633439] E [MSGID: 109016] [dht-rebalance.c:2554:gf_defrag_fix_layout] 0-live-dht: Fix layout failed for /hcs
stor105: [2015-08-14 12:11:27.085008] I [MSGID: 100030] [glusterfsd.c:2301:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.7.3 (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/live --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=4 --xlator-option *dht.node-uuid=b80cca53-bfd7-442a-a0a2-c846c978ea30 --xlator-option *dht.commit-hash=2926499706 --socket-file /var/run/gluster/gluster-rebalance-1328637d-7730-4627-8945-bbe43626d527.sock --pid-file /var/lib/glusterd/vols/live/rebalance/b80cca53-bfd7-442a-a0a2-c846c978ea30.pid -l /var/log/glusterfs/live-rebalance.log)
stor105: [2015-08-14 12:11:27.107815] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
stor105: [2015-08-14 12:11:32.086382] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'commit-hash' for volume 'live-dht' with value '2926499706'
stor105: [2015-08-14 12:11:32.092152] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'node-uuid' for volume 'live-dht' with value 'b80cca53-bfd7-442a-a0a2-c846c978ea30'
stor105: [2015-08-14 12:11:32.092187] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'rebalance-cmd' for volume 'live-dht' with value '4'
stor105: [2015-08-14 12:11:32.092220] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'readdir-optimize' for volume 'live-dht' with value 'on'
stor105: [2015-08-14 12:11:32.092239] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'assert-no-child-down' for volume 'live-dht' with value 'yes'
stor105: [2015-08-14 12:11:32.092256] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'lookup-unhashed' for volume 'live-dht' with value 'yes'
stor105: [2015-08-14 12:11:32.092291] I [graph.c:269:gf_add_cmdline_options] 0-live-dht: adding option 'use-readdirp' for volume 'live-dht' with value 'yes'

Comment 1 Trefex 2015-08-17 09:25:06 UTC

Steps to reproduce:

1. Add a new node to a 2-node setup (already full with files)
2. Start a `gluster volume rebalance live fix-layout start`
3. Wait 20k-40k seconds
4. rebalance failed

Comment 2 Anand Avati 2015-08-17 11:01:14 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#2) for review on master by mohammed rafi  kc (rkavunga)

Comment 3 Anand Avati 2015-08-20 08:37:23 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#4) for review on master by mohammed rafi  kc (rkavunga)

Comment 4 Anand Avati 2015-08-31 11:30:54 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#5) for review on master by Vijay Bellur (vbellur)

Comment 5 Vijay Bellur 2015-09-09 10:30:37 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#6) for review on master by mohammed rafi  kc (rkavunga)

Comment 6 Vijay Bellur 2015-09-10 13:24:54 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#7) for review on master by mohammed rafi  kc (rkavunga)

Comment 7 Vijay Bellur 2015-09-15 06:05:50 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#8) for review on master by mohammed rafi  kc (rkavunga)

Comment 8 Raghavendra G 2015-10-09 06:43:35 UTC

we block reading from socket till event-handler completes. This might cause spurious disconnects due to ping-timer expiry if handlers take more time. In this bug, the load seems to be from readdirp. I just looked at readdirp reply path. It involves looping over dentry list from various translators:

1. protocol/client construct dentry list and hence it traverses the list.
2. afr does a loop over dentries
3. dht does a loop over dentries
4. syncop_readdirp_cbk (rebalance process use syncops) copies each dentry and constructs a new list.

I am suspecting whether such heavy processing in handler might've prevented the client from reading the ping response from socket (if ping response was queued behind readdirp response), resulting in timeout of ping-timer.

One solution is that it would be better if we start reading from socket once we read a complete rpc msg. We need not wait till rpc-program/rpc-clnt above transport to process the reply.

Comment 9 Raghavendra G 2015-10-09 06:54:17 UTC

<rpc_clnt_ping_timer_expired>

                gettimeofday (&current, NULL);
                if (((current.tv_sec - conn->last_received.tv_sec) <
                     conn->ping_timeout)
                    || ((current.tv_sec - conn->last_sent.tv_sec) <
                        conn->ping_timeout)) {
                        transport_activity = 1;
		}

                if (transport_activity) {
                        gf_log (trans->name, GF_LOG_TRACE,
                                "ping timer expired but transport activity "
                                "detected - not bailing transport");

                        if (__rpc_clnt_rearm_ping_timer (rpc,
                                         rpc_clnt_ping_timer_expired) == -1) {
                                gf_log (trans->name, GF_LOG_WARNING,
					"unable to setup ping timer");
                        }

                } else {
                        conn->ping_started = 0;
                        disconnect = 1;
                }

</rpc_clnt_ping_timer_expired>

As can be seen above, ping_timer_expired takes "transport_activity" into account before actually disconnecting (Its not just ping response we are looking at). Hence RCA in previous comment is most likely wrong.

Comment 11 Vijay Bellur 2015-11-23 03:00:07 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#10) for review on master by Vijay Bellur (vbellur)

Comment 12 Vijay Bellur 2015-11-23 18:22:08 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#11) for review on master by Vijay Bellur (vbellur)

Comment 13 Vijay Bellur 2016-02-22 17:56:07 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#12) for review on master by mohammed rafi  kc (rkavunga)

Comment 14 Mike McCune 2016-03-28 23:28:22 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 15 Worker Ant 2016-11-16 12:00:50 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#13) for review on master by mohammed rafi  kc (rkavunga)

Comment 16 Worker Ant 2016-11-17 09:15:39 UTC

REVIEW: http://review.gluster.org/11935 (socket: Add ping packets into beginning of ioq list) posted (#14) for review on master by mohammed rafi  kc (rkavunga)

Comment 17 Kaushal 2017-03-08 10:48:24 UTC

This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.

Note You need to log in before you can comment on or make changes to this bug.