950006 – replace-brick activity dies, destination glusterfs spins at 100% CPU forever

Bug 950006 - replace-brick activity dies, destination glusterfs spins at 100% CPU forever

Summary: replace-brick activity dies, destination glusterfs spins at 100% CPU forever

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	3.3.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Assignee:	krishnan parthasarathi
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-04-09 13:03 UTC by hans
Modified:	2015-11-03 23:05 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-12-14 19:40:30 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
The -etc-glusterfs-glusterd.vol.log (56.99 KB, text/plain) 2013-04-10 12:29 UTC, hans	no flags	Details
View All

Description hans 2013-04-09 13:03:50 UTC

Description of problem:

Distributed-Replicate 8x2 volume on 3.3.1 .
Moving a healthy brick to another server.

While the source brick was overloaded with IO causing the whole volume to be unavailable (I'll file another bug for that) the destination brick glusterfs hangs itself in __inode_find in inode.c:

20130409_114419 root@stor3-idc1-lga:~/ gdb //sbin/glusterfs
(gdb) attach 24685
(gdb) bt
#0  0x00007fa947885b40 in uuid_compare@plt () from //lib/libglusterfs.so.0
#1  0x00007fa9478a008d in __inode_find (table=0x23cf660, gfid=0x2687410 "\003\267\245~\246\321@\024\274\372\230\310\002\365\306,") at inode.c:765
#2  0x00007fa9478a0649 in inode_find (table=0x23cf660, gfid=0x2687410 "\003\267\245~\246\321@\024\274\372\230\310\002\365\306,") at inode.c:788
#3  0x00007fa94372ab5f in resolve_entry_simple (frame=<optimized out>) at server-resolve.c:251
#4  0x00007fa94372b32e in resolve_continue (frame=0x7fa94590679c) at server-resolve.c:215
#5  0x00007fa94372b71c in resolve_gfid_entry_cbk (frame=0x7fa94590679c, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, 
    op_errno=<optimized out>, inode=<optimized out>, buf=0x7fffd7390820, xdata=0x0, postparent=0x7fffd7390890) at server-resolve.c:107
#6  0x00007fa943964886 in pl_lookup_cbk (frame=0x7fa945b102dc, cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=0, inode=0x7fa9424fd7bc, 
    buf=0x7fffd7390820, xdata=0x0, postparent=0x7fffd7390890) at posix.c:1611
#7  0x00007fa943b7b17e in posix_lookup (frame=0x7fa945b10230, this=<optimized out>, loc=0x2687438, xdata=<optimized out>) at posix.c:188
#8  0x00007fa94395f464 in pl_lookup (frame=<optimized out>, this=0x23c0050, loc=0x2687438, xdata=0x0) at posix.c:1653
#9  0x00007fa94372b573 in resolve_gfid_cbk (frame=0x7fa94590679c, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, 
    op_errno=<optimized out>, inode=0x7fa9424fd7bc, buf=0x7fffd7390b40, xdata=0x0, postparent=0x7fffd7390bb0) at server-resolve.c:158
#10 0x00007fa943964886 in pl_lookup_cbk (frame=0x7fa945b1002c, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, inode=0x7fa9424fd7bc, 
    buf=0x7fffd7390b40, xdata=0x0, postparent=0x7fffd7390bb0) at posix.c:1611
#11 0x00007fa943b7b17e in posix_lookup (frame=0x7fa945b100d8, this=<optimized out>, loc=0x2687438, xdata=<optimized out>) at posix.c:188
#12 0x00007fa94395f464 in pl_lookup (frame=<optimized out>, this=0x23c0050, loc=0x2687438, xdata=0x0) at posix.c:1653
#13 0x00007fa94372aa91 in resolve_gfid (frame=0x7fa94590679c) at server-resolve.c:190
#14 0x00007fa94372b160 in server_resolve_entry (frame=0x7fa94590679c) at server-resolve.c:325
#15 0x00007fa94372b218 in server_resolve (frame=0x7fa94590679c) at server-resolve.c:502
#16 0x00007fa94372af0e in server_resolve_all (frame=<optimized out>) at server-resolve.c:559
#17 0x00007fa94372b7f4 in resolve_and_resume (frame=<optimized out>, fn=<optimized out>) at server-resolve.c:589
#18 0x00007fa943745d41 in server_lookup (req=<optimized out>) at server3_1-fops.c:5571
#19 0x00007fa9476641c8 in rpcsvc_handle_rpc_call (svc=0x23c2ec0, trans=<optimized out>, msg=<optimized out>) at rpcsvc.c:513
#20 0x00007fa9476647fb in rpcsvc_notify (trans=0x24a5970, mydata=<optimized out>, event=<optimized out>, data=0x23cebf0) at rpcsvc.c:612
#21 0x00007fa947668367 in rpc_transport_notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at rpc-transport.c:489
#22 0x00007fa942df1c24 in socket_event_poll_in (this=0x24a5970) at socket.c:1677
#23 0x00007fa942df1f77 in socket_event_handler (fd=<optimized out>, idx=1, data=0x24a5970, poll_in=1, poll_out=0, poll_err=<optimized out>) at socket.c:1792
#24 0x00007fa9478b2917 in event_dispatch_epoll_handler (i=<optimized out>, events=0x23cd620, event_pool=0x23b5d30) at event.c:785
#25 event_dispatch_epoll (event_pool=0x23b5d30) at event.c:847
#26 0x000000000040475d in main (argc=<optimized out>, argv=0x7fffd7391598) at glusterfsd.c:1689

After some stepping I found that __inode_find never finishes.

The process never exits "list_for_each_entry (tmp, &table->inode_hash[hash], hash) {" of __inode_find in inode.c

The process was still busy writing out 2M directories to the new brick, it got to 1% .

Comment 1 Jeff Darcy 2013-04-09 13:49:30 UTC

User has been stuck for 24h, raising priority accordingly.

Comment 2 hans 2013-04-10 12:29:57 UTC

Created attachment 733672 [details]
The -etc-glusterfs-glusterd.vol.log

This is the -etc-glusterfs-glusterd.vol.log of the destination node this bug is about.
The glusterfs is still spinning at 100% CPU.
Bug 950024 a.o. holds the source brick log and commands.

Comment 3 hans 2013-04-16 11:37:42 UTC

Any update ?
It's now 7 days further and I killed the 100% CPU spinning glusterfs to free up resources.

Comment 4 Niels de Vos 2014-11-27 14:54:19 UTC

The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.

Note You need to log in before you can comment on or make changes to this bug.