Bug 982181
Summary: | DHT: NFS process crashed on a node in a cluster when another storage node in the cluster went offline | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | spandura | ||||
Component: | glusterfs | Assignee: | Amar Tumballi <amarts> | ||||
Status: | CLOSED ERRATA | QA Contact: | spandura | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 2.1 | CC: | rhs-bugs, spandura, spradhan, vbellur, vraman | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.4.0.12rhs.beta5 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 983431 (view as bug list) | Environment: | |||||
Last Closed: | 2013-09-23 22:35:42 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 983431, 1139996 | ||||||
Attachments: |
|
Description
spandura
2013-07-08 10:54:17 UTC
Created attachment 770453 [details]
SOS Reports, BT of core, commands execution history on nodes
- Could you just clarify: What is the need of 4 node cluster for 1x2 dist-rep vol? - Is core file included in the attachment? Necessary information has been provided to Santosh. Hence removing the need info flag . Thanks Shwetha for sharing the m/c, had a quick look at the core file. The stack trace (from core) looks like: #0 _gf_log (domain=0x1cae510 "vol_rep-replicate-0", file=0x7f85fe9aec3f "afr-common.c", function=0x7f85fe9b1350 "afr_get_call_child", line=745, level=GF_LOG_DEBUG, fmt=0x7f85fe9af268 "Returning %d, call_child: %d, last_index: %d") at logging.c:442 #1 0x00007f85fe99d012 in afr_get_call_child (this=0x1caf160, child_up=0x26248800 "", read_child=<value optimized out>, fresh_children=<value optimized out>, call_child=<value optimized out>, last_index=0x26247970) at afr-common.c:744 #2 0x00007f85fe96d9bf in afr_access (frame=0x262463cc, this=0x1caf160, loc=0x7f85f6afafec, mask=7, xdata=0x0) at afr-inode-read.c:136 #3 0x00007f85fe73fe6e in dht_access_cbk (frame=0x7f8602255434, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, xdata=<value optimized out>) at dht-inode-read.c:516 ..... ..... ..... ..... #130898 0x00007f85fe96dbda in afr_access (frame=0x7f8602255184, this=<value optimized out>, loc=<value optimized out>, mask=<value optimized out>, xdata=0x0) at afr-inode-read.c:157 #130899 0x00007f85fe73fe6e in dht_access_cbk (frame=0x7f8602255434, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, xdata=<value optimized out>) at dht-inode-read.c:516 #130900 0x00007f85fe96dbda in afr_access (frame=0x7f86022554e0, this=<value optimized out>, loc=<value optimized out>, mask=<value optimized out>, xdata=0x0) at afr-inode-read.c:157 #130901 0x00007f85fe73fe6e in dht_access_cbk (frame=0x7f8602255434, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, xdata=<value optimized out>) at dht-inode-read.c:516 #130902 0x00007f85fe96dbda in afr_access (frame=0x7f86022556e4, this=<value optimized out>, loc=<value optimized out>, mask=<value optimized out>, xdata=0x0) at afr-inode-read.c:157 #130903 0x00007f85fe73fe6e in dht_access_cbk (frame=0x7f8602255434, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, xdata=<value optimized out>) at dht-inode-read.c:516 #130904 0x00007f85fe96dbda in afr_access (frame=0x7f8602255388, this=<value optimized out>, loc=<value optimized out>, mask=<value optimized out>, xdata=0x0) at afr-inode-read.c:157 #130905 0x00007f85fe73fe6e in dht_access_cbk (frame=0x7f8602255434, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=<value optimized out>, xdata=<value optimized out>) at dht-inode-read.c:516 #130906 0x00007f85fe96dbda in afr_access (frame=0x7f8602255638, this=<value optimized out>, loc=<value optimized out>, mask=<value optimized out>, xdata=0x0) at afr-inode-read.c:157 #130907 0x00007f85fe73d44d in dht_access (frame=0x7f8602255434, this=<value optimized out>, loc=0x7f85f7678414, mask=7, xdata=0x0) at dht-inode-read.c:593 #130908 0x00007f860340e228 in default_access (frame=0x7f8602255434, this=0x1cb12f0, loc=0x7f85f7678414, mask=7, xdata=<value optimized out>) at defaults.c:1066 #130909 0x00007f85fe2e91b4 in io_stats_access (frame=0x7f86022558e8, this=0x1cb2580, loc=0x7f85f7678414, mask=7, xdata=0x0) at io-stats.c:2344 #130910 0x00007f85fe0a6a31 in nfs_fop_access (nfsx=<value optimized out>, xl=0x1cb2580, nfu=<value optimized out>, loc=0x7f85f7678414, accesstest=<value optimized out>, cbk=<value optimized out>, local=0x7f85f767802c) at nfs-fops.c:467 #130911 0x00007f85fe0b9efb in nfs3_access_resume (carg=0x7f85f767802c) at nfs3.c:1545 #130912 0x00007f85fe0c0f0b in nfs3_fh_resolve_inode_done (cs=0x7f85f767802c, inode=<value optimized out>) at nfs3-helpers.c:3544 #130913 0x00007f85fe0c0fdb in nfs3_fh_resolve_inode (cs=0x7f85f767802c) at nfs3-helpers.c:3735 #130914 0x00007f85fe0c1085 in nfs3_fh_resolve_resume (cs=0x7f85f767802c) at nfs3-helpers.c:3767 #130915 0x00007f85fe0c1198 in nfs3_fh_resolve_root (cs=0x7f85f767802c) at nfs3-helpers.c:3821 #130916 0x00007f85fe0b9a61 in nfs3_access (req=0x7f85fdd1002c, fh=0x7fffd1d573b0, accbits=31) at nfs3.c:1582 #130917 0x00007f85fe0b9d95 in nfs3svc_access (req=0x7f85fdd1002c) at nfs3.c:1616 #130918 0x00007f86031e6655 in rpcsvc_handle_rpc_call (svc=<value optimized out>, trans=<value optimized out>, msg=0x1d39930) at rpcsvc.c:549 #130919 0x00007f86031e67a3 in rpcsvc_notify (trans=0x1d2b8d0, mydata=<value optimized out>, event=<value optimized out>, data=0x1d39930) at rpcsvc.c:643 #130920 0x00007f86031e78d8 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:497 #130921 0x00007f85ffa1dbd6 in socket_event_poll_in (this=0x1d2b8d0) at socket.c:2119 #130922 0x00007f85ffa1f4ed in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1d2b8d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2231 #130923 0x00007f860344d407 in event_dispatch_epoll_handler (event_pool=0x1c89eb0) at event-epoll.c:384 #130924 event_dispatch_epoll (event_pool=0x1c89eb0) at event-epoll.c:445 #130925 0x00000000004067c6 in main (argc=11, argv=0x7fffd1d58ba8) at glusterfsd.c:1962 The number of frames ( 131-k frames) in the stack explains itself i.e. STACK overflow. Something goes wrong in afr_access() or dht_access_cbk() they ended up in a loop or so. afr_access() is failing somewhere (ret val < 0) and does a stack unwind (AFR_STACK_UNWIND()) but dht_access_cbk() again repeats. I dont know much in AFR/DHT, may be Pranith or Shishir will be able to help. Does not look like a NFS issue. Thanks, Santosh Verified the fix on the build: ============================ glusterfs 3.4.0.15rhs built on Aug 4 2013 22:34:17 Bug is fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html |