Bug 851109
Summary: | glusterd core dumps when statedump is taken under heavy load | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Sachidananda Urs <sac> | ||||||||
Component: | glusterd | Assignee: | Raghavendra Bhat <rabhat> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Sudhir D <sdharane> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 2.0 | CC: | amarts, gluster-bugs, kparthas, shaines, vbellur, vinaraya | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2012-09-11 14:24:06 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 844802 | ||||||||||
Attachments: |
|
Description
Sachidananda Urs
2012-08-23 09:28:49 UTC
Created attachment 606486 [details]
Gluster core dump
Fixed on master - http://review.gluster.com/3864 CHANGE: http://review.gluster.org/3876 (glusterd: Fixed incorrect assumptions in rpcsvc actors of glusterd) merged in release-3.3 by Vijay Bellur (vbellur) Root Cause: ------------ Some rpc procedure handlers (esp. statedump) of glusterd and glusterfsd return -1 even when a reply may have been submitted to the remote caller. This has a side-effect of rpc layer sending a error reply using the _same_ request object potentially resulting in double free. Fix: ----- RPC handler functions are modified to return 0 (success for rpc layer) once reply is submitted irrespective of other kinds of failures (ENOMEM, dict set failing etc) that may happen prior to return. Crashes yet again. This time sooner than expected, under moderate load. #0 0x00007f99a9491f7b in getmntent_r () from /lib64/libc.so.6 #1 0x00007f99a5e0c1d3 in glusterd_add_brick_mount_details (volinfo=<value optimized out>, brickinfo=0x1aef4a0, dict=0x7f99a80585c0, count=0) at glusterd-utils.c:3788 #2 glusterd_add_brick_detail_to_dict (volinfo=<value optimized out>, brickinfo=0x1aef4a0, dict=0x7f99a80585c0, count=0) at glusterd-utils.c:3906 #3 0x00007f99a5df8c1a in glusterd_op_status_volume (op=<value optimized out>, dict=<value optimized out>, op_errstr=<value optimized out>, rsp_dict=<value optimized out>) at glusterd-op-sm.c:1529 #4 glusterd_op_commit_perform (op=<value optimized out>, dict=<value optimized out>, op_errstr=<value optimized out>, rsp_dict=<value optimized out>) at glusterd-op-sm.c:3039 #5 0x00007f99a5df9d34 in glusterd_op_ac_send_commit_op (event=<value optimized out>, ctx=<value optimized out>) at glusterd-op-sm.c:2348 #6 0x00007f99a5df6906 in glusterd_op_sm () at glusterd-op-sm.c:4620 #7 0x00007f99a5e1400f in glusterd3_1_stage_op_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f99a81e1fc0) at glusterd-rpc-ops.c:923 #8 0x00007f99a9f1c0c5 in rpc_clnt_handle_reply (clnt=0x1adee20, pollin=0x1b06be0) at rpc-clnt.c:788 #9 0x00007f99a9f1c8c0 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x1adee50, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:907 #10 0x00007f99a9f18018 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:489 #11 0x00007f99a557d954 in socket_event_poll_in (this=0x1aebd40) at socket.c:1677 #12 0x00007f99a557da37 in socket_event_handler (fd=<value optimized out>, idx=4, data=0x1aebd40, poll_in=1, poll_out=0, poll_err=<value optimized out>) at socket.c:1792 #13 0x00007f99aa162d84 in event_dispatch_epoll_handler (event_pool=0x1ac1630) at event.c:785 #14 event_dispatch_epoll (event_pool=0x1ac1630) at event.c:847 #15 0x00000000004073ca in main (argc=<value optimized out>, argv=0x7fff27810308) at glusterfsd.c:1689 (gdb) Created attachment 610787 [details]
New core
Created attachment 610788 [details]
Logs
In glusterd_add_brick_mount_details setmntent is called on /etc/mtab to get the mount options. It returns a file pointer which is used by getmntent and it gets the all the mounted filesystems related information. If setmntent fails, then it returns a NULL file pointer and if it is passed to getmntent without being checked, then the process segfaults. Thats what happened in this case. I was able to reproduced the bug by renaming the /etc/mtab file. It produced the same backtrace. It can be fixed by checking the file pointer to be NULL after setmntent is called if its NULL then return from the function (-1 ret value). Also the errno can be logged. The below diff can fix the issue. git diff diff --git a/xlators/mgmt/glusterd/src/glusterd-utils.c b/xlators/mgmt/glusterd/src/glusterd-utils.c index 9112458..ce7e74a 100644 --- a/xlators/mgmt/glusterd/src/glusterd-utils.c +++ b/xlators/mgmt/glusterd/src/glusterd-utils.c @@ -3785,6 +3785,12 @@ glusterd_add_brick_mount_details (glusterd_brickinfo_t *brickinfo, goto out; mtab = setmntent (_PATH_MOUNTED, "r"); + if (!mtab) { + gf_log ("glusterd", GF_LOG_ERROR, "setmnt of %s failed (%s)", + _PATH_MOUNTED, strerror (errno)); + ret = -1; + goto out; + } entry = getmntent (mtab); while (1) { @@ -5239,8 +5245,8 @@ glusterd_nfs_statedump (char *options, int option_cnt, char **op_errstr) pidfile = fopen (pidfile_path, "r"); if (!pidfile) { - gf_log ("", GF_LOG_ERROR, "Unable to open pidfile: %s", - pidfile_path); + gf_log ("", GF_LOG_ERROR, "Unable to open pidfile: %s (%s)", + pidfile_path, strerror (errno)); ret = -1; goto out; } But the question is why setmntent failed. Raghavendra Bhat, The reason why setmntent failed is most likely too many open fds. setmntent(3) 'opens' /etc/mtab everytime glusterd_add_brick_detail_mount_details is called. We should have 'closed' the fd on exit using endmntent(3). On missing this, running "volume status" in a loop would result in too many open fds, which can cause setmntent(3) to return NULL (FILE*) and eventually result in glusterd crashing with the backtrace observed. See http://review.gluster.com/3920 pushed into downstream, need one more round of testing on 3.3.0rhs-28 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-1253.html |