Created attachment 1043182 [details] core dump Description of problem: ------------------------ In a 3-node cluster, after upgrading one of the nodes from 3.0.4 to 3.1, one of the bricks was seen to have crashed. See backtrace below - #0 0x00007f0220182792 in posix_getxattr (frame=0x7f022b448520, this=0x7f021c006410, loc=0x7f022aece0ac, name=0x7f02140040d0 "trusted.glusterfs.node-uuid", xdata=0x0) at posix.c:3729 #1 0x00007f022d8941ab in default_getxattr (frame=0x7f022b448520, this=0x7f021c008c90, loc=0x7f022aece0ac, name=0x7f02140040d0 "trusted.glusterfs.node-uuid", xdata=<value optimized out>) at defaults.c:1969 #2 0x00007f021b3b8a43 in posix_acl_getxattr (frame=0x7f022b448474, this=0x7f021c00a960, loc=0x7f022aece0ac, name=0x7f02140040d0 "trusted.glusterfs.node-uuid", xdata=0x0) at posix-acl.c:1978 #3 0x00007f021b1a1c0b in pl_getxattr (frame=0x7f022b4483c8, this=0x7f021c00bc80, loc=0x7f022aece0ac, name=<value optimized out>, xdata=0x0) at posix.c:777 #4 0x00007f022d89798a in default_getxattr_resume (frame=0x7f022b44831c, this=0x7f021c00cfa0, loc=0x7f022aece0ac, name=0x7f02140040d0 "trusted.glusterfs.node-uuid", xdata=0x0) at defaults.c:1530 #5 0x00007f022d8b2b90 in call_resume (stub=0x7f022aece06c) at call-stub.c:2576 #6 0x00007f021af8e3d8 in iot_worker (data=0x7f021c0478a0) at io-threads.c:214 #7 0x00007f022c977a51 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f022c2e196d in clone () from /lib64/libc.so.6 See volume configuration below - # gluster v info 3-test Volume Name: 3-test Type: Distributed-Replicate Volume ID: 72bb16e7-31bb-4b16-aa62-1697b8296280 Status: Started Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 10.70.37.119:/rhs/brick3/b1 Brick2: 10.70.37.218:/rhs/brick2/b1 Brick3: 10.70.37.200:/rhs/brick2/b1 Brick4: 10.70.37.119:/rhs/brick4/b1 Brick5: 10.70.37.218:/rhs/brick3/b1 Brick6: 10.70.37.200:/rhs/brick3/b1 Options Reconfigured: features.uss: on features.quota: on performance.readdir-ahead: on snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable Version-Release number of selected component (if applicable): -------------------------------------------------------------- glusterfs-3.7.1-5.el6rhs.x86_64 How reproducible: ------------------ Tried upgrade once and observed. Steps to Reproduce: -------------------- 1. Setup a 3-node cluster running RHS 3.0.4, with 2 volumes (2x2 and 2x3). 2. Both volumes were being accessed using fuse with continuous I/O. 3. One of the nodes was upgraded to RHGS 3.1 using ISO and rebooted. 4. After reboot, all bricks started and self-heal was running on when 2 of the bricks were OOM killed. (BZ #1224177) 5. The volume was started with force option but one of the 2 bricks (that were OOM killed) failed to start and was seen to have crashed. Actual results: ---------------- Brick process crashed. Expected results: ------------------ Brick process should not crash. Additional info:
The reason it was changelog crash was this->private of changelog is NULL. But from the dump, it is very clear that the changelog translator is still in the initialization path as can be seen from following bt of thread 8 and hence this->private is NULL for changelog. So the crash is not related to changelog and also there is no dereference of this->private of changelog Thread 8 (Thread 0x7f0220d97700 (LWP 14934)): #0 0x00007f022c2d42d7 in mkdir () from /lib64/libc.so.6 #1 0x00007f022d8a9a38 in mkdir_p (path=<value optimized out>, mode=384, allow_symlinks=_gf_true) at common-utils.c:96 #2 0x00007f021b7e0a2b in changelog_init_options (this=0x7f021c008c90, priv=0x7f021c04d660) at changelog.c:2498 #3 0x00007f021b7e0d94 in init (this=0x7f021c008c90) at changelog.c:2632 #4 0x00007f022d888f82 in __xlator_init (xl=0x7f021c008c90) at xlator.c:397 #5 xlator_init (xl=0x7f021c008c90) at xlator.c:420 #6 0x00007f022d8c90c1 in glusterfs_graph_init (graph=<value optimized out>) at graph.c:319 #7 0x00007f022d8c91f7 in glusterfs_graph_activate (graph=<value optimized out>, ctx=0x7f022f7bb010) at graph.c:659 #8 0x00007f022dd4cd4b in glusterfs_process_volfp (ctx=0x7f022f7bb010, fp=0x7f021c0008e0) at glusterfsd.c:2174 #9 0x00007f022dd549e5 in mgmt_getspec_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f022b27c6d4) at glusterfsd-mgmt.c:1560 #10 0x00007f022d65ade5 in rpc_clnt_handle_reply (clnt=0x7f022f820e80, pollin=0x7f021c0016a0) at rpc-clnt.c:766 #11 0x00007f022d65c282 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x7f022f820eb0, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:894 #12 0x00007f022d657928 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:543 #13 0x00007f02223bdc6d in socket_event_poll_in (this=0x7f022f8229d0) at socket.c:2290 #14 0x00007f02223bf79d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x7f022f8229d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2403 #15 0x00007f022d8e7fb0 in event_dispatch_epoll_handler (data=0x7f022f823dc0) at event-epoll.c:572 #16 event_dispatch_epoll_worker (data=0x7f022f823dc0) at event-epoll.c:674 #17 0x00007f022c977a51 in start_thread () from /lib64/libpthread.so.0 #18 0x00007f022c2e196d in clone () from /lib64/libc.so.6
*** Bug 1238535 has been marked as a duplicate of this bug. ***
https://code.engineering.redhat.com/gerrit/#/c/52206/
There is a crash uncovered by [1]. Note that this crash is a day-zero bug and hence not a regression caused by [1]. Hence moving this bug back to assigned. http://blog.gmane.org/gmane.comp.file-systems.gluster.devel [1] http://review.gluster.org/11490
https://bugzilla.redhat.com/show_bug.cgi?id=1239280 Has been filed already. Hence moving back to modified.
Verified as fixed in glusterfs-3.7.1-7. Upgraded 3 nodes from 3.0.4 to 3.1 and did not observe crash.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html