Bug 1235735

Summary: glusterfsd crash observed after upgrading from 3.0.4 to 3.1
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shruti Sampat <ssampat>
Component: coreAssignee: Raghavendra G <rgowdapp>
Status: CLOSED ERRATA QA Contact: Shruti Sampat <ssampat>
Severity: high Docs Contact:
Priority: urgent    
Version: rhgs-3.1CC: annair, asrivast, khiremat, nbalacha, rcyriac, rgowdapp, rhs-bugs, sasundar, storage-qa-internal, vagarwal
Target Milestone: ---   
Target Release: RHGS 3.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.7.1-8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-29 05:08:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1186580, 1202842, 1223915, 1236945    
Attachments:
Description Flags
core dump none

Description Shruti Sampat 2015-06-25 15:35:03 UTC
Created attachment 1043182 [details]
core dump

Description of problem:
------------------------

In a 3-node cluster, after upgrading one of the nodes from 3.0.4 to 3.1, one of the bricks was seen to have crashed. See backtrace below -

#0  0x00007f0220182792 in posix_getxattr (frame=0x7f022b448520, this=0x7f021c006410, loc=0x7f022aece0ac, name=0x7f02140040d0 "trusted.glusterfs.node-uuid", xdata=0x0)
    at posix.c:3729
#1  0x00007f022d8941ab in default_getxattr (frame=0x7f022b448520, this=0x7f021c008c90, loc=0x7f022aece0ac, name=0x7f02140040d0 "trusted.glusterfs.node-uuid", 
    xdata=<value optimized out>) at defaults.c:1969
#2  0x00007f021b3b8a43 in posix_acl_getxattr (frame=0x7f022b448474, this=0x7f021c00a960, loc=0x7f022aece0ac, name=0x7f02140040d0 "trusted.glusterfs.node-uuid", 
    xdata=0x0) at posix-acl.c:1978
#3  0x00007f021b1a1c0b in pl_getxattr (frame=0x7f022b4483c8, this=0x7f021c00bc80, loc=0x7f022aece0ac, name=<value optimized out>, xdata=0x0) at posix.c:777
#4  0x00007f022d89798a in default_getxattr_resume (frame=0x7f022b44831c, this=0x7f021c00cfa0, loc=0x7f022aece0ac, name=0x7f02140040d0 "trusted.glusterfs.node-uuid", 
    xdata=0x0) at defaults.c:1530
#5  0x00007f022d8b2b90 in call_resume (stub=0x7f022aece06c) at call-stub.c:2576
#6  0x00007f021af8e3d8 in iot_worker (data=0x7f021c0478a0) at io-threads.c:214
#7  0x00007f022c977a51 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f022c2e196d in clone () from /lib64/libc.so.6

See volume configuration below -

# gluster v info 3-test
 
Volume Name: 3-test
Type: Distributed-Replicate
Volume ID: 72bb16e7-31bb-4b16-aa62-1697b8296280
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.37.119:/rhs/brick3/b1
Brick2: 10.70.37.218:/rhs/brick2/b1
Brick3: 10.70.37.200:/rhs/brick2/b1
Brick4: 10.70.37.119:/rhs/brick4/b1
Brick5: 10.70.37.218:/rhs/brick3/b1
Brick6: 10.70.37.200:/rhs/brick3/b1
Options Reconfigured:
features.uss: on
features.quota: on
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
glusterfs-3.7.1-5.el6rhs.x86_64

How reproducible:
------------------
Tried upgrade once and observed.

Steps to Reproduce:
--------------------
1. Setup a 3-node cluster running RHS 3.0.4, with 2 volumes (2x2 and 2x3).
2. Both volumes were being accessed using fuse with continuous I/O.
3. One of the nodes was upgraded to RHGS 3.1 using ISO and rebooted.
4. After reboot, all bricks started and self-heal was running on when 2 of the bricks were OOM killed. (BZ #1224177)
5. The volume was started with force option but one of the 2 bricks (that were OOM killed) failed to start and was seen to have crashed.

Actual results:
----------------
Brick process crashed.

Expected results:
------------------
Brick process should not crash.

Additional info:

Comment 2 Kotresh HR 2015-06-29 09:05:51 UTC
The reason it was changelog crash was this->private of changelog is NULL.
But from the dump, it is very clear that the changelog translator is still in the initialization path as can be seen from following bt of thread 8 and hence this->private is NULL for changelog. So the crash is not related to changelog and also there is no dereference of this->private of changelog


Thread 8 (Thread 0x7f0220d97700 (LWP 14934)):
#0  0x00007f022c2d42d7 in mkdir () from /lib64/libc.so.6
#1  0x00007f022d8a9a38 in mkdir_p (path=<value optimized out>, mode=384, allow_symlinks=_gf_true) at common-utils.c:96
#2  0x00007f021b7e0a2b in changelog_init_options (this=0x7f021c008c90, priv=0x7f021c04d660) at changelog.c:2498
#3  0x00007f021b7e0d94 in init (this=0x7f021c008c90) at changelog.c:2632
#4  0x00007f022d888f82 in __xlator_init (xl=0x7f021c008c90) at xlator.c:397
#5  xlator_init (xl=0x7f021c008c90) at xlator.c:420
#6  0x00007f022d8c90c1 in glusterfs_graph_init (graph=<value optimized out>) at graph.c:319
#7  0x00007f022d8c91f7 in glusterfs_graph_activate (graph=<value optimized out>, ctx=0x7f022f7bb010) at graph.c:659
#8  0x00007f022dd4cd4b in glusterfs_process_volfp (ctx=0x7f022f7bb010, fp=0x7f021c0008e0) at glusterfsd.c:2174
#9  0x00007f022dd549e5 in mgmt_getspec_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f022b27c6d4) at glusterfsd-mgmt.c:1560
#10 0x00007f022d65ade5 in rpc_clnt_handle_reply (clnt=0x7f022f820e80, pollin=0x7f021c0016a0) at rpc-clnt.c:766
#11 0x00007f022d65c282 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x7f022f820eb0, event=<value optimized out>, data=<value optimized out>) at rpc-clnt.c:894
#12 0x00007f022d657928 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:543
#13 0x00007f02223bdc6d in socket_event_poll_in (this=0x7f022f8229d0) at socket.c:2290
#14 0x00007f02223bf79d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x7f022f8229d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2403
#15 0x00007f022d8e7fb0 in event_dispatch_epoll_handler (data=0x7f022f823dc0) at event-epoll.c:572
#16 event_dispatch_epoll_worker (data=0x7f022f823dc0) at event-epoll.c:674
#17 0x00007f022c977a51 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f022c2e196d in clone () from /lib64/libc.so.6

Comment 4 Nagaprasad Sathyanarayana 2015-07-02 06:38:35 UTC
*** Bug 1238535 has been marked as a duplicate of this bug. ***

Comment 6 Raghavendra G 2015-07-06 05:39:23 UTC
There is a crash uncovered by [1]. Note that this crash is a day-zero bug and hence not a regression caused by [1]. Hence moving this bug back to assigned.

http://blog.gmane.org/gmane.comp.file-systems.gluster.devel

[1] http://review.gluster.org/11490

Comment 7 Raghavendra G 2015-07-06 05:43:04 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1239280 Has been filed already. Hence moving back to modified.

Comment 8 Shruti Sampat 2015-07-08 09:25:19 UTC
Verified as fixed in glusterfs-3.7.1-7.

Upgraded 3 nodes from 3.0.4 to 3.1 and did not observe crash.

Comment 9 errata-xmlrpc 2015-07-29 05:08:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html