Bug 1583937 - Brick process crashed after upgrade from RHGS-3.3.1 async(7.4) to RHGS-3.4(7.5)
Summary: Brick process crashed after upgrade from RHGS-3.3.1 async(7.4) to RHGS-3.4(7.5)
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: protocol
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard: brick-multiplexing
: 1520374 (view as bug list)
Depends On: 1545277
Blocks: 1584633
TreeView+ depends on / blocked
 
Reported: 2018-05-30 03:50 UTC by Raghavendra G
Modified: 2018-10-23 15:10 UTC (History)
10 users (show)

Fixed In Version: glusterfs-5.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1545277
: 1584633 (view as bug list)
Environment:
Last Closed: 2018-10-23 15:10:12 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Comment 1 Raghavendra G 2018-05-30 03:52:23 UTC
Description of problem:
======================

Brick process crashed after upgrade from RHGS-3.3.1 async(7.4)  to RHGS-3.4(7.5)

Version-Release number of selected component (if applicable):
------------------------------------------------------------
RHGS version:
------------
from version glusterfs-3.8.4-54.el7 to glusterfs-3.12.2-4.el7

OS version:
----------
from RHEL 7.4 to RHEL7.5

How reproducible:
----------------

Tried once, Only one node faced this issue out of 5 nodes in 6 node cluster

Steps to Reproduce:
------------------

1. Create 6 RHEL-7.4 machines.
2. Install RHGS-3.3.1 async build on RHEL-7.4 machines.
3. Then add firewall-services(glusterfs, nfs, rpc-bind) to all the cluster servers
4. Then perform peer probe from one node to remaining all 5 servers.
5. Now all servers peer status is in connected state.
6. Create around 50 volumes which consisted of different topologies including two-way distributed-replica volumes, three way distributed-replica volumes, Arbitrated-replicate volumes, Distributed dispersed volumes.
7. Then mount 5 volumes to RHEL-7.4 client and 5 volumes to RHEL-7.5 client.
8. Kept 5 volumes in offline
9. Copy RHLE 7.5 repos and RHGS-3.4 repos into /etc/yum.repos.d
10. Stop glusterd, glusterfs, glusterfsd services of one node which is getting upgrade.
11. Then perform yum update of that particular node.
12. After upgrade, upgraded node all bricks went to offline.
13. Core file generated in '/' directory with name of 'core.6282'
14.below is core details
  
*************************************************************************
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/66/a1ad12474aef1b8a3aac8363ef99e4c06ca5ab
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfsd -s 10.70.37.208 --volfile-id arbtr_10.10.70.37.208.bricks-'.
Program terminated with signal 11, Segmentation fault.
#0  server_inode_new (itable=0x0, gfid=gfid@entry=0x7f1824022070 "") at server-helpers.c:1314
1314	                return itable->root;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-18.el7.x86_64 libacl-2.2.51-14.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcom_err-1.42.9-11.el7.x86_64 libgcc-4.8.5-28.el7.x86_64 libselinux-2.5-12.el7.x86_64 libuuid-2.23.2-52.el7.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 sssd-client-1.16.0-16.el7.x86_64 zlib-1.2.7-17.el7.x86_64

********************************************************************************
15. bt details

********************************************************************************
#0  server_inode_new (itable=0x0, gfid=gfid@entry=0x7f1824022070 "") at server-helpers.c:1314
#1  0x00007f184cd1c13d in resolve_gfid (frame=frame@entry=0x7f182401fa30) at server-resolve.c:205
#2  0x00007f184cd1d038 in server_resolve_inode (frame=frame@entry=0x7f182401fa30)
    at server-resolve.c:418
a#3  0x00007f184cd1d2b0 in server_resolve (frame=0x7f182401fa30) at server-resolve.c:559
#4  0x00007f184cd1c88e in server_resolve_all (frame=frame@entry=0x7f182401fa30)
    at server-resolve.c:611
#5  0x00007f184cd1d344 in resolve_and_resume (frame=frame@entry=0x7f182401fa30, 
    fn=fn@entry=0x7f184cd2a910 <server_getxattr_resume>) at server-resolve.c:642
#6  0x00007f184cd3f638 in server3_3_getxattr (req=0x7f181c0132b0) at server-rpc-fops.c:5121
#7  0x00007f1861c9a246 in rpcsvc_request_handler (arg=0x7f1850040c90) at rpcsvc.c:1899
#8  0x00007f1860d37dd5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f1860600b3d in clone () from /lib64/libc.so.6

********************************************************************************

Note : Only one node faced this issue out of 5 nodes in 6 node cluster, for first 4 nodes didn't face this issue,in 5th node upgrade seen this issue,still one more node yet to upgrade 


Actual results:

    All bricks went to offline in upgraded node, and core found.

Expected results:

    All bricks should be in online , no cores should found

Comment 2 Raghavendra G 2018-05-30 03:55:47 UTC
A getxattr fop was received even before setvolume is complete.  While handling a CONNECT event protocol/client actually sets conf->connected=1 even before setvolume is complete. This means non handshake fops can reach brick and try to access brick stack even before its initialized. Since only a successful SETVOLUME can guarantee that the brick is stack is initialized and ready to be consumed, getxattr ends up accessing a still-to-be-initialized brick stack resulting in crash.

Comment 3 Worker Ant 2018-05-30 04:10:31 UTC
REVIEW: https://review.gluster.org/20101 (protocol/client: Don't send fops till SETVOLUME is complete) posted (#1) for review on master by Raghavendra G

Comment 4 Worker Ant 2018-05-31 01:53:02 UTC
COMMIT: https://review.gluster.org/20101 committed in master by "Raghavendra G" <rgowdapp@redhat.com> with a commit message- protocol/client: Don't send fops till SETVOLUME is complete

An earlier commit set conf->connected just after rpc layer sends
RPC_CLNT_CONNECT event. However, success of socket level connection
connection doesn't indicate brick stack is ready to receive fops, as
an handshake has to be done b/w client and server after
RPC_CLNT_CONNECT event. Any fop sent to brick in the window between,
* protocol/client receiving RPC_CLNT_CONNECT event
* protocol/client receiving a successful setvolume response

can end up accessing an uninitialized brick stack. So, set
conf->connected only after a successful SETVOLUME.

Change-Id: I139a03d2da6b0d95a0d68391fcf54b00e749decf
fixes: bz#1583937
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>

Comment 5 Raghavendra G 2018-06-15 06:19:31 UTC
*** Bug 1520374 has been marked as a duplicate of this bug. ***

Comment 6 Shyamsundar 2018-10-23 15:10:12 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.