Bug 1545277

Summary: Brick process crashed after upgrade from RHGS-3.3.1 async(7.4) to RHGS-3.4(7.5)
Product: Red Hat Gluster Storage Reporter: Rajesh Madaka <rmadaka>
Component: coreAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: Rajesh Madaka <rmadaka>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: amukherj, rhinduja, rhs-bugs, sasundar, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: brick-multiplexing
Fixed In Version: glusterfs-3.12.2-8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1583937 (view as bug list) Environment:
Last Closed: 2018-09-04 06:42:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 1503137, 1583937    

Description Rajesh Madaka 2018-02-14 14:21:13 UTC
Description of problem:
======================

Brick process crashed after upgrade from RHGS-3.3.1 async(7.4)  to RHGS-3.4(7.5)

Version-Release number of selected component (if applicable):
------------------------------------------------------------
RHGS version:
------------
from version glusterfs-3.8.4-54.el7 to glusterfs-3.12.2-4.el7

OS version:
----------
from RHEL 7.4 to RHEL7.5

How reproducible:
----------------

Tried once, Only one node faced this issue out of 5 nodes in 6 node cluster

Steps to Reproduce:
------------------

1. Create 6 RHEL-7.4 machines.
2. Install RHGS-3.3.1 async build on RHEL-7.4 machines.
3. Then add firewall-services(glusterfs, nfs, rpc-bind) to all the cluster servers
4. Then perform peer probe from one node to remaining all 5 servers.
5. Now all servers peer status is in connected state.
6. Create around 50 volumes which consisted of different topologies including two-way distributed-replica volumes, three way distributed-replica volumes, Arbitrated-replicate volumes, Distributed dispersed volumes.
7. Then mount 5 volumes to RHEL-7.4 client and 5 volumes to RHEL-7.5 client.
8. Kept 5 volumes in offline
9. Copy RHLE 7.5 repos and RHGS-3.4 repos into /etc/yum.repos.d
10. Stop glusterd, glusterfs, glusterfsd services of one node which is getting upgrade.
11. Then perform yum update of that particular node.
12. After upgrade, upgraded node all bricks went to offline.
13. Core file generated in '/' directory with name of 'core.6282'
14.below is core details
  
*************************************************************************
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/66/a1ad12474aef1b8a3aac8363ef99e4c06ca5ab
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfsd -s 10.70.37.208 --volfile-id arbtr_10.10.70.37.208.bricks-'.
Program terminated with signal 11, Segmentation fault.
#0  server_inode_new (itable=0x0, gfid=gfid@entry=0x7f1824022070 "") at server-helpers.c:1314
1314	                return itable->root;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-18.el7.x86_64 libacl-2.2.51-14.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcom_err-1.42.9-11.el7.x86_64 libgcc-4.8.5-28.el7.x86_64 libselinux-2.5-12.el7.x86_64 libuuid-2.23.2-52.el7.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 sssd-client-1.16.0-16.el7.x86_64 zlib-1.2.7-17.el7.x86_64

********************************************************************************
15. bt details

********************************************************************************
#0  server_inode_new (itable=0x0, gfid=gfid@entry=0x7f1824022070 "") at server-helpers.c:1314
#1  0x00007f184cd1c13d in resolve_gfid (frame=frame@entry=0x7f182401fa30) at server-resolve.c:205
#2  0x00007f184cd1d038 in server_resolve_inode (frame=frame@entry=0x7f182401fa30)
    at server-resolve.c:418
a#3  0x00007f184cd1d2b0 in server_resolve (frame=0x7f182401fa30) at server-resolve.c:559
#4  0x00007f184cd1c88e in server_resolve_all (frame=frame@entry=0x7f182401fa30)
    at server-resolve.c:611
#5  0x00007f184cd1d344 in resolve_and_resume (frame=frame@entry=0x7f182401fa30, 
    fn=fn@entry=0x7f184cd2a910 <server_getxattr_resume>) at server-resolve.c:642
#6  0x00007f184cd3f638 in server3_3_getxattr (req=0x7f181c0132b0) at server-rpc-fops.c:5121
#7  0x00007f1861c9a246 in rpcsvc_request_handler (arg=0x7f1850040c90) at rpcsvc.c:1899
#8  0x00007f1860d37dd5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f1860600b3d in clone () from /lib64/libc.so.6

********************************************************************************

Note : Only one node faced this issue out of 5 nodes in 6 node cluster, for first 4 nodes didn't face this issue,in 5th node upgrade seen this issue,still one more node yet to upgrade 


Actual results:

    All bricks went to offline in upgraded node, and core found.

Expected results:

    All bricks should be in online , no cores should found


Additional info:

Comment 3 Rajesh Madaka 2018-02-15 10:53:50 UTC
copied brick logs and sosreport of upgraded in below path:

qe@rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/rajesh/1545277

Comment 9 Rajesh Madaka 2018-05-08 10:55:34 UTC
I have followed the steps mentioned in above description, i have created same setup(6 node cluster) which is mentioned in desc. i didn't find any brick crashes and all bricks came to online after upgrade. No cores found in all cluster nodes.

Verified in below version:

glusterfs-server-3.12.2-8

Comment 11 errata-xmlrpc 2018-09-04 06:42:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607