Bug 1545277 - Brick process crashed after upgrade from RHGS-3.3.1 async(7.4) to RHGS-3.4(7.5)
Summary: Brick process crashed after upgrade from RHGS-3.3.1 async(7.4) to RHGS-3.4(7.5)
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: core
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: Mohit Agrawal
QA Contact: Rajesh Madaka
Whiteboard: brick-multiplexing
: 1763865 (view as bug list)
Depends On:
Blocks: 1503137 1583937
TreeView+ depends on / blocked
Reported: 2018-02-14 14:21 UTC by Rajesh Madaka
Modified: 2019-11-04 09:00 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.12.2-8
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1583937 (view as bug list)
Last Closed: 2018-09-04 06:42:41 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 06:43:59 UTC

Description Rajesh Madaka 2018-02-14 14:21:13 UTC
Description of problem:

Brick process crashed after upgrade from RHGS-3.3.1 async(7.4)  to RHGS-3.4(7.5)

Version-Release number of selected component (if applicable):
RHGS version:
from version glusterfs-3.8.4-54.el7 to glusterfs-3.12.2-4.el7

OS version:
from RHEL 7.4 to RHEL7.5

How reproducible:

Tried once, Only one node faced this issue out of 5 nodes in 6 node cluster

Steps to Reproduce:

1. Create 6 RHEL-7.4 machines.
2. Install RHGS-3.3.1 async build on RHEL-7.4 machines.
3. Then add firewall-services(glusterfs, nfs, rpc-bind) to all the cluster servers
4. Then perform peer probe from one node to remaining all 5 servers.
5. Now all servers peer status is in connected state.
6. Create around 50 volumes which consisted of different topologies including two-way distributed-replica volumes, three way distributed-replica volumes, Arbitrated-replicate volumes, Distributed dispersed volumes.
7. Then mount 5 volumes to RHEL-7.4 client and 5 volumes to RHEL-7.5 client.
8. Kept 5 volumes in offline
9. Copy RHLE 7.5 repos and RHGS-3.4 repos into /etc/yum.repos.d
10. Stop glusterd, glusterfs, glusterfsd services of one node which is getting upgrade.
11. Then perform yum update of that particular node.
12. After upgrade, upgraded node all bricks went to offline.
13. Core file generated in '/' directory with name of 'core.6282'
14.below is core details
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/66/a1ad12474aef1b8a3aac8363ef99e4c06ca5ab
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfsd -s --volfile-id arbtr_10.'.
Program terminated with signal 11, Segmentation fault.
#0  server_inode_new (itable=0x0, gfid=gfid@entry=0x7f1824022070 "") at server-helpers.c:1314
1314	                return itable->root;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-18.el7.x86_64 libacl-2.2.51-14.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcom_err-1.42.9-11.el7.x86_64 libgcc-4.8.5-28.el7.x86_64 libselinux-2.5-12.el7.x86_64 libuuid-2.23.2-52.el7.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 sssd-client-1.16.0-16.el7.x86_64 zlib-1.2.7-17.el7.x86_64

15. bt details

#0  server_inode_new (itable=0x0, gfid=gfid@entry=0x7f1824022070 "") at server-helpers.c:1314
#1  0x00007f184cd1c13d in resolve_gfid (frame=frame@entry=0x7f182401fa30) at server-resolve.c:205
#2  0x00007f184cd1d038 in server_resolve_inode (frame=frame@entry=0x7f182401fa30)
    at server-resolve.c:418
a#3  0x00007f184cd1d2b0 in server_resolve (frame=0x7f182401fa30) at server-resolve.c:559
#4  0x00007f184cd1c88e in server_resolve_all (frame=frame@entry=0x7f182401fa30)
    at server-resolve.c:611
#5  0x00007f184cd1d344 in resolve_and_resume (frame=frame@entry=0x7f182401fa30, 
    fn=fn@entry=0x7f184cd2a910 <server_getxattr_resume>) at server-resolve.c:642
#6  0x00007f184cd3f638 in server3_3_getxattr (req=0x7f181c0132b0) at server-rpc-fops.c:5121
#7  0x00007f1861c9a246 in rpcsvc_request_handler (arg=0x7f1850040c90) at rpcsvc.c:1899
#8  0x00007f1860d37dd5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f1860600b3d in clone () from /lib64/libc.so.6


Note : Only one node faced this issue out of 5 nodes in 6 node cluster, for first 4 nodes didn't face this issue,in 5th node upgrade seen this issue,still one more node yet to upgrade 

Actual results:

    All bricks went to offline in upgraded node, and core found.

Expected results:

    All bricks should be in online , no cores should found

Additional info:

Comment 3 Rajesh Madaka 2018-02-15 10:53:50 UTC
copied brick logs and sosreport of upgraded in below path:


Comment 9 Rajesh Madaka 2018-05-08 10:55:34 UTC
I have followed the steps mentioned in above description, i have created same setup(6 node cluster) which is mentioned in desc. i didn't find any brick crashes and all bricks came to online after upgrade. No cores found in all cluster nodes.

Verified in below version:


Comment 11 errata-xmlrpc 2018-09-04 06:42:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 12 Pranith Kumar K 2019-11-04 09:00:32 UTC
*** Bug 1763865 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.