1273728 – Crash while bringing down the bricks and self heal

Bug 1273728 - Crash while bringing down the bricks and self heal

Summary: Crash while bringing down the bricks and self heal

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	tier
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.2
Assignee:	Joseph Elwin Fernandes
QA Contact:	Neha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1260783 1260923
TreeView+	depends on / blocked

Reported:	2015-10-21 06:41 UTC by Bhaskarakiran
Modified:	2016-11-23 23:12 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.7.5-7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-03-01 05:43:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0193	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 update 2	2016-03-01 10:20:36 UTC

Description Bhaskarakiran 2015-10-21 06:41:34 UTC

Description of problem:
========================

There is no core file though but tier.log shows the crash

pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-10-21 06:12:00
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.5
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f1d10368002]
/lib64/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f1d1038448d]
/lib64/libc.so.6(+0x35670)[0x7f1d0ea56670]
/lib64/libc.so.6(+0x1628b1)[0x7f1d0eb838b1]
/usr/lib64/libgfdb.so.0(gf_sql_query_function+0x101)[0x7f1d018d7481]
/usr/lib64/libgfdb.so.0(gf_sqlite3_find_unchanged_for_time+0xd0)[0x7f1d018d8f70]
/usr/lib64/libgfdb.so.0(find_unchanged_for_time+0x41)[0x7f1d018d2da1]
/usr/lib64/glusterfs/3.7.5/xlator/cluster/tier.so(+0x561f5)[0x7f1d01f021f5]
/usr/lib64/glusterfs/3.7.5/xlator/cluster/tier.so(+0x56d23)[0x7f1d01f02d23]
/usr/lib64/glusterfs/3.7.5/xlator/cluster/tier.so(+0x599d8)[0x7f1d01f059d8]
/lib64/libpthread.so.0(+0x7dc5)[0x7f1d0f1d0dc5]
/lib64/libc.so.6(clone+0x6d)[0x7f1d0eb171cd]
---------

Steps which are done:

Created a 1x(8+4) disperse volume and attached a rep 2 tier
Started IO (files creation and linux untar)
Brought down tier bricks and ec bricks one at a time and triggered heal.
checked tier status for promotions and demotions.


Version-Release number of selected component (if applicable):
=============================================================
3.7.5-0.3

How reproducible:
=================
Seen once 

Steps to Reproduce:
As in description

Actual results:
===============
Crash

Expected results:
=================
No crash should be seen

Additional info:
================
Attaching the tier log file.

Comment 3 Joseph Elwin Fernandes 2015-11-24 11:19:20 UTC

1) Tested the following but couldnt reproduce this.
   a) Created a volume with 1000 files in it already
   b) Attached a hot tier and created another 1000 files.
[root@fedora1 test]# gluster vol info
 
Volume Name: test
Type: Tier
Volume ID: bb7a3b77-063d-4334-9e60-862ce4f90bd0
Status: Started
Number of Bricks: 10
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: fedora1:/home/ssd/small_brick3/s3
Brick2: fedora1:/home/ssd/small_brick2/s2
Brick3: fedora1:/home/ssd/small_brick1/s1
Brick4: fedora1:/home/ssd/small_brick0/s0
Cold Tier:
Cold Tier Type : Disperse
Number of Bricks: 1 x (4 + 2) = 6
Brick5: fedora1:/home/disk/d1
Brick6: fedora1:/home/disk/d2
Brick7: fedora1:/home/disk/d3
Brick8: fedora1:/home/disk/d4
Brick9: fedora1:/home/disk/d5
Brick10: fedora1:/home/disk/d6
Options Reconfigured:
diagnostics.brick-log-level: TRACE
cluster.self-heal-daemon: enable
cluster.disperse-self-heal-daemon: enable
cluster.tier-mode: test
features.record-counters: on
features.ctr-enabled: on
performance.readdir-ahead: on
[root@fedora1 test]# 

 c) during promotion and demotion stopped and restarted EC bricks.
    Didnt find any crash.

2) The code path where this crash was seen previously has completely changed in this patch https://code.engineering.redhat.com/gerrit/#/c/61006/
Similar kind of crashes where seen previously in
https://bugzilla.redhat.com/show_bug.cgi?id=1258144
https://bugzilla.redhat.com/show_bug.cgi?id=1273347
And the above fix is supposed to fix these crashes.

Changing the status to ON_QA.

Comment 6 errata-xmlrpc 2016-03-01 05:43:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html

Note You need to log in before you can comment on or make changes to this bug.