Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1385606 - 4 of 8 bricks (2 dht subvols) crashed on systemic setup
4 of 8 bricks (2 dht subvols) crashed on systemic setup
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: core (Show other bugs)
3.2
Unspecified Unspecified
unspecified Severity urgent
: ---
: RHGS 3.2.0
Assigned To: Pranith Kumar K
nchilaka
:
: 1397667 (view as bug list)
Depends On: 1409472 1410313
Blocks: 1351528 1386097 1396778 1399196
  Show dependency treegraph
 
Reported: 2016-10-17 08:09 EDT by nchilaka
Modified: 2017-03-23 02:11 EDT (History)
7 users (show)

See Also:
Fixed In Version: glusterfs-3.8.4-6
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1386097 (view as bug list)
Environment:
Last Closed: 2017-03-23 02:11:30 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 05:18:45 EDT

  None (edit)
Description nchilaka 2016-10-17 08:09:26 EDT
Description of problem:
=========================
(looks like an eventing bug)
In my systemic testing (progress and what I do can be found at https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609)

I saw that 4 of the 8 bricks in a distrepvol crashed simultaneously.
The four bricks are part of 2 dht subvols (complete dht subvol 1 and 2 )


(IO : for more information on exact IO refer to the work-sheet "IOs" in the google doc shared)
On the server side the below actions were running(in screen sessions):
1) heal info --xml ====>for last 4 days for which I am waiting for o/p (refer BZ#1382686)
2) healing is going on from about 3 hours back as bricks were brought online after about a week
3)snapshot is scheduled every 1 hour

Client side IO:
=============
from 4 clients : lookups using ls -lRt
for the same 4 clients: symlinks were being created for same target directories
From another 4 clients: 2 clients are creating same directory structure , while other 2 are renaming directories
from another 2 client2: append to same file




Backtrace using gdb:

(gdb) bt
#0  0x00007fe1e15cbab4 in vfprintf () from /lib64/libc.so.6
#1  0x00007fe1e168ee25 in __vsnprintf_chk () from /lib64/libc.so.6
#2  0x00007fe1e2ef9598 in vsnprintf (__ap=0x7fe152cf8a70, __fmt=<optimized out>, __n=0, __s=0x0)
    at /usr/include/bits/stdio2.h:77
#3  gf_vasprintf (string_ptr=string_ptr@entry=0x7fe152cf8b78, 
    format=format@entry=0x7fe1d52c01b0 "op=%s;path=%s;error=%s;brick=%s:%s", 
    arg=arg@entry=0x7fe152cf8b90) at mem-pool.c:219
#4  0x00007fe1e2f482da in gf_event (event=event@entry=EVENT_POSIX_HEALTH_CHECK_FAILED, 
    fmt=fmt@entry=0x7fe1d52c01b0 "op=%s;path=%s;error=%s;brick=%s:%s") at events.c:84
#5  0x00007fe1d52b7660 in posix_fs_health_check (this=this@entry=0x7fe1d0006dd0)
    at posix-helpers.c:1795
#6  0x00007fe1d52b77e4 in posix_health_check_thread_proc (data=0x7fe1d0006dd0)
    at posix-helpers.c:1833
#7  0x00007fe1e1d34dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fe1e1679ced in clone () from /lib64/libc.so.6



Probable Root cause(based on initial findings by Raghavendra Gowdappa)
===================================================================
op_errno is supposed to be an integer but is being assigned a string


1794	                        "%s() on %s returned", op, file_path);
1795	                gf_event (EVENT_POSIX_HEALTH_CHECK_FAILED,
1796	                          "op=%s;path=%s;error=%s;brick=%s:%s", op, file_path,
1797	                          op_errno, priv->hostname, priv->base_path);
1798	        }




Version-Release number of selected component (if applicable):
=================================
[root@dhcp37-150 bricks]# rpm -qa|grep gluster
glusterfs-libs-3.8.4-1.el7rhgs.x86_64
glusterfs-fuse-3.8.4-1.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-1.el7rhgs.x86_64
glusterfs-3.8.4-1.el7rhgs.x86_64
glusterfs-api-3.8.4-1.el7rhgs.x86_64
glusterfs-cli-3.8.4-1.el7rhgs.x86_64
glusterfs-events-3.8.4-1.el7rhgs.x86_64
glusterfs-rdma-3.8.4-1.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-1.el7rhgs.x86_64
glusterfs-server-3.8.4-1.el7rhgs.x86_64
python-gluster-3.8.4-1.el7rhgs.noarch
glusterfs-devel-3.8.4-1.el7rhgs.x86_64


[root@dhcp35-191 bricks]# file /core.17523 
/core.17523: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfsd -s 10.70.35.191 --volfile-id distrepvol.10.70.35.191.rhs-b'
[root@dhcp35-191 bricks]# 


Brick logs:
==============



tat on parent /rhs/brick1/distrepvol/rootdir1/symlink failed [Input/output error]
[2016-10-17 11:26:48.124026] W [MSGID: 113018] [posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on /rhs/brick1/distrepvol/rootdir1/symlink [Input/output error]
[2016-10-17 11:26:48.124037] E [MSGID: 113018] [posix.c:237:posix_lookup] 0-distrepvol-posix: post-operation lstat on parent /rhs/brick1/distrepvol/rootdir1/symlink failed [Input/output error]
[2016-10-17 11:26:48.124051] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-distrepvol-server: 2701980: LOOKUP /rootdir1/symlink/file.559585 (603542a5-8221-4bde-8869-09f0167ecb80/file.559585) ==> (Input/output error) [Input/output error]
[2016-10-17 11:26:48.124075] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-distrepvol-server: 2653096: LOOKUP /rootdir1/symlink/file.561231 (603542a5-8221-4bde-8869-09f0167ecb80/file.561231) ==> (Input/output error) [Input/output error]
[2016-10-17 11:26:48.124326] W [MSGID: 113075] [posix-helpers.c:1794:posix_fs_health_check] 0-distrepvol-posix: open() on /rhs/brick1/distrepvol/.glusterfs/health_check returned [Input/output error]
[2016-10-17 11:26:48.124523] W [MSGID: 113018] [posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on /rhs/brick1/distrepvol/rootdir1/symlink/file.520670 [Input/output error]
[2016-10-17 11:26:48.124549] W [MSGID: 113018] [posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on /rhs/brick1/distrepvol/rootdir1/symlink/file.520670 [Input/output error]
[2016-10-17 11:26:48.124550] W [MSGID: 113018] [posix.c:199:posix_lookup] 0-distrepvol-posix: lstat on /rhs/brick1/distrepvol/rootdir1/symlink/file.520670 failed [Input/output error]
[2016-10-17 11:26:48.124568] W [MSGID: 113018] [posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on /rhs/brick1/distrepvol/rootdir1/symlink [Input/output error]
[2016-10-17 11:26:48.124593] E [MSGID: 113018] [posix.c:237:posix_lookup] 0-distrepvol-posix: post-operation lstat on parent /rhs/brick1/distrepvol/rootdir1/symlink failed [Input/output error]
pending frames:
frame : type(0) op(27)
frame : type(0) op(27)
frame : type(0) op(27)
patchset: git://git.gluster.com/glusterfs.git
[2016-10-17 11:26:48.124625] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-distrepvol-server: 2757018: LOOKUP /rootdir1/symlink/file.520670 (603542a5-8221-4bde-8869-09f0167ecb80/file.520670) ==> (Input/output error) [Input/output error]
[2016-10-17 11:26:48.124978] W [MSGID: 113018] [posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on /rhs/brick1/distrepvol/rootdir1/symlink/file.561236 [Input/output error]
[2016-10-17 11:26:48.125012] W [MSGID: 113018] [posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on /rhs/brick1/distrepvol/rootdir1/symlink/file.561236 [Input/output error]
[2016-10-17 11:26:48.125025] W [MSGID: 113018] [posix.c:199:posix_lookup] 0-distrepvol-posix: lstat on /rhs/brick1/distrepvol/rootdir1/symlink/file.561236 failed [Input/output error]
[2016-10-17 11:26:48.125035] W [MSGID: 113018] [posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on /rhs/brick1/distrepvol/rootdir1/symlink [Input/output error]
[2016-10-17 11:26:48.125042] E [MSGID: 113018] [posix.c:237:posix_lookup] 0-distrepvol-posix: post-operation lstat on parent /rhs/brick1/distrepvol/rootdir1/symlink failed [Input/output error]
[2016-10-17 11:26:48.125057] W [MSGID: 113018] [posix-helpers.c:667:posix_pstat] 0-distrepvol-posix: lstat failed on /rhs/brick1/distrepvol/rootdir1/symlink/file.559585 [Input/output error]
signal received: 11
time of crash: 
2016-10-17 11:26:48
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.8.4
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f9e28ae3832]
/lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f9e28aed2c4]
/lib64/libc.so.6(+0x35670)[0x7f9e271c8670]
/lib64/libc.so.6(_IO_vfprintf+0x1564)[0x7f9e271dbab4]
/lib64/libc.so.6(__vsnprintf_chk+0x95)[0x7f9e2729ee25]
/lib64/libglusterfs.so.0(gf_vasprintf+0x68)[0x7f9e28b09598]
/lib64/libglusterfs.so.0(gf_event+0x1aa)[0x7f9e28b582da]
/usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so(+0x29660)[0x7f9e1aec7660]
/usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so(+0x297e4)[0x7f9e1aec77e4]
/lib64/libpthread.so.0(+0x7dc5)[0x7f9e27944dc5]
/lib64/libc.so.6(clone+0x6d)[0x7f9e27289ced]
---------
Comment 2 Raghavendra G 2016-10-18 03:30:40 EDT
Just separating out Nag's comment on RCA in comment 1

Probable Root cause(based on initial findings by Raghavendra Gowdappa)
===================================================================
op_errno is supposed to be an integer but is being assigned a string


1794	                        "%s() on %s returned", op, file_path);
1795	                gf_event (EVENT_POSIX_HEALTH_CHECK_FAILED,
1796	                          "op=%s;path=%s;error=%s;brick=%s:%s", op, file_path,
1797	                          op_errno, priv->hostname, priv->base_path);
1798	        }
Comment 3 Atin Mukherjee 2016-10-18 03:47:40 EDT
upstream patch http://review.gluster.org/#/c/15667 posted for review.
Comment 8 Pranith Kumar K 2016-11-18 05:00:28 EST
https://code.engineering.redhat.com/gerrit/90550
Comment 9 Ashish Pandey 2016-11-23 02:11:47 EST
*** Bug 1397667 has been marked as a duplicate of this bug. ***
Comment 10 Atin Mukherjee 2016-11-23 02:46:49 EST
*** Bug 1397681 has been marked as a duplicate of this bug. ***
Comment 12 nchilaka 2017-01-02 01:56:56 EST
on_qa validation blocked due to 1409472 - brick crashed on systemic setup
Comment 14 nchilaka 2017-01-03 03:20:01 EST
I am blocked with validation of this bug due to 1409472	brick crashed on systemic setup
Comment 15 nchilaka 2017-02-23 08:08:39 EST
Reran my systemic testing on 3.8.4-14
Not hit any brick crash for 4 days, hence moving to verified
Comment 17 errata-xmlrpc 2017-03-23 02:11:30 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.