1806838 – Disperse volume : Ganesha crash with IO in 4+2 config when one glusterfsd restart every 600s

Bug 1806838 - Disperse volume : Ganesha crash with IO in 4+2 config when one glusterfsd restart every 600s

Summary: Disperse volume : Ganesha crash with IO in 4+2 config when one glusterfsd res...

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1729772 1806843
Blocks:	1805052 1806846
TreeView+	depends on / blocked

Reported:	2020-02-25 06:35 UTC by Pranith Kumar K
Modified:	2020-03-03 14:07 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:	1729772
Environment:
Last Closed:	2020-02-26 11:09:40 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	24168	0	None	Merged	cluster/ec: skip updating ctx->loc again when ec_fix_open/opendir	2020-02-26 11:09:39 UTC

Description Pranith Kumar K 2020-02-25 06:35:43 UTC

+++ This bug was initially created as a clone of Bug #1729772 +++

Description of problem:

LTP ftestxx tests at a 4+2 disperse volume.
When running the test at nfs client, a bash scripts running which reboot one node(the cluster node Ganesha.nfsd is not running on) every 600s.

ganesha.nfsd crash when healing name,

Core was generated by `/usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N N'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f0d5ae8c5a9 in ec_heal_name (frame=0x7f0d57c6ca28, 
    ec=0x7f0d5b62d280, parent=0x0, name=0x7f0d57537d31 "b", 
    participants=0x7f0d0dfffe30 "\001\001\001") at ec-heal.c:1685
1685	    loc.inode = inode_new(parent->table);
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 dbus-libs-1.10.24-12.el7.x86_64 elfutils-libelf-0.172-2.el7.x86_64 elfutils-libs-0.172-2.el7.x86_64 glibc-2.17-260.el7.x86_64 gssproxy-0.7.0-21.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-34.el7.x86_64 libacl-2.2.51-14.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libblkid-2.23.2-59.el7.x86_64 libcap-2.22-9.el7.x86_64 libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libnfsidmap-0.25-19.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-59.el7.x86_64 lz4-1.7.5-2.el7.x86_64 openssl-libs-1.0.2k-16.el7.x86_64 pcre-8.32-17.el7.x86_64 systemd-libs-219-62.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007f0d5ae8c5a9 in ec_heal_name (frame=0x7f0d57c6ca28, 
    ec=0x7f0d5b62d280, parent=0x0, name=0x7f0d57537d31 "b", 
    participants=0x7f0d0dfffe30 "\001\001\001") at ec-heal.c:1685
#1  0x00007f0d5ae93cae in ec_heal_do (this=0x7f0d5b65ac00, 
    data=0x7f0d24e3c028, loc=0x7f0d24e3c358, partial=0) at ec-heal.c:3050
#2  0x00007f0d5ae94455 in ec_synctask_heal_wrap (opaque=0x7f0d24e3c028)
    at ec-heal.c:3139
#3  0x00007f0d6d1268c9 in synctask_wrap () at syncop.c:369
#4  0x00007f0d6c6bf010 in ?? () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) frame 1
#1  0x00007f0d5ae93cae in ec_heal_do (this=0x7f0d5b65ac00, 
    data=0x7f0d24e3c028, loc=0x7f0d24e3c358, partial=0) at ec-heal.c:3050
3050	        ret = ec_heal_name(frame, ec, loc->parent, (char *)loc->name,
(gdb) p loc
$1 = (loc_t *) 0x7f0d24e3c358
(gdb) p *loc
$2 = {
  path = 0x7f0d57537d00 "/nfsshare/ltp-eZQlnozjnX/ftegVRmbT/ftest05.20436/b", 
  name = 0x7f0d57537d31 "b", inode = 0x7f0d24255b28, parent = 0x0, 
  gfid = "\263\341\223\031\301\245I\260\234\334\017\to%\305^", 
  pargfid = '\000' <repeats 15 times>}

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Worker Ant on 2019-07-14 13:03:24 UTC ---

REVIEW: https://review.gluster.org/23029 (cluster/ec: do loc_copy from ctx->loc in fd->lock) posted (#2) for review on master by Kinglong Mee

--- Additional comment from Worker Ant on 2019-07-17 16:24:24 UTC ---

REVIEW: https://review.gluster.org/23029 (cluster/ec: skip updating ctx->loc again when ec_fix_open/opendir) merged (#4) on master by Kinglong Mee

Comment 1 Worker Ant 2020-02-25 06:38:49 UTC

REVIEW: https://review.gluster.org/24168 (cluster/ec: skip updating ctx->loc again when ec_fix_open/opendir) posted (#1) for review on release-6 by Pranith Kumar Karampuri

Comment 2 Worker Ant 2020-02-26 11:09:40 UTC

REVIEW: https://review.gluster.org/24168 (cluster/ec: skip updating ctx->loc again when ec_fix_open/opendir) merged (#2) on release-6 by hari gowtham

Note You need to log in before you can comment on or make changes to this bug.