Bug 2128703

Summary:	[GSS] VDSM Problem while trying to mount target
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rafrojas <rafrojas>
Component:	core	Assignee:	Mohit Agrawal <moagrawa>
Status:	CLOSED DUPLICATE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	rhgs-3.4	CC:	moagrawa, rhs-bugs, sajmoham
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-09-23 05:12:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rafrojas 2022-09-21 12:51:43 UTC

Before you record your issue, ensure you are using the latest version of Gluster.


Provide version-Release number of selected component (if applicable):
Red Hat Enterprise Linux release 7.7
glusterfs-fuse-3.12.2-47.5.el7rhgs.x86_64                   Fri Nov  8 05:26:38 2019
glusterfs-server-3.12.2-47.5.el7rhgs.x86_64  
 
Have you searched the Bugzilla archives for same/similar issues reported.
Yes


Did you run SoS report with Insights tool?.

sos_report and abrt logs attached to the case from customer

Have you discovered any workarounds?. 
If not, Read the troubleshooting documentation to help solve your issue.
(https://mojo.redhat.com/groups/gss-gluster (Gluster feature and its troubleshooting)  https://access.redhat.com/articles/1365073 
(Specific debug data that needs to be collected for GlusterFS to help troubleshooting)



Please provide the below Mandatory Information:
1 - gluster v <volname> info
2 - gluster v <volname> heal info
3 - gluster v <volname> status
4 - Fuse Mount/SMB/nfs-ganesha/OCS ???



Describe the issue:(please be detailed as possible and provide log snippets)
[Provide TimeStamp when the issue is seen]

glusterfsd  process is killed by SIGSEGV on all the RHV hosts when a attempt is made to mount “rhvh1.infra.ul.pmrlabs.airbus.com:/data”.  This is Master Storage domain and it fails to activate due this problem.


Is this issue reproducible? If yes, share more details.:
Unmounting, restarting the vsdm service and mounting again VDSM is having same problem 

Steps to Reproduce:
1.Unmount volume
2.Restart vsdm service
3.Rem mount
Actual results:  returned by VDSM was: Problem while trying to mount target
 
Expected results: Volume mount with no issues
 
Any Additional info: We tried to manually mount the volume and is working, we checked gluster daemons, bricks and volumes and could not find issues.

Comment 3 Mohit Agrawal 2022-09-21 13:31:39 UTC

Hi,

As per logs, it seems it is a known issue, most probably the issue should be similar to the bug (https://bugzilla.redhat.com/show_bug.cgi?id=1917488). 
I can confirm more after checking the coredump while setup will be available.


[2022-09-20 11:33:05.591614] E [MSGID: 133010] [shard.c:2299:shard_common_lookup_shards_cbk] 0-data-shard: Lookup on shard 1729 failed. Base file gfid = 98f326c2-6a81-48c1-81e5-d93b41edb543 [Stale file handle]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2022-09-20 11:33:05
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x9d)[0x7f6df6b11bdd]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6df6b1c154]
/lib64/libc.so.6(+0x363f0)[0x7f6df514b3f0]
/lib64/libuuid.so.1(+0x2570)[0x7f6df6272570]
/lib64/libuuid.so.1(+0x2606)[0x7f6df6272606]
/lib64/libglusterfs.so.0(uuid_utoa+0x1c)[0x7f6df6b1b2ec]

It is a known issue and we have already backported the patch in a downstream release(6.0.57).
A fuse process is crashing due to a bug in write-behind while truncating a file.
The patch is merged in the downstream build(glusterfs-fuse-6.0-57 from the bug https://bugzilla.redhat.com/show_bug.cgi?id=1917488). 
Either the user has to upgrade to the latest downstream release or we can suggest disabling write-behind to avoid the crash.

Thanks,
Mohit Agrawal

Comment 4 Mohit Agrawal 2022-09-22 11:33:20 UTC

Hi,

 Thanks for sharing the environment to debug a core. The client process is getting crashed because
 a shard xlator is trying to access inode that is already unlinked while shard is trying to 
 reattempt cleanup during remount.It is a known issue and the issue is already fixed in the
 release glusterfs-6.0.35(https://bugzilla.redhat.com/show_bug.cgi?id=1836233).
 

gdb) bt
#0  0x00007f916ca2e570 in uuid_unpack () from /lib64/libuuid.so.1
#1  0x00007f916ca2e606 in uuid_unparse_x () from /lib64/libuuid.so.1
#2  0x00007f916d2d72ec in gf_uuid_unparse (out=0x7f9130006cd0 "98f326c2-6a81-48c1-81e5-d93b41edb543", uuid=0x8 <Address 0x8 out of bounds>)
    at compat-uuid.h:57
#3  uuid_utoa (uuid=0x8 <Address 0x8 out of bounds>) at common-utils.c:2852
#4  0x00007f915e805596 in shard_post_lookup_shards_unlink_handler (frame=<optimized out>, this=0x7f915801e8d0) at shard.c:2915
#5  0x00007f915e803fa5 in shard_common_lookup_shards (frame=frame@entry=0x7f914801b598, this=this@entry=0x7f915801e8d0, inode=<optimized out>, 
    handler=handler@entry=0x7f915e805540 <shard_post_lookup_shards_unlink_handler>) at shard.c:2458
#6  0x00007f915e80561c in shard_post_resolve_unlink_handler (frame=frame@entry=0x7f914801b598, this=this@entry=0x7f915801e8d0) at shard.c:2939
#7  0x00007f915e801b47 in shard_common_resolve_shards (frame=frame@entry=0x7f914801b598, this=this@entry=0x7f915801e8d0, 
    post_res_handler=post_res_handler@entry=0x7f915e8055f0 <shard_post_resolve_unlink_handler>) at shard.c:1069
#8  0x00007f915e805721 in shard_regulated_shards_deletion (cleanup_frame=cleanup_frame@entry=0x7f914801b598, this=this@entry=0x7f915801e8d0, 
    now=now@entry=100, first_block=first_block@entry=1701, entry=entry@entry=0x7f914c021c30) at shard.c:3178
#9  0x00007f915e805d84 in __shard_delete_shards_of_entry (cleanup_frame=cleanup_frame@entry=0x7f914801b598, this=this@entry=0x7f915801e8d0, 
    entry=entry@entry=0x7f914c021c30, inode=inode@entry=0x7f914c00f888) at shard.c:3339
#10 0x00007f915e806196 in shard_delete_shards_of_entry (cleanup_frame=cleanup_frame@entry=0x7f914801b598, this=this@entry=0x7f915801e8d0, 
    entry=entry@entry=0x7f914c021c30, inode=inode@entry=0x7f914c00f888) at shard.c:3395
#11 0x00007f915e80687f in shard_delete_shards (opaque=0x7f914801b598) at shard.c:3619
#12 0x00007f916d307840 in synctask_wrap () at syncop.c:375
#13 0x00007f916b919180 in ?? () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()
(gdb) f 4
#4  0x00007f915e805596 in shard_post_lookup_shards_unlink_handler (frame=<optimized out>, this=0x7f915801e8d0) at shard.c:2915
2915	                gf_msg (this->name, GF_LOG_ERROR, local->op_errno,
(gdb) l
2910	        shard_local_t *local = NULL;
2911	
2912	        local = frame->local;
2913	
2914	        if ((local->op_ret < 0) && (local->op_errno != ENOENT)) {
2915	                gf_msg (this->name, GF_LOG_ERROR, local->op_errno,
2916	                        SHARD_MSG_FOP_FAILED, "failed to delete shards of %s",
2917	                        uuid_utoa (local->resolver_base_inode->gfid));
2918	                return 0;
2919	        }
(gdb) p local->resolver_base_inode
$2 = (inode_t *) 0x0
(gdb) p local->resolver_base_inode->gfid
Cannot access memory at address 0x8

Can we ask to upgrade the environment to avoid a crash. The earlier suggested workaround
will not work in this case though they were facing that issue also because traceback was
captured in logs but coredump is not available so We should suggest upgrading the environment
after the release (6.0.57) to avoid both issues.

Thanks,
Mohit Agrawal

Comment 5 Mohit Agrawal 2022-09-23 05:12:43 UTC


*** This bug has been marked as a duplicate of bug 1836233 ***