1601331 – dht: Crash seen in thread dht_dir_attr_heal

Bug 1601331 - dht: Crash seen in thread dht_dir_attr_heal

Summary: dht: Crash seen in thread dht_dir_attr_heal

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Nithya Balachandran
QA Contact:	Prasad Desala
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137 1602866
TreeView+	depends on / blocked

Reported:	2018-07-16 06:18 UTC by Kotresh HR
Modified:	2018-09-18 09:06 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.12.2-15
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1602866 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:50:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:51:58 UTC

Description Kotresh HR 2018-07-16 06:18:05 UTC

Description of problem:

While testing two of my geo-rep patches [1] and [2] I see the geo-rep mount process crashed. I was running the modified upstream geo-rep regression test suite to run on replica 3 (6*3) volume. Geo-rep client process crashed as below. Note that the geo-rep mount are aux-gfid-mount.

I looked into the traceback, it's during dht attr heal and the gfid is null in both loc and loc->inode. I didn't have much context on afr/dht, so could debug it further.

(gdb) 
#0  0x00007f260e71b765 in raise () from /lib64/libc.so.6
#1  0x00007f260e71d36a in abort () from /lib64/libc.so.6
#2  0x00007f260e713f97 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f260e714042 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f2602149ec2 in client_pre_inodelk (this=0x7f25fc00ef20, req=0x7f25e4217670, loc=0x7f25e400a298, cmd=6, flock=0x7f25e400a4b8, volume=0x7f25fc0132b0 "slave-replicate-1", xdata=0x0)
    at client-common.c:841
#5  0x00007f2602138b24 in client3_3_inodelk (frame=0x7f25e4015290, this=0x7f25fc00ef20, data=0x7f25e4217760) at client-rpc-fops.c:5307
#6  0x00007f260210d9d9 in client_inodelk (frame=0x7f25e4015290, this=0x7f25fc00ef20, volume=0x7f25fc0132b0 "slave-replicate-1", loc=0x7f25e400a298, cmd=6, lock=0x7f25e400a4b8, xdata=0x0)
    at client.c:1679
#7  0x00007f2601ea4444 in afr_nonblocking_inodelk (frame=0x7f25e400f680, this=0x7f25fc015230) at afr-lk-common.c:1093
#8  0x00007f2601e9d149 in afr_lock (frame=0x7f25e400f680, this=0x7f25fc015230) at afr-transaction.c:1652
#9  0x00007f2601e9eb84 in afr_transaction_start (local=0x7f25e4009e60, this=0x7f25fc015230) at afr-transaction.c:2333
#10 0x00007f2601e9eec0 in afr_transaction (frame=0x7f25e400f680, this=0x7f25fc015230, type=AFR_METADATA_TRANSACTION) at afr-transaction.c:2402
#11 0x00007f2601e875d7 in afr_setattr (frame=0x7f25e400ece0, this=0x7f25fc015230, loc=0x7f25e4008e58, buf=0x7f25e4008f58, valid=7, xdata=0x0) at afr-inode-write.c:895
#12 0x00007f261011681d in syncop_setattr (subvol=0x7f25fc015230, loc=0x7f25e4008e58, iatt=0x7f25e4008f58, valid=7, preop=0x0, postop=0x0, xdata_in=0x0, xdata_out=0x0) at syncop.c:1811
#13 0x00007f2601bc0448 in dht_dir_attr_heal (data=0x7f25e4007c60) at dht-selfheal.c:2497
#14 0x00007f261010f894 in synctask_wrap () at syncop.c:375
#15 0x00007f260e72fb60 in ?? () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
(gdb) f 4
#4  0x00007f2602149ec2 in client_pre_inodelk (this=0x7f25fc00ef20, req=0x7f25e4217670, loc=0x7f25e400a298, cmd=6, flock=0x7f25e400a4b8, volume=0x7f25fc0132b0 "slave-replicate-1", xdata=0x0)
    at client-common.c:841
841	        GF_ASSERT_AND_GOTO_WITH_ERROR (this->name,
(gdb) p *loc
$1 = {path = 0x7f25e40102f0 "/.gfid/00000000-0000-0000-0000-", '0' <repeats 11 times>, "1/rsnapshot_symlinkbug", name = 0x7f25e401031c "rsnapshot_symlinkbug", inode = 0x7f25ec030d30, 
  parent = 0x7f25fc078870, gfid = '\000' <repeats 15 times>, pargfid = '\000' <repeats 15 times>, "\001"}
(gdb) p *loc->inode
$2 = {table = 0x7f25fc078770, gfid = '\000' <repeats 15 times>, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, 
        __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, nlookup = 0, fd_count = 0, active_fd_count = 0, ref = 3, 
  ia_type = IA_INVAL, fd_list = {next = 0x7f25ec030d88, prev = 0x7f25ec030d88}, dentry_list = {next = 0x7f25ec030d98, prev = 0x7f25ec030d98}, hash = {next = 0x7f25ec030da8, 
    prev = 0x7f25ec030da8}, list = {next = 0x7f25ec03aa98, prev = 0x7f25fc0787d0}, _ctx = 0x7f25ec032580}
(gdb) 
$3 = {table = 0x7f25fc078770, gfid = '\000' <repeats 15 times>, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, 
        __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, nlookup = 0, fd_count = 0, active_fd_count = 0, ref = 3, 
  ia_type = IA_INVAL, fd_list = {next = 0x7f25ec030d88, prev = 0x7f25ec030d88}, dentry_list = {next = 0x7f25ec030d98, prev = 0x7f25ec030d98}, hash = {next = 0x7f25ec030da8, 
    prev = 0x7f25ec030da8}, list = {next = 0x7f25ec03aa98, prev = 0x7f25fc0787d0}, _ctx = 0x7f25ec032580}

Version-Release number of selected component (if applicable):

3.4 source install:
Had two patches [1] and [2] on top of
 commit  8c9028b560b1f0fd816e7d2a9e0bec70cc526c1a


How reproducible:
Rarely, I have hit only once.

Steps to Reproduce:
1.  Run upstream regression test suite modifying the volume type of both master and slave to 6*3
   # prove -v tests/00-geo-rep/georep-basic-dr-rsync.t


Actual results:
mount process crashed 


[1]  https://code.engineering.redhat.com/gerrit/143400
[2]  https://code.engineering.redhat.com/gerrit/143826

Expected results:
No crash should be seen

Additional info:
The mount is done by geo-rep worker process and it's a gfid-acces fuse mount.
Fust mount mounted with option "-o aux-gfid-mount".

Comment 2 Nithya Balachandran 2018-07-16 06:23:11 UTC

Please provide access to the coredump

Comment 3 Kotresh HR 2018-07-16 06:36:37 UTC

I have just uploaded the core to qe machine. Prasad will share the details.
The host is "fedora 24" not rhel and it's my local VM. So if you can't use the core file, let me know if I need to update the other gluster binaries.

Comment 7 Nithya Balachandran 2018-07-17 08:01:04 UTC

I am unable to see any symbols in the core file when I try to open it.

Comment 8 Susant Kumar Palai 2018-07-17 09:46:08 UTC

Will check and update if dht misses any gfid update in healing code path.

Comment 9 Susant Kumar Palai 2018-07-17 10:21:17 UTC

Nithya, If you are working on this already, could you move this to assigned state.

Susant

Comment 10 Nithya Balachandran 2018-07-17 11:23:46 UTC

(In reply to Susant Kumar Palai from comment #9)
> Nithya, If you are working on this already, could you move this to assigned
> state.
> 
> Susant

Done. I suspect the heal in dht_lookup_dir_cbk() - the gfid is not set in loc.
loc->inode->gfid is also NULL which is what causes the crash.

Comment 11 Kotresh HR 2018-07-17 11:29:11 UTC

Hi,

The lib64 directory which was missing is uploaded here

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1601331/

core/ 	                2018-07-16 12:04 	- 	 
georep-basic-dr-rsyn..>	2018-07-16 12:37 	131M	 
gluster-binares.tar 	2018-07-16 12:37 	1.6M	 
lib64.tar 	        2018-07-17 16:50 	333M	 
libraries.tar 	        2018-07-16 12:44 	30M	

 
Steps to use the core for debugging.

1. Create a directory on local machine and change directory
    #mkdir /dht-crash
    #cd /dht-crash
 
2. Download all libraries.tar, gluster-binaries.tar and core/core-glustersproc0-6-0-0-13668-1531717404 into /dht-crash directory

3. untar all the tar files

4. gdb usr/local/sbin/glusterfs core-glustersproc0-6-0-0-13668-1531717404

    (gdb) set solib-absolute-prefix /dht-crash
    (gdb) bt

Comment 12 Kotresh HR 2018-07-17 11:29:58 UTC

Mid air collision, setting the status back to ASSIGNED

Comment 13 Nithya Balachandran 2018-07-17 13:26:12 UTC

(In reply to Kotresh HR from comment #11)
> Hi,
> 
> The lib64 directory which was missing is uploaded here
> 
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1601331/
> 
> core/ 	                2018-07-16 12:04 	- 	 
> georep-basic-dr-rsyn..>	2018-07-16 12:37 	131M	 
> gluster-binares.tar 	2018-07-16 12:37 	1.6M	 
> lib64.tar 	        2018-07-17 16:50 	333M	 
> libraries.tar 	        2018-07-16 12:44 	30M	
> 
>  
> Steps to use the core for debugging.
> 
> 1. Create a directory on local machine and change directory
>     #mkdir /dht-crash
>     #cd /dht-crash
>  
> 2. Download all libraries.tar, gluster-binaries.tar and
> core/core-glustersproc0-6-0-0-13668-1531717404 into /dht-crash directory
> 
> 3. untar all the tar files
> 
> 4. gdb usr/local/sbin/glusterfs core-glustersproc0-6-0-0-13668-1531717404
> 
>     (gdb) set solib-absolute-prefix /dht-crash
>     (gdb) bt

Thank you. I can now see the symbols.

Comment 14 Nithya Balachandran 2018-07-18 07:37:57 UTC

Still looking into this. I shall update by tomorrow.

Comment 24 Prasad Desala 2018-08-10 12:15:13 UTC

On glusterfs version: 3.12.2-15.el7rhgs.x86_64, ran the same test case mentioned in the description multiple times and didn't hit this issue.

Hencce, moving this BZ to Verified state.

Comment 25 errata-xmlrpc 2018-09-04 06:50:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.