1559886 – Brick process hung, and looks like a deadlock in inode locks

Bug 1559886 - Brick process hung, and looks like a deadlock in inode locks

Summary: Brick process hung, and looks like a deadlock in inode locks

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Sanoj Unnikrishnan
QA Contact:	Vinayak Papnoi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2018-03-23 13:46 UTC by Shyamsundar
Modified:	2018-10-01 06:24 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.12.2-8
Doc Type:	Bug Fix
Doc Text:	Earlier, inode dictionaries were redundantly accessed under client_table lock while dumping clients. As a result of behavior, 'gluster volume status inode' suspended the brick. With this fix, the inode dictionaries under client_table lock will not be redundantly populated.
Clone Of:
Environment:
Last Closed:	2018-09-04 06:45:40 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
gstack output (239.96 KB, text/plain) 2018-03-23 13:46 UTC, Shyamsundar	no flags	Details
Full backtrace (353.53 KB, text/plain) 2018-03-26 09:58 UTC, Sanoj Unnikrishnan	no flags	Details
reproducer_script (651 bytes, text/plain) 2018-04-27 13:54 UTC, Sanoj Unnikrishnan	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:46:54 UTC

Description Shyamsundar 2018-03-23 13:46:39 UTC

Created attachment 1412108 [details]
gstack output

Description of problem:
This is on an OSIO (openshift.io) setup, which is a 1x3 (pure replicate) volume on a 3 node cluster, with quota enabled, and having 5k sub-directories with a 1GB quota set on them, and a daily nightly snapshot schedule setup.

A particular instance when the clients (which are Openshift pods using a sub-directory each as a PV, from across ~75 nodes, and each node having more than one active pod) started issuing a slew of IOs to the Gluster bricks, and the brick processes started going to 100% CPU busyness, one of the brick processes stopped serving IO and dropped its CPU utilization.

At this instance, investigating the brick process, looked like it was hung/deadlocked.

A gstack and a subsequent gcore was gathered from this process and is attached to this bug (the gstack at least).

gcore can be found here: http://file.rdu.redhat.com/srangana/Mar-14-2018/ (redhat internal only)

Version-Release number of selected component (if applicable):
This is running 3.3.1 + errata, and the RPM versions are: glusterfs-3.8.4-54.el7rhgs.x86_64

How reproducible:
Happened once, filing the bug with the most relevant information, it is also not possible to reproduce this at will in this setup.

Additional info:
Additional logs from this node can be gathered if needed, but considering the hang and the core provided, hopefully this is not required.

Filing this against quota to start with, as the gstack (or core) is analyzed it may change components.

Comment 2 Sanoj Unnikrishnan 2018-03-26 09:54:36 UTC

Attaching full back trace obtained from core (bt.log)

From the backtrace it seems that user has trigerred,                        
"gluster volume status inode"                                                
                                                                                               
Thread 33 (Thread 0x7f29cbff7700 (LWP 4734)):      
#0  __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:165                        
#1  0x00007f2a6f7e5a51 in dict_lookup_common (this=<optimized out>, key=<optimized out>) at dict.c:288
#2  0x00007f2a6f7e6bcd in dict_set_lk (this=<optimized out>, key=<optimized out>, value=<optimized out>, replace=<optimized out>) at dict.c:346
#3  0x00007f2a6f7e6d61 in dict_set (this=<optimized out>, key=<optimized out>, value=<optimized out>) at dict.c:424
#4  0x00007f2a6f7e868b in dict_set_int32 (this=<optimized out>, key=<optimized out>, val=<optimized out>) at dict.c:1797
#5  0x00007f2a6f800cde in inode_dump_to_dict (inode=inode@entry=0x7f29ed24dc00, prefix=prefix@entry=0x7f2a54c0c6f0 "conn0.itable.active64903", dict=dict@entry=0x7f2a1409c710) at inode.c:2445
#6  0x00007f2a6f800ead in inode_table_dump_to_dict (itable=0x7f2a2c0037d0, prefix=prefix@entry=0x7f2a54c0d770 "conn0", dict=dict@entry=0x7f2a1409c710) at inode.c:2486
#7  0x00007f2a6f847452 in gf_client_dump_inodes_to_dict (this=<optimized out>, dict=0x7f2a1409c710) at client_t.c:866
#8  0x00005611d69e69e6 in glusterfs_handle_brick_status (req=0x7f2a54687f20) at glusterfsd-mgmt.c:1059
#9  0x00007f2a6f8254f2 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#10 0x00007f2a6de68d40 in ?? () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()


This in turn takes a lock on the clienttable and a lock on the itables.
gf_client_dump_inodes_to_dict (xlator_t *this, dict_t *dict)..
 
Due to this there are about 121 threads stuck in gf_client_get calls
# cat bt.log | grep gf_client_get | wc -l
121
            
the gf_client_dump_inodes_to_dict calls inode_table_dump_to_dict which takes a lock on the inode table, It can be seen the inode_find calls are stuck waiting for this lock.   
                                                                                               
[root@dhcp42-178 Mar-14-2018]# cat bt.log | grep table=0x7f2a2c0037d0 
>>> notice the table field in inode_table_dump_to_dict and inode_find calls below.             

#6  0x00007f2a6f800ead in inode_table_dump_to_dict (itable=0x7f2a2c0037d0, prefix=prefix@entry=0x7f2a54c0d770 "conn0", dict=dict@entry=0x7f2a1409c710) at inode.c:2486
#3  0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f29e8624fb8 "") at inode.c:910
#3  0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f2a0c6f9698 "BZ\036\363\300\250G=\231j\232|5\206}\250") at inode.c:910
#3  0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f2a080776b8 "BZ\036\363\300\250G=\231j\232|5\206}\250") at inode.c:910
#3  0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f2a2c168b18 "S\032\005\327\272\330G") at inode.c:910
#3  0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f2a2103da58 "\021a\343f\222\230AZ\213U`\325\331r\372\060") at inode.c:910
#3  0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f2a19140c38 "") at inode.c:910
...                                                                             
... 


The gf_client_dump_inodes has to be rewritten with a reduced locking window to fix this.

Changing component to core.

Comment 3 Sanoj Unnikrishnan 2018-03-26 09:58:41 UTC

Created attachment 1413044 [details]
Full backtrace

Comment 8 Sanoj Unnikrishnan 2018-04-05 09:07:39 UTC

In gf_client_dump_inodes_to_dict, we iterate over itables for all clients..
the itables however could be the same and we have the below check to prevent duplicate listing.
/* Presently every brick contains only
* one bound_xl for all connections.
* This will lead to duplicating of
* the inode lists, if listing is
* done for every connection. This
* simple check prevents duplication
* in the present case. If need arises
* the check can be improved.
*/
if (client->bound_xl == prev_bound_xl)
continue;

This however may not be sufficient ..
assume the following scenario.
1) many clients (say 128) are come up and go down..
leading to clienttable entries. loking like
1,3 ,5 ,7, ....

THe intermediate entries are unrefed.. and hence marked free..

2) graph switch leads to a new bound_xl and new itable being used (say Itable_2)

3) new clients use itable_2 and clients that bind to it may get entries like 2,
8,6,4,2...

now there are only 2 itables .. but the check will fail for each iteration..since alternative clients use the same itable.. and the loop assumes only consecutive clients could use the same itable.

Will need to reproduce this scenario.

Fix would be to avoid the duplication..
1) loop over the clienttable using nextfree (instead of a counter based loop that we do now.)
this would ensure successive entries of clients are the ones that came after each other, then we can use the same check if successive entries have same bound_xl ..
2) keep a list tracking all inode tables we have seen and check for duplicates.

Comment 11 Sanoj Unnikrishnan 2018-04-27 13:51:16 UTC

I was able to reproduce the scenario with attached repro_script 
The issue seems to be fixed with below patch . 
>>> 
===
commit be26b0da2f1a7fe336400de6a1c016716983bd38
Author: hari gowtham <hgowtham>
Date:   Wed Apr 11 17:38:26 2018 +0530

    glusterd: volume inode/fd status broken with brick mux
====


As noted earlier the issue was because the same itable was being dumped multiples (as many as number of clients) under clienttable lock.. 
This resulted in lock being held for long time.


TIME_TAKEN_WITHOUT_PATCH

[root@node1 ~]# time gluster v status v2 inode
Error : Request timed out

real    2m0.124s
user    0m0.095s
sys     0m0.041s

>>> Note: that the request time out at this moment inspecting the brick stack showed that itable dump was still being processed.

(gdb) where
#0  0x00007f0e0028c2a5 in dict_lookup_common (this=0x7f0dd0001ae8, key=0x7f0db0456510 "conn20.itable.lru888.ref", hash=4248157189) at dict.c:307
#1  0x00007f0e0028c506 in dict_set_lk (this=0x7f0dd0001ae8, key=0x7f0db0456510 "conn20.itable.lru888.ref", value=0x7f0dd2acaab8, replace=true) at dict.c:370
#2  0x00007f0e0028c821 in dict_set (this=0x7f0dd0001ae8, key=0x7f0db0456510 "conn20.itable.lru888.ref", value=0x7f0dd2acaab8) at dict.c:452
#3  0x00007f0e00290b86 in dict_set_uint32 (this=0x7f0dd0001ae8, key=0x7f0db0456510 "conn20.itable.lru888.ref", val=0) at dict.c:1970
#4  0x00007f0e002b13e0 in inode_dump_to_dict (inode=0x7f0dac06e548, prefix=0x7f0db0457550 "conn20.itable.lru888", dict=0x7f0dd0001ae8) at inode.c:2461
#5  0x00007f0e002b1798 in inode_table_dump_to_dict (itable=0x7f0db0008df0, prefix=0x7f0db04585a0 "conn20", dict=0x7f0dd0001ae8) at inode.c:2516
#6  0x00007f0e0031029d in gf_client_dump_inodes_to_dict (this=0x7f0dec02f530, dict=0x7f0dd0001ae8) at client_t.c:880
#7  0x000000000040e667 in glusterfs_handle_brick_status (req=0x7f0db0143648) at glusterfsd-mgmt.c:1197
#8  0x00007f0e002e92db in synctask_wrap () at syncop.c:375
#9  0x00007f0dfeb0ad40 in ?? () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()


>>> Any new mount got hung due to clienttable lock being held.. 

root@node1 glusterfs]# mount -t glusterfs self:v2 /mnt128
HUNG
 
This lasted for many minutes and explains the scenario seen earlier.


TIME_TAKEN_WITH_PATCH
[root@node1 ~]# time gluster v status v2 inode
---
---
real    0m1.193s
user    0m0.451s
sys     0m0.043s


with no redundant listing

Comment 12 Sanoj Unnikrishnan 2018-04-27 13:54:07 UTC

Created attachment 1427669 [details]
reproducer_script

Comment 18 hari gowtham 2018-05-11 08:04:24 UTC

Sanoj, could you please update the "Fixed in version" ?

keeping the needinfo put on Sanoj by Vinayak.

Comment 23 Sanoj Unnikrishnan 2018-05-15 09:01:34 UTC

As discussed in comment 21, we have two issues here:

1) Redundant processing of itable in a brick-mux setup. This is fixed in
glusterfs-3.12.2-8  (hence updating the fixed-in-version field here)

2) Inode table dump to dictionary taking long time (due to lock held over span of lru list). We need a new bug for this. Requesting vinayak to open the same.

Comment 28 errata-xmlrpc 2018-09-04 06:45:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.