Created attachment 1412108 [details] gstack output Description of problem: This is on an OSIO (openshift.io) setup, which is a 1x3 (pure replicate) volume on a 3 node cluster, with quota enabled, and having 5k sub-directories with a 1GB quota set on them, and a daily nightly snapshot schedule setup. A particular instance when the clients (which are Openshift pods using a sub-directory each as a PV, from across ~75 nodes, and each node having more than one active pod) started issuing a slew of IOs to the Gluster bricks, and the brick processes started going to 100% CPU busyness, one of the brick processes stopped serving IO and dropped its CPU utilization. At this instance, investigating the brick process, looked like it was hung/deadlocked. A gstack and a subsequent gcore was gathered from this process and is attached to this bug (the gstack at least). gcore can be found here: http://file.rdu.redhat.com/srangana/Mar-14-2018/ (redhat internal only) Version-Release number of selected component (if applicable): This is running 3.3.1 + errata, and the RPM versions are: glusterfs-3.8.4-54.el7rhgs.x86_64 How reproducible: Happened once, filing the bug with the most relevant information, it is also not possible to reproduce this at will in this setup. Additional info: Additional logs from this node can be gathered if needed, but considering the hang and the core provided, hopefully this is not required. Filing this against quota to start with, as the gstack (or core) is analyzed it may change components.
Attaching full back trace obtained from core (bt.log) From the backtrace it seems that user has trigerred, "gluster volume status inode" Thread 33 (Thread 0x7f29cbff7700 (LWP 4734)): #0 __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:165 #1 0x00007f2a6f7e5a51 in dict_lookup_common (this=<optimized out>, key=<optimized out>) at dict.c:288 #2 0x00007f2a6f7e6bcd in dict_set_lk (this=<optimized out>, key=<optimized out>, value=<optimized out>, replace=<optimized out>) at dict.c:346 #3 0x00007f2a6f7e6d61 in dict_set (this=<optimized out>, key=<optimized out>, value=<optimized out>) at dict.c:424 #4 0x00007f2a6f7e868b in dict_set_int32 (this=<optimized out>, key=<optimized out>, val=<optimized out>) at dict.c:1797 #5 0x00007f2a6f800cde in inode_dump_to_dict (inode=inode@entry=0x7f29ed24dc00, prefix=prefix@entry=0x7f2a54c0c6f0 "conn0.itable.active64903", dict=dict@entry=0x7f2a1409c710) at inode.c:2445 #6 0x00007f2a6f800ead in inode_table_dump_to_dict (itable=0x7f2a2c0037d0, prefix=prefix@entry=0x7f2a54c0d770 "conn0", dict=dict@entry=0x7f2a1409c710) at inode.c:2486 #7 0x00007f2a6f847452 in gf_client_dump_inodes_to_dict (this=<optimized out>, dict=0x7f2a1409c710) at client_t.c:866 #8 0x00005611d69e69e6 in glusterfs_handle_brick_status (req=0x7f2a54687f20) at glusterfsd-mgmt.c:1059 #9 0x00007f2a6f8254f2 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #10 0x00007f2a6de68d40 in ?? () from /lib64/libc.so.6 #11 0x0000000000000000 in ?? () This in turn takes a lock on the clienttable and a lock on the itables. gf_client_dump_inodes_to_dict (xlator_t *this, dict_t *dict).. Due to this there are about 121 threads stuck in gf_client_get calls # cat bt.log | grep gf_client_get | wc -l 121 the gf_client_dump_inodes_to_dict calls inode_table_dump_to_dict which takes a lock on the inode table, It can be seen the inode_find calls are stuck waiting for this lock. [root@dhcp42-178 Mar-14-2018]# cat bt.log | grep table=0x7f2a2c0037d0 >>> notice the table field in inode_table_dump_to_dict and inode_find calls below. #6 0x00007f2a6f800ead in inode_table_dump_to_dict (itable=0x7f2a2c0037d0, prefix=prefix@entry=0x7f2a54c0d770 "conn0", dict=dict@entry=0x7f2a1409c710) at inode.c:2486 #3 0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f29e8624fb8 "") at inode.c:910 #3 0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f2a0c6f9698 "BZ\036\363\300\250G=\231j\232|5\206}\250") at inode.c:910 #3 0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f2a080776b8 "BZ\036\363\300\250G=\231j\232|5\206}\250") at inode.c:910 #3 0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f2a2c168b18 "S\032\005\327\272\330G") at inode.c:910 #3 0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f2a2103da58 "\021a\343f\222\230AZ\213U`\325\331r\372\060") at inode.c:910 #3 0x00007f2a6f7fe6c6 in inode_find (table=0x7f2a2c0037d0, gfid=gfid@entry=0x7f2a19140c38 "") at inode.c:910 ... ... The gf_client_dump_inodes has to be rewritten with a reduced locking window to fix this. Changing component to core.
Created attachment 1413044 [details] Full backtrace
In gf_client_dump_inodes_to_dict, we iterate over itables for all clients.. the itables however could be the same and we have the below check to prevent duplicate listing. /* Presently every brick contains only * one bound_xl for all connections. * This will lead to duplicating of * the inode lists, if listing is * done for every connection. This * simple check prevents duplication * in the present case. If need arises * the check can be improved. */ if (client->bound_xl == prev_bound_xl) continue; This however may not be sufficient .. assume the following scenario. 1) many clients (say 128) are come up and go down.. leading to clienttable entries. loking like 1,3 ,5 ,7, .... THe intermediate entries are unrefed.. and hence marked free.. 2) graph switch leads to a new bound_xl and new itable being used (say Itable_2) 3) new clients use itable_2 and clients that bind to it may get entries like 2, 8,6,4,2... now there are only 2 itables .. but the check will fail for each iteration..since alternative clients use the same itable.. and the loop assumes only consecutive clients could use the same itable. Will need to reproduce this scenario. Fix would be to avoid the duplication.. 1) loop over the clienttable using nextfree (instead of a counter based loop that we do now.) this would ensure successive entries of clients are the ones that came after each other, then we can use the same check if successive entries have same bound_xl .. 2) keep a list tracking all inode tables we have seen and check for duplicates.
I was able to reproduce the scenario with attached repro_script The issue seems to be fixed with below patch . >>> === commit be26b0da2f1a7fe336400de6a1c016716983bd38 Author: hari gowtham <hgowtham> Date: Wed Apr 11 17:38:26 2018 +0530 glusterd: volume inode/fd status broken with brick mux ==== As noted earlier the issue was because the same itable was being dumped multiples (as many as number of clients) under clienttable lock.. This resulted in lock being held for long time. TIME_TAKEN_WITHOUT_PATCH [root@node1 ~]# time gluster v status v2 inode Error : Request timed out real 2m0.124s user 0m0.095s sys 0m0.041s >>> Note: that the request time out at this moment inspecting the brick stack showed that itable dump was still being processed. (gdb) where #0 0x00007f0e0028c2a5 in dict_lookup_common (this=0x7f0dd0001ae8, key=0x7f0db0456510 "conn20.itable.lru888.ref", hash=4248157189) at dict.c:307 #1 0x00007f0e0028c506 in dict_set_lk (this=0x7f0dd0001ae8, key=0x7f0db0456510 "conn20.itable.lru888.ref", value=0x7f0dd2acaab8, replace=true) at dict.c:370 #2 0x00007f0e0028c821 in dict_set (this=0x7f0dd0001ae8, key=0x7f0db0456510 "conn20.itable.lru888.ref", value=0x7f0dd2acaab8) at dict.c:452 #3 0x00007f0e00290b86 in dict_set_uint32 (this=0x7f0dd0001ae8, key=0x7f0db0456510 "conn20.itable.lru888.ref", val=0) at dict.c:1970 #4 0x00007f0e002b13e0 in inode_dump_to_dict (inode=0x7f0dac06e548, prefix=0x7f0db0457550 "conn20.itable.lru888", dict=0x7f0dd0001ae8) at inode.c:2461 #5 0x00007f0e002b1798 in inode_table_dump_to_dict (itable=0x7f0db0008df0, prefix=0x7f0db04585a0 "conn20", dict=0x7f0dd0001ae8) at inode.c:2516 #6 0x00007f0e0031029d in gf_client_dump_inodes_to_dict (this=0x7f0dec02f530, dict=0x7f0dd0001ae8) at client_t.c:880 #7 0x000000000040e667 in glusterfs_handle_brick_status (req=0x7f0db0143648) at glusterfsd-mgmt.c:1197 #8 0x00007f0e002e92db in synctask_wrap () at syncop.c:375 #9 0x00007f0dfeb0ad40 in ?? () from /lib64/libc.so.6 #10 0x0000000000000000 in ?? () >>> Any new mount got hung due to clienttable lock being held.. root@node1 glusterfs]# mount -t glusterfs self:v2 /mnt128 HUNG This lasted for many minutes and explains the scenario seen earlier. TIME_TAKEN_WITH_PATCH [root@node1 ~]# time gluster v status v2 inode --- --- real 0m1.193s user 0m0.451s sys 0m0.043s with no redundant listing
Created attachment 1427669 [details] reproducer_script
Sanoj, could you please update the "Fixed in version" ? keeping the needinfo put on Sanoj by Vinayak.
As discussed in comment 21, we have two issues here: 1) Redundant processing of itable in a brick-mux setup. This is fixed in glusterfs-3.12.2-8 (hence updating the fixed-in-version field here) 2) Inode table dump to dictionary taking long time (due to lock held over span of lru list). We need a new bug for this. Requesting vinayak to open the same.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607