Bug 1464327 - glusterfs client crashes when reading large directory
glusterfs client crashes when reading large directory
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: fuse (Show other bugs)
mainline
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Csaba Henk
:
Depends On:
Blocks: 1452513
  Show dependency treegraph
 
Reported: 2017-06-23 02:04 EDT by Csaba Henk
Modified: 2017-10-26 10:35 EDT (History)
3 users (show)

See Also:
Fixed In Version: glusterfs-3.12.0
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-09-05 13:34:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Diagnostic patch (6.50 KB, patch)
2017-07-05 11:34 EDT, Csaba Henk
no flags Details | Diff

  None (edit)
Description Csaba Henk 2017-06-23 02:04:56 EDT
Description of problem:

Set up a glusterfs mount with following parameters:

- performance.client-io-threads = on
- performance.stat-prefetch = on
(these are defaults)
- client is mounted with -oattribute_timeout=0,gid_timeout=0

Then, when doing READDPRP on a large directory (5000 files are usually enough, 10000 is always enough), a crash occurs.

Version-Release number of selected component (if applicable):

Seems to occur from 3.8 up.

How reproducible:

Deterministically.

Steps to Reproduce:

In the gluster mount, do:

$ for i in `seq 10000`; do echo -ne "\r$i        "; printf "foof%05d\n" $i |xargs touch; done
$ ls -l | wc =l


Actual results:

Thread 7 "glusterfs" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff0ee2700 (LWP 13811)]
0x00007ffff7b0743c in __dentry_grep (table=0x7fffec032c80, parent=0x7fffec030ce0, name=0x7fffec046d88 "foof07345") at inode.c:774
warning: Source file is more recent than executable.
774                     if (tmp->parent == parent && !strcmp (tmp->name, name)) {
(gdb) bt
#0  0x00007ffff7b0743c in __dentry_grep (table=0x7fffec032c80, parent=0x7fffec030ce0, name=0x7fffec046d88 "foof07345") at inode.c:774
#1  0x00007ffff7b07c92 in __inode_link (inode=0x7fffec042420, parent=0x7fffec030ce0, name=0x7fffec046d88 "foof07345", iatt=0x7fffec046d08)
    at inode.c:1049
#2  0x00007ffff7b07e72 in inode_link (inode=0x7fffec042420, parent=0x7fffec030ce0, name=0x7fffec046d88 "foof07345", iatt=0x7fffec046d08)
    at inode.c:1096
#3  0x00007ffff51d0e8a in fuse_readdirp_cbk (frame=0x7fffe00016c0, cookie=0x7fffe00017d0, this=0x656e60, op_ret=46, op_errno=0,
    entries=0x7fffe4002200, xdata=0x0) at fuse-bridge.c:2944
#4  0x00007fffeb9b7ef2 in io_stats_readdirp_cbk (frame=0x7fffe00017d0, cookie=0x7fffe00018e0, this=0x7fffec010240, op_ret=46, op_errno=0,
    buf=0x7fffe4002200, xdata=0x0) at io-stats.c:2132
#5  0x00007ffff7b9bbd5 in default_readdirp_cbk (frame=0x7fffe00018e0, cookie=0x7fffe40039d0, this=0x7fffec00ec70, op_ret=46, op_errno=0,
    entries=0x7fffe4002200, xdata=0x0) at defaults.c:1403
#6  0x00007fffebdf6fda in mdc_readdirp_cbk (frame=0x7fffe40039d0, cookie=0x7fffe4001320, this=0x7fffec00d680, op_ret=46, op_errno=0,
    entries=0x7fffe4002200, xdata=0x0) at md-cache.c:2393
#7  0x00007ffff0224698 in dht_readdirp_cbk (frame=0x7fffe4001320, cookie=0x7fffec009610, this=0x7fffec00bfe0, op_ret=23, op_errno=0,
    orig_entries=0x7ffff0ee19e0, xdata=0x0) at dht-common.c:5230
#8  0x00007ffff04ad7ca in client3_3_readdirp_cbk (req=0x7fffec02f1b0, iov=0x7fffec02f1f0, count=1, myframe=0x7fffec03d610)
    at client-rpc-fops.c:2580
#9  0x00007ffff78ba533 in rpc_clnt_handle_reply (clnt=0x7fffec02c900, pollin=0x7fffec002b60) at rpc-clnt.c:778
#10 0x00007ffff78baace in rpc_clnt_notify (trans=0x7fffec02cad0, mydata=0x7fffec02c930, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7fffec002b60)
    at rpc-clnt.c:971
#11 0x00007ffff78b6bc5 in rpc_transport_notify (this=0x7fffec02cad0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7fffec002b60)
    at rpc-transport.c:538
#12 0x00007ffff279f265 in socket_event_poll_in (this=0x7fffec02cad0, notify_handled=_gf_true) at socket.c:2315
#13 0x00007ffff279f884 in socket_event_handler (fd=11, idx=2, gen=1, data=0x7fffec02cad0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467
#14 0x00007ffff7b61fed in event_dispatch_epoll_handler (event_pool=0x64f080, event=0x7ffff0ee1ea0) at event-epoll.c:572
#15 0x00007ffff7b622c1 in event_dispatch_epoll_worker (data=0x693760) at event-epoll.c:648
#16 0x00007ffff693f5ba in start_thread (arg=0x7ffff0ee2700) at pthread_create.c:333
#17 0x00007ffff62177cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)

Expected results:

10002

Additional info:
Comment 1 Csaba Henk 2017-07-04 12:27:33 EDT
Further notes about reproduction:

- In the bug descriptpion the mount options were listed *incorrectly*. They should be: -oentry_timeout=0,gid_timeout=0
- It's sufficient to use a single brick volume
- The volfile can be stripped down as follows:

volume moz-client-0
    type protocol/client
#   Following settings were just left at default
    option send-gids true
    option transport.socket.keepalive-count 9
    option transport.socket.keepalive-interval 2
    option transport.socket.keepalive-time 20
    option transport.tcp-user-timeout 0
    option password <PASSWORD> 
    option username <USERNAME>
    option transport.address-family inet
    option transport-type tcp
    option remote-subvolume /bricks/moz0
    option remote-host <HOST> 
    option ping-timeout 42
end-volume

volume moz-md-cache
    type performance/md-cache
    subvolumes moz-client-0
end-volume

volume moz
    type performance/io-threads
    subvolumes moz-md-cache
end-volume

- So the glusterfs commandline for a manually created volfile can be like this: 

# glusterfs  --entry-timeout=0 --gid-timeout=0 --volfile=<FILE> <MOUNTPOINT>
Comment 2 Csaba Henk 2017-07-05 11:18:11 EDT
# Analysis 1/2

The stack trace shown in the bug description indicates a memory corruption, but the corruption
itself occurred earlier. (It's a stack overflow, as we are to see.)

The inode table comes with a hash table of dentries. table->name_hash is the array of buckets,
each of which holds list of dentry_t objects (dentries). The crash in the __dentry_grep function
occurs because list pointers are corrupt in the bucket in which a certain dentry is sought.

(gdb) l
669                     return NULL;
770
771             hash = hash_dentry (parent, name, table->hashsize);
772
773             list_for_each_entry (tmp, &table->name_hash[hash], hash) {
774                     if (tmp->parent == parent && !strcmp (tmp->name, name)) {
775                             dentry = tmp;
776                             break;
777                     }
778             }
(gdb) p tmp
$1 = (dentry_t *) 0xfffffffffffffff0
(gdb) p table->name_hash[hash]
$2 = {next = 0x0, prev = 0x0}
(gdb) p (dentry_t *)((char *)(table->name_hash[hash].next) - sizeof(struct list_head))
$3 = (dentry_t *) 0xfffffffffffffff0

So it can be seen that tmp is the dentry pointer which comes as the first element in the bucket
list at the hash key, and its invalid, so when its parent member is attempted to be retrieved, a
crash occurs. In a list, the list pointers should always be valid and consist a circle.
When an inode table is created in inode_table_new(), all the buckets list are properly
initialized, so they should be always be valid during the lifetime of the table. That it's not
the case indicates a memory corruption. So question is where the corruption occurs.

An interesting other stack trace can be provoked if we slightly alter the reproduction
instructions: run the two commands, the for loop for file creation and the ls for listing the
files, simultaneously (maybe starting the ls with some delay). Most of the times the first kind
of stack trace will be seen, but sometimes the following comes up:

(gdb) bt
#0  frame_fill_groups (frame=frame@entry=0x7f44bc040bb0) at fuse-helpers.c:158
#1  0x00007f44fa08f1d6 in get_groups (frame=0x7f44bc040bb0, priv=0x7f4503405040) at fuse-helpers.c:321
#2  get_call_frame_for_req (state=state@entry=0x7f44d0004aa0) at fuse-helpers.c:366
#3  0x00007f44fa0977d0 in fuse_unlink_resume (state=0x7f44d0004aa0) at fuse-bridge.c:1631
#4  0x00007f44fa0915c5 in fuse_resolve_done (state=<optimized out>) at fuse-resolve.c:663
#5  fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:690
#6  0x00007f44fa0912d8 in fuse_resolve (state=0x7f44d0004aa0) at fuse-resolve.c:654
#7  0x00007f44fa09160e in fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:686
#8  0x00007f44fa0908f3 in fuse_resolve_continue (state=state@entry=0x7f44d0004aa0) at fuse-resolve.c:706
#9  0x00007f44fa090ae7 in fuse_resolve_entry_cbk (frame=<optimized out>, cookie=<optimized out>, this=0x7f45033feef0, op_ret=0, op_errno=0, inode=0x7f44e6937830,
    buf=0x7f44ec6d6c60, xattr=0x0, postparent=0x7f44ec6d6cd0) at fuse-resolve.c:76
#10 0x00007f44ef9cd069 in io_stats_lookup_cbk (frame=0x7f44d0065800, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, inode=0x7f44e6937830,
    buf=0x7f44ec6d6c60, xdata=0x0, postparent=0x7f44ec6d6cd0) at io-stats.c:2190
#11 0x00007f4502cb14d1 in default_lookup_cbk (frame=frame@entry=0x7f44d0060840, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=0,
    op_errno=op_errno@entry=0, inode=0x7f44e6937830, buf=buf@entry=0x7f44ec6d6c60, xdata=0x0, postparent=postparent@entry=0x7f44ec6d6cd0) at defaults.c:1265
#12 0x00007f44efdf8933 in mdc_lookup (frame=0x7f44bc040bb0, this=<optimized out>, loc=0x7f44b8d324e0, xdata=<optimized out>) at md-cache.c:1123
#13 0x00007f4502cc5b92 in default_lookup_resume (frame=0x7f44d0060840, this=0x7f44f001d280, loc=0x7f44b8d324e0, xdata=0x0) at defaults.c:1872
#14 0x00007f4502c55b25 in call_resume (stub=0x7f44b8d32490) at call-stub.c:2508
#15 0x00007f44efbe3957 in iot_worker (data=0x7f44f002c900) at io-threads.c:220
#16 0x00007f4501a92dc5 in start_thread (arg=0x7f44ec6d7700) at pthread_create.c:308
#17 0x00007f45013d773d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) l
153             char            line[4096];
154             char           *ptr           = NULL;
155             FILE           *fp            = NULL;
156             int             idx           = 0;
157             long int        id            = 0;
158             char           *saveptr       = NULL;
159             char           *endptr        = NULL;
160             int             ret           = 0;
161             int             ngroups       = FUSE_MAX_AUX_GROUPS;
162             gid_t           mygroups[GF_MAX_AUX_GROUPS];

What is interesting about it is that the crash is indicated at a variable declaration which doesn't
have too much "crash potential".

We can explore the scenario further with Electric Fence (http://elinux.org/Electric_Fence). (In
RHEL/Centos/Fedora it's packaged as ElectricFence).

Start glusterfs in gdb and set up libefence for preloading, and run then glusterfs:

(gdb) set exec-wrapper env  LD_PRELOAD=/usr/lib64/libefence.so
(gdb) run --entry-timeout=0 --gid-timeout=0 --volfile=<VOLFILE> -N --log-file=- --log-level=INFO <MOUNTPOINT>

Then performing the step of reproduction (no need to do them in parallel this time) we'll hit this:

Thread 7 "glusterfs" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7e87700 (LWP 17452)]
0x00007ffff4eb450c in frame_fill_groups (frame=<error reading variable: Cannot access memory at address 0x7ffff7e45468>) at fuse-helpers.c:148
148     {
(gdb) bt
#0  0x00007ffff4eb450c in frame_fill_groups (frame=<error reading variable: Cannot access memory at address 0x7ffff7e45468>) at fuse-helpers.c:148
#1  0x00007ffff4eb4938 in get_groups (priv=0x7ffff4de4df0, frame=0x7fffedf26f20) at fuse-helpers.c:302
...
(gdb) l
143
144
145     #define FUSE_MAX_AUX_GROUPS 32 /* We can get only up to 32 aux groups from /proc */
146     void
147     frame_fill_groups (call_frame_t *frame)
148     {
149     #if defined(GF_LINUX_HOST_OS)
150             xlator_t       *this          = frame->this;
151             fuse_private_t *priv          = this->private;
152             char            filename[32];

A crash with a very similar stack trace, just now the point of crash is indicated at the opening
brace of the offending function, and its first argument is an unreadable address. The safeguard
mechanisms of libefence hint that this is the location of the corruption, and "this" can be
identified as the entry to frame_fill_groups, ie. the point when the runtime sets up the stack for
calling frame_fill_groups. So a stack issue is quite likely at this point.

Let's look further into frame_fill_groups!

#define FUSE_MAX_AUX_GROUPS 32 /* We can get only up to 32 aux groups from /proc */
void
frame_fill_groups (call_frame_t *frame)
{
#if defined(GF_LINUX_HOST_OS)
        xlator_t       *this          = frame->this;
        fuse_private_t *priv          = this->private;
        char            filename[32];
        char            line[4096];
        char           *ptr           = NULL;
        FILE           *fp            = NULL;
        int             idx           = 0;
        long int        id            = 0;
        char           *saveptr       = NULL;
        char           *endptr        = NULL;
        int             ret           = 0;
        int             ngroups       = FUSE_MAX_AUX_GROUPS;
        gid_t           mygroups[GF_MAX_AUX_GROUPS];

        if (priv->resolve_gids) {

There is one thing allocated on stack that's bigger than a scalar: the mygroups array. How big is
it?

/* GlusterFS's maximum supported Auxiliary GIDs */
#define GF_MAX_AUX_GROUPS   65535

(gdb) p sizeof(gid_t)
$26 = 4

So a 64k sized integer array is allocated on stack, that means a buffer of 256k bytes. That's a
likely culprit for a stack overflow. The hypothesis can be quickly tested with replacing the stack
allocation with a heap one, and we can see that it eliminates the crash and Electric Fence is also
made happy with that.

Notes:

- The "gid_t mygroups[GF_MAX_AUX_GROUPS]" pattern occurs at a few other places too, it's not clear
  why at this location did the stack overflow occur.
- It's also not clear why exactly the issue occurs with this configuration (io-threads and md-cache
  and the {entry,gid}_timeout=0 settings).
- The "culprit" is change I7ede90d0e41bcf55755cced5747fa0fb1699edb2
  (https://review.gluster.org/#/q/I7ede90d0e41bcf55755cced5747fa0fb1699edb2), which is present in
  GlusterFS 3.8.0 and also backported to 3.6 and 3.7 branches.
- It's also not clear at this point what's the relationship with the stack overflow and the
  corruption in the inode table. In the followup comment we'll explore that further.
Comment 3 Csaba Henk 2017-07-05 11:34 EDT
Created attachment 1294659 [details]
Diagnostic patch

# Analysis 2/2

To explore the connection between the inode table corruption and the stack overflow with
frame_fill_groups, we need to get our hands a bit more dirty. We'll add a routine that does a basic
sanity check on inode tables: it runs through the bucket list of the name_hash and check if the
list_{next,prev} pointers are zero. If none of them is zero, then it logs a message that the table
is "all good", otherwise logs the offending buckets. ("All good" is euphemistic though-- this is of
course just a heuristic, as the list pointers can be corrupt in other ways than being 0.) Then we
insert this sanity check to a number of locations which are hit during various fops. Also we add
logs that display the hash value of the dentries being added to the table. These changes are
contained in the attached patch. Re-running the test with it, we get logs like this:

[2017-07-04 23:04:57.798261] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #1 fuse_getattr_resume/950: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good!
[2017-07-04 23:04:57.798683] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #2 fuse_attr_cbk/852: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good!
[2017-07-04 23:04:57.809346] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #3 fuse_lookup_resume/704: table 0x7fffe4006050 (inode 0x7fffe4004730): all good!
[2017-07-04 23:04:57.809892] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #4 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe4004730): all good!
[2017-07-04 23:04:57.810612] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #5 fuse_lookup_resume/704: table 0x7fffe4006050 (inode 0x7fffe4004730): all good!
[2017-07-04 23:04:57.811149] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #6 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe4004730): all good!
[2017-07-04 23:04:57.811852] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #7 fuse_lookup_resume/704: table 0x7fffe4006050 (inode 0x7fffe4004730): all good!
[2017-07-04 23:04:57.812440] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #8 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe4004730): all good!
[2017-07-04 23:04:57.813398] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #9 fuse_lookup_resume/704: table 0x7fffe4006050 (inode 0x7fffe4004730): all good!
...
[2017-07-04 23:05:08.291447] W [inode.c:94:__dentry_hash] 0-DEBUG: table 0x7fffe4006050 (inode 0x7fffe400b6e0): adding dentry to name hash at 0x7ffff7e36600, index 9819
[2017-07-04 23:05:08.291718] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #40 fuse_readdirp_cbk/3018: table 0x7fffe4006050 (inode 0x7fffe400b6e0): all good!
[2017-07-04 23:05:08.291991] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #41 fuse_readdirp_cbk/3012: table 0x7fffe4006050 (inode 0x7fffe400bb00): all good!
[2017-07-04 23:05:08.292249] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #42 fuse_readdirp_cbk/3013: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good!
[2017-07-04 23:05:08.292322] W [inode.c:94:__dentry_hash] 0-DEBUG: table 0x7fffe4006050 (inode 0x7fffe400bb00): adding dentry to name hash at 0x7ffff7e2d080, index 7427
[2017-07-04 23:05:08.292810] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #43 fuse_readdirp_cbk/3018: table 0x7fffe4006050 (inode 0x7fffe400bb00): all good!
[2017-07-04 23:05:08.293270] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #44 fuse_readdirp_cbk/3012: table 0x7fffe4006050 (inode 0x7fffe400bf20): all good!
[2017-07-04 23:05:08.293570] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #45 fuse_readdirp_cbk/3013: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good!
[2017-07-04 23:05:08.293637] W [inode.c:94:__dentry_hash] 0-DEBUG: table 0x7fffe4006050 (inode 0x7fffe400bf20): adding dentry to name hash at 0x7ffff7e46710, index 13932
[2017-07-04 23:05:08.294232] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #46 fuse_readdirp_cbk/3018: table 0x7fffe4006050 (inode 0x7fffe400bf20): all good!
[2017-07-04 23:05:08.294882] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #47 fuse_readdirp_cbk/3012: table 0x7fffe4006050 (inode 0x7fffe400c340): all good!
[2017-07-04 23:05:08.295142] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #48 fuse_readdirp_cbk/3013: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good!
...
14957 [2017-07-04 23:05:08.303786] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #102 fuse_attr_cbk/852: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good!
14958 [2017-07-04 23:05:08.304070] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #103 fuse_lookup_resume/704: table 0x7fffe4006050 (inode 0x7fffe400b6e0): all good!
14959 [2017-07-04 23:05:08.304259] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e44cc0, index 13511
14960 [2017-07-04 23:05:08.304279] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e44cd0, index 13512
14961 [2017-07-04 23:05:08.304292] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e44cf0, index 13514
14962 [2017-07-04 23:05:08.304304] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e44d00, index 13515
14963 [2017-07-04 23:05:08.304316] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e44d60, index 13521
14964 [2017-07-04 23:05:08.304351] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e451c0, index 13591
14965 [2017-07-04 23:05:08.304368] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45230, index 13598
14966 [2017-07-04 23:05:08.304383] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45240, index 13599
14967 [2017-07-04 23:05:08.304398] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45250, index 13600
14968 [2017-07-04 23:05:08.304410] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45260, index 13601
14969 [2017-07-04 23:05:08.304421] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45270, index 13602
14970 [2017-07-04 23:05:08.304432] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45280, index 13603
14971 [2017-07-04 23:05:08.304442] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e1): corrupt entry in name hash at 0x7ffff7e452a0, index 13605
14972 [2017-07-04 23:05:08.304457] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45300, index 13611
14973 [2017-07-04 23:05:08.304471] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45320, index 13613
14974 [2017-07-04 23:05:08.304486] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45350, index 13616
14975 [2017-07-04 23:05:08.304498] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45370, index 13618
14976 [2017-07-04 23:05:08.304509] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e453a0, index 13621
14977 [2017-07-04 23:05:08.304520] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e453b0, index 13622
14978 [2017-07-04 23:05:08.304531] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e453c0, index 13623
14979 [2017-07-04 23:05:08.304542] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e453f0, index 13626
14980 [2017-07-04 23:05:08.304553] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45400, index 13627
14981 [2017-07-04 23:05:08.304564] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45420, index 13629
14982 [2017-07-04 23:05:08.304575] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45440, index 13631

Breaking after corruption was detected -- which is possible via the __inode_check_trap_bad hook
provided by the patch, we can investigate the degree of corruption in gdb. There is another
heuristic check that can err to the opposite direction (reporting false negatives): check if the
list_{next,prev} pointers of given list_head (the kind of struct that populates the buckets) point
to itself. If yes, that's definitely a valid entry, representing an empty list (in the context of
the hash table, meaning no item was hashed there). If no, then it can be either a corrupted
list_head or a valid non-empty list -- in case of the name_hash, the latter can be filtered out by
matching against the log entries on the hash indices of the hashed dentries. In gdb, we can perform
these investigations in an interval that contains the noticed corruptions as follows:

(gdb)  set $i=0
(gdb)  while ($i < 150)
> printf "%d %d %p ", $i, $i+13500, $inode->table->name_hash+13500+$i
> p $inode->table->name_hash[13500+$i++]
> end

This would print lines like this:

21 13521 0x7ffff7e44d60 $11 = {next = 0x0, prev = 0x7ffff51dc3b4}
22 13522 0x7ffff7e44d70 $12 = {next = 0xffffffffffffffff, prev = 0x1600000000}
23 13523 0x7ffff7e44d80 $13 = {next = 0x7ffff51dc3bb, prev = 0x7ffff7e44d80}
24 13524 0x7ffff7e44d90 $14 = {next = 0x7ffff7e44d90, prev = 0x7ffff7e44d90}
25 13525 0x7ffff7e44da0 $15 = {next = 0x7ffff7e44da0, prev = 0x7ffff7e44da0}
26 13526 0x7ffff7e44db0 $16 = {next = 0x7ffff7e44db0, prev = 0x3000000018}
27 13527 0x7ffff7e44dc0 $17 = {next = 0x7ffff7e45460, prev = 0x7ffff7e453a0}
28 13528 0x7ffff7e44dd0 $18 = {next = 0x7ffff7e44dd0, prev = 0x7ffff7e44dd0}

(here lines of index 24, 25 and 29 are sane as the addresses agree (and we checked that no reported
hash index falls into the interval), the rest are corrupt).

What makes our investigation easy is that these addresses do not vary between various runs of the
same setup (same glusterfs build tested against same directory with same glusterfs configuration).
So we can re-run the test with making use of the findings so far. In the hope of catching the
corruption "red-handed", we can use watchpoints to notice a change at the memory addresses we
collected. For example, to see when the bogus value prev = 0x3000000018 gets inserted to entry 26
(hash 13526, address 0x7ffff7e44db0), we could use the following watch directive:

(gdb) watch *0x7ffff7e44db8

It would be great to set watchpoints on all list pointers occurring in the interval we investigate.
Alas, the instruction set of the Intel architectures support only two such watchpoints. So we need
to re-run the test with various address pairs under watch, and see what is corrupt at the time of a
change in a value there. With some endurance it becomes clear that the corruption spreads from
higher addresses towards the lower ones, and thus to capture the Original Corruption we need to set
a watchpoint on the highest corrupted address -- in the example setup it was at:

133 13633 0x7ffff7e45460 $123 = {next = 0x7ffff7e45460, prev = 0x7fffec000ac0}

where prev = 0x7fffec000ac0 is bogus and thus

(gdb) watch *0x7ffff7e45468

is the relevant watch directive. With this we get:


Thread 7 "glusterfs" hit Hardware watchpoint 1: *0x7ffff7e45468

Old value = -136031136
New value = -335541568
frame_fill_groups (frame=0x7fffec000ac0) at fuse-helpers.c:150
150             xlator_t       *this          = frame->this;
(gdb) bt
#0  frame_fill_groups (frame=0x7fffec000ac0) at fuse-helpers.c:150
#1  0x00007ffff51b9938 in get_groups (priv=0x65d550, frame=0x7fffec000ac0) at fuse-helpers.c:302
#2  0x00007ffff51b9c71 in get_call_frame_for_req (state=0x7fffe0003910) at fuse-helpers.c:368
#3  0x00007ffff51bf575 in fuse_lookup_resume (state=0x7fffe0003910) at fuse-bridge.c:705
#4  0x00007ffff51bf2c4 in fuse_fop_resume (state=0x7fffe0003910) at fuse-bridge.c:666
#5  0x00007ffff51bc972 in fuse_resolve_done (state=0x7fffe0003910) at fuse-resolve.c:663
#6  0x00007ffff51bca48 in fuse_resolve_all (state=0x7fffe0003910) at fuse-resolve.c:690
#7  0x00007ffff51bc953 in fuse_resolve (state=0x7fffe0003910) at fuse-resolve.c:654
#8  0x00007ffff51bca1f in fuse_resolve_all (state=0x7fffe0003910) at fuse-resolve.c:686
#9  0x00007ffff51bcaa6 in fuse_resolve_continue (state=0x7fffe0003910) at fuse-resolve.c:706
#10 0x00007ffff51bad4e in fuse_resolve_entry_cbk (frame=0x7fffe00016c0, cookie=0x7fffe00017d0, this=0x656d90, op_ret=0, op_errno=0, inode=0x7fffdc007e80, buf=0x7ffff7e86ce0, xattr=0x0, postparent=0x7ffff7e86c70) at fuse-resolve.c:76
#11 0x00007ffff7b99862 in default_lookup_cbk (frame=0x7fffe00017d0, cookie=0x7fffec000ac0, this=0x67ad30, op_ret=0, op_errno=0, inode=0x7fffdc007e80, buf=0x7ffff7e86ce0, xdata=0x0, postparent=0x7ffff7e86c70) at defaults.c:1265
#12 0x00007ffff2537bae in mdc_lookup (frame=0x7fffec000ac0, this=0x679820, loc=0x7fffe0001930, xdata=0x0) at md-cache.c:1116
#13 0x00007ffff7ba2a9c in default_lookup_resume (frame=0x7fffe00017d0, this=0x67ad30, loc=0x7fffe0001930, xdata=0x0) at defaults.c:1872
#14 0x00007ffff7b14e66 in call_resume_wind (stub=0x7fffe00018e0) at call-stub.c:2149
#15 0x00007ffff7b23c39 in call_resume (stub=0x7fffe00018e0) at call-stub.c:2512
#16 0x00007ffff232bfd4 in iot_worker (data=0x683410) at io-threads.c:224
#17 0x00007ffff693f5ba in start_thread (arg=0x7ffff7e87700) at pthread_create.c:333
#18 0x00007ffff62177cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) l
145     #define FUSE_MAX_AUX_GROUPS 32 /* We can get only up to 32 aux groups from /proc */
146     void
147     frame_fill_groups (call_frame_t *frame)
148     {
149     #if defined(GF_LINUX_HOST_OS)
150             xlator_t       *this          = frame->this;
151             fuse_private_t *priv          = this->private;
152             char            filename[32];
153             char            line[4096];
154             char           *ptr           = NULL;


So in accordance with the ElectricFence hint, the corruption happens at / near to entering
frame_fill_groups(). Let's see the offending bucket entry at this point:

(gdb) fr 2
(gdb) p state->loc.inode->table->name_hash[13633]
$23 = {next = 0x7ffff7e45460, prev = 0x7fffec000ac0}

Looking at it carefully: the bogus value that overwrites the prev member of this list_head
("prev = 0x7fffec000ac0") is nothing else but the address of the frame argument of
frame_fill_groups ("#0  frame_fill_groups (frame=0x7fffec000ac0) at fuse-helpers.c:150")!
That is, the stack of thread 7, when it is extended to host frame_fill_groups()
is overflowing and what it starts to flood is the inode table.
Comment 5 Worker Ant 2017-07-05 11:57:48 EDT
REVIEW: https://review.gluster.org/17706 (Don't allocate auxiliary gid list on stack) posted (#1) for review on master by Csaba Henk (csaba@redhat.com)
Comment 6 Worker Ant 2017-07-06 06:39:20 EDT
REVIEW: https://review.gluster.org/17706 (groups: don't allocate auxiliary gid list on stack) posted (#2) for review on master by Csaba Henk (csaba@redhat.com)
Comment 7 Worker Ant 2017-07-06 14:26:11 EDT
COMMIT: https://review.gluster.org/17706 committed in master by Jeff Darcy (jeff@pl.atyp.us) 
------
commit e14ea3f5c37475e12a3b7fb7bd3165b0a4e77c51
Author: Csaba Henk <csaba@redhat.com>
Date:   Wed Jul 5 17:48:37 2017 +0200

    groups: don't allocate auxiliary gid list on stack
    
    When glusterfs wants to retrieve the list of auxiliary gids
    of a user, it typically allocates a sufficiently big gid_t
    array on stack and calls getgrouplist(3) with it. However,
    "sufficiently big" means to be of maximum supported gid list
    size, which in GlusterFS is GF_MAX_AUX_GROUPS = 64k.
    That means a 64k * sizeof(gid_t) = 256k allocation, which is
    big enough to overflow the stack in certain cases.
    
    A further observation is that stack allocation of the gid list
    brings no gain, as in all cases the content of the gid list
    eventually gets copied over to a heap allocated buffer.
    
    So we add a convenience wrapper of getgrouplist to libglusterfs
    called gf_getgrouplist which calls getgrouplist with a sufficiently
    big heap allocated buffer (it takes care of the allocation too).
    We are porting all the getgrouplist invocations to gf_getgrouplist
    and thus eliminate the huge stack allocation.
    
    BUG: 1464327
    Change-Id: Icea76d0d74dcf2f87d26cb299acc771ca3b32d2b
    Signed-off-by: Csaba Henk <csaba@redhat.com>
    Reviewed-on: https://review.gluster.org/17706
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Niels de Vos <ndevos@redhat.com>
    Reviewed-by: Amar Tumballi <amarts@redhat.com>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
Comment 8 Shyamsundar 2017-09-05 13:34:44 EDT
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.