Bug 1464327
Summary: | glusterfs client crashes when reading large directory | ||||||
---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Csaba Henk <csaba> | ||||
Component: | fuse | Assignee: | Csaba Henk <csaba> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | mainline | CC: | bugs, nbalacha, pasik | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.12.0 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-09-05 17:34:44 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1452513 | ||||||
Attachments: |
|
Description
Csaba Henk
2017-06-23 06:04:56 UTC
Further notes about reproduction: - In the bug descriptpion the mount options were listed *incorrectly*. They should be: -oentry_timeout=0,gid_timeout=0 - It's sufficient to use a single brick volume - The volfile can be stripped down as follows: volume moz-client-0 type protocol/client # Following settings were just left at default option send-gids true option transport.socket.keepalive-count 9 option transport.socket.keepalive-interval 2 option transport.socket.keepalive-time 20 option transport.tcp-user-timeout 0 option password <PASSWORD> option username <USERNAME> option transport.address-family inet option transport-type tcp option remote-subvolume /bricks/moz0 option remote-host <HOST> option ping-timeout 42 end-volume volume moz-md-cache type performance/md-cache subvolumes moz-client-0 end-volume volume moz type performance/io-threads subvolumes moz-md-cache end-volume - So the glusterfs commandline for a manually created volfile can be like this: # glusterfs --entry-timeout=0 --gid-timeout=0 --volfile=<FILE> <MOUNTPOINT> # Analysis 1/2 The stack trace shown in the bug description indicates a memory corruption, but the corruption itself occurred earlier. (It's a stack overflow, as we are to see.) The inode table comes with a hash table of dentries. table->name_hash is the array of buckets, each of which holds list of dentry_t objects (dentries). The crash in the __dentry_grep function occurs because list pointers are corrupt in the bucket in which a certain dentry is sought. (gdb) l 669 return NULL; 770 771 hash = hash_dentry (parent, name, table->hashsize); 772 773 list_for_each_entry (tmp, &table->name_hash[hash], hash) { 774 if (tmp->parent == parent && !strcmp (tmp->name, name)) { 775 dentry = tmp; 776 break; 777 } 778 } (gdb) p tmp $1 = (dentry_t *) 0xfffffffffffffff0 (gdb) p table->name_hash[hash] $2 = {next = 0x0, prev = 0x0} (gdb) p (dentry_t *)((char *)(table->name_hash[hash].next) - sizeof(struct list_head)) $3 = (dentry_t *) 0xfffffffffffffff0 So it can be seen that tmp is the dentry pointer which comes as the first element in the bucket list at the hash key, and its invalid, so when its parent member is attempted to be retrieved, a crash occurs. In a list, the list pointers should always be valid and consist a circle. When an inode table is created in inode_table_new(), all the buckets list are properly initialized, so they should be always be valid during the lifetime of the table. That it's not the case indicates a memory corruption. So question is where the corruption occurs. An interesting other stack trace can be provoked if we slightly alter the reproduction instructions: run the two commands, the for loop for file creation and the ls for listing the files, simultaneously (maybe starting the ls with some delay). Most of the times the first kind of stack trace will be seen, but sometimes the following comes up: (gdb) bt #0 frame_fill_groups (frame=frame@entry=0x7f44bc040bb0) at fuse-helpers.c:158 #1 0x00007f44fa08f1d6 in get_groups (frame=0x7f44bc040bb0, priv=0x7f4503405040) at fuse-helpers.c:321 #2 get_call_frame_for_req (state=state@entry=0x7f44d0004aa0) at fuse-helpers.c:366 #3 0x00007f44fa0977d0 in fuse_unlink_resume (state=0x7f44d0004aa0) at fuse-bridge.c:1631 #4 0x00007f44fa0915c5 in fuse_resolve_done (state=<optimized out>) at fuse-resolve.c:663 #5 fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:690 #6 0x00007f44fa0912d8 in fuse_resolve (state=0x7f44d0004aa0) at fuse-resolve.c:654 #7 0x00007f44fa09160e in fuse_resolve_all (state=<optimized out>) at fuse-resolve.c:686 #8 0x00007f44fa0908f3 in fuse_resolve_continue (state=state@entry=0x7f44d0004aa0) at fuse-resolve.c:706 #9 0x00007f44fa090ae7 in fuse_resolve_entry_cbk (frame=<optimized out>, cookie=<optimized out>, this=0x7f45033feef0, op_ret=0, op_errno=0, inode=0x7f44e6937830, buf=0x7f44ec6d6c60, xattr=0x0, postparent=0x7f44ec6d6cd0) at fuse-resolve.c:76 #10 0x00007f44ef9cd069 in io_stats_lookup_cbk (frame=0x7f44d0065800, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, inode=0x7f44e6937830, buf=0x7f44ec6d6c60, xdata=0x0, postparent=0x7f44ec6d6cd0) at io-stats.c:2190 #11 0x00007f4502cb14d1 in default_lookup_cbk (frame=frame@entry=0x7f44d0060840, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=0, op_errno=op_errno@entry=0, inode=0x7f44e6937830, buf=buf@entry=0x7f44ec6d6c60, xdata=0x0, postparent=postparent@entry=0x7f44ec6d6cd0) at defaults.c:1265 #12 0x00007f44efdf8933 in mdc_lookup (frame=0x7f44bc040bb0, this=<optimized out>, loc=0x7f44b8d324e0, xdata=<optimized out>) at md-cache.c:1123 #13 0x00007f4502cc5b92 in default_lookup_resume (frame=0x7f44d0060840, this=0x7f44f001d280, loc=0x7f44b8d324e0, xdata=0x0) at defaults.c:1872 #14 0x00007f4502c55b25 in call_resume (stub=0x7f44b8d32490) at call-stub.c:2508 #15 0x00007f44efbe3957 in iot_worker (data=0x7f44f002c900) at io-threads.c:220 #16 0x00007f4501a92dc5 in start_thread (arg=0x7f44ec6d7700) at pthread_create.c:308 #17 0x00007f45013d773d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) l 153 char line[4096]; 154 char *ptr = NULL; 155 FILE *fp = NULL; 156 int idx = 0; 157 long int id = 0; 158 char *saveptr = NULL; 159 char *endptr = NULL; 160 int ret = 0; 161 int ngroups = FUSE_MAX_AUX_GROUPS; 162 gid_t mygroups[GF_MAX_AUX_GROUPS]; What is interesting about it is that the crash is indicated at a variable declaration which doesn't have too much "crash potential". We can explore the scenario further with Electric Fence (http://elinux.org/Electric_Fence). (In RHEL/Centos/Fedora it's packaged as ElectricFence). Start glusterfs in gdb and set up libefence for preloading, and run then glusterfs: (gdb) set exec-wrapper env LD_PRELOAD=/usr/lib64/libefence.so (gdb) run --entry-timeout=0 --gid-timeout=0 --volfile=<VOLFILE> -N --log-file=- --log-level=INFO <MOUNTPOINT> Then performing the step of reproduction (no need to do them in parallel this time) we'll hit this: Thread 7 "glusterfs" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff7e87700 (LWP 17452)] 0x00007ffff4eb450c in frame_fill_groups (frame=<error reading variable: Cannot access memory at address 0x7ffff7e45468>) at fuse-helpers.c:148 148 { (gdb) bt #0 0x00007ffff4eb450c in frame_fill_groups (frame=<error reading variable: Cannot access memory at address 0x7ffff7e45468>) at fuse-helpers.c:148 #1 0x00007ffff4eb4938 in get_groups (priv=0x7ffff4de4df0, frame=0x7fffedf26f20) at fuse-helpers.c:302 ... (gdb) l 143 144 145 #define FUSE_MAX_AUX_GROUPS 32 /* We can get only up to 32 aux groups from /proc */ 146 void 147 frame_fill_groups (call_frame_t *frame) 148 { 149 #if defined(GF_LINUX_HOST_OS) 150 xlator_t *this = frame->this; 151 fuse_private_t *priv = this->private; 152 char filename[32]; A crash with a very similar stack trace, just now the point of crash is indicated at the opening brace of the offending function, and its first argument is an unreadable address. The safeguard mechanisms of libefence hint that this is the location of the corruption, and "this" can be identified as the entry to frame_fill_groups, ie. the point when the runtime sets up the stack for calling frame_fill_groups. So a stack issue is quite likely at this point. Let's look further into frame_fill_groups! #define FUSE_MAX_AUX_GROUPS 32 /* We can get only up to 32 aux groups from /proc */ void frame_fill_groups (call_frame_t *frame) { #if defined(GF_LINUX_HOST_OS) xlator_t *this = frame->this; fuse_private_t *priv = this->private; char filename[32]; char line[4096]; char *ptr = NULL; FILE *fp = NULL; int idx = 0; long int id = 0; char *saveptr = NULL; char *endptr = NULL; int ret = 0; int ngroups = FUSE_MAX_AUX_GROUPS; gid_t mygroups[GF_MAX_AUX_GROUPS]; if (priv->resolve_gids) { There is one thing allocated on stack that's bigger than a scalar: the mygroups array. How big is it? /* GlusterFS's maximum supported Auxiliary GIDs */ #define GF_MAX_AUX_GROUPS 65535 (gdb) p sizeof(gid_t) $26 = 4 So a 64k sized integer array is allocated on stack, that means a buffer of 256k bytes. That's a likely culprit for a stack overflow. The hypothesis can be quickly tested with replacing the stack allocation with a heap one, and we can see that it eliminates the crash and Electric Fence is also made happy with that. Notes: - The "gid_t mygroups[GF_MAX_AUX_GROUPS]" pattern occurs at a few other places too, it's not clear why at this location did the stack overflow occur. - It's also not clear why exactly the issue occurs with this configuration (io-threads and md-cache and the {entry,gid}_timeout=0 settings). - The "culprit" is change I7ede90d0e41bcf55755cced5747fa0fb1699edb2 (https://review.gluster.org/#/q/I7ede90d0e41bcf55755cced5747fa0fb1699edb2), which is present in GlusterFS 3.8.0 and also backported to 3.6 and 3.7 branches. - It's also not clear at this point what's the relationship with the stack overflow and the corruption in the inode table. In the followup comment we'll explore that further. Created attachment 1294659 [details] Diagnostic patch # Analysis 2/2 To explore the connection between the inode table corruption and the stack overflow with frame_fill_groups, we need to get our hands a bit more dirty. We'll add a routine that does a basic sanity check on inode tables: it runs through the bucket list of the name_hash and check if the list_{next,prev} pointers are zero. If none of them is zero, then it logs a message that the table is "all good", otherwise logs the offending buckets. ("All good" is euphemistic though-- this is of course just a heuristic, as the list pointers can be corrupt in other ways than being 0.) Then we insert this sanity check to a number of locations which are hit during various fops. Also we add logs that display the hash value of the dentries being added to the table. These changes are contained in the attached patch. Re-running the test with it, we get logs like this: [2017-07-04 23:04:57.798261] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #1 fuse_getattr_resume/950: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good! [2017-07-04 23:04:57.798683] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #2 fuse_attr_cbk/852: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good! [2017-07-04 23:04:57.809346] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #3 fuse_lookup_resume/704: table 0x7fffe4006050 (inode 0x7fffe4004730): all good! [2017-07-04 23:04:57.809892] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #4 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe4004730): all good! [2017-07-04 23:04:57.810612] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #5 fuse_lookup_resume/704: table 0x7fffe4006050 (inode 0x7fffe4004730): all good! [2017-07-04 23:04:57.811149] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #6 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe4004730): all good! [2017-07-04 23:04:57.811852] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #7 fuse_lookup_resume/704: table 0x7fffe4006050 (inode 0x7fffe4004730): all good! [2017-07-04 23:04:57.812440] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #8 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe4004730): all good! [2017-07-04 23:04:57.813398] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #9 fuse_lookup_resume/704: table 0x7fffe4006050 (inode 0x7fffe4004730): all good! ... [2017-07-04 23:05:08.291447] W [inode.c:94:__dentry_hash] 0-DEBUG: table 0x7fffe4006050 (inode 0x7fffe400b6e0): adding dentry to name hash at 0x7ffff7e36600, index 9819 [2017-07-04 23:05:08.291718] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #40 fuse_readdirp_cbk/3018: table 0x7fffe4006050 (inode 0x7fffe400b6e0): all good! [2017-07-04 23:05:08.291991] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #41 fuse_readdirp_cbk/3012: table 0x7fffe4006050 (inode 0x7fffe400bb00): all good! [2017-07-04 23:05:08.292249] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #42 fuse_readdirp_cbk/3013: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good! [2017-07-04 23:05:08.292322] W [inode.c:94:__dentry_hash] 0-DEBUG: table 0x7fffe4006050 (inode 0x7fffe400bb00): adding dentry to name hash at 0x7ffff7e2d080, index 7427 [2017-07-04 23:05:08.292810] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #43 fuse_readdirp_cbk/3018: table 0x7fffe4006050 (inode 0x7fffe400bb00): all good! [2017-07-04 23:05:08.293270] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #44 fuse_readdirp_cbk/3012: table 0x7fffe4006050 (inode 0x7fffe400bf20): all good! [2017-07-04 23:05:08.293570] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #45 fuse_readdirp_cbk/3013: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good! [2017-07-04 23:05:08.293637] W [inode.c:94:__dentry_hash] 0-DEBUG: table 0x7fffe4006050 (inode 0x7fffe400bf20): adding dentry to name hash at 0x7ffff7e46710, index 13932 [2017-07-04 23:05:08.294232] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #46 fuse_readdirp_cbk/3018: table 0x7fffe4006050 (inode 0x7fffe400bf20): all good! [2017-07-04 23:05:08.294882] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #47 fuse_readdirp_cbk/3012: table 0x7fffe4006050 (inode 0x7fffe400c340): all good! [2017-07-04 23:05:08.295142] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #48 fuse_readdirp_cbk/3013: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good! ... 14957 [2017-07-04 23:05:08.303786] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #102 fuse_attr_cbk/852: table 0x7fffe4006050 (inode 0x7fffe4005b10): all good! 14958 [2017-07-04 23:05:08.304070] W [fuse-bridge.c:78:inode_check] 0-DEBUG: #103 fuse_lookup_resume/704: table 0x7fffe4006050 (inode 0x7fffe400b6e0): all good! 14959 [2017-07-04 23:05:08.304259] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e44cc0, index 13511 14960 [2017-07-04 23:05:08.304279] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e44cd0, index 13512 14961 [2017-07-04 23:05:08.304292] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e44cf0, index 13514 14962 [2017-07-04 23:05:08.304304] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e44d00, index 13515 14963 [2017-07-04 23:05:08.304316] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e44d60, index 13521 14964 [2017-07-04 23:05:08.304351] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e451c0, index 13591 14965 [2017-07-04 23:05:08.304368] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45230, index 13598 14966 [2017-07-04 23:05:08.304383] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45240, index 13599 14967 [2017-07-04 23:05:08.304398] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45250, index 13600 14968 [2017-07-04 23:05:08.304410] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45260, index 13601 14969 [2017-07-04 23:05:08.304421] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45270, index 13602 14970 [2017-07-04 23:05:08.304432] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45280, index 13603 14971 [2017-07-04 23:05:08.304442] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e1): corrupt entry in name hash at 0x7ffff7e452a0, index 13605 14972 [2017-07-04 23:05:08.304457] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45300, index 13611 14973 [2017-07-04 23:05:08.304471] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45320, index 13613 14974 [2017-07-04 23:05:08.304486] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45350, index 13616 14975 [2017-07-04 23:05:08.304498] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45370, index 13618 14976 [2017-07-04 23:05:08.304509] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e453a0, index 13621 14977 [2017-07-04 23:05:08.304520] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e453b0, index 13622 14978 [2017-07-04 23:05:08.304531] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e453c0, index 13623 14979 [2017-07-04 23:05:08.304542] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e453f0, index 13626 14980 [2017-07-04 23:05:08.304553] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45400, index 13627 14981 [2017-07-04 23:05:08.304564] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45420, index 13629 14982 [2017-07-04 23:05:08.304575] W [fuse-bridge.c:73:inode_check] 0-DEBUG: #104 fuse_entry_cbk/502: table 0x7fffe4006050 (inode 0x7fffe400b6e0): corrupt entry in name hash at 0x7ffff7e45440, index 13631 Breaking after corruption was detected -- which is possible via the __inode_check_trap_bad hook provided by the patch, we can investigate the degree of corruption in gdb. There is another heuristic check that can err to the opposite direction (reporting false negatives): check if the list_{next,prev} pointers of given list_head (the kind of struct that populates the buckets) point to itself. If yes, that's definitely a valid entry, representing an empty list (in the context of the hash table, meaning no item was hashed there). If no, then it can be either a corrupted list_head or a valid non-empty list -- in case of the name_hash, the latter can be filtered out by matching against the log entries on the hash indices of the hashed dentries. In gdb, we can perform these investigations in an interval that contains the noticed corruptions as follows: (gdb) set $i=0 (gdb) while ($i < 150) > printf "%d %d %p ", $i, $i+13500, $inode->table->name_hash+13500+$i > p $inode->table->name_hash[13500+$i++] > end This would print lines like this: 21 13521 0x7ffff7e44d60 $11 = {next = 0x0, prev = 0x7ffff51dc3b4} 22 13522 0x7ffff7e44d70 $12 = {next = 0xffffffffffffffff, prev = 0x1600000000} 23 13523 0x7ffff7e44d80 $13 = {next = 0x7ffff51dc3bb, prev = 0x7ffff7e44d80} 24 13524 0x7ffff7e44d90 $14 = {next = 0x7ffff7e44d90, prev = 0x7ffff7e44d90} 25 13525 0x7ffff7e44da0 $15 = {next = 0x7ffff7e44da0, prev = 0x7ffff7e44da0} 26 13526 0x7ffff7e44db0 $16 = {next = 0x7ffff7e44db0, prev = 0x3000000018} 27 13527 0x7ffff7e44dc0 $17 = {next = 0x7ffff7e45460, prev = 0x7ffff7e453a0} 28 13528 0x7ffff7e44dd0 $18 = {next = 0x7ffff7e44dd0, prev = 0x7ffff7e44dd0} (here lines of index 24, 25 and 29 are sane as the addresses agree (and we checked that no reported hash index falls into the interval), the rest are corrupt). What makes our investigation easy is that these addresses do not vary between various runs of the same setup (same glusterfs build tested against same directory with same glusterfs configuration). So we can re-run the test with making use of the findings so far. In the hope of catching the corruption "red-handed", we can use watchpoints to notice a change at the memory addresses we collected. For example, to see when the bogus value prev = 0x3000000018 gets inserted to entry 26 (hash 13526, address 0x7ffff7e44db0), we could use the following watch directive: (gdb) watch *0x7ffff7e44db8 It would be great to set watchpoints on all list pointers occurring in the interval we investigate. Alas, the instruction set of the Intel architectures support only two such watchpoints. So we need to re-run the test with various address pairs under watch, and see what is corrupt at the time of a change in a value there. With some endurance it becomes clear that the corruption spreads from higher addresses towards the lower ones, and thus to capture the Original Corruption we need to set a watchpoint on the highest corrupted address -- in the example setup it was at: 133 13633 0x7ffff7e45460 $123 = {next = 0x7ffff7e45460, prev = 0x7fffec000ac0} where prev = 0x7fffec000ac0 is bogus and thus (gdb) watch *0x7ffff7e45468 is the relevant watch directive. With this we get: Thread 7 "glusterfs" hit Hardware watchpoint 1: *0x7ffff7e45468 Old value = -136031136 New value = -335541568 frame_fill_groups (frame=0x7fffec000ac0) at fuse-helpers.c:150 150 xlator_t *this = frame->this; (gdb) bt #0 frame_fill_groups (frame=0x7fffec000ac0) at fuse-helpers.c:150 #1 0x00007ffff51b9938 in get_groups (priv=0x65d550, frame=0x7fffec000ac0) at fuse-helpers.c:302 #2 0x00007ffff51b9c71 in get_call_frame_for_req (state=0x7fffe0003910) at fuse-helpers.c:368 #3 0x00007ffff51bf575 in fuse_lookup_resume (state=0x7fffe0003910) at fuse-bridge.c:705 #4 0x00007ffff51bf2c4 in fuse_fop_resume (state=0x7fffe0003910) at fuse-bridge.c:666 #5 0x00007ffff51bc972 in fuse_resolve_done (state=0x7fffe0003910) at fuse-resolve.c:663 #6 0x00007ffff51bca48 in fuse_resolve_all (state=0x7fffe0003910) at fuse-resolve.c:690 #7 0x00007ffff51bc953 in fuse_resolve (state=0x7fffe0003910) at fuse-resolve.c:654 #8 0x00007ffff51bca1f in fuse_resolve_all (state=0x7fffe0003910) at fuse-resolve.c:686 #9 0x00007ffff51bcaa6 in fuse_resolve_continue (state=0x7fffe0003910) at fuse-resolve.c:706 #10 0x00007ffff51bad4e in fuse_resolve_entry_cbk (frame=0x7fffe00016c0, cookie=0x7fffe00017d0, this=0x656d90, op_ret=0, op_errno=0, inode=0x7fffdc007e80, buf=0x7ffff7e86ce0, xattr=0x0, postparent=0x7ffff7e86c70) at fuse-resolve.c:76 #11 0x00007ffff7b99862 in default_lookup_cbk (frame=0x7fffe00017d0, cookie=0x7fffec000ac0, this=0x67ad30, op_ret=0, op_errno=0, inode=0x7fffdc007e80, buf=0x7ffff7e86ce0, xdata=0x0, postparent=0x7ffff7e86c70) at defaults.c:1265 #12 0x00007ffff2537bae in mdc_lookup (frame=0x7fffec000ac0, this=0x679820, loc=0x7fffe0001930, xdata=0x0) at md-cache.c:1116 #13 0x00007ffff7ba2a9c in default_lookup_resume (frame=0x7fffe00017d0, this=0x67ad30, loc=0x7fffe0001930, xdata=0x0) at defaults.c:1872 #14 0x00007ffff7b14e66 in call_resume_wind (stub=0x7fffe00018e0) at call-stub.c:2149 #15 0x00007ffff7b23c39 in call_resume (stub=0x7fffe00018e0) at call-stub.c:2512 #16 0x00007ffff232bfd4 in iot_worker (data=0x683410) at io-threads.c:224 #17 0x00007ffff693f5ba in start_thread (arg=0x7ffff7e87700) at pthread_create.c:333 #18 0x00007ffff62177cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 (gdb) l 145 #define FUSE_MAX_AUX_GROUPS 32 /* We can get only up to 32 aux groups from /proc */ 146 void 147 frame_fill_groups (call_frame_t *frame) 148 { 149 #if defined(GF_LINUX_HOST_OS) 150 xlator_t *this = frame->this; 151 fuse_private_t *priv = this->private; 152 char filename[32]; 153 char line[4096]; 154 char *ptr = NULL; So in accordance with the ElectricFence hint, the corruption happens at / near to entering frame_fill_groups(). Let's see the offending bucket entry at this point: (gdb) fr 2 (gdb) p state->loc.inode->table->name_hash[13633] $23 = {next = 0x7ffff7e45460, prev = 0x7fffec000ac0} Looking at it carefully: the bogus value that overwrites the prev member of this list_head ("prev = 0x7fffec000ac0") is nothing else but the address of the frame argument of frame_fill_groups ("#0 frame_fill_groups (frame=0x7fffec000ac0) at fuse-helpers.c:150")! That is, the stack of thread 7, when it is extended to host frame_fill_groups() is overflowing and what it starts to flood is the inode table. REVIEW: https://review.gluster.org/17706 (Don't allocate auxiliary gid list on stack) posted (#1) for review on master by Csaba Henk (csaba) REVIEW: https://review.gluster.org/17706 (groups: don't allocate auxiliary gid list on stack) posted (#2) for review on master by Csaba Henk (csaba) COMMIT: https://review.gluster.org/17706 committed in master by Jeff Darcy (jeff.us) ------ commit e14ea3f5c37475e12a3b7fb7bd3165b0a4e77c51 Author: Csaba Henk <csaba> Date: Wed Jul 5 17:48:37 2017 +0200 groups: don't allocate auxiliary gid list on stack When glusterfs wants to retrieve the list of auxiliary gids of a user, it typically allocates a sufficiently big gid_t array on stack and calls getgrouplist(3) with it. However, "sufficiently big" means to be of maximum supported gid list size, which in GlusterFS is GF_MAX_AUX_GROUPS = 64k. That means a 64k * sizeof(gid_t) = 256k allocation, which is big enough to overflow the stack in certain cases. A further observation is that stack allocation of the gid list brings no gain, as in all cases the content of the gid list eventually gets copied over to a heap allocated buffer. So we add a convenience wrapper of getgrouplist to libglusterfs called gf_getgrouplist which calls getgrouplist with a sufficiently big heap allocated buffer (it takes care of the allocation too). We are porting all the getgrouplist invocations to gf_getgrouplist and thus eliminate the huge stack allocation. BUG: 1464327 Change-Id: Icea76d0d74dcf2f87d26cb299acc771ca3b32d2b Signed-off-by: Csaba Henk <csaba> Reviewed-on: https://review.gluster.org/17706 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Niels de Vos <ndevos> Reviewed-by: Amar Tumballi <amarts> CentOS-regression: Gluster Build System <jenkins.org> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report. glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html [2] https://www.gluster.org/pipermail/gluster-users/ |