Created attachment 1177253 [details] "thread apply all backtrace" output for 100% CPU usage Description of problem: Given distributed-replicated volume (we didn't test other layouts) multiple brick process could crash under load while performing "volume status clients" command and probing bricks port. Version-Release number of selected component (if applicable): CentOS 7.2, GlusterFS 3.7.12 with following patches: === Jiffin Tony Thottan (1): gfapi : check the value "iovec" in glfs_io_async_cbk only for read Kaleb S KEITHLEY (1): build: RHEL7 unpackaged files .../hooks/S57glusterfind-delete-post.{pyc,pyo} Kotresh HR (1): changelog/rpc: Fix rpc_clnt_t mem leaks Pranith Kumar K (1): features/index: Exclude gfid-type for '.', '..' Raghavendra G (2): libglusterfs/client_t: Dump the 0th client too storage/posix: fix inode leaks Raghavendra Talur (1): gfapi: update count when glfs_buf_copy is used Ravishankar N (1): afr:Don't wind reads for files in metadata split-brain Soumya Koduri (1): gfapi/handleops: Avoid using glfd during create === How reproducible: Reliably (see below). Steps to Reproduce: All the actions below we performed on one node. Another node in replica was not used (except for maintaining the replica itself), and bricks there did not crash. 1. create distributed-replicated (or, we suspect, any other) volume and start it; 2. mount volume on some client via FUSE; 3. find out what TCP port are used by the volume on one of the hosts where crash would be initiated; 4. start nmap'ing those ports in a loop: "while true; do nmap -Pn -p49163-49167 127.0.0.1; done"; 5. start invoking status command in a loop: "while true; do sudo gluster volume status test; sudo gluster volume status test clients; done"; 6. start generating some workload on the volume (we used to write lots of zero files and stat them in parallel); 7. ...wait... 8. observe one or multiple brick crash on the node where status command is performed. Actual results: Two variants: 1. brick could crash and generate core file; 2. brick could hang consuming 100% of CPU time. Expected results: Do not crash, of course :). Additional info: If brick crashes generating core file, gdb gives us the following stacktrace: === #0 0x00007fefa9f1cda1 in __strlen_sse2 () from /lib64/libc.so.6 #1 0x00007fefab7d8465 in str_to_data (value=value@entry=0x66726574737562eb <Address 0x66726574737562eb out of bounds>) at dict.c:904 #2 0x00007fefab7d9e16 in dict_set_str (this=this@entry=0x7fefababa048, key=key@entry=0x7fef8c225280 "client2896.hostname", str=str@entry=0x66726574737562eb <Address 0x66726574737562eb out of bounds>) at dict.c:2224 #3 0x00007fef96d2e244 in server_priv_to_dict (this=<optimized out>, dict=0x7fefababa048) at server.c:262 #4 0x00007fefabcc311a in glusterfs_handle_brick_status (req=0x7fefad9942dc) at glusterfsd-mgmt.c:890 #5 0x00007fefab82d4a2 in synctask_wrap (old_task=<optimized out>) at syncop.c:380 #6 0x00007fefa9edd110 in ?? () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () === Additionally, we attach two compressed cores for the stacktrace above. If brick hangs consuming 100% of CPU time, we attached to brick process using gdb and got stacktraces of all threads (see attached "all_threads_stacktrace.log.xz" file).
Created attachment 1177254 [details] core file 1
Created attachment 1177255 [details] core file 2
While still investigating the issue, got the same cores even without file workload.
Reverting http://review.gluster.org/#/c/13658 and http://review.gluster.org/#/c/14739 as suggested by Soumya Koduri did not help — I've got exactly the same crash with the same stacktrace.
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.