Created attachment 1177299 [details] "thread apply all backtrace" output for 0% CPU usage Description of problem: Given distributed-replicated volume (we didn't test other layouts) multiple brick processes could crash under load while probing brick TCP ports. Version-Release number of selected component (if applicable): CentOS 7.2, GlusterFS 3.7.12 with following patches: === Jiffin Tony Thottan (1): gfapi : check the value "iovec" in glfs_io_async_cbk only for read Kaleb S KEITHLEY (1): build: RHEL7 unpackaged files .../hooks/S57glusterfind-delete-post.{pyc,pyo} Kotresh HR (1): changelog/rpc: Fix rpc_clnt_t mem leaks Pranith Kumar K (1): features/index: Exclude gfid-type for '.', '..' Raghavendra G (2): libglusterfs/client_t: Dump the 0th client too storage/posix: fix inode leaks Raghavendra Talur (1): gfapi: update count when glfs_buf_copy is used Ravishankar N (1): afr:Don't wind reads for files in metadata split-brain Soumya Koduri (1): gfapi/handleops: Avoid using glfd during create === How reproducible: Reliably (see below). Steps to Reproduce: All the actions below we performed on one node. Another node in replica was not used (except for maintaining the replica itself), and bricks there did not crash. 1. create distributed-replicated (or, we suspect, any other) volume and start it; 2. mount volume on some client via FUSE; 3. find out what TCP port are used by the volume on one of the hosts where crash would be initiated; 4. start nmap'ing those ports in a loop: "while true; do nmap -Pn -p49163-49167 127.0.0.1; done"; 5. start generating some workload on the volume (we used to write lots of zero files, stat them and remove in parallel); 6. ...wait... 7. observe one or multiple brick crash on the node where TCP-probed bricks are running. Actual results: Two variants: 1. brick could crash and generate core file; 2. brick could hang consuming 0% or 100% of CPU time. Expected results: Do not crash, of course :). Additional info: If brick crashes generating core file, gdb gives us the following stacktrace: === #0 0x00007fc6cb66ebd0 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00007fc6cc80755d in gf_log_set_log_buf_size (buf_size=buf_size@entry=0) at logging.c:256 #2 0x00007fc6cc8076f7 in gf_log_disable_suppression_before_exit (ctx=0x7fc6cdc51010) at logging.c:428 #3 0x00007fc6cc829775 in gf_print_trace (signum=11, ctx=0x7fc6cdc51010) at common-utils.c:579 #4 <signal handler called> #5 0x00007fc6cc82d149 in __inode_ctx_free (inode=inode@entry=0x7fc6b4050c74) at inode.c:336 #6 0x00007fc6cc82e1c7 in __inode_destroy (inode=0x7fc6b4050c74) at inode.c:358 #7 inode_table_prune (table=table@entry=0x7fc6b80cea00) at inode.c:1540 #8 0x00007fc6cc82e454 in inode_unref (inode=0x7fc6b4050c74) at inode.c:529 #9 0x00007fc6cc841354 in fd_destroy (bound=_gf_true, fd=0x7fc6b80cee20) at fd.c:537 #10 fd_unref (fd=0x7fc6b80cee20) at fd.c:573 #11 0x00007fc6b7ddd397 in server3_3_releasedir (req=0x7fc6b6058190) at server-rpc-fops.c:4072 #12 0x00007fc6cc5cc6ab in rpcsvc_handle_rpc_call (svc=0x7fc6b8030080, trans=trans@entry=0x7fc6b80e9f10, msg=0x7fc6b80f44f0) at rpcsvc.c:705 #13 0x00007fc6cc5cc87b in rpcsvc_notify (trans=0x7fc6b80e9f10, mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at rpcsvc.c:799 #14 0x00007fc6cc5ce7c3 in rpc_transport_notify (this=this@entry=0x7fc6b80e9f10, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fc6b80f44f0) at rpc-transport.c:546 #15 0x00007fc6c14959b4 in socket_event_poll_in (this=this@entry=0x7fc6b80e9f10) at socket.c:2353 #16 0x00007fc6c14985f4 in socket_event_handler (fd=fd@entry=15, idx=idx@entry=6, data=0x7fc6b80e9f10, poll_in=1, poll_out=0, poll_err=0) at socket.c:2466 #17 0x00007fc6cc872e6a in event_dispatch_epoll_handler (event=0x7fc6bf7bae80, event_pool=0x7fc6cdc70290) at event-epoll.c:575 #18 event_dispatch_epoll_worker (data=0x7fc6cdcbcd70) at event-epoll.c:678 #19 0x00007fc6cb66cdc5 in start_thread () from /lib64/libpthread.so.0 #20 0x00007fc6cafb1ced in clone () from /lib64/libc.so.6 === Additionally, we attach corresponding core for the stacktrace above. If brick hangs consuming 0% of CPU time, we attached to brick process using gdb and got stacktraces of all threads (see attached "all_threads_stacktrace.log.xz" file).
Created attachment 1177300 [details] core file 1
Reverting http://review.gluster.org/#/c/13658 and http://review.gluster.org/#/c/14739 as suggested by Soumya Koduri did not help — I've got exactly the same crash with the same stacktrace.
Another bunch of cores I've got during extra tests. === core.27675 === #0 0x00007fb97754abd0 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00007fb9786e36fd in gf_log_set_log_buf_size (buf_size=buf_size@entry=0) at logging.c:256 #2 0x00007fb9786e3897 in gf_log_disable_suppression_before_exit (ctx=0x7fb97a3f1010) at logging.c:428 #3 0x00007fb978705915 in gf_print_trace (signum=11, ctx=0x7fb97a3f1010) at common-utils.c:579 #4 <signal handler called> #5 0x00007fb978709449 in __inode_ctx_free (inode=inode@entry=0x7fb933ed16d8) at inode.c:336 #6 0x00007fb97870a4c7 in __inode_destroy (inode=0x7fb933ed16d8) at inode.c:358 #7 inode_table_prune (table=table@entry=0x7fb9640ce160) at inode.c:1540 #8 0x00007fb97870a754 in inode_unref (inode=0x7fb933ed16d8) at inode.c:529 #9 0x00007fb9786de152 in loc_wipe (loc=loc@entry=0x7fb975c91e14) at xlator.c:695 #10 0x00007fb97871c794 in call_stub_wipe_args (stub=0x7fb975c91dd4) at call-stub.c:2511 #11 call_stub_destroy (stub=0x7fb975c91dd4) at call-stub.c:2550 #12 0x00007fb968eb4363 in iot_worker (data=0x7fb964055bf0) at io-threads.c:215 #13 0x00007fb977548dc5 in start_thread () from /lib64/libpthread.so.0 #14 0x00007fb976e8dced in clone () from /lib64/libc.so.6 === === core.8362 === #0 0x00007f8613718bdd in __gf_free (free_ptr=0x7f86000ebbd0) at mem-pool.c:313 #1 0x00007f85fef6ffa5 in io_stats_release (this=0x7f860001dc70, fd=0x7f85f8003d14) at io-stats.c:2540 #2 0x00007f86137166c7 in fd_destroy (bound=_gf_true, fd=0x7f85f8003d14) at fd.c:520 #3 fd_unref (fd=0x7f85f8003d14) at fd.c:573 #4 0x00007f85fed3f227 in server3_3_release (req=0x7f85fdfba06c) at server-rpc-fops.c:4030 #5 0x00007f86134a16ab in rpcsvc_handle_rpc_call (svc=0x7f860002f8b0, trans=trans@entry=0x7f86000d9d80, msg=0x7f85f8038120) at rpcsvc.c:705 #6 0x00007f86134a187b in rpcsvc_notify (trans=0x7f86000d9d80, mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at rpcsvc.c:799 #7 0x00007f86134a37c3 in rpc_transport_notify (this=this@entry=0x7f86000d9d80, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f85f8038120) at rpc-transport.c:546 #8 0x00007f860836a654 in socket_event_poll_in (this=this@entry=0x7f86000d9d80) at socket.c:2245 #9 0x00007f860836d294 in socket_event_handler (fd=fd@entry=20, idx=idx@entry=9, data=0x7f86000d9d80, poll_in=1, poll_out=0, poll_err=0) at socket.c:2358 #10 0x00007f86137481ca in event_dispatch_epoll_handler (event=0x7f85fed21e80, event_pool=0x7f8614d02290) at event-epoll.c:575 #11 event_dispatch_epoll_worker (data=0x7f86000205d0) at event-epoll.c:678 #12 0x00007f8612541dc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f8611e86ced in clone () from /lib64/libc.so.6 === === core.8420 === #0 0x00007f1404043fa8 in ?? () #1 0x00007f141f83c7c3 in rpc_transport_notify (this=this@entry=0x7f14040434f0, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7f14040434f0) at rpc-transport.c:546 #2 0x00007f1414706242 in socket_event_poll_err (this=0x7f14040434f0) at socket.c:1158 #3 socket_event_handler (fd=fd@entry=25, idx=idx@entry=10, data=0x7f14040434f0, poll_in=1, poll_out=0, poll_err=<optimized out>) at socket.c:2365 #4 0x00007f141fae11ca in event_dispatch_epoll_handler (event=0x7f140b150e80, event_pool=0x7f1421eb6290) at event-epoll.c:575 #5 event_dispatch_epoll_worker (data=0x7f140c0205d0) at event-epoll.c:678 #6 0x00007f141e8dadc5 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f141e21fced in clone () from /lib64/libc.so.6 === === core.8439 === #0 0x00007f3a7b79bc10 in list_del_init (old=0x7f3a74027e18) at ../../../../libglusterfs/src/list.h:88 #1 server_rpc_notify (rpc=<optimized out>, xl=0x7f3a7c01f1e0, event=<optimized out>, data=0x7f3a74027360) at server.c:542 #2 0x00007f3a8ffec4bf in rpcsvc_handle_disconnect (svc=0x7f3a7c02f8b0, trans=trans@entry=0x7f3a74027360) at rpcsvc.c:756 #3 0x00007f3a8ffee830 in rpcsvc_notify (trans=0x7f3a74027360, mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at rpcsvc.c:794 #4 0x00007f3a8fff07c3 in rpc_transport_notify (this=this@entry=0x7f3a74027360, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7f3a74027360) at rpc-transport.c:546 #5 0x00007f3a84eba242 in socket_event_poll_err (this=0x7f3a74027360) at socket.c:1158 #6 socket_event_handler (fd=fd@entry=27, idx=idx@entry=10, data=0x7f3a74027360, poll_in=1, poll_out=0, poll_err=<optimized out>) at socket.c:2365 #7 0x00007f3a902951ca in event_dispatch_epoll_handler (event=0x7f3a831dce80, event_pool=0x7f3a90eb6290) at event-epoll.c:575 #8 event_dispatch_epoll_worker (data=0x7f3a90f02d70) at event-epoll.c:678 #9 0x00007f3a8f08edc5 in start_thread () from /lib64/libpthread.so.0 #10 0x00007f3a8e9d3ced in clone () from /lib64/libc.so.6 === Attaching xzipped core files to this ticket. Please note that this cores were got running v3.7.13 with the following patches applied: === Kaleb S KEITHLEY (1): build: RHEL7 unpackaged files .../hooks/S57glusterfind-delete-post.{pyc,pyo} Kotresh HR (1): changelog/rpc: Fix rpc_clnt_t mem leaks Mohit Agrawal (1): rpc/socket.c: Modify approach to cleanup threads of socket_poller in socket_spawn. N Balachandran (1): rpc/socket: pthread resources are not cleaned up Niels de Vos (1): rpc: invalid argument when function setsockopt sets option TCP_USER_TIMEOUT Pranith Kumar K (1): features/index: Exclude gfid-type for '.', '..' Raghavendra G (2): libglusterfs/client_t: Dump the 0th client too storage/posix: fix inode leaks === Also, please, note, that core.27675 is the coredump of brick of the volume with following options: === nfs.disable: off network.ping-timeout: 10 storage.linux-aio: on server.event-threads: 4 performance.write-behind: on performance.write-behind-window-size: 4194304 performance.stat-prefetch: on performance.read-ahead: on performance.quick-read: on performance.io-thread-count: 2 performance.flush-behind: on performance.client-io-threads: off performance.cache-size: 33554432 performance.cache-max-file-size: 1048576 network.inode-lru-limit: 4096 features.cache-invalidation: on cluster.entry-self-heal: off cluster.metadata-self-heal: off cluster.data-self-heal: off cluster.readdir-optimize: on cluster.lookup-optimize: on client.event-threads: 2 performance.readdir-ahead: on === Rest of cores are obtained from bricks of the volume with default options.
Created attachment 1181462 [details] core.8362
Created attachment 1181463 [details] core.8420
Created attachment 1181464 [details] core.8439
Created attachment 1181465 [details] core.27675
New bunch of cores, re-tested with 3.7.14 with the following patches: === Aravinda VK (1): packaging: Remove ".py" extension from symlink target Atin Mukherjee (1): rpc : build_prog_details should iterate program list inside critical section Jiffin Tony Thottan (2): gfapi : Avoid double freeing of dict in glfs_*_*getxattr xlator/trash : append '/' at the end in trash_notify_lookup_cbk Raghavendra G (2): libglusterfs/client_t: Dump the 0th client too storage/posix: fix inode leaks Soumya Koduri (2): glfs/upcall: entries should be removed under mutex lock gfapi/upcall: Fix a ref leak Susant Palai (1): posix: fix posix_fgetxattr to return the correct error === core.25072: === #0 0x00007f0e0d0875f7 in raise () from /lib64/libc.so.6 #1 0x00007f0e0d088ce8 in abort () from /lib64/libc.so.6 #2 0x00007f0e0d0c7327 in __libc_message () from /lib64/libc.so.6 #3 0x00007f0e0d0cf053 in _int_free () from /lib64/libc.so.6 #4 0x00007f0e0e993b29 in dict_destroy (this=0x7f0e0ec76e18) at dict.c:565 #5 0x00007f0e0e9d7974 in call_stub_wipe_args (stub=0x7f0e0bf4b06c) at call-stub.c:2533 #6 call_stub_destroy (stub=0x7f0e0bf4b06c) at call-stub.c:2550 #7 0x00007f0dfb166363 in iot_worker (data=0x7f0dfc055de0) at io-threads.c:215 #8 0x00007f0e0d803dc5 in start_thread () from /lib64/libpthread.so.0 #9 0x00007f0e0d148ced in clone () from /lib64/libc.so.6 === core.25078: === #0 0x00007f336bbe55f7 in raise () from /lib64/libc.so.6 #1 0x00007f336bbe6ce8 in abort () from /lib64/libc.so.6 #2 0x00007f336bc25327 in __libc_message () from /lib64/libc.so.6 #3 0x00007f336bc2d053 in _int_free () from /lib64/libc.so.6 #4 0x00007f336219080f in fini (this=0x7f334400d7d0) at socket.c:4215 #5 0x00007f336d2c35cd in rpc_transport_destroy (this=this@entry=0x7f334400d7d0) at rpc-transport.c:471 #6 0x00007f336d2c3767 in rpc_transport_unref (this=this@entry=0x7f334400d7d0) at rpc-transport.c:529 #7 0x00007f336218d24a in socket_event_handler (fd=fd@entry=20, idx=idx@entry=8, data=0x7f334400d7d0, poll_in=1, poll_out=0, poll_err=<optimized out>) at socket.c:2366 #8 0x00007f336d56832a in event_dispatch_epoll_handler (event=0x7f335cbf4e80, event_pool=0x7f336dfb4300) at event-epoll.c:575 #9 event_dispatch_epoll_worker (data=0x7f3358020870) at event-epoll.c:678 #10 0x00007f336c361dc5 in start_thread () from /lib64/libpthread.so.0 #11 0x00007f336bca6ced in clone () from /lib64/libc.so.6 === core.6484 (similar to introduced previously): === #0 0x00007fcec4c5fbd0 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00007fcec5df885d in gf_log_set_log_buf_size (buf_size=buf_size@entry=0) at logging.c:256 #2 0x00007fcec5df89f7 in gf_log_disable_suppression_before_exit (ctx=0x7fcec7b60010) at logging.c:428 #3 0x00007fcec5e1aa75 in gf_print_trace (signum=11, ctx=0x7fcec7b60010) at common-utils.c:579 #4 <signal handler called> #5 0x00007fcec5e1e5a9 in __inode_ctx_free (inode=inode@entry=0x7fce8d659ac4) at inode.c:336 #6 0x00007fcec5e1f627 in __inode_destroy (inode=0x7fce8d659ac4) at inode.c:358 #7 inode_table_prune (table=table@entry=0x7fceb40d06a0) at inode.c:1540 #8 0x00007fcec5e1f8b4 in inode_unref (inode=0x7fce8d659ac4) at inode.c:529 #9 0x00007fcec5e327b4 in fd_destroy (bound=_gf_true, fd=0x7fceb40d0b2c) at fd.c:537 #10 fd_unref (fd=0x7fceb40d0b2c) at fd.c:573 #11 0x00007fceb139e397 in server3_3_releasedir (req=0x7fceb8010190) at server-rpc-fops.c:4072 #12 0x00007fcec5bbd6db in rpcsvc_handle_rpc_call (svc=0x7fceb4030130, trans=trans@entry=0x7fceb40cda10, msg=0x7fcea4009bb0) at rpcsvc.c:705 #13 0x00007fcec5bbd8ab in rpcsvc_notify (trans=0x7fceb40cda10, mydata=<optimized out>, event=<optimized out>, data=<optimized out>) at rpcsvc.c:799 #14 0x00007fcec5bbf7f3 in rpc_transport_notify (this=this@entry=0x7fceb40cda10, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7fcea4009bb0) at rpc-transport.c:546 #15 0x00007fcebaa86654 in socket_event_poll_in (this=this@entry=0x7fceb40cda10) at socket.c:2245 #16 0x00007fcebaa89294 in socket_event_handler (fd=fd@entry=10, idx=idx@entry=5, data=0x7fceb40cda10, poll_in=1, poll_out=0, poll_err=0) at socket.c:2358 #17 0x00007fcec5e6432a in event_dispatch_epoll_handler (event=0x7fceb037ee80, event_pool=0x7fcec7b7f300) at event-epoll.c:575 #18 event_dispatch_epoll_worker (data=0x7fceb4002ac0) at event-epoll.c:678 #19 0x00007fcec4c5ddc5 in start_thread () from /lib64/libpthread.so.0 #20 0x00007fcec45a2ced in clone () from /lib64/libc.so.6 === Attaching xz-ipped corefiles as well.
Created attachment 1187470 [details] core.6484
Created attachment 1187471 [details] core.25072
Created attachment 1187472 [details] core.25078
Tried http://review.gluster.org/#/c/15087/ as suggested by Soumya, but got the same crash. core.18159: === #0 0x00007ff5478c8bd0 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00007ff548a6185d in gf_log_set_log_buf_size (buf_size=buf_size@entry=0) at logging.c:256 #2 0x00007ff548a619f7 in gf_log_disable_suppression_before_exit (ctx=0x7ff5494f8010) at logging.c:428 #3 0x00007ff548a83a75 in gf_print_trace (signum=11, ctx=0x7ff5494f8010) at common-utils.c:579 #4 <signal handler called> #5 0x00007ff548a875a9 in __inode_ctx_free (inode=inode@entry=0x7ff51027f8f0) at inode.c:336 #6 0x00007ff548a8864f in __inode_destroy (inode=0x7ff51027f8f0) at inode.c:358 #7 inode_table_prune (table=table@entry=0x7ff5340cf8b0) at inode.c:1548 #8 0x00007ff548a88944 in inode_unref (inode=0x7ff51027f8f0) at inode.c:529 #9 0x00007ff548a5c2b2 in loc_wipe (loc=loc@entry=0x7ff546010cc8) at xlator.c:695 #10 0x00007ff548a9a994 in call_stub_wipe_args (stub=0x7ff546010c88) at call-stub.c:2511 #11 call_stub_destroy (stub=0x7ff546010c88) at call-stub.c:2550 #12 0x00007ff539232363 in iot_worker (data=0x7ff534055c20) at io-threads.c:215 #13 0x00007ff5478c6dc5 in start_thread () from /lib64/libpthread.so.0 #14 0x00007ff54720bced in clone () from /lib64/libc.so.6 ===
Created attachment 1187491 [details] core.18159
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.