+++ This bug was initially created as a clone of Bug #1346549 +++ Description of problem: Create a volume like this: Volume Name: test Type: Distributed-Disperse Volume ID: 78bd1b85-cfe9-401e-ac1e-dc9e072ed4db Status: Started Number of Bricks: 2 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: node-1:/disk1 Brick2: node-2:/disk1 Brick3: node-3:/disk1 Brick4: node-1:/disk2 Brick5: node-2:/disk2 Brick6: node-3:/disk2 Options Reconfigured: performance.readdir-ahead: on features.quota: on features.inode-quota: on features.quota-deem-statfs: on Then I umount /disk{1..3} and set /disk{1..3} readonly. I several attempt to gluster vol start test force, sometimes the glusterfsd crash. glusterfsd's log: [2016-06-15 09:44:20.567687] I [MSGID: 100030] [glusterfsd.c:2338:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 3.7.12 (args: /usr/sbin/glusterfsd -s node-1 --volfile-id test.node-1.disk2 -p /var/lib/glusterd/vols/test/run/node-1-disk2.pid -S /var/run/gluster/bddd1d1330cb529b05a3a9266879baee.socket --brick-name /disk2 -l /var/log/glusterfs/bricks/disk2.log --xlator-option *-posix.glusterd-uuid=dee1dcb8-280b-4b4c-b5a6-6ad7dbd0360a --brick-port 49153 --xlator-option test-server.listen-port=49153) [2016-06-15 09:44:20.575048] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2016-06-15 09:44:20.580116] I [graph.c:269:gf_add_cmdline_options] 0-test-server: adding option 'listen-port' for volume 'test-server' with value '49153' [2016-06-15 09:44:20.580187] I [graph.c:269:gf_add_cmdline_options] 0-test-posix: adding option 'glusterd-uuid' for volume 'test-posix' with value 'dee1dcb8-280b-4b4c-b5a6-6ad7dbd0360a' [2016-06-15 09:44:20.580607] I [MSGID: 115034] [server.c:403:_check_for_auth_option] 0-/disk2: skip format check for non-addr auth option auth.login./disk2.allow [2016-06-15 09:44:20.580765] I [MSGID: 115034] [server.c:403:_check_for_auth_option] 0-/disk2: skip format check for non-addr auth option auth.login.8306814a-3bf6-49b0-b75a-95665c2ba483.password [2016-06-15 09:44:20.582297] I [rpcsvc.c:2196:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64 [2016-06-15 09:44:20.582499] W [MSGID: 101002] [options.c:957:xl_opt_validate] 0-test-server: option 'listen-port' is deprecated, preferred is 'transport.socket.listen-port', continuing with correction [2016-06-15 09:44:20.583012] W [socket.c:3759:reconfigure] 0-test-quota: NBIO on -1 failed (Bad file descriptor) [2016-06-15 09:44:20.583141] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2016-06-15 09:44:20.588207] E [index.c:188:index_dir_create] 0-test-index: /disk2/.glusterfs/indices/xattrop: Failed to create (Permission denied) [2016-06-15 09:44:20.588401] E [MSGID: 101019] [xlator.c:435:xlator_init] 0-test-index: Initialization of volume 'test-index' failed, review your volfile again [2016-06-15 09:44:20.588512] E [graph.c:322:glusterfs_graph_init] 0-test-index: initializing translator failed [2016-06-15 09:44:20.588613] E [graph.c:662:glusterfs_graph_activate] 0-graph: init failed [2016-06-15 09:44:20.590554] W [glusterfsd.c:1251:cleanup_and_exit] (-->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x307) [0x40dbe7] -->/usr/sbin/glusterfsd(glusterfs_process_volfp+0x13a) [0x408c7a] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x5f) [0x40831f] ) 0-: received signum (1), shutting down pending frames: frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2016-06-15 09:44:20 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 gdb bt: Core was generated by `/usr/sbin/glusterfsd -s node-1 --volfile-id test.node-1.disk2 -p /var/lib/glust'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007fd73c6ff688 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1 (gdb) bt #0 0x00007fd73c6ff688 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1 #1 0x00007fd73c7006f8 in _Unwind_Backtrace () from /lib/x86_64-linux-gnu/libgcc_s.so.1 #2 0x00007fd7418dae26 in __GI___backtrace (array=array@entry=0x7fd735bbfb80, size=size@entry=200) at ../sysdeps/x86_64/backtrace.c:109 #3 0x00007fd742411ea2 in _gf_msg_backtrace_nomem (level=level@entry=GF_LOG_ALERT, stacksize=stacksize@entry=200) at logging.c:1095 #4 0x00007fd74243713d in gf_print_trace (signum=11, ctx=0x2049010) at common-utils.c:615 #5 <signal handler called> #6 0x00007fd73644adb0 in ?? () #7 0x00007fd7421df8e4 in rpc_clnt_notify (trans=<optimized out>, mydata=0x7fd73803ef80, event=<optimized out>, data=0x7fd7380420f0) at rpc-clnt.c:957 #8 0x00007fd7421db593 in rpc_transport_notify (this=this@entry=0x7fd7380420f0, event=event@entry=RPC_TRANSPORT_CONNECT, data=data@entry=0x7fd7380420f0) at rpc-transport.c:546 #9 0x00007fd73d579f8f in socket_connect_finish (this=this@entry=0x7fd7380420f0) at socket.c:2429 #10 0x00007fd73d57a3af in socket_event_handler (fd=fd@entry=12, idx=idx@entry=3, data=0x7fd7380420f0, poll_in=0, poll_out=4, poll_err=0) at socket.c:2459 #11 0x00007fd74247f9fa in event_dispatch_epoll_handler (event=0x7fd735bc0e90, event_pool=0x2067da0) at event-epoll.c:575 #12 event_dispatch_epoll_worker (data=0x7fd73801e670) at event-epoll.c:678 #13 0x00007fd741b9d182 in start_thread (arg=0x7fd735bc1700) at pthread_create.c:312 #14 0x00007fd7418ca47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: --- Additional comment from jiademing.dd on 2016-06-14 22:04:29 EDT --- After analysis, rpc_clnt_notify() will call quota_enforcer_notify(),because rpc_clnt_register_notify (rpc, quota_enforcer_notify, this) in quota. glusterfsd exit will call glusterfs_graph_destroy(),in glusterfs_graph_destroy() will dlclose (xl->dlhandle). so if dlclose(xl->dlhandle) before rpc_clnt_notify(),quota.so's quota_enforcer_notify() invalid, then lead to crash.
From the looks of it , It seems to be a race between connect event and graph destroy. So, the component is either core that handles graph switch / Protocol server which should have waited till graph is activated before listening for incoming connections
Thank you for your bug report. We are no longer releasing any bug fixes or, any other updates for this version. This bug will be set to CLOSED WONTFIX to reflect this. Please reopen if the problem continues to be observed after upgrading to a latest version. [Also considering this is in the path of cleanup_and_exit(), ie, at the time of process stopping, we wouldn't focus on this bug for anytime now.