Description of problem: After installing incorrect SSL/TLS certificates in one node the glusterd crashes and after that bricks goes down for that node and cluster goes into inconsistent state. Version-Release number of selected component (if applicable): [root@gqas009 ~]# rpm -qa | grep gluster glusterfs-api-devel-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hive-0.1-11.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hbase-0.1-3.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_fs_counters-0.1-10.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_multiuser_support-0.1-3.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hadoop_hcfs_fileappend-0.1-4.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-setup_hadoop-0.1-121.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hadoop_hcfs_quota-0.1-5.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_multiple_volumes-0.1-17.noarch glusterfs-libs-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-test_dfsio_io_exception-0.1-8.noarch glusterfs-fuse-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-test_shim_access_error_messages-0.1-5.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_sqoop-0.1-1.noarch glusterfs-devel-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-setup_gluster-0.2-77.noarch glusterfs-resource-agents-3.5.3-1.fc20.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_brick_sorted_order_of_filenames-0.1-1.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-setup_bigtop-0.2.1-23.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_erroneous_multivolume_filepaths-0.1-3.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_gluster_selfheal-0.1-5.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_file_dir_permissions-0.1-8.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_selinux_persistently_disabled-0.1-1.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_user_mapred_job-0.1-4.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_generate_gridmix2_data-0.1-2.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-setup_hadoop_security-0.0.1-7.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_dfsio-0.1-1.noarch glusterfs-api-3.6.2-1.fc20.x86_64 glusterfs-extra-xlators-3.6.2-1.fc20.x86_64 glusterfs-server-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-setup_common-0.2-111.noarch glusterfs-hadoop-2.1.2-2.fc20.noarch glusterfs-geo-replication-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-test_special_char_in_path-0.1-1.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_groovy_sync-0.1-23.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_gluster_quota_selfheal-0.2-10.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_multifilewc_null_pointer_exception-0.1-5.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_pig-0.1-8.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_gridmix3-0.1-1.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_setting_working_directory-0.1-1.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-setup_rhs_georep-0.1-2.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_home_dir_listing-0.1-4.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hadoop_hcfs_testcli-0.2-6.noarch glusterfs-hadoop-javadoc-2.1.2-2.fc20.noarch glusterfs-debuginfo-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-test_missing_dirs_create-0.1-3.noarch glusterfs-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hadoop_mapreduce-0.1-5.noarch glusterfs-cli-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-test_append_to_file-0.1-5.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_mahout-0.1-5.noarch glusterfs-rdma-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop-0.1-7.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_default_block_size-0.1-3.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_ldap-0.1-6.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_junit_shim-0.1-12.noarch [root@gqas009 ~]# [root@gqas005 ~]# yum info openssl Installed Packages Name : openssl Arch : x86_64 Epoch : 1 Version : 1.0.1e Release : 42.fc20 Size : 1.5 M Repo : installed From repo : fedora-updates Summary : Utilities from the general purpose cryptography library with TLS implementation URL : http://www.openssl.org/ License : OpenSSL Description : The OpenSSL toolkit provides support for secure communications between : machines. OpenSSL includes a certificate management tool and shared : libraries which provide various cryptographic algorithms and : protocols. [root@gqas005 ~]# How reproducible: Tried once Steps to Reproduce: 1.Install fedora-20 and GlusterFS3.6.2(4 server physical machines, 1 client physical machines) 2.Create a volume and start it 3.Enable the SSL options(client.ssl and server.ssl) 4.Create separate private keys for all the server nodes and clients 5.Create the public key and CN and concatenate the public keys(client and server) and create a glusterfs.ca file and copy to the server nodes(/etc/ssl) and clients(/etc/ssl). 6.Add the CN's to ssl-auth-allow list for the volume 7.Restart the volume 8.Mount from the client using fuse Actual results: Bricks for the one node goes down because of one node has not having correct certificates. Expected results: There should not be any crashes and it should be handled gracefully. Additional info: Doc reference: (https://github.com/gluster/glusterfs/blob/master/doc/admin-guide/en-US/markdown/admin_ssl.md [root@gqas009 ~]# gluster peer status Number of Peers: 3 Hostname: 10.16.156.36 Uuid: 10490a3d-10d8-48ca-963c-a85a6a195d1a State: Peer in Cluster (Connected) Hostname: 10.16.156.45 Uuid: de2bdc1a-cf40-4b4f-bb6a-5a261cd90db1 State: Peer in Cluster (Connected) Hostname: 10.16.156.42 Uuid: dd3509dd-4fa4-4b0b-ae42-440ba22d8ec2 State: Peer in Cluster (Connected) [root@gqas009 ~]# Enable Nested Virtualization ============================ cat /sys/module/kvm_intel/parameters/nested N Temporarily remove the KVM intel Kernel module, enable nested virtualization to be persistent across reboots and add the Kernel module back: sudo rmmod kvm-intel sudo sh -c "echo 'options kvm-intel nested=y' >> /etc/modprobe.d/dist.conf" sudo modprobe kvm-intel Ensure the Nested KVM Kernel module parameter for Intel is enabled on the host: cat /sys/module/kvm_intel/parameters/nested Y modinfo kvm_intel | grep nested parm: nested:bool Generate private keys for the gluster nodes and clients ======================================================= [root@gqas009 ~]# openssl genrsa -out glusterfs.key 1024 Generating RSA private key, 1024 bit long modulus ...++++++ ....................++++++ e is 65537 (0x10001) [root@gqas009 ~]# [root@gqas013 ~]# openssl genrsa -out glusterfs.key 1024 Generating RSA private key, 1024 bit long modulus ...++++++ ...........................++++++ e is 65537 (0x10001) [root@gqas013 ~]# [root@gqas015 ~]# openssl genrsa -out glusterfs.key 1024 Generating RSA private key, 1024 bit long modulus ........................++++++ .....................................................................++++++ e is 65537 (0x10001) [root@gqas015 ~]# [root@gqas016 ~]# openssl genrsa -out glusterfs.key 1024 Generating RSA private key, 1024 bit long modulus ...........++++++ .............................................++++++ e is 65537 (0x10001) [root@gqas016 ~]# From brick logs: ================ signal received: 11 time of crash: 2015-04-08 13:54:58 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.2 [2015-04-08 13:54:58.330721] E [socket.c:384:ssl_setup_connection] 0-tcp.testvol2-server: SSL connect error /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f6fe8782362] /lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f6fe879985d] /lib64/libc.so.6(+0x358f0)[0x7f6fe779c8f0] /lib64/libcrypto.so.10(X509_subject_name_cmp+0x3)[0x7f6fe7c3f6c3] /lib64/libcrypto.so.10(OBJ_bsearch_ex_+0x64)[0x7f6fe7b93614] /lib64/libcrypto.so.10(+0xe34c5)[0x7f6fe7c084c5] /lib64/libcrypto.so.10(+0x12067f)[0x7f6fe7c4567f] /lib64/libcrypto.so.10(X509_STORE_CTX_get1_issuer+0xe7)[0x7f6fe7c465c7] /lib64/libcrypto.so.10(X509_verify_cert+0x90b)[0x7f6fe7c427fb] /lib64/libssl.so.10(ssl3_output_cert_chain+0x1a8)[0x7f6fdd83cb68] /lib64/libssl.so.10(ssl3_send_server_certificate+0x35)[0x7f6fdd8303d5] /lib64/libssl.so.10(ssl3_accept+0xd1d)[0x7f6fdd83184d] /usr/lib64/glusterfs/3.6.2/rpc-transport/socket.so(+0x478a)[0x7f6fdda7f78a] /usr/lib64/glusterfs/3.6.2/rpc-transport/socket.so(+0x5e70)[0x7f6fdda80e70] /usr/lib64/glusterfs/3.6.2/rpc-transport/socket.so(+0xb149)[0x7f6fdda86149] /lib64/libpthread.so.0(+0x7ee5)[0x7f6fe7f14ee5] /lib64/libc.so.6(clone+0x6d)[0x7f6fe785bd1d] From client: =========== [root@gqas005 ssl]# openssl req -new -x509 -key glusterfs.key -subj /CN=client1.example.com -out glusterfs.pem [root@gqas005 ssl]# ls Servers with same CN(client.example.com) ======================================= [root@gqas005 ssl]# openssl req -new -x509 -key glusterfs.key -subj /CN=client.example.com -out glusterfs.pem [root@gqas005 ssl]# [root@gqas005 ssl]# mount -t glusterfs 10.16.156.24:/testvol2 /mnt/test Mount failed. Please check the log file for more details. [root@gqas005 ssl]# Crash from glusterd logs: ======================== backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.2 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f60b99ed362] /lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f60b9a0485d] /lib64/libc.so.6(+0x358f0)[0x7f60b8a078f0] /lib64/libssl.so.10(SSL_write+0x4)[0x7f60ac636a94] /usr/lib64/glusterfs/3.6.2/rpc-transport/socket.so(+0x4602)[0x7f60ac865602] /usr/lib64/glusterfs/3.6.2/rpc-transport/socket.so(+0x495e)[0x7f60ac86595e] /usr/lib64/glusterfs/3.6.2/rpc-transport/socket.so(+0x4ff2)[0x7f60ac865ff2] /usr/lib64/glusterfs/3.6.2/rpc-transport/socket.so(+0x54a4)[0x7f60ac8664a4] /lib64/libgfrpc.so.0(rpc_clnt_submit+0x2b2)[0x7f60b97c01a2] /usr/lib64/glusterfs/3.6.2/xlator/mgmt/glusterd.so(glusterd_submit_request_unlocked+0x164)[0x7f60aec0e8a4] /usr/lib64/glusterfs/3.6.2/xlator/mgmt/glusterd.so(glusterd_submit_request+0x7a)[0x7f60aec0ea1a] /usr/lib64/glusterfs/3.6.2/xlator/mgmt/glusterd.so(glusterd_peer_dump_version+0x8e)[0x7f60aec49b1e] /usr/lib64/glusterfs/3.6.2/xlator/mgmt/glusterd.so(__glusterd_peer_rpc_notify+0x2ee)[0x7f60aebfc58e] /usr/lib64/glusterfs/3.6.2/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x4c)[0x7f60aebf501c] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x1a0)[0x7f60b97c13d0] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f60b97bd2f3] /usr/lib64/glusterfs/3.6.2/rpc-transport/socket.so(+0x5977)[0x7f60ac866977] /usr/lib64/glusterfs/3.6.2/rpc-transport/socket.so(+0xabff)[0x7f60ac86bbff] /lib64/libglusterfs.so.0(+0x765f2)[0x7f60b9a425f2] /usr/sbin/glusterd(main+0x502)[0x7f60b9e96012] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f60b89f3d65] /usr/sbin/glusterd(+0x63b1)[0x7f60b9e963b1]
Sobhan, I still don't understand what is wrong with your environment. I've myself never faced any crashes, and know that other people who had SSL setup issues didn't face crashes. I'll try to find time early next week and sit down with you. I'd like to observe what you are doing completely.
In what way is one server's certificates incorrect? The description seems to show certificates being created the same way everywhere, and that way looks valid.
The problem is while concatenating the public keys to form glusterfs.ca file I missed one key (i.e clients public key) and copied the glusterfs.ca file to /etc/ssl to the server nodes.I tried to mount from the client and it failed. Then I realized I missed one entry in glusterfs.ca file to the server nodes. When I saw "gluster volume status" I saw the brick was down for one node and there was a brick crash. I understand that SSL connection is failed due to incorrect SSL certificates being copied to server nodes. In any case the glusterd should not crash and it should handle it gracefully.
Sobhan, It would be good if you can reproduce this and collect all the cli steps and create a document so that everyone is clear what steps you followed in what exact sequence and post that doc as an attachment to BZ, that should reduce all confusion and reduce the ambiguity that typically crops up when u put something in words Also you said: >How reproducible: >Tried once Not very encouraging :) IMHO, As a bug reporter you should report a bug only if its reproducing relaibly or atleast say that this happened just once and not happening again, which is a imp data point for developer
Created attachment 1013873 [details] Reproduction steps for glusterd crash
I reproduced the issue and mentioned where exactly the glusterd crashes. Please see the attachment. I also have the machines in the same state and happy to share the test setup for more investigation.
The following backtrace may be useful for debugging further. (gdb) bt #0 SSL_write (s=0x0, buf=0x5555557983a0, num=4) at ssl_lib.c:990 #1 0x00007fffe6854602 in ssl_do (buf=0x5555557983a0, len=4, func=0x7fffe6625a90 <SSL_write>, this=0x5555557e8690, this=0x5555557e8690) at socket.c:281 #2 0x00007fffe685495e in __socket_rwv (this=this@entry=0x5555557e8690, vector=<optimized out>, count=<optimized out>, pending_vector=pending_vector@entry=0x5555557984b0, pending_count=pending_count@entry=0x5555557984b8, bytes=bytes@entry=0x0, write=write@entry=1) at socket.c:565 #3 0x00007fffe6854ff2 in __socket_writev (pending_count=<optimized out>, pending_vector=<optimized out>, count=<optimized out>, vector=<optimized out>, this=0x5555557e8690) at socket.c:684 #4 __socket_ioq_churn_entry (this=this@entry=0x5555557e8690, entry=entry@entry=0x555555798390, direct=direct@entry=1) at socket.c:1082 #5 0x00007fffe68554a4 in socket_submit_request (this=0x5555557e8690, req=<optimized out>) at socket.c:3304 #6 0x00007ffff792f1a2 in rpc_clnt_submit (rpc=rpc@entry=0x5555557e45a0, prog=prog@entry=0x7fffed061f60 <glusterd_dump_prog>, procnum=procnum@entry=1, cbkfn=cbkfn@entry=0x7fffecdb4970 <glusterd_peer_dump_version_cbk>, proghdr=proghdr@entry=0x7fffffffcc30, proghdrcount=1, progpayload=progpayload@entry=0x0, progpayloadcount=progpayloadcount@entry=0, iobref=iobref@entry=0x5555557f85e0, frame=frame@entry=0x7ffff5bb26b4, rsphdr=rsphdr@entry=0x0, rsphdr_count=rsphdr_count@entry=0, rsp_payload=rsp_payload@entry=0x0, rsp_payload_count=rsp_payload_count@entry=0, rsp_iobref=rsp_iobref@entry=0x0) at rpc-clnt.c:1555 #7 0x00007fffecd7d8a4 in glusterd_submit_request_unlocked (rpc=rpc@entry=0x5555557e45a0, req=req@entry=0x7fffffffcd10, frame=frame@entry=0x7ffff5bb26b4, prog=prog@entry=0x7fffed061f60 <glusterd_dump_prog>, procnum=procnum@entry=1, iobref=0x5555557f85e0, iobref@entry=0x0, this=this@entry=0x55555579ca30, cbkfn=cbkfn@entry=0x7fffecdb4970 <glusterd_peer_dump_version_cbk>, xdrproc=xdrproc@entry=0x7ffff7714380 <xdr_gf_dump_req>) at glusterd-utils.c:271 #8 0x00007fffecd7da1a in glusterd_submit_request (rpc=0x5555557e45a0, req=req@entry=0x7fffffffcd10, frame=0x7ffff5bb26b4, prog=prog@entry=0x7fffed061f60 <glusterd_dump_prog>, procnum=procnum@entry=1, iobref=iobref@entry=0x0, this=this@entry=0x55555579ca30, cbkfn=cbkfn@entry=0x7fffecdb4970 <glusterd_peer_dump_version_cbk>, xdrproc=0x7ffff7714380 <xdr_gf_dump_req>) at glusterd-utils.c:296 #9 0x00007fffecdb8b1e in glusterd_peer_dump_version (this=this@entry=0x55555579ca30, rpc=rpc@entry=0x5555557e45a0, peerctx=peerctx@entry=0x5555557d6ac0) at glusterd-handshake.c:2008 #10 0x00007fffecd6b58e in __glusterd_peer_rpc_notify (rpc=rpc@entry=0x5555557e45a0, mydata=mydata@entry=0x5555557d6ac0, event=event@entry=RPC_CLNT_CONNECT, data=data@entry=0x0) at glusterd-handler.c:4352 #11 0x00007fffecd6401c in glusterd_big_locked_notify (rpc=0x5555557e45a0, mydata=0x5555557d6ac0, event=RPC_CLNT_CONNECT, data=0x0, notify_fn=0x7fffecd6b2a0 <__glusterd_peer_rpc_notify>) at glusterd-handler.c:69 #12 0x00007ffff79303d0 in rpc_clnt_notify (trans=<optimized out>, mydata=0x5555557e45d0, event=<optimized out>, data=<optimized out>) at rpc-clnt.c:923 #13 0x00007ffff792c2f3 in rpc_transport_notify (this=this@entry=0x5555557e8690, event=event@entry=RPC_TRANSPORT_CONNECT, data=data@entry=0x5555557e8690) at rpc-transport.c:516 #14 0x00007fffe6855977 in socket_connect_finish (this=this@entry=0x5555557e8690) at socket.c:2301 #15 0x00007fffe685abff in socket_event_handler (fd=<optimized out>, idx=2, data=data@entry=0x5555557e8690, poll_in=1, poll_out=4, poll_err=16) ---Type <return> to continue, or q <return> to quit--- at socket.c:2331 #16 0x00007ffff7bb15f2 in event_dispatch_epoll_handler (i=<optimized out>, events=0x5555557dc2a0, event_pool=0x555555788d00) at event-epoll.c:384 #17 event_dispatch_epoll (event_pool=0x555555788d00) at event-epoll.c:445 #18 0x000055555555a012 in main (argc=3, argv=0x7fffffffe0d8) at glusterfsd.c:2043 (gdb)
Created attachment 1014229 [details] Same content as reproduction_steps_for_glusterd_crash.odt, converted to plaintext
Proper steps were not followed to enable SSL on the management path. Precisely, these are the steps to be followed. 1) Stop all gluster processes on all nodes (glusterd, glusterfs, glusterfsd) 2) create file /var/lib/glusterd/secure-access on all nodes 3) start glusterd process However, this bug will be kept open and verified once again after we have fix for 1244415
This is not a security bug, not going to fix this in 3.6.x because of http://www.gluster.org/pipermail/gluster-users/2016-July/027682.html
If the issue persists in the latest releases, please feel free to clone them