Description of problem: NFS server got crashed on one of the node(4 node gluster cluster) after enabling management SSL and glusterd is restarted and attempt mount from the client. Version-Release number of selected component (if applicable): [root@gqas003 ssl]# rpm -qa | grep gluster glusterfs-3.7.0-2.el6rhs.x86_64 glusterfs-api-3.7.0-2.el6rhs.x86_64 glusterfs-geo-replication-3.7.0-2.el6rhs.x86_64 gluster-nagios-common-0.1.4-1.el6rhs.noarch glusterfs-libs-3.7.0-2.el6rhs.x86_64 glusterfs-client-xlators-3.7.0-2.el6rhs.x86_64 glusterfs-fuse-3.7.0-2.el6rhs.x86_64 glusterfs-server-3.7.0-2.el6rhs.x86_64 vdsm-gluster-4.16.8.1-6.2.el6rhs.noarch gluster-nagios-addons-0.1.16-1.el6rhs.x86_64 rhs-tests-rhs-tests-beaker-rhs-gluster-qe-libs-dev-bturner-2.37-0.noarch glusterfs-rdma-3.7.0-2.el6rhs.x86_64 glusterfs-cli-3.7.0-2.el6rhs.x86_64 [root@gqas003 ssl]# Client( Openstack Nova VM) =========================== [fedora@myvm-admin-fedora1 ~]$ rpm -qa | grep gluster glusterfs-server-3.7.0-2.fc20.x86_64 glusterfs-libs-3.7.0-2.fc20.x86_64 glusterfs-client-xlators-3.7.0-2.fc20.x86_64 glusterfs-cli-3.7.0-2.fc20.x86_64 glusterfs-3.7.0-2.fc20.x86_64 glusterfs-api-3.7.0-2.fc20.x86_64 glusterfs-fuse-3.7.0-2.fc20.x86_64 [fedora@myvm-admin-fedora1 ~]$ Client/Nova VM OpenSSL ======================= [fedora@myvm-admin-fedora1 ~]$ yum info openssl Installed Packages Name : openssl Arch : x86_64 Epoch : 1 Version : 1.0.1e Release : 42.fc20 Size : 1.5 M Repo : installed From repo : updates Summary : Utilities from the general purpose cryptography library with TLS implementation URL : http://www.openssl.org/ License : OpenSSL Description : The OpenSSL toolkit provides support for secure communications between : machines. OpenSSL includes a certificate management tool and shared : libraries which provide various cryptographic algorithms and : protocols. [fedora@myvm-admin-fedora1 ~] [fedora@myvm-admin-fedora1 ~]$ rpm -qa | grep openssl openssl-libs-1.0.1e-42.fc20.x86_64 openssl-1.0.1e-42.fc20.x86_64 [fedora@myvm-admin-fedora1 ~]$ Server OpenSSL =============== [root@gqas003 ssl]# yum info openssl Loaded plugins: aliases, changelog, downloadonly, product-id, security, subscription-manager, tmprepo, verify, versionlock rhel-6-server-rpms | 3.7 kB 00:00 rhel-scalefs-for-rhel-6-server-rpms | 4.6 kB 00:00 rhs | 2.9 kB 00:00 rhs-3-for-rhel-6-server-rpms | 3.1 kB 00:00 uspace-rcu | 2.9 kB 00:00 Installed Packages Name : openssl Arch : x86_64 Version : 1.0.1e Release : 30.el6_6.8 Size : 4.0 M Repo : installed From repo : rhel-6-server-rpms Summary : A general purpose cryptography library with TLS implementation URL : http://www.openssl.org/ License : OpenSSL Description : The OpenSSL toolkit provides support for secure communications between : machines. OpenSSL includes a certificate management tool and shared : libraries which provide various cryptographic algorithms and : protocols. Available Packages Name : openssl Arch : i686 Version : 1.0.1e Release : 30.el6_6.8 Size : 1.5 M Repo : rhel-6-server-rpms Summary : A general purpose cryptography library with TLS implementation URL : http://www.openssl.org/ License : OpenSSL Description : The OpenSSL toolkit provides support for secure communications between : machines. OpenSSL includes a certificate management tool and shared : libraries which provide various cryptographic algorithms and : protocols. [root@gqas003 ssl]# [root@gqas003 ssl]# rpm -qa | grep openssl openssl-1.0.1e-30.el6_6.8.x86_64 [root@gqas003 ssl]# How reproducible: Happens consistently. Steps to Reproduce: 1.Install RHS3.1 build(4 server physical machines) 1 client VM with RHS3.1 client bits for f20. 2.Create a volume and start it 3.Enable the SSL options(client.ssl and server.ssl) 4.Create separate private keys for all the server nodes and client 5.Create the public key and CN and concatenate the public keys(client and server) and create a glusterfs.ca file and copy to the server nodes(/etc/ssl) and clients(/etc/ssl). 6.Add the CN's to ssl-auth-allow list for the volume 7.Restart the volume 8.Mount from the client using fuse and verify it is Successful 9.Enable management SSL by creating the file /var/lib/glusterd/secure-access on server and client and restart glusterd 10. mount the volume through fuse ( mount fails and the NFS server gets crashed on one of the node) Actual results: NFS server crashed on a node after step 9. Expected results: NFS server should not be crashed. Additional info: sosreport: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/sosreport-gqas009-20150522085134-f9ab.tar.xz [fedora@myvm-admin-fedora1 ~]$ yum info openssl Installed Packages Name : openssl Arch : x86_64 Epoch : 1 Version : 1.0.1e Release : 42.fc20 Size : 1.5 M Repo : installed From repo : updates Summary : Utilities from the general purpose cryptography library with TLS implementation URL : http://www.openssl.org/ License : OpenSSL Description : The OpenSSL toolkit provides support for secure communications between : machines. OpenSSL includes a certificate management tool and shared : libraries which provide various cryptographic algorithms and : protocols. [fedora@myvm-admin-fedora1 ~]$ Sequence of log after the management ssl is enabled and glusterd is restarted: ============================================================================= [root@gqas003 ssl]# touch /var/lib/glusterd/secure-access NFS server is running on all nodes =================================== [root@gqas003 ssl]# gluster volume status Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.16.156.6:/rhs/brick1/br8 49152 0 Y 8400 Brick 10.16.156.18:/rhs/brick1/br4 49152 0 Y 5581 Brick 10.16.156.24:/rhs/brick1/br1 49152 0 Y 21183 Brick 10.16.156.36:/rhs/brick1/br4 49152 0 Y 6020 NFS Server on localhost 2049 0 Y 8421 Self-heal Daemon on localhost N/A N/A Y 8427 NFS Server on 10.16.156.24 2049 0 Y 21204 Self-heal Daemon on 10.16.156.24 N/A N/A Y 21210 NFS Server on 10.16.156.36 2049 0 Y 6040 Self-heal Daemon on 10.16.156.36 N/A N/A Y 6047 NFS Server on 10.16.156.18 2049 0 Y 5601 Self-heal Daemon on 10.16.156.18 N/A N/A Y 5608 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks [root@gqas003 ssl]# service glusterd restart Starting glusterd:[ OK ] NFS Server is crashed in one of nodes after mounting from the client: ==================================================================== [root@gqas003 ssl]# gluster volume status Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.16.156.6:/rhs/brick1/br8 49152 0 Y 8400 Brick 10.16.156.18:/rhs/brick1/br4 49152 0 Y 5581 Brick 10.16.156.24:/rhs/brick1/br1 49152 0 Y 21183 Brick 10.16.156.36:/rhs/brick1/br4 49152 0 Y 6020 NFS Server on localhost 2049 0 Y 9129 Self-heal Daemon on localhost N/A N/A Y 9136 NFS Server on 10.16.156.18 2049 0 Y 6292 Self-heal Daemon on 10.16.156.18 N/A N/A Y 6300 NFS Server on 10.16.156.24 N/A N/A N N/A ---> Crashed Self-heal Daemon on 10.16.156.24 N/A N/A N N/A NFS Server on 10.16.156.36 2049 0 Y 6731 Self-heal Daemon on 10.16.156.36 N/A N/A Y 6738 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks Attempt mount from client ========================= [fedora@myvm-admin-fedora1 ~]$ sudo mount -t glusterfs 10.16.156.6:/testvol /mnt/test WARNING: getfattr not found, certain checks will be skipped.. Mount failed. Please check the log file for more details. [fedora@myvm-admin-fedora1 ~]$ +------------------------------------------------------------------------------+ [2015-05-22 11:28:48.960170] I [socket.c:401:ssl_setup_connection] 0-testvol-client-0: peer CN = server1.example.com [2015-05-22 11:28:48.964507] I [socket.c:401:ssl_setup_connection] 0-testvol-client-1: peer CN = server2.example.com pending frames: frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-05-22 11:28:48 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.0 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3919a24b96] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3919a435af] /lib64/libc.so.6[0x34b42326a0] /usr/lib64/libgfrpc.so.0(rpc_transport_connect+0xc)[0x391a20b9fc] /usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xd9)[0x391a20e729] /usr/lib64/libglusterfs.so.0(gf_timer_proc+0x113)[0x3919a45573] /lib64/libpthread.so.0[0x34b46079d1] /lib64/libc.so.6(clone+0x6d)[0x34b42e88fd] ---------
After the above issue the gluster command fails with "Request timeout". [root@gqas003 ~]# gluster volume info Error : Request timed out No volumes present [root@gqas003 ~]# [root@gqas007 ~]# service glusterd status glusterd (pid 6157) is running... [root@gqas007 ~]# gluster peer status Error : Request timed out [root@gqas007 ~]#
Can you please provide the coredump for the crashed nfs process?
Sobhan provided me with access to the systems he faces the issues on. What I found was that the bricks were left running when the switch to management encryption was done. This is incorrect. When enabling or disabling management encryption, all GlusterFS processes - GlusterD, bricks, clients etc. - need to be stopped and started. This is needed because, 1. Interactions between processes trying to use encrypted connections and processes using unencrypted connections is undefined, and will lead to failures as observed here. 2. All GlusterFS processes communicate with GlusterD, so changing management encryption's state affects all of them 3. It is not possible to do a dynamically switch an unencrypted connection to encrypted or vice-versa. Sobhan was following [1], which isn't complete with respect to upgrade procedures. This lack of documentation was one of the issues we found when I got involved with the GlusterFS network encryption and Manila. As a result, I've written up a guide on how-to use network encryption with GlusterFS at [2], which covers many different scenarios of enabling network encryption in GlusterFS, including enabling management encryption on an existing cluster (as is the case here). I'll work with the documentation team to provide proper official documentation for RHGS based on [2]. But till we get the official documentation, please refer to [2] for network encryption guidance. Sobhan, could you please re-run your tests following the guidelines given in [2]. You shouldn't be facing any issues if you follow it. In case you do hit issues even when following the guidelines, please let met know. As this is issue is not really a bug with GlusterFS, but arose because of incorrect setup/steps followed, I suggest closing this bug. I'll do the same if there are no objections. [1]: https://github.com/gluster/glusterfs/blob/master/doc/admin-guide/en-US/markdown/admin_ssl.md [2]: https://kshlm.in/network-encryption-in-glusterfs/
Created attachment 1029922 [details] verification logs