Bug 1224289 - NFS server crashed in one of the node after enabling the management SSL
Summary: NFS server crashed in one of the node after enabling the management SSL
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Kaushal
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1223636
TreeView+ depends on / blocked
 
Reported: 2015-05-22 12:58 UTC by ssamanta
Modified: 2015-06-11 01:09 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-05-26 12:35:36 UTC
Embargoed:


Attachments (Terms of Use)
verification logs (23.39 KB, application/vnd.oasis.opendocument.text)
2015-05-26 12:35 UTC, ssamanta
no flags Details

Description ssamanta 2015-05-22 12:58:58 UTC
Description of problem:
NFS server got crashed on one of the node(4 node gluster cluster) after enabling management SSL and glusterd is restarted and attempt mount from the client.


Version-Release number of selected component (if applicable):
[root@gqas003 ssl]# rpm -qa | grep gluster
glusterfs-3.7.0-2.el6rhs.x86_64
glusterfs-api-3.7.0-2.el6rhs.x86_64
glusterfs-geo-replication-3.7.0-2.el6rhs.x86_64
gluster-nagios-common-0.1.4-1.el6rhs.noarch
glusterfs-libs-3.7.0-2.el6rhs.x86_64
glusterfs-client-xlators-3.7.0-2.el6rhs.x86_64
glusterfs-fuse-3.7.0-2.el6rhs.x86_64
glusterfs-server-3.7.0-2.el6rhs.x86_64
vdsm-gluster-4.16.8.1-6.2.el6rhs.noarch
gluster-nagios-addons-0.1.16-1.el6rhs.x86_64
rhs-tests-rhs-tests-beaker-rhs-gluster-qe-libs-dev-bturner-2.37-0.noarch
glusterfs-rdma-3.7.0-2.el6rhs.x86_64
glusterfs-cli-3.7.0-2.el6rhs.x86_64
[root@gqas003 ssl]# 

Client( Openstack Nova VM) 
===========================
[fedora@myvm-admin-fedora1 ~]$ rpm -qa | grep gluster
glusterfs-server-3.7.0-2.fc20.x86_64
glusterfs-libs-3.7.0-2.fc20.x86_64
glusterfs-client-xlators-3.7.0-2.fc20.x86_64
glusterfs-cli-3.7.0-2.fc20.x86_64
glusterfs-3.7.0-2.fc20.x86_64
glusterfs-api-3.7.0-2.fc20.x86_64
glusterfs-fuse-3.7.0-2.fc20.x86_64
[fedora@myvm-admin-fedora1 ~]$ 

Client/Nova VM OpenSSL
=======================
[fedora@myvm-admin-fedora1 ~]$ yum info openssl
Installed Packages
Name        : openssl
Arch        : x86_64
Epoch       : 1
Version     : 1.0.1e
Release     : 42.fc20
Size        : 1.5 M
Repo        : installed
From repo   : updates
Summary     : Utilities from the general purpose cryptography library with TLS implementation
URL         : http://www.openssl.org/
License     : OpenSSL
Description : The OpenSSL toolkit provides support for secure communications between
            : machines. OpenSSL includes a certificate management tool and shared
            : libraries which provide various cryptographic algorithms and
            : protocols.

[fedora@myvm-admin-fedora1 ~]

[fedora@myvm-admin-fedora1 ~]$ rpm -qa | grep openssl
openssl-libs-1.0.1e-42.fc20.x86_64
openssl-1.0.1e-42.fc20.x86_64
[fedora@myvm-admin-fedora1 ~]$ 

Server OpenSSL
===============
[root@gqas003 ssl]# yum info openssl
Loaded plugins: aliases, changelog, downloadonly, product-id, security, subscription-manager, tmprepo, verify, versionlock
rhel-6-server-rpms                                                                                                             | 3.7 kB     00:00     
rhel-scalefs-for-rhel-6-server-rpms                                                                                            | 4.6 kB     00:00     
rhs                                                                                                                            | 2.9 kB     00:00     
rhs-3-for-rhel-6-server-rpms                                                                                                   | 3.1 kB     00:00     
uspace-rcu                                                                                                                     | 2.9 kB     00:00     
Installed Packages
Name        : openssl
Arch        : x86_64
Version     : 1.0.1e
Release     : 30.el6_6.8
Size        : 4.0 M
Repo        : installed
From repo   : rhel-6-server-rpms
Summary     : A general purpose cryptography library with TLS implementation
URL         : http://www.openssl.org/
License     : OpenSSL
Description : The OpenSSL toolkit provides support for secure communications between
            : machines. OpenSSL includes a certificate management tool and shared
            : libraries which provide various cryptographic algorithms and
            : protocols.

Available Packages
Name        : openssl
Arch        : i686
Version     : 1.0.1e
Release     : 30.el6_6.8
Size        : 1.5 M
Repo        : rhel-6-server-rpms
Summary     : A general purpose cryptography library with TLS implementation
URL         : http://www.openssl.org/
License     : OpenSSL
Description : The OpenSSL toolkit provides support for secure communications between
            : machines. OpenSSL includes a certificate management tool and shared
            : libraries which provide various cryptographic algorithms and
            : protocols.

[root@gqas003 ssl]# 

[root@gqas003 ssl]# rpm -qa | grep openssl
openssl-1.0.1e-30.el6_6.8.x86_64
[root@gqas003 ssl]# 

How reproducible:
Happens consistently.

Steps to Reproduce:
1.Install RHS3.1 build(4 server physical machines) 1 client VM with RHS3.1 client bits for f20.
2.Create a volume and start it
3.Enable the SSL options(client.ssl and server.ssl)
4.Create separate private keys for all the server nodes and client
5.Create the public key and CN and concatenate the public keys(client and server) and create a glusterfs.ca file and copy to the server nodes(/etc/ssl) and clients(/etc/ssl).
6.Add the CN's to ssl-auth-allow list for the volume
7.Restart the volume
8.Mount from the client using fuse and verify it is Successful
9.Enable management SSL by creating the file /var/lib/glusterd/secure-access on server and client and restart glusterd
10. mount the volume through fuse ( mount fails and the NFS server gets crashed on one of the node)

Actual results:
NFS server crashed on a node after step 9.

Expected results:
NFS server should not be crashed.

Additional info:
sosreport: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/sosreport-gqas009-20150522085134-f9ab.tar.xz

[fedora@myvm-admin-fedora1 ~]$ yum info openssl
Installed Packages
Name        : openssl
Arch        : x86_64
Epoch       : 1
Version     : 1.0.1e
Release     : 42.fc20
Size        : 1.5 M
Repo        : installed
From repo   : updates
Summary     : Utilities from the general purpose cryptography library with TLS implementation
URL         : http://www.openssl.org/
License     : OpenSSL
Description : The OpenSSL toolkit provides support for secure communications between
            : machines. OpenSSL includes a certificate management tool and shared
            : libraries which provide various cryptographic algorithms and
            : protocols.

[fedora@myvm-admin-fedora1 ~]$ 

Sequence of log after the management ssl is enabled and glusterd is restarted:
=============================================================================

[root@gqas003 ssl]# touch /var/lib/glusterd/secure-access

NFS server is running on all nodes
===================================
[root@gqas003 ssl]# gluster volume status
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.16.156.6:/rhs/brick1/br8           49152     0          Y       8400 
Brick 10.16.156.18:/rhs/brick1/br4          49152     0          Y       5581 
Brick 10.16.156.24:/rhs/brick1/br1          49152     0          Y       21183
Brick 10.16.156.36:/rhs/brick1/br4          49152     0          Y       6020 
NFS Server on localhost                     2049      0          Y       8421 
Self-heal Daemon on localhost               N/A       N/A        Y       8427 
NFS Server on 10.16.156.24                  2049      0          Y       21204
Self-heal Daemon on 10.16.156.24            N/A       N/A        Y       21210
NFS Server on 10.16.156.36                  2049      0          Y       6040 
Self-heal Daemon on 10.16.156.36            N/A       N/A        Y       6047 
NFS Server on 10.16.156.18                  2049      0          Y       5601 
Self-heal Daemon on 10.16.156.18            N/A       N/A        Y       5608 
 
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gqas003 ssl]# service glusterd restart
Starting glusterd:[  OK  ]

NFS Server is crashed in one of nodes after mounting from the client:
====================================================================
[root@gqas003 ssl]# gluster volume status
Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.16.156.6:/rhs/brick1/br8           49152     0          Y       8400 
Brick 10.16.156.18:/rhs/brick1/br4          49152     0          Y       5581 
Brick 10.16.156.24:/rhs/brick1/br1          49152     0          Y       21183
Brick 10.16.156.36:/rhs/brick1/br4          49152     0          Y       6020 
NFS Server on localhost                     2049      0          Y       9129 
Self-heal Daemon on localhost               N/A       N/A        Y       9136 
NFS Server on 10.16.156.18                  2049      0          Y       6292 
Self-heal Daemon on 10.16.156.18            N/A       N/A        Y       6300  
NFS Server on 10.16.156.24                  N/A       N/A        N       N/A ---> Crashed  
Self-heal Daemon on 10.16.156.24            N/A       N/A        N       N/A  
NFS Server on 10.16.156.36                  2049      0          Y       6731 
Self-heal Daemon on 10.16.156.36            N/A       N/A        Y       6738 
 
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks
 
Attempt mount from client
=========================
[fedora@myvm-admin-fedora1 ~]$ sudo mount -t glusterfs 10.16.156.6:/testvol /mnt/test
WARNING: getfattr not found, certain checks will be skipped..
Mount failed. Please check the log file for more details.
[fedora@myvm-admin-fedora1 ~]$ 


+------------------------------------------------------------------------------+
[2015-05-22 11:28:48.960170] I [socket.c:401:ssl_setup_connection] 0-testvol-client-0: peer CN = server1.example.com
[2015-05-22 11:28:48.964507] I [socket.c:401:ssl_setup_connection] 0-testvol-client-1: peer CN = server2.example.com
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2015-05-22 11:28:48
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.0
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3919a24b96]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3919a435af]
/lib64/libc.so.6[0x34b42326a0]
/usr/lib64/libgfrpc.so.0(rpc_transport_connect+0xc)[0x391a20b9fc]
/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xd9)[0x391a20e729]
/usr/lib64/libglusterfs.so.0(gf_timer_proc+0x113)[0x3919a45573]
/lib64/libpthread.so.0[0x34b46079d1]
/lib64/libc.so.6(clone+0x6d)[0x34b42e88fd]
---------

Comment 2 ssamanta 2015-05-25 04:57:10 UTC
After the above issue the gluster command fails with "Request timeout". 

[root@gqas003 ~]# gluster volume info
Error : Request timed out
No volumes present
[root@gqas003 ~]#

[root@gqas007 ~]# service glusterd status
glusterd (pid  6157) is running...
[root@gqas007 ~]# gluster peer status
Error : Request timed out
[root@gqas007 ~]#

Comment 4 Kaushal 2015-05-25 07:42:53 UTC
Can you please provide the coredump for the crashed nfs process?

Comment 6 Kaushal 2015-05-25 11:13:27 UTC
Sobhan provided me with access to the systems he faces the issues on.

What I found was that the bricks were left running when the switch to management encryption was done. This is incorrect. When enabling or disabling management encryption, all GlusterFS processes - GlusterD, bricks, clients etc. - need to be stopped and started. This is needed because,
1. Interactions between processes trying to use encrypted connections and processes using unencrypted connections is undefined, and will lead to failures as observed here.
2. All GlusterFS processes communicate with GlusterD, so changing management encryption's state affects all of them
3. It is not possible to do a dynamically switch an unencrypted connection to encrypted or vice-versa.

Sobhan was following [1], which isn't complete with respect to upgrade procedures. This lack of documentation was one of the issues we found when I got involved with the GlusterFS network encryption and Manila. As a result, I've written up a guide on how-to use network encryption with GlusterFS at [2], which covers many different scenarios of enabling network encryption in GlusterFS, including enabling management encryption on an existing cluster (as is the case here). I'll work with the documentation team to provide proper official documentation for RHGS based on [2]. But till we get the official documentation, please refer to [2] for network encryption guidance.

Sobhan, could you please re-run your tests following the guidelines given in [2]. You shouldn't be facing any issues if you follow it. In case you do hit issues even when following the guidelines, please let met know.

As this is issue is not really a bug with GlusterFS, but arose because of incorrect setup/steps followed, I suggest closing this bug. I'll do the same if there are no objections.


[1]: https://github.com/gluster/glusterfs/blob/master/doc/admin-guide/en-US/markdown/admin_ssl.md
[2]: https://kshlm.in/network-encryption-in-glusterfs/

Comment 10 ssamanta 2015-05-26 12:35:00 UTC
Created attachment 1029922 [details]
verification logs


Note You need to log in before you can comment on or make changes to this bug.