Created attachment 1459060 [details] Installation procedure for Gluster 4.X Hello, This is my first time reporting a bug on bugzilla, so let me know if I post something wrong. Description of problem: I am doing some tests with GlusterFS 4.0 and 4.1 and I can't seem to solve some SSL/TLS issues. I am trying to set up a 2 node replicated gluster volume with SSL/TLS. For this setup, I use 3 KVM VMs (2 storage nodes + 1 client node). For the networking part, I use a dedicated private LAN for the KVM VMs. Each VM is able to ping the other, so there's no problem with the connectivity. Version-Release number of selected component (if applicable): These are the installed packages on gluster-client: [root@gluster-client ~]# rpm -qa | grep "gluster\|fuse" glusterfs-4.1.1-1.el7.x86_64 centos-release-gluster41-1.0-1.el7.centos.x86_64 glusterfs-libs-4.1.1-1.el7.x86_64 glusterfs-client-xlators-4.1.1-1.el7.x86_64 glusterfs-fuse-4.1.1-1.el7.x86_64 And these are the installed packages on gluster1 and gluster2 storage nodes: [root@gluster1 ~]# rpm -qa | grep "gluster\|fuse" glusterfs-api-4.1.1-1.el7.x86_64 centos-release-gluster41-1.0-1.el7.centos.x86_64 glusterfs-libs-4.1.1-1.el7.x86_64 glusterfs-4.1.1-1.el7.x86_64 glusterfs-cli-4.1.1-1.el7.x86_64 glusterfs-fuse-4.1.1-1.el7.x86_64 glusterfs-server-4.1.1-1.el7.x86_64 glusterfs-client-xlators-4.1.1-1.el7.x86_64 ===================================================== These are the informations regarding the gluster volume: [root@gluster1 ~]# gluster volume info vol01 Volume Name: vol01 Type: Replicate Volume ID: ab7426a5-23ab-40ff-91af-a5b977152553 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gluster1:/data/glusterfs/gluster1/vol01/brick1 Brick2: gluster2:/data/glusterfs/gluster2/vol01/brick1 Options Reconfigured: ssl.cipher-list: ALL network.ping-timeout: 5 server.ssl: on client.ssl: on auth.ssl-allow: * transport.address-family: inet nfs.disable: on performance.client-io-threads: off ===================================================== Here is the peers information: [root@gluster1 ~]# gluster peer status Number of Peers: 1 Hostname: gluster2 Uuid: f506bf62-6551-46b0-8a5b-457ae1fde839 State: Peer in Cluster (Connected) ===================================================== Here is the volume status: [root@gluster1 ~]# gluster volume status vol01 Status of volume: vol01 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gluster1:/data/glusterfs/gluster1/vol 01/brick1 49152 0 Y 11196 Brick gluster2:/data/glusterfs/gluster2/vol 01/brick1 49152 0 Y 11013 Self-heal Daemon on localhost N/A N/A Y 11315 Self-heal Daemon on gluster2 N/A N/A Y 11086 Task Status of Volume vol01 ------------------------------------------------------------------------------ There are no active volume tasks ===================================================== How reproducible: Steps to Reproduce: 1. Install GlusterFS 4.0 or 4.1 2. Make a 2-node replicated gluster volume with SSL/TLS 3. After doing all the necessary settings, try to copy a file to the Fuse mount on the client node. I've also put a .txt file with my procedure of installing the Gluster nodes and client. Let me know if you see anything wrong with it. Actual results: I receive this error: "Transport endpoint is not connected" after I issue the copy command. Expected results: I expected the file to be copied without a problem, like in version 3.12. Additional info: There is a Gluster mailing list thread about this. I will post it here just so that the two are linked: https://lists.gluster.org/pipermail/gluster-users/2018-July/034353.html The mount works fine until I try to copy an archive, multiple smaller files or a bigger file on it (meaning it shows correctly in df -Th and I can create several files with "touch file1 file2..."). Basically, after any data transfer, I get these errors. I followed the indications from the redhat page: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.1/html/administration_guide/chap-network_encryption UPDATE 1: I tried doing the exact same steps in Gluster 3.12 and had no problem. The steps worked and SSL/TLS was enabled. There was no transport error or anything and I also checked if SSL/TLS was enabled. Afterwards, I also tried with the new release 4.1 and the problem persists (same error with "Transport endpoint is not connected"). Let me know if you need any other info. Any help is much appreciated. Regards, Andrei H.
I can confirm the same issue. When copying a few small files onto the FUSE mount it is no problem but as soon as you put any "load" onto it (that means more than a few files, or big files like ISO images) the connection is interrupted with the error message as shown above. Our current workaround is to disable server.ssl and client.ssl for the volumes. We never had this problem with Gluster 3.12 .
As per Step 8 8. Set up TLS/SSL encryption on all nodes and clients (gluster1, gluster2, gluster-client): openssl genrsa -out /etc/ssl/glusterfs.key 2048 In gluster1 node: openssl req -new -x509 -key /etc/ssl/glusterfs.key -subj "/CN=gluster1" -out /etc/ssl/glusterfs.pem In gluster2 node: openssl req -new -x509 -key /etc/ssl/glusterfs.key -subj "/CN=gluster2" -out /etc/ssl/glusterfs.pem In gluster-client node: openssl req -new -x509 -key /etc/ssl/glusterfs.key -subj "/CN=gluster-client" -out /etc/ssl/glusterfs.pem ---------- As per Step 15 15. Setup SSL/TLS access to the volume: gluster volume set vol01 auth.ssl-allow 'gluster01,gluster02,gluster-client' gluster volume set vol01 client.ssl on gluster volume set vol01 server.ssl on gluster volume set vol01 network.ping-timeout "5" gluster volume start vol01 ---------- Please note that the Common Name mentioned during SSL key/cert generation is "gluster1" but mentioned in auth.ssl-allow is "gluster01". Please note the '0' prefixed to '1'. Is this a typo during bug reporting or an actual typo during volume configuration ? If this is a typo during volume configuration, it needs to be corrected. Please set auth.ssl-allow to: gluster volume set vol01 auth.ssl-allow 'gluster1,gluster2,gluster-client'
We use auth.ssl-allow "*" and we have the same issue so I'm guessing that's not the problem...
Hello, It's just a typo during bug reporting. I also tried Omar's setting with auth.ssl-allow "*" and the issue was the same. Let me know if you need any other info. Thank you.
Milind - Please see comment 4. Do we have any further investigation done ?
I've built RPMs using the release-4.1 branch with commit f33a61086da43af5a5de2ba99b4045a63cf5bd79 at HEAD There are no issues with SSL configuration. As per the steps listed in the attachment, the server and client pem files are not signed by a CA. This being an upstream BZ, I'll recommend user to look at: https://stackoverflow.com/questions/21297139/how-do-you-sign-a-certificate-signing-request-with-your-certification-authority ----- There's also no problem using self-signed certificates either.
I tried copying a 900MB ISO and saw the following problems: I can see the following errors in the client/mount log: [2018-08-17 07:21:20.602283] E [socket.c:2167:__socket_read_frag] 0-rpc: wrong MSG-TYPE (574) received from 192.168.122.87:24007 [2018-08-17 07:21:20.602297] T [socket.c:2801:socket_poller] 0-patchy-client-0: disconnecting socket [2018-08-17 07:21:20.602365] D [MSGID: 0] [client.c:2241:client_rpc_notify] 0-patchy-client-0: got RPC_CLNT_DISCONNECT [2018-08-17 07:21:20.602379] I [MSGID: 114018] [client.c:2254:client_rpc_notify] 0-patchy-client-0: disconnected from patchy-client-0. Client process will keep trying to connect to glusterd until brick's port is available On the brick side: [2018-08-17 07:21:00.723552] E [MSGID: 115067] [server-rpc-fops_v2.c:1316:server4_writev_cbk] 0-patchy-server: 562: WRITEV 0 (3fd3cf86-419e-43eb-88ad-72b12263fab6), client: CTX_ID:47717648-2a74-49b5-8e39-4069a86b2246-GRAPH_ID:0-PID:1553-HOST:centos7-2-PC_NAME:patchy-client-0-RECON_NO:-0, error-xlator: - [Bad file descriptor]
I just re-tested using the commit tagged as v4.1.2 (044f9df65) and the problems persist as described above. The log messages are the same as the ones Milind is getting. From the client's perspective the copy operation of an ISO file aborts with an error message. Few small files can be copied with no problems. Milind do you therefore confirm that a problem exists, or is it unclear?
I confirm that the problem is present.
REVIEW: https://review.gluster.org/20993 (rpc: handle EAGAIN when SSL_ERROR_SYSCALL is returned) posted (#2) for review on release-4.1 by Milind Changire
Please note that the master branch and release-4.1 branch have diverged significantly. So the patch is not applicable to master branch. Also, this issue has already been addressed in the master branch.
COMMIT: https://review.gluster.org/20993 committed in release-4.1 by "Shyamsundar Ranganathan" <srangana> with a commit message- rpc: handle EAGAIN when SSL_ERROR_SYSCALL is returned Problem: A return value of ENODATA was forcibly returned in the case where SSL_get_error(r) returned SSL_ERROR_SYSCALL. Sometimes SSL_ERROR_SYSCALL is a transient error which is identified by setting errno to EAGAIN. EAGAIN is not a fatal error and indicates that the syscall needs to be retried. Solution: Bubble up the errno in case SSL_get_error(r) returns SSL_ERROR_SYSCALL and let the upper layers handle it appropriately. fixes: bz#1601356 Change-Id: I76eff278378930ee79abbf9fa267a7e77356eed6 Signed-off-by: Milind Changire <mchangir>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.1.5, please open a new bug report. glusterfs-4.1.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-September/000113.html [2] https://www.gluster.org/pipermail/gluster-users/
Mounted Dir Gets Error in GlusterFS Storage Cluster with SSL/TLS Encryption as doing add-brick and remove-brick repeatly Hi all, I'm afraid that there's something wrong with GlusterFS storage cluster using SSL/TLS encryption in latest version 4.1.5 or older versions. First, I enabled SSL for GlusterFS Replicated Volume afr_vol, which is mounted on /mnt/gluster by GlusterFS native client. Then, I did add-brick and remove-brick repeatedly. At the same time, mounted dir are read or written continuously for a while. Here I used the command "find /mnt/gluster" or "ls /mnt/gluster" to reproduce it. Later, The error message is as below: find: ‘/mnt/gluster’: Transport endpoint is not connected However, everything would be OK if I disabled SSL for GlusterFS Replicated Volume afr_vol. So, it seems that bug 1601356("Problem with SSL/TLS encryption on Gluster 4.0 & 4.1") hasn't been fixed yet. Hope someone could help me. Thanks a lot.
As mounted dir cannot be accessed, process of glusterfs is in 100% CPU. and the result of strace of glusterfs process is as below: strace -f -p 6576 [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295^CProcess 6576 detached Process 6577 detached Process 6578 detached Process 6579 detached Process 6580 detached Process 6581 detached Process 6584 detached Process 6585 detached Process 6596 detached Process 6597 detached Process 1578 detached Process 1581 detached Process 11623 detached Process 14581 detached Process 14601 detached <detached ...> Process 15032 detached