Bug 1279628

Summary: [GSS]-gluster v heal volname info does not work with enabled ssl/tls
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Harold Miller <hamiller>
Component: replicateAssignee: Ashish Pandey <aspandey>
Status: CLOSED ERRATA QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.1CC: aspandey, asrivast, biryulini, bkunal, ccalhoun, john, nchilaka, olim, pkarampu, ravishankar, rhinduja, rhs-bugs, sankarshan, smohan, storage-qa-internal
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: RHGS 3.1.3   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.7.9-2 Doc Type: Bug Fix
Doc Text:
When management encryption via SSL is enabled, glusterd only allows encrypted connections on port 24007. However, the self heal daemon did not use an encrypted connection when attempting to fetch its volfile. This meant that when management encryption was enabled, running the "gluster volume heal info" command resulted in error messages, and users could not see the list of files that needed to be healed. The self heal daemon now communicates correctly over an encrypted connection and "gluster volume heal info" works as expected.
Story Points: ---
Clone Of: 1258931
: 1320388 (view as bug list) Environment:
Last Closed: 2016-06-23 04:56:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1258931    
Bug Blocks: 1299183, 1320388, 1321514, 1369170    

Description Harold Miller 2015-11-09 21:43:26 UTC
+++ This bug was initially created as a clone of Bug #1258931 +++

Description of problem:
When have enabled ssl/tls command "gluster v heal <VOLUME> info" return
"VOLNAME: Not able to fetch volfile from glusterd
Volume heal failed."


Version-Release number of selected component (if applicable):
glusterfs 3.7.4-2 on rhel 7.1 with tls/ssl enabled.


How reproducible:
Enable ssl/tls.

Steps to Reproduce:
1. Setup installation from 2 replica.
2. Enable ssl/tls:
Generate a private key for each system.
openssl genrsa -out /etc/ssl/glusterfs.key 2048

Use the generated private key to create a signed certificate by running the following command:
openssl req -new -x509 -key /etc/ssl/glusterfs.key -subj "/CN=COMMONNAME" -out /etc/ssl/glusterfs.pem

Concatinate all glusterfs.pem to /etc/ssl/glusterfs.ca and copy on all node.

Umount mount-point

Enable encrypting managment traffic:
touch /var/lib/glusterd/secure-access

Stop VOLUME
gluster volume stop VOLNAME

Setup list of allow servers and clients
gluster volume set VOLNAME auth.ssl-allow 'server1,server2,server3,client1,client2,client3'

Enable variables:
gluster volume set VOLNAME client.ssl on
gluster volume set VOLNAME server.ssl on

Stop all glusterfs services:
/etc/init.d/glusterfs-server stop
pkill glusterd
pkill glusterfs 
pkill glusterfsd

Start glusterfs service
/etc/init.d/glusterfs-server start

Start VOLUME:
gluster volume start VOLNAME

If we want mount our share.

3. Try get "gluster v heal <VOLUME> info"

Actual results:
VOLNAME: Not able to fetch volfile from glusterd
Volume heal failed.

Expected results:
List of files need healing


Additional info:
OS: Ubuntu 14.04.3 LTS

In glfsheal-VOLNAME.log:
[2015-09-01 14:02:03.666757] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-09-01 14:02:03.670743] W [socket.c:642:__socket_rwv] 0-gfapi: readv on 127.0.0.1:24007 failed (No data available)
[2015-09-01 14:02:03.671053] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7feb297ebf46] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7feb27fad54e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7feb27fad65e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[0x7feb27faef1c] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x48)[0x7feb27faf6b8] ))))) 0-gfapi: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2015-09-01 14:02:03.670590 (xid=0x1)
[2015-09-01 14:02:03.671076] E [MSGID: 104007] [glfs-mgmt.c:637:glfs_mgmt_getspec_cbk] 0-glfs-mgmt: failed to fetch volume file (key:repofiles) [Invalid argument]
[2015-09-01 14:02:03.671102] E [MSGID: 104024] [glfs-mgmt.c:738:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with remote-host: localhost (No data available) [No data available]
[2015-09-01 14:02:03.671115] I [MSGID: 104025] [glfs-mgmt.c:744:mgmt_rpc_notify] 0-glfs-mgmt: Exhausted all volfile servers [Transport endpoint is not connected]

In cli.log:
[2015-09-01 14:02:03.596437] I [cli.c:720:main] 0-cli: Started running gluster with version 3.7.3
[2015-09-01 14:02:03.599274] I [socket.c:3971:socket_init] 0-glusterfs: SSL support for glusterd is ENABLED
[2015-09-01 14:02:03.600027] I [socket.c:3971:socket_init] 0-glusterfs: SSL support for glusterd is ENABLED
[2015-09-01 14:02:03.659342] I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-09-01 14:02:03.659426] I [socket.c:2409:socket_event_handler] 0-transport: disconnecting now
[2015-09-01 14:02:03.672269] I [input.c:36:cli_batch] 0-: Exiting with: 255

In etc-glusterfs-glusterd.vol.log every 3 sec next errors:
[2015-09-01 14:22:01.690274] W [socket.c:642:__socket_rwv] 0-nfs: readv on /var/run/gluster/c6f1a58b839ee4334fc0c8731ca06078.socket failed (Invalid argument)
[2015-09-01 14:22:04.690595] W [socket.c:642:__socket_rwv] 0-nfs: readv on /var/run/gluster/c6f1a58b839ee4334fc0c8731ca06078.socket failed (Invalid argument)
[2015-09-01 14:22:07.690876] W [socket.c:642:__socket_rwv] 0-nfs: readv on /var/run/gluster/c6f1a58b839ee4334fc0c8731ca06078.socket failed (Invalid argument)
[2015-09-01 14:22:10.691273] W [socket.c:642:__socket_rwv] 0-nfs: readv on /var/run/gluster/c6f1a58b839ee4334fc0c8731ca06078.socket failed (Invalid argument)
[2015-09-01 14:22:13.691728] W [socket.c:642:__socket_rwv] 0-nfs: readv on /var/run/gluster/c6f1a58b839ee4334fc0c8731ca06078.socket failed (Invalid argument)
[2015-09-01 14:22:16.692098] W [socket.c:642:__socket_rwv] 0-nfs: readv on /var/run/gluster/c6f1a58b839ee4334fc0c8731ca06078.socket failed (Invalid argument)
[2015-09-01 14:22:19.692517] W [socket.c:642:__socket_rwv] 0-nfs: readv on /var/run/gluster/c6f1a58b839ee4334fc0c8731ca06078.socket failed (Invalid argument)
[2015-09-01 14:22:22.692831] W [socket.c:642:__socket_rwv] 0-nfs: readv on /var/run/gluster/c6f1a58b839ee4334fc0c8731ca06078.socket failed (Invalid argument)
[2015-09-01 14:22:25.693201] W [socket.c:642:__socket_rwv] 0-nfs: readv on /var/run/gluster/c6f1a58b839ee4334fc0c8731ca06078.socket failed (Invalid argument)

--- Additional comment from JWeir on 2015-10-07 12:09:03 EDT ---

Experiencing the exact same issue:  

Gluster 3.7.4
Ubuntu 14.04.3

The bug does not occur when the secure-access file is absent.

Comment 3 Cal Calhoun 2016-03-17 16:00:17 UTC
Can I get an update regarding this BZ to provide to my customer?

Comment 5 Cal Calhoun 2016-03-18 15:09:04 UTC
Hi Pranith,

  From the customer:

  >Yes, basically I expect a fix as soon as it is possible. Release dates are
  >too far. We also have to plan any updates especially because they have an
  >impact on our customers' application.

  >Off course, the fix must be reliable, we only have prod environment, we
  >cannot play with them.

  So yes, if we can provide one that will fit the customer's requirements, a HOTFIX would be great.

Cal

Comment 8 Cal Calhoun 2016-03-23 16:23:53 UTC
Hello Ashish,

  I've queried the customer to confirm the version they're running in their prod environment.  Can you confirm that this fix has been tested?  The customer has expressed concerns because they have no environment to test in prior to applying it.

Regards,

Cal

Comment 10 Cal Calhoun 2016-04-04 21:20:32 UTC
RHGS 3.1.2 rpm glusterfs-server-3.7.5-19.el7rhgs.x86_64

Comment 14 Cal Calhoun 2016-05-02 15:18:43 UTC
Customer has already applied a HOTFIX from BZ 1310740.  If a HOTFIX results from this BZ, will they be cumulative?

Comment 15 Pranith Kumar K 2016-05-03 05:27:17 UTC
I am sorry, I don't understand your comment Cal. There is only one bugfix is needed to fix this bug, is that what you are asking.

Comment 16 Cal Calhoun 2016-05-03 15:27:37 UTC
My apologies.  One of my customers, on SF case 01587696, is interested in a hotfix when this BZ has been through QE.  They have previously applied a hotfix from BZ 1310740 and would like to know if a hotfix resulting from this BZ would be safe to apply on top of the previous one.

Comment 17 Pranith Kumar K 2016-05-04 04:55:20 UTC
Yes, please go ahead. If you face any problem in applying let us know, but I don't think there should be any problem.

Comment 18 Nag Pavan Chilakam 2016-05-27 10:39:34 UTC
QATP and the results:
===================


BUG#1279628 -  [GSS]-gluster v heal volname info does not work with enabled ssl/tls

    Description of Problem:

    When have enabled ssl/tls command "gluster v heal <VOLUME> info" return "VOLNAME: Not able to fetch volfile from glusterd Volume heal failed."


    Patch Info:http://review.gluster.org/#/c/13815/

    glfs/heal: Use encrypted connection in shd  When management encryption is enabled, GlusterD only allows encrypted connections for port 24007. SHD is trying to fetch it's volfile using an unencrypted connection.  If /var/lib/glusterd/secure-access is present , i.e. if management ssl is enabled, use encrypted connection fecth info from glusterd.

    QATP:

    TC#1:for a New volume gluster v heal info should display the right information when ssl is enabled (both mngt and data traffic) -->PASS

    1. Create a cluster and identify client(s)

    2. Now enable SSL for both mngt and data traffic using steps mentioned in https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Network_Encryption.html

    3. Now create a dist-rep volume

    4. Now issue heal info on that volume

    Heal info must give the right information and not throw "unable to fetch volfile" error


    TC#2:for an Existing volume gluster v heal info should display the right information when ssl is enabled (both mngt and data traffic) -->PASS

    1. Create a cluster and identify client(s)

    2. Create a distrep volume and populate some data

    3. Now enable SSL for both mngt and data traffic using steps mentioned in https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Network_Encryption.html

     4. Now issue heal info on that volume

    Heal info must give the right information and not throw "unable to fetch volfile" error


    TC#3:for a New volume gluster v heal info should display the right information when only mngt layer (glusterd) ssl is enabled  -->PASS

    1. Create a cluster and identify client(s)

    2. Now enable SSL for only mngt using steps mentioned in https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Network_Encryption.html

    3. Now create a dist-rep volume

    4. Now issue heal info on that volume

    Heal info must give the right information and not throw "unable to fetch volfile" error


    TC#4:for an existing volume gluster v heal info should display the right information when only mngt layer (glusterd) ssl is enabled -->PASS

    1. Create a cluster and identify client(s)

    2. Now create a dist-rep volume

    3. Now enable SSL for only mngt using steps mentioned in https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Network_Encryption.html

    4. Now issue heal info on that volume

    Heal info must give the right information and not throw "unable to fetch volfile" error


    TC#5: While IOs are going on gluster v heal info should display the right information when SSL is enabled (both mngt and data)  -->FAIL

    1. Create a cluster and identify client(s)

    2. Now create a dist-rep volume

    3. Now enable SSL for only mngt using steps mentioned in https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Network_Encryption.html

    4. Now mount the volume on fuse on multiple clients

    5. Now trigger IOs using dd or anyother method

    IOs must not hang

    However IOs and heal info hangs and a bug has been raised already 1337863 - [SSL] : I/O hangs when run from multiple clients on an SSL enabled volume

    6. Now also issue heal info on that volume

    Heal info must give the right information and not throw "unable to fetch volfile" error

    However IOs and heal info hangs and a bug has been raised already 1337863 - [SSL] : I/O hangs when run from multiple clients on an SSL enabled volume

              Note: Also I had set gluster volume set <volname> locking-scheme granular" while the IOs were going on  so as to avoid false positives as mentioned in BUG#1311839 - False positives in heal info





Test version:
============
root@dhcp35-191 ~]# rpm -qa|grep gluster
glusterfs-cli-3.7.9-6.el7rhgs.x86_64
glusterfs-libs-3.7.9-6.el7rhgs.x86_64
glusterfs-fuse-3.7.9-6.el7rhgs.x86_64
glusterfs-client-xlators-3.7.9-6.el7rhgs.x86_64
glusterfs-server-3.7.9-6.el7rhgs.x86_64
python-gluster-3.7.9-5.el7rhgs.noarch
glusterfs-3.7.9-6.el7rhgs.x86_64
glusterfs-api-3.7.9-6.el7rhgs.x86_64

Comment 20 Ashish Pandey 2016-06-03 10:17:13 UTC
Hi Laura,

Text is not exactly giving the correct picture about the issue we faced.
It was self heal daemon which was not using ssl connection to communicate with glusterd.

Modified text is - 
------------------
When management encryption is enabled, Glusterd only allows encrypted connections on port 24007. Self Heal Daemon is trying to fetch it's volfile using an unencrypted connection. This meant that when management SSL was enabled, running the "gluster volume heal info" command resulted in error messages, and users could not see the list of files that needed to be healed.  Self Heal Daemon now communicates correctly over an encrypted connection and "gluster volume heal info" works as expected.
------------------

Comment 22 Ashish Pandey 2016-06-06 04:42:25 UTC
Laura,

Description provided by you in comment #21 looks perfect to me.
I don't have any more comment on that.

Comment 25 errata-xmlrpc 2016-06-23 04:56:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240