Bug 1408576

Summary:	[Ganesha+SSL] : Bonnie++ hangs during rewrites.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ambarish <asoman>
Component:	core	Assignee:	Mohit Agrawal <moagrawa>
Status:	CLOSED ERRATA	QA Contact:	Ambarish <asoman>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, asoman, bturner, dang, ffilz, jdarcy, jthottan, kaushal, mbenjamin, moagrawa, rcyriac, rgowdapp, rhinduja, rhs-bugs, rjoseph, rkavunga, rtalur, skoduri, srangana, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-14	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1418213 1418539 1418541 (view as bug list)		Environment:
Last Closed:	2017-03-23 05:59:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1351528, 1418213, 1418539, 1418541

Description Ambarish 2016-12-25 09:16:31 UTC

Description of problem:
----------------------

Bonnie++ hangs every time I run from multiple times while it rewrites.
The workload is that it will create a a file size of twice the RAM size of the client,delete it and rewrite it.
I waited for close to 16 hours when it was rewriting for the test to complete but it did not.Packet trace from the clients shows nothing happening as well.
Both management and IO encryption were enabled.


Version-Release number of selected component (if applicable):
--------------------------------------------------------------

gluster -> 3.8.4-9
Ganesha -> 2.4.1-3

How reproducible:
------------------
2/2

Steps to Reproduce:
------------------

1. Create a 4 node Ganesha SSL cluster.Mount a 2*2 volume on 4 clients via v4.

2. Run Bonnie++


Actual results:
---------------

Bonnie++ hangs.

Expected results:
------------------

Bonnie++ should not hang.


Additional info:
----------------

Client and Server OS : RHEL 7.3

[root@gqas008 ~]# gluster v status
Status of volume: gluster_shared_storage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas014.sbu.lab.eng.bos.redhat.com:/v
ar/lib/glusterd/ss_brick                    49153     0          Y       11419
Brick gqas010.sbu.lab.eng.bos.redhat.com:/v
ar/lib/glusterd/ss_brick                    49153     0          Y       21619
Brick gqas008.sbu.lab.eng.bos.redhat.com:/v
ar/lib/glusterd/ss_brick                    49153     0          Y       16339
Self-heal Daemon on localhost               N/A       N/A        Y       20533
Self-heal Daemon on gqas010.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       4356 
Self-heal Daemon on gqas009.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       10569
Self-heal Daemon on gqas014.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       565  
 
Task Status of Volume gluster_shared_storage
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: testvolD
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas014.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_bricA                         49152     0          Y       528  
Brick gqas010.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_bricX                         49152     0          Y       4321 
Brick gqas008.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_bricaA                        49152     0          Y       20480
Brick gqas009.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brickqw                       49152     0          Y       10534
Self-heal Daemon on localhost               N/A       N/A        Y       20533
Self-heal Daemon on gqas014.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       565  
Self-heal Daemon on gqas009.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       10569
Self-heal Daemon on gqas010.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       4356 
 
Task Status of Volume testvolD
------------------------------------------------------------------------------
There are no active volume tasks

Comment 3 Ambarish 2016-12-25 09:26:06 UTC

Few other things we know :

> The test passes in a Non SSL environment on 4 Ganesha v4 mounts.

> The test passes on FUSE and SMB mounts  with SSL enabled.

Comment 4 Ambarish 2016-12-26 08:05:08 UTC

Proposing this as a blocker since the application side is impacted .

Comment 5 Jiffin 2016-12-26 09:57:32 UTC

From gfapi-logs in gqas008 and gqas014 there are a lot of timeout/disconnect messages 

[2016-12-20 13:30:18.882704] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x75400 sent = 2016-12-20 13:00:14.948068. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 13:30:18.882937] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvolD-client-0: remote operation failed [Transport endpoint is not connected]
[2016-12-20 13:30:18.909913] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x753ff sent = 2016-12-20 13:00:14.941327. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 13:30:18.910054] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x753e5 sent = 2016-12-20 13:00:14.930079. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 13:30:18.910334] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x753e4 sent = 2016-12-20 13:00:14.923329. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 13:30:18.910495] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x753e3 sent = 2016-12-20 13:00:14.916624. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 13:30:18.910680] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x753e2 sent = 2016-12-20 13:00:14.909907. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 13:30:18.910940] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x753e1 sent = 2016-12-20 13:00:14.903172. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 13:30:18.911102] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x753e0 sent = 2016-12-20 13:00:14.896423. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 13:30:18.911300] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x753df sent = 2016-12-20 13:00:14.889711. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 13:30:18.911467] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x753de sent = 2016-12-20 13:00:14.882998. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 13:30:18.911867] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(WRITE(13)) xid = 0x753dd sent = 2016-12-20 13:00:14.876168. timeout = 1800 for 192.168.79.140:49152

[2016-12-20 14:00:19.143825] W [MSGID: 108035] [afr-transaction.c:2166:afr_changelog_fsync_cbk] 0-testvolD-replicate-0: fsync(b77fbb20-c131-4886-bfd4-a5bd003bbe2f) failed on subvolume testvolD-client-0. Transaction was WRITE [Transport endpoint is not connected]
The message "W [MSGID: 114031] [client-rpc-fops.c:974:client3_3_fsync_cbk] 0-testvolD-client-0: remote operation failed [Transport endpoint is not connected]" repeated 206 times between [2016-12-20 14:00:19.143743] and [2016-12-20 14:00:19.177858]
The message "W [MSGID: 108035] [afr-transaction.c:2166:afr_changelog_fsync_cbk] 0-testvolD-replicate-0: fsync(b77fbb20-c131-4886-bfd4-a5bd003bbe2f) failed on subvolume testvolD-client-0. Transaction was WRITE [Transport endpoint is not connected]" repeated 206 times between [2016-12-20 14:00:19.143825] and [2016-12-20 14:00:19.177891]


[2016-12-20 14:30:29.406483] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(FXATTROP(34)) xid = 0x7553b sent = 2016-12-20 14:00:19.144275. timeout = 1800 for 192.168.79.140:49152
[2016-12-20 14:30:29.406634] E [rpc-clnt.c:200:call_bail] 0-testvolD-client-0: bailing out frame type(GlusterFS 3.3) op(FXATTROP(34)) xid = 0x7553a sent = 2016-12-20 14:00:19.143995. timeout = 1800 for 192.168.79.140:49152
The message "W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-testvolD-client-0: remote operation failed" repeated 206 times between [2016-12-20 14:30:29.379019] and [2016-12-20 14:30:29.406657]

I tried to look into core, it complains while extracting.(tried with tar,unzip and gzip, core should give clear idea where it was stuck).
   
In the bricks there was a disconnection happened at similar, but it got reconnected after some time.

Requesting Talur/Du/Rafi to provide their input for this bug

Comment 7 Jiffin 2016-12-28 04:39:06 UTC

From the pstack on the server gqas008, there are 201 threads waiting on syncop_writev and remaining 55 threads are waiting on mdcache_getattrs(). Statedumps in brick side looks okay. The core given in the location is not usuable. Myself and Ambarish trying to reproduce issue using my setup.

Comment 9 Jiffin 2016-12-29 04:52:39 UTC

    Tried to reproduce the issue with help of Ambarish in a two server/ two client scenario using 2x2 volume . In the first run , in one client rewrite got passed and error out it following
     Delete files in sequential order...done.
     Create files in random order...Can't create file jkLcF20000000001
    Cleaning up test directory after error.
    but in second client
    Writing intelligently...Can't write block.: Input/output error
    Can't write block 1463302

    I guess these error are thrown because one of the subvolume got completely filled. When I ran test using single client it got passed. And retry two client scenario. In this run for one client it got succeed and other client error out in a similar way. Also I tried iozone -a from two different clients(similar bench mark like bonnie which uses smaller file size) got passed.

    Since issue is not reproduced on dev setup, requesting qa to reproduce the issue again.

Comment 10 Ambarish 2017-01-11 09:05:20 UTC

Got a reproducer.
Setup details shared with Dev as of now for further RCA.

Comment 11 Ambarish 2017-01-12 02:51:58 UTC

Jiffin,

Repro'd again on the same setup.

This time Bonnie was hung on rewrites on 1 client.

On the other three clients ,it looks hung on writes itself from the past 12 hours.

Can you plz take a look when you get a free sec.

Comment 12 Jiffin 2017-01-12 06:23:47 UTC

I found the following in bricks :
at brick0 at gqas13

Thread 4 (Thread 0x7effaf266700 (LWP 22296)):
#0  0x00007effafc53dfd in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007effa5d6ec43 in ssl_do.isra.9 () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#2  0x00007effa5d6f126 in __socket_rwv () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#3  0x00007effa5d6f882 in __socket_ioq_churn_entry () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#4  0x00007effa5d6fd2c in socket_submit_reply () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#5  0x00007effb1279e62 in rpcsvc_transport_submit () from /lib64/libgfrpc.so.0
#6  0x00007effb127b9a0 in rpcsvc_submit_generic () from /lib64/libgfrpc.so.0
#7  0x00007eff9bdc0f19 in server_submit_reply () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so
#8  0x00007eff9bdd78d7 in server_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so
#9  0x00007effa043002d in io_stats_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so
#10 0x00007effb1534ea2 in default_readv_cbk () from /lib64/libglusterfs.so.0
#11 0x00007effa10b9630 in up_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#12 0x00007effa12cf9f2 in leases_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/leases.so
#13 0x00007effa190358f in pl_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/locks.so
#14 0x00007effa284ef4d in ctr_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/changetimerecorder.so
#15 0x00007effa329c970 in posix_readv () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#16 0x00007effb1531317 in default_readv () from /lib64/libglusterfs.so.0
#17 0x00007effa284ace0 in ctr_readv () from /usr/lib64/glusterfs/3.8.4/xlator/features/changetimerecorder.so
#18 0x00007effb1531317 in default_readv () from /lib64/libglusterfs.so.0
#19 0x00007effa1d372ca in br_stub_readv () from /usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so
#20 0x00007effb1531317 in default_readv () from /lib64/libglusterfs.so.0
#21 0x00007effa1906546 in pl_readv () from /usr/lib64/glusterfs/3.8.4/xlator/features/locks.so
#22 0x00007effb1531317 in default_readv () from /lib64/libglusterfs.so.0
#23 0x00007effb1531317 in default_readv () from /lib64/libglusterfs.so.0
#24 0x00007effa12d310e in leases_readv () from /usr/lib64/glusterfs/3.8.4/xlator/features/leases.so
#25 0x00007effa10c1edb in up_readv () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#26 0x00007effb1549d94 in default_readv_resume () from /lib64/libglusterfs.so.0
#27 0x00007effb14db1f2 in call_resume_wind () from /lib64/libglusterfs.so.0
#28 0x00007effb14db64d in call_resume () from /lib64/libglusterfs.so.0
#29 0x00007effa0eac957 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#30 0x00007effb0319dc5 in start_thread (arg=0x7effaf266700) at pthread_create.c:308
#31 0x00007effafc5e73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Thread 3 (Thread 0x7eff641f6700 (LWP 22299)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007effb031bd02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007effb031bc08 in __GI___pthread_mutex_lock (mutex=0x7eff900066d8) at pthread_mutex_lock.c:64
#3  0x00007effa5d6fbe2 in socket_submit_reply () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#4  0x00007effb1279e62 in rpcsvc_transport_submit () from /lib64/libgfrpc.so.0
#5  0x00007effb127b9a0 in rpcsvc_submit_generic () from /lib64/libgfrpc.so.0
#6  0x00007eff9bdc0f19 in server_submit_reply () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so
#7  0x00007eff9bdd78d7 in server_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so
#8  0x00007effa043002d in io_stats_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so
#9  0x00007effb1534ea2 in default_readv_cbk () from /lib64/libglusterfs.so.0
#10 0x00007effa10b9630 in up_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#11 0x00007effa12cf9f2 in leases_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/leases.so
#12 0x00007effa190358f in pl_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/locks.so
#13 0x00007effa284ef4d in ctr_readv_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/changetimerecorder.so
#14 0x00007effa329c970 in posix_readv () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#15 0x00007effb1531317 in default_readv () from /lib64/libglusterfs.so.0
#16 0x00007effa284ace0 in ctr_readv () from /usr/lib64/glusterfs/3.8.4/xlator/features/changetimerecorder.so
#17 0x00007effb1531317 in default_readv () from /lib64/libglusterfs.so.0
#18 0x00007effa1d372ca in br_stub_readv () from /usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so
#19 0x00007effb1531317 in default_readv () from /lib64/libglusterfs.so.0
#20 0x00007effa1906546 in pl_readv () from /usr/lib64/glusterfs/3.8.4/xlator/features/locks.so
#21 0x00007effb1531317 in default_readv () from /lib64/libglusterfs.so.0
#22 0x00007effb1531317 in default_readv () from /lib64/libglusterfs.so.0
#23 0x00007effa12d310e in leases_readv () from /usr/lib64/glusterfs/3.8.4/xlator/features/leases.so
#24 0x00007effa10c1edb in up_readv () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#25 0x00007effb1549d94 in default_readv_resume () from /lib64/libglusterfs.so.0
#26 0x00007effb14db1f2 in call_resume_wind () from /lib64/libglusterfs.so.0
#27 0x00007effb14db64d in call_resume () from /lib64/libglusterfs.so.0
#28 0x00007effa0eac957 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#29 0x00007effb0319dc5 in start_thread (arg=0x7eff641f6700) at pthread_create.c:308
#30 0x00007effafc5e73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


at brick3 - gqas011
Thread 27 (Thread 0x7f84df7fe700 (LWP 15855)):
#0  0x00007f850d080dfd in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f850319bc43 in ssl_do.isra.9 () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#2  0x00007f850319c126 in __socket_rwv () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#3  0x00007f850319c882 in __socket_ioq_churn_entry () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#4  0x00007f850319cd2c in socket_submit_reply () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so
#5  0x00007f850e6a6e62 in rpcsvc_transport_submit () from /lib64/libgfrpc.so.0
#6  0x00007f850e6a89a0 in rpcsvc_submit_generic () from /lib64/libgfrpc.so.0
#7  0x00007f84f934af19 in server_submit_reply () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so
#8  0x00007f84f9360f72 in server_fsync_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so
#9  0x00007f84f97b4e2b in io_stats_fsync_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so
#10 0x00007f84f9e0414e in barrier_fsync_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/barrier.so
#11 0x00007f850e967286 in default_fsync_cbk () from /lib64/libglusterfs.so.0
#12 0x00007f84fa654da6 in leases_fsync_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/leases.so
#13 0x00007f850e8fc977 in call_resume_unwind () from /lib64/libglusterfs.so.0
#14 0x00007f850e9086fd in call_unwind_error () from /lib64/libglusterfs.so.0
#15 0x00007f85006e2cec in posix_fsyncer_process () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#16 0x00007f85006e2eb3 in posix_fsyncer () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#17 0x00007f850d746dc5 in start_thread (arg=0x7f84df7fe700) at pthread_create.c:308
#18 0x00007f850d08b73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

there are inodelk waits in this server as well


Other two bricks are fine

Setup is still in same state

Requesting Rajesh to have a look

Comment 13 Jiffin 2017-01-12 06:58:38 UTC

IMHO this might have happen, two of the requests in readv at (brick0 gqas13) and fsync at (brick3 , gqas011) got stuck at ssl_do()->poll(). So all other request in rpc layer waiting on this request to finish, whole resulted in a hang

Comment 14 rjoseph 2017-01-16 06:11:23 UTC

From the initial analysis it seems that SSL_Write function failed, probably due to underlying socket failure. After the failure instead of returning error we are waiting on poll till the socket is available for write. I am not very familiar with this code may be Jeff can take a look and comment.

Failure is seen in function ssl_do where SSL_get_error is returning SSL_ERROR_WANT_WRITE.

Comment 15 Atin Mukherjee 2017-01-16 06:19:13 UTC

Kaushal - Could you help here if possible?

Comment 16 Kaushal 2017-01-16 07:45:44 UTC

Rajesh explained this to me last week. And also proposed a possible solution. I'm also not really familiar with the actual SSL bits of RPC, I am more familiar with the integration bits of RPC and SSL.

As Rajesh mentioned above, we hang when we get a SSL_ERROR_WANT_WRITE error when expecting to read or write from the socket. If we get a SSL_ERROR_WANT_READ error in the same case, we return an error, instead of waiting for a poll event. An error sould be returned for SSL_ERROR_WANT_WRITE as well, to allow upper layers to retry.

I don't know if this is cause and solution is either complete or correct. People more familiar with SSL routines and errors should weigh in. Ideally Jeff, as he knows more about it.

Comment 17 Atin Mukherjee 2017-01-19 13:40:28 UTC

Resetting the needinfo back to Jeff as Kaushal accidentally cleared it.

Comment 18 Jeff Darcy 2017-01-19 17:26:03 UTC

(In reply to Kaushal from comment #16)
> Rajesh explained this to me last week. And also proposed a possible
> solution. I'm also not really familiar with the actual SSL bits of RPC, I am
> more familiar with the integration bits of RPC and SSL.
> 
> As Rajesh mentioned above, we hang when we get a SSL_ERROR_WANT_WRITE error
> when expecting to read or write from the socket. If we get a
> SSL_ERROR_WANT_READ error in the same case, we return an error, instead of
> waiting for a poll event. An error sould be returned for
> SSL_ERROR_WANT_WRITE as well, to allow upper layers to retry.
> 
> I don't know if this is cause and solution is either complete or correct.
> People more familiar with SSL routines and errors should weigh in. Ideally
> Jeff, as he knows more about it.

First, I just have to ask: was this tried with own-thread instead of epoll?  If so, what was the result?  As explained in the comment for the WANT_READ case, the reason why we return an error there but not for WANT_WRITE is tied to upper layers' error/retry behavior, which was correct when both SSL and own-thread were originally being developed.  Since the later epoll code did not adhere to the same error/retry model, it might well require different behavior from ssl_do.  It certainly seems worth trying the fix Kaushal has suggested.  This would probably break own-thread, but that's becoming deprecated anyway.

Comment 19 Shyamsundar 2017-01-23 18:35:17 UTC

Some thoughts on the bug:

Returning from ssl_do when the error is SSL_ERROR_WANT_WRITE and the request was SSL_Write, is fine (of course r should be -1 and errno EAGAIN).

NOTE: From comment #12 the stacks seem be attempting a write (socket_submit_reply), so the above change may help.

Getting a SSL_ERROR_WANT_WRITE when attempting to do a SSL_Read, is due to SSL renegotiation that could be happening.

Returning from ssl_do when error is SSL_ERROR_WANT_WRITE and the request was an SSL_Read is not ok on its face, as that code path in socket.c may not enable POLL_OUT in the events, and hence if there is data to be read from the socket, it will just keep getting triggered for read, till the SSL renegotiation write is possible.

IOW, it will keep getting triggered, till the socket can accept writes, which should happen at some point in time (when the SendQ for that socket has space for example).

Coming to specifics in this hang,

- If client end is closing the socket after some duration of inactivity (due to the brick process getting stick in the WANT_WRITE->poll), then poll on the brick should break with POLLHUP/ERR/NVAL. Why is this not happening?

- The poll is to check if we can write to the socket, and this will not break till SendQ is full. So in this case, why is the SendQ not getting drained? (also can we check netstat Q sizes for these sockets on the client and on the server)
  - Further, SendQ may not get drained by the client due to RPC throttling in our code, is that happening here? Further, can we test without RPC throttling here if possible.

- The throttle code does not get triggered with own-thread, so testing with own-thread may add credence to the fact that SendQ is getting full or is not drained periodically.

Finally, in the code here, we may want to return r=-1 and errno=EAGAIN very specifically, at the moment we are returning r as we receive from the SSL call made, and in some cases we set errno appropriately.

Comment 20 rjoseph 2017-01-30 06:50:24 UTC

> Coming to specifics in this hang,
> 
> - If client end is closing the socket after some duration of inactivity (due
> to the brick process getting stick in the WANT_WRITE->poll), then poll on
> the brick should break with POLLHUP/ERR/NVAL. Why is this not happening?

I assume this may not be a case as you mentioned we should have got some kind of an error on poll.

> 
> - The poll is to check if we can write to the socket, and this will not
> break till SendQ is full. So in this case, why is the SendQ not getting
> drained? (also can we check netstat Q sizes for these sockets on the client
> and on the server)
>   - Further, SendQ may not get drained by the client due to RPC throttling
> in our code, is that happening here? Further, can we test without RPC
> throttling here if possible.
> 
> - The throttle code does not get triggered with own-thread, so testing with
> own-thread may add credence to the fact that SendQ is getting full or is not
> drained periodically.
> 

I suspect this might be the real cause. In one of the NFS setup I have seen lot of TCP Zero Window messages when tested with FIO test suite.

Jiffin: Can you please verify if we this is the real cause?

> Finally, in the code here, we may want to return r=-1 and errno=EAGAIN very
> specifically, at the moment we are returning r as we receive from the SSL
> call made, and in some cases we set errno appropriately.

Comment 21 Jiffin 2017-01-30 09:38:58 UTC

(In reply to rjoseph from comment #20)
> > Coming to specifics in this hang,
> > 
> > - If client end is closing the socket after some duration of inactivity (due
> > to the brick process getting stick in the WANT_WRITE->poll), then poll on
> > the brick should break with POLLHUP/ERR/NVAL. Why is this not happening?
> 
> I assume this may not be a case as you mentioned we should have got some
> kind of an error on poll.
> 
> > 
> > - The poll is to check if we can write to the socket, and this will not
> > break till SendQ is full. So in this case, why is the SendQ not getting
> > drained? (also can we check netstat Q sizes for these sockets on the client
> > and on the server)
> >   - Further, SendQ may not get drained by the client due to RPC throttling
> > in our code, is that happening here? Further, can we test without RPC
> > throttling here if possible.
> > 
> > - The throttle code does not get triggered with own-thread, so testing with
> > own-thread may add credence to the fact that SendQ is getting full or is not
> > drained periodically.
> > 
> 
> I suspect this might be the real cause. In one of the NFS setup I have seen
> lot of TCP Zero Window messages when tested with FIO test suite.
> 
> Jiffin: Can you please verify if we this is the real cause?
> 


When I got the setup there were no request(packets) send from the ganesha server[glusterfs client]( I took packet trace for half an hour) to bricks.
So I am not sure whether this might happened.


> > Finally, in the code here, we may want to return r=-1 and errno=EAGAIN very
> > specifically, at the moment we are returning r as we receive from the SSL
> > call made, and in some cases we set errno appropriately.

Comment 22 Mohit Agrawal 2017-02-01 06:59:28 UTC

Hi,

  I have tried to reproduce the issue on 2*2 environment after running bonnie.I am able to reproduce it .
  I am also getting same stack trace that were mentioned in earlier comments.
  As per this stack trace it shows threads(socket_submit_reply) are waiting to get the lock but lock is held by thread 19931 due to blocking on poll call because no POLLOUT event is getting on socket.

Reproducer steps
1) Setup 2*2 environment (SSL + Ganesha)
2) Run bonnie from 3 client parallely 

After run test case ("Rewriting by bonnie) i have observed there was no respond packet is coming on the client.


All waiting threads on lock is showing below information in stack

#3  0x00007fa71af73bc2 in socket_submit_reply (this=0x7fa70c003df0, 
    reply=0x7fa6edef6720) at socket.c:3503
3503	        pthread_mutex_lock (&priv->lock);
(gdb) p this->peerinfo->identifier
$1 = "10.65.6.244:49143", '\000' <repeats 90 times>
(gdb) p this->myinfo->identifier
$2 = "10.65.6.244:49152", '\000' <repeats 90 times>
(gdb) p priv->sock
$3 = 30
(gdb) p priv->lock
$4 = {__data = {__lock = 2, __count = 0, __owner = 19931, __nusers = 1, __kind = 0, 
    __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
  __size = "\002\000\000\000\000\000\000\000\333M\000\000\001", '\000' <repeats 26 times>, __align = 2}



Thread 19931 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>

(gdb) thread 25
[Switching to thread 25 (Thread 0x7fa6f88f4700 (LWP 19931))]
#0  0x00007fa724e57c3d in poll () from /lib64/libc.so.6
(gdb) f 3
#3  0x00007fa71af73106 in __socket_rwv (this=this@entry=0x7fa70c003df0, 
    vector=<optimized out>, count=<optimized out>, 
    pending_vector=pending_vector@entry=0x7fa728231e00, 
    pending_count=pending_count@entry=0x7fa728231e08, bytes=bytes@entry=0x0, 
    write=write@entry=1) at socket.c:542
542	                                ret = ssl_write_one (this, opvector->iov_base,
(gdb) p ((socket_private_t*) this->private)->sock
$46 = 30

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Below is the stack trace of thread (19931)

>>>>>>>>>>>>>>>>>>>>>>>>

#0  0x00007fa724e57c3d in poll () from /lib64/libc.so.6
#1  0x00007fa71af72c23 in poll (__timeout=-1, __nfds=1, __fds=0x7fa6f88f2530)
    at /usr/include/bits/poll2.h:46
#2  ssl_do (buf=0x7fa6efc40000, len=131072, func=0x7fa71ad41d50 <SSL_write>, 
    this=0x7fa70c003df0, this=0x7fa70c003df0) at socket.c:268
#3  0x00007fa71af73106 in __socket_rwv (this=this@entry=0x7fa70c003df0, 
    vector=<optimized out>, count=<optimized out>, 
    pending_vector=pending_vector@entry=0x7fa728231e00, 
    pending_count=pending_count@entry=0x7fa728231e08, bytes=bytes@entry=0x0, 
    write=write@entry=1) at socket.c:542
#4  0x00007fa71af73862 in __socket_writev (pending_count=<optimized out>, 
    pending_vector=<optimized out>, count=<optimized out>, vector=<optimized out>, 
    this=0x7fa70c003df0) at socket.c:661
#5  __socket_ioq_churn_entry (this=this@entry=0x7fa70c003df0, 
    entry=entry@entry=0x7fa728231ce0, direct=direct@entry=1) at socket.c:1097
#6  0x00007fa71af73d0c in socket_submit_reply (this=0x7fa70c003df0, 
    reply=<optimized out>) at socket.c:3521
#7  0x00007fa72647ceb2 in rpcsvc_transport_submit (trans=trans@entry=0x7fa70c003df0, 
    rpchdr=rpchdr@entry=0x7fa6f88f27f0, rpchdrcount=rpchdrcount@entry=1, 
    proghdr=proghdr@entry=0x7fa6f88f28a0, proghdrcount=proghdrcount@entry=1, 
    progpayload=progpayload@entry=0x7fa6f88f2e80, 
    progpayloadcount=progpayloadcount@entry=1, iobref=iobref@entry=0x7fa7282a0880, 
---Type <return> to continue, or q <return> to quit---
    priv=0x0) at rpcsvc.c:1125
#8  0x00007fa72647e9f0 in rpcsvc_submit_generic (req=req@entry=0x7fa710e76ffc, 
    proghdr=proghdr@entry=0x7fa6f88f28a0, hdrcount=hdrcount@entry=1, 
    payload=payload@entry=0x7fa6f88f2e80, payloadcount=payloadcount@entry=1, 
    iobref=iobref@entry=0x7fa7282a0880) at rpcsvc.c:1309
#9  0x00007fa711349ff9 in server_submit_reply (frame=frame@entry=0x7fa7241ba110, 
    req=0x7fa710e76ffc, arg=arg@entry=0x7fa6f88f2970, 
    payload=payload@entry=0x7fa6f88f2e80, payloadcount=payloadcount@entry=1, 
    iobref=iobref@entry=0x7fa7282a0880, xdrproc=0x7fa725e4c980 <xdr_gfs3_read_rsp>)
    at server.c:186
#10 0x00007fa711361004 in server_readv_cbk (frame=0x7fa7241ba110, 
    cookie=<optimized out>, this=0x7fa71402e400, op_ret=131072, op_errno=0, 
    vector=0x7fa6f88f2e80, count=1, stbuf=0x7fa6f88f2e90, iobref=0x7fa7282a0880, 
    xdata=0x0) at server-rpc-fops.c:1451
#11 0x00007fa7117b4f96 in io_stats_readv_cbk (frame=0x7fa7241a2420, 
    cookie=<optimized out>, this=<optimized out>, op_ret=131072, op_errno=0, 
    vector=0x7fa6f88f2e80, count=1, buf=0x7fa6f88f2e90, iobref=0x7fa7282a0880, 
    xdata=0x0) at io-stats.c:2049
#12 0x00007fa726738ebb in default_readv_cbk (frame=0x7fa7241cb0b0, 
    cookie=<optimized out>, this=<optimized out>, op_ret=131072, op_errno=0, 
    vector=0x7fa6f88f2e80, count=1, stbuf=0x7fa6f88f2e90, iobref=0x7fa7282a0880, 
    xdata=0x0) at defaults.c:1105
---Type <return> to continue, or q <return> to quit---
#13 0x00007fa712441ce1 in up_readv_cbk (frame=0x7fa7241aff94, 
    cookie=<optimized out>, this=<optimized out>, op_ret=131072, op_errno=0, 
    vector=0x7fa6f88f2e80, count=1, stbuf=0x7fa6f88f2e90, iobref=0x7fa7282a0880, 
    xdata=0x0) at upcall.c:167
#14 0x00007fa712658a8b in leases_readv_cbk (frame=frame@entry=0x7fa7241bbabc, 
    cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=131072, 
    op_errno=op_errno@entry=0, vector=vector@entry=0x7fa6f88f2e80, 
    count=count@entry=1, stbuf=stbuf@entry=0x7fa6f88f2e90, 
    iobref=iobref@entry=0x7fa7282a0880, xdata=xdata@entry=0x0) at leases.c:145
#15 0x00007fa712c911fe in pl_readv_cbk (frame=0x7fa7241c7018, 
    cookie=<optimized out>, this=<optimized out>, op_ret=131072, op_errno=0, 
    vector=0x7fa6f88f2e80, count=1, stbuf=0x7fa6f88f2e90, iobref=0x7fa7282a0880, 
    xdata=<optimized out>) at posix.c:1759
#16 0x00007fa713be94a4 in ctr_readv_cbk (frame=frame@entry=0x7fa7241c2314, 
    cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=131072, 
    op_errno=0, vector=vector@entry=0x7fa6f88f2e80, count=count@entry=1, 
    stbuf=stbuf@entry=0x7fa6f88f2e90, iobref=iobref@entry=0x7fa7282a0880, 
    xdata=xdata@entry=0x0) at changetimerecorder.c:1630
#17 0x00007fa7184a4f64 in posix_readv (frame=0x7fa7241b7ff0, this=<optimized out>, 
    fd=<optimized out>, size=<optimized out>, offset=<optimized out>, 
    flags=<optimized out>, xdata=0x0) at posix.c:3167
#18 0x00007fa713be4fba in ctr_readv (frame=frame@entry=0x7fa7241c2314, 
---Type <return> to continue, or q <return> to quit---
    this=this@entry=0x7fa714005f60, fd=fd@entry=0x7fa7269cf328, 
    size=size@entry=131072, off=off@entry=19660800, flags=flags@entry=0, 
    xdata=xdata@entry=0x0) at changetimerecorder.c:1661
#19 0x00007fa7267351d1 in default_readv (frame=frame@entry=0x7fa7241c2314, 
    this=this@entry=0x7fa714018cf0, fd=fd@entry=0x7fa7269cf328, 
    size=size@entry=131072, offset=offset@entry=19660800, flags=flags@entry=0, 
    xdata=xdata@entry=0x0) at defaults.c:2345
#20 0x00007fa7130cf6b4 in br_stub_readv (frame=frame@entry=0x7fa7241c2314, 
    this=this@entry=0x7fa71401ad20, fd=fd@entry=0x7fa7269cf328, 
    size=size@entry=131072, offset=offset@entry=19660800, flags=flags@entry=0, 
    xdata=xdata@entry=0x0) at bit-rot-stub.c:1752
#21 0x00007fa7267351d1 in default_readv (frame=0x7fa7241c2314, this=<optimized out>, 
    fd=0x7fa7269cf328, size=131072, offset=19660800, flags=0, xdata=0x0)
    at defaults.c:2345
#22 0x00007fa712c9a010 in pl_readv (frame=frame@entry=0x7fa7241c7018, 
    this=this@entry=0x7fa71401d7f0, fd=fd@entry=0x7fa7269cf328, 
    size=size@entry=131072, offset=offset@entry=19660800, flags=flags@entry=0, 
    xdata=xdata@entry=0x0) at posix.c:1920
#23 0x00007fa7267351d1 in default_readv (frame=frame@entry=0x7fa7241c7018, 
    this=this@entry=0x7fa71401ec30, fd=fd@entry=0x7fa7269cf328, 
    size=size@entry=131072, offset=offset@entry=19660800, flags=flags@entry=0, 
    xdata=xdata@entry=0x0) at defaults.c:2345
---Type <return> to continue, or q <return> to quit---
#24 0x00007fa7267351d1 in default_readv (frame=0x7fa7241c7018, this=<optimized out>, 
    fd=0x7fa7269cf328, size=131072, offset=19660800, flags=0, xdata=0x0)
    at defaults.c:2345
#25 0x00007fa71265c5a8 in leases_readv (frame=0x7fa7241bbabc, this=0x7fa7140217a0, 
    fd=0x7fa7269cf328, size=131072, offset=19660800, flags=0, xdata=0x0)
    at leases.c:180
#26 0x00007fa71244ac45 in up_readv (frame=0x7fa7241aff94, this=0x7fa714022cd0, 
    fd=0x7fa7269cf328, size=131072, offset=19660800, flags=0, xdata=0x0)
    at upcall.c:190
#27 0x00007fa72674f9be in default_readv_resume (frame=0x7fa7241cb0b0, 
    this=0x7fa714024270, fd=0x7fa7269cf328, size=131072, offset=19660800, flags=0, 
    xdata=0x0) at defaults.c:1687
#28 0x00007fa7266de872 in call_resume_wind (stub=0x7fa723a83dac) at call-stub.c:2039
#29 0x00007fa7266decc5 in call_resume (stub=0x7fa723a83dac) at call-stub.c:2508
#30 0x00007fa712234dd4 in iot_worker (data=0x7fa71407f800) at io-threads.c:222
#31 0x00007fa72551cdc5 in start_thread (arg=0x7fa6f88f4700) at pthread_create.c:308
#32 0x00007fa724e6228d in clone () from /lib64/libc.so.6

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Below is the current code of ssl_do,In current code in case of getting SSL_ERROR_WANT_WRITE we need
to wait(infinitely) on socket for pollout event 

>>>>>>>>>>>>>>>>>>

              case SSL_ERROR_WANT_WRITE:
                        pfd.fd = priv->sock;
                        pfd.events = POLLOUT;
                        if (poll(&pfd,1,-1) < 0) {
                                gf_log(this->name,GF_LOG_ERROR,"poll error %d",
                                       errno);
                        }


>>>>>>>>>>>>>>>>>>>

I think we need to update the code here in case of calling function (SSL_write/SSL_read as Shyam suggested) we can return from here with error code (EAGAIN) and the error is already handled in upper functions instead of waiting on poll as we are doing same for SSL_ERROR_WANT_READ.

I am testing my patch and update soon about the result.

Regards
Mohit Agrawal

Comment 23 Mohit Agrawal 2017-02-01 09:14:55 UTC

Hi,

I have tested the patch after running bonnie tools, it is working fine.

bonnie++ -u root
Using uid:0, gid:0.
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...



Regards
Mohit Agrawal

Comment 25 Atin Mukherjee 2017-02-01 10:10:16 UTC

upstream patch : https://review.gluster.org/#/c/16501/

Comment 26 Atin Mukherjee 2017-02-02 04:09:34 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/96684/

Comment 28 Ambarish 2017-02-10 04:50:19 UTC

Clean run of Bonnie++ on my Ganesha+SSL setup with 3.8.4-14/2.4.1-7 , Verified.

Comment 30 errata-xmlrpc 2017-03-23 05:59:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html