Bug 1641969

Summary: Mounted Dir Gets Error in GlusterFS Storage Cluster with SSL/TLS Encryption as Doing add-brick and remove-brick Repeatly
Product: [Community] GlusterFS Reporter: Terry Cui <i_chips>
Component: coreAssignee: Amar Tumballi <amarts>
Status: CLOSED DEFERRED QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 4.1CC: atumball, bugs, pasik
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-06 11:28:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Terry Cui 2018-10-23 09:14:25 UTC
Description of problem:
Mounted dir gets error in GlusterFS storage cluster with SSL/TLS encryption as doing add-brick and remove-brick repeatly.


Version-Release number of selected component (if applicable):
4.1.5 or older versions


How reproducible:
It could be reproduced easily.


Steps to Reproduce:
1. SSL is enabled for GlusterFS Replicated Volume afr_vol, which is mounted on /mnt/gluster by GlusterFS native client.
2. The command "add-brick" and "remove-brick" are executed repeatedly. 
3. At the same time, the mounted dir is read or written continuously for a while. Here I used the command "find /mnt/gluster" or "ls /mnt/gluster".


Actual results:
Later, The error message is as below:
find: ‘/mnt/gluster’: Transport endpoint is not connected


Expected results:
The mounted dir should be always OK when it is read or written.


Additional info:
However, everything would be OK if SSL is disabled for GlusterFS Replicated Volume afr_vol.

As the mounted dir cannot be accessed, glusterfs process is using 100% CPU. And strace result of glusterfs process is as below:

strace -f -p 6576


[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}])
[pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295^CProcess 6576 detached
Process 6577 detached
Process 6578 detached
Process 6579 detached
Process 6580 detached
Process 6581 detached
Process 6584 detached
Process 6585 detached
Process 6596 detached
Process 6597 detached
Process 1578 detached
Process 1581 detached
Process 11623 detached
Process 14581 detached
Process 14601 detached
 <detached ...>
Process 15032 detached

Comment 1 Terry Cui 2018-11-19 05:45:42 UTC
Is there someone who can help me with this issue?
Thanks a lot. :-)

Comment 2 Amar Tumballi 2019-07-16 04:00:49 UTC
Terry, Apologies for the delay.

While the issue may be valid, the usecase of 'repeated' add-brick and remove-brick was not in scope of Gluster's design. Ie, it is true that it is a software defined storage solution, and we do provide scale-out and scale-down operation. But we normally designed it to be a rare operation (ie, once in a year, once in a quarter etc).

Comment 3 Amar Tumballi 2019-08-06 11:28:27 UTC
The usecase of 'repeated' add-brick remove-brick is something we are not focusing right now. Marking it DEFERRED till it gets a focus.