Description of problem: Mounted dir gets error in GlusterFS storage cluster with SSL/TLS encryption as doing add-brick and remove-brick repeatly. Version-Release number of selected component (if applicable): 4.1.5 or older versions How reproducible: It could be reproduced easily. Steps to Reproduce: 1. SSL is enabled for GlusterFS Replicated Volume afr_vol, which is mounted on /mnt/gluster by GlusterFS native client. 2. The command "add-brick" and "remove-brick" are executed repeatedly. 3. At the same time, the mounted dir is read or written continuously for a while. Here I used the command "find /mnt/gluster" or "ls /mnt/gluster". Actual results: Later, The error message is as below: find: ‘/mnt/gluster’: Transport endpoint is not connected Expected results: The mounted dir should be always OK when it is read or written. Additional info: However, everything would be OK if SSL is disabled for GlusterFS Replicated Volume afr_vol. As the mounted dir cannot be accessed, glusterfs process is using 100% CPU. And strace result of glusterfs process is as below: strace -f -p 6576 [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295) = 1 ([{fd=12, revents=POLLIN}]) [pid 14601] poll([{fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 4294967295^CProcess 6576 detached Process 6577 detached Process 6578 detached Process 6579 detached Process 6580 detached Process 6581 detached Process 6584 detached Process 6585 detached Process 6596 detached Process 6597 detached Process 1578 detached Process 1581 detached Process 11623 detached Process 14581 detached Process 14601 detached <detached ...> Process 15032 detached
Is there someone who can help me with this issue? Thanks a lot. :-)
Terry, Apologies for the delay. While the issue may be valid, the usecase of 'repeated' add-brick and remove-brick was not in scope of Gluster's design. Ie, it is true that it is a software defined storage solution, and we do provide scale-out and scale-down operation. But we normally designed it to be a rare operation (ie, once in a year, once in a quarter etc).
The usecase of 'repeated' add-brick remove-brick is something we are not focusing right now. Marking it DEFERRED till it gets a focus.