Bug 1387960 - Sequential volume start&stop is failing with SSL enabled setup.
Summary: Sequential volume start&stop is failing with SSL enabled setup.
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: rpc
Version: 3.9
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kaushal
QA Contact:
URL:
Whiteboard:
Depends On: 1336371 1336508
Blocks: 1336339 1336369 1336376 1381822
TreeView+ depends on / blocked
 
Reported: 2016-10-24 04:14 UTC by Kaushal
Modified: 2017-03-08 12:32 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1336371
Environment:
Last Closed: 2017-03-08 12:32:17 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Kaushal 2016-10-24 04:14:51 UTC
+++ This bug was initially created as a clone of Bug #1336371 +++

+++ This bug was initially created as a clone of Bug #1336369 +++

+++ This bug was initially created as a clone of Bug #1336339 +++

Description of problem:
=======================
I done 400 times sequential volume start and stop, it's failed after some time
with CLI error message 
"""
volume start: rep: success
volume stop: rep: success
volume start: rep: success
volume stop: rep: failed: Commit failed on localhost. Please check the log file for more details.
volume start: rep: failed: Pre Validation failed on dhcp43-215.lab.eng.blr.redhat.com. Volume rep already started
volume stop: rep: failed: Volume rep is not in the started state
volume start: rep: failed: Pre-validation failed on localhost. Please check log file for details
volume stop: rep: failed: Volume rep is not in the started state
volume start: rep: failed: Pre-validation failed on localhost. Please check lo
"""

glusterd logs are posted in the comment.

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.9-4.


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Create a volume 16 x 2 = 32 type using two nodes.
2. Do FUSE mount and umount the volume 300 times.
3. Do volume start/stop 400 times //it fail after some start/stop operation.
4. Check the glusterd logs for error messages and warnings.

Actual results:
===============
Sequential volume start and stop is not working with SSL enabled setup. 


Expected results:
=================
Sequential volume start and stop should work with SSL enabled setup. 



Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-05-16 13:58:53 IST ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Byreddy on 2016-05-16 13:59:59 IST ---

[2016-05-16 06:52:45.028639] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028646] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028654] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028661] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028674] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028679] I [socket.c:459:ssl_setup_connection] 0-socket.management: peer CN = dhcp43-215.lab.eng.blr.redhat.com
[2016-05-16 06:52:45.028689] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028750] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028760] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028768] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028776] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028784] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028792] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028806] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028815] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028823] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028831] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028838] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028864] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028872] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028880] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028888] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028895] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028903] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028911] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028918] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028926] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028934] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028941] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028949] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028956] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028965] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028973] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029132] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029152] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029168] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029183] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029191] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029198] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029206] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029213] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029221] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029228] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029236] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029244] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029251] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029258] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.030909] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.031075] E [socket.c:2868:socket_server_event_handler] 0-socket.management: could not create pipe
[2016-05-16 06:52:45.041646] I [socket.c:459:ssl_setup_connection] 0-socket.management: peer CN = dhcp43-215.lab.eng.blr.redhat.com
[2016-05-16 06:52:45.043605] E [socket.c:2616:socket_poller] 0-socket.management: error in polling loop
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 2 times between [2016-05-16 06:52:44.061901] and [2016-05-16 06:52:47.161342]
[2016-05-16 06:52:47.163334] E [MSGID: 106280] [glusterd-volume-ops.c:1653:glusterd_op_stage_stop_volume] 0-management: Volume rep is not in the started state
[2016-05-16 06:52:47.163376] E [MSGID: 106301] [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation 'Volume Stop' failed on localhost : Volume rep is not in the started state
[2016-05-16 06:52:47.163551] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 2 times between [2016-05-16 06:52:47.163551] and [2016-05-16 06:52:47.252524]
[2016-05-16 06:52:47.254692] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 06:52:47.254771] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 06:52:47.254784] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 06:52:47.254793] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 06:52:47.254800] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 06:52:47.254809] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
[2016-05-16 06:52:47.255009] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 2 times between [2016-05-16 06:52:47.255009] and [2016-05-16 06:52:50.350609]
[2016-05-16 06:52:50.352683] E [MSGID: 106280] [glusterd-volume-ops.c:1653:glusterd_op_stage_stop_volume] 0-management: Volume rep is not in the started state
[2016-05-16 06:52:50.352756] E [MSGID: 106301] [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation 'Volume Stop' failed on localhost : Volume rep is not in the started state
[2016-05-16 06:52:50.352909] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 2 times between [2016-05-16 06:52:50.352909] and [2016-05-16 06:52:50.448995]
[2016-05-16 06:52:50.451250] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 06:52:50.451338] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 06:52:50.451353] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 06:52:50.451363] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 06:52:50.451371] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 06:52:50.451381] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
[2016-05-16 06:52:50.451505] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 2 times between [2016-05-16 06:52:50.451505] and [2016-05-16 06:52:53.542390]
[2016-05-16 06:52:53.544425] E [MSGID: 106280] [glusterd-volume-ops.c:1653:glusterd_op_stage_stop_volume] 0-management: Volume rep is not in the started state
[2016-05-16 06:52:53.544467] E [MSGID: 106301] [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation 'Volume Stop' failed on localhost : Volume rep is not in the star:

[2016-05-16 06:53:00.020201] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 06:53:00.020216] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 06:53:00.020225] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 06:53:00.020233] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 06:53:00.020243] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
[2016-05-16 06:53:00.020359] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
[2016-05-16 06:56:30.764560] I [MSGID: 106487] [glusterd-handler.c:1472:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2016-05-16 07:27:02.476621] E [MSGID: 101012] [logging.c:376:gf_log_rotate] 0-logrotate: failed to open logfile [Too many open files]
[2016-05-16 07:27:02.476615] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.476753] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.476902] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.476983] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/stripe.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477002] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/stripe.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477010] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/debug/io-stats.so: cannot open shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/debug/io-stats.so: cannot open shared object file: Too many open files" repeated 13 times between [2016-05-16 07:27:02.477010] and [2016-05-16 07:27:02.477115]
[2016-05-16 07:27:02.477124] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/io-cache.so: cannot open shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/io-cache.so: cannot open shared object file: Too many open files" repeated 4 times between [2016-05-16 07:27:02.477124] and [2016-05-16 07:27:02.477170]
[2016-05-16 07:27:02.477177] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/io-threads.so: cannot open shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/io-threads.so: cannot open shared object file: Too many open files" repeated 6 times between [2016-05-16 07:27:02.477177] and [2016-05-16 07:27:02.477235]
[2016-05-16 07:27:02.477242] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/write-behind.so: cannot open shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/write-behind.so: cannot open shared object file: Too many open files" repeated 7 times between [2016-05-16 07:27:02.477242] and [2016-05-16 07:27:02.477312]
[2016-05-16 07:27:02.477319] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/open-behind.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477343] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/open-behind.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477350] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/read-ahead.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477371] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/md-cache.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477390] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/encryption/crypt.so: cannot open shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/encryption/crypt.so: cannot open shared object file: Too many open files" repeated 2 times between [2016-05-16 07:27:02.477390] and [2016-05-16 07:27:02.477412]
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so: cannot open shared object file: Too many open files" repeated 9 times between [2016-05-16 07:27:02.476615] and [2016-05-16 07:27:02.478514]
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so: cannot open shared object file: Too many open files" repeated 2 times between [2016-05-16 07:27:02.478527] and [2016-05-16 07:27:02.478556]
[2016-05-16 08:04:49.794095] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
[2016-05-16 08:04:49.849469] I [MSGID: 106499] [glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management: Received status volume req for volume rep
[2016-05-16 08:04:49.851449] E [MSGID: 106301] [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation 'Volume Status' failed on localhost : Volume rep is not started
[2016-05-16 08:04:55.999928] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:04:56.000034] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 08:04:56.000059] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:04:56.000096] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 08:04:56.000107] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 08:04:56.000117] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
[2016-05-16 08:05:27.990060] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:05:27.990090] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 08:05:27.990157] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:05:27.990177] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 08:05:27.990246] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 08:05:27.990272] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 10 times between [2016-05-16 08:04:49.794095] and [2016-05-16 08:05:27.990523]
[2016-05-16 08:13:05.184001] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
[2016-05-16 08:13:05.186580] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:13:05.186639] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 08:13:05.186654] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:13:05.186665] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 08:13:05.186675] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 08:13:05.186690] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
[2016-05-16 08:13:06.581765] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:13:06.581767] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 08:13:06.581833] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 08:13:06.581722] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:13:06.581761] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 08:13:06.581873] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 5 times between [2016-05-16 08:13:05.184001] and [2016-05-16 08:13:06.581980]
[2016-05-16 08:19:40.223288] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2016-05-16 08:19:40.239294] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
(END)

--- Additional comment from Kaushal on 2016-05-16 15:12:08 IST ---

Each encrypted connections create a pipe (which uses 2 fds), which are not closed after disconnections. This is an fd leak, which can lead to resource starvation and lead to other operations failing afterwards.

--- Additional comment from Vijay Bellur on 2016-05-16 15:18:56 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on disconnection) posted (#1) for review on master by Kaushal M (kaushal)

--- Additional comment from Vijay Bellur on 2016-05-16 15:53:41 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on disconnection) posted (#2) for review on master by Kaushal M (kaushal)

--- Additional comment from Vijay Bellur on 2016-05-18 15:58:03 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on disconnection) posted (#3) for review on master by Kaushal M (kaushal)

--- Additional comment from Vijay Bellur on 2016-06-15 13:36:26 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on disconnection) posted (#4) for review on master by Kaushal M (kaushal)

--- Additional comment from Atin Mukherjee on 2016-08-10 12:34:15 IST ---

http://review.gluster.org/#/c/14694/ actually fixes this problem.

--- Additional comment from Kaushal on 2016-10-20 14:13:19 IST ---

http://review.gluster.org/#/c/14694/ doesn't fix the issue completely.

With the change pipes are being only closed when spawning own-thread (socket_spawn) fails. They are still being left open when a working connection disconnects.

So the fix is only partial. It still requires http://review.gluster.org/14356 which closes pipes after an SSL connection disconnects.

--- Additional comment from Worker Ant on 2016-10-20 14:15:18 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on disconnection) posted (#5) for review on master by Kaushal M (kaushal)

--- Additional comment from Worker Ant on 2016-10-20 20:08:23 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on disconnection) posted (#6) for review on master by Kaushal M (kaushal)

--- Additional comment from Worker Ant on 2016-10-20 20:22:14 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on disconnection) posted (#7) for review on master by Kaushal M (kaushal)

--- Additional comment from Worker Ant on 2016-10-21 11:34:59 IST ---

REVIEW: http://review.gluster.org/14356 (rpc/socket: Close pipe on disconnection) posted (#8) for review on master by Kaushal M (kaushal)

--- Additional comment from Worker Ant on 2016-10-24 09:40:54 IST ---

COMMIT: http://review.gluster.org/14356 committed in master by Raghavendra G (rgowdapp) 
------
commit d4b993610c1ee64ab7c2f41cf457e63a3da7f521
Author: Kaushal M <kaushal>
Date:   Mon May 16 15:15:49 2016 +0530

    rpc/socket: Close pipe on disconnection
    
    Encrypted connections create a pipe, which isn't closed when the
    connection disconnects. This leaks fds, and gluster eventually ends up
    in a situation with fd starvation which leads to operation failures.
    
    Change-Id: I144e1f767cec8c6fc1aa46b00cd234129d2a4adc
    BUG: 1336371
    Signed-off-by: Kaushal M <kaushal>
    Reviewed-on: http://review.gluster.org/14356
    Tested-by: MOHIT AGRAWAL <moagrawa>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 1 Worker Ant 2016-10-24 04:18:26 UTC
REVIEW: http://review.gluster.org/15702 (rpc/socket: Close pipe on disconnection) posted (#1) for review on release-3.9 by Kaushal M (kaushal)

Comment 2 Worker Ant 2016-10-25 11:09:11 UTC
COMMIT: http://review.gluster.org/15702 committed in release-3.9 by Raghavendra G (rgowdapp) 
------
commit 4d59c16154e794677bae61842e32fa8aef6d419f
Author: Kaushal M <kaushal>
Date:   Mon May 16 15:15:49 2016 +0530

    rpc/socket: Close pipe on disconnection
    
    Encrypted connections create a pipe, which isn't closed when the
    connection disconnects. This leaks fds, and gluster eventually ends up
    in a situation with fd starvation which leads to operation failures.
    
    > Change-Id: I144e1f767cec8c6fc1aa46b00cd234129d2a4adc
    > BUG: 1336371
    > Signed-off-by: Kaushal M <kaushal>
    > Reviewed-on: http://review.gluster.org/14356
    > Tested-by: MOHIT AGRAWAL <moagrawa>
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > Reviewed-by: Raghavendra G <rgowdapp>
    > Signed-off-by: Kaushal M <kaushal>
    
    Change-Id: I144e1f767cec8c6fc1aa46b00cd234129d2a4adc
    BUG: 1387960
    Signed-off-by: Kaushal M <kaushal>
    Reviewed-on: http://review.gluster.org/15702
    Tested-by: Atin Mukherjee <amukherj>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 3 Kaushal 2017-03-08 12:32:17 UTC
This bug is getting closed because GlusterFS-3.9 has reached its end-of-life [1].

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please open a new bug against the newer release.

[1]: https://www.gluster.org/community/release-schedule/


Note You need to log in before you can comment on or make changes to this bug.