1336339 – Sequential volume start&stop is failing with SSL enabled setup.

Bug 1336339 - Sequential volume start&stop is failing with SSL enabled setup.

Summary: Sequential volume start&stop is failing with SSL enabled setup.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Kaushal
QA Contact:	Byreddy
Docs Contact:
URL:
Whiteboard:
Depends On:	1336371 1387960
Blocks:	1311843 1351522
TreeView+	depends on / blocked

Reported:	2016-05-16 08:28 UTC by Byreddy
Modified:	2017-03-23 05:30 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.8.4-1
Doc Type:	Bug Fix
Doc Text:	Each encrypted connections with GlusterD creates pipe and they are not closed after disconnection. This leads to resource starvation and lead to other operations failing afterwards. Workaround: Restart the GlusterD service.
Clone Of:
Clones:	1336369 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:30:59 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Byreddy 2016-05-16 08:28:50 UTC

Description of problem:
=======================
I done 400 times sequential volume start and stop, it's failed after some time
with CLI error message 
"""
volume start: rep: success
volume stop: rep: success
volume start: rep: success
volume stop: rep: failed: Commit failed on localhost. Please check the log file for more details.
volume start: rep: failed: Pre Validation failed on dhcp43-215.lab.eng.blr.redhat.com. Volume rep already started
volume stop: rep: failed: Volume rep is not in the started state
volume start: rep: failed: Pre-validation failed on localhost. Please check log file for details
volume stop: rep: failed: Volume rep is not in the started state
volume start: rep: failed: Pre-validation failed on localhost. Please check lo
"""

glusterd logs are posted in the comment.

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.9-4.


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Create a volume 16 x 2 = 32 type using two nodes.
2. Do FUSE mount and umount the volume 300 times.
3. Do volume start/stop 400 times //it fail after some start/stop operation.
4. Check the glusterd logs for error messages and warnings.

Actual results:
===============
Sequential volume start and stop is not working with SSL enabled setup. 


Expected results:
=================
Sequential volume start and stop should work with SSL enabled setup. 



Additional info:

Comment 2 Byreddy 2016-05-16 08:29:59 UTC

[2016-05-16 06:52:45.028639] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028646] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028654] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028661] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028674] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028679] I [socket.c:459:ssl_setup_connection] 0-socket.management: peer CN = dhcp43-215.lab.eng.blr.redhat.com
[2016-05-16 06:52:45.028689] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028750] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028760] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028768] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028776] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028784] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028792] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028806] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028815] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028823] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028831] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028838] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028864] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028872] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028880] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028888] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028895] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028903] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028911] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028918] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028926] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028934] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028941] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028949] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028956] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028965] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.028973] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029132] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029152] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029168] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029183] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029191] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029198] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029206] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029213] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029221] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029228] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029236] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029244] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029251] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.029258] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.030909] W [socket.c:2719:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-05-16 06:52:45.031075] E [socket.c:2868:socket_server_event_handler] 0-socket.management: could not create pipe
[2016-05-16 06:52:45.041646] I [socket.c:459:ssl_setup_connection] 0-socket.management: peer CN = dhcp43-215.lab.eng.blr.redhat.com
[2016-05-16 06:52:45.043605] E [socket.c:2616:socket_poller] 0-socket.management: error in polling loop
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 2 times between [2016-05-16 06:52:44.061901] and [2016-05-16 06:52:47.161342]
[2016-05-16 06:52:47.163334] E [MSGID: 106280] [glusterd-volume-ops.c:1653:glusterd_op_stage_stop_volume] 0-management: Volume rep is not in the started state
[2016-05-16 06:52:47.163376] E [MSGID: 106301] [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation 'Volume Stop' failed on localhost : Volume rep is not in the started state
[2016-05-16 06:52:47.163551] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 2 times between [2016-05-16 06:52:47.163551] and [2016-05-16 06:52:47.252524]
[2016-05-16 06:52:47.254692] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 06:52:47.254771] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 06:52:47.254784] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 06:52:47.254793] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 06:52:47.254800] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 06:52:47.254809] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
[2016-05-16 06:52:47.255009] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 2 times between [2016-05-16 06:52:47.255009] and [2016-05-16 06:52:50.350609]
[2016-05-16 06:52:50.352683] E [MSGID: 106280] [glusterd-volume-ops.c:1653:glusterd_op_stage_stop_volume] 0-management: Volume rep is not in the started state
[2016-05-16 06:52:50.352756] E [MSGID: 106301] [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation 'Volume Stop' failed on localhost : Volume rep is not in the started state
[2016-05-16 06:52:50.352909] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 2 times between [2016-05-16 06:52:50.352909] and [2016-05-16 06:52:50.448995]
[2016-05-16 06:52:50.451250] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 06:52:50.451338] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 06:52:50.451353] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 06:52:50.451363] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 06:52:50.451371] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 06:52:50.451381] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
[2016-05-16 06:52:50.451505] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 2 times between [2016-05-16 06:52:50.451505] and [2016-05-16 06:52:53.542390]
[2016-05-16 06:52:53.544425] E [MSGID: 106280] [glusterd-volume-ops.c:1653:glusterd_op_stage_stop_volume] 0-management: Volume rep is not in the started state
[2016-05-16 06:52:53.544467] E [MSGID: 106301] [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation 'Volume Stop' failed on localhost : Volume rep is not in the star:

[2016-05-16 06:53:00.020201] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 06:53:00.020216] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 06:53:00.020225] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 06:53:00.020233] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 06:53:00.020243] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
[2016-05-16 06:53:00.020359] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
[2016-05-16 06:56:30.764560] I [MSGID: 106487] [glusterd-handler.c:1472:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2016-05-16 07:27:02.476621] E [MSGID: 101012] [logging.c:376:gf_log_rotate] 0-logrotate: failed to open logfile [Too many open files]
[2016-05-16 07:27:02.476615] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.476753] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.476902] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/disperse.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.476983] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/stripe.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477002] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/stripe.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477010] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/debug/io-stats.so: cannot open shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/debug/io-stats.so: cannot open shared object file: Too many open files" repeated 13 times between [2016-05-16 07:27:02.477010] and [2016-05-16 07:27:02.477115]
[2016-05-16 07:27:02.477124] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/io-cache.so: cannot open shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/io-cache.so: cannot open shared object file: Too many open files" repeated 4 times between [2016-05-16 07:27:02.477124] and [2016-05-16 07:27:02.477170]
[2016-05-16 07:27:02.477177] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/io-threads.so: cannot open shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/io-threads.so: cannot open shared object file: Too many open files" repeated 6 times between [2016-05-16 07:27:02.477177] and [2016-05-16 07:27:02.477235]
[2016-05-16 07:27:02.477242] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/write-behind.so: cannot open shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/write-behind.so: cannot open shared object file: Too many open files" repeated 7 times between [2016-05-16 07:27:02.477242] and [2016-05-16 07:27:02.477312]
[2016-05-16 07:27:02.477319] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/open-behind.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477343] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/open-behind.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477350] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/read-ahead.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477371] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/performance/md-cache.so: cannot open shared object file: Too many open files
[2016-05-16 07:27:02.477390] W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/encryption/crypt.so: cannot open shared object file: Too many open files
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/encryption/crypt.so: cannot open shared object file: Too many open files" repeated 2 times between [2016-05-16 07:27:02.477390] and [2016-05-16 07:27:02.477412]
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so: cannot open shared object file: Too many open files" repeated 9 times between [2016-05-16 07:27:02.476615] and [2016-05-16 07:27:02.478514]
The message "W [MSGID: 101095] [xlator.c:148:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so: cannot open shared object file: Too many open files" repeated 2 times between [2016-05-16 07:27:02.478527] and [2016-05-16 07:27:02.478556]
[2016-05-16 08:04:49.794095] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
[2016-05-16 08:04:49.849469] I [MSGID: 106499] [glusterd-handler.c:4330:__glusterd_handle_status_volume] 0-management: Received status volume req for volume rep
[2016-05-16 08:04:49.851449] E [MSGID: 106301] [glusterd-syncop.c:1281:gd_stage_op_phase] 0-management: Staging of operation 'Volume Status' failed on localhost : Volume rep is not started
[2016-05-16 08:04:55.999928] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:04:56.000034] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 08:04:56.000059] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:04:56.000096] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 08:04:56.000107] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 08:04:56.000117] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
[2016-05-16 08:05:27.990060] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:05:27.990090] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 08:05:27.990157] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:05:27.990177] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 08:05:27.990246] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 08:05:27.990272] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 10 times between [2016-05-16 08:04:49.794095] and [2016-05-16 08:05:27.990523]
[2016-05-16 08:13:05.184001] W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace.
[2016-05-16 08:13:05.186580] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:13:05.186639] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 08:13:05.186654] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:13:05.186665] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 08:13:05.186675] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 08:13:05.186690] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
[2016-05-16 08:13:06.581765] E [MSGID: 106187] [glusterd-volume-ops.c:1486:glusterd_op_stage_start_volume] 0-management: Could not find peer on which brick dhcp43-214.lab.eng.blr.redhat.com:/bricks/brick0/brr0 resides
[2016-05-16 08:13:06.581767] W [MSGID: 106122] [glusterd-mgmt.c:198:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed.
[2016-05-16 08:13:06.581833] E [MSGID: 106122] [glusterd-mgmt.c:879:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node
[2016-05-16 08:13:06.581722] E [MSGID: 106408] [glusterd-peer-utils.c:120:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: System error
 [Unknown error -11]
[2016-05-16 08:13:06.581761] E [MSGID: 101075] [common-utils.c:3137:gf_is_local_addr] 0-management: error in getaddrinfo: System error

[2016-05-16 08:13:06.581873] E [MSGID: 106122] [glusterd-mgmt.c:1991:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed
The message "W [MSGID: 101088] [common-utils.c:3662:gf_backtrace_save] 0-management: Failed to save the backtrace." repeated 5 times between [2016-05-16 08:13:05.184001] and [2016-05-16 08:13:06.581980]
[2016-05-16 08:19:40.223288] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2016-05-16 08:19:40.239294] I [MSGID: 106488] [glusterd-handler.c:1533:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
(END)

Comment 3 Kaushal 2016-05-16 09:42:08 UTC

Each encrypted connections create a pipe (which uses 2 fds), which are not closed after disconnections. This is an fd leak, which can lead to resource starvation and lead to other operations failing afterwards.

Comment 4 Atin Mukherjee 2016-05-16 11:22:09 UTC

Upstream patch http://review.gluster.org/14356 posted for review.

I'd prefer to mark it as known issue as restarting glusterd could be the work around to get rid of this issue.

Comment 5 Atin Mukherjee 2016-05-17 05:24:35 UTC

Kaushal,

Could you fill up the doc text marking it as a known issue?

~Atin

Comment 6 Ambarish 2016-05-23 07:38:31 UTC

Not just volume restart,I am seeing this log message whenever I do  a replace-brick/add-brick/enable quota etc.:

[socket.c:2618:socket_poller] 0-tcp.testvol-server: error in polling loop

Comment 7 Atin Mukherjee 2016-06-15 09:40:03 UTC

clearing the needinfo from Kaushal as the doc text is already provided.

Comment 8 Atin Mukherjee 2016-07-25 05:04:47 UTC

http://review.gluster.org/14891 in 3.8 branch is merged.

Comment 10 Atin Mukherjee 2016-09-17 14:35:42 UTC

Upstream mainline : http://review.gluster.org/14694
Upstream 3.8 : http://review.gluster.org/14891

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 15 Byreddy 2016-10-17 12:05:23 UTC

This bug is verified using the build -  glusterfs-3.8.4-2

Repeated mentioned reproducing steps..issue no more exist 


Moving to verified state.

Comment 17 errata-xmlrpc 2017-03-23 05:30:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.