1381822 – glusterd.log is flooded with socket.management: accept on 11 failed (Too many open files) and glusterd service stops

Bug 1381822 - glusterd.log is flooded with socket.management: accept on 11 failed (Too many open files) and glusterd service stops

Summary: glusterd.log is flooded with socket.management: accept on 11 failed (Too many...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Mohit Agrawal
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:	1336371 1387960
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2016-10-05 06:43 UTC by RamaKasturi
Modified:	2020-12-14 07:46 UTC (History)
CC List:	14 users (show)
Fixed In Version:	glusterfs-3.8.4-3
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-23 06:08:03 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1427958	0	unspecified	CLOSED	[GSS] Error 0-socket.management: socket_poller XX.XX.XX.XX:YYY failed (Input/output error) during any volume operation	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Internal Links: 1427958

Description RamaKasturi 2016-10-05 06:43:53 UTC

Description of problem:
I see that glusterd.log is flooded with messages below and log file grows huge with in minutes and glusterd just stops.

[2016-10-05 06:20:46.491886] W [socket.c:2645:socket_server_event_handler] 0-socket.management: accept on 11 failed (Too many open files)
[2016-10-05 06:20:46.491895] W [socket.c:2645:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-10-05 06:20:46.491904] W [socket.c:2645:socket_server_event_handler] 0-socket.management: accept on 11 failed (Too many open files)
[2016-10-05 06:20:46.491913] W [socket.c:2645:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-10-05 06:20:46.491922] W [socket.c:2645:socket_server_event_handler] 0-socket.management: accept on 11 failed (Too many open files)


Version-Release number of selected component (if applicable):
glusterfs-3.8.4-2.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Installed HC stack on RHV-H and arbiter
2. Enabled SSL on the volumes
3. Just left the setup overnight.

Actual results:
All of a sudden i see that glusterd.log gets flooded with  messages reported in  description and the log file grows huge i.e upto 14G and glusterd service stops.

Expected results:
glusterd.log should not be flooded with these messages and glusterd service should not stop.

Additional info:

Comment 3 RamaKasturi 2016-10-05 07:07:44 UTC

I have also copied the glusterd.log file which has grown up to 14G to the link mentioned in comment 2

Comment 4 RamaKasturi 2016-10-06 05:55:17 UTC

proposing this bug as blocker as glusterd.log gets flooded with these messages and all glusterd commands are hung on my system.

Comment 5 SATHEESARAN 2016-10-07 11:26:39 UTC

Additional information:

There are lots of error saying that - "error in polling loop", which should not happen once the setup is up and running with SSL/TLS configured properly.

There are about 1580 such error messages in 1 day and it keeps ticking.

# grep -i 'error in polling loop' /var/log/glusterfs/glusterd.log  | wc -l
1580

This considerably makes glusterd log grow.
# ls -lsah /var/log/glusterfs/glusterd.log 
8.1M -rw-------. 1 root root 8.1M Oct  7 16:51 /var/log/glusterfs/glusterd.log

Comment 6 Atin Mukherjee 2016-10-07 11:57:17 UTC

(In reply to RamaKasturi from comment #4)
> proposing this bug as blocker as glusterd.log gets flooded with these
> messages and all glusterd commands are hung on my system.

Please do not use blocker flag till dev freeze. TestBlocker keyword should suffice.

Comment 7 Vivek Das 2016-10-20 06:02:43 UTC

Hi, I am facing similar kind of issue. With glusterd logs are flooded with messages, resulting in root disk space getting full and blocks the ongoing work.
I did very limited (say twice) removing and creating the secure-access file from the management and trying the mount & umount in the client along with glusterd restart. But with in a span of 2 days huge number of data is created (5GB) in the glusterd log file.

# df -h /
#Filesystem                        Size  Used Avail Use% Mounted on
#/dev/mapper/rhgs_dhcp47--12-root  7.6G  7.6G   20K 100% /
# du -sh /var/log/glusterfs
#4.6G /var/log/glusterfs
# du -sh /var/log/glusterfs/glusterd.log
#4.1G /var/log/glusterfs/glusterd.log

I am raising this as a highest of priority as within a week's time it will overload to 100GB of data resulting to a block for the on going work.

Comment 9 RamaKasturi 2016-10-20 06:22:44 UTC

Atin, apologies for the delay. I will update this by EOD. I have been continuously hitting other RHV-H problems due to which i could not test this.

Comment 10 Kaushal 2016-10-20 08:41:07 UTC

This is happening because the fix for #1336339 that was merged with http://review.gluster.org/14891 is only partial.

With the change pipes are being only closed when spawning own-thread (socket_spawn) fails. They are still being left open when a working connection disconnects.

So the fix is only partial. It still requires http://review.gluster.org/14356 which closes pipes after an SSL connection disconnects.

I'll refresh my change upstream and get it merged. It can then be backported.

Comment 11 Mohit Agrawal 2016-10-20 09:37:46 UTC

Hi Kasturi,

  This is updated test rpm( this is updated link https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11924412
<magrawal> it have both fixes) after merged patch is shared by kaushal.

You can test it and share the result.

Regards
Mohit Agrawal

Comment 13 Mohit Agrawal 2016-10-21 02:46:00 UTC

Hi Kasturi,

  There was some problem in last build,I have done some minor update.
  I have tried some basic testing on below build specific to SSL ,it is working fine.
 
     https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11928955

 Please do install this and share the result.

Regards
Mohit Agrawal

Comment 14 RamaKasturi 2016-10-21 06:42:55 UTC

Hi Mohith,
  
   I installed the latest build which you provided in comment 13 and i am able to mount the volume. Monitoring the system, will update the status in the bug.

Thanks
kasturi

Comment 15 Mohit Agrawal 2016-10-21 06:43:55 UTC

Hi Kasturi,

   Thank you very much for sharing the result.
   

Regards
Mohit Agrawal

Comment 16 RamaKasturi 2016-10-21 12:41:36 UTC

Has not seen the logs flooded with Too many open files message. I will leave the setup over night monitor and update the bug.

Comment 19 Atin Mukherjee 2016-10-24 04:53:47 UTC

upstream mainline : http://review.gluster.org/14356
upstream 3.8 : http://review.gluster.org/#/c/15703 (not merged)
upstream 3.9 : http://review.gluster.org/#/c/15702 (not merged)
downstream patch :https://code.engineering.redhat.com/gerrit/87981

Comment 21 RamaKasturi 2016-10-24 05:34:01 UTC

Tested with the private build given. Do not see that glusterd.log files is getting flooded with too many open file errors.

Comment 23 RamaKasturi 2016-11-07 05:23:49 UTC

Verified and works fine with build glusterfs-3.8.4-3.el7rhgs.x86_64.

I do not see the log files getting flooded with "Too many files open issue".

Comment 26 Mohit Agrawal 2017-02-01 03:46:30 UTC

Hi Abhishek/Atin,

  The bugzilla was raised to resolve the error "Too many open files" due to fd leak issue in the socket_poller code.
  We can provide the patch only to resolve this issue(fd leak) but in later version(after this version) we had done multiple changes in socket code,after this patch you will not get fd leak issue but you can face other issue.

  
Regards
Mohit Agrawal

Comment 39 errata-xmlrpc 2017-03-23 06:08:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.