Bug 1381822

Summary: glusterd.log is flooded with socket.management: accept on 11 failed (Too many open files) and glusterd service stops
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RamaKasturi <knarra>
Component: glusterfsAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: high Docs Contact:
Priority: urgent    
Version: rhgs-3.2CC: abhishku, amukherj, asoman, bsrirama, kaushal, knarra, moagrawa, rabhat, rhinduja, rhs-bugs, sasundar, storage-qa-internal, vbellur, vdas
Target Milestone: ---Keywords: TestBlocker
Target Release: RHGS 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 06:08:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1336371, 1387960    
Bug Blocks: 1351528    

Description RamaKasturi 2016-10-05 06:43:53 UTC
Description of problem:
I see that glusterd.log is flooded with messages below and log file grows huge with in minutes and glusterd just stops.

[2016-10-05 06:20:46.491886] W [socket.c:2645:socket_server_event_handler] 0-socket.management: accept on 11 failed (Too many open files)
[2016-10-05 06:20:46.491895] W [socket.c:2645:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-10-05 06:20:46.491904] W [socket.c:2645:socket_server_event_handler] 0-socket.management: accept on 11 failed (Too many open files)
[2016-10-05 06:20:46.491913] W [socket.c:2645:socket_server_event_handler] 0-socket.management: accept on 10 failed (Too many open files)
[2016-10-05 06:20:46.491922] W [socket.c:2645:socket_server_event_handler] 0-socket.management: accept on 11 failed (Too many open files)


Version-Release number of selected component (if applicable):
glusterfs-3.8.4-2.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Installed HC stack on RHV-H and arbiter
2. Enabled SSL on the volumes
3. Just left the setup overnight.

Actual results:
All of a sudden i see that glusterd.log gets flooded with  messages reported in  description and the log file grows huge i.e upto 14G and glusterd service stops.

Expected results:
glusterd.log should not be flooded with these messages and glusterd service should not stop.

Additional info:

Comment 3 RamaKasturi 2016-10-05 07:07:44 UTC
I have also copied the glusterd.log file which has grown up to 14G to the link mentioned in comment 2

Comment 4 RamaKasturi 2016-10-06 05:55:17 UTC
proposing this bug as blocker as glusterd.log gets flooded with these messages and all glusterd commands are hung on my system.

Comment 5 SATHEESARAN 2016-10-07 11:26:39 UTC
Additional information:

There are lots of error saying that - "error in polling loop", which should not happen once the setup is up and running with SSL/TLS configured properly.

There are about 1580 such error messages in 1 day and it keeps ticking.

# grep -i 'error in polling loop' /var/log/glusterfs/glusterd.log  | wc -l
1580

This considerably makes glusterd log grow.
# ls -lsah /var/log/glusterfs/glusterd.log 
8.1M -rw-------. 1 root root 8.1M Oct  7 16:51 /var/log/glusterfs/glusterd.log

Comment 6 Atin Mukherjee 2016-10-07 11:57:17 UTC
(In reply to RamaKasturi from comment #4)
> proposing this bug as blocker as glusterd.log gets flooded with these
> messages and all glusterd commands are hung on my system.

Please do not use blocker flag till dev freeze. TestBlocker keyword should suffice.

Comment 7 Vivek Das 2016-10-20 06:02:43 UTC
Hi, I am facing similar kind of issue. With glusterd logs are flooded with messages, resulting in root disk space getting full and blocks the ongoing work.
I did very limited (say twice) removing and creating the secure-access file from the management and trying the mount & umount in the client along with glusterd restart. But with in a span of 2 days huge number of data is created (5GB) in the glusterd log file.

# df -h /
#Filesystem                        Size  Used Avail Use% Mounted on
#/dev/mapper/rhgs_dhcp47--12-root  7.6G  7.6G   20K 100% /
# du -sh /var/log/glusterfs
#4.6G /var/log/glusterfs
# du -sh /var/log/glusterfs/glusterd.log
#4.1G /var/log/glusterfs/glusterd.log

I am raising this as a highest of priority as within a week's time it will overload to 100GB of data resulting to a block for the on going work.

Comment 9 RamaKasturi 2016-10-20 06:22:44 UTC
Atin, apologies for the delay. I will update this by EOD. I have been continuously hitting other RHV-H problems due to which i could not test this.

Comment 10 Kaushal 2016-10-20 08:41:07 UTC
This is happening because the fix for #1336339 that was merged with http://review.gluster.org/14891 is only partial.

With the change pipes are being only closed when spawning own-thread (socket_spawn) fails. They are still being left open when a working connection disconnects.

So the fix is only partial. It still requires http://review.gluster.org/14356 which closes pipes after an SSL connection disconnects.

I'll refresh my change upstream and get it merged. It can then be backported.

Comment 11 Mohit Agrawal 2016-10-20 09:37:46 UTC
Hi Kasturi,

  This is updated test rpm( this is updated link https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11924412
<magrawal> it have both fixes) after merged patch is shared by kaushal.

You can test it and share the result.

Regards
Mohit Agrawal

Comment 13 Mohit Agrawal 2016-10-21 02:46:00 UTC
Hi Kasturi,

  There was some problem in last build,I have done some minor update.
  I have tried some basic testing on below build specific to SSL ,it is working fine.
 
     https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11928955

 Please do install this and share the result.

Regards
Mohit Agrawal

Comment 14 RamaKasturi 2016-10-21 06:42:55 UTC
Hi Mohith,
  
   I installed the latest build which you provided in comment 13 and i am able to mount the volume. Monitoring the system, will update the status in the bug.

Thanks
kasturi

Comment 15 Mohit Agrawal 2016-10-21 06:43:55 UTC
Hi Kasturi,

   Thank you very much for sharing the result.
   

Regards
Mohit Agrawal

Comment 16 RamaKasturi 2016-10-21 12:41:36 UTC
Has not seen the logs flooded with Too many open files message. I will leave the setup over night monitor and update the bug.

Comment 19 Atin Mukherjee 2016-10-24 04:53:47 UTC
upstream mainline : http://review.gluster.org/14356
upstream 3.8 : http://review.gluster.org/#/c/15703 (not merged)
upstream 3.9 : http://review.gluster.org/#/c/15702 (not merged)
downstream patch :https://code.engineering.redhat.com/gerrit/87981

Comment 21 RamaKasturi 2016-10-24 05:34:01 UTC
Tested with the private build given. Do not see that glusterd.log files is getting flooded with too many open file errors.

Comment 23 RamaKasturi 2016-11-07 05:23:49 UTC
Verified and works fine with build glusterfs-3.8.4-3.el7rhgs.x86_64.

I do not see the log files getting flooded with "Too many files open issue".

Comment 26 Mohit Agrawal 2017-02-01 03:46:30 UTC
Hi Abhishek/Atin,

  The bugzilla was raised to resolve the error "Too many open files" due to fd leak issue in the socket_poller code.
  We can provide the patch only to resolve this issue(fd leak) but in later version(after this version) we had done multiple changes in socket code,after this patch you will not get fd leak issue but you can face other issue.

  
Regards
Mohit Agrawal

Comment 39 errata-xmlrpc 2017-03-23 06:08:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html