Bug 1030393 - dist-geo-rep: Worker crashing with "ValueError: filedescriptor out of range in select()" backtrace in logs
Summary: dist-geo-rep: Worker crashing with "ValueError: filedescriptor out of range i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: 2.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: RHGS 3.0.0
Assignee: Aravinda VK
QA Contact: Bhaskar Bandari
URL:
Whiteboard:
: 1050754 (view as bug list)
Depends On:
Blocks: 1126355
TreeView+ depends on / blocked
 
Reported: 2013-11-14 11:13 UTC by M S Vishwanath Bhat
Modified: 2018-12-09 17:17 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: An file descriptor leak in the tar+ssh sync engine mechanism. Consequence: When tar+ssh is used as the sync engine, due to an fd leak, the open descriptor count will cross the max allowed limit and cause the gsync daemon to crash. Fix: Fix file descriptor leak. Result: No geo-rep worker crash.
Clone Of:
: 1126355 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:29:40 UTC
Embargoed:


Attachments (Terms of Use)
geo-rep master log from the node where crash was found. (503.96 KB, text/x-log)
2013-11-14 11:13 UTC, M S Vishwanath Bhat
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description M S Vishwanath Bhat 2013-11-14 11:13:17 UTC
Created attachment 823875 [details]
geo-rep master log from the node where crash was found.

Description of problem:
When was running the geo-rep session and there were some node reboots, worker crashed thrice with the "ValueError: filedescriptor out of range in select()" back trace in logs. But the good news is that the worker is restarted by the monitor. But this happened thrice at irregular intervals.

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.44rhs-1.el6rhs.x86_64

How reproducible:
Not consistently reproducible. Hit thrice all on single setup.

Steps to Reproduce:
1. Create and start the geo-rep session between 2*2 dist-rep master 2*2 dist-rep slave. 
2. Now start creating smallfiles and tar files using smallfiles.py and crefi.py respectively.
3. Now reboot a passive node.
4. Bring down a Active node for 2 hours and then bring it on again.
5. After about 5-6 hours, reboot the same active node which was down for 2 hours.

Actual results:
[2013-11-14 12:56:54.371954] E [syncdutils(/rhs/bricks/brick0):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1157, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 461, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1027, in crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 815, in process
    if self.process_change(change, done, retry):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 804, in process_change
    if self.syncdata(datas):
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 223, in syncdata
    self.syncer.sync()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1249, in sync
    self.syncjob()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1241, in syncjob
    po = self.slave.tarssh(pb)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1266, in tarssh
    return sup(self, files, self.slaveurl)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 35, in sup
    return getattr(super(type(x), x), sys._getframe(1).f_code.co_name)(*a, **kw)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 785, in tarssh
    p1.terminate_geterr(fail_on_err = False)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 240, in terminate_geterr
    if not select([self.stderr],[],[],0.1)[0]:
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 293, in select
    return eintr_wrap(oselect.select, oselect.error, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 286, in eintr_wrap
    return func(*a)
ValueError: filedescriptor out of range in select()
[2013-11-14 12:56:54.910108] I [syncdutils(/rhs/bricks/brick0):159:finalize] <top>: exiting.
[2013-11-14 12:56:54.931972] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2013-11-14 12:57:07.114229] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------


Status goes to faulty with the above crash. But the monitor starts it again.

Expected results:
There should be no worker crashes.

Additional info:
Will attach the complete log file.

NOTE: The crash was found in a node which was never rebooted.

Comment 2 Vijaykumar Koppad 2013-12-02 09:00:26 UTC
Even I observed this crash, 

The backtrace was like  this

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2013-12-02 14:19:13.673443] E [syncdutils(/bricks/brick1):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap
    tf(*aa)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1363, in syncjob
    po = self.sync_engine(pb)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1266, in tarssh
    return sup(self, files, self.slaveurl)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 35, in sup
    return getattr(super(type(x), x), sys._getframe(1).f_code.co_name)(*a, **kw)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 785, in tarssh
    p1.terminate_geterr(fail_on_err = False)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 240, in terminate_geterr
    if not select([self.stderr],[],[],0.1)[0]:
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 293, in select
    return eintr_wrap(oselect.select, oselect.error, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 286, in eintr_wrap
    return func(*a)
ValueError: filedescriptor out of range in select()
[2013-12-02 14:19:13.721365] I [syncdutils(/bricks/brick1):159:finalize] <top>: exiting.
[2013-12-02 14:19:13.734653] I [monitor(monitor):81:set_state] Monitor: new state: faulty

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 6 Venky Shankar 2014-03-13 09:30:42 UTC
This backtrace is a result of an fd leak when tar+ssh is used an the sync engine. A missing cleanup of the tar process resulted in the file descriptor number that's assigned to subsequent "tar" processes to be ever growing. This would result in maxing out the number of open fd's per process and eventually crash, but python's select() has it's own limitations -- it can only handle fd numbers upto 1023 thereby hitting the limit early as shown below:

In [2]: select.select([1024], [], [])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-9f305b4e42e8> in <module>()
----> 1 select.select([1024], [], [])

ValueError: filedescriptor out of range in select()


Therefore, as time progressed and there were changes in the file system, geo-replication would invoke the sync engine ("tar+ssh" in this case) and never properly cleanup.

Fix is to properly cleanup (wait and close open fd's) so as to recycle file descriptor numbers.

Comment 7 Venky Shankar 2014-03-13 11:26:25 UTC
review link: https://code.engineering.redhat.com/gerrit/#/c/18580/

Comment 8 Dustin Black 2014-05-16 20:32:17 UTC
It looks like this was fixed in a build already. Can we confirm whether this will be included in 3.0?

Comment 9 Nagaprasad Sathyanarayana 2014-05-19 10:56:20 UTC
Setting flags required to add BZs to RHS 3.0 Errata

Comment 11 Vijaykumar Koppad 2014-06-10 11:07:44 UTC
verified on the build glusterfs-3.6.0.15-1.el6rhs

verification steps.
1. create and start a geo-rep relationship between master and slave.
2. set config option use-tarssh to true.
3. start creating data on master.
4. track fd's opened by the geo-rep worker process in /proc/<worker_PID>/fd/
5. Number of fd's doesn't keep growing. With the bug, it keeps growing 
eg: with the bug, entries like this keep growing. 

lrwx------ 1 root root 64 Jun 10 16:17 23 -> pipe:[17609585]
lr-x------ 1 root root 64 Jun 10 16:19 28 -> pipe:[17609582]

Comment 12 Vijaykumar Koppad 2014-06-12 05:44:42 UTC
*** Bug 1050754 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2014-09-22 19:29:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.