886808 – geo-rep: select() returning EBADF not handled correctly

Bug 886808 - geo-rep: select() returning EBADF not handled correctly

Summary: geo-rep: select() returning EBADF not handled correctly

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	pre-release
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Assignee:	Niels de Vos
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	880308 902206
TreeView+	depends on / blocked

Reported:	2012-12-13 08:53 UTC by Niels de Vos
Modified:	2013-07-24 17:31 UTC (History)
CC List:	3 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:	880308
Clones:	902206 (view as bug list)
Environment:
Last Closed:	2013-07-24 17:31:46 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Niels de Vos 2012-12-13 08:53:48 UTC

+++ This bug was initially created as a clone of Bug #880308 +++

Created attachment 652074 [details]
geo-replication log containing the error

Description of problem:

The georep_debug.log contains the following error:
E [syncdutils:190:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 216, in twrap
tf(*aa)
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 123, in tailer
poe, _ ,_ = select([po.stderr for po in errstore], [], [], 1)
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 276, in select
return eintr_wrap(oselect.select, oselect.error, *a)
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 269, in eintr_wrap
return func(*a)
error: (9, 'Bad file descriptor')

Version-Release number of selected component (if applicable):
glusterfs-3.3.0.5rhs-37.el6rhs.x86_64

How reproducible:
Very often for this customer

Steps to Reproduce:
Unclear :-/
  
Actual results:
Georeplication does not finish completely, and is restarted continuously.


Expected results:
Georeplication should finish completely.

Additional info:
syncdutils.py contains a wrapper function eintr_wrap(). This function is used to retry a function when errno is set to EINTR.
resource.py callss syncdutils.py:select(), which is a select.select() wrapped in eintr_wrap().

When select() returns a select.error (this is a class) instance, the eintr_wrap() function checks if errno is set to EINTR. If select() encounters a "bad filedescriptor" (the file has likely been closed already), errno is set to EBADF and eintr_wrap() raises the exception to the upper handler in resource.py:tailer().

The handler in resource.py:tailer() should catch the select.error exception and ignore it, continuing with the loop.

Example of the incorrect error handling:

--- %< ---
from select import error as selecterror
from select import select
import sys, os

fo = open("/tmp/select.out", "w")
fd = fo.fileno()
fo.close()

try:
        ret = select([fd], [], [], 3)
except ValueError, selecterror:
        print "anonymous error caught"
except ValueError as e:
        print "ValueError caugth: %s" % e 
except selecterror as e:
        print "selecterror caugth: %s" % e 
else:
        print "no error caught"
--- >% ---

Output:
selecterror caugth: (9, 'Bad file descriptor')

The code in resource.py uses "except ValueError, selecterror" which does not seem to catch the select.error as expected. This may well work in newer versions of Python, but definitely does not on RHEL6.

A change like the following should prevent the error in the logs:
diff --git a/xlators/features/marker/utils/syncdaemon/resource.py b/xlators/features/marker/utils/syncdaemon/resource.py
index 7eced82..e1384a0 100644
--- a/xlators/features/marker/utils/syncdaemon/resource.py
+++ b/xlators/features/marker/utils/syncdaemon/resource.py
@@ -121,7 +121,9 @@ class Popen(subprocess.Popen):
                 errstore = cls.errstore.copy()
                 try:
                     poe, _ ,_ = select([po.stderr for po in errstore], [], [], 1)
-                except ValueError, selecterror:
+                except ValueError:
+                    continue
+                except selecterror:
                     continue
                 for po in errstore:
                     if po.stderr not in poe:


When the exception is not caught, the final handler in syncdutils.py:log_raise_exception() calls sys.exit() and causes the thread to exit. This is likely the cause for the incomplete geo-sync to stop processing further files, until the next run (which may hit a premature error too).

Note that we have not found the root cause yet (why does select() return EBADF)...

--- Additional comment from Vijay Bellur on 2012-11-29 01:24:02 CET ---

CHANGE: http://review.gluster.org/4233 (geo-replication: catch select.error on select()) merged in master by Anand Avati (avati)

Note You need to log in before you can comment on or make changes to this bug.