1026780 – Dist-geo-rep : geo-rep worker crashed while init with [Errno 34] Numerical result out of range.

Bug 1026780 - Dist-geo-rep : geo-rep worker crashed while init with [Errno 34] Numerical result out of range.

Summary: Dist-geo-rep : geo-rep worker crashed while init with [Errno 34] Numerical re...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1285200 1294588 1313311
TreeView+	depends on / blocked

Reported:	2013-11-05 12:35 UTC by Vijaykumar Koppad
Modified:	2016-03-01 11:27 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1285200 (view as bug list)
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vijaykumar Koppad 2013-11-05 12:35:14 UTC

Description of problem: geo-rep worker crashed while init with [Errno 34] Numerical result out of range. This happened in one of the node when the new nodes were added and geo-rep was restarted. 

Python backtrace
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-11-05 17:52:06.95596] I [master(/bricks/brick3):917:crawl] _GMaster: finished hybrid crawl syncing
[2013-11-05 17:52:06.253268] E [syncdutils(/bricks/brick3):207:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 535, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1134, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 397, in crawlwrap
    volinfo_sys = self.volinfo_hook()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 137, in volinfo_hook
    return self.get_sys_volinfo()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 256, in get_sys_volinfo
    fgn_vis, nat_vi = self.master.server.aggregated.foreign_volume_infos(), \
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 862, in foreign_volume_infos
    xattr_list = Xattr.llistxattr_buf('.')
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 87, in llistxattr_buf
    return cls.llistxattr(path, size)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 57, in llistxattr
    ret = cls._query_xattr(path, siz, 'llistxattr')
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 35, in _query_xattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 34] Numerical result out of range
[2013-11-05 17:52:06.256297] I [syncdutils(/bricks/brick3):159:finalize] <top>: exiting.
[2013-11-05 17:52:06.866612] I [monitor(monitor):81:set_state] Monitor: new state: faulty

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Version-Release number of selected component (if applicable):glusterfs-3.4.0.39rhs-1.el6rhs.x86_64


How reproducible: didn't try to reproduce.


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.add new nodes to the master volume
3.stop the geo-rep 
4.do gsec_create and create push-pem force
5.start the geo-rep again 

Actual results: The worker crashed with "Numerical result out of range"


Expected results:the worker shouldn't crash 


Additional info:

Comment 2 Amar Tumballi 2013-11-11 12:23:56 UTC

diff --git a/geo-replication/syncdaemon/libcxattr.py b/geo-replication/syncdaemon/libcxattr.py
index b5b6956..75c89ef 100644
--- a/geo-replication/syncdaemon/libcxattr.py
+++ b/geo-replication/syncdaemon/libcxattr.py
@@ -54,9 +54,13 @@ class Xattr(object):
 
     @classmethod
     def llistxattr(cls, path, siz=0):
-        ret = cls._query_xattr(path, siz, 'llistxattr')
-        if isinstance(ret, str):
-            ret = ret.split('\0')
+
+        try:
+            ret = cls._query_xattr(path, siz, 'llistxattr')
+            if isinstance(ret, str):
+                ret = ret.split('\0')
+        except:
+            ret = -1
         return ret
 
     @classmethod

Comment 3 Venky Shankar 2013-11-11 16:45:51 UTC

Amar,

I think try ... catch won't help here as the call is via ctypes. A probable fix would be to handle ERANGE for _all_ getxattr calls. What do you think?

Comment 4 Vijaykumar Koppad 2014-02-04 05:43:06 UTC

This has happened again in the build glusterfs-3.4.0.58rhs-1. 


backtrace

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-02-04 11:05:25.961899] I [master(/bricks/master_brick9):438:crawlwrap] _GMaster: crawl interval: 60 seconds
[2014-02-04 11:05:25.967730] I [master(/bricks/master_brick9):918:update_worker_status] _GMaster: Creating new /var/lib/glusterd/geo-replication/master_10.70.43.76_slave/_bricks_master_brick9.status
[2014-02-04 11:05:25.975267] I [master(/bricks/master_brick9):1129:crawl] _GMaster: starting hybrid crawl...
[2014-02-04 11:05:25.991782] E [syncdutils(/bricks/master_brick1):240:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1156, in service_loop
    g1.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 422, in crawlwrap
    volinfo_sys = self.volinfo_hook()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 138, in volinfo_hook
    return self.get_sys_volinfo()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 278, in get_sys_volinfo
    fgn_vis, nat_vi = self.master.server.aggregated.foreign_volume_infos(), \
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 885, in foreign_volume_infos
    xattr_list = Xattr.llistxattr_buf('.')
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 87, in llistxattr_buf
    return cls.llistxattr(path, size)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 57, in llistxattr
    ret = cls._query_xattr(path, siz, 'llistxattr')
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 35, in _query_xattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 34] Numerical result out of range
[2014-02-04 11:05:25.995131] I [syncdutils(/bricks/master_brick1):192:finalize] <top>: exiting.
[2014-02-04 11:05:26.662705] I [master(/bricks/master_brick5):58:gmaster_builder] <top>: setting up xsync change detection mode

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 5 Vijaykumar Koppad 2014-05-16 08:50:58 UTC

This has happened again in the build, glusterfs-3.6.0.2-1.el6rhs. 

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2014-05-16 12:17:02.880822] I [master(/bricks/master_brick5):1251:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.114%3Agluster%3A%2F%2F127.0.0.1%3Aslave/994c39ebca2dac30ef18cf407ed3322f/xsync/XSYNC-CHANGELOG.1400222822
[2014-05-16 12:17:02.891830] I [master(/bricks/master_brick5):1248:crawl] _GMaster: finished hybrid crawl syncing
[2014-05-16 12:17:02.896193] E [syncdutils(/bricks/master_brick9):270:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 633, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1298, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 447, in crawlwrap
    volinfo_sys = self.volinfo_hook()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 155, in volinfo_hook
    return self.get_sys_volinfo()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 303, in get_sys_volinfo
    self.master.server.aggregated.foreign_volume_infos(),
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 958, in foreign_volume_infos
    xattr_list = Xattr.llistxattr_buf('.')
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 99, in llistxattr_buf
    return cls.llistxattr(path, size)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 69, in llistxattr
    ret = cls._query_xattr(path, siz, 'llistxattr')
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 47, in _query_xattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 34] Numerical result out of range
[2014-05-16 12:17:02.902026] I [syncdutils(/bricks/master_brick9):214:finalize] <top>: exiting.
[2014-05-16 12:17:02.905908] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Comment 8 Rahul Hinduja 2015-06-22 08:59:37 UTC

I see this happening with build: glusterfs-geo-replication-3.7.1-4.el6rhs.x86_64 

Even without adding any new node. It happened when the volume type is "disperse" for both master and slave. 

[root@georep2 ~]# grep "OSError:" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.154%3Agluster%3A%2F%2F127.0.0.1%3Aslave.log
OSError: [Errno 34] Numerical result out of range
OSError: [Errno 34] Numerical result out of range
[root@georep2 ~]# 

[2015-06-22 19:39:07.959126] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------
[2015-06-22 19:39:07.959476] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker
[2015-06-22 19:39:08.93621] I [gsyncd(/rhs/brick1/b1):649:main_i] <top>: syncing: gluster://localhost:master -> ssh://root.46.103:gluster://localhost:slave
[2015-06-22 19:39:08.95005] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining...
[2015-06-22 19:39:11.153399] I [master(/rhs/brick1/b1):83:gmaster_builder] <top>: setting up xsync change detection mode
[2015-06-22 19:39:11.153790] I [master(/rhs/brick1/b1):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-06-22 19:39:11.155164] I [master(/rhs/brick1/b1):83:gmaster_builder] <top>: setting up changelog change detection mode
[2015-06-22 19:39:11.155376] I [master(/rhs/brick1/b1):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-06-22 19:39:11.156248] I [master(/rhs/brick1/b1):83:gmaster_builder] <top>: setting up changeloghistory change detection mode
[2015-06-22 19:39:11.156491] I [master(/rhs/brick1/b1):404:__init__] _GMaster: using 'rsync' as the sync engine
[2015-06-22 19:39:13.201039] I [master(/rhs/brick1/b1):1208:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.46.154%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c19b89ac45352ab8c894d210d136dd56/xsync
[2015-06-22 19:39:13.201385] I [resource(/rhs/brick1/b1):1432:service_loop] GLUSTER: Register time: 1434982153
[2015-06-22 19:39:15.791850] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 514, in crawlwrap
    volinfo_sys = self.volinfo_hook()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 172, in volinfo_hook
    return self.get_sys_volinfo()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 332, in get_sys_volinfo
    self.master.server.aggregated.foreign_volume_infos(),
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1084, in foreign_volume_infos
    xattr_list = Xattr.llistxattr_buf('.')
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 94, in llistxattr_buf
    return cls.llistxattr(path, size)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 69, in llistxattr
    ret = cls._query_xattr(path, siz, 'llistxattr')
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 47, in _query_xattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 34] Numerical result out of range
[2015-06-22 19:39:15.793708] I [syncdutils(/rhs/brick1/b1):220:finalize] <top>: exiting.
[2015-06-22 19:39:15.795592] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF.
[2015-06-22 19:39:15.795979] I [syncdutils(agent):220:finalize] <top>: exiting.
[2015-06-22 19:39:16.160041] I [monitor(monitor):282:monitor] Monitor: worker(/rhs/brick1/b1) died in startup phase
[2015-06-22 19:39:26.346771] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------
:


This happened on one of the node in master cluster, just after starting the geo-rep session and status went to Faulty. After multiple tries, the worked comes back and status becomes correctly passive.

Will be attaching the new logs.

Comment 10 Rahul Hinduja 2015-09-09 06:50:44 UTC

Hit this bug on the normal distributed-volume as well with build glusterfs-3.7.1-14.el7rhgs.x86_64

[2015-09-08 17:50:29.959184] I [master(/bricks/brick2/master_brick8):1249:crawl] _GMaster: finished hybrid crawl syncing, stime: (1441734629, 0)
[2015-09-08 17:50:29.960731] E [syncdutils(/bricks/brick0/master_brick0):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1445, in service_loop
    g1.crawlwrap(oneshot=True, register_time=register_time)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 525, in crawlwrap
    volinfo_sys = self.volinfo_hook()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 172, in volinfo_hook
    return self.get_sys_volinfo()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 332, in get_sys_volinfo
    self.master.server.aggregated.foreign_volume_infos(),
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1084, in foreign_volume_infos
    xattr_list = Xattr.llistxattr_buf('.')
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 94, in llistxattr_buf
    return cls.llistxattr(path, size)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 69, in llistxattr
    ret = cls._query_xattr(path, siz, 'llistxattr')
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 47, in _query_xattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 34] Numerical result out of range


[root@georep1 syncdaemon]# gluster volume info master
 
Volume Name: master
Type: Distributed-Replicate
Volume ID: 114cc338-b4ae-469a-8db7-105b5f671f9c
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp

Comment 11 Aravinda VK 2015-11-25 08:50:49 UTC

Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Comment 12 Aravinda VK 2015-11-25 08:52:04 UTC

Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Note You need to log in before you can comment on or make changes to this bug.