Description of problem: geo-rep worker crashed while init with [Errno 34] Numerical result out of range. This happened in one of the node when the new nodes were added and geo-rep was restarted. Python backtrace >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-11-05 17:52:06.95596] I [master(/bricks/brick3):917:crawl] _GMaster: finished hybrid crawl syncing [2013-11-05 17:52:06.253268] E [syncdutils(/bricks/brick3):207:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 535, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1134, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 397, in crawlwrap volinfo_sys = self.volinfo_hook() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 137, in volinfo_hook return self.get_sys_volinfo() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 256, in get_sys_volinfo fgn_vis, nat_vi = self.master.server.aggregated.foreign_volume_infos(), \ File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 862, in foreign_volume_infos xattr_list = Xattr.llistxattr_buf('.') File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 87, in llistxattr_buf return cls.llistxattr(path, size) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 57, in llistxattr ret = cls._query_xattr(path, siz, 'llistxattr') File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 35, in _query_xattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 34] Numerical result out of range [2013-11-05 17:52:06.256297] I [syncdutils(/bricks/brick3):159:finalize] <top>: exiting. [2013-11-05 17:52:06.866612] I [monitor(monitor):81:set_state] Monitor: new state: faulty >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version-Release number of selected component (if applicable):glusterfs-3.4.0.39rhs-1.el6rhs.x86_64 How reproducible: didn't try to reproduce. Steps to Reproduce: 1.create and start a geo-rep relationship between master and slave. 2.add new nodes to the master volume 3.stop the geo-rep 4.do gsec_create and create push-pem force 5.start the geo-rep again Actual results: The worker crashed with "Numerical result out of range" Expected results:the worker shouldn't crash Additional info:
diff --git a/geo-replication/syncdaemon/libcxattr.py b/geo-replication/syncdaemon/libcxattr.py index b5b6956..75c89ef 100644 --- a/geo-replication/syncdaemon/libcxattr.py +++ b/geo-replication/syncdaemon/libcxattr.py @@ -54,9 +54,13 @@ class Xattr(object): @classmethod def llistxattr(cls, path, siz=0): - ret = cls._query_xattr(path, siz, 'llistxattr') - if isinstance(ret, str): - ret = ret.split('\0') + + try: + ret = cls._query_xattr(path, siz, 'llistxattr') + if isinstance(ret, str): + ret = ret.split('\0') + except: + ret = -1 return ret @classmethod
Amar, I think try ... catch won't help here as the call is via ctypes. A probable fix would be to handle ERANGE for _all_ getxattr calls. What do you think?
This has happened again in the build glusterfs-3.4.0.58rhs-1. backtrace >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2014-02-04 11:05:25.961899] I [master(/bricks/master_brick9):438:crawlwrap] _GMaster: crawl interval: 60 seconds [2014-02-04 11:05:25.967730] I [master(/bricks/master_brick9):918:update_worker_status] _GMaster: Creating new /var/lib/glusterd/geo-replication/master_10.70.43.76_slave/_bricks_master_brick9.status [2014-02-04 11:05:25.975267] I [master(/bricks/master_brick9):1129:crawl] _GMaster: starting hybrid crawl... [2014-02-04 11:05:25.991782] E [syncdutils(/bricks/master_brick1):240:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 150, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 540, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1156, in service_loop g1.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 422, in crawlwrap volinfo_sys = self.volinfo_hook() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 138, in volinfo_hook return self.get_sys_volinfo() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 278, in get_sys_volinfo fgn_vis, nat_vi = self.master.server.aggregated.foreign_volume_infos(), \ File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 885, in foreign_volume_infos xattr_list = Xattr.llistxattr_buf('.') File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 87, in llistxattr_buf return cls.llistxattr(path, size) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 57, in llistxattr ret = cls._query_xattr(path, siz, 'llistxattr') File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 35, in _query_xattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 34] Numerical result out of range [2014-02-04 11:05:25.995131] I [syncdutils(/bricks/master_brick1):192:finalize] <top>: exiting. [2014-02-04 11:05:26.662705] I [master(/bricks/master_brick5):58:gmaster_builder] <top>: setting up xsync change detection mode >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
This has happened again in the build, glusterfs-3.6.0.2-1.el6rhs. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2014-05-16 12:17:02.880822] I [master(/bricks/master_brick5):1251:crawl] _GMaster: processing xsync changelog /var/run/gluster/master/ssh%3A%2F%2Froot%4010.70.43.114%3Agluster%3A%2F%2F127.0.0.1%3Aslave/994c39ebca2dac30ef18cf407ed3322f/xsync/XSYNC-CHANGELOG.1400222822 [2014-05-16 12:17:02.891830] I [master(/bricks/master_brick5):1248:crawl] _GMaster: finished hybrid crawl syncing [2014-05-16 12:17:02.896193] E [syncdutils(/bricks/master_brick9):270:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 633, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1298, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 447, in crawlwrap volinfo_sys = self.volinfo_hook() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 155, in volinfo_hook return self.get_sys_volinfo() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 303, in get_sys_volinfo self.master.server.aggregated.foreign_volume_infos(), File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 958, in foreign_volume_infos xattr_list = Xattr.llistxattr_buf('.') File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 99, in llistxattr_buf return cls.llistxattr(path, size) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 69, in llistxattr ret = cls._query_xattr(path, siz, 'llistxattr') File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 47, in _query_xattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 34] Numerical result out of range [2014-05-16 12:17:02.902026] I [syncdutils(/bricks/master_brick9):214:finalize] <top>: exiting. [2014-05-16 12:17:02.905908] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I see this happening with build: glusterfs-geo-replication-3.7.1-4.el6rhs.x86_64 Even without adding any new node. It happened when the volume type is "disperse" for both master and slave. [root@georep2 ~]# grep "OSError:" /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.154%3Agluster%3A%2F%2F127.0.0.1%3Aslave.log OSError: [Errno 34] Numerical result out of range OSError: [Errno 34] Numerical result out of range [root@georep2 ~]# [2015-06-22 19:39:07.959126] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------ [2015-06-22 19:39:07.959476] I [monitor(monitor):222:monitor] Monitor: starting gsyncd worker [2015-06-22 19:39:08.93621] I [gsyncd(/rhs/brick1/b1):649:main_i] <top>: syncing: gluster://localhost:master -> ssh://root.46.103:gluster://localhost:slave [2015-06-22 19:39:08.95005] I [changelogagent(agent):75:__init__] ChangelogAgent: Agent listining... [2015-06-22 19:39:11.153399] I [master(/rhs/brick1/b1):83:gmaster_builder] <top>: setting up xsync change detection mode [2015-06-22 19:39:11.153790] I [master(/rhs/brick1/b1):404:__init__] _GMaster: using 'rsync' as the sync engine [2015-06-22 19:39:11.155164] I [master(/rhs/brick1/b1):83:gmaster_builder] <top>: setting up changelog change detection mode [2015-06-22 19:39:11.155376] I [master(/rhs/brick1/b1):404:__init__] _GMaster: using 'rsync' as the sync engine [2015-06-22 19:39:11.156248] I [master(/rhs/brick1/b1):83:gmaster_builder] <top>: setting up changeloghistory change detection mode [2015-06-22 19:39:11.156491] I [master(/rhs/brick1/b1):404:__init__] _GMaster: using 'rsync' as the sync engine [2015-06-22 19:39:13.201039] I [master(/rhs/brick1/b1):1208:register] _GMaster: xsync temp directory: /var/lib/misc/glusterfsd/master/ssh%3A%2F%2Froot%4010.70.46.154%3Agluster%3A%2F%2F127.0.0.1%3Aslave/c19b89ac45352ab8c894d210d136dd56/xsync [2015-06-22 19:39:13.201385] I [resource(/rhs/brick1/b1):1432:service_loop] GLUSTER: Register time: 1434982153 [2015-06-22 19:39:15.791850] E [syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1438, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 514, in crawlwrap volinfo_sys = self.volinfo_hook() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 172, in volinfo_hook return self.get_sys_volinfo() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 332, in get_sys_volinfo self.master.server.aggregated.foreign_volume_infos(), File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1084, in foreign_volume_infos xattr_list = Xattr.llistxattr_buf('.') File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 94, in llistxattr_buf return cls.llistxattr(path, size) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 69, in llistxattr ret = cls._query_xattr(path, siz, 'llistxattr') File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 47, in _query_xattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 34] Numerical result out of range [2015-06-22 19:39:15.793708] I [syncdutils(/rhs/brick1/b1):220:finalize] <top>: exiting. [2015-06-22 19:39:15.795592] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-06-22 19:39:15.795979] I [syncdutils(agent):220:finalize] <top>: exiting. [2015-06-22 19:39:16.160041] I [monitor(monitor):282:monitor] Monitor: worker(/rhs/brick1/b1) died in startup phase [2015-06-22 19:39:26.346771] I [monitor(monitor):221:monitor] Monitor: ------------------------------------------------------------ : This happened on one of the node in master cluster, just after starting the geo-rep session and status went to Faulty. After multiple tries, the worked comes back and status becomes correctly passive. Will be attaching the new logs.
Hit this bug on the normal distributed-volume as well with build glusterfs-3.7.1-14.el7rhgs.x86_64 [2015-09-08 17:50:29.959184] I [master(/bricks/brick2/master_brick8):1249:crawl] _GMaster: finished hybrid crawl syncing, stime: (1441734629, 0) [2015-09-08 17:50:29.960731] E [syncdutils(/bricks/brick0/master_brick0):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1445, in service_loop g1.crawlwrap(oneshot=True, register_time=register_time) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 525, in crawlwrap volinfo_sys = self.volinfo_hook() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 172, in volinfo_hook return self.get_sys_volinfo() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 332, in get_sys_volinfo self.master.server.aggregated.foreign_volume_infos(), File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1084, in foreign_volume_infos xattr_list = Xattr.llistxattr_buf('.') File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 94, in llistxattr_buf return cls.llistxattr(path, size) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 69, in llistxattr ret = cls._query_xattr(path, siz, 'llistxattr') File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 47, in _query_xattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 34] Numerical result out of range [root@georep1 syncdaemon]# gluster volume info master Volume Name: master Type: Distributed-Replicate Volume ID: 114cc338-b4ae-469a-8db7-105b5f671f9c Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.