Description of problem: Seeing Directory not empty and Stale file handle errors in geo-rep logs during history crawl while deleting files. Geo-rep session went faulty. Version-Release number of selected component (if applicable): glusterfs-3.6.0.51-1.el6rhs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create and start geo-rep session. 2. Create data(10k files) on the master mount. 3. Ensure files are synced to slave volume 4. Now start deleting the files. And while deletes are running stop geo-rep session. 5. Once all the deletes are finished on master volume, start the geo-rep session. 6. The session will switch to history crawl and start syncing to slave volume. 7. On the master node, check geo-rep slave.log and look for errors and traceback. Actual results: Found errors like "Stale file handle: '.gfid/cee167df-dfc0-406a-bd0d-ed593faeaa9c/5502b36a%%JZKEPVPX8R' and OSError: [Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread4' However the data was synced to slave. But geo-rep session went to faulty state but came back active after a while. Expected results: We should not see these traceback errors in the logs Additional info: [2015-03-13 15:48:23.847239] E [repce(/bricks/brick0/master_brick0):207:__call__] RepceClient: call 16550:140374215104256:1426241903.22 (entry_ops) failed on peer with OSError [2015-03-13 15:48:23.847932] E [syncdutils(/bricks/brick0/master_brick0):270:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 645, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1329, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 553, in crawlwrap self.crawl(no_stime_update=no_stime_update) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1334, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1017, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 981, in process_change self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 39] Directory not empty: '.gfid/00000000-0000-0000-0000-000000000001/thread4' [2015-03-13 15:48:39.740008] E [repce(/bricks/brick0/master_brick0):207:__call__] RepceClient: call 16662:140258876471040:1426241919.52 (entry_ops) failed on peer with OSError [2015-03-13 15:48:39.740557] E [syncdutils(/bricks/brick0/master_brick0):270:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 645, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1329, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 553, in crawlwrap self.crawl(no_stime_update=no_stime_update) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1334, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1017, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 981, in process_change self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 116] Stale file handle: '.gfid/cee167df-dfc0-406a-bd0d-ed593faeaa9c/5502b36a%%JZKEPVPX8R' [2015-03-13 15:46:27.528390] I [syncdutils(/bricks/brick0/master_brick0):214:finalize] <top>: exiting. [2015-03-13 15:46:27.532459] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-03-13 15:46:27.532964] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-03-13 15:46:28.258347] I [monitor(monitor):280:monitor] Monitor: worker(/bricks/brick0/master_brick0) died in startup phase [2015-03-13 15:46:28.259366] I [monitor(monitor):146:set_state] Monitor: new state: faulty [2015-03-13 15:46:38.598536] I [monitor(monitor):220:monitor] Monitor: ------------------------------------------------------------ [2015-03-13 15:46:38.599097] I [monitor(monitor):221:monitor] Monitor: starting gsyncd worker [2015-03-13 15:46:38.887062] I [gsyncd(/bricks/brick0/master_brick0):635:main_i] <top>: syncing: gluster://localhost:master -> ssh://root@falcon:gluster://localhost:sla ve [2015-03-13 15:46:38.901046] I [changelogagent(agent):72:__init__] ChangelogAgent: Agent listining... # gluster v i Volume Name: master Type: Distributed-Replicate Volume ID: 82b646f9-1f38-443f-862d-dadd24b668c5 Status: Started Snap Volume: no Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: spitfire:/bricks/brick0/master_brick0 Brick2: mustang:/bricks/brick0/master_brick1 Brick3: harrier:/bricks/brick0/master_brick2 Brick4: typhoon:/bricks/brick0/master_brick3 Brick5: ccr:/bricks/brick0/master_brick4 Brick6: metallica:/bricks/brick0/master_brick5 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on features.quota: on performance.readdir-ahead: on auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 # gluster v geo-rep master falcon::slave status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS ------------------------------------------------------------------------------------------------------------------------------------------------------------- spitfire master /bricks/brick0/master_brick0 root falcon::slave Faulty N/A Changelog Crawl typhoon master /bricks/brick0/master_brick3 root lightning::slave Faulty N/A Changelog Crawl metallica master /bricks/brick0/master_brick5 root interceptor::slave Passive N/A N/A mustang master /bricks/brick0/master_brick1 root interceptor::slave Passive N/A N/A harrier master /bricks/brick0/master_brick2 root hornet::slave Passive N/A N/A ccr master /bricks/brick0/master_brick4 root falcon::slave Passive N/A N/A
Also saw "OSError: [Errno 61] No data available" error for hardlinks: [2015-03-13 22:47:46.456286] E [syncdutils(/bricks/brick0/master_brick0):270:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 645, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1329, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 553, in crawlwrap self.crawl(no_stime_update=no_stime_update) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1334, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1017, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 981, in process_change self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 61] No data available: '.gfid/b8c9ca6c-cb69-4930-b4d5-c50ca7710f66/hardlink_to_files/55030214%%8W5FSWYQIO' [2015-03-13 22:47:46.459382] I [syncdutils(/bricks/brick0/master_brick0):214:finalize] <top>: exiting. [2015-03-13 22:47:46.464014] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-03-13 22:47:46.464807] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-03-13 22:47:46.887911] I [monitor(monitor):280:monitor] Monitor: worker(/bricks/brick0/master_brick0) died in startup phase
Hit the same issue with just changelog crawl. Georep session was never stopped. Executed following from fuse and nfs client of master volume: Fuse: ==== [root@wingo master]# for i in {1..10}; do cp -rf /etc etc.1 ; sleep 10 ; rm -rf etc.1 ; sleep 10 ; done NFS: ==== [root@wingo master_nfs]# for i in {1..10}; do cp -rf /etc etc.2 ; sleep 10 ; rm -rf etc.2 ; sleep 10 ; done Status moved from ACTIVE to Faulty as: ====================================== [root@georep1 ~]# gluster volume geo-replication vol0 10.70.46.100::vol1 status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS ---------------------------------------------------------------------------------------------------------------------------------- georep1 vol0 /rhs/brick1/b1 root 10.70.46.101::vol1 Active N/A Changelog Crawl georep1 vol0 /rhs/brick2/b2 root 10.70.46.100::vol1 Active N/A Changelog Crawl georep3 vol0 /rhs/brick1/b1 root 10.70.46.101::vol1 Passive N/A N/A georep3 vol0 /rhs/brick2/b2 root 10.70.46.100::vol1 Passive N/A N/A georep2 vol0 /rhs/brick1/b1 root 10.70.46.100::vol1 Passive N/A N/A georep2 vol0 /rhs/brick2/b2 root 10.70.46.101::vol1 Passive N/A N/A [root@georep1 ~]# [root@georep1 ~]# [root@georep1 ~]# [root@georep1 ~]# [root@georep1 ~]# gluster volume geo-replication vol0 10.70.46.100::vol1 status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS ------------------------------------------------------------------------------------------------------------------------------- georep1 vol0 /rhs/brick1/b1 root 10.70.46.101::vol1 faulty N/A N/A georep1 vol0 /rhs/brick2/b2 root 10.70.46.100::vol1 faulty N/A N/A georep3 vol0 /rhs/brick1/b1 root 10.70.46.101::vol1 Passive N/A N/A georep3 vol0 /rhs/brick2/b2 root 10.70.46.100::vol1 Passive N/A N/A georep2 vol0 /rhs/brick1/b1 root 10.70.46.100::vol1 Passive N/A N/A georep2 vol0 /rhs/brick2/b2 root 10.70.46.101::vol1 Passive N/A N/A [root@georep1 ~] From faulty it went to active and changed the crawl to history crawl: ====================================================================== [root@georep1 ~]# gluster volume geo-replication vol0 10.70.46.100::vol1 status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS -------------------------------------------------------------------------------------------------------------------------------- georep1 vol0 /rhs/brick1/b1 root 10.70.46.101::vol1 Active N/A History Crawl georep1 vol0 /rhs/brick2/b2 root 10.70.46.100::vol1 Active N/A History Crawl georep3 vol0 /rhs/brick1/b1 root 10.70.46.101::vol1 Passive N/A N/A georep3 vol0 /rhs/brick2/b2 root 10.70.46.100::vol1 Passive N/A N/A georep2 vol0 /rhs/brick1/b1 root 10.70.46.100::vol1 Passive N/A N/A georep2 vol0 /rhs/brick2/b2 root 10.70.46.101::vol1 Passive N/A N/A [root@georep1 ~]# Log Snippet: ============ [2015-03-16 18:53:29.265461] I [syncdutils(/rhs/brick1/b1):214:finalize] <top>: exiting. [2015-03-16 18:53:29.264503] E [syncdutils(/rhs/brick2/b2):270:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 645, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1329, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 553, in crawlwrap self.crawl(no_stime_update=no_stime_update) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1334, in crawl self.process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1017, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 981, in process_change self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 39] Directory not empty: '.gfid/70b3b3b8-3e8d-4f32-9123-fab73574ce91/yum' [2015-03-16 18:53:29.266381] I [syncdutils(/rhs/brick2/b2):214:finalize] <top>: exiting. [2015-03-16 18:53:29.268383] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-03-16 18:53:29.268708] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-03-16 18:53:29.269619] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. [2015-03-16 18:53:29.270034] I [syncdutils(agent):214:finalize] <top>: exiting. [2015-03-16 18:53:29.685185] I [monitor(monitor):280:monitor] Monitor: worker(/rhs/brick1/b1) died in startup phase [2015-03-16 18:53:30.220942] I [monitor(monitor):280:monitor] Monitor: worker(/rhs/brick2/b2) died in startup phase Output of history log: ====================== [root@georep1 ssh%3A%2F%2Froot%4010.70.46.100%3Agluster%3A%2F%2F127.0.0.1%3Avol1]# find . . ./c19b89ac45352ab8c894d210d136dd56 ./c19b89ac45352ab8c894d210d136dd56/xsync ./c19b89ac45352ab8c894d210d136dd56/.history ./c19b89ac45352ab8c894d210d136dd56/.history/tracker ./c19b89ac45352ab8c894d210d136dd56/.history/.current ./c19b89ac45352ab8c894d210d136dd56/.history/.processed ./c19b89ac45352ab8c894d210d136dd56/.history/.processing ./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512222 ./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512176 ./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512161 ./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512131 ./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512191 ./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512207 ./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512146 ./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512116 ./c19b89ac45352ab8c894d210d136dd56/.history/.processing/CHANGELOG.1426512101 ./c19b89ac45352ab8c894d210d136dd56/tracker ./c19b89ac45352ab8c894d210d136dd56/.current ./c19b89ac45352ab8c894d210d136dd56/.processed ./c19b89ac45352ab8c894d210d136dd56/.processed/archive_201503.tar ./c19b89ac45352ab8c894d210d136dd56/.processing ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512372 ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512342 ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512282 ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512357 ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512327 ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512267 ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512387 ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512312 ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512237 ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512252 ./c19b89ac45352ab8c894d210d136dd56/.processing/CHANGELOG.1426512297 ./764586b145d7206a154a778f64bd2f50 ./764586b145d7206a154a778f64bd2f50/xsync ./764586b145d7206a154a778f64bd2f50/.history ./764586b145d7206a154a778f64bd2f50/.history/tracker ./764586b145d7206a154a778f64bd2f50/.history/.current ./764586b145d7206a154a778f64bd2f50/.history/.processed ./764586b145d7206a154a778f64bd2f50/.history/.processing ./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512222 ./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512176 ./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512161 ./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512131 ./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512207 ./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512146 ./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512192 ./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512116 ./764586b145d7206a154a778f64bd2f50/.history/.processing/CHANGELOG.1426512101 ./764586b145d7206a154a778f64bd2f50/tracker ./764586b145d7206a154a778f64bd2f50/.current ./764586b145d7206a154a778f64bd2f50/.processed ./764586b145d7206a154a778f64bd2f50/.processed/archive_201503.tar ./764586b145d7206a154a778f64bd2f50/.processing ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512372 ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512342 ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512282 ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512357 ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512327 ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512267 ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512387 ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512312 ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512237 ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512252 ./764586b145d7206a154a778f64bd2f50/.processing/CHANGELOG.1426512297 [root@georep1 ssh%3A%2F%2Froot%4010.70.46.100%3Agluster%3A%2F%2F127.0.0.1%3Avol1]#
Upstream Patch sent for review http://review.gluster.org/#/c/10204/
Verified with build: glusterfs-3.7.1-9.el6rhs.x86_64 Worker didn't crash with "Directory not Empty" and "ESTALE". Though I see lots of logs with "directory not empty" during recursive remove and eventually directories do not get remove from the slave. This is tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1235633#c5 Moving this bug to verified state. [root@georep1 ~]# grep -i "OSError" /var/log/glusterfs/geo-replication/master/* [root@georep1 ~]# [root@georep3 ~]# grep -i "OSError" /var/log/glusterfs/geo-replication-slaves/1712d430-4ccd-4560-9f7c-d826537a9600\:gluster%3A%2F%2F127.0.0.1%3Aslave.* [root@georep3 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html