Bug 1222856
| Summary: | [geo-rep]: worker died with "ESTALE" when performed rm -rf on a directory from mount of master volume | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rahul Hinduja <rhinduja> | |
| Component: | geo-replication | Assignee: | Aravinda VK <avishwan> | |
| Status: | CLOSED ERRATA | QA Contact: | Rahul Hinduja <rhinduja> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | high | |||
| Version: | rhgs-3.1 | CC: | aavati, annair, asrivast, avishwan, bmohanra, csaba, khiremat, nlevinki, nsathyan, vagarwal | |
| Target Milestone: | --- | |||
| Target Release: | RHGS 3.1.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.7.1-6 | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, when DHT could not resolve a GFID or path, it raised an ESTALE error similar to ENOENT error. Due to unhandled ESTALE exception, Geo-rep worker would crash and the tracebacks are printed in the log files. With this release, the ESTALE errors in Geo-rep worker is handled similar to the ENOENT errors and Geo-rep worker does not crash due to this.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1223280 1232912 (view as bug list) | Environment: | ||
| Last Closed: | 2015-07-29 04:43:38 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1223286 | |||
| Bug Blocks: | 1202842, 1223636, 1232912, 1236093 | |||
Patches: master: http://review.gluster.org/#/c/10837/ release-3.7: http://review.gluster.org/10913 downstream: https://code.engineering.redhat.com/gerrit/#/c/49674/ I still see the issue with build: glusterfs-3.7.1-1 Moving bug back to assigned state; [root@georep1 scripts]# rpm -qa | grep gluster glusterfs-client-xlators-3.7.1-1.el6rhs.x86_64 glusterfs-server-3.7.1-1.el6rhs.x86_64 glusterfs-3.7.1-1.el6rhs.x86_64 glusterfs-api-3.7.1-1.el6rhs.x86_64 glusterfs-cli-3.7.1-1.el6rhs.x86_64 glusterfs-geo-replication-3.7.1-1.el6rhs.x86_64 glusterfs-libs-3.7.1-1.el6rhs.x86_64 glusterfs-fuse-3.7.1-1.el6rhs.x86_64 glusterfs-debuginfo-3.7.1-1.el6rhs.x86_64 [root@georep1 scripts]# cat /var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.46.154%3Agluster%3A%2F%2F127.0.0.1%3Aslave.log | grep "OSError" [2015-06-11 22:34:23.111248] E [repce(/rhs/brick2/b2):207:__call__] RepceClient: call 20852:140282122651392:1434042220.8 (entry_ops) failed on peer with OSError [2015-06-11 22:34:46.175925] E [repce(/rhs/brick2/b2):207:__call__] RepceClient: call 21689:140594955093760:1434042280.85 (entry_ops) failed on peer with OSError OSError: [Errno 116] Stale file handle [2015-06-11 22:35:08.149015] E [repce(/rhs/brick2/b2):207:__call__] RepceClient: call 21766:140460004030208:1434042303.43 (entry_ops) failed on peer with OSError OSError: [Errno 116] Stale file handle [root@georep1 scripts]# Upstream Patch (Master): http://review.gluster.org/#/c/11296/ Upstream Patch (3.7): http://review.gluster.org/#/c/11430/ Downstream Patch: https://code.engineering.redhat.com/gerrit/#/c/51709/ Hi Aravinda, The doc text is updated. Please review the same and share your technical review comments. If it looks ok, then sign-off on the same. Regards, Bhavana Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html |
Description of problem: ======================= Whenever perfomred rm -rf on the master volume, the worker died with the backtrace as: [2015-05-19 15:33:13.868683] E [syncdutils(/rhs/brick2/b2):276:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 659, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1440, in service_loop g3.crawlwrap(oneshot=True) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 580, in crawlwrap self.crawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1150, in crawl self.changelogs_batch_process(changes) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1059, in changelogs_batch_process self.process(batch) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 946, in process self.process_change(change, done, retry) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 902, in process_change failures = self.slave.server.entry_ops(entries) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__ return self.ins(self.meth, *a) File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__ raise res OSError: [Errno 116] Stale file handle [2015-05-19 15:33:13.870326] I [syncdutils(/rhs/brick2/b2):220:finalize] <top>: exiting. [2015-05-19 15:33:13.874784] I [repce(agent):92:service_loop] RepceServer: terminating on reaching EOF. And with everytime monitor tries to spawn the process, it dies in startup phase. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.0-2.el6rhs.x86_64 How reproducible: ================ Tried couple of times and was successful in reproducing it in as many times Steps Carried: ============== 1. Created master cluster 2. Created and started master volume 3. Created shared volume (gluster_shared_storage) 4. Mounted the shared volume on /var/run/gluster/shared_storage 5. Created Slave cluster 6. Created and Started slave volume 7. Created geo-rep session between master and slave 8. Configured use_meta_volume true 9. Started geo-rep 10. Mounted master volume over Fuse and NFS to client 11. Copied files /etc{1..10} from fuse mount 12. Copied files /etc{11.20} from NFS mount 13. Sync completed successfully 14. Removed the files etc.2 from fuse and etc.12 from NFS 15. Looked into the geo-rep session it was faulty 16. Looked into the logs, it showed continuous traceback Actual results: =============== It crashed and comes back with crawl type as history Expected results: ================= Worker should not crash and it should handle ESTALE gracefully