1655333 – OSError: [Errno 116] Stale file handle due to rotated files

Bug 1655333 - OSError: [Errno 116] Stale file handle due to rotated files

Summary: OSError: [Errno 116] Stale file handle due to rotated files

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	4.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Sunny Kumar
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-02 21:21 UTC by Lazuardi Nasution
Modified:	2019-09-11 06:30 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-09-11 06:30:31 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Lazuardi Nasution 2018-12-02 21:21:26 UTC

Description of problem:
Geo-rep worker goes faulty on some bricks (not all bricks) if there is file rotation inside GlusterFS mount

Version-Release number of selected component (if applicable):
4.1.5 on CentOS 7.5 (I have not tested on different version and OS)

How reproducible:
Always

Steps to Reproduce:
1. Mount a geo-replicated volume from Master node
2. Create a file (such as log file)
3. Do some file rotation to that file

Actual results:
Geo-rep worker goes faulty on some bricks (not all bricks)

gsyncd.log on Master
--------------------
[2018-12-01 20:39:49.653356] E [repce(worker /mnt/BRICK3):197:__call__] RepceClient: call failed        call=25197:139717822179136:1543696787.31        method=entry_ops     error=OSError
[2018-12-01 20:39:49.653767] E [syncdutils(worker /mnt/BRICK3):332:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 311, in main
    func(args)
  File "/usr/libexec/glusterfs/python/syncdaemon/subcmds.py", line 72, in subcmd_worker
    local.service_loop(remote)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1295, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 615, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1545, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1445, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1280, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1179, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 216, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 198, in __call__
    raise res
OSError: [Errno 116] Stale file handle

gsyncd.log on Slave
-------------------
[2018-12-01 20:59:52.571860] W [syncdutils(slave gluster-eadmin-data.vm/mnt/BRICK3):552:errno_wrap] <top>: reached maximum retries      args=['.gfid/86ba8c38-5ab0-417e-9130-64dd2d7cf4aa/glue_app_debug_log.log.82', '.gfid/86ba8c38-5ab0-417e-9130-64dd2d7cf4aa/glue_app_debug_log.log.83']        error=[Errno 116] Stale file handle
[2018-12-01 20:59:52.572635] E [repce(slave gluster-eadmin-data.vm/mnt/BRICK3):105:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 101, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 675, in entry_ops
    uid, gid)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 526, in rename_with_disk_gfid_confirmation
    [ENOENT, EEXIST], [ESTALE, EBUSY])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 540, in errno_wrap
    return call(*arg)
OSError: [Errno 116] Stale file handle

Expected results:
Geo-rep worker goes normal

Additional info:
Those error are gone if I move rotated files (glue_app_debug_log.log.82 and glue_app_debug_log.log.83 in above log) from Gluster mount to temporary place and move back to origin place of Gluster mount.

Comment 1 Lazuardi Nasution 2018-12-18 15:41:59 UTC

I confirm that this error still happen on versio 4.1.6. I have try invalidate related volume config too but still no luck.

Comment 3 Sunny Kumar 2019-06-18 05:17:15 UTC

I think this should solve this issue:

https://bugzilla.redhat.com/show_bug.cgi?id=1694820.

Patch is merged upstream. Can you please verify and let us know whether this solved this problem.

-Sunny

Comment 4 Sunny Kumar 2019-09-11 06:30:31 UTC

Closing this bug as fix is available upstream.

Note You need to log in before you can comment on or make changes to this bug.