1306255 – [georep+disperse]: Geo-rep worker goes to faulty and restarts due to error Input/output error on a file

Bug 1306255 - [georep+disperse]: Geo-rep worker goes to faulty and restarts due to error Input/output error on a file

Summary: [georep+disperse]: Geo-rep worker goes to faulty and restarts due to error In...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Sunny Kumar
QA Contact:	Prasanth
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-10 12:31 UTC by Rahul Hinduja
Modified:	2020-06-02 08:58 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-02 08:58:05 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rahul Hinduja 2016-02-10 12:31:56 UTC

Description of problem:
=======================

If the salve volume is dispersed {2x(4+2)}, geo-rep worker dies and starts with the traceback:

Master's slave.log:
===================

[2016-02-10 08:44:49.383415] E [repce(/bricks/brick1/master_brick6):207:__call__] RepceClient: call 7565:139988489930560:1455093871.58 (entry_ops) failed on peer with O
SError
[2016-02-10 08:44:49.383992] E [syncdutils(/bricks/brick1/master_brick6):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 662, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1439, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 610, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1226, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1010, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 965, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 5] Input/output error: '.gfid/a6d2e818-af15-48b2-ae51-4f114ef757fb/56ba4416%%64JWUS02W9'


On Slave node:
==============

[2016-02-10 08:45:02.999108] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 678, in entry_ops
    er = entry_purge(entry, gfid)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 603, in entry_purge
    er = errno_wrap(os.unlink, [entry], [ENOENT, ESTALE, EISDIR])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
OSError: [Errno 5] Input/output error: '.gfid/a4117e5c-c02b-499d-ab18-238fb7f1da16/56ba441b%%FU8VFXYZ7O'


Slave client logs {gluster.log}
===============================

[2016-02-10 07:44:14.530667] W [MSGID: 122033] [ec-common.c:1297:ec_locked] 0-slave-disperse-1: Failed to complete preop lock [Input/output error]
[2016-02-10 07:44:14.532831] W [MSGID: 122033] [ec-common.c:1297:ec_locked] 0-slave-disperse-0: Failed to complete preop lock [Stale file handle]
[2016-02-10 07:44:14.535377] W [MSGID: 122033] [ec-common.c:1297:ec_locked] 0-slave-disperse-1: Failed to complete preop lock [Input/output error]
[2016-02-10 07:44:14.535414] W [MSGID: 109005] [dht-selfheal.c:1142:dht_selfheal_dir_mkdir_cbk] 0-slave-dht: Directory selfheal failed: path = /.gfid/ee3d6f67-dc0a-4213-9cf7-8b110ba359aa/level13, gfid = 00000000-0000-0000-0000-000000000000 [Input/output error]


For the file:
=============

842-77acaf5d7bd7> (7439b876-c90d-4bcb-9842-77acaf5d7bd7) [No such file or directory]
[2016-02-10 08:45:01.756947] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-slave-client-11: remote operation failed. Path: <gfid:7439b876-c90d-4bcb-
9842-77acaf5d7bd7> (7439b876-c90d-4bcb-9842-77acaf5d7bd7) [No such file or directory]
[2016-02-10 08:45:01.756992] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-slave-client-10: remote operation failed. Path: <gfid:7439b876-c90d-4bcb-
9842-77acaf5d7bd7> (7439b876-c90d-4bcb-9842-77acaf5d7bd7) [No such file or directory]
[2016-02-10 08:45:02.991912] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-slave-disperse-0: Operation failed on some subvolumes (up=3F, mask=1F, remaining=0, g
ood=1F, bad=20)
[2016-02-10 08:45:02.998614] W [MSGID: 122033] [ec-common.c:1297:ec_locked] 0-slave-disperse-0: Failed to complete preop lock [Input/output error]
[2016-02-10 08:45:02.998752] W [fuse-bridge.c:1348:fuse_unlink_cbk] 0-glusterfs-fuse: 4416: UNLINK() /.gfid/a4117e5c-c02b-499d-ab18-238fb7f1da16/56ba441b%%FU8VFXYZ7O => -1 (Input/output error)
[2016-02-10 08:45:03.007464] I [MSGID: 122002] [ec-heal.c:2344:ec_heal_do] 0-slave-disperse-0: /.gfid/a4117e5c-c02b-499d-ab18-238fb7f1da16/56ba441b%%N3BGBHSRVG: name heal failed [Transport endpoint is not connected]

One of the suspect is that multiple worker from master might try to perform unlink which fails with input/output error.

Eventually files from master to slave might sync but the worker for a brief period goes to faulty and restarts.


Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.7.5-19


How reproducible:
=================

1/1


Steps to Reproduce:
===================
1. Create master volume being DR and Slave volume being EC
2. Create and start geo-rep session
3. Perfrm operation on Master as create,chmod,chown,chgrp,hardlink,symlink,truncate,rename,remove

Actual results:
===============

All the fops succeeded and synced to slave but logs reports the traceback and worker restarts.


Expected results:
=================

All the fops should succeed and synced to slave without session going to faulty.

Comment 3 Rahul Hinduja 2017-07-30 08:42:55 UTC

Looks like I am hitting this issue consistently upon rmdir at master on slave (Slace-EC):

Master:
-------

[2017-07-29 16:24:37.529525] E [syncdutils(/bricks/brick1/master_brick6):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1582, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1143, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1118, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1001, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 942, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 5] Input/output error: '.gfid/4b0cb9a6-4240-4bf7-aa2e-0fb922d5c537/symlink_to_files'


Slave:
------

[2017-07-29 16:24:37.496778] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 725, in entry_ops
    er = entry_purge(entry, gfid)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 626, in entry_purge
    ENOTEMPTY], [EBUSY])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap
    return call(*arg)
OSError: [Errno 5] Input/output error: '.gfid/4b0cb9a6-4240-4bf7-aa2e-0fb922d5c537/symlink_to_files'

Client Log:
-----------

[2017-07-29 16:24:37.496357] W [MSGID: 122033] [ec-common.c:1542:ec_locked] 0-slave-disperse-1: Failed to complete preop lock [Input/output error]
[2017-07-29 16:24:37.496469] W [fuse-bridge.c:1355:fuse_unlink_cbk] 0-glusterfs-fuse: 263167: RMDIR() /.gfid/4b0cb9a6-4240-4bf7-aa2e-0fb922d5c537/symlink_to_files => -1
 (Input/output error)

Comment 7 Amar Tumballi 2018-12-14 04:27:08 UTC

Reducing the severity of the bug. Not in plan for immediate next release.

Note You need to log in before you can comment on or make changes to this bug.