Bug 1306255

Summary:	[georep+disperse]: Geo-rep worker goes to faulty and restarts due to error Input/output error on a file
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	geo-replication	Assignee:	Sunny Kumar <sunkumar>
Status:	CLOSED DEFERRED	QA Contact:	Prasanth <pprakash>
Severity:	medium	Docs Contact:
Priority:	low
Version:	rhgs-3.1	CC:	aspandey, chrisw, csaba, kiyer, nlevinki, sheggodu, sunkumar
Target Milestone:	---	Keywords:	Reopened, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-02 08:58:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rahul Hinduja 2016-02-10 12:31:56 UTC

Description of problem:
=======================

If the salve volume is dispersed {2x(4+2)}, geo-rep worker dies and starts with the traceback:

Master's slave.log:
===================

[2016-02-10 08:44:49.383415] E [repce(/bricks/brick1/master_brick6):207:__call__] RepceClient: call 7565:139988489930560:1455093871.58 (entry_ops) failed on peer with O
SError
[2016-02-10 08:44:49.383992] E [syncdutils(/bricks/brick1/master_brick6):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 165, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 662, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1439, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 610, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1226, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1132, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1010, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 965, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 5] Input/output error: '.gfid/a6d2e818-af15-48b2-ae51-4f114ef757fb/56ba4416%%64JWUS02W9'


On Slave node:
==============

[2016-02-10 08:45:02.999108] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 678, in entry_ops
    er = entry_purge(entry, gfid)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 603, in entry_purge
    er = errno_wrap(os.unlink, [entry], [ENOENT, ESTALE, EISDIR])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 475, in errno_wrap
    return call(*arg)
OSError: [Errno 5] Input/output error: '.gfid/a4117e5c-c02b-499d-ab18-238fb7f1da16/56ba441b%%FU8VFXYZ7O'


Slave client logs {gluster.log}
===============================

[2016-02-10 07:44:14.530667] W [MSGID: 122033] [ec-common.c:1297:ec_locked] 0-slave-disperse-1: Failed to complete preop lock [Input/output error]
[2016-02-10 07:44:14.532831] W [MSGID: 122033] [ec-common.c:1297:ec_locked] 0-slave-disperse-0: Failed to complete preop lock [Stale file handle]
[2016-02-10 07:44:14.535377] W [MSGID: 122033] [ec-common.c:1297:ec_locked] 0-slave-disperse-1: Failed to complete preop lock [Input/output error]
[2016-02-10 07:44:14.535414] W [MSGID: 109005] [dht-selfheal.c:1142:dht_selfheal_dir_mkdir_cbk] 0-slave-dht: Directory selfheal failed: path = /.gfid/ee3d6f67-dc0a-4213-9cf7-8b110ba359aa/level13, gfid = 00000000-0000-0000-0000-000000000000 [Input/output error]


For the file:
=============

842-77acaf5d7bd7> (7439b876-c90d-4bcb-9842-77acaf5d7bd7) [No such file or directory]
[2016-02-10 08:45:01.756947] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-slave-client-11: remote operation failed. Path: <gfid:7439b876-c90d-4bcb-
9842-77acaf5d7bd7> (7439b876-c90d-4bcb-9842-77acaf5d7bd7) [No such file or directory]
[2016-02-10 08:45:01.756992] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-slave-client-10: remote operation failed. Path: <gfid:7439b876-c90d-4bcb-
9842-77acaf5d7bd7> (7439b876-c90d-4bcb-9842-77acaf5d7bd7) [No such file or directory]
[2016-02-10 08:45:02.991912] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-slave-disperse-0: Operation failed on some subvolumes (up=3F, mask=1F, remaining=0, g
ood=1F, bad=20)
[2016-02-10 08:45:02.998614] W [MSGID: 122033] [ec-common.c:1297:ec_locked] 0-slave-disperse-0: Failed to complete preop lock [Input/output error]
[2016-02-10 08:45:02.998752] W [fuse-bridge.c:1348:fuse_unlink_cbk] 0-glusterfs-fuse: 4416: UNLINK() /.gfid/a4117e5c-c02b-499d-ab18-238fb7f1da16/56ba441b%%FU8VFXYZ7O => -1 (Input/output error)
[2016-02-10 08:45:03.007464] I [MSGID: 122002] [ec-heal.c:2344:ec_heal_do] 0-slave-disperse-0: /.gfid/a4117e5c-c02b-499d-ab18-238fb7f1da16/56ba441b%%N3BGBHSRVG: name heal failed [Transport endpoint is not connected]

One of the suspect is that multiple worker from master might try to perform unlink which fails with input/output error.

Eventually files from master to slave might sync but the worker for a brief period goes to faulty and restarts.


Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.7.5-19


How reproducible:
=================

1/1


Steps to Reproduce:
===================
1. Create master volume being DR and Slave volume being EC
2. Create and start geo-rep session
3. Perfrm operation on Master as create,chmod,chown,chgrp,hardlink,symlink,truncate,rename,remove

Actual results:
===============

All the fops succeeded and synced to slave but logs reports the traceback and worker restarts.


Expected results:
=================

All the fops should succeed and synced to slave without session going to faulty.

Comment 3 Rahul Hinduja 2017-07-30 08:42:55 UTC

Looks like I am hitting this issue consistently upon rmdir at master on slave (Slace-EC):

Master:
-------

[2017-07-29 16:24:37.529525] E [syncdutils(/bricks/brick1/master_brick6):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1582, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1143, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1118, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1001, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 942, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 5] Input/output error: '.gfid/4b0cb9a6-4240-4bf7-aa2e-0fb922d5c537/symlink_to_files'


Slave:
------

[2017-07-29 16:24:37.496778] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 725, in entry_ops
    er = entry_purge(entry, gfid)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 626, in entry_purge
    ENOTEMPTY], [EBUSY])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap
    return call(*arg)
OSError: [Errno 5] Input/output error: '.gfid/4b0cb9a6-4240-4bf7-aa2e-0fb922d5c537/symlink_to_files'

Client Log:
-----------

[2017-07-29 16:24:37.496357] W [MSGID: 122033] [ec-common.c:1542:ec_locked] 0-slave-disperse-1: Failed to complete preop lock [Input/output error]
[2017-07-29 16:24:37.496469] W [fuse-bridge.c:1355:fuse_unlink_cbk] 0-glusterfs-fuse: 263167: RMDIR() /.gfid/4b0cb9a6-4240-4bf7-aa2e-0fb922d5c537/symlink_to_files => -1
 (Input/output error)

Comment 7 Amar Tumballi 2018-12-14 04:27:08 UTC

Reducing the severity of the bug. Not in plan for immediate next release.