1645482 – Files pending heal in Arbiter volume

Bug 1645482 - Files pending heal in Arbiter volume

Summary: Files pending heal in Arbiter volume

Keywords:
Status:	CLOSED DUPLICATE of bug 1645480
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	arbiter
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ravishankar N
QA Contact:	Karan Sandha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-02 10:41 UTC by Anees Patel
Modified:	2018-11-02 10:45 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-02 10:45:05 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Anees Patel 2018-11-02 10:41:20 UTC

Description of problem:

Hit a issue while testing for replica 3 to arbiter conversion, Heal is not completing for Arbiter volume 1 X (2+1) when rename operation is done when brick down.

Version-Release number of selected component (if applicable):

# rpm -qa | grep gluster
glusterfs-3.12.2-25.el7rhgs.x86_64
tendrl-gluster-integration-1.5.4-14.el7rhgs.noarch
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-server-3.12.2-25.el7rhgs.x86_64


How reproducible:

Most of the time (3 out of 4 times)

Steps to Reproduce:
1. Disable all client side heals
2. Created a file from client (fuse mount) when all three bricks are up.
# echo "Hi" >>retry1
3. Bring down brick 2 and append the file
# echo "Hi2" >>retry1
4. Bring down brick 1 and bring up brick 2
Now cat on file fails,
# cat retry1
cat: retry1: Input/output error
5. Perform rename
# mv retry1 retry2
# ls
retry2
6. After rename, cat works on file retry2 which previously was giving I/o error 
# cat retry2 
Hi
7. Append file retry2
# echo "W" >>retry2
Append works this time
8. Now bring all bricks up, issue heal
9. # cat retry2 
cat: retry2: Input/output error
10. Also heal info has this file retry2 as pending heal

A similar bug was raised upstream BZ Bug 1357000
	
Actual results:
Tried this scenario four time, three times it hit the issue.
# cat retry1
cat: retry1: No such file or directory
# cat file2
cat: file2: Input/output error
# cat nile2
cat: nile2: Input/output error

Heal is pending for the renamed file, 
# gluster v heal newvol info
Brick 10.70.47.130:/bricks/brick1/newvol
Status: Connected
Number of entries: 0

Brick 10.70.46.213:/bricks/brick1/newvol
<gfid:b9291fe7-06ae-4fea-b492-90882e91c299>/file2 
<gfid:b9291fe7-06ae-4fea-b492-90882e91c299>/nile2 
<gfid:b9291fe7-06ae-4fea-b492-90882e91c299>/retry2 
Status: Connected
Number of entries: 3

Brick 10.70.47.38:/bricks/brick1/newvol1
/dir2/file2 
/dir2/nile2 
/dir2/retry2 
Status: Connected
Number of entries: 3

At step 9 unable to read file, hence seems to be a Data Unavailable issue,

Expected results:
At step 7 append should not be allowed as the brick containing good copy is down, also heal should complete, with no files pending heal

Additional info:

xttr of the entries

Brick1
# getfattr -d -m . -e hex /bricks/brick1/newvol/dir2/retry2
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/newvol/dir2/retry2
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.gfid=0x5e9646bf976c4cd8832c1da9b10c730f
trusted.gfid2path.3cf2e0799590a8e0=0x62393239316665372d303661652d346665612d623439322d3930383832653931633239392f726574727932

Brick2 
# getfattr -d -m . -e hex /bricks/brick1/newvol/dir2/retry2 
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/newvol/dir2/retry2
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.newvol-client-0=0x000000010000000000000000
trusted.gfid=0x5e9646bf976c4cd8832c1da9b10c730f
trusted.gfid2path.3cf2e0799590a8e0=0x62393239316665372d303661652d346665612d623439322d3930383832653931633239392f726574727932

Brick3 / arbiter

# getfattr -d -m . -e hex /bricks/brick1/newvol1/dir2/retry2
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/newvol1/dir2/retry2
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.newvol-client-0=0x000000010000000000000000
trusted.afr.newvol-client-1=0x000000010000000000000000
trusted.gfid=0x5e9646bf976c4cd8832c1da9b10c730f
trusted.gfid2path.3cf2e0799590a8e0=0x62393239316665372d303661652d346665612d623439322d3930383832653931633239392f726574727932


Client logs:

[2018-11-02 09:56:14.772374] W [fuse-bridge.c:871:fuse_attr_cbk] 0-glusterfs-fuse: 2218: STAT() /dir2 => -1 (Input/output error)
[2018-11-02 09:56:15.961426] I [rpc-clnt.c:2007:rpc_clnt_reconfig] 4-newvol-client-1: changing port to 49162 (from 0)
[2018-11-02 09:56:15.967799] I [MSGID: 114057] [client-handshake.c:1397:select_server_supported_programs] 4-newvol-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-11-02 09:56:15.972517] I [MSGID: 114046] [client-handshake.c:1150:client_setvolume_cbk] 4-newvol-client-1: Connected to newvol-client-1, attached to remote volume '/bricks/brick1/newvol'.
[2018-11-02 09:56:15.972553] I [MSGID: 114047] [client-handshake.c:1161:client_setvolume_cbk] 4-newvol-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2018-11-02 09:56:15.972704] I [MSGID: 108002] [afr-common.c:5164:afr_notify] 4-newvol-replicate-0: Client-quorum is met
[2018-11-02 09:56:15.973448] I [MSGID: 114035] [client-handshake.c:121:client_set_lk_version_cbk] 4-newvol-client-1: Server lk version = 1
[2018-11-02 09:56:42.620808] W [fuse-bridge.c:1396:fuse_err_cbk] 0-glusterfs-fuse: 2285: FLUSH() ERR => -1 (Transport endpoint is not connected)
[2018-11-02 09:57:13.664335] I [rpc-clnt.c:2007:rpc_clnt_reconfig] 4-newvol-client-0: changing port to 49158 (from 0)
[2018-11-02 09:57:13.669657] I [MSGID: 114057] [client-handshake.c:1397:select_server_supported_programs] 4-newvol-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-11-02 09:57:13.670849] I [MSGID: 114046] [client-handshake.c:1150:client_setvolume_cbk] 4-newvol-client-0: Connected to newvol-client-0, attached to remote volume '/bricks/brick1/newvol'.
[2018-11-02 09:57:13.670888] I [MSGID: 114047] [client-handshake.c:1161:client_setvolume_cbk] 4-newvol-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2018-11-02 09:57:13.671372] I [MSGID: 114035] [client-handshake.c:121:client_set_lk_version_cbk] 4-newvol-client-0: Server lk version = 1
[2018-11-02 09:57:32.177276] E [MSGID: 108008] [afr-read-txn.c:90:afr_read_txn_refresh_done] 4-newvol-replicate-0: Failing STAT on gfid 5e9646bf-976c-4cd8-832c-1da9b10c730f: split-brain observed. [Input/output e
rror]
[2018-11-02 09:57:32.177385] W [fuse-bridge.c:871:fuse_attr_cbk] 0-glusterfs-fuse: 2294: STAT() /dir2/retry2 => -1 (Input/output error)
[2018-11-02 10:16:43.798183] E [MSGID: 108008] [afr-read-txn.c:90:afr_read_txn_refresh_done] 4-newvol-replicate-0: Failing READ on gfid 5e9646bf-976c-4cd8-832c-1da9b10c730f: split-brain observed. [Input/output e
rror]
[2018-11-02 10:16:43.798368] W [fuse-bridge.c:2337:fuse_readv_cbk] 0-glusterfs-fuse: 2404: READ => -1 gfid=5e9646bf-976c-4cd8-832c-1da9b10c730f fd=0x7f552000f3b0 (Input/output error)
[2018-11-02 10:17:07.922620] E [MSGID: 108008] [afr-read-txn.c:90:afr_read_txn_refresh_done] 4-newvol-replicate-0: Failing READ on gfid 53ba229a-a933-48a9-a7c9-2abd99c1e557: split-brain observed. [Input/output e
rror]
[2018-11-02 10:17:07.922731] W [fuse-bridge.c:2337:fuse_readv_cbk] 0-glusterfs-fuse: 2417: READ => -1 gfid=53ba229a-a933-48a9-a7c9-2abd99c1e557 fd=0x7f552002ac00 (Input/output error)
[2018-11-02 10:19:49.578790] E [MSGID: 108008] [afr-read-txn.c:90:afr_read_txn_refresh_done] 4-newvol-replicate-0: Failing READ on gfid f20e1de8-c29f-4bd9-9d2b-c9c56af6ce33: split-brain observed. [Input/output error]
[2018-11-02 10:19:49.578953] W [fuse-bridge.c:2337:fuse_readv_cbk] 0-glusterfs-fuse: 2433: READ => -1 gfid=f20e1de8-c29f-4bd9-9d2b-c9c56af6ce33 fd=0x7f552001ad20 (Input/output error)

Volume Info:
# gluster v info newvol
 
Volume Name: newvol
Type: Replicate
Volume ID: 05cd2dec-7a61-4eb0-88cd-b5b829bed17f
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.47.130:/bricks/brick1/newvol
Brick2: 10.70.46.213:/bricks/brick1/newvol
Brick3: 10.70.47.38:/bricks/brick1/newvol1 (arbiter)
Options Reconfigured:
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Comment 2 Anees Patel 2018-11-02 10:45:05 UTC

This bug got created twice (1645480), due to browser refresh, 
Hence closing it as duplicate of BZ#1645480

*** This bug has been marked as a duplicate of bug 1645480 ***

Note You need to log in before you can comment on or make changes to this bug.