1387499 – Should not display wrong information by stat when data heal is pending for a non-zero size file

Bug 1387499 - Should not display wrong information by stat when data heal is pending for a non-zero size file

Summary: Should not display wrong information by stat when data heal is pending for a ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	arbiter
Sub Component:
Version:	rhgs-3.2
Hardware:	All
OS:	All
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ravishankar N
QA Contact:	Karan Sandha
Docs Contact:
URL:
Whiteboard:
Depends On:	1356974
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-21 06:05 UTC by Ravishankar N
Modified:	2018-11-13 03:22 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1356974
Environment:
Last Closed:	2018-11-13 03:22:15 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ravishankar N 2016-10-21 06:05:39 UTC

+++ This bug was initially created as a clone of Bug #1356974 +++

Description of problem:
======================
when a non-zero file is pending with data heals and the source data brick is down , stat works on the file from mount point but displays wrong information wrt the size.
Some applications which may be consuming these details can end up getting wrong info.
Stat for a zero byte size file makes sense, but file with contents must either throw the correct metadata info if metadata heal can fix it or else, it must throw EIO
In case of throwing EIO, i would also suggest to log the possible reason like saying data heal is pending(in the fuse mount and shd logs)


Also, note that while we bring down both databirck1 and databrick2 (with only arbiter brick up), we get an EIO if we do a stat, the reason given is that we don't want to display the wrong info.
Hence this bug is being raised as we are displaying spurious details

Version-Release number of selected component (if applicable):
===================
glusterfs 3.9dev built on Jul 11 2016 10:04:54

How reproducible:
==================
always

Steps to Reproduce:
====================
1.create a 1x(2+1) replicate arbiter vol
2.now mount the vol by fuse
3.create a directory say dir1
4. Now bring down the first data brick 
5. create a file sat f1 under dir1 with some contents 
6.Now bring down the other data brick too
7. bring up the first data brick which was down
8. check heal info and trigger a manual heal
9. do an ls -l on mount, you can see the file f1 is shown 
10. now do a stat of the file, it can be seen as below
  File: ‘f1’
  Size: 0         	Blocks: 0          IO Block: 131072 regular empty file
Device: 2ah/42d	Inode: 13754313043819253517  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:fusefs_t:s0
Access: 2016-07-15 17:17:29.467779000 +0530
Modify: 2016-07-15 17:17:29.467779000 +0530
Change: 2016-07-15 17:17:29.471878457 +0530
 Birth: -




Actual results:
================
the file shows as zero size

Expected results:
===================
show eio till data heal happens or collect and display the right file size



Additional info:

Comment 4 Ravishankar N 2016-11-03 08:29:20 UTC

After some code-reading and debugging, I found that this is not a bug specific to arbiter. In AFR we do have checks to fail afr_stat() or any read transaction when the only good copy is down. But the problem here is that the stat is not reaching AFR, it is getting served from the kernel cache as a part of the lookup response sent by AFR. For example, if we fuse mount the volume with attribute-timeout=0 and entry-timeout=0,  we will get EIO for the reproducer given in the description because there is no kernel caching and afr_stat will be hit, which will fail the fop with EIO.

Note: I'm not changing the component from arbiter to replicate because I think the acks might be lost if I do that.

Comment 5 Pranith Kumar K 2016-11-20 17:03:54 UTC

On fuse mount stat is converted as Lookup. If Lookup on such files fail, unlink on such files will never happen. Especially the files which are in split-brain. While it does show file's stat output as wrong, nothing much can be done with contents of such data:

root@dhcp35-190 - ~ 
22:26:41 :) ⚡ gluster v heal r2 info
Brick localhost.localdomain:/home/gfs/r2_0
Status: Connected
Number of entries: 0

Brick localhost.localdomain:/home/gfs/r2_1
Status: Transport endpoint is not connected
Number of entries: -

Brick localhost.localdomain:/home/gfs/r2_2
/di1/a 
Status: Connected
Number of entries: 1


root@dhcp35-190 - ~ 
22:26:46 :) ⚡ ls -l /mnt/r2/di1
total 0
-rw-r--r--. 1 root root 0 Nov 20 22:24 a

root@dhcp35-190 - ~ 
22:26:53 :) ⚡ cp -r /mnt/r2/di1/ /mnt/r2/di2
cp: error reading '/mnt/r2/di1/a': Input/output error <<-----

root@dhcp35-190 - ~ 
22:30:41 :( ⚡ truncate -s 0 /mnt/r2/di1/a
truncate: failed to truncate '/mnt/r2/di1/a' at 0 bytes: Input/output error

root@dhcp35-190 - ~ 
22:30:48 :( ⚡ echo abc > /mnt/r2/di1/a
-bash: /mnt/r2/di1/a: Input/output error

root@dhcp35-190 - ~ 
22:30:56 :( ⚡ unlink /mnt/r2/di1/a <<---- Only deletion is successful.

Ravi,
     I was under the impression that the 'cp -r' would succeed, it just occurred to me that I wasn't thinking through it correctly. I think we can fix this a bit later also.

Could you check the above cases work the same for other protocols like NFS/Samba? If yes we can defer this.

Pranith

Comment 6 Ravishankar N 2016-11-22 06:47:07 UTC

Tested using gluster NFS and NFS ganesha mounts, confirmed that reads and writes were failing with EIO as expected.

Comment 7 Ravishankar N 2016-11-23 04:24:43 UTC

In light of comment #5 after discussing with Nag and Pranith, the BZ can be moved to 3.2.0 beyond. To elaborate, as described in the RCA (comment#4), the incorrect stat  size is due to caching in the kernel. But since stat is the only command that succeeds ie. reads/writes etc would still fail with EIO (because they hit AFR) to the application, we should be good.

I am not providing doc text since there is no workaround needed per se. i.e. once the bricks come up, heal and I/O can continue.

Note You need to log in before you can comment on or make changes to this bug.