Bug 1802013

Summary: read() returns more than file size when using direct I/O
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Krutika Dhananjay <kdhananj>
Component: shardingAssignee: Krutika Dhananjay <kdhananj>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: urgent Docs Contact:
Priority: urgent    
Version: unspecifiedCC: atumball, bugs, csaba, kdhananj, khiremat, kwolf, nsoffer, pkarampu, pprakash, puebele, rabhat, rgowdapp, rhs-bugs, rkavunga, rkothiya, sabose, sasundar, sheggodu, storage-qa-internal, teigland, tnisan, vjuranek
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: RHGS 3.5.z Batch Update 2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-6.0-34 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1738419
: 1802016 (view as bug list) Environment:
Last Closed: 2020-06-16 06:19:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1740316    
Bug Blocks: 1801892, 1802016    

Description Krutika Dhananjay 2020-02-12 07:40:57 UTC
+++ This bug was initially created as a clone of Bug #1738419 +++

+++ This bug was initially created as a clone of Bug #1737141 +++

Description of problem:

When using direct I/O, reading from a file returns more data, padding the file
data with zeroes.

Here is an example.

## On a host mounting gluster using fuse

$ pwd
/rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com:_gv0/de566475-5b67-4987-abf3-3dc98083b44c/dom_md


$ mount | grep glusterfs
voodoo4.tlv.redhat.com:/gv0 on /rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com:_gv0 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)


$ stat metadata 
  File: metadata
  Size: 501       	Blocks: 1          IO Block: 131072 regular file
Device: 31h/49d	Inode: 13313776956941938127  Links: 1
Access: (0644/-rw-r--r--)  Uid: (   36/    vdsm)   Gid: (   36/     kvm)
Context: system_u:object_r:fusefs_t:s0
Access: 2019-08-01 22:21:49.186381528 +0300
Modify: 2019-08-01 22:21:49.427404135 +0300
Change: 2019-08-01 22:21:49.969739575 +0300
 Birth: -


$ cat metadata 
ALIGNMENT=1048576
BLOCK_SIZE=4096
CLASS=Data
DESCRIPTION=gv0
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=1
POOL_DESCRIPTION=4k-gluster
POOL_DOMAINS=de566475-5b67-4987-abf3-3dc98083b44c:Active
POOL_SPM_ID=-1
POOL_SPM_LVER=-1
POOL_UUID=44cfb532-3144-48bd-a08c-83065a5a1032
REMOTE_PATH=voodoo4.tlv.redhat.com:/gv0
ROLE=Master
SDUUID=de566475-5b67-4987-abf3-3dc98083b44c
TYPE=GLUSTERFS
VERSION=5
_SHA_CKSUM=3d1cb836f4c93679fc5a4e7218425afe473e3cfa


$ dd if=metadata bs=4096 count=1 of=/dev/null
0+1 records in
0+1 records out
501 bytes copied, 0.000340298 s, 1.5 MB/s


$ dd if=metadata bs=4096 count=1 of=/dev/null iflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00398529 s, 1.0 MB/s

Checking the copied data, the actual content of the file is padded
with zeros to 4096 bytes.


## On the one of the gluster nodes

$ pwd
/export/vdo0/brick/de566475-5b67-4987-abf3-3dc98083b44c/dom_md


$ stat metadata 
  File: metadata
  Size: 501       	Blocks: 16         IO Block: 4096   regular file
Device: fd02h/64770d	Inode: 149         Links: 2
Access: (0644/-rw-r--r--)  Uid: (   36/ UNKNOWN)   Gid: (   36/     kvm)
Context: system_u:object_r:usr_t:s0
Access: 2019-08-01 22:21:50.380425478 +0300
Modify: 2019-08-01 22:21:49.427397589 +0300
Change: 2019-08-01 22:21:50.374425302 +0300
 Birth: -


$ dd if=metadata bs=4096 count=1 of=/dev/null
0+1 records in
0+1 records out
501 bytes copied, 0.000991636 s, 505 kB/s


$ dd if=metadata bs=4096 count=1 of=/dev/null iflag=direct
0+1 records in
0+1 records out
501 bytes copied, 0.0011381 s, 440 kB/s

This proves that the issue is in gluster.


# gluster volume info gv0
 
Volume Name: gv0
Type: Replicate
Volume ID: cbc5a2ad-7246-42fc-a78f-70175fb7bf22
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: voodoo4.tlv.redhat.com:/export/vdo0/brick
Brick2: voodoo5.tlv.redhat.com:/export/vdo0/brick
Brick3: voodoo8.tlv.redhat.com:/export/vdo0/brick (arbiter)
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
server.event-threads: 4
client.event-threads: 4
cluster.choose-local: off
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: disable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on


$ xfs_info /export/vdo0
meta-data=/dev/mapper/vdo0       isize=512    agcount=4, agsize=6553600 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


Version-Release number of selected component (if applicable):

Server:

$ rpm -qa | grep glusterfs
glusterfs-libs-6.4-1.fc29.x86_64
glusterfs-api-6.4-1.fc29.x86_64
glusterfs-client-xlators-6.4-1.fc29.x86_64
glusterfs-fuse-6.4-1.fc29.x86_64
glusterfs-6.4-1.fc29.x86_64
glusterfs-cli-6.4-1.fc29.x86_64
glusterfs-server-6.4-1.fc29.x86_64

Client:

$ rpm -qa | grep glusterfs
glusterfs-client-xlators-6.4-1.fc29.x86_64
glusterfs-6.4-1.fc29.x86_64
glusterfs-rdma-6.4-1.fc29.x86_64
glusterfs-cli-6.4-1.fc29.x86_64
glusterfs-libs-6.4-1.fc29.x86_64
glusterfs-fuse-6.4-1.fc29.x86_64
glusterfs-api-6.4-1.fc29.x86_64


How reproducible:
Always.

Steps to Reproduce:
1. Provision gluster volume over vdo (did not check without vdo)
2. Create a file of 501 bytes
3. Read the file using direct I/O

Actual results:
read() returns 4096 bytes, padding the file data with zeroes

Expected results:
read() returns actual file data (501 bytes)

--- Additional comment from Nir Soffer on 2019-08-02 19:21:20 UTC ---

David, do you think this can affect sanlock?

--- Additional comment from Nir Soffer on 2019-08-02 19:25:02 UTC ---

Kevin, do you think this can affect qemu/qemu-img?

--- Additional comment from Amar Tumballi on 2019-08-05 05:33:57 UTC ---

@Nir, thanks for the report. We will look into this.

--- Additional comment from Kevin Wolf on 2019-08-05 09:16:16 UTC ---

(In reply to Nir Soffer from comment #2)
> Kevin, do you think this can affect qemu/qemu-img?

This is not a problem for QEMU as long as the file size is correct. If gluster didn't do the zero padding, QEMU would do it internally.

In fact, fixing this in gluster may break the case of unaligned image sizes with QEMU because the image size is rounded up to sector (512 byte) granularity and the gluster driver turns short reads into errors. This would actually affect non-O_DIRECT, too, which already seems to behave this way, so can you just give this a quick test?

--- Additional comment from David Teigland on 2019-08-05 15:08:32 UTC ---

(In reply to Nir Soffer from comment #1)
> David, do you think this can affect sanlock?

I don't think so.  sanlock doesn't use any space that it didn't first write to initialize.

--- Additional comment from Worker Ant on 2019-08-08 05:56:04 UTC ---

REVIEW: https://review.gluster.org/23175 (features/shard: Send correct size when reads are sent beyond file size) posted (#1) for review on master by Krutika Dhananjay

--- Additional comment from Worker Ant on 2019-08-12 13:30:56 UTC ---

REVIEW: https://review.gluster.org/23175 (features/shard: Send correct size when reads are sent beyond file size) merged (#3) on master by Krutika Dhananjay

Comment 1 Sahina Bose 2020-02-25 12:20:51 UTC
Prasanth, can you provide qa ack for this to take into 3.5.2 ? We need this for RHHI-V (given the changes added for 4K block support in recent releases)

Comment 2 Prasanth 2020-02-27 12:32:38 UTC
(In reply to Sahina Bose from comment #1)
> Prasanth, can you provide qa ack for this to take into 3.5.2 ? We need this
> for RHHI-V (given the changes added for 4K block support in recent releases)

Sahina, considering the importance of the requirement, I'm providing qa_ack+ for this BZ for 3.5.2

Comment 13 SATHEESARAN 2020-06-06 11:57:09 UTC
Verified with RHVH 4.4.1 and RHGS 3.5.2 - glusterfs-6.0-37.el8rhgs with the following steps:

[root@ ~]# ls /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_vmstore/977e8d86-afd8-46c1-bf15-ed19d3cb6ed1/dom_md/
ids  inbox  leases  metadata  outbox  xleases

[root@ ~ ]# stat metadata 
  File: metadata
  Size: 391       	Blocks: 1          IO Block: 131072 regular file
Device: 34h/52d	Inode: 10208956554895298979  Links: 1
Access: (0644/-rw-r--r--)  Uid: (   36/    vdsm)   Gid: (   36/     kvm)
Context: system_u:object_r:fusefs_t:s0
Access: 2020-06-03 18:59:17.547192000 +0000
Modify: 2020-06-03 18:59:17.548192011 +0000
Change: 2020-06-03 18:59:17.600192582 +0000
 Birth: -

[root@ ~ ]# cat metadata 
ALIGNMENT=1048576
BLOCK_SIZE=4096
CLASS=Data
DESCRIPTION=vmstore
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
POOL_UUID=0f3fc724-a5ca-11ea-a7a6-004755204901
REMOTE_PATH=rhsqa-grafton7.lab.eng.blr.redhat.com:/vmstore
ROLE=Regular
SDUUID=977e8d86-afd8-46c1-bf15-ed19d3cb6ed1
TYPE=GLUSTERFS
VERSION=5
_SHA_CKSUM=771d06cb29cd1ee6a7e5b4c72be119cd5078a87e

[root@ ~]# dd if=metadata of=/dev/null bs=4096 count=1
0+1 records in
0+1 records out
391 bytes copied, 0.000101469 s, 3.9 MB/s
[root@ ~]# dd if=metadata of=/dev/null bs=4096 count=1 iflag=direct
0+1 records in
0+1 records out
391 bytes copied, 0.00143502 s, 272 kB/s


So there are no zeroes padded

Comment 15 errata-xmlrpc 2020-06-16 06:19:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2572