Bug 1753843 - [Disperse volume]: Regression in IO performance seen in sequential read for large file
Summary: [Disperse volume]: Regression in IO performance seen in sequential read for l...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: read-ahead
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-20 04:12 UTC by Raghavendra G
Modified: 2019-10-11 06:04 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-11 06:04:59 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 23454 0 None Merged performance/read-ahead: update expected offset before unwinding read response 2019-10-11 06:04:58 UTC

Comment 1 Raghavendra G 2019-09-20 04:13:41 UTC
Description of problem:
While running the perf test for large file regression seen in sequential read. 


=================================================================================================
Build           glusterfs-3.12.2-47     glusterfs-6.0-9         glusterfs-6.0-9 vs 
                (RHEL 7.7)                                      glusterfs-3.12.2-47 (RHEL 7.7)
=================================================================================================
Seq write       3509108.198             3575702.62               1.90%
Seq read        2894066.582             2531160.094             -12.54%  <<<<<<<<
Random read     107318.61               106771.402              -0.51%
Random write    551146.86               535243.422              -2.89%
=================================================================================================

Description of problem:
While running the perf test for large file regression seen in sequential read. 


=================================================================================================
Build           glusterfs-3.12.2-47     glusterfs-6.0-9         glusterfs-6.0-9 vs 
                (RHEL 7.7)                                      glusterfs-3.12.2-47 (RHEL 7.7)
=================================================================================================
Seq write       3509108.198             3575702.62               1.90%
Seq read        2894066.582             2531160.094             -12.54%  <<<<<<<<
Random read     107318.61               106771.402              -0.51%
Random write    551146.86               535243.422              -2.89%
=================================================================================================

Comment 2 Raghavendra G 2019-09-20 04:14:37 UTC
I've checked the information provided by one of the runs (sequential-read-and-reread-profile-fuse-large-file-test-run0.txt) and I've found the reason (though not the cause) of the regression:


On 3.12, the first sequential read reads a total of 171.3 GiB based on iozone data. Based on gluster profile info, the total amount of data read is 172.8 GiB, which matches quite well.

However on 6.0, iozone reports 177.9 GiB read, while profile info reports 266.4 GiB read. This means that almost 50% more data is read from bricks.

I will try to determine what is the cause of this, but right now I think there are two possible candidates:

1. ec's read-policy is set to round-robin

   Apparently this is not the case as RHGS 3.5.0 has "gfid-hash" as default value and it doesn't appear as modified in volume options

2. Some issue in read-ahead xlator or other xlators that could send reads (maybe io-cache ?)

Xavi

Comment 3 Raghavendra G 2019-09-20 04:14:58 UTC
I've checked the information provided by one of the runs (sequential-read-and-reread-profile-fuse-large-file-test-run0.txt) and I've found the reason (though not the cause) of the regression:


On 3.12, the first sequential read reads a total of 171.3 GiB based on iozone data. Based on gluster profile info, the total amount of data read is 172.8 GiB, which matches quite well.

However on 6.0, iozone reports 177.9 GiB read, while profile info reports 266.4 GiB read. This means that almost 50% more data is read from bricks.

I will try to determine what is the cause of this, but right now I think there are two possible candidates:

1. ec's read-policy is set to round-robin

   Apparently this is not the case as RHGS 3.5.0 has "gfid-hash" as default value and it doesn't appear as modified in volume options

2. Some issue in read-ahead xlator or other xlators that could send reads (maybe io-cache ?)

Xavi

Comment 4 Raghavendra G 2019-09-20 04:15:25 UTC
I've done some tests and the reason of the increased amount of read seems to be read-ahead xlator.

I've checked the amount of data read when a 8 GiB file is read with a dd and I see that disabling read-ahead xlator, profile info reports 8 GiB read. However with read-ahead enabled it reports more than 12 GiB read.

Looking at the trace logs of the read-ahead xlator I see the following sequence:

2019-09-19 11:32:23.347873] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39321600 for size=131072
[2019-09-19 11:32:23.347877] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39321600) when page_count=4
[2019-09-19 11:32:23.347881] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39321600.
[2019-09-19 11:32:23.347886] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=39714816
[2019-09-19 11:32:23.348248] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=39845888
[2019-09-19 11:32:23.348608] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=39976960
[2019-09-19 11:32:23.349153] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40108032
[2019-09-19 11:32:23.350774] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39452672 for size=131072
[2019-09-19 11:32:23.350783] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39452672) when page_count=4
[2019-09-19 11:32:23.350788] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39452672.
[2019-09-19 11:32:23.351186] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39583744 for size=131072
[2019-09-19 11:32:23.351192] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39583744) when page_count=4
[2019-09-19 11:32:23.351197] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39583744.
[2019-09-19 11:32:23.351573] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39714816 for size=131072
[2019-09-19 11:32:23.351578] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39714816) when page_count=4
[2019-09-19 11:32:23.351582] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39714816.
[2019-09-19 11:32:23.351900] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39845888 for size=131072
[2019-09-19 11:32:23.351904] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39845888) when page_count=4
[2019-09-19 11:32:23.351908] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39845888.
[2019-09-19 11:32:23.351913] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40239104
[2019-09-19 11:32:23.352229] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40370176
[2019-09-19 11:32:23.352797] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40501248
[2019-09-19 11:32:23.353324] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40632320
[2019-09-19 11:32:23.354954] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39976960 for size=131072
[2019-09-19 11:32:23.354969] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39976960) when page_count=4
[2019-09-19 11:32:23.354983] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39976960.
[2019-09-19 11:32:23.355003] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=40108032 for size=131072
[2019-09-19 11:32:23.355019] T [MSGID: 0] [read-ahead.c:473:ra_readv] 0-ec-read-ahead: unexpected offset (39976960 != 40108032) resetting
[2019-09-19 11:32:23.355035] T [MSGID: 0] [read-ahead.c:388:dispatch_requests] 0-ec-read-ahead: IN-TRANSIT at offset=40108032.
[2019-09-19 11:32:23.355047] T [MSGID: 0] [read-ahead.c:402:dispatch_requests] 0-ec-read-ahead: MISS at offset=40108032.
[2019-09-19 11:32:23.355839] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=40239104 for size=131072
[2019-09-19 11:32:23.355843] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (40239104) when page_count=0
[2019-09-19 11:32:23.355850] T [MSGID: 0] [read-ahead.c:388:dispatch_requests] 0-ec-read-ahead: IN-TRANSIT at offset=40239104.
[2019-09-19 11:32:23.355859] T [MSGID: 0] [read-ahead.c:402:dispatch_requests] 0-ec-read-ahead: MISS at offset=40239104.
[2019-09-19 11:32:23.357077] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=40370176 for size=131072
[2019-09-19 11:32:23.357088] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (40370176) when page_count=1
[2019-09-19 11:32:23.357101] T [MSGID: 0] [read-ahead.c:388:dispatch_requests] 0-ec-read-ahead: IN-TRANSIT at offset=40370176.
[2019-09-19 11:32:23.357110] T [MSGID: 0] [read-ahead.c:402:dispatch_requests] 0-ec-read-ahead: MISS at offset=40370176.
[2019-09-19 11:32:23.357562] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40501248
[2019-09-19 11:32:23.358147] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=40501248 for size=131072
[2019-09-19 11:32:23.358155] T [MSGID: 0] [read-ahead.c:473:ra_readv] 0-ec-read-ahead: unexpected offset (40370176 != 40501248) resetting
[2019-09-19 11:32:23.358171] T [MSGID: 0] [read-ahead.c:388:dispatch_requests] 0-ec-read-ahead: IN-TRANSIT at offset=40501248.
[2019-09-19 11:32:23.358178] T [MSGID: 0] [read-ahead.c:402:dispatch_requests] 0-ec-read-ahead: MISS at offset=40501248.

My understanding of what happens is that 'dd' sends the requests one by one sequentially. The read-ahead xlator starts reading more pages than requested, which makes it possible that future requests can be served directly from the cached data. At some point the kernel also detects the sequential access and starts sending more read requests in parallel. Many of these requests are served very fast by read-ahead xlator, but after some requests, the cache is empty.

Apparently there's a race when this happens:

At 2019-09-19 11:32:23.354954 a new request at offset 39976960 arrives. It's the expected offset and the cache contains that page, so it's served immediately. However, at 2019-09-19 11:32:23.355003, a new request arrives for the next offset (40108032) but read-ahead has not yet updated the expected offset and it incorrectly believes that the read is not sequential, dropping cached data and starts sending requests to bricks again. Apparently this causes that the same offset is read more than once, increasing the read volume on bricks.

Raghavendra, can you take a look to see if it's an easy change ?

Comment 5 Raghavendra G 2019-09-20 04:15:55 UTC
The bug in read-ahead is fixed by https://review.gluster.org/#/c/glusterfs/+/23454/

Comment 6 Worker Ant 2019-09-20 04:21:38 UTC
REVIEW: https://review.gluster.org/23454 (performance/read-ahead: update expected offset before unwinding read response) posted (#2) for review on master by Raghavendra G

Comment 7 Worker Ant 2019-10-11 06:04:59 UTC
REVIEW: https://review.gluster.org/23454 (performance/read-ahead: update expected offset before unwinding read response) merged (#3) on master by Raghavendra G


Note You need to log in before you can comment on or make changes to this bug.