Description of problem: While running the perf test for large file regression seen in sequential read. ================================================================================================= Build glusterfs-3.12.2-47 glusterfs-6.0-9 glusterfs-6.0-9 vs (RHEL 7.7) glusterfs-3.12.2-47 (RHEL 7.7) ================================================================================================= Seq write 3509108.198 3575702.62 1.90% Seq read 2894066.582 2531160.094 -12.54% <<<<<<<< Random read 107318.61 106771.402 -0.51% Random write 551146.86 535243.422 -2.89% ================================================================================================= Description of problem: While running the perf test for large file regression seen in sequential read. ================================================================================================= Build glusterfs-3.12.2-47 glusterfs-6.0-9 glusterfs-6.0-9 vs (RHEL 7.7) glusterfs-3.12.2-47 (RHEL 7.7) ================================================================================================= Seq write 3509108.198 3575702.62 1.90% Seq read 2894066.582 2531160.094 -12.54% <<<<<<<< Random read 107318.61 106771.402 -0.51% Random write 551146.86 535243.422 -2.89% =================================================================================================
I've checked the information provided by one of the runs (sequential-read-and-reread-profile-fuse-large-file-test-run0.txt) and I've found the reason (though not the cause) of the regression: On 3.12, the first sequential read reads a total of 171.3 GiB based on iozone data. Based on gluster profile info, the total amount of data read is 172.8 GiB, which matches quite well. However on 6.0, iozone reports 177.9 GiB read, while profile info reports 266.4 GiB read. This means that almost 50% more data is read from bricks. I will try to determine what is the cause of this, but right now I think there are two possible candidates: 1. ec's read-policy is set to round-robin Apparently this is not the case as RHGS 3.5.0 has "gfid-hash" as default value and it doesn't appear as modified in volume options 2. Some issue in read-ahead xlator or other xlators that could send reads (maybe io-cache ?) Xavi
I've done some tests and the reason of the increased amount of read seems to be read-ahead xlator. I've checked the amount of data read when a 8 GiB file is read with a dd and I see that disabling read-ahead xlator, profile info reports 8 GiB read. However with read-ahead enabled it reports more than 12 GiB read. Looking at the trace logs of the read-ahead xlator I see the following sequence: 2019-09-19 11:32:23.347873] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39321600 for size=131072 [2019-09-19 11:32:23.347877] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39321600) when page_count=4 [2019-09-19 11:32:23.347881] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39321600. [2019-09-19 11:32:23.347886] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=39714816 [2019-09-19 11:32:23.348248] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=39845888 [2019-09-19 11:32:23.348608] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=39976960 [2019-09-19 11:32:23.349153] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40108032 [2019-09-19 11:32:23.350774] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39452672 for size=131072 [2019-09-19 11:32:23.350783] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39452672) when page_count=4 [2019-09-19 11:32:23.350788] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39452672. [2019-09-19 11:32:23.351186] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39583744 for size=131072 [2019-09-19 11:32:23.351192] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39583744) when page_count=4 [2019-09-19 11:32:23.351197] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39583744. [2019-09-19 11:32:23.351573] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39714816 for size=131072 [2019-09-19 11:32:23.351578] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39714816) when page_count=4 [2019-09-19 11:32:23.351582] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39714816. [2019-09-19 11:32:23.351900] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39845888 for size=131072 [2019-09-19 11:32:23.351904] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39845888) when page_count=4 [2019-09-19 11:32:23.351908] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39845888. [2019-09-19 11:32:23.351913] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40239104 [2019-09-19 11:32:23.352229] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40370176 [2019-09-19 11:32:23.352797] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40501248 [2019-09-19 11:32:23.353324] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40632320 [2019-09-19 11:32:23.354954] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=39976960 for size=131072 [2019-09-19 11:32:23.354969] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (39976960) when page_count=4 [2019-09-19 11:32:23.354983] T [MSGID: 0] [read-ahead.c:382:dispatch_requests] 0-ec-read-ahead: HIT at offset=39976960. [2019-09-19 11:32:23.355003] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=40108032 for size=131072 [2019-09-19 11:32:23.355019] T [MSGID: 0] [read-ahead.c:473:ra_readv] 0-ec-read-ahead: unexpected offset (39976960 != 40108032) resetting [2019-09-19 11:32:23.355035] T [MSGID: 0] [read-ahead.c:388:dispatch_requests] 0-ec-read-ahead: IN-TRANSIT at offset=40108032. [2019-09-19 11:32:23.355047] T [MSGID: 0] [read-ahead.c:402:dispatch_requests] 0-ec-read-ahead: MISS at offset=40108032. [2019-09-19 11:32:23.355839] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=40239104 for size=131072 [2019-09-19 11:32:23.355843] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (40239104) when page_count=0 [2019-09-19 11:32:23.355850] T [MSGID: 0] [read-ahead.c:388:dispatch_requests] 0-ec-read-ahead: IN-TRANSIT at offset=40239104. [2019-09-19 11:32:23.355859] T [MSGID: 0] [read-ahead.c:402:dispatch_requests] 0-ec-read-ahead: MISS at offset=40239104. [2019-09-19 11:32:23.357077] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=40370176 for size=131072 [2019-09-19 11:32:23.357088] T [MSGID: 0] [read-ahead.c:479:ra_readv] 0-ec-read-ahead: expected offset (40370176) when page_count=1 [2019-09-19 11:32:23.357101] T [MSGID: 0] [read-ahead.c:388:dispatch_requests] 0-ec-read-ahead: IN-TRANSIT at offset=40370176. [2019-09-19 11:32:23.357110] T [MSGID: 0] [read-ahead.c:402:dispatch_requests] 0-ec-read-ahead: MISS at offset=40370176. [2019-09-19 11:32:23.357562] T [MSGID: 0] [read-ahead.c:317:read_ahead] 0-ec-read-ahead: RA at offset=40501248 [2019-09-19 11:32:23.358147] T [MSGID: 0] [read-ahead.c:459:ra_readv] 0-ec-read-ahead: NEW REQ at offset=40501248 for size=131072 [2019-09-19 11:32:23.358155] T [MSGID: 0] [read-ahead.c:473:ra_readv] 0-ec-read-ahead: unexpected offset (40370176 != 40501248) resetting [2019-09-19 11:32:23.358171] T [MSGID: 0] [read-ahead.c:388:dispatch_requests] 0-ec-read-ahead: IN-TRANSIT at offset=40501248. [2019-09-19 11:32:23.358178] T [MSGID: 0] [read-ahead.c:402:dispatch_requests] 0-ec-read-ahead: MISS at offset=40501248. My understanding of what happens is that 'dd' sends the requests one by one sequentially. The read-ahead xlator starts reading more pages than requested, which makes it possible that future requests can be served directly from the cached data. At some point the kernel also detects the sequential access and starts sending more read requests in parallel. Many of these requests are served very fast by read-ahead xlator, but after some requests, the cache is empty. Apparently there's a race when this happens: At 2019-09-19 11:32:23.354954 a new request at offset 39976960 arrives. It's the expected offset and the cache contains that page, so it's served immediately. However, at 2019-09-19 11:32:23.355003, a new request arrives for the next offset (40108032) but read-ahead has not yet updated the expected offset and it incorrectly believes that the read is not sequential, dropping cached data and starts sending requests to bricks again. Apparently this causes that the same offset is read more than once, increasing the read volume on bricks. Raghavendra, can you take a look to see if it's an easy change ?
The bug in read-ahead is fixed by https://review.gluster.org/#/c/glusterfs/+/23454/
REVIEW: https://review.gluster.org/23454 (performance/read-ahead: update expected offset before unwinding read response) posted (#2) for review on master by Raghavendra G
REVIEW: https://review.gluster.org/23454 (performance/read-ahead: update expected offset before unwinding read response) merged (#3) on master by Raghavendra G