Created attachment 1024653 [details] Example of gluster reading data from file more than once . Description of problem: While trying to get SAS Grid Calibration workload to function successfully in a gluster environment, I noticed we were doing an enormous amount of I/O (reads) compared to other shared file system environments. Tools we have to capture I/O rates and store aggregate I/O information are showing we are reading 2-3 times more than we should. This was verified by running a simple test that drops the page cache on a client and reads a file from the gluster volume. Monitoring network data rates and page cache utilization we were able to show more than 2x the network read traffic. Additionally the way the page cache was being filled implies we are reading the same blocks on the gluster server more than once. To get SAS Calibration work to function without errors, we have had to turn off virtually all of the gluster translators. Only the read ahead translator is on. Using the attached test case to show this problem in its simplest fashion, I learned that the problem stems from having that read ahead translator on and the stat-prefetch translator off. If both are on, or both off, everything works fine. The interesting part of the attachement is the cache and ETHiA (accumulated) values. They should be virtually the same but ETHiA is much higher. The dstat tool confirms this Version-Release number of selected component (if applicable): RHEL6.6 -504 kernel on clients and servers Gluster version glusterfs-3.6.0.53-1.el6rhs.x86_64 How reproducible: every time Steps to Reproduce: 1. See attachment 2. 3. Actual results: network reads were 13GB The interesting part of the attachement is the cache and ETHiA (accumulated) values. They should be virtually the same. The dstat tool confirms this behavior as well. Expected results: network reads should have been around 5.3GB Additional info: Gluster config ... Volume Name: vol0 Type: Distributed-Replicate Volume ID: f7f43d39-0607-4e27-8917-83f3ecb4709d Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gprfs025-b-10ge:/brick/vol0 Brick2: gprfs026-b-10ge:/brick/vol0 Brick3: gprfs027-b-10ge:/brick/vol0 Brick4: gprfs028-b-10ge:/brick/vol0 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.readdir-ahead: off server.allow-insecure: on performance.io-cache: off performance.stat-prefetch: off performance.quick-read: off performance.read-ahead: on performance.write-behind: off cluster.read-hash-mode: 1 auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 clients mount with gid-timeout=2,entry-timeout=0 as well
Thank you for your report. This bug is filed against a component for which no further new development is being undertaken. If you find the bug relevant in the higher versions too, please feel free to reopen the bug.
(In reply to Barry Marson from comment #0) > Created attachment 1024653 [details] > Example of gluster reading data from file more than once . > > Description of problem: > > While trying to get SAS Grid Calibration workload to function successfully > in a gluster environment, I noticed we were doing an enormous amount of I/O > (reads) compared to other shared file system environments. Tools we have to > capture I/O rates and store aggregate I/O information are showing we are > reading 2-3 times more than we should. > > This was verified by running a simple test that drops the page cache on a > client and reads a file from the gluster volume. Monitoring network data > rates and page cache utilization we were able to show more than 2x the > network read traffic. Additionally the way the page cache was being filled > implies we are reading the same blocks on the gluster server more than once. > > To get SAS Calibration work to function without errors, we have had to turn > off virtually all of the gluster translators. Only the read ahead > translator is on. > > Using the attached test case to show this problem in its simplest fashion, I > learned that the problem stems from having that read ahead translator on and > the stat-prefetch translator off. If both are on, or both off, everything > works fine. I think I've an hypothesis which explains this behaviour. read-ahead flushes its cache on (f)stat. The only reason I can think of for this behaviour is atime updates. With stat-prefetch turned on, stats are absorbed by stat-prefetch and read-ahead won't see them. So, no flushing of cache. > > The interesting part of the attachement is the cache and ETHiA (accumulated) > values. They should be virtually the same but ETHiA is much higher. The > dstat tool confirms this > > Version-Release number of selected component (if applicable): > > RHEL6.6 -504 kernel on clients and servers > Gluster version glusterfs-3.6.0.53-1.el6rhs.x86_64 > > How reproducible: > every time > > Steps to Reproduce: > 1. See attachment > 2. > 3. > > Actual results: > > network reads were 13GB > > The interesting part of the attachement is the cache and ETHiA (accumulated) > values. They should be virtually the same. The dstat tool confirms this > behavior as well. > > Expected results: > > network reads should have been around 5.3GB > > Additional info: > > Gluster config ... > Volume Name: vol0 > Type: Distributed-Replicate > Volume ID: f7f43d39-0607-4e27-8917-83f3ecb4709d > Status: Started > Snap Volume: no > Number of Bricks: 2 x 2 = 4 > Transport-type: tcp > Bricks: > Brick1: gprfs025-b-10ge:/brick/vol0 > Brick2: gprfs026-b-10ge:/brick/vol0 > Brick3: gprfs027-b-10ge:/brick/vol0 > Brick4: gprfs028-b-10ge:/brick/vol0 > Options Reconfigured: > diagnostics.count-fop-hits: on > diagnostics.latency-measurement: on > performance.readdir-ahead: off > server.allow-insecure: on > performance.io-cache: off > performance.stat-prefetch: off > performance.quick-read: off > performance.read-ahead: on > performance.write-behind: off > cluster.read-hash-mode: 1 > auto-delete: disable > snap-max-soft-limit: 90 > snap-max-hard-limit: 256 > > clients mount with gid-timeout=2,entry-timeout=0 as well
(In reply to Raghavendra G from comment #5) > > I think I've an hypothesis which explains this behaviour. read-ahead flushes > its cache on (f)stat. The only reason I can think of for this behaviour is > atime updates. With stat-prefetch turned on, stats are absorbed by > stat-prefetch and read-ahead won't see them. So, no flushing of cache. To (dis)prove the hypothesis, we either need to: 1. capture fops on read-ahead while the test (cat) is being done. This can be done by loading debug/trace xlator on top of read-ahead. Its also better to see what fops are going out of read-ahead. So, loading another instance of debug/trace as a child of read-ahead will help. 2. capture fusedump. I think 1 is sufficient to correctly debug the issue.
Looks to be a duplicate of bz 1214489
(In reply to Raghavendra G from comment #6) > (In reply to Raghavendra G from comment #5) > > > > I think I've an hypothesis which explains this behaviour. read-ahead flushes > > its cache on (f)stat. The only reason I can think of for this behaviour is > > atime updates. With stat-prefetch turned on, stats are absorbed by > > stat-prefetch and read-ahead won't see them. So, no flushing of cache. > > To (dis)prove the hypothesis, we either need to: > 1. capture fops on read-ahead while the test (cat) is being done. This can > be done by loading debug/trace xlator on top of read-ahead. Its also better > to see what fops are going out of read-ahead. So, loading another instance > of debug/trace as a child of read-ahead will help. > 2. capture fusedump. > > I think 1 is sufficient to correctly debug the issue. From client logs attached to bz 1393419, I could see that reads from kernel are interspersed with attr calls. These fstat calls flush the read-ahead cache. So, data is read more than once - once for read-ahead and later when application actually issues read. This explains the extra data read over network. From the same logs, it also looks read-ahead logic is bit aggressive making this problem more prominent. Had there been no fstat calls from kernel, the prefetched data would be eventually consumed as cache hit and it would not have been a problem. Even with stat-prefetch/md-cache turned on we can hit this bug as default timeout for md-cache is 1s and there is a very good chance that this cache might've timedout when it actually sees an fstat. So, if we are using md-cache we need to turn on "group metadata-cache" profile which makes sure larger timeouts are set on md-cache and upcall used for handling cache-coherency issues.
Upstream bug is opened for this (1595088). And we would take this work to Upstream, if any.