Turning on performance.read-ahead results in a huge increase in network traffic during reads, up to 2.5X the amount of traffic expected.
Red Hat Enterprise Linux Server release 6.6 (Santiago)
Red Hat Storage Server 3.0 Update 2
8-node cluster, 10G network, 6 JBOD bricks per node, replication=2
This is the test protocol:
1. Run: gluster volume set HadoopVol performance.read-ahead on
2. Write a 1 GB file to gluster. We monitor performance stats during tests with collectl and colmux (collectl.sourceforge.net). When writing a 1 GB file, we see 2GB leave the client node over the network. Each server node (replication=2) receives 1GB over the net and writes 1GB to disk, as expected.
3. Drop the linux page cache on all nodes.
4. Read the file sequentially on the node that wrote it, with an I/O size of 256KB. We see one of the two servers read 1GB from disk and send 2.5 GB to the client node. That's 2.5 times the amount of network traffic expected.
The factor of 2.5X does not depend on the file size, from 10 MB up to 10 GB.
The factor does depend on the I/O size. For a read size of 256KB or less, the factor is about 2.5X. For a 1 MB read size, the factor is 1.6X. For a read size of 16 MB, the extra traffic is negligible. It looks like each read causes an unnecessary 500-600 KB of traffic.
When we turn off performance.read-ahead, this problem goes away.
Just in case there was a problem with the the counters used by collectl, we captured tcpdump traces during the tests, and added up the packet sizes. These results agree with the collectl data.
The cluster is in the Phoenix lab. Contact me for access.
Looks to be a duplicate of bz 1220845
From client logs attached to bz 1393419, I could see that reads from kernel are interspersed with attr calls. These fstat calls flush the read-ahead cache. So, data is read more than once - once for read-ahead and later when application actually issues read. This explains the extra data read over network.
From the same logs, it also looks read-ahead logic is bit aggressive making this problem more prominent. Had there been no fstat calls from kernel, the prefetched data would be eventually consumed as cache hit and it would not have been a problem.
Even with stat-prefetch/md-cache turned on we can hit this bug as default timeout for md-cache is 1s and there is a very good chance that this cache might've timedout when it actually sees an fstat. So, if we are using md-cache we need to turn on "group metadata-cache" profile which makes sure larger timeouts are set on md-cache and upcall used for handling cache-coherency issues.
At the release stakeholders meeting this morning, it was agreed to push this out of proposed list of 3.4.3, and to be considered for a future batch update.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.