Bug 1583662 - RFE: load-balance reads even when the read-policy is set to gfid-hash when multiple clients read same file
Summary: RFE: load-balance reads even when the read-policy is set to gfid-hash when mu...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Ashish Pandey
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-29 12:33 UTC by Nag Pavan Chilakam
Modified: 2018-11-19 06:14 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-19 06:14:58 UTC
Embargoed:


Attachments (Terms of Use)

Description Nag Pavan Chilakam 2018-05-29 12:33:47 UTC
Description of problem:
----------------------
to improve performance especially wrt gnfs or applications which use atime based cache invalidations, we have made the read-policy for ec volume as gfid-hash(BZ#1559084 - [EC] Read performance of EC volume exported over gNFS is significantly lower than write performance)
This makes sure a file is always read from same 4 bricks.
However if the same file is read from multiple clients say 10 clients, that means all are being served from the same 4 bricks. that means the load is balanced well as the other 2 bricks are idle.
With round robin the problem was cache invalidation due to minute atime differences

Instead we can try taking advantage of round-robin + gfid-hash by having something like weighted load balance
below is the eg:

lets take we have 4 bricks b1,b2,b3,b4,b5,b6 and a file f1 whose gfid-hash results to b1-b4
and take that there are 10 clients all reading f1 simultaneously.
that means b1-b4 are loaded, while b5/b6 are idle

instead let the load be divided such that a client always reads from same set of bricks but need not have to be based on gfid. with this we can make sure b5,b6 also participates but without having a client to toggle b/w any of the bricks and hence no perf hit

we can try load-balancing by doing something like gfid-hash+client-hash(just an example)



Version-Release number of selected component (if applicable):
-----------
3.12.2-11


Steps to Reproduce:
1.create a ec volume mount of 10 gnfs clients
2. create a 1gb file and clear the cache on client(s)
3.now do a read from all 10clients simultaneously


Actual results:
=============
4. as the read policy is gfid-hash only 4/6 bricks are participating

Expected results:
--------------
have a method to loadbalance above

Comment 3 Xavi Hernandez 2018-10-26 09:29:06 UTC
The reason why we moved from 'round-robin' to 'gfid-hash' was that caching effects and reducing read amplification were more important than load distribution.

Based on this, if we have multiple clients reading the same file, making them use different bricks will nullify caching effects, which have proven to be very important.

I think it will be better to send all read requests to the same bricks because this way we will take advantage of cached data (reads will be answered faster) and we'll minimize read amplification.

Since most probably we'll already have other files being read concurrently, this will cause a load-balancing effect. This is similar to what DHT does: if only one file is accessed, the same set of bricks is used. However, since volumes will have multiple accesses normally, all DHT subvolumes are active at the same time.

From my point of view, I think this option won't provide any improvement, but I would like to know what others think.

Comment 4 Pranith Kumar K 2018-10-29 05:25:40 UTC
(In reply to Xavi Hernandez from comment #3)
> The reason why we moved from 'round-robin' to 'gfid-hash' was that caching
> effects and reducing read amplification were more important than load
> distribution.
> 
> Based on this, if we have multiple clients reading the same file, making
> them use different bricks will nullify caching effects, which have proven to
> be very important.

Didn't understand the point above. If we use same set of bricks per mount for reading, cache-affects will be enabled because for that mount same bricks are used for reading, so the 'times' will always be same in stat data when we send the info in the responses. So as per my understanding this should work fine, we can do something like read-has-mode=2 in afr where it takes pid+gfid for hashing the bricks it wants to choose.

> 
> I think it will be better to send all read requests to the same bricks
> because this way we will take advantage of cached data (reads will be
> answered faster) and we'll minimize read amplification.
> 
> Since most probably we'll already have other files being read concurrently,
> this will cause a load-balancing effect. This is similar to what DHT does:
> if only one file is accessed, the same set of bricks is used. However, since
> volumes will have multiple accesses normally, all DHT subvolumes are active
> at the same time.

Yes while in theory we can improve it, I am not sure how much practical value the solution suggested brings in. So not sure if we should spend time on giving yet another option.

> 
> From my point of view, I think this option won't provide any improvement,
> but I would like to know what others think.

Comment 5 Ashish Pandey 2018-10-29 05:39:59 UTC
(In reply to Pranith Kumar K from comment #4)
> (In reply to Xavi Hernandez from comment #3)
> > The reason why we moved from 'round-robin' to 'gfid-hash' was that caching
> > effects and reducing read amplification were more important than load
> > distribution.
> > 
> > Based on this, if we have multiple clients reading the same file, making
> > them use different bricks will nullify caching effects, which have proven to
> > be very important.
> 
> Didn't understand the point above. If we use same set of bricks per mount
> for reading, cache-affects will be enabled because for that mount same
> bricks are used for reading, so the 'times' will always be same in stat data
> when we send the info in the responses. So as per my understanding this
> should work fine, we can do something like read-has-mode=2 in afr where it
> takes pid+gfid for hashing the bricks it wants to choose.
> 

In my understanding, Xavi meant to say hard drive level caching.
If two clients want to read same set of data from a file then it is better to
send that on same set of 4 (EC 4+2) bricks because that would have been cached on hard drive level and access to hard drive can be avoided.

> > 
> > I think it will be better to send all read requests to the same bricks
> > because this way we will take advantage of cached data (reads will be
> > answered faster) and we'll minimize read amplification.
> > 
> > Since most probably we'll already have other files being read concurrently,
> > this will cause a load-balancing effect. This is similar to what DHT does:
> > if only one file is accessed, the same set of bricks is used. However, since
> > volumes will have multiple accesses normally, all DHT subvolumes are active
> > at the same time.
> 
> Yes while in theory we can improve it, I am not sure how much practical
> value the solution suggested brings in. So not sure if we should spend time
> on giving yet another option.
> 
> > 
> > From my point of view, I think this option won't provide any improvement,
> > but I would like to know what others think.


I also agree that this option does not worth extra effort.

Comment 6 Pranith Kumar K 2018-10-29 06:19:10 UTC
> 
> In my understanding, Xavi meant to say hard drive level caching.
> If two clients want to read same set of data from a file then it is better to
> send that on same set of 4 (EC 4+2) bricks because that would have been
> cached on hard drive level and access to hard drive can be avoided.

This may not hold true for o-direct reads like in VM/block workloads, so per client reading from multiple bricks may not be a bad idea in that case. Theoretically there is value in what this bug suggests but we need data to make sure that it is indeed worth implementing. So if we find a problem from customers like we did with round-robin we can consider it.

Comment 7 Xavi Hernandez 2018-10-29 08:43:47 UTC
(In reply to Pranith Kumar K from comment #4)
> (In reply to Xavi Hernandez from comment #3)
> > The reason why we moved from 'round-robin' to 'gfid-hash' was that caching
> > effects and reducing read amplification were more important than load
> > distribution.
> > 
> > Based on this, if we have multiple clients reading the same file, making
> > them use different bricks will nullify caching effects, which have proven to
> > be very important.
> 
> Didn't understand the point above. If we use same set of bricks per mount
> for reading, cache-affects will be enabled because for that mount same
> bricks are used for reading, so the 'times' will always be same in stat data
> when we send the info in the responses. So as per my understanding this
> should work fine, we can do something like read-has-mode=2 in afr where it
> takes pid+gfid for hashing the bricks it wants to choose.

What I was trying to say is that if we allow different requests to the same file to be served by different bricks of the same disperse set, we'll lose the benefits of brick caches (as well as the time issue). Dispersing reads has proven to be bad because of cache misses and read amplification.

My comment was an attempt to justify why I think this option is not interesting/important (at least right now). So you are right, the current approach should work well.

I'm not sure that pid+gfid can bring any advantage.

> 
> > 
> > I think it will be better to send all read requests to the same bricks
> > because this way we will take advantage of cached data (reads will be
> > answered faster) and we'll minimize read amplification.
> > 
> > Since most probably we'll already have other files being read concurrently,
> > this will cause a load-balancing effect. This is similar to what DHT does:
> > if only one file is accessed, the same set of bricks is used. However, since
> > volumes will have multiple accesses normally, all DHT subvolumes are active
> > at the same time.
> 
> Yes while in theory we can improve it, I am not sure how much practical
> value the solution suggested brings in. So not sure if we should spend time
> on giving yet another option.

Note that the reason for changing to 'gfid-hash' by default was that the overhead that read amplification and cache misses caused was huge, causing severe performance degradation for reads. So I don't see how an intermediate solution could bring some benefits.

> 
> > 
> > From my point of view, I think this option won't provide any improvement,
> > but I would like to know what others think.

Comment 8 Xavi Hernandez 2018-10-29 08:50:38 UTC
(In reply to Pranith Kumar K from comment #6)
> > 
> > In my understanding, Xavi meant to say hard drive level caching.
> > If two clients want to read same set of data from a file then it is better to
> > send that on same set of 4 (EC 4+2) bricks because that would have been
> > cached on hard drive level and access to hard drive can be avoided.
> 
> This may not hold true for o-direct reads like in VM/block workloads, so per
> client reading from multiple bricks may not be a bad idea in that case.
> Theoretically there is value in what this bug suggests but we need data to
> make sure that it is indeed worth implementing. So if we find a problem from
> customers like we did with round-robin we can consider it.

O_DIRECT access won't benefit from caching, but it won't suffer either from always accessing the same bricks. If we have sharding enabled, as it's recommended for VM workloads, this will already balance reads among bricks. And if we have more than a single VM, which is very likely, brick load will be balanced quite well, so I don't see a real use case for this option right now.

Comment 9 Pranith Kumar K 2018-10-29 09:10:12 UTC
> 
> What I was trying to say is that if we allow different requests to the same
> file to be served by different bricks of the same disperse set, we'll lose
> the benefits of brick caches (as well as the time issue). Dispersing reads
> has proven to be bad because of cache misses and read amplification.
> 

Ah! I think I understood why my earlier comment is confusing. What I was mentioning is for fuse-mounts, or different gNFS servers not NFS clients. When multiple fuse-mounts or gNFS servers try to read same file, we can send them to different set of bricks per fuse-client/gNFS server but always same bricks for that fuse-client/gNFS server.
For example: For a file f1 on volume with disperse (2+1) we can choose bricks 0, 1 on mount-1 and 1, 2 on mount-2 and 0, 2 on mount 3 this way there is load-balancing but it will always lead to correct atime attributes.

One way to do that is by taking pid of the fuse-mount/gNFS process and gfid for doing the hashing. In afr we have this as an option.

So I am saying there could be value in doing this, but we need a use-case which makes it necessary to have this option.

Comment 10 Xavi Hernandez 2018-10-31 12:30:53 UTC
(In reply to Pranith Kumar K from comment #9)
> > 
> > What I was trying to say is that if we allow different requests to the same
> > file to be served by different bricks of the same disperse set, we'll lose
> > the benefits of brick caches (as well as the time issue). Dispersing reads
> > has proven to be bad because of cache misses and read amplification.
> > 
> 
> Ah! I think I understood why my earlier comment is confusing. What I was
> mentioning is for fuse-mounts, or different gNFS servers not NFS clients.
> When multiple fuse-mounts or gNFS servers try to read same file, we can send
> them to different set of bricks per fuse-client/gNFS server but always same
> bricks for that fuse-client/gNFS server.
> For example: For a file f1 on volume with disperse (2+1) we can choose
> bricks 0, 1 on mount-1 and 1, 2 on mount-2 and 0, 2 on mount 3 this way
> there is load-balancing but it will always lead to correct atime attributes.
> 
> One way to do that is by taking pid of the fuse-mount/gNFS process and gfid
> for doing the hashing. In afr we have this as an option.
> 
> So I am saying there could be value in doing this, but we need a use-case
> which makes it necessary to have this option.

I still don't see the benefit. Let's say we have two mounts M1 and M2, and the first one reads one file from bricks B1 and B2, and the other from B2 and B3.

In this configuration, some requests that B1 could have answered with cached data, B3 will need to access disk, which is orders of magnitude slower. Even worse, most read requests come with some read-ahead buffer that wastes more I/O bandwidth and memory. In this scenario we'll also have two copies of the same data in caches (B1 and B3), making caches less efficient since we are using physical memory to store data we already have cached on the other node.

I was of the same opinion that balancing I/O seems like a good thing, but what I learned from that case is that balancing reads is very bad for performance unless we are sure that reads won't overlap, and considering read-ahead, the only way to be sure we won't have overlaps is to only use different bricks for different files.

Adding the option shouldn't take much effort, but I don't see any benefit from it.

Comment 11 Pranith Kumar K 2018-11-08 09:02:33 UTC
(In reply to Xavi Hernandez from comment #10)
> (In reply to Pranith Kumar K from comment #9)
> > > 
> > > What I was trying to say is that if we allow different requests to the same
> > > file to be served by different bricks of the same disperse set, we'll lose
> > > the benefits of brick caches (as well as the time issue). Dispersing reads
> > > has proven to be bad because of cache misses and read amplification.
> > > 
> > 
> > Ah! I think I understood why my earlier comment is confusing. What I was
> > mentioning is for fuse-mounts, or different gNFS servers not NFS clients.
> > When multiple fuse-mounts or gNFS servers try to read same file, we can send
> > them to different set of bricks per fuse-client/gNFS server but always same
> > bricks for that fuse-client/gNFS server.
> > For example: For a file f1 on volume with disperse (2+1) we can choose
> > bricks 0, 1 on mount-1 and 1, 2 on mount-2 and 0, 2 on mount 3 this way
> > there is load-balancing but it will always lead to correct atime attributes.
> > 
> > One way to do that is by taking pid of the fuse-mount/gNFS process and gfid
> > for doing the hashing. In afr we have this as an option.
> > 
> > So I am saying there could be value in doing this, but we need a use-case
> > which makes it necessary to have this option.
> 
> I still don't see the benefit. Let's say we have two mounts M1 and M2, and
> the first one reads one file from bricks B1 and B2, and the other from B2
> and B3.
> 
> In this configuration, some requests that B1 could have answered with cached
> data, B3 will need to access disk, which is orders of magnitude slower. Even
> worse, most read requests come with some read-ahead buffer that wastes more
> I/O bandwidth and memory. In this scenario we'll also have two copies of the
> same data in caches (B1 and B3), making caches less efficient since we are
> using physical memory to store data we already have cached on the other node.
> 
> I was of the same opinion that balancing I/O seems like a good thing, but
> what I learned from that case is that balancing reads is very bad for
> performance unless we are sure that reads won't overlap, and considering
> read-ahead, the only way to be sure we won't have overlaps is to only use
> different bricks for different files.
> 
> Adding the option shouldn't take much effort, but I don't see any benefit
> from it.

I agree. Do you think it makes sense to close this bug as WONTFIX in that case? If you do, go ahead. Leaving a needinfo on you, so that you get this notification.

Comment 12 Pranith Kumar K 2018-11-08 09:03:46 UTC
Nag,
   Based on the discussion, it makes sense to close this bug for now. Let us know if you are fine with that decision

Pranith

Comment 13 Xavi Hernandez 2018-11-09 09:04:24 UTC
I'll wait until Nag gives his point of view on this bug before closing or doing anything else.


Note You need to log in before you can comment on or make changes to this bug.