Hide Forgot
Description of problem: The original source of the 2ms RTT maximum latency is/was GFS, however many deployments do not make use of GFS. This bug is to test, and later to update our documentation to reflect, what latencies can be tolerated in non-GFS usecases. Higher thresholds will become increasingly important as stretch cluster deployments become more popular (such as for OpenStack). Version-Release number of selected component (if applicable): The immediate goal is to develop a procedure for establishing new numbers for 7.4, after which we may look to repeat for older versions. How reproducible: NA Steps to Reproduce: 1. (If not already the case) Make it possible to run the QE tests without GFS present 2. Simulate higher latencies 3. Run the tests 4. Repeat Actual results: Maximum RTT Latency 2ms Expected results: Maximum RTT Latency ~1s Additional info: Once a new number is established, it will be important to sanity test what happens when the latencies are asymmetrical. Specifically when there are two sets of machines and the latencies are ~2ms between machines in the same set but much higher between machines in different sets.
(In reply to Andrew Beekhof from comment #0) > Description of problem: > > The original source of the 2ms RTT maximum latency is/was GFS, however many > deployments do not make use of GFS. > > This bug is to test, and later to update our documentation to reflect, what > latencies can be tolerated in non-GFS usecases. > > Higher thresholds will become increasingly important as stretch cluster > deployments become more popular (such as for OpenStack). > > Version-Release number of selected component (if applicable): > > The immediate goal is to develop a procedure for establishing new numbers > for 7.4, after which we may look to repeat for older versions. > > How reproducible: > > NA > > Steps to Reproduce: > 1. (If not already the case) Make it possible to run the QE tests without > GFS present > 2. Simulate higher latencies > 3. Run the tests > 4. Repeat > > Actual results: > > Maximum RTT Latency 2ms > > Expected results: > > Maximum RTT Latency ~1s ~1s might be excessive. I´d say we first identify the Max Latency with default settings, and then we take it from there. Already a 500ms latency means crossing the oceans twice or more. EMEA <-> US ping is ~200ms EMEA <-> JP ping is ~300ms > > Additional info: > > Once a new number is established, it will be important to sanity test what > happens when the latencies are asymmetrical. Specifically when there are > two sets of machines and the latencies are ~2ms between machines in the same > set but much higher between machines in different sets. My suggestion would be to start with a 2 node cluster to keep it easier (both nodes would have symmetrical config for latency). Then I agree we should test 4 / 6 / 8, where nodes are split 50/50 and the ones in the same DC would have no added latency, but communication with the nodes on the other DC would.
~1s was my round-about way of saying lets see how high we can go, rather than putting an artificial ceiling on the tests. I wasn't actually expecting to get that high.
I'm not sure it makes sense to define it in those terms. I would guess that the important thing here is token rotation time. That will depend upon the number of nodes, the time the token spends on each node before being passed on, and the time taken to pass the token from node to node on each part of the ring. RTT only really makes sense for point to point connections, rather than the ring structure that corosync uses. Btw, does corosync have anything built in to measure the node to node token time and/or rotation time? That would be a good thing to export to PCP, for example, to make it easier for people to keep track of it.
While I do agree that raising the limit here will give us more flexibility with our customers, I'd like to caution against developing a support matrix with varying policies across versions and/or configurations. Where we already have policies that carve out different standards for the various releases or conditions, we find customers often get it wrong, the policies become difficult to maintain as new releases and combinations come out that we have to consider, and in the end it just drives more dissatisfaction with the product. Right now it sounds like we're discussing: - Some higher limit for 7.4+ without GFS2 - 2ms for 7.4+ with GFS2 - 2ms for < 7.4 I can already say this is likely going to result in confusion and dissastisfaction from customers. If we can make any higher latency allowance retroactive for older releases, that definitely makes this easier to accept. I'm also in favor of Andrew's suggestion that we simply see how high we can go. Building on Fabio's response, if 1s latency already means several trips around the globe, then successful tests at 1s would - in my opinion - be good reason to just drop the latency limits altogether. Do we expect anyone to deploy a cluster with a 10s latency and to be disappointed that they need to do some tuning to make it stable? If we've tested at 1s and the cluster doesn't fall apart, then what concerns do we have about leaving the door open to any latency conditions being eligible for support? If it reduces the concern, we could include a caveat in our support policy doc stating something like: "High latencies above 1s may trigger instability without further tuning of communication settings and timings". Finally, I'd like to ask what the concern is around high latency with GFS2? Of course, I realize this will affect performance negatively...Does that go as far as causing instability, or are we just not confident that performance levels will be satisfactory with those latencies? Again, I have to ask what our concern is if a customer tried to do this? Do we think they're going to use a high-latency link and then be surprised or disappointed that it negatively impacts performance in their cluster? And again, would a simple warning in the policy be enough to address those concerns, like "Performance of components that process a high-volume of network messages will decline with higher latencies, and may be unacceptable beyond latencies of 1s. Those components include: GFS, GFS2, cmirror, clvmd, [...]". In summary: let's make the policy simple by lifting limits wherever possible and applying the same rules across all of our offerings if we can.
(In reply to John Ruemker from comment #6) > I'm also in favor of Andrew's suggestion that we simply see how high we can > go. Building on Fabio's response, if 1s latency already means several trips > around the globe, then successful tests at 1s would - in my opinion - be > good reason to just drop the latency limits altogether. Do we expect anyone > to deploy a cluster with a 10s latency and to be disappointed that they need > to do some tuning to make it stable? If we've tested at 1s and the cluster > doesn't fall apart, then what concerns do we have about leaving the door > open to any latency conditions being eligible for support? If it reduces > the concern, we could include a caveat in our support policy doc stating > something like: "High latencies above 1s may trigger instability without > further tuning of communication settings and timings". > And to be clear, I wasn't suggesting that 1s is the threshold we must hit without instability in order to lift the limit altogether. My point was more that if we're testing at latencies higher-than-imaginable in an enterprise environment and don't see instability outside of just needing to tune the settings, then I feel the latency limit provides little benefit to our customers, and could just be replaced by a recommendation to tune and test.
(In reply to John Ruemker from comment #6) > While I do agree that raising the limit here will give us more flexibility > with our customers, I'd like to caution against developing a support matrix > with varying policies across versions and/or configurations. Where we > already have policies that carve out different standards for the various > releases or conditions, we find customers often get it wrong, the policies > become difficult to maintain as new releases and combinations come out that > we have to consider, and in the end it just drives more dissatisfaction with > the product. > > Right now it sounds like we're discussing: > > - Some higher limit for 7.4+ without GFS2 > - 2ms for 7.4+ with GFS2 > - 2ms for < 7.4 > > I can already say this is likely going to result in confusion and > dissastisfaction from customers. If we can make any higher latency > allowance retroactive for older releases, that definitely makes this easier > to accept. > > I'm also in favor of Andrew's suggestion that we simply see how high we can > go. Building on Fabio's response, if 1s latency already means several trips > around the globe, then successful tests at 1s would - in my opinion - be > good reason to just drop the latency limits altogether. Do we expect anyone > to deploy a cluster with a 10s latency and to be disappointed that they need > to do some tuning to make it stable? If we've tested at 1s and the cluster > doesn't fall apart, then what concerns do we have about leaving the door > open to any latency conditions being eligible for support? If it reduces > the concern, we could include a caveat in our support policy doc stating > something like: "High latencies above 1s may trigger instability without > further tuning of communication settings and timings". I would suggest we take this in stages, depending on when QE has bw to test. First i´d like to see how far we can go without tuning (default settings). Second would be to see if we can do it retroactively (I am fairly confident we can given that corosync didn´t get any major changes between releases). I´d like to keep the tuning option for later releases, depending on how good/bad the current status is. Right now we don´t have an automatic system to measure internal latency yet, and tuning would all be manual, meaning that nobody will ever get it right. the only discriminating factor could be GFS2 depending on Steven´s input.
(In reply to Steve Whitehouse from comment #4) > I'm not sure it makes sense to define it in those terms. I would guess that > the important thing here is token rotation time. That will depend upon the > number of nodes, the time the token spends on each node before being passed > on, and the time taken to pass the token from node to node on each part of > the ring. RTT only really makes sense for point to point connections, rather > than the ring structure that corosync uses. technically this is correct, but we need to translate that into something customers can measure before deploying. RTT is a good one and gives a pretty good indication on how things are going to behave without running corosync and benchmark tools. > > Btw, does corosync have anything built in to measure the node to node token > time and/or rotation time? That would be a good thing to export to PCP, for > example, to make it easier for people to keep track of it.
For GFS2, lower latency is going to be much preferable, otherwise I can see we'll land up with all kinds of performance issues, particularly for metadata intensive operations, and those using lots of inodes. I don't think we've ever characterised a specific RTT that would make sense for GFS2, however Bob has been doing a lot of perf work looking at the DLM recently so he likely has some current figures for that. There is of course not just DLM to consider, but also plock latency which is corosync based too. So the 2ms figure is really plucked out of the air, but by the time you've multiplied it up several times for lock operations, and then again by the number of inodes, something simple like a "find" over a fs tree could result in it being _really_ slow even if the storage is fast. The trend is towards faster storage, and even 2ms looks very, very slow when compared with NVMeF storage latencies. We are putting in a lot of effort to figure out how to speed up the existing locking and allowing people to use really slow networks just doesn't make sense.
> (In reply to Steve Whitehouse from comment #4) > > I'm not sure it makes sense to define it in those terms. I would guess that > > the important thing here is token rotation time. That will depend upon the > > number of nodes, the time the token spends on each node before being passed > > on, and the time taken to pass the token from node to node on each part of > > the ring. RTT only really makes sense for point to point connections, rather > > than the ring structure that corosync uses. > > technically this is correct, but we need to translate that into something > customers can measure before deploying. RTT is a good one and gives a pretty > good indication on how things are going to behave without running corosync and > benchmark tools. Well the question is whether they'll measure RTT between all possible combinations of nodes? My suspicion is that they will not, they'll just choose a couple of different nodes and leave it at that. Therefore if corosync can measure that directly and report it, that will help a lot in tracking down problems in due course.
(In reply to Steve Whitehouse from comment #10) > For GFS2, lower latency is going to be much preferable, otherwise I can see > we'll land up with all kinds of performance issues, particularly for > metadata intensive operations, and those using lots of inodes. > > I don't think we've ever characterised a specific RTT that would make sense > for GFS2, however Bob has been doing a lot of perf work looking at the DLM > recently so he likely has some current figures for that. There is of course > not just DLM to consider, but also plock latency which is corosync based too. > > So the 2ms figure is really plucked out of the air, but by the time you've > multiplied it up several times for lock operations, and then again by the > number of inodes, something simple like a "find" over a fs tree could result > in it being _really_ slow even if the storage is fast. > > The trend is towards faster storage, and even 2ms looks very, very slow when > compared with NVMeF storage latencies. We are putting in a lot of effort to > figure out how to speed up the existing locking and allowing people to use > really slow networks just doesn't make sense. that´s exactly why we need to disconnect the support matrix between RS and HA. HA doesn´t have those strict requirements and this max 2ms latency is killing the HA adoption in several areas.
(In reply to Steve Whitehouse from comment #11) > > (In reply to Steve Whitehouse from comment #4) > > > I'm not sure it makes sense to define it in those terms. I would guess that > > > the important thing here is token rotation time. That will depend upon the > > > number of nodes, the time the token spends on each node before being passed > > > on, and the time taken to pass the token from node to node on each part of > > > the ring. RTT only really makes sense for point to point connections, rather > > > than the ring structure that corosync uses. > > > > technically this is correct, but we need to translate that into something > > customers can measure before deploying. RTT is a good one and gives a pretty > > good indication on how things are going to behave without running corosync and > benchmark tools. > > Well the question is whether they'll measure RTT between all possible > combinations of nodes? My suspicion is that they will not, they'll just > choose a couple of different nodes and leave it at that. Therefore if > corosync can measure that directly and report it, that will help a lot in > tracking down problems in due course. We require the measure between the 2 datacenters. that´s generally more than enough.
(In reply to Steve Whitehouse from comment #10) > > So the 2ms figure is really plucked out of the air, but by the time you've > multiplied it up several times for lock operations, and then again by the > number of inodes, something simple like a "find" over a fs tree could result > in it being _really_ slow even if the storage is fast. > > The trend is towards faster storage, and even 2ms looks very, very slow when > compared with NVMeF storage latencies. We are putting in a lot of effort to > figure out how to speed up the existing locking and allowing people to use > really slow networks just doesn't make sense. I just don't understand why we prefer the support policy to deal with this if that's also going to limit some customers from using the product with full acceptance and awareness of the ramifications of the higher latency. If a customer deploys an environment with suboptimal conditions and gets poor performance, then we identify that and explain how they need to improve that component of their environment. That's already responsible for the vast majority of our GFS2-based cases, so it's not like our support policies are doing a great job of keeping the product out of the problematic conditions. We don't draw the lines of supportability on storage latency, which could have as much impact on GFS2 performance as network latency. Customers can (and do) deploy terribly designed workloads on the file system that are just going to perform miserably, but we don't draw the line on support for those. If we split the policy between HA and RS, then it is not a given that it will be a net savings in resources spent supporting customers. The confusion or lack of familiarity with what limit applies where means we'll still get into these latency-based performance investigations; the dissatisfaction some customers will express over not being able to deploy GFS2 with higher latency will require us to spend time having that debate with them or their account team; and some customers are going to just do it anyways despite the policy, with us having somewhat opened the door to higher latency with HA. I just don't see how completely blocking the use case leaves us in a better position than simply publishing guidance and expectations around dealing with higher latency. We would be keeping the support policy simple and understandable and we would let customers decide what performance is acceptable vs unacceptable - potentially bringing more users to the product.
I've created the following article to cover our existing latency policy: https://access.redhat.com/articles/2823721 This is not yet available to customers, as its part of a larger structure of support policies and other documents that we're working on that needs to be completed and pushed out together. But this is where we'll need to define any changes to the policy once they're sorted out and ready.
I'm assuming this is beyond consideration for 7.4 - even with it being TestOnly. Proposing for 7.5.
Updated High Availability policy: https://access.redhat.com/articles/2823721 And created a new Resilient Storage policy guide that maintains the 2ms limit there: https://access.redhat.com/articles/3163171 These changes are both live and available to customers. Let me know if there are any concerns or suggestions.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0920