1138706 – Qpid HA catchup phase can be extremely long, an improvement / analysis needed

Bug 1138706 - Qpid HA catchup phase can be extremely long, an improvement / analysis needed

Summary: Qpid HA catchup phase can be extremely long, an improvement / analysis needed

Keywords:
Status:	NEW
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	Development
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	3.3
Target Release:	---
Assignee:	messaging-bugs
QA Contact:	Messaging QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-09-05 13:29 UTC by Frantisek Reznicek
Modified:	2024-01-19 19:11 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1117863	1	None	None	None	2021-01-20 06:05:38 UTC

Internal Links: 1117863

Description Frantisek Reznicek 2014-09-05 13:29:45 UTC

Description of problem:

Qpid HA catchup phase can be extremely long, an improvement / analysis needed.

There are observed situations where in 3-node cluster broker catchup is taking more than 10 minutes when permanently running single qpid-send and qpid-receive clients.

This highlight partly weakness of active - passive approach but also specific qpidd catchup process intensity.


The proposal is following.
Reduce situations when primary is serving and there is no backup in ready by introducing:
 * normal catchup (taking place when there is at least one other backup 'ready')
 * fast catchup (taking place when there is no other backup 'ready')

See further details in bug 1117863 comment 4, bug 1117863 comment 5, bug 1117863 comment 6.


Version-Release number of selected component (if applicable):
qpid-cpp-*0.22-48.el6

How reproducible:
100%

Steps to Reproduce:
1. bug 1128051, reproducer ha-tx-atomic.sh

Actual results:
  Very long time for catching up joining brokers into ready state

Expected results:
  Reduced time for catching up joining brokers into ready state (if there is no ther backup ready)

Additional info:

Comment 1 Alan Conway 2014-09-05 16:01:04 UTC

Just to clarify my understanding and add some thoughts:

(In reply to Frantisek Reznicek from comment #0)
> Description of problem:
> 
> Qpid HA catchup phase can be extremely long, an improvement / analysis
> needed.
> 
> There are observed situations where in 3-node cluster broker catchup is
> taking more than 10 minutes when permanently running single qpid-send and
> qpid-receive clients.

I'd like to confirm - all these situations involve large queue depths at the point where the new backup joins?

> This highlight partly weakness of active - passive approach but also
> specific qpidd catchup process intensity.

active-active has the same problem. New brokers have to find out what the existing cluster knows before they can be reliable replicas.

> The proposal is following.
> Reduce situations when primary is serving and there is no backup in ready

We are talking about lone primary, all backups dead. Where there are multiple ready backups at the time of failure, the new primary should recover contact with those existing backups quickly.

> by
> introducing:
>  * normal catchup (taking place when there is at least one other backup
> 'ready')
>  * fast catchup (taking place when there is no other backup 'ready')
> 
> See further details in bug 1117863 comment 4, bug 1117863 comment 5, bug
> 1117863 comment 6.

My understanding is that we are assuming we can speed up one backup by 'handicapping' the others so it has less competition. We'll have to measure how much speed-up we can get this way - the theoretical limit is the speed of a single backup catching up by itself compared to 2 or more backups catching up at the same time.

Another approach we should investigate is just making replication faster in general. 

AMQP 1.0 shows hope of improved throughput over 0.10. Latencies look a little worse but throughput is what we need to clear large queue depths. Its a substantial job because it means moving federation to 1.0 also. However it seems like a good thing in general if we plan to drop 0.10 eventually.

If we stick with 0.10 there is a long-standing FIXME in the HA code were we set qpid.sync-frequency=1. This is not a high-performance setting. There are some consistency issues to changing this but it should be doable.

There are probably other areas we could optimize, a bit of profiling is due.

For persistent messages, we could get brokers to use their stores as part of catch-up (currently they dump their stores and download everything from primary.) This would reduce load on the primary, but effect on overall catch-up time depends on the speed of store vs. subscription.

Note that no matter what we do, unlimited queue depths will give unlimited catch-up times. There's no way around that. We need to think about what are acceptable catch-up times for a given queue depth.

Note2: The active-active cluster had the same problem during it's update phase and a sneaky version of the problem after update. After a long update the new broker appears to be active, but clients see long delays while it scrambles to catch up with the stream of corosync messages it missed while it was updating. One of the goals in the new design was to eliminate this "infinite catch-up" problem. However we can probably still improve the trade-off between reliability and responsiveness.

Comment 2 Frantisek Reznicek 2014-09-05 18:58:19 UTC

(In reply to Alan Conway from comment #1)
> Just to clarify my understanding and add some thoughts:
> 
> (In reply to Frantisek Reznicek from comment #0)
> > Description of problem:
> > 
> > Qpid HA catchup phase can be extremely long, an improvement / analysis
> > needed.
> > 
> > There are observed situations where in 3-node cluster broker catchup is
> > taking more than 10 minutes when permanently running single qpid-send and
> > qpid-receive clients.
> 
> I'd like to confirm - all these situations involve large queue depths at the
> point where the new backup joins?

I'd like to add to this, that 10 minutes catchup was seen with just 2 medium-sized queues. If there were more than 10, my wait for HA with 1p+1b was timeouting, so expecting this >> 10 minutes.

I'll add timing once I have spare minute, but my current feeling about that is that if there are 20 clients, 100 queues, then catchup would take hours which is not acceptable as we would like (I guess) to handle reoccurring failures with shorter period.

> 
> > This highlight partly weakness of active - passive approach but also
> > specific qpidd catchup process intensity.
> 
> active-active has the same problem. New brokers have to find out what the
> existing cluster knows before they can be reliable replicas.

Fair enough. You're right here, the catch-up here was lowering whole performance of old clustering.

> 
> > The proposal is following.
> > Reduce situations when primary is serving and there is no backup in ready
> 
> We are talking about lone primary, all backups dead. Where there are
> multiple ready backups at the time of failure, the new primary should
> recover contact with those existing backups quickly.

+1

> 
> > by
> > introducing:
> >  * normal catchup (taking place when there is at least one other backup
> > 'ready')
> >  * fast catchup (taking place when there is no other backup 'ready')
> > 
> > See further details in bug 1117863 comment 4, bug 1117863 comment 5, bug
> > 1117863 comment 6.
> 
> My understanding is that we are assuming we can speed up one backup by
> 'handicapping' the others so it has less competition. We'll have to measure
> how much speed-up we can get this way - the theoretical limit is the speed
> of a single backup catching up by itself compared to 2 or more backups
> catching up at the same time.

I was hoping we can speed up the catchup process even for case where multiple brokers are catching up and let multiple brokers run in fast catchup when no other backup is ready. If we cannot speed it up generally, then we need to mark just one for fast catchup.

> 
> Another approach we should investigate is just making replication faster in
> general. 

Yes, I'd propose to investigate it, this would be ideal.

> 
> AMQP 1.0 shows hope of improved throughput over 0.10. Latencies look a
> little worse but throughput is what we need to clear large queue depths. Its
> a substantial job because it means moving federation to 1.0 also. However it
> seems like a good thing in general if we plan to drop 0.10 eventually.
> 
> If we stick with 0.10 there is a long-standing FIXME in the HA code were we
> set qpid.sync-frequency=1. This is not a high-performance setting. There are
> some consistency issues to changing this but it should be doable.
> 
> There are probably other areas we could optimize, a bit of profiling is due.
> 
> For persistent messages, we could get brokers to use their stores as part of
> catch-up (currently they dump their stores and download everything from
> primary.) This would reduce load on the primary, but effect on overall
> catch-up time depends on the speed of store vs. subscription.
> 
> Note that no matter what we do, unlimited queue depths will give unlimited
> catch-up times. There's no way around that. We need to think about what are
> acceptable catch-up times for a given queue depth.

This should come from real-world deployment, certain number of queues, certain client traffic and target for catchup time.

> 
> Note2: The active-active cluster had the same problem during it's update
> phase and a sneaky version of the problem after update. After a long update
> the new broker appears to be active, but clients see long delays while it
> scrambles to catch up with the stream of corosync messages it missed while
> it was updating. One of the goals in the new design was to eliminate this
> "infinite catch-up" problem. However we can probably still improve the
> trade-off between reliability and responsiveness.

Comment 3 Alan Conway 2014-09-09 14:21:13 UTC

(In reply to Frantisek Reznicek from comment #2)
>
> I'd like to add to this, that 10 minutes catchup was seen with just 2
> medium-sized queues. If there were more than 10, my wait for HA with 1p+1b
> was timeouting, so expecting this >> 10 minutes.
>

Can you attach reproducer when you have time? I've seen similar long catchups with a single queue with 500,000 1k messages.
 
There has been no work yet on optimizing HA performance so I am sure we can find ways to improve. Hopefully we can improve overall performance at the same time since catch-up is basically the same process as active replication.

Note You need to log in before you can comment on or make changes to this bug.