Bug 1099809 - [Doc Bug Fix] 18.8.5. About High-availability (HA) Failover - only one backup is supported
Summary: [Doc Bug Fix] 18.8.5. About High-availability (HA) Failover - only one backup...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Documentation
Version: 6.2.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ER7
: EAP 6.3.0
Assignee: Nichola Moore
QA Contact: Russell Dickenson
URL:
Whiteboard:
: 1090420 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-05-21 08:54 UTC by Miroslav Novak
Modified: 2014-08-14 15:20 UTC (History)
12 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-06-28 15:29:52 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
hornetq_subystem_attributes.txt (220.01 KB, text/plain)
2014-05-26 07:18 UTC, Miroslav Novak
no flags Details

Description Miroslav Novak 2014-05-21 08:54:51 UTC
There is misleading information in chapter ⁠18.8.5. About High-availability (HA) Failover in Administration and Configuration Guide [1]. Only one backup is supported.

Change:
High-availability failover is available with either automatic client failover, or application-level failover, through a live-backup structure. Each live server has a backup server, which can also be backed up by as many servers as necessary. 

The backup server only takes over if the live server crashes and there is a failover. Simulteneously, one of the secondary backup servers takes over as the passive backup server, from the new live server. After the failover, and after the former live server has been restarted, it becomes a secondary backup server, or the backup server if there are only two. 

to:
High-availability failover is available with either automatic client failover, or application-level failover, through a live-backup structure. Each live server has a backup server. Only one backup server per live is supported.

The backup server only activates if the live server crashes. After live server has been restarted, it becomes live server again if and only if attribute "allow-failback" is set true. In this case backup automatically becomes passive.

Remove "Important" note:
A shared file-system-directory is required, in order for the backup server to send/receive messages as a response to the messages received by the previous live server. 

################################

Adding GSS and HornetQ dev team to cc to review. 

If we're going to support more than one backup we'll need to write new tests for this. There are some tricky scenarios as described in bz#1079763. Creating new RFE for EAP 6.x agreed by PM/DEV/QE would be the way to go in case customers will want this.

[1] http://documentation-devel.engineering.redhat.com/site/documentation/en-US/JBoss_Enterprise_Application_Platform/6.3/html-single/Administration_and_Configuration_Guide/index.html#About_High-availability_HA_Failover

Comment 1 Justin Bertram 2014-05-21 15:19:26 UTC
The documentation changes look fine given that only 1 backup is supported currently.

Comment 2 Nichola Moore 2014-05-22 06:11:35 UTC
I have updated topic 4820, revision 642535 to reflect the changes outlined above.

Comment 3 Nichola Moore 2014-05-22 06:16:17 UTC
Hi Miroslav,
I have a question before releasing this bug to the wild. 
In the changes above we say that when the original live server comes back, the backup server automatically goes passive. In the next couple of paragraphs it seems to suggest that you have to kill the new live server when the original one comes back online. Can you have a look at:

http://documentation-devel.engineering.redhat.com/site/documentation/en-US/JBoss_Enterprise_Application_Platform/6.3/html/Administration_and_Configuration_Guide/About_High-availability_HA_Failover.html

Specifically have a look at the paragraphs that start with 'High availability cluster topology...' and 'After a live server has failed and a backup...'.  Do they make sense in the light of the changes made for this bug?

It's best to get this sorted out now rather than risk having it slip through.

Thanks,

Nichola.

Comment 4 Miroslav Novak 2014-05-22 09:21:11 UTC
Thanks for looking at this. You're right that my second paragraph with allow-failback attribute makes this more complicated to understand as a whole. What about to change it to:

The backup server only takes over if the live server crashes and there is a failover. After the failover the backup server becomes new live server. If the old live server has been restarted, it becomes a backup for the new live server. 

We have to change "hornetq-configuration.xml" to "standalone...xml" in the whole chapter.

#########################################

I also did a review of previous chapters to make all this more clear. 

Chapter - 18.9.3. HornetQ Message Replication

We can't allow random live-backup pairing which leads to problem discussed in bz#1079763 (backup server can become backup for another backup server and not for live server). Attribute backup-group-name must specify live-backup pair and must be unique. There is also bad configuration file name hornetq-configuration.xml->standalone...xml. 

I suggest this change:

How a backup server looks for a live server to replicate data from depends on whether the backup-group-name parameter has been defined in the hornetq-configuration.xml file. A backup server will only connect to a live server that shares the same group name. In the absence of this parameter, a backup server will try and connect to any live server. 

to something like: 
How a backup server looks for a live server to replicate data from depends on whether the backup-group-name parameter has been defined in the standalone...xml file. This parameter must be defined in configuration of live and backup server. A backup server will only connect to a live server that shares the same group name. 

##########################################

Chapter 18.9.4. Configuring the HornetQ Servers for Replication

Change the chapter to something like this:
To configure the live and backup servers to be a replicating pair, configure both standalone...xml files to have: 

<shared-store>false</shared-store>
<backup-group-name>nameOfLiveBackupPair</backup-group-name>
<check-for-live-server>true</check-for-live-server>
.
.
.
<cluster-connections>
   <cluster-connection name="my-cluster">
      ...
   </cluster-connection>
</cluster-connections>

Attributes:
shared-store -> "Whether this server is using shared store or not. Default is false."
backup-group-name -> "The name of live/backup pair that should replicate with each other",
check-for-live-server -> "If a replicated live server should check the current cluster to see if there is already a live server with the same node id"
failover-on-shutdown -> "Whether this backup server (if it is a backup server) should come live on a normal server shutdown. This must be specified on both of the servers."

The backup server must also be flagged explicitly as a backup:

<backup>true</backup>

Atributes:
allow-failback - "Whether this server will automatically shutdown if the original live server comes back up.",
max-saved-replicated-journal-size - "The maximum number of backup journals to keep after failback occurs. This is necessary to specify only if attribute allow-failback is true. Default value is 2. Which means that after 2 failbacks backup server must be restarted in order to be able to replicate journal from live server and become backup again."

Comment 5 Nichola Moore 2014-05-23 05:32:10 UTC
Hi Miroslav,
It's great to pick up these issues, so thanks for the feedback.

I've done most of this, but I have some questions.

1. In About High-availability (HA) Failover, at first we put in that allow-failback = 'true' had to be set to make a live server go live again. Is this still the case? If so I think it should be left in, don't you?

2.  Are all these instructions only for standalone servers? I want to make referring to the configuration file a bit neater, as there are a few for standalone servers. Can you please let me know so I can handle the configuration file names a bit more elegantly. 

3. Whilst I'm here, and if I can't find this info in the HornetQ documentation, could you please provide default values for the attributes listed above. Some have them and some don't. It would be good to be consistent.

Just to let you know, I'm on a course next week, so you may not hear from me until the week after. I have taken bz#1079763 and I'll fix that up too if it's not addressed by the changes for this bug.

Cheers,

Nichola

Comment 6 Miroslav Novak 2014-05-26 07:17:58 UTC
Hi Nichola,

- 1. Yes, it's still the case. We can leave allow-failback in. 

- 2. We have 3 configuration files by default - standalone.xml, standalone-full.xml and standalone-full-ha.xml. HornetQ is configured only in standalone-full.xml and standalone-full-ha.xml. 
I suggest to name it standalone-...xml or if we'll be consistent in all "HA" chapters then standalone-full-ha.xml. We could say at the beginning that following changes are for standalone-full-ha.xml as all the configurations are derived from it.

- 3. I've attached hornetq_subystem_attributes.txt file. It contains all attributes (with default values) for HornetQ. 

See you next week,
Mirek

Comment 7 Miroslav Novak 2014-05-26 07:18:37 UTC
Created attachment 899198 [details]
hornetq_subystem_attributes.txt

Comment 8 Nichola Moore 2014-06-05 01:56:23 UTC
Updated topics

13565 revision 651596
13566 revision 651594

to change standalone.xml to standalone-X.xml. I also added some default values to 13566.

Comment 10 Miroslav Novak 2014-06-05 08:44:40 UTC
Hi Nichola,

first congrats to the course :-)

We can change it to:
The backup server will activate only if the live server has failed and backup server is able to connect to more than half of the servers within cluster.

First thing with "live has failed" is clear. Backup has to activate when live crashed. "activate" means that backup will start, open HornetQ ports (it's 5445 and some others, those ports are defined in acceptors). After activation all clients can failover from live to backup.
Second thing should prevent split brain syndrome. It can happen that for some reason backup looses connection to network, for example some one unplugged network cable and so on. In this case we don't want backup to activate because after reconnection of network cable there would be live and backup active at the same time. 

Yes, the paragraph somewhat odd. Feel free to change it.

Thanks,
Mirek

Comment 11 Nichola Moore 2014-06-06 00:12:02 UTC
Updated topic 13565 revision 658944 with the information graciously provided above.

Comment 13 Miroslav Novak 2014-06-16 10:47:44 UTC
Nice work Nichola! Setting as verified.

Comment 14 Nichola Moore 2014-07-21 04:47:27 UTC
*** Bug 1090420 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.