There is misleading information in chapter 18.8.5. About High-availability (HA) Failover in Administration and Configuration Guide [1]. Only one backup is supported. Change: High-availability failover is available with either automatic client failover, or application-level failover, through a live-backup structure. Each live server has a backup server, which can also be backed up by as many servers as necessary. The backup server only takes over if the live server crashes and there is a failover. Simulteneously, one of the secondary backup servers takes over as the passive backup server, from the new live server. After the failover, and after the former live server has been restarted, it becomes a secondary backup server, or the backup server if there are only two. to: High-availability failover is available with either automatic client failover, or application-level failover, through a live-backup structure. Each live server has a backup server. Only one backup server per live is supported. The backup server only activates if the live server crashes. After live server has been restarted, it becomes live server again if and only if attribute "allow-failback" is set true. In this case backup automatically becomes passive. Remove "Important" note: A shared file-system-directory is required, in order for the backup server to send/receive messages as a response to the messages received by the previous live server. ################################ Adding GSS and HornetQ dev team to cc to review. If we're going to support more than one backup we'll need to write new tests for this. There are some tricky scenarios as described in bz#1079763. Creating new RFE for EAP 6.x agreed by PM/DEV/QE would be the way to go in case customers will want this. [1] http://documentation-devel.engineering.redhat.com/site/documentation/en-US/JBoss_Enterprise_Application_Platform/6.3/html-single/Administration_and_Configuration_Guide/index.html#About_High-availability_HA_Failover
The documentation changes look fine given that only 1 backup is supported currently.
I have updated topic 4820, revision 642535 to reflect the changes outlined above.
Hi Miroslav, I have a question before releasing this bug to the wild. In the changes above we say that when the original live server comes back, the backup server automatically goes passive. In the next couple of paragraphs it seems to suggest that you have to kill the new live server when the original one comes back online. Can you have a look at: http://documentation-devel.engineering.redhat.com/site/documentation/en-US/JBoss_Enterprise_Application_Platform/6.3/html/Administration_and_Configuration_Guide/About_High-availability_HA_Failover.html Specifically have a look at the paragraphs that start with 'High availability cluster topology...' and 'After a live server has failed and a backup...'. Do they make sense in the light of the changes made for this bug? It's best to get this sorted out now rather than risk having it slip through. Thanks, Nichola.
Thanks for looking at this. You're right that my second paragraph with allow-failback attribute makes this more complicated to understand as a whole. What about to change it to: The backup server only takes over if the live server crashes and there is a failover. After the failover the backup server becomes new live server. If the old live server has been restarted, it becomes a backup for the new live server. We have to change "hornetq-configuration.xml" to "standalone...xml" in the whole chapter. ######################################### I also did a review of previous chapters to make all this more clear. Chapter - 18.9.3. HornetQ Message Replication We can't allow random live-backup pairing which leads to problem discussed in bz#1079763 (backup server can become backup for another backup server and not for live server). Attribute backup-group-name must specify live-backup pair and must be unique. There is also bad configuration file name hornetq-configuration.xml->standalone...xml. I suggest this change: How a backup server looks for a live server to replicate data from depends on whether the backup-group-name parameter has been defined in the hornetq-configuration.xml file. A backup server will only connect to a live server that shares the same group name. In the absence of this parameter, a backup server will try and connect to any live server. to something like: How a backup server looks for a live server to replicate data from depends on whether the backup-group-name parameter has been defined in the standalone...xml file. This parameter must be defined in configuration of live and backup server. A backup server will only connect to a live server that shares the same group name. ########################################## Chapter 18.9.4. Configuring the HornetQ Servers for Replication Change the chapter to something like this: To configure the live and backup servers to be a replicating pair, configure both standalone...xml files to have: <shared-store>false</shared-store> <backup-group-name>nameOfLiveBackupPair</backup-group-name> <check-for-live-server>true</check-for-live-server> . . . <cluster-connections> <cluster-connection name="my-cluster"> ... </cluster-connection> </cluster-connections> Attributes: shared-store -> "Whether this server is using shared store or not. Default is false." backup-group-name -> "The name of live/backup pair that should replicate with each other", check-for-live-server -> "If a replicated live server should check the current cluster to see if there is already a live server with the same node id" failover-on-shutdown -> "Whether this backup server (if it is a backup server) should come live on a normal server shutdown. This must be specified on both of the servers." The backup server must also be flagged explicitly as a backup: <backup>true</backup> Atributes: allow-failback - "Whether this server will automatically shutdown if the original live server comes back up.", max-saved-replicated-journal-size - "The maximum number of backup journals to keep after failback occurs. This is necessary to specify only if attribute allow-failback is true. Default value is 2. Which means that after 2 failbacks backup server must be restarted in order to be able to replicate journal from live server and become backup again."
Hi Miroslav, It's great to pick up these issues, so thanks for the feedback. I've done most of this, but I have some questions. 1. In About High-availability (HA) Failover, at first we put in that allow-failback = 'true' had to be set to make a live server go live again. Is this still the case? If so I think it should be left in, don't you? 2. Are all these instructions only for standalone servers? I want to make referring to the configuration file a bit neater, as there are a few for standalone servers. Can you please let me know so I can handle the configuration file names a bit more elegantly. 3. Whilst I'm here, and if I can't find this info in the HornetQ documentation, could you please provide default values for the attributes listed above. Some have them and some don't. It would be good to be consistent. Just to let you know, I'm on a course next week, so you may not hear from me until the week after. I have taken bz#1079763 and I'll fix that up too if it's not addressed by the changes for this bug. Cheers, Nichola
Hi Nichola, - 1. Yes, it's still the case. We can leave allow-failback in. - 2. We have 3 configuration files by default - standalone.xml, standalone-full.xml and standalone-full-ha.xml. HornetQ is configured only in standalone-full.xml and standalone-full-ha.xml. I suggest to name it standalone-...xml or if we'll be consistent in all "HA" chapters then standalone-full-ha.xml. We could say at the beginning that following changes are for standalone-full-ha.xml as all the configurations are derived from it. - 3. I've attached hornetq_subystem_attributes.txt file. It contains all attributes (with default values) for HornetQ. See you next week, Mirek
Created attachment 899198 [details] hornetq_subystem_attributes.txt
Updated topics 13565 revision 651596 13566 revision 651594 to change standalone.xml to standalone-X.xml. I also added some default values to 13566.
Hi Nichola, first congrats to the course :-) We can change it to: The backup server will activate only if the live server has failed and backup server is able to connect to more than half of the servers within cluster. First thing with "live has failed" is clear. Backup has to activate when live crashed. "activate" means that backup will start, open HornetQ ports (it's 5445 and some others, those ports are defined in acceptors). After activation all clients can failover from live to backup. Second thing should prevent split brain syndrome. It can happen that for some reason backup looses connection to network, for example some one unplugged network cable and so on. In this case we don't want backup to activate because after reconnection of network cable there would be live and backup active at the same time. Yes, the paragraph somewhat odd. Feel free to change it. Thanks, Mirek
Updated topic 13565 revision 658944 with the information graciously provided above.
This can be verified on DocStage here: http://documentation-devel.engineering.redhat.com/site/documentation/en-US/JBoss_Enterprise_Application_Platform/6.3/html-single/Administration_and_Configuration_Guide/index.html#HornetQ_Message_Replication
Nice work Nichola! Setting as verified.
*** Bug 1090420 has been marked as a duplicate of this bug. ***