1013536 – Reset HornetQ backup after failback

Bug 1013536 - Reset HornetQ backup after failback

Summary: Reset HornetQ backup after failback

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	HornetQ
Sub Component:
Version:	6.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ER6
Target Release:	EAP 6.2.0
Assignee:	Miroslav Novak
QA Contact:	Miroslav Novak
Docs Contact:	Russell Dickenson
URL:
Whiteboard:
Depends On:	1016141
Blocks:
TreeView+	depends on / blocked

Reported:	2013-09-30 10:10 UTC by Miroslav Novak
Modified:	2019-06-13 07:55 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-12-15 16:16:07 UTC
Type:	Enhancement
Embargoed:

Attachments	(Terms of Use)

Description Miroslav Novak 2013-09-30 10:10:25 UTC

Description of problem:
At this moment we're forced to reset EAP 6 with HQ backup after failback so failover can happen again.

Imagine following scenario:
1. There are 2 EAP 6.x server in collocated HA topology with replicated journal
2. Kill first server -> failover occurs
3. Start first server -> failback occurs
4. Now HQ backup on 2nd server must be restarted otherwise failover can't happen again

Server from backup server on 2nd server:
11:59:33,176 WARN  [org.hornetq.core.server] (Thread-97) HQ222163: Server is being completely stopped, since this was a replicated backup there may be journal files that need cleaning up. The HornetQ server will have to be manually restarted.

We cannot force users to restart the 2nd EAP 6.x server just because of HornetQ needs it.

Expected results:
I'm not sure what was the reason to restart the backup server after failback. We can discuss best solution.
Restart of backup server could happen automatically. Or we could add CLI operation for admins to do that.

Comment 1 Clebert Suconic 2013-09-30 14:34:42 UTC

I agree it's an issue... but I would do it after 5.2.0..

Can we schedule it accordingly?

Comment 2 Miroslav Novak 2013-09-30 15:53:02 UTC

From QE point of view this issue can negatively impact the customer production environment and break their HA. Restarting 2nd server would actually force restart of 1st server (and so on). We'd like to see the enhancement very soon.

I'm adding John Doyle with "need info" so you can plan best dates.

Comment 3 Clebert Suconic 2013-09-30 17:04:29 UTC

We sure are going to fix it.. it's not a new issue since HornetQ 2.3.0, it's actually behaving as engineered... but I agree it's better to restart itself after failback.


It's just that.. does it require a release now? 

If you think it's that important we will do it.. it's just that I think it will affect users doing failover with replication, and only at failback time. So I consider it to be a minor issue.


Usually users using dedicated failover are using wildfly just to support HornetQ. So I don't think it would be a big issue to restart the server.


Again: we will fix it.. I'm just talking about when you require it

Comment 4 Miroslav Novak 2013-10-01 08:19:55 UTC

Thanks for your feedback. From QE point of view, we'd like to see the enhancement very soon, so possibly in EAP 6.2.0. But I understand you need to plan according to priorities, resources and PM.

Comment 5 Clebert Suconic 2013-10-01 14:37:23 UTC

I've asked Andy Taylor to take a look on this.

It was just open.. I think it would take us at least 1 week to resolve it. We can do a release if there's space on the release schedule. We will communicate accordingly within a week.

Comment 6 John Doyle 2013-10-01 15:49:07 UTC

It would have to be a low risk change for me to approve it for 6.2.  We only have 2 builds remaining after Beta and we cannot endanger the date.

I do agree that it's better that it restarts.

Comment 7 Clebert Suconic 2013-10-01 15:52:02 UTC

@John we will be working on it anyways. We are only committing low risks changes on 2.3.x after all.. but I would prefer not having to do a release of HornetQ on 6.2. if we have to do it we would be ready.

Comment 8 Andy Taylor 2013-10-02 09:15:26 UTC

the reason we dont restart is so we don't delete the journal, deleting a journal should never be done automatically as it may be needed by admins, this should always be a manual task. so solutions are:

1 move the journal and start the server using the CLI, this can be done now as far as I am aware.

2 we move the journal to another directory and restart, issue here is file space, altho we could have a max-replicated-journal-cache-size or something.

Is there any reason 1 isnt usable?

Comment 9 John Doyle 2013-10-02 09:45:02 UTC

If proposal 1 can be done now (passes testing), we should document it as a supported solution for the near term.  Who can validate this and provide detailed steps for Doc?

Comment 10 Clebert Suconic 2013-10-02 12:01:12 UTC

@John: Actually This is how replication is supposed to do. We can't simply remove the data. and moving to another file would make the user to leak files on its file system.


We could maybe add a log.info to the logs and document it:


Something like: "Server is failing back, you need to save or remove your previous journal files"


We can't remove that as we don't want to risk losing data.. say if something happens after failback.

Comment 11 Andy Taylor 2013-10-02 12:27:43 UTC

If 1. doesnt suit I can do 2 but with a default max size of say 2, so after 2 failbacks the server wont restart but that should give users enough time to clear up. or they can change the default to -1 if they arent bothered about filespace.

Comment 12 Miroslav Novak 2013-10-03 06:49:50 UTC

For 1. I could not find CLI operation which would restart only HornetQ subsystem. (There is :reload() operation but it restarts all the services of EAP server.) This options seems to be the safest for EAP 6.2.

2. option looks more enterprise because it's automatic. Andy is right that there is necessary an option how many journals to store. Probably there could be also an option to delete the oldest journal files when new failback occurs. So the newest journal files can be stored.

Comment 13 Miroslav Novak 2013-10-22 12:10:16 UTC

I've set max-saved-replicated-journal-size attribute on backup to 5 and can see that backup is not necessary to restart for 4 failbacks. 

Setting bz as verified in EAP 6.2.0.ER6.

I'll update bz#927867 - "Document: How to configure message replication/shared store for HornetQ in EAP 6.1" - to document this new attibute.

Comment 14 Justin Bertram 2013-10-25 21:29:38 UTC

FYI - Andy's fix added support for <max-saved-replicated-journals-size> to HornetQ, but from what I can see the messaging subsystem in EAP 6 was *not* updated to support this new parameter.  Therefore, users will be stuck with the default value for max-saved-replicated-journals-size (i.e. 2) until the messaging subsystem is updated appropriately.

Comment 15 Miroslav Novak 2013-10-26 06:20:51 UTC

I've set the attribute using CLI but you're right that it's missing in xsd schema. This should be fixed. I'll set bz as assigned.

Comment 16 Miroslav Novak 2013-10-26 06:22:17 UTC

@Jeff
Can you take a look at this, please?

Comment 18 Clebert Suconic 2013-12-03 19:26:54 UTC

I'm setting this back to onQA as you need to bump the schema in your test.

Comment 19 Martin Svehla 2013-12-05 14:12:48 UTC

<max-saved-replicated-journals-size> in config schema 1.4 works as expected in EAP 6.2.0.GA. Thanks guys

Note You need to log in before you can comment on or make changes to this bug.