Bug 1023325
| Summary: | Lost messages after fail in cluster with message grouping | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Miroslav Novak <mnovak> |
| Component: | HornetQ | Assignee: | Andy Taylor <ataylor> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Miroslav Novak <mnovak> |
| Severity: | high | Docs Contact: | Russell Dickenson <rdickens> |
| Priority: | unspecified | ||
| Version: | 6.2.0 | CC: | ataylor, csuconic, jbertram, jdoyle, jmesnil, lcosti, msvehla |
| Target Milestone: | CR1 | Flags: | rdickens:
needinfo-
|
| Target Release: | EAP 6.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| URL: | https://c.na7.visual.force.com/apex/Case_View?id=500A000000FWZosIAH&sfdc.override=1 | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause:
In a clustered HornetQ environment, following a failure in which the node with the local grouping handler went offline before other nodes. When the node with the local grouping handler came back online, it may attempt to distribute messages to nodes that are still starting up or may be offline.
Consequence:
This may situation may result in lost messages.
Fix:
* On restart or failover, new checks have been added to make sure the cluster has had time to initialize. The wait time before bindings are removed is configurable on the grouping handler via the 'timeout' property.
* There is now a 'reaper' which runs on the local grouping handler node that periodically cleans out the bindings journal. This is configurable on the local grouping handler with the following properties:
** `group-timeout` The time in milliseconds that a group id will be bound to a node (default -1 (never)).
b) `reaper-period` how often in ms the reaper thread should be run (default 30000).
Result:
These fixes ensure that messages will not be lost following a failure involving the local grouping handler in a clustered HornetQ environment.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-12-15 16:14:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1026553 | ||
| Bug Blocks: | |||
|
Description
Miroslav Novak
2013-10-25 08:17:30 UTC
This has been fixed in https://github.com/hornetq/hornetq/pull/1353 and https://github.com/hornetq/hornetq/pull/1352. there are 2 parts to the fix. 1. on restart (or failover) I have added a check to make sure the cluster has had time to form, if after this configurable timeout the bindings havent been added then we assume the remote queues havent been created and no longer exist and remove the groupid bindings. The wait time is configurable on the grouping handler via the 'timeout' property. 2. remote group id bindings were never removed so if you used a lot of groups that were only used for a short period then the bindings journal would grow indefinitely. Ive added a reaper that runs on the local grouping handler node, this again is configurable on the grouping handler via the following properties: a) 'group-timeout' the time in milli seconds that a group id will be bound to a node (default -1 never). b) 'reaper-period' how often in ms the reaper thread should be run (default 30000). c) 'reaper-priority' the thread priority of the thread (default 3) This will all need documenting actually its only a) and b), ignore 'reaper-priority' as i removed it FYI - this fix will require an update the of the messaging subsystem to support these new configuration elements. For that reason this won't be able to make 6.2.0.... I will probably revert this fix if I have to make a 6.2.0... we can make one into 6.3.0... it could be a one-off for customers who need this.. or maybe play with default values? As mentioned above there is still need to provide update of the messaging subsystem to support these new configuration elements. Moving to back to assigned. (checked in EAP 6.2.0.CR1) Miroslav: We decided to use default values, and use system properties. If you really want the updated schema, you could open a BZ to update the schema as the fix is working fine here. I don't agree with the FailedQA. Ok, I have nothing against using system properties but it must be documented. Setting requires_doc_text to ?. After reading related support case I'm still not able to understand the test scenario. Could you describe it here, please? Flags: whatinfo?(mnovak) I would like to reproduce the issue and verify the fix manually. Still I'm not able to understand the test scenario. Can you help me with it, please? Thanks, Mirek to verify, 1. start the 2 servers and send some messages 2. kill the node with the local grouping handler 3. kill the other node 4. restart the node with the grouping handler. The bug was that it would try to distribute messages to the node that had disapeared. Thanks Andy for help. I've managed to reproduce the issue with EAP 6.2.0.ER7. When I tried with EAP 6.2.0.CR1 then no message was lost and there is no exception as described in customer ticket. Nice work! Setting as verified for EAP 6.2.0.CR1. `Requires doc text` flag cleared as it's too late to include this in JBoss EAP 6.2.0 Release Notes. |