Bug 625540
Summary: | cluster safe assertion from within qpid::broker::SemanticState::attached() | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Gordon Sim <gsim> | ||||||||||
Component: | qpid-cpp | Assignee: | Alan Conway <aconway> | ||||||||||
Status: | CLOSED NOTABUG | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | beta | CC: | aconway, ppecka | ||||||||||
Target Milestone: | 1.4 | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2010-08-23 21:07:43 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Gordon Sim
2010-08-19 18:48:45 UTC
Reproducer not yet known, stack traces reported as above however against first beta packages. The stack trace shows that the broker is connecting to give an update to a new broker, while at the same it is receiving an update connection. I don't know how this could come about. Are there any log files available or a description of what was happening at the time of the core dump? It would make sense for a broker to refuse the erroneous update connection in this situation, if there's no more information available. I can make that change. Created attachment 440424 [details]
broker log file 1
Created attachment 440425 [details]
broker log file 2
Created attachment 440426 [details]
client log file 1
Created attachment 440427 [details]
client log file 2
The reproducer is as attached here: https://bugzilla.redhat.com/attachment.cgi?id=436252 Additionally the following information was given by the user: "This client run was using the 'failover_exchange' cluster failover method. Just to confirm, the sequence of events for these logs was as follows: - start broker 1 - start broker 2 - start client app, confirm messages are being sent/received (the 'Sent:/got message' logging) - stop broker 1 - broker 2 becomes primary, app successfully fails over to broker 2 and continues sending/receiving - start broker 1. Note I did not clean broker 1's state files before starting - I don't believe this should be necessary though. At this point broker 2 crashes with the attached backtrace, and the client comes to a halt. I took a number of thread dumps at different points in the lifecyle, which are all in the client log file (pubsubtest.out.0.tgz). Eventually the client gives up and exits. Broker 1 appears to start, but is unusable: - tried to re-run the test client app. It is unable to connect to broker 1. Client logs inc. stack traces attached. (pubsubtest.out.1.tgz) - manually stop broker 1. " The brokers are I believe running in VMs, not sure if that is relevant. It appears that the broker's cluster-url has been specified incorrectly. From the log: 2010-08-23 12:11:31 notice cluster(10.34.22.65:8356 UPDATER) sending update to 10.34.22.64:16813 at amqp:tcp:10.34.22.65:5678,tcp:10.34.22.64:5678 Note that the URL includes _both_ brokers addresses. This is causing the updater broker to connect to itself rather than the new broker, causing the crash. Each brokers cluster-url should specify only the address(es) of that one broker. The brokers collaborate to provide the clients with the URLs of all brokers in the cluster. I will change the broker to refuse the erroneous catch-up connection attempt with a sensible error message rather than crash, and check if the documentation for cluster-url needs clarification. I've moved the bug to 1.4, I don't think its critical for the 1.3 release. Additional error checking comitted ontrunk r988312 Check for and abort invalid catchup connections. Detect attempt to make a catch-up connection while we are not expecting an update. *** Bug 634168 has been marked as a duplicate of this bug. *** |