Description of problem: When source broker fails but the network/transport layer is still "operational" thus LISTENing for connections and accepting them gracefully, so the connection get ESTABLISHED on both sides, but the application layer do not respond at all, the destination broker after heartbeat timeouted, do not initiate failover procedure, but make new connections to same destination (source broker). Additionally the link state remains Operational. Version-Release number of selected component (if applicable): qpid-cpp-0.22-6 How reproducible: 100% Steps to Reproduce: 1. start two brokers 2. create link between them 3. check the link's connection source port 3. SIGSTOP the source broker 4. wait until hearbeat time out 5. observe netstat that link's connection has changed Actual results: Destination broker do not initiate failover procedure and link state remains operational. Expected results: Destination broker initiate failover procedure after heartbeat time out, and makes the link state Waiting (or somehing else than Operational) Additional info: ### NOTES Source broker port=10000 Destination broker port=20000 ### Destination Broker's log: 2013-07-09 11:29:35 [Broker] error Federation link connection qpid.tcp:10.34.27.65:10000 missed 2 heartbeats - closing connection 2013-07-09 11:29:35 [Broker] info Inter-broker link disconnected from 10.34.27.65:10000 Closed by peer 2013-07-09 11:29:37 [System] info Connecting: 10.34.27.65:10000 2013-07-09 11:29:37 [Broker] info Inter-broker link established to 10.34.27.65:10000 2013-07-09 11:29:37 [Broker] info Inter-broker link connecting to 10.34.27.65:10000 2013-07-09 11:29:47 [System] error Connection qpid.tcp:10.34.27.65:10000 No protocol received closing 2013-07-09 11:29:47 [Broker] info Inter-broker link disconnected from 10.34.27.65:10000 Closed by peer 2013-07-09 11:29:49 [System] info Connecting: 10.34.27.65:10000 2013-07-09 11:29:49 [Broker] info Inter-broker link established to 10.34.27.65:10000 2013-07-09 11:29:49 [Broker] info Inter-broker link connecting to 10.34.27.65:10000 ### Netstat Pre-stop tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 17639/qpidd tcp 0 0 0.0.0.0:20000 0.0.0.0:* LISTEN 17640/qpidd tcp 0 0 10.34.27.65:10000 10.34.27.65:36943 ESTABLISHED 17639/qpidd tcp 0 0 10.34.27.65:36943 10.34.27.65:10000 ESTABLISHED 17640/qpidd tcp 0 0 :::10000 :::* LISTEN 17639/qpidd tcp 0 0 :::20000 :::* LISTEN 17640/qpidd ### Netstat post-timeout tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 17639/qpidd tcp 0 0 0.0.0.0:20000 0.0.0.0:* LISTEN 17640/qpidd tcp 9 0 10.34.27.65:10000 10.34.27.65:36950 CLOSE_WAIT - tcp 0 0 10.34.27.65:36951 10.34.27.65:10000 ESTABLISHED 17640/qpidd tcp 8 0 10.34.27.65:10000 10.34.27.65:36951 ESTABLISHED - tcp 0 0 10.34.27.65:36950 10.34.27.65:10000 FIN_WAIT2 - tcp 1 0 10.34.27.65:10000 10.34.27.65:36943 CLOSE_WAIT 17639/qpidd tcp 0 0 10.34.27.65:36943 10.34.27.65:10000 FIN_WAIT2 - tcp 0 0 :::10000 :::* LISTEN 17639/qpidd tcp 0 0 :::20000 :::* LISTEN 17640/qpidd ### Even further Netstat tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 17639/qpidd tcp 0 0 10.34.27.65:10000 10.34.27.65:36963 SYN_RECV - tcp 0 0 10.34.27.65:10000 10.34.27.65:36962 SYN_RECV - tcp 0 0 0.0.0.0:20000 0.0.0.0:* LISTEN 17640/qpidd tcp 0 0 10.34.27.65:36961 10.34.27.65:10000 FIN_WAIT2 - tcp 9 0 10.34.27.65:10000 10.34.27.65:36956 CLOSE_WAIT - tcp 9 0 10.34.27.65:10000 10.34.27.65:36961 CLOSE_WAIT - tcp 9 0 10.34.27.65:10000 10.34.27.65:36952 CLOSE_WAIT - tcp 0 0 10.34.27.65:36959 10.34.27.65:10000 FIN_WAIT2 - tcp 9 0 10.34.27.65:10000 10.34.27.65:36950 CLOSE_WAIT - tcp 0 8 10.34.27.65:36963 10.34.27.65:10000 ESTABLISHED 17640/qpidd tcp 0 9 10.34.27.65:36962 10.34.27.65:10000 FIN_WAIT1 - tcp 9 0 10.34.27.65:10000 10.34.27.65:36958 CLOSE_WAIT - tcp 9 0 10.34.27.65:10000 10.34.27.65:36960 CLOSE_WAIT - tcp 9 0 10.34.27.65:10000 10.34.27.65:36959 CLOSE_WAIT - tcp 9 0 10.34.27.65:10000 10.34.27.65:36955 CLOSE_WAIT - tcp 9 0 10.34.27.65:10000 10.34.27.65:36957 CLOSE_WAIT - tcp 9 0 10.34.27.65:10000 10.34.27.65:36951 CLOSE_WAIT - tcp 1 0 10.34.27.65:10000 10.34.27.65:36943 CLOSE_WAIT 17639/qpidd tcp 9 0 10.34.27.65:10000 10.34.27.65:36954 CLOSE_WAIT - tcp 0 0 10.34.27.65:36960 10.34.27.65:10000 FIN_WAIT2 - tcp 0 0 :::10000 :::* LISTEN 17639/qpidd tcp 0 0 :::20000 :::* LISTEN 17640/qpidd
Hi Zdenek, Isn't it fixed by bz1094781 (in/since qpid-cpp-0.18-23) / bz1097559 (in/since 0.22-42) ?
Pavel, sorry for delayed response. tested with qpid-cpp-0.30-8 the broker do a "failover" but since in the scenario is only one source-broker, it is trying that one, which it is correct behaviour I think. the problem in this scenarion is that route state within that link is still treated as Operatinoal even though it can't connect to a source broker. # qpid-route link list guest/guest.37.193:5672 Host Port Transport Durable State Last Error ============================================================================= 10.34.37.192 5672 tcp N Operational Closed by peer also I hit one moment when state was Waiting, then it jumped back to Operational, while source broker was SIGSTOPped. so after SIGCONT the source broker, link is apparently established, and qpid-route still show Operational and last error closed by peer, so I would say, it's only matter of confusion in QMF. I can confirm that after source broker is enabled to run again and link is established, messages are passing through it. As I'm reviewing this issue, I now think it could be only problem in qpid-route. I'm also for lowering the severity of this issue.