Bug 982553 - Qpid don't initiate a failover on federation link when source broker don't respond but tcp is established
Summary: Qpid don't initiate a failover on federation link when source broker don't re...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 2.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Andrew Stitcher
QA Contact: Messaging QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-07-09 09:41 UTC by Zdenek Kraus
Modified: 2020-05-22 14:35 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-22 14:35:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Zdenek Kraus 2013-07-09 09:41:37 UTC
Description of problem:
When source broker fails but the network/transport layer is still "operational" thus LISTENing for connections and accepting them gracefully, so the connection get ESTABLISHED on both sides, but the application layer do not respond at all, the destination broker after heartbeat timeouted, do not initiate failover procedure, but make new connections to same destination (source broker). Additionally the link state remains Operational.

Version-Release number of selected component (if applicable):
qpid-cpp-0.22-6

How reproducible:
100%

Steps to Reproduce:
1. start two brokers
2. create link between them
3. check the link's connection source port
3. SIGSTOP the source broker
4. wait until hearbeat time out
5. observe netstat that link's connection has changed

Actual results:
Destination broker do not initiate failover procedure and link state remains operational.

Expected results:
Destination broker initiate failover procedure after heartbeat time out, and makes the link state Waiting (or somehing else than Operational)

Additional info:

### NOTES
Source broker      port=10000
Destination broker port=20000

### Destination Broker's log:
2013-07-09 11:29:35 [Broker] error Federation link connection qpid.tcp:10.34.27.65:10000 missed 2 heartbeats - closing connection
2013-07-09 11:29:35 [Broker] info Inter-broker link disconnected from 10.34.27.65:10000 Closed by peer
2013-07-09 11:29:37 [System] info Connecting: 10.34.27.65:10000
2013-07-09 11:29:37 [Broker] info Inter-broker link established to 10.34.27.65:10000
2013-07-09 11:29:37 [Broker] info Inter-broker link connecting to 10.34.27.65:10000
2013-07-09 11:29:47 [System] error Connection qpid.tcp:10.34.27.65:10000 No protocol received closing
2013-07-09 11:29:47 [Broker] info Inter-broker link disconnected from 10.34.27.65:10000 Closed by peer
2013-07-09 11:29:49 [System] info Connecting: 10.34.27.65:10000
2013-07-09 11:29:49 [Broker] info Inter-broker link established to 10.34.27.65:10000
2013-07-09 11:29:49 [Broker] info Inter-broker link connecting to 10.34.27.65:10000


### Netstat Pre-stop
tcp        0      0 0.0.0.0:10000               0.0.0.0:*                   LISTEN      17639/qpidd         
tcp        0      0 0.0.0.0:20000               0.0.0.0:*                   LISTEN      17640/qpidd         
tcp        0      0 10.34.27.65:10000           10.34.27.65:36943           ESTABLISHED 17639/qpidd         
tcp        0      0 10.34.27.65:36943           10.34.27.65:10000           ESTABLISHED 17640/qpidd         
tcp        0      0 :::10000                    :::*                        LISTEN      17639/qpidd         
tcp        0      0 :::20000                    :::*                        LISTEN      17640/qpidd  

### Netstat post-timeout
tcp        0      0 0.0.0.0:10000               0.0.0.0:*                   LISTEN      17639/qpidd         
tcp        0      0 0.0.0.0:20000               0.0.0.0:*                   LISTEN      17640/qpidd         
tcp        9      0 10.34.27.65:10000           10.34.27.65:36950           CLOSE_WAIT  -                   
tcp        0      0 10.34.27.65:36951           10.34.27.65:10000           ESTABLISHED 17640/qpidd         
tcp        8      0 10.34.27.65:10000           10.34.27.65:36951           ESTABLISHED -                   
tcp        0      0 10.34.27.65:36950           10.34.27.65:10000           FIN_WAIT2   -                   
tcp        1      0 10.34.27.65:10000           10.34.27.65:36943           CLOSE_WAIT  17639/qpidd         
tcp        0      0 10.34.27.65:36943           10.34.27.65:10000           FIN_WAIT2   -                   
tcp        0      0 :::10000                    :::*                        LISTEN      17639/qpidd         
tcp        0      0 :::20000                    :::*                        LISTEN      17640/qpidd  
 
### Even further Netstat
tcp        0      0 0.0.0.0:10000               0.0.0.0:*                   LISTEN      17639/qpidd         
tcp        0      0 10.34.27.65:10000           10.34.27.65:36963           SYN_RECV    -                   
tcp        0      0 10.34.27.65:10000           10.34.27.65:36962           SYN_RECV    -                   
tcp        0      0 0.0.0.0:20000               0.0.0.0:*                   LISTEN      17640/qpidd         
tcp        0      0 10.34.27.65:36961           10.34.27.65:10000           FIN_WAIT2   -                   
tcp        9      0 10.34.27.65:10000           10.34.27.65:36956           CLOSE_WAIT  -                   
tcp        9      0 10.34.27.65:10000           10.34.27.65:36961           CLOSE_WAIT  -                   
tcp        9      0 10.34.27.65:10000           10.34.27.65:36952           CLOSE_WAIT  -                   
tcp        0      0 10.34.27.65:36959           10.34.27.65:10000           FIN_WAIT2   -                   
tcp        9      0 10.34.27.65:10000           10.34.27.65:36950           CLOSE_WAIT  -                   
tcp        0      8 10.34.27.65:36963           10.34.27.65:10000           ESTABLISHED 17640/qpidd         
tcp        0      9 10.34.27.65:36962           10.34.27.65:10000           FIN_WAIT1   -                   
tcp        9      0 10.34.27.65:10000           10.34.27.65:36958           CLOSE_WAIT  -                   
tcp        9      0 10.34.27.65:10000           10.34.27.65:36960           CLOSE_WAIT  -                   
tcp        9      0 10.34.27.65:10000           10.34.27.65:36959           CLOSE_WAIT  -                   
tcp        9      0 10.34.27.65:10000           10.34.27.65:36955           CLOSE_WAIT  -                   
tcp        9      0 10.34.27.65:10000           10.34.27.65:36957           CLOSE_WAIT  -                   
tcp        9      0 10.34.27.65:10000           10.34.27.65:36951           CLOSE_WAIT  -                   
tcp        1      0 10.34.27.65:10000           10.34.27.65:36943           CLOSE_WAIT  17639/qpidd         
tcp        9      0 10.34.27.65:10000           10.34.27.65:36954           CLOSE_WAIT  -                   
tcp        0      0 10.34.27.65:36960           10.34.27.65:10000           FIN_WAIT2   -                   
tcp        0      0 :::10000                    :::*                        LISTEN      17639/qpidd         
tcp        0      0 :::20000                    :::*                        LISTEN      17640/qpidd

Comment 1 Pavel Moravec 2015-07-04 14:09:47 UTC
Hi Zdenek,
Isn't it fixed by bz1094781 (in/since qpid-cpp-0.18-23) / bz1097559 (in/since 0.22-42) ?

Comment 2 Zdenek Kraus 2015-08-03 11:57:33 UTC
Pavel,

 sorry for delayed response.

tested with qpid-cpp-0.30-8

the broker do a "failover" but since in the scenario is only one source-broker, it is trying that one, which it is correct behaviour I think. the problem in this scenarion is that route state within that link is still treated as Operatinoal even though it can't connect to a source broker. 


# qpid-route link list guest/guest.37.193:5672

Host            Port    Transport Durable  State             Last Error
=============================================================================
10.34.37.192    5672    tcp          N     Operational       Closed by peer


also I hit one moment when state was Waiting, then it jumped back to Operational, while source broker was SIGSTOPped.


so after SIGCONT the source broker, link is apparently established, and qpid-route still show Operational and last error closed by peer, so I would say, it's only matter of confusion in QMF. I can confirm that after source broker is enabled to run again and link is established, messages are passing through it.

As I'm reviewing this issue, I now think it could be only problem in qpid-route. I'm also for lowering the severity of this issue.


Note You need to log in before you can comment on or make changes to this bug.