Bug 1111628

Summary: Lost large message in cluster with network failures
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Miroslav Novak <mnovak>
Component: HornetQAssignee: Clebert Suconic <csuconic>
Status: CLOSED EOL QA Contact: Miroslav Novak <mnovak>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.3.0CC: ataylor, jbertram, msvehla, myarboro
Target Milestone: ---   
Target Release: EAP 6.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-19 12:45:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs.zip none

Description Miroslav Novak 2014-06-20 14:39:48 UTC
Created attachment 910795 [details]
logs.zip

There is lost large message in cluster with network failures. This can have negative impact for customers. 

Test scenario:
1. Start 2 servers in HQ cluster with deployed "queue/InQueue" (reconnect-attempts for cluster connection is set to -1)
2. Start producer which sends large messages (1 MB) to InQueue to 1st server
3. Start consumer on 2nd server which reads messages from InQueue
4. During steps 2. and 3. disconnect network between servers, wait 2 minutes and reconnect
5. Stop producer and wait for consumer to receive all messages
6. Verify that number of send and received messages is equal

Sometimes happens that 1 large message is missing. I'm attaching logs from test, server1 and server2. 

ID of lost large message is: ID:96424df7-f870-11e3-a232-dd298b5ff7de and have set _HQ_DUPL_ID=a4640014-8f4d-4912-9ab2-273b495cccd21403264819220

I can see that server1 acked delivery of large message to server2. But server2 did not receive the whole message, only the first packet with initial large message header but no other chunks:

From server2:
13:49:25,301 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: 590887547, [id: 0x23383a7b, /192.168.40.2:5445])) sendLarge::LargeServerMessage[messageID=3770,priority=4,expiration=[null], durable=true, address=jms.queue.InQueue,properties=TypedProperties[{_HQ_BRIDGE_DUP=[B@562798e8, _HQ_LARGE_SIZE=2147811, counter=904, _HQ_ROUTE_TO=[B@41a7d388, _HQ_DUPL_ID=a4640014-8f4d-4912-9ab2-273b495cccd21403264819220, count=903, color=RED, __HQ_CID=0e83a18b-f870-11e3-a232-dd298b5ff7de}]]@1899013988

From server1:
13:49:25,551 TRACE [org.hornetq.core.server] (Old I/O client worker ([id: 0x1ab03f84, /127.0.0.1:50286 => localhost/127.0.0.1:43812])) ClusterConnectionBridge@5b0508f3 [name=sf.my-cluster.03a9ec30-f870-11e3-8ee0-afecb4b03472, queue=QueueImpl[name=sf.my-cluster.03a9ec30-f870-11e3-8ee0-afecb4b03472, postOffice=PostOfficeImpl [server=HornetQServerImpl::serverUUID=ff82ff19-f86f-11e3-839a-4d2347c499b7]]@40882c7f targetConnector=ServerLocatorImpl (identity=(Cluster-connection-bridge::ClusterConnectionBridge@5b0508f3 [name=sf.my-cluster.03a9ec30-f870-11e3-8ee0-afecb4b03472, queue=QueueImpl[name=sf.my-cluster.03a9ec30-f870-11e3-8ee0-afecb4b03472, postOffice=PostOfficeImpl [server=HornetQServerImpl::serverUUID=ff82ff19-f86f-11e3-839a-4d2347c499b7]]@40882c7f targetConnector=ServerLocatorImpl [initialConnectors=[TransportConfiguration(name=connector-to-proxy-directing-to-this-server, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=43812&host=localhost], discoveryGroupConfiguration=null]]::ClusterConnectionImpl@605474553[nodeUUID=ff82ff19-f86f-11e3-839a-4d2347c499b7, connector=TransportConfiguration(name=connector-to-proxy-directing-to-this-server, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=43821&host=localhost, address=jms, server=HornetQServerImpl::serverUUID=ff82ff19-f86f-11e3-839a-4d2347c499b7])) [initialConnectors=[TransportConfiguration(name=connector-to-proxy-directing-to-this-server, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=43812&host=localhost], discoveryGroupConfiguration=null]] Acking Reference[2845]:RELIABLE:LargeServerMessage[messageID=2845,priority=4,expiration=[null], durable=true, address=jms.queue.InQueue,properties=TypedProperties[{counter=904, _HQ_LARGE_SIZE=2147811, count=903, _HQ_DUPL_ID=a4640014-8f4d-4912-9ab2-273b495cccd21403264819220, _HQ_ROUTE_TOsf.my-cluster.03a9ec30-f870-11e3-8ee0-afecb4b03472=[B@3918e589, color=RED, __HQ_CID=0e83a18b-f870-11e3-a232-dd298b5ff7de}]]@1915834324 on queue QueueImpl[name=sf.my-cluster.03a9ec30-f870-11e3-8ee0-afecb4b03472, postOffice=PostOfficeImpl [server=HornetQServerImpl::serverUUID=ff82ff19-f86f-11e3-839a-4d2347c499b7]]@40882c7f

Network was disconnected in:
13:49:25,756 main INFO  [org.jboss.qa.hornetq.test.bridges.NetworkFailuresHornetQCoreBridges:616] Stop all proxies.