1106505 – Cannot handshake version when deploying two nodes parallely

Bug 1106505 - Cannot handshake version when deploying two nodes parallely

Summary: Cannot handshake version when deploying two nodes parallely

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Storage Node
Sub Component:
Version:	JON 3.2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	JON 4.0.0
Assignee:	Michael Burman
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:	1120418
Blocks:
TreeView+	depends on / blocked

Reported:	2014-06-09 13:57 UTC by Filip Brychta
Modified:	2019-08-05 14:52 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-08-05 14:52:03 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1084651	0	unspecified	CLOSED	Storage Node (un)deployment can cause deadlock in rhq_storage_node table	2021-02-22 00:41:40 UTC

Internal Links: 1084651

Description Filip Brychta 2014-06-09 13:57:04 UTC

Description of problem:
This problem occurs only when two or more new storage nodes are being deployed parallely. Sequential deployment (wait until previous storage node is joined - status NORMAL) works correctly.

Version-Release number of selected component (if applicable):
Version :	
3.2.0.GA Update 02
Build Number :	
055b880:0620403

How reproducible:
Always

Steps to Reproduce:
1. jon server, storage node and agent are installed and running on server1
2. install second storage node (do not start it) on server2
3. install third storage node (do not start it) on server3
4. start both storage nodes on server2 and server3
5. run 'Manual Autodiscovery' operation on platform resources for server2 and server3

Actual results:
- both storage nodes are JOINING
- both storage nodes are NORMAL in a while
- both storage nodes throw a lot of (each milisec) following messages to rhq-storage.log:
INFO [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,355 OutboundTcpConnection.java (line 399) Handshaking version with /10.16.23.185
TRACE [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,356 OutboundTcpConnection.java (line 406) Cannot handshake version with /10.16.23.185

note that they are trying the handshake with each other so 10.16.23.185 is ip of server3 and storage log on server3 contains the same message exept an ip which points to server2.


Expected results:
Handshake should be successful

Additional info:
trace level exception:
 INFO [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,344 OutboundTcpConnection.java (line 399) Handshaking version with /10.16.23.185
TRACE [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,344 OutboundTcpConnection.java (line 406) Cannot handshake version with /10.16.23.185
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
	at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:203)
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
	at java.io.InputStream.read(InputStream.java:101)
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81)
	at java.io.DataInputStream.readInt(DataInputStream.java:387)
	at org.apache.cassandra.net.OutboundTcpConnection$1.run(OutboundTcpConnection.java:400)
DEBUG [WRITE-/10.16.23.185] 2014-06-09 08:24:07,344 OutboundTcpConnection.java (line 338) Target max version is -2147483648; no version information yet, will retry
TRACE [WRITE-/10.16.23.185] 2014-06-09 08:24:07,345 MessagingService.java (line 826) Assuming current protocol version for /10.16.23.185
 INFO [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,345 OutboundTcpConnection.java (line 399) Handshaking version with /10.16.23.185
TRACE [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,345 OutboundTcpConnection.java (line 406) Cannot handshake version with /10.16.23.185
java.io.EOFException
	at java.io.DataInputStream.readInt(DataInputStream.java:392)
	at org.apache.cassandra.net.OutboundTcpConnection$1.run(OutboundTcpConnection.java:400)


The issue doesn't disapear even when both storage nodes are restarted.

Comment 1 John Sanda 2014-06-09 14:40:50 UTC

I think that the safe, conservative approach is to only allow one node to be deployed at a time in order to avoid problems like schema disagreement. There is currently no mechanism in place to prevent multiple deployments being done simultaneously. I considered implemented some optimistic locking in 3.2.0, but there was not enough time. 

This (the locking) can probably be done for 3.3.0 because it also affects bug 1102887 and bug 1103841.

Comment 3 John Sanda 2014-08-29 12:19:21 UTC

Bumping the target release due to time constraints. Work has been started though in the storage_workflow branch.

Comment 5 Filip Brychta 2019-08-05 14:52:03 UTC

JBoss ON is coming to the end of its product life cycle. For more information regarding this transition, see https://access.redhat.com/articles/3827121.
This bug report/request is being closed. If you feel this issue should not be closed or requires further review, please create a new bug report against the latest supported JBoss ON 3.3 version.

Note You need to log in before you can comment on or make changes to this bug.