Bug 1106505

Summary:	Cannot handshake version when deploying two nodes parallely
Product:	[JBoss] JBoss Operations Network	Reporter:	Filip Brychta <fbrychta>
Component:	Storage Node	Assignee:	Michael Burman <miburman>
Status:	CLOSED EOL	QA Contact:	Mike Foley <mfoley>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	JON 3.2.1	CC:	mfoley
Target Milestone:	---
Target Release:	JON 4.0.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-05 14:52:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1120418
Bug Blocks:

Description Filip Brychta 2014-06-09 13:57:04 UTC

Description of problem:
This problem occurs only when two or more new storage nodes are being deployed parallely. Sequential deployment (wait until previous storage node is joined - status NORMAL) works correctly.

Version-Release number of selected component (if applicable):
Version :	
3.2.0.GA Update 02
Build Number :	
055b880:0620403

How reproducible:
Always

Steps to Reproduce:
1. jon server, storage node and agent are installed and running on server1
2. install second storage node (do not start it) on server2
3. install third storage node (do not start it) on server3
4. start both storage nodes on server2 and server3
5. run 'Manual Autodiscovery' operation on platform resources for server2 and server3

Actual results:
- both storage nodes are JOINING
- both storage nodes are NORMAL in a while
- both storage nodes throw a lot of (each milisec) following messages to rhq-storage.log:
INFO [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,355 OutboundTcpConnection.java (line 399) Handshaking version with /10.16.23.185
TRACE [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,356 OutboundTcpConnection.java (line 406) Cannot handshake version with /10.16.23.185

note that they are trying the handshake with each other so 10.16.23.185 is ip of server3 and storage log on server3 contains the same message exept an ip which points to server2.


Expected results:
Handshake should be successful

Additional info:
trace level exception:
 INFO [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,344 OutboundTcpConnection.java (line 399) Handshaking version with /10.16.23.185
TRACE [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,344 OutboundTcpConnection.java (line 406) Cannot handshake version with /10.16.23.185
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
	at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:203)
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
	at java.io.InputStream.read(InputStream.java:101)
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81)
	at java.io.DataInputStream.readInt(DataInputStream.java:387)
	at org.apache.cassandra.net.OutboundTcpConnection$1.run(OutboundTcpConnection.java:400)
DEBUG [WRITE-/10.16.23.185] 2014-06-09 08:24:07,344 OutboundTcpConnection.java (line 338) Target max version is -2147483648; no version information yet, will retry
TRACE [WRITE-/10.16.23.185] 2014-06-09 08:24:07,345 MessagingService.java (line 826) Assuming current protocol version for /10.16.23.185
 INFO [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,345 OutboundTcpConnection.java (line 399) Handshaking version with /10.16.23.185
TRACE [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,345 OutboundTcpConnection.java (line 406) Cannot handshake version with /10.16.23.185
java.io.EOFException
	at java.io.DataInputStream.readInt(DataInputStream.java:392)
	at org.apache.cassandra.net.OutboundTcpConnection$1.run(OutboundTcpConnection.java:400)


The issue doesn't disapear even when both storage nodes are restarted.

Comment 1 John Sanda 2014-06-09 14:40:50 UTC

I think that the safe, conservative approach is to only allow one node to be deployed at a time in order to avoid problems like schema disagreement. There is currently no mechanism in place to prevent multiple deployments being done simultaneously. I considered implemented some optimistic locking in 3.2.0, but there was not enough time. 

This (the locking) can probably be done for 3.3.0 because it also affects bug 1102887 and bug 1103841.

Comment 3 John Sanda 2014-08-29 12:19:21 UTC

Bumping the target release due to time constraints. Work has been started though in the storage_workflow branch.

Comment 5 Filip Brychta 2019-08-05 14:52:03 UTC

JBoss ON is coming to the end of its product life cycle. For more information regarding this transition, see https://access.redhat.com/articles/3827121.
This bug report/request is being closed. If you feel this issue should not be closed or requires further review, please create a new bug report against the latest supported JBoss ON 3.3 version.