Description of problem:
This problem occurs only when two or more new storage nodes are being deployed parallely. Sequential deployment (wait until previous storage node is joined - status NORMAL) works correctly.
Version-Release number of selected component (if applicable):
3.2.0.GA Update 02
Build Number :
Steps to Reproduce:
1. jon server, storage node and agent are installed and running on server1
2. install second storage node (do not start it) on server2
3. install third storage node (do not start it) on server3
4. start both storage nodes on server2 and server3
5. run 'Manual Autodiscovery' operation on platform resources for server2 and server3
- both storage nodes are JOINING
- both storage nodes are NORMAL in a while
- both storage nodes throw a lot of (each milisec) following messages to rhq-storage.log:
INFO [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,355 OutboundTcpConnection.java (line 399) Handshaking version with /10.16.23.185
TRACE [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,356 OutboundTcpConnection.java (line 406) Cannot handshake version with /10.16.23.185
note that they are trying the handshake with each other so 10.16.23.185 is ip of server3 and storage log on server3 contains the same message exept an ip which points to server2.
Handshake should be successful
trace level exception:
INFO [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,344 OutboundTcpConnection.java (line 399) Handshaking version with /10.16.23.185
TRACE [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,344 OutboundTcpConnection.java (line 406) Cannot handshake version with /10.16.23.185
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
DEBUG [WRITE-/10.16.23.185] 2014-06-09 08:24:07,344 OutboundTcpConnection.java (line 338) Target max version is -2147483648; no version information yet, will retry
TRACE [WRITE-/10.16.23.185] 2014-06-09 08:24:07,345 MessagingService.java (line 826) Assuming current protocol version for /10.16.23.185
INFO [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,345 OutboundTcpConnection.java (line 399) Handshaking version with /10.16.23.185
TRACE [HANDSHAKE-/10.16.23.185] 2014-06-09 08:24:07,345 OutboundTcpConnection.java (line 406) Cannot handshake version with /10.16.23.185
The issue doesn't disapear even when both storage nodes are restarted.
I think that the safe, conservative approach is to only allow one node to be deployed at a time in order to avoid problems like schema disagreement. There is currently no mechanism in place to prevent multiple deployments being done simultaneously. I considered implemented some optimistic locking in 3.2.0, but there was not enough time.
This (the locking) can probably be done for 3.3.0 because it also affects bug 1102887 and bug 1103841.
Bumping the target release due to time constraints. Work has been started though in the storage_workflow branch.
JBoss ON is coming to the end of its product life cycle. For more information regarding this transition, see https://access.redhat.com/articles/3827121.
This bug report/request is being closed. If you feel this issue should not be closed or requires further review, please create a new bug report against the latest supported JBoss ON 3.3 version.