Description of problem: If deployment fails at the start of the add maintenance phase, a deadlock occurs that prevents an error message stored on the storage node entity; consequently, the cluster status will remain JOINING when it should change to DOWN. Here are the errors from a server log that shows the issue, 12:33:05,541 ERROR [org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean] (Reconnection-0) Aborting storage node deployment due to unexpected error while performing add node maintenance.: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:64) [cassandra-driver-core-1.0.2-rhq-1.2.4.jar:] at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:214) [cassandra-driver-core-1.0.2-rhq-1.2.4.jar:] at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:169) [cassandra-driver-core-1.0.2-rhq-1.2.4.jar:] at com.datastax.driver.core.Session.execute(Session.java:110) [cassandra-driver-core-1.0.2-rhq-1.2.4.jar:] at com.datastax.driver.core.Session.execute(Session.java:79) [cassandra-driver-core-1.0.2-rhq-1.2.4.jar:] at org.rhq.server.metrics.StorageSession.execute(StorageSession.java:36) [rhq-server-metrics-4.9.0-SNAPSHOT.jar:4.9.0-SNAPSHOT] at org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean.updateReplicationFactor(StorageNodeOperationsHandlerBean.java:850) [rhq-server.jar:4.9.0-SNAPSHOT] at org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean.updateSchemaIfNecessary(StorageNodeOperationsHandlerBean.java:836) [rhq-server.jar:4.9.0-SNAPSHOT] at org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean.performAddNodeMaintenance(StorageNodeOperationsHandlerBean.java:223) [rhq-server.jar:4.9.0-SNAPSHOT] at org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean.performAddNodeMaintenanceIfNecessary(StorageNodeOperationsHandlerBean.java:200) [rhq-server.jar:4.9.0-SNAPSHOT] ... 12:43:05,416 WARN [com.arjuna.ats.arjuna] (Reconnection-0) ARJUNA012077: Abort called on already aborted atomic action 0:ffff0a101777:-1b97486a:5220c752:7292 12:43:05,417 ERROR [org.jboss.as.ejb3.invocation] (Reconnection-0) JBAS014134: EJB Invocation failed on component StorageNodeOperationsHandlerBean for method public abstract void org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerLocal.performAddNodeMaintenanceIfNecessary(java.net.InetAddress): javax.ejb.EJBTransactionRolledbackException: Transaction rolled back Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I have pushed a fix to master. commit hash: 9b3c7ffa8ce There was a deadlock issue that could manifest itself in the performAddNodeMaintenanceIfNecessary and in the performRemoveNodeMaintenanceIfNecessary methods when an occurred. Both methods had nested transactions in which both the outer and inner transactions tried to update the same storage node entity. The transactions aren no longer nested.