Bug 1096927

Summary: JON storage node does not get properly configured if rhqctl install is run a second time to complete a failed install
Product: [JBoss] JBoss Operations Network Reporter: Larry O'Leary <loleary>
Component: InstallerAssignee: John Mazzitelli <mazz>
Status: CLOSED CURRENTRELEASE QA Contact: Filip Brychta <fbrychta>
Severity: high Docs Contact:
Priority: unspecified    
Version: JON 3.2CC: ahovsepy, fbrychta, mazz
Target Milestone: DR01Flags: jmorgan: needinfo?
Target Release: JON 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-11 14:03:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Larry O'Leary 2014-05-12 17:33:41 UTC
Description of problem:
If re-run the installer to finish a failed install or to adjust configuration parameters, the storage node will not be available to the JBoss ON system.

WARN  [org.rhq.enterprise.server.storage.StorageClientManagerBean] (EJB default - 2) Storage client subsystem wasn't initialized. The RHQ server will be set to MAINTENANCE mode. Please verify  that the storage cluster is operational.: java.lang.RuntimeException: Authentication error on host jboss.example.com/192.168.1.1: Username and/or password are incorrect
	at org.rhq.enterprise.server.storage.StorageClientManagerBean.checkSchemaCompability(StorageClientManagerBean.java:257) [rhq-server.jar:4.9.0.JON320GA]
	at org.rhq.enterprise.server.storage.StorageClientManagerBean.createSession(StorageClientManagerBean.java:341) [rhq-server.jar:4.9.0.JON320GA]
	at org.rhq.enterprise.server.storage.StorageClientManagerBean.init(StorageClientManagerBean.java:154) [rhq-server.jar:4.9.0.JON320GA]
	at org.rhq.enterprise.server.storage.StorageClientManagerBean.storageSessionMaintenance(StorageClientManagerBean.java:129) [rhq-server.jar:4.9.0.JON320GA]
    ...
Caused by: com.datastax.driver.core.exceptions.AuthenticationException: Authentication error on host jboss.example.com/192.168.1.1: Username and/or password are incorrect
	at com.datastax.driver.core.Connection.initializeTransport(Connection.java:168) [cassandra-driver-core-1.0.2.jar:]
	at com.datastax.driver.core.Connection.<init>(Connection.java:129) [cassandra-driver-core-1.0.2.jar:]
    ...
	at org.rhq.cassandra.util.ClusterBuilder.build(ClusterBuilder.java:130) [rhq-cassandra-util-4.9.0.JON320GA.jar:4.9.0.JON320GA]
	at org.rhq.cassandra.schema.SessionManager.initSession(SessionManager.java:33) [rhq-cassandra-schema-4.9.0.JON320GA.jar:4.9.0.JON320GA]
	at org.rhq.cassandra.schema.AbstractManager.initClusterSession(AbstractManager.java:100) [rhq-cassandra-schema-4.9.0.JON320GA.jar:4.9.0.JON320GA]
	at org.rhq.cassandra.schema.AbstractManager.initClusterSession(AbstractManager.java:90) [rhq-cassandra-schema-4.9.0.JON320GA.jar:4.9.0.JON320GA]
	at org.rhq.cassandra.schema.VersionManager.checkCompatibility(VersionManager.java:257) [rhq-cassandra-schema-4.9.0.JON320GA.jar:4.9.0.JON320GA]
	at org.rhq.cassandra.schema.SchemaManager.checkCompatibility(SchemaManager.java:112) [rhq-cassandra-schema-4.9.0.JON320GA.jar:4.9.0.JON320GA]
	at org.rhq.enterprise.server.storage.StorageClientManagerBean.checkSchemaCompability(StorageClientManagerBean.java:253) [rhq-server.jar:4.9.0.JON320GA]
	... 49 more

Version-Release number of selected component (if applicable):
3.2.0

How reproducible:
Always

Steps to Reproduce:
1.  Create a shell script that can be used to simulate a storage node shutdown failure similar to what happens in bug 1089757:

        cat >/tmp/neverDie <<EOF
#/bin/sh

_okayToExit=false

function handleSignal() {
    echo "A signal was received. \$1"
    \${_okayToExit} && exit \$1
    _okayToExit=true
}

trap handleSignal SIGHUP SIGINT SIGTERM

while true; do
    echo "Running..."
    sleep 5s
done
EOF
        chmod +x /tmp/neverDie

2.  Run rhqctl install:

        "${RHQ_SERVER_HOME}"'/bin/rhqctl' install 
        
3.  After the installer has been running for a few seconds, execute the following commands from another terminal:

        /tmp/neverDie &
        newPID=$!
        echo -n ${newPID} >"${RHQ_SERVER_HOME}"'/rhq-storage/bin/cassandra.pid'

    You must ensure that RHQ_SERVER_HOME is set to the correct path. The goal here is to replace the Cassandra PID stored in cassandra.pid with the one of our neverDie shell script. This is to simulate a shutdown failure for the RHQ Storage Node during the execution of the installer.
    
4.  Wait for the installer to fail:

        11:08:30,024 ERROR [org.rhq.server.control.RHQControl] Process [11951] did not finish yet. Terminate it manually and retry.
        11:08:30,025 WARN  [org.rhq.server.control.command.Install] UNDO: Removing agent install directory
        11:08:30,047 WARN  [org.rhq.server.control.command.Install] UNDO: Removing server-installed marker file and management user
        11:08:30,048 WARN  [org.rhq.server.control.command.Install] UNDO: Stopping component: --server
        11:08:30,049 WARN  [org.rhq.server.control.command.Install] UNDO: Stopping component: --storage
        11:08:40,069 WARN  [org.rhq.server.control.command.Install] UNDO: Removing storage node data and install directories
        11:08:40,098 WARN  [org.rhq.server.control.command.Install] UNDO: Reverting server properties file

5.  Re-execute the installer using the --start command-line argument:

        "${RHQ_SERVER_HOME}"'/bin/rhqctl' install --start

Actual results:
Server logs the following warning and enters maintenance mode:

    WARN  [org.rhq.enterprise.server.storage.StorageClientManagerBean] (EJB default - 2) Storage client subsystem wasn't initialized. The RHQ server will be set to MAINTENANCE mode. Please verify  that the storage cluster is operational.: java.lang.RuntimeException: Authentication error on host jboss.example.com/192.168.1.1: Username and/or password are incorrect


Expected results:
Storage node and cluster is running just fine and no warnings or errors logged in server.log.

Additional info:
It is not clear why this happens. During testing, if I delete the contents of the file system, and re-run the installer, this problem doesn't seem to occur. Perhaps the rhq-server.properties and rhq-storage.properties files are not actually getting reverted back to their pre-install versions? Keep in mind that I would expect the user provided configuration to remain in the property files but not the stuff that was added/changed during the install itself.

Comment 1 John Mazzitelli 2014-05-13 21:08:32 UTC
Note that with the fix to bug #1089757 in place, the replication procedures specified in this issue will no longer cause the install to fail. Its possible therefore the fix to that issue will also fix this issue (though I have no verified that yet).

Comment 2 John Mazzitelli 2014-05-14 13:24:20 UTC
Is there another replication procedure to cause this to happen other than the failure of the storage node to shutdown at the end? Because once the fix to that other bug is in place, that failure will no longer cause the rollback to happen, so this BZ's error won't happen.  But I suspect there might be other conditions where this might happen (for example, what did you mean "re-run the installer ... or to adjust configuration parameters" - I'm not aware of being able to re-run the installer after it has successfully run.)

Comment 3 Larry O'Leary 2014-05-14 13:59:13 UTC
The reproducer steps here simply simulates an actual delayed shutdown of the storage node upon install.

If for example, it is running on a slower machine or any other shutdown failure occurs. This is different then what is happening in the other bug. In the other bug, the shutdown is really successful but due to the bug, we still say it was unsuccessful. In this case, the shutdown is really not working.

As for configuration changes + re-run I am referring to issues like "unable to bind to address" and "invalid port" and "SELinux in enforcing mode". In those cases, the install will fail and the user would update their configuration and re-run.

Comment 4 John Mazzitelli 2014-05-14 16:01:41 UTC
I found one issue that may or may not be related (I can't see right now that it is related, but I'm going to commit it with this BZ number associated with it since I saw it while trying to replicate this issue).

If you restart the installer and the jboss.bind.address.management wasn't set, we need to fallback to jboss.bind.address (we don't in the broken code, I will fix that now).

git commit to master: 6fccc15

Comment 5 John Mazzitelli 2014-05-14 16:36:58 UTC
I have not been able to replicate on master which has the latest fix to the installer from that previous BZ. I also tried hitting control-C at different times during the installer run to see if the UNDO steps are not getting performed properly but I've not seen the problem. I'll try running on JON 3.2.0 and try to determine why it happens there.

Comment 6 John Mazzitelli 2014-05-15 11:53:23 UTC
I was able to replicate this easily on 3.2.0 using the replication procedures. However, did not see this error on master build. I will see if I can verify that it is the fix from the other BZ that fixed the issue.

Comment 7 John Mazzitelli 2014-05-15 13:21:53 UTC
OK, I rebuilt release/3.2.x branch and reproduced the error - verified the problem exists there in that branch.

Then I cherry-picking these two (and correcting some minor conflicts) into release/3.2.x branch:

3a7cec6
845cc38

These are documented in bug #1089757.

Re-ran test - the installation doesn't fail with any abort/undo. The installations are complete. You can start up everything and you won't get the error mentioned in the description.

Comment 8 John Mazzitelli 2014-05-15 13:25:32 UTC
(In reply to John Mazzitelli from comment #7)
> Then I cherry-picking these two (and correcting some minor conflicts) into
> release/3.2.x branch:
> 
> 3a7cec6
> 845cc38

I forgot, as part of the conflict resolution, I removed an import that should not have been removed. So I commited this to fix that:

87c31fc24abea9b281f4852f43e276046535eb09

Comment 9 Simeon Pinder 2014-07-31 15:51:31 UTC
Moving to ON_QA as available to test with brew build of DR01: https://brewweb.devel.redhat.com//buildinfo?buildID=373993

Comment 10 Filip Brychta 2014-08-15 09:02:09 UTC
I tried to ivoke revert of installation many different ways (incorrect properties, ctrl+c during diffent phases of installation) and I was not able to reproduce the issue.

Version :	
3.3.0.DR01
Build Number :	
6468454:dda0a47