Description of problem: Upgrade from JON 3.2.0 to JON 3.3.3 failed with "Could not verify that the node is up and running" I believe this is caused by slow environment: 22:53:18,695 INFO [org.rhq.storage.installer.StorageInstaller] Starting RHQ Sto rage Node 22:53:28,836 WARN [org.rhq.storage.installer.StorageInstaller] Could not verify that the node is up and running. 22:53:28,836 WARN [org.rhq.storage.installer.StorageInstaller] Check the log fi le at ../../logs/rhq-storage.log for errors. There is 10s between "Starting RHQ Storage" and "Could not verify that the node" and it seems that storage node is not started within 10s on this slow environment. There are no valid errors in storage log Version-Release number of selected component (if applicable): Version : 3.3.0.GA Update 03 Build Number : 82ad0cc:a25836e How reproducible: 5/5 Steps to Reproduce: 1. install and start JON 3.2.0 2. unzip JON 3.3.0 3. unzip CP3 4. apply CP3 on JON 3.3.0 5. stop JON 3.2.0 6. start upgrade (c:\jon-server-3.3.0.GA\bin>rhqctl upgrade --from-server-dir c:\jon-server-3.2.0.GA) Actual results: 22:53:28,836 WARN [org.rhq.storage.installer.StorageInstaller] Could not verify that the node is up and running. 22:53:28,836 WARN [org.rhq.storage.installer.StorageInstaller] Check the log fi le at ../../logs/rhq-storage.log for errors. 22:53:28,836 WARN [org.rhq.storage.installer.StorageInstaller] The storage inst aller will now exit 22:53:28,867 INFO [org.rhq.server.control.command.Upgrade] The storage node upg rade has finished with an exit value of [2] The RHQ Server [rhqserver-WIN-2008] service was not running. Stopping the RHQ Storage [rhqstorage-WIN-2008] service... RHQ Storage [rhqstorage-WIN-2008] service stopped. RHQ storage node has stopped 22:53:34,883 ERROR [org.rhq.server.control.RHQControl] The storage node upgrade failed with exit code [2] Expected results: Upgrade is successful Additional info: This issue will most probably occur only on environments where starting storage node takes more then 10s. If the assumption is correct, it should occur during installation as well, anytime when storage node starts longer then 10s
Created attachment 1038040 [details] storage log
Created attachment 1038041 [details] console log
This isn't about timeout, Cassandra is returning false for NativeTransportRunning so we don't retry (we only retry if there's an exception). I'll fix this by pushing us to the retry policy if false is returned.
Fixed in the master: commit 0cde115e1081f5aa982170e9f3838da4fd79963f Author: Michael Burman <miburman> Date: Mon Nov 9 15:52:07 2015 +0200 [BZ 1231199] If Cassandra returns NativeTransportRunning is false, force retry policy to try again
Merged to release/jon3.3.x: commit 9c049818e3caef5321f2b35f45adee9e9b1b8a69 Author: Michael Burman <miburman> Date: Mon Nov 9 15:52:07 2015 +0200 [BZ 1231199] If Cassandra returns NativeTransportRunning is false, force retry policy to try again (cherry picked from commit 0cde115e1081f5aa982170e9f3838da4fd79963f)
Moving to ON_QA as available to test with the following brew build: JON Cumulative patch build: https://brewweb.devel.redhat.com/buildinfo?buildID=469635 *Note: jon-server-patch-3.3.0.GA.zip maps to DR01 build of jon-server-3.3.0.GA-update-05.zip.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-0118.html