Created attachment 733544 [details] rhq_sh_log.txt Description of problem: Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. download latest rhq server installer (having cassandra integrated) 2. from non-root user run rhq.sh 3. Actual results: rhq server with storage and agetn is installed, having exceptions in log Expected results: rhq server with storage and agetn is installed - without exceptions in log Additional info: log is attached
I am closing this since out since rhq.sh has been replaced by rhqctl.sh. Please retest with rhqctl.sh. I may have investigated these exceptions at some point, and I want to say that it looked like it may have been an issue specific to the environment. With that said, if this error occurs again, please run the RHQ installer from a master build to see if you get the error there is well. Even if it is not an issue introduced in the cassandra-backend branch and even if it looks like an environment-specific issue, someone ought to determine what is causing the problems because at some point someone else is bound to run into it.
Sorry for reopening. The exception exists while running rhq-install.sh as non-root user. Editing the description of bug.
Armine, can you please retest this with a master build as well?
Bug is visible in master build as well. removing dependency on Cassandra back-end tracker.
This bz probably causes following exception in rhq server log: 01:58:26,230 ERROR [org.quartz.core.JobRunShell] (RHQScheduler_Worker-5) Job CassandraClusterHeartBeatJobGroup.CassandraClusterHeartBeatJob threw an unhandled Exception: : java.lang.IllegalArgumentException: Expected string of the form, hostname|jmxPort|nativeTransportPort at org.rhq.core.domain.cloud.StorageNode.parseNodeInformation(StorageNode.java:272) [rhq-core-domain-ejb3.jar:4.8.0-SNAPSHOT] at org.rhq.enterprise.server.cassandra.CassandraClusterHeartBeatJob.execute(CassandraClusterHeartBeatJob.java:72) [rhq-enterprise-server-ejb3.jar:4.8.0-SNAPSHOT] at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-1.6.5.jar:1.6.5] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525) [quartz-1.6.5.jar:1.6.5] This exception is thrown each ~5 seconds. AFAIK it should be caused by incorrect format of 'rhq.cassandra.seeds' property in rhq-server.properties but this property is set correctly. This exception disappears once the rhq server is restarted. So i think its like this: The RHQ server is not restarted after the installation because of this bz -> configuration is not reloaded -> exception from this comment is thrown
This is a bit more serious. A few tests from our cli automation suite failed because of this throwing exception: Caused by: javax.ejb.EJBException: [Warning] java.lang.IllegalStateException: JBAS011048: Failed to construct component instance at org.rhq.enterprise.server.remote.RemoteSafeInvocationHandler.invoke(RemoteSafeInvocationHandler.java:116) at org.jboss.remoting.ServerInvoker.invoke(ServerInvoker.java:967) at org.jboss.remoting.transport.servlet.ServletServerInvoker.processRequest(ServletServerInvoker.java:416) at sun.reflect.GeneratedMethodAccessor183.invoke(Unknown Source) . . . Caused by: java.lang.IllegalArgumentException: [Warning] Expected string of the form, hostname|jmxPort|nativeTransportPort at org.rhq.core.util.exception.WrappedRemotingException.getCause(WrappedRemotingException.java:121) at java.lang.Throwable.printEnclosedStackTrace(Throwable.java:706) at java.lang.Throwable.printEnclosedStackTrace(Throwable.java:708) When the RHQ server is restarted after the installation (before tests are launched) tests pass.
This would indicate that rhq.cassandra.seeds was missing or not properly formed. This property is typically set in rhq-server.properties. Check to make sure it is correct. I have pushed a change to include the invalid property setting in the exception so future builds should show a bit more information.
(In reply to Jay Shaughnessy from comment #8) > This would indicate that rhq.cassandra.seeds was missing or not properly > formed. > > This property is typically set in rhq-server.properties. Check to make sure > it is correct. > > I have pushed a change to include the invalid property setting in the > exception so future builds should show a bit more information. As i noted in comment 6 "AFAIK it should be caused by incorrect format of 'rhq.cassandra.seeds' property in rhq-server.properties but this property is set correctly. This exception disappears once the rhq server is restarted." So the problem is not in incorrect property but in the fact, that server didn't reload a configuration after the installation and must be restarted manually. Then the issue disappears..
I created a separate bz 971377 to distinguish used installation script (rhq-install.sh or rhqctl)
I am not sure if the cause here had to due with the rhq.cassandra.seeds property, but the behavior you describe looks like an issue which I am currently addressing. If all of the storage nodes go down for whatever reason, the driver basically becomes stale and won't be able to reconnect when storage nodes come back up. The session object needs to be recycled. I have added logic so that when the server goes into maintenance mode as a result of all storage nodes being down, we also shut down what I am now calling the storage client subsystem. This includes shutting down the driver session. When one or more storage nodes come back online, the storage client subsystem will be reinitialized with a new driver session. You should see log statements like the following when the server goes into maintenance mode, 12:15:01,137 WARN [org.rhq.enterprise.server.cassandra.CassandraClusterHeartBeatJob] (RHQScheduler_Worker-2) Server[id=10001,name=localhost,address=localhost,port=7080,securePort=7443] is unable to connect to any Cassandra node. Server will go into maintenance mode. 12:15:01,137 INFO [org.rhq.enterprise.server.cassandra.CassandraClusterHeartBeatJob] (RHQScheduler_Worker-2) Moving Server[id=10001,name=localhost,address=localhost,port=7080,securePort=7443] from NORMAL to MAINTENANCE 12:15:01,146 INFO [org.rhq.enterprise.server.cassandra.CassandraClusterHeartBeatJob] (RHQScheduler_Worker-2) Preparing to shut down storage client subsystem 12:15:01,147 INFO [org.rhq.enterprise.server.cassandra.StorageClientManagerBean] (RHQScheduler_Worker-2) Shuttting down storage client subsystem And when the server comes out of maintenance mode, you should see log statements like, 12:25:01,161 INFO [org.rhq.enterprise.server.cassandra.CassandraClusterHeartBeatJob] (RHQScheduler_Worker-3) Moving Server[id=10001,name=localhost,address=localhost,port=7080,securePort=7443] from MAINTENANCE to NORMAL 12:25:01,168 INFO [org.rhq.enterprise.server.cassandra.CassandraClusterHeartBeatJob] (RHQScheduler_Worker-3) Restarting storage client subsystem... 12:25:01,168 INFO [org.rhq.enterprise.server.cassandra.StorageClientManagerBean] (RHQScheduler_Worker-3) Initializing storage client subsystem 12:25:01,546 INFO [org.rhq.server.metrics.MetricsDAO] (RHQScheduler_Worker-3) Initializing prepared statements 12:25:01,564 INFO [org.rhq.server.metrics.MetricsDAO] (RHQScheduler_Worker-3) Finished initializing prepared statements in 18 ms 12:25:01,565 INFO [org.rhq.enterprise.server.cassandra.StorageClientManagerBean] (RHQScheduler_Worker-3) Storage client subsystem is now initialized
These changes are in build 2295 or later of the rhq-master Jenkins job. Moving to ON_QA so this can be retested.
Verified that installer exception is not there any more. Version: 4.8.0-SNAPSHOT Build Number: bf44b14
sorry, have to reopen the issue. rhq-install still has errors in log (error is attached)
I tried that again and it works for me. I checked with Armine that we used the same version of RHQ. We use the same architecture and java version so there is some magic going on. Armine: uname -a Linux rhqserver.bc.jonqe.lab.eng.bos.redhat.com 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux java -version java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.1) (rhel-1.45.1.11.1.el6-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) Mine: uname -a Linux fbry-test2.bc.jonqe.lab.eng.bos.redhat.com 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux java -version java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.1) (rhel-1.45.1.11.1.el6-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
Please re-test. The storage node code changed significantly since the bug was reported. Currently the storage cluster availability no longer uses the JMX port but rather events from the CQL driver. The problem should be resolved.
Created attachment 790421 [details] rhq-storage-installer.png
reopening rhq-storage-installer.sh runs with Exit 1 - storage not being installed - please get screen-shot attached.
Comment 18 does not look related to the initial JMX parsing issues. Please open a different BZ if this is an issue. Also, please review and attach the logs for both the storage installer (rhq-storage-installer.log) and the storage server (rhq-storage.log) for the cause of the error.
closing this bug as verified since rhq.sh is removed and rhqctl install works as an equivalent. bug related to comment 18 is #972602