Description of problem: Too many exceptions in server.log Version-Release number of selected component (if applicable): e2a1811 How reproducible: very frequently Steps to Reproduce: 1. install and start rhq server, storage and agent on ip1 2. Stop-started storage in ip1 3. install and start storage and agent on ip2 4. stop-started or restarted storage in ip1 5. undeploy storage in ip2 6. deploy storage on ip2 again 7. run prepare for bootstrap operation 8. cancel prepare bootstrap operation 9. run prepare for bootstrap operation 10. run add node maintenance operation Additional info: *** EJB Invocation failed on component MeasurementDataManagerBean with a stack trace -- when no host is available to connect to --- I would expect an exception handling here when server is in maintenance mode. *** Failed to get live availability.: java.lang.IllegalStateException with a stack trace -- when one of the agents is not available to connect to --- I would expect an exception handling. *** Sending exception to client: [1377693404302] : org.rhq.enterprise.server.resource.ResourceNotFoundException: A Resource with id 10591 does not exist in inventory with a stack trace -- couldn't found action performed before - I would expect an exception handling. *** EJB Invocation failed on component OperationManagerBean for method public abstract void - with a stack trace -- when a bootstrap operation cancellation is performed -- I would expect an exception handling here. server.log uploaded here for detailed investigation -> http://d.pr/f/iqZj
Some of the exceptions in the server log are due to bug 1002238. Other exceptions like com.datastax.driver.core.exceptions.UnavailableException can occur while trying to read/write metrics when a node is being added to or removed from the cluster and the cluster is being rebalanced. com.datastax.driver.core.exceptions.NoHostAvailableException is thrown when we try to read/write metrics when the storage cluster is down. These are both RuntimeExceptions and they are getting wrapped in an EJBException which is resulting is a much larger (than necessary) stack trace. The following will help clean things up a bit. I will add a new StorageException class that wraps those C* exceptions and make it an application exception. Then we will get a stack trace that does not include all of the internal, container calls. This will help a lot with debugging.
I have made some changes to reduce the noise in server.log. From my commit message: There were some methods in MeasurementDataManagerBean with default transaction support, but they should be NOT_SUPPORTED since they read/write to and from Cassandra. This will help reduce stacktraces because when exceptions bubble up from those methods they will no longer get wrapped in EJBExceptions. When an error occurs while inserting raw data, we are no longer logging the full exception. There is a better than likely change that if an exception occurs for one write, it will ocurr for several. Logging each of the exceptions resulted in a lot of noise in the logs. Now only the error message is logged. The full exception will be logged with DEBUG logging. master commit hash: 98c76cebf These changes should be in build 2596 of the rhq-master job.
update: new time-out exceptions in server.log --http://pastebin.test.redhat.com/161393 --- will update bug as soon as reproduced.
The description lists a few different exceptions. As I mentioned comment 1, one of the exceptions is related to bug 1002238. The other exceptions are addressed by the commit 98c76cebf cited in comment 2. The error cited in comment 3 is unrelated, and I would rather if necessary call that out in a separate BZ. I do not want this BZ to become a catch-all bucket for errors that appear in the server log.
I have opened bug 1003191 to track the issue cited in comment 3.