Description of problem:
After successful inventory of a Tomcat Server, you can successfully Stop, Start, Stop, Start, Stop, Start, and so on... its Tomcat Connectors (e.g. http-8080). However, if you then successfully run a Restart operation on the Tomcat Server (8080) itself, then try to run a new Stop or Start operation on the same Tomcat Connector (e.g. http-8080). This time it will fail with a stack track. Below is the stack trace.
The only way to get the operations on the Tomcat Connectors to work again is to Uninventory, re-discover and Import the Tomcat Server again.
Version-Release number of selected component (if applicable):
build number: b9ca90d
JBoss Operations Network
build number: 10745:647a602
Note that this failure is easier to reproduce on a Tomcat5 server rather than a Tomcat6 server because on a Tomcat6, the Stop operation on the Tomcat Connector appears successful, yet the port it was connected to actually fails to be released as indicated by netstat. But this is actually a separate defect and should be opened against Apache.
# netstat -lpn | grep 8080
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 :::8080 :::* LISTEN -
Note the PID is - and therefore the port remains bound to an unknown process.
Actual RHQ/JON results:
org.mc4j.ems.connection.EmsInvocationException: Exception on invocation of [stop]java.lang.reflect.UndeclaredThrowableException
at sun.reflect.GeneratedMethodAccessor228.invoke(Unknown Source)
Caused by: java.lang.reflect.UndeclaredThrowableException
at $Proxy60.invoke(Unknown Source)
... 11 more
Caused by: java.rmi.ConnectException: Connection refused to host: 127.0.0.1; nested exception is:
java.net.ConnectException: Connection refused
at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
at javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
... 13 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
... 24 more
So this is an interesting bug that touches on an area that's been a problem for me in the past, and I've worked around it. This time I think the fix is in place for the root cause.
The issue is in the way the JMX plugin caches the component mbean. The caching is fine most of the time but the bean's validity is only verified in the MBeanResourceComponent impl of getAvailability(). This is problematic in that there can be a significant window of time (minutes) between the bean becoming invalid and a call to getAvailability(). And if, in the case of this issue report, getAvailability() is overriden without calling the super, the bean will never get refreshed short of an agent shutdown.
When the server restart operation happens the TC server is shutdown and restarted. The mbean connections are all lost at that point and the cached beans become invalid. They stay that way at least until the next availability check. Note that that check is scheduled by the plugin container, it is unrelated to the fact that the server has been restarted via the operation.
So, metric collection, operations, etc are all going to be in trouble until the avail check. And TC connectors, due to the override, will not perform correctly after the restart.
The solution is to change the implementation of MBeanResourceComponent.getEmsBean(). This method typically returns the cached bean. I'm adding a (fast) check to ensure that the cached bean's connection matches the current emsConnection. If not the bean is reset.
This has possible benefit to all JMX based plugins. I am sure there must be other code paths, especially for plugins offering stop/start/restart capability, where this could have been a problem.
note - reviewed with mazz.
fix commit: c6a959a6fd636f15c76493bf20a9ad779e441175
In addition to verifying the scenario written up in this BZ I would recommend that QA also attempt a similar test with an AS4 restart. After the restart try an operation on some child service and not the server itself. Analogous to the connector operation used for TC.
Tested against Tomcat5 and seem to be fine. Will test against AS4.
QA Verified against Tomcat5 and EAP4.3 After performing a restart op on the server and then stop/start against some child resource, things look fine.
Mass-closure of verified bugs against JON.