Description of problem: If an agent has been registered with a server and then later configured to register with a different (unrelated) server, the original server that the agent registered to will continue to talk to the agent even though the agent is not connected to it. Version-Release number of selected component (if applicable): 4.4.0.JON312GA How reproducible: Always Steps to Reproduce: 1. Install JBoss ON system and start the agent using a public interface. 2. Import platform, agent, and server into inventory. 3. Verify everything is up and showing available. 4. From JONServer1, view the Platform Utilization Report. 5. Shutdown agent. 6. Reconfigure agent to communicate with another JBoss ON server (JONServer2) (on another host). 7. Import platform and agent into inventory on JONServer2. 8. Verify everything is up and showing available on JONServer2. 9. Verify everything is down and showing unavailable on JONServer1. 10. From JONServer1, view the Platform Utilization Report. Actual results: Platform utilization report does not get displayed and a "Globally uncaught exception" appears in the message center. com.google.gwt.core.client.JavaScriptException:(TypeError) stack: org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_$copyValues__Lorg_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_2Lorg_rhq_core_domain_resource_composite_PlatformMetricsSummary_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:90371 org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_copyValues__Ljava_lang_Object_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:90403 org_rhq_enterprise_gui_coregui_client_util_RPCDataSource_copyValues__Ljava_lang_Object_2ZLcom_smartgwt_client_widgets_grid_ListGridRecord_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:58330 org_rhq_enterprise_gui_coregui_client_util_RPCDataSource_$buildRecords__Lorg_rhq_enterprise_gui_coregui_client_util_RPCDataSource_2Ljava_util_Collection_2Z_3Lcom_smartgwt_client_widgets_grid_ListGridRecord_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:58049 org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_$onSuccess__Lorg_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_2Lorg_rhq_core_domain_util_PageList_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:90422 org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_onSuccess__Ljava_lang_Object_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:90440 com_google_gwt_user_client_rpc_impl_RequestCallbackAdapter_$onResponseReceived__Lcom_google_gwt_user_client_rpc_impl_RequestCallbackAdapter_2Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_Response_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:10006 org_rhq_enterprise_gui_coregui_client_util_rpc_TrackingRequestCallback_onResponseReceived__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_Response_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:148704 com_google_gwt_http_client_Request_$fireOnResponseReceived__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:4388 com_google_gwt_http_client_RequestBuilder$1_onReadyStateChange__Lcom_google_gwt_xhr_client_XMLHttpRequest_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:4573 this$static.onreadystatechange<@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:13131 com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:2735 com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:2773 @http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:2758 fileName: http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html lineNumber: 90371 columnNumber: 4: summary.org_rhq_core_domain_resource_composite_PlatformMetricsSummary_idleCPU is null --- STACK TRACE FOLLOWS --- ... columnNumber: 4: summary.org_rhq_core_domain_resource_composite_PlatformMetricsSummary_idleCPU is null at Unknown.org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_$copyValues__Lorg_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_2Lorg_rhq_core_domain_resource_composite_PlatformMetricsSummary_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2(Unknown Source) at Unknown.org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_copyValues__Ljava_lang_Object_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2(Unknown Source) at Unknown.org_rhq_enterprise_gui_coregui_client_util_RPCDataSource_copyValues__Ljava_lang_Object_2ZLcom_smartgwt_client_widgets_grid_ListGridRecord_2(Unknown Source) at Unknown.org_rhq_enterprise_gui_coregui_client_util_RPCDataSource_$buildRecords__Lorg_rhq_enterprise_gui_coregui_client_util_RPCDataSource_2Ljava_util_Collection_2Z_3Lcom_smartgwt_client_widgets_grid_ListGridRecord_2(Unknown Source) at Unknown.org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_$onSuccess__Lorg_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_2Lorg_rhq_core_domain_util_PageList_2V(Unknown Source) at Unknown.org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_onSuccess__Ljava_lang_Object_2V(Unknown Source) at Unknown.com_google_gwt_user_client_rpc_impl_RequestCallbackAdapter_$onResponseReceived__Lcom_google_gwt_user_client_rpc_impl_RequestCallbackAdapter_2Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_Response_2V(Unknown Source) at Unknown.org_rhq_enterprise_gui_coregui_client_util_rpc_TrackingRequestCallback_onResponseReceived__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_Response_2V(Unknown Source) at Unknown.com_google_gwt_http_client_Request_$fireOnResponseReceived__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source) at Unknown.com_google_gwt_http_client_RequestBuilder$1_onReadyStateChange__Lcom_google_gwt_xhr_client_XMLHttpRequest_2V(Unknown Source) at Unknown.this$static.onreadystatechange<(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.anonymous(Unknown Source) Additionally, the agent logs contained the following warnings: WARN [WorkerThread#0[10.3.113.30:53249]] (rhq.core.pc.measurement.MeasurementManager)- Can not get resource container for resource with id 10001 WARN [WorkerThread#0[10.3.113.30:53249]] (rhq.core.pc.measurement.MeasurementManager)- Can not get resource container for resource with id 10001 Where the IP address is that of JONServer1 even though the agent was registered and connected to JONServer2. Expected results: The platform utilization report shouldn't throw an error. If more then one platform was in inventory, the utilization data for the other platform(s) should be displayed. Additional info: Essentially, what this comes down to is that agent was originally registered and connected to JONServer1. JONServer1 knows the agent's endpoint address. Even after the agent's configuration is wiped and it is registered with a completely different server, it continues to receive invocation requests from JONServer1. In this case, the only reason the platform utilization report threw an error was because the platform resource ID JONServer1 knew about wasn't the correct resource ID for an existing resource in JONServer2. Take the following scenario: This could happen in instances where an agent is registered with a server and has its platform or other child resources added to inventory. Later, the administrator decides to shutdown the agent and register it with a different server. For example, if the initial roll-out of the JBoss ON system is one server for the entire enterprise but then later it is decided that separate JBoss ON systems will be installed using one system per data center (multiple servers not in an HA configuration). - Server 1 -- Agent 1 -- Agent 2 -- Agent 3 -- Agent 4 Is converted to: - Server 1 -- Agent 1 - Server 2 -- Agent 2 - Server 3 -- Agent 3 - Server 4 -- Agent 4 The result will be that Server 1 will still know about Agent 1, 2, 3, and 4. Assuming their network address has not changed, server 1 will still be able to connect to agents and invoke operations.
You can secure the server-agent communications with SSL/certificates. The agent will reject any server whose certificate is not in the agent's truststore. Thus, if you have multiple JON environments and are afraid that agents will end up getting registered in both (a rare edge-case I would presume) then securing the comm with certificates is the answer. That is how JON can be told to restrict which servers can talk to which agents.
Understood, but we need a way to handle this situation to prevent random failures due to the agent not returning expected data/inventory when invoking operations. In the reproducer we can see that the platform utilization data fails to be displayed due to empty responses. What if the response wasn't empty but contained data for a different resource altogether? It still seems that the server should validate the agent's identity prior to executing any commands so that it can respond appropriately.
So there are two things the customer can do for such a rare edge case - one to be done proactively and one reactively: 1) As mentioned earlier, if the user wants to proactively prohibit this from happening, they can configure the servers and agents with certificates 2) If the user doesn't do #1, they can reactively (i.e. after the problem happens) uninventory/delete the agent in one of the JON environments where you don't want it. I guess the thing to say is that it simply is not supported that agents are actively registered on two separate JON environments - you are required to uninventory/delete one of them. If you do do this, errors and problems are expected to happen - resource IDs alone will not be correct, as well as the agent tokens, probably more things.
> It still seems that the server should validate the agent's identity prior to > executing any commands so that it can respond appropriately. As for this specifically, "the server should validate the agent's identity prior to executing any commands" - technically, that's what the SSL certificates do, so it can be said we do support this feature already (albeit optionally).
But I think we are missing the real issues here: * This isn't really so much as an edge case. Although this configuration would never be deliberate, it can actually happen fairly easily. If for example I am testing an agent in dev using the dev server and then move it over to the production server. In that case, the dev server will still know about the agent and there is a chance that its end-point is the same. The user isn't thinking, "Oh, I guess I need to find all the old servers this agent may have, at one time or another, been registered to and remove them from inventory." * The fact that the UI fails to do anything and returns a meaningless "globally uncaught exception" when this situation occurs. This is unacceptable regardless of how the problem occurs. The system should continue to function as normal and one agent configuration issue shouldn't render my JBoss ON system useless.
Mazz, can you provide any input on comment 10 ?
git commit to master: c160be0
Moving to ON_QA for testing in the next build.
Created attachment 817475 [details] platform_util
verified with jon 3.2 er4