Bug 965886

Summary: Servers can continue to invoke operations for agents that have since been registered to a different server
Product: [JBoss] JBoss Operations Network Reporter: Larry O'Leary <loleary>
Component: Agent, Inventory, OperationsAssignee: John Mazzitelli <mazz>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: urgent    
Version: JON 3.1.2CC: ahovsepy, aneelica, asantos, djorm, hrupp, mazz, myarboro, theute
Target Milestone: ER04   
Target Release: JON 3.2.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-02 20:34:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1012435    
Attachments:
Description Flags
platform_util none

Description Larry O'Leary 2013-05-21 22:15:23 UTC
Description of problem:
If an agent has been registered with a server and then later configured to register with a different (unrelated) server, the original server that the agent registered to will continue to talk to the agent even though the agent is not connected to it.

Version-Release number of selected component (if applicable):
4.4.0.JON312GA

How reproducible:
Always

Steps to Reproduce:
1.  Install JBoss ON system and start the agent using a public interface.
2.  Import platform, agent, and server into inventory.
3.  Verify everything is up and showing available.
4.  From JONServer1, view the Platform Utilization Report.
5.  Shutdown agent.
6.  Reconfigure agent to communicate with another JBoss ON server (JONServer2) (on another host).
7.  Import platform and agent into inventory on JONServer2.
8.  Verify everything is up and showing available on JONServer2.
9.  Verify everything is down and showing unavailable on JONServer1.
10. From JONServer1, view the Platform Utilization Report.

Actual results:
Platform utilization report does not get displayed and a "Globally uncaught exception" appears in the message center.

    com.google.gwt.core.client.JavaScriptException:(TypeError) 
     stack: org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_$copyValues__Lorg_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_2Lorg_rhq_core_domain_resource_composite_PlatformMetricsSummary_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:90371
    org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_copyValues__Ljava_lang_Object_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:90403
    org_rhq_enterprise_gui_coregui_client_util_RPCDataSource_copyValues__Ljava_lang_Object_2ZLcom_smartgwt_client_widgets_grid_ListGridRecord_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:58330
    org_rhq_enterprise_gui_coregui_client_util_RPCDataSource_$buildRecords__Lorg_rhq_enterprise_gui_coregui_client_util_RPCDataSource_2Ljava_util_Collection_2Z_3Lcom_smartgwt_client_widgets_grid_ListGridRecord_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:58049
    org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_$onSuccess__Lorg_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_2Lorg_rhq_core_domain_util_PageList_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:90422
    org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_onSuccess__Ljava_lang_Object_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:90440
    com_google_gwt_user_client_rpc_impl_RequestCallbackAdapter_$onResponseReceived__Lcom_google_gwt_user_client_rpc_impl_RequestCallbackAdapter_2Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_Response_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:10006
    org_rhq_enterprise_gui_coregui_client_util_rpc_TrackingRequestCallback_onResponseReceived__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_Response_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:148704
    com_google_gwt_http_client_Request_$fireOnResponseReceived__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:4388
    com_google_gwt_http_client_RequestBuilder$1_onReadyStateChange__Lcom_google_gwt_xhr_client_XMLHttpRequest_2V@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:4573
    this$static.onreadystatechange<@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:13131
    com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:2735
    com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2@http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:2773
    @http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html:2758

     fileName: http://localhost:7080/coregui/org.rhq.enterprise.gui.coregui.CoreGUI/2AB681FED10D6980DED6B17E0559474A.cache.html
     lineNumber: 90371
     columnNumber: 4: summary.org_rhq_core_domain_resource_composite_PlatformMetricsSummary_idleCPU is null
    --- STACK TRACE FOLLOWS ---
    ...
     columnNumber: 4: summary.org_rhq_core_domain_resource_composite_PlatformMetricsSummary_idleCPU is null
       at Unknown.org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_$copyValues__Lorg_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_2Lorg_rhq_core_domain_resource_composite_PlatformMetricsSummary_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2(Unknown Source)
       at Unknown.org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource_copyValues__Ljava_lang_Object_2Lcom_smartgwt_client_widgets_grid_ListGridRecord_2(Unknown Source)
       at Unknown.org_rhq_enterprise_gui_coregui_client_util_RPCDataSource_copyValues__Ljava_lang_Object_2ZLcom_smartgwt_client_widgets_grid_ListGridRecord_2(Unknown Source)
       at Unknown.org_rhq_enterprise_gui_coregui_client_util_RPCDataSource_$buildRecords__Lorg_rhq_enterprise_gui_coregui_client_util_RPCDataSource_2Ljava_util_Collection_2Z_3Lcom_smartgwt_client_widgets_grid_ListGridRecord_2(Unknown Source)
       at Unknown.org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_$onSuccess__Lorg_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_2Lorg_rhq_core_domain_util_PageList_2V(Unknown Source)
       at Unknown.org_rhq_enterprise_gui_coregui_client_dashboard_portlets_platform_PlatformMetricDataSource$1_onSuccess__Ljava_lang_Object_2V(Unknown Source)
       at Unknown.com_google_gwt_user_client_rpc_impl_RequestCallbackAdapter_$onResponseReceived__Lcom_google_gwt_user_client_rpc_impl_RequestCallbackAdapter_2Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_Response_2V(Unknown Source)
       at Unknown.org_rhq_enterprise_gui_coregui_client_util_rpc_TrackingRequestCallback_onResponseReceived__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_Response_2V(Unknown Source)
       at Unknown.com_google_gwt_http_client_Request_$fireOnResponseReceived__Lcom_google_gwt_http_client_Request_2Lcom_google_gwt_http_client_RequestCallback_2V(Unknown Source)
       at Unknown.com_google_gwt_http_client_RequestBuilder$1_onReadyStateChange__Lcom_google_gwt_xhr_client_XMLHttpRequest_2V(Unknown Source)
       at Unknown.this$static.onreadystatechange<(Unknown Source)
       at Unknown.com_google_gwt_core_client_impl_Impl_apply__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
       at Unknown.com_google_gwt_core_client_impl_Impl_entry0__Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2Ljava_lang_Object_2(Unknown Source)
       at Unknown.anonymous(Unknown Source)
       at Unknown.anonymous(Unknown Source)

Additionally, the agent logs contained the following warnings:

    WARN  [WorkerThread#0[10.3.113.30:53249]] (rhq.core.pc.measurement.MeasurementManager)- Can not get resource container for resource with id 10001
    WARN  [WorkerThread#0[10.3.113.30:53249]] (rhq.core.pc.measurement.MeasurementManager)- Can not get resource container for resource with id 10001

Where the IP address is that of JONServer1 even though the agent was registered and connected to JONServer2.

Expected results:
The platform utilization report shouldn't throw an error. If more then one platform was in inventory, the utilization data for the other platform(s) should be displayed.

Additional info:
Essentially, what this comes down to is that agent was originally registered and connected to JONServer1. JONServer1 knows the agent's endpoint address. Even after the agent's configuration is wiped and it is registered with a completely different server, it continues to receive invocation requests from JONServer1. In this case, the only reason the platform utilization report threw an error was because the platform resource ID JONServer1 knew about wasn't the correct resource ID for an existing resource in JONServer2.

Take the following scenario:

This could happen in instances where an agent is registered with a server and has its platform or other child resources added to inventory. Later, the administrator decides to shutdown the agent and register it with a different server. For example, if the initial roll-out of the JBoss ON system is one server for the entire enterprise but then later it is decided that separate JBoss ON systems will be installed using one system per data center (multiple servers not in an HA configuration). 

- Server 1
-- Agent 1
-- Agent 2
-- Agent 3
-- Agent 4


Is converted to:

- Server 1
-- Agent 1
- Server 2
-- Agent 2
- Server 3
-- Agent 3
- Server 4
-- Agent 4

The result will be that Server 1 will still know about Agent 1, 2, 3, and 4. Assuming their network address has not changed, server 1 will still be able to connect to agents and invoke operations.

Comment 1 John Mazzitelli 2013-05-22 03:13:00 UTC
You can secure the server-agent communications with SSL/certificates. The agent will reject any server whose certificate is not in the agent's truststore.

Thus, if you have multiple JON environments and are afraid that agents will end up getting registered in both (a rare edge-case I would presume) then securing the comm with certificates is the answer. That is how JON can be told to restrict which servers can talk to which agents.

Comment 2 Larry O'Leary 2013-05-22 13:05:26 UTC
Understood, but we need a way to handle this situation to prevent random failures due to the agent not returning expected data/inventory when invoking operations. In the reproducer we can see that the platform utilization data fails to be displayed due to empty responses. What if the response wasn't empty but contained data for a different resource altogether? 

It still seems that the server should validate the agent's identity prior to executing any commands so that it can respond appropriately.

Comment 3 John Mazzitelli 2013-05-22 13:30:41 UTC
So there are two things the customer can do for such a rare edge case - one to be done proactively and one reactively:

1) As mentioned earlier, if the user wants to proactively prohibit this from happening, they can configure the servers and agents with certificates

2) If the user doesn't do #1, they can reactively (i.e. after the problem happens) uninventory/delete the agent in one of the JON environments where you don't want it.

I guess the thing to say is that it simply is not supported that agents are actively registered on two separate JON environments - you are required to uninventory/delete one of them. If you do do this, errors and problems are expected to happen - resource IDs alone will not be correct, as well as the agent tokens, probably more things.

Comment 4 John Mazzitelli 2013-05-22 13:47:57 UTC
> It still seems that the server should validate the agent's identity prior to
> executing any commands so that it can respond appropriately.

As for this specifically, "the server should validate the agent's identity prior to executing any commands" - technically, that's what the SSL certificates do, so it can be said we do support this feature already (albeit optionally).

Comment 5 Larry O'Leary 2013-05-22 15:22:38 UTC
But I think we are missing the real issues here:

 * This isn't really so much as an edge case. Although this configuration would never be deliberate, it can actually happen fairly easily. If for example I am testing an agent in dev using the dev server and then move it over to the production server. In that case, the dev server will still know about the agent and there is a chance that its end-point is the same. The user isn't thinking, "Oh, I guess I need to find all the old servers this agent may have, at one time or another, been registered to and remove them from inventory."

 * The fact that the UI fails to do anything and returns a meaningless "globally uncaught exception" when this situation occurs. This is unacceptable regardless of how the problem occurs. The system should continue to function as normal and one agent configuration issue shouldn't render my JBoss ON system useless.

Comment 11 David Jorm 2013-08-20 10:52:59 UTC
Mazz, can you provide any input on comment 10 ?

Comment 19 John Mazzitelli 2013-10-02 15:46:32 UTC
git commit to master: c160be0

Comment 20 Simeon Pinder 2013-10-24 04:10:23 UTC
Moving to ON_QA for testing in the next build.

Comment 21 Armine Hovsepyan 2013-10-30 14:47:33 UTC
Created attachment 817475 [details]
platform_util

Comment 22 Armine Hovsepyan 2013-10-30 14:48:27 UTC
verified with jon 3.2 er4