Bug 536373 (RHQ-729)

Summary:	Refine the ResourceComponent API
Product:	[Other] RHQ Project	Reporter:	Jason Dobies <__jdobies>
Component:	Inventory	Assignee:	Jason Dobies <__jdobies>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	1.0.1	Keywords:	Improvement
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
URL:	http://jira.rhq-project.org/browse/RHQ-729
Whiteboard:
Fixed In Version:	1.1	Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jason Dobies 2008-08-12 21:02:00 UTC

The lifecycle of a ResourceComponent has changed since its original intention. Originally, the start/stop methods were meant to correspond to whether or not the resource was running. This has become inadequate for a number of reasons. One of these is the current implementation of plugin configuration updates, which fails if the subsequent component start call (which is part of the plugin configuration workflow) fails. This causes a Catch-22 if the plugin configuration change was to enable the resource to be started.

While there are other ways to solve the aforementioned plugin configuration issue, the component API could use some revisiting with our better knowledge of how our plugins have evolved.


============================================================
Ideas
============================================================


Split the API to have start/stop and connect/disconnect calls
==============================

Overview
--------------------

One of the biggest issues faced in any of these changes is over complicating what it takes to write a plugin. The trick to this approach is being able to clearly and simply define the difference between start and connect, as well as enumerating which will be used in what circumstances.

* Start - Does any one-time loading necessary for the resource component to function. For instance, if the component were to load anything from the file system (i.e. a persistence mechanism using the agent's temporary directory structures). I'm not completely sure what you would do here v. the component's constructor. My suspicion is that this will likely be a no-op in most cases. We also need to better define what to do if this start call fails, but I'm thinking a failure at this point represents an error condition within the component (see notes later on where to pass the plugin configuration).

* Connect - Tells the component to connect to the resource. This call will fail if the resource itself is not running. A failure here shouldn't represent an error in the system, but simply a connection problem to the resource (specifically not an error if the resource is known to not be running).

* Stop/Disconnect - The definitions and contracts for these are largely contingent on what is decided for start/connect, simply undoing what is supposed to be done in each of those phases.


Questions
--------------------

- When is each called? An immediate port of the current setup would be to call start and connect in succession. That would grant us a differentiation between what we consider an error (fail to start) v. the inability to connect to a resource, though this could be achieved through specific exceptions thrown from the start method.

We would likely want to be more intelligent than a simple port of the existing infrastructure, but the question arises in the case that either failed. We currently attempt to call start on a stopped resource on each call to getAvailability. Logically, this would become an attempt to connect in the case that it is disconnected, however what do we do if the initial call to start failed? We'd effectively be in the same place, just calling two calls instead of one.

- Where is the resource context passed in? This is currently done in the start method and much of the data in there makes sense as part of the start operation. But what about the plugin configuration?

We have grown to overuse (in my opinion) the plugin configuration. In addition to properties on how to connect to a resource, we're now using it to specify resource-wide properties that span subsystems (i.e. things that should be applied to all operations on the resource). This blurs where it should be specified.

If they are in fact connection properties, it would make sense to pass them into the connect method. This preserves the multiple call nature of connect/disconnect and more rarely used start/stop on the component. However, if we continue to use these properties as resource-wide configuration properties, it's possible they are being used at component start time (granted, the solution may be to tell plugin developers to not do this).

Looking specfically at the plugin configuration update use case, we would first disconnect the component, update the configuration, and then attempt to connect again. If we cannot connect, there are two possibilities:
- The plugin configuration update failed, perhaps from a plugin-side validation.
- The connection failed, either because an incorrect (but semantically valid) property was specified or simply because the resource is not running.

To maintain that distinction in this set up, we would likely want multiple exceptions to be thrown from the connect call, similar to what was mentioned in the first bullet point (except in that case, with regard to start).


Separate method for setting the plugin configuration
==============================

This may be in addition to the above suggestion. The goal here is to more cleanly differentiate between a configuration that failed validation v. a failure to connect. This is alluded to in the previous discussion.

Additionally, this lets us continue to treat the plugin configuration as more than just connection properties by not tying it to the connect method. We can continue to pass it to start as part of the resource context. When we want to update it, we call disconnect, set plugin configuration, and then connect. The explicit call to set plugin configuration lets the plugin writer hang any other non-connection changes it needs to do off of the set method rather than the connect method.


New Interface
==============================

For either of these approaches, the suggestion was made to have a "connectable" interface that may optionally be implemented by the component. This would cause us to not only have clearly defined differences between start and connect, but also be able to take into account that connect may not always be available. Additionally, this further complicates writing a plugin as its optional nature will likely be confusing. I'm against the idea, but wanted to mention it for completeness.


Recursive
==============================

There is a comment in the InventoryManager on the call to activate a resource that's been deactivated in order to update its plugin configuration.

// TODO: What about re-activating the Resource's descendants?

Now would probably be a good time to address this.


============================================================
Other Tasks
============================================================

- Readdress existing plugins to split out start v. connect
- Adjust the InventoryManager to make the appropriate calls to start, connect, and set plugin configuration.
--- This includes understanding our usage and differentiation between the ResourceContainer.resourceComponentState (started, stopped) and Resource.connected (true, false). With this change, this is far from an ideal way to represent the differences as we can be in an inconsistent state (stopped and connected) as well as having two different locations to check (resource v. resource container). I'm also wondering if we treat these are 4 different states or if connected is an attribute on the started state, which will affect how we use these variables.
- There are no unit tests for InventoryManager, which raises the risk level of this change significantly (taking into account the current state of that code). 
- Tests will include smoke testing to make sure resource components function correctly, viewing the results of stopping a resource outside of RHQ, updating a plugin configuration when either connected or disconnected.

Comment 1 Joseph Marques 2008-08-14 04:40:23 UTC

wow, that's an impressive amount of detail jdob - nice leg work.

Comment 2 Jason Dobies 2008-08-14 17:04:45 UTC

*This comment largely talks in terms of the (unfortunately) currently closed source JBoss AS plugin code, however the concepts of what and how to connect to a resource transcend a particular plugin and apply to plugin best practices in general.*

The initial issue that started me down this path was the AS plugin failing on start if the AS instance isn't running. I wonder if I'm looking at the solution in the wrong place. The issue might not be in our architecture of having a start call on a resource component. It might be in the implementation of the AS plugin itself.

If the AS resource component can't connect to a resource when the component is started, is that necessarily a failure?

We currently have code that will attempt to start a stopped resource component each time it's used if it's not already running. That same sort of logic might have a place in the AS plugin. Perhaps we treat the inability to connect to an AS instance as not a failure. Instead, if we are not connected when we go to use the JMX connection, we once again attempt to load it. At that point, if we still can't, we can't perform the requested operation (metrics collection, operation, etc.) and THAT is where the failure lies.

At the same time, we still have a means for showing that connection is not made in the availability check. So even if the start succeeds, the availability indicator of the resource will still show there is an issue.

As much as I thought this was a new idea, it's largely covered in the ResourceComponent documentation:

* <p>This method typically examines the plugin configuration from {@link ResourceContext#getPluginConfiguration()}
* and uses that information to connect to the managed resource. If this method finds that the plugin configuration
* is invalid which causes the connection to the managed resource to fail, then this method should throw
* {@link InvalidPluginConfigurationException}. This exception typically should not be thrown if the connection
* failed for some reason other than an invalid plugin configuration (e.g. in the case the managed resource is
* simply not running now) because usually those conditions can be tracked as part of the resource's
* {@link #getAvailability() availability} data.</p>

The current incarnation of the start method (redundantly) throws InvalidPluginConfigurationException and Exception. I don't believe the infrastructure properly reacts to the former and I'd like to do some more research into that to make sure we still have the information necessary to inform the user the plugin configuration needs to be updated.

Comment 3 Jason Dobies 2008-08-14 20:04:24 UTC

There are a number of issues in play in the current code. Depending on the solution, some of these may disappear, but it's worth noting where we stand now.

InventoryManager.activateResource
------------------------------

1.) When we try to start the resource, there is code to check to see if the activation was specifically due to a plugin configuration change. If it is, we rewrap the exception as an InvalidPluginConfigurationException. There are two potential issues with this:
- The rewrap is often unnecessary; if the plugin is correctly honoring the ResourceComponent contract, it will have already thrown an InvalidPluginConfigurationException as it is.
- We swallow *any* exception from the call to ResourceComponent.start and wrap it as an InvalidPluginConfiguration exception. This prevents us from being able to differentiate between a failure and an invalid config.

As a result of the second point, if the plugin configuration change was successful (but a connection cannot be established), we have no way of knowing that and will mark the change request as failed.

2.) I may be missing something, but if a plugin configuration update fails, we add a new ResourceGotActivatedListener. That listener is notified when a resource is successfully activated. In the listener, the server is sent a message to clear the resource configuration error (DiscoveryServerService.clearResourceConfigError).

The problem as I see it is that a new instance of this listener is added to the listener list each time a plugin configuration fails. The instance itself isn't scoped to a particular resource. So if we fail to update a plugin configuration multiple times, or even for multiple resources before the correct configuration is entered, there will be two results:
- The first time a resource is successfully activated, there will be multiple calls to the server to clear its resource configuration error.
- Each of these instances removes itself from the listener list after it's been triggered the first time. If there were multiple resources that had invalid plugin configurations, when they are resolved the listeners won't be present to make the call to clear the configuration error.

InventoryManager.updatePluginConfiguration
------------------------------

1.) There is a TODO in place for this. After activating the resource whose plugin configuration has just changed, we don't attempt to activate its children. I suspect in the rest of the discovery chaos we accidentally activate them, otherwise we'd have run into this already. Regardless, we should better understand if we need to explicitly add this or if they will be reliably be started elsewhere.

2.) Similarly, we don't deactivate children, just the resource being updated. We should put some thought into whether or not this would be a good idea.

ConfigurationManagerBean.executePluginConfigurationUpdate
------------------------------

1.) The call is made to DiscoveryAgentService.updatePluginConfiguration, which is fielded by the InventoryManager to actually do the update. The agent service method can throw one of two exceptions: InvalidPluginConfigurationException and PluginContainerException.

In the ConfigurationManagerBean, however, we simply catch Exception. Thus even when the InventoryManager is fixed to not mask the indication that a plugin configuration update was invalid, the server won't treat it differently anyway.

JBossASServerComponent.start
------------------------------
1.) This should not throw an exception if we cannot connect to the AS resource it if it not running, as per the documentation in ResourceComponent.start

Comment 4 Jason Dobies 2008-08-14 20:06:31 UTC

Given that the proposed split of start and connect would affect more than just plugin configuration updates, the following summarizes the usage of the InventoryManager.activateResource call.

AvailabilityExecutor
- Checking the availability of a resource that's not currently started

InventoryManager
- Starting a resource component after it is loaded from disk
- Starting the resource component of a manually inventoried resource
- "Merging resource from discovery" when in the embedded console
- In refreshResourceComponentState - ???  (I'm not fully sure what this does, but once again there is a flag in this method that does different handling if a plugin configuration was updated).
- After a plugin configuration is updated

Comment 5 Joseph Marques 2008-08-15 15:59:56 UTC

This reply only scratches the surface of all the various issues you addressed:

there doesn't need to be a state distinction between started/stopped, connected/disconnected. to me, this is all just "state", and as such a single state diagram could comprise the entire thing.

* started->connected,disconnected // so, a resource is started by the container, and then depending on whether the connect() succeeds can either move to connected or disconnected
* disconnected<-->connected // the resource will move back and forth between these two states depending on whether the plugin configuration entries allow it to connect or not
* disconnected->stopped // if a resource wants to be stopped, it must first be in the disconnected state, so disconnect() might first have to be called before stop()

then, once you have this single, unified state model, plugin configuration updates don't have to worry about putting the resource in an inconsistent stopped/connected state tuple. instead, plugin config updates look quite simple:

* if resource is connected, disconnect()
* update plugin config
* connect()

if connect throws plugin configuration exception, then the plugin container assumes that the entries are not valid to be able to connect to the resource. the resource would remain in the disconnected state, but the new plugin configuration is kept locally on that agent -- because there's no reason to NOT update the plugin config, it's just data (that in this case has special meaning for resource connectivity)

-----

i think the interface idea is interesting. but to make it easier for the plugin developer i think we can provide reasonable defaults. for instance, if you don't override connect(), the default impl is to return true, meaning that the container will always succeed in putting your resource into the connected state.

-----

i think container services need to be directly linked to the state of the resource. essentially, i only care about availability reports, measurement reports, events, etc when my resource is in the connected state. if it's not in the connected state, then it is a feature of the plugin container to always return an "unknown" availability state for that resource...but this might not be something we want to handle.

basically, do we really want to differentiate between: 1) i can't connect to the resource, so i have no idea what state it's in, and 2) i can connect to the resource, but it's in a downed state. we have to consider how often case #2 can really occur. i'd think it would be relatively rare. more often, if the resource is down, then the plugin won't be able to connect to it either. so the inability for the plugin to reach the connected state is the same thing as red availability.

perhaps, from a plugin developer's perspective, this should be the container-provided, default implementation for the availability report. but users are free to override the availability facet and provide more complicated mechanism (essentially, one of the rare scenarios that case #2 embodies).

-----

so, similar to availability, once we have a solid notion of state (connected/disconnected), we can come up with what a container-level service failure is. basically, i feel that we should ONLY be starting/running container services if the resource is in the connected state. if it's not connected, then everything else is suppressed. no data is polled for, no listeners are registered, no files are parsed, etc etc. this way, there is a clean separation / abstraction. services that logically should depend on the ability to talk to the resource don't themselves execute unless that line of communication is open. if that line goes down, it should be the responsibility of the container to shut down those services until some time in the future when that connection is re-established.

Comment 6 Jason Dobies 2008-08-15 16:18:26 UTC

@ "then, once you have this single, unified state model, plugin configuration updates don't have to worry about putting the resource in an inconsistent stopped/connected state tuple. instead, plugin config updates look quite simple"

I agree that if we were to do the start/connect split, it'd make sense to do it in a unified state model. Look at my original comment under Other Tasks. The issue is that it'd be a pretty significant change to our domain model. I believe the server uses the Resource.connected flag, so it couldn't be outright removed. Nor does it make sense to expose the ResourceContainer's state to the server. So at best, we could add the unified state model and maintain the Resource.connected flag. It's just ugly.

This all still falls victim to the much bigger question of whether or not it even makes sense to split them. I'd like to see some more thought into what that would get us. Or put another way, how having the two would affect plugin development.

@ "if connect throws plugin configuration exception ... but the new plugin configuration is kept locally on that agent -- because there's no reason to NOT update the plugin config"

There is a reason to not update it, it's invalid. If the plugin rejects a configuration as invalid, there's no reason to keep it around in the agent; the plugin indicated that it's not going to use it.

@ "for instance, if you don't override connect()"

This could be an enhancement later once we clean up InventoryManager and better understand how it should work, but for now I'm fine with letting the plugin writer stub it out themselves if necessary.

@ "this should be the container-provided, default implementation for the availability report. but users are free to override the availability facet"

FYI, availability is not a facet. It's one of the three method in ResourceComponent. I don't think we give them much at all by providing simple stubs, and it we just couldn't achieve the plugin flexibility we have now without using interfaces.

In fact, I'd argue we go in the other direction. If we don't run with connect, we use the getAvailability call to provide the connected logic for us. They aren't necessarily the same thing, but again, I'm not sure what we gain from explicit "is this connected" logic.

@ "basically, do we really want to differentiate between: 1) i can't connect to the resource, so i have no idea what state it's in, and 2) i can connect to the resource, but it's in a downed state"

I think we do want this differentiation. If I update the plugin connection properties, I want to know if I specified a bad value. Even if the resource is down, that didn't stop me from updating the configuration. The issue is, I'm not sure how possible it will be in most cases to determine that a given set of properties is valid, however the resource isn't running.

@ "i think container services need to be directly linked to the state of the resource." and "if that line goes down, it should be the responsibility of the container to shut down those services until some time in the future when that connection is re-established."

Agreed, and they are now to a point. The problem is, some do lazy connecting if the connection failed, so it's tricky to see what is going on.

Comment 7 Joseph Marques 2008-08-15 16:42:21 UTC

Thanks for the feedback.  Just one thing...

@@ "if connect throws plugin configuration exception ... but the new plugin configuration is kept locally on that agent -- because there's no reason to NOT update the plugin config"

@ "There is a reason to not update it, it's invalid. If the plugin rejects a configuration as invalid, there's no reason to keep it around in the agent; the plugin indicated that it's not going to use it. "

Well, how do we know for sure it's NOT a good configuration?  What if the managed resource was down (not started), but the user wanted to update the plugin configuration anyway.

In my opinion, the server-side initiated update of a plugin config should always succeed.  Things should not fail because an indirect hook or processing side-effect failed.  A plugin config update though should fail if we literally couldn't update the plugin config (problems communicating with the agent, there was a timeout because we sat waiting for some lock too long [indicating we have a concurrent bug somewhere], component was not in the started sate [indicating we might have a life-cycle bug in the plugin container logic], etc).

This way, if we update the plugin config, and always keep that last updated record on the agent, if the managed resource is ever restarted the plugin will naturally reconnect to it during it's next availability poll.  And, from an aggregate standpoint, if i'm updating the plugin config for 100 JBAS servers, I *want* that update to succeed.  Each agent should get the new plugin config, even if they can't connect.  Then I go to the aggregate control / operations section and execute a start operation against the same group.  Because each agent has that most recent plugin config I pushed to them, they should all startup because they have the correct data.  If the agent didn't keep this update, it might cache an older / stale plugin config, which would cause the start operation to fail later.

Comment 8 Jason Dobies 2008-08-15 16:54:37 UTC

Ahh, ok. This is a fundamental question.

What was called "plugin configuration exception" is actually two different exceptions thrown from start. To me, if the plugin throws InvalidPluginConfigurationException, it means we know it's not a good configuration. Perhaps it failed some plugin-only validation (as compared to the static validation, i.e. is port an integer, which we can do on the server). For instance, if the resource type's plugin configuration takes a directory on the file system, it might be considered an invalid configuration if that directory doesn't exist.

What this gets us is the ability to show the user in the UI that property X was deemed invalid by the plugin and prompt them to enter a new value. However, this isn't exactly working properly due to a few of the various bugs I listed earlier.

There may be a use case for being able to say "update the configuration anyway, validation be damned." But otherwise, I treat that exception as the plugin saying the configuration values will not be enforced, and thus the agent shouldn't care to hold on to them and the server UI should reflect that.

The big issue that led me here is the case of us knowing if a plugin configuration was set (i.e. passed plugin-side validation) however we still can't connect to the resource. For example, if we specified bad login credentials for connecting to JMX.

That's a tricky use case. If we attempt to connect and cannot with those credentials, we could reject the plugin configuration update as invalid. But if the AS instance isn't running, we can't actually test those credentials. In the latter, is that still a failed plugin configuration update? What happens when the AS instance comes online and the credentials are bad?

Comment 9 Joseph Marques 2008-08-15 17:26:47 UTC

@ "What was called "plugin configuration exception" is actually two different exceptions thrown from start. To me, if the plugin throws InvalidPluginConfigurationException, it means we know it's not a good configuration. Perhaps it failed some plugin-only validation (as compared to the static validation, i.e. is port an integer, which we can do on the server). For instance, if the resource type's plugin configuration takes a directory on the file system, it might be considered an invalid configuration if that directory doesn't exist."

Agreed. If there is plugin-only validation, we could fail-fast as we know this config never has a chance of being correct.

@ "What this gets us is the ability to show the user in the UI that property X was deemed invalid by the plugin and prompt them to enter a new value. However, this isn't exactly working properly due to a few of the various bugs I listed earlier."

True, which would work for individual config updates, but maybe not for aggregate plugin config updates...or maybe it would. An aggregate update, in the end, is just a bunch of clones of some config update, so we should be able to validate the aggegrate update ahead of time. Basically, we could have the aggregate pushed down to any (it doesn't matter which) agent. If validation fails, don't submit the job and don't create the individual plugin config update children (because presumably if the aggregate failed validation, so too will the children plugin config updates).

@ "There may be a use case for being able to say "update the configuration anyway, validation be damned." But otherwise, I treat that exception as the plugin saying the configuration values will not be enforced, and thus the agent shouldn't care to hold on to them and the server UI should reflect that."

I'd say that validation failures are the primary (only?) ones we should be concerned with. If validation passes, then the config may or may not be correct. (BTW, if we want to be able to do the aggregate validation ahead of time, I'd say that we should separate the validation method from the update method on the agent side.)

@ "The big issue that led me here is the case of us knowing if a plugin configuration was set (i.e. passed plugin-side validation) however we still can't connect to the resource. For example, if we specified bad login credentials for connecting to JMX."

@ "That's a tricky use case. If we attempt to connect and cannot with those credentials, we could reject the plugin configuration update as invalid. But if the AS instance isn't running, we can't actually test those credentials. In the latter, is that still a failed plugin configuration update? What happens when the AS instance comes online and the credentials are bad?"

Agreed, unless we draw a very clear distinction between errors that occur when we can't find the running process versus errors that occur when we can ping the process but don't get back the data we're expecting, this is going to be tough to track down. Perhaps we could use the events subsystem to notify of these things? This way, the user will see that the resource is red in the web console, and they can go to the events tab to see why it can't connect. The event details will either be that the instance could not be found (case when the server isn't started), or that the instance could not be connected up (case when the server was up, but creds are bad).

Comment 10 Jason Dobies 2008-08-15 17:33:26 UTC

@ "I'd say that we should separate the validation method from the update method on the agent side"

I was thinking the same thing based on your first comment in that last reply. That'd also save us the unnecessary stop/disconnect if we know the plugin config won't work anyway.

I want to do some looking at other plugins a bit before I continue. I think I've gotten too into the mentality of the AS plugin and its EMS connection. I wonder how much that carries over into other domains. We may be debating adding a connect method that will be unused in 90% of the cases.

One other comment at this time, I'm currently leaning towards not adding connect and fixing the AS usage of start. I don't really view not being able to connect as a reason for start to fail. Availability might be usable for that, although isn't exactly the same thing (there could be two interpretations... the resource is down or we simply can't get to it to check). But again, I wonder if looking at other plugins will make the generic case a little easier to see.

Comment 11 Jason Dobies 2008-08-26 19:50:57 UTC

This was punted on for RHQ 1.1. At this time, there doesn't seem to be a compelling use case for changing the semantics of start. To work around the JBoss AS issue that prompted this, the AS plugin was changed to correctly honor the ResourceComponent API as it already stood.

Comment 12 Red Hat Bugzilla 2009-11-10 21:15:42 UTC

This bug was previously known as http://jira.rhq-project.org/browse/RHQ-729