Red Hat Bugzilla – Bug 854805
Tracker: UI visual feedback on interactions of distributed components
Last modified: 2015-02-01 18:29:04 EST
RHQ/JON is a distributed system with many decoupled parts that interact asynchronously. This can lead to a natural, and unavoidable, latency between an action being triggered by a user and the result of that action, or even the progress of that action, being returned to the user. The base of this issue is the latency and lack of visual queues to let the user know when things should or will happen. Our customers tend to expect things to happen in real-time, at least in the default case where we don't show them anything different, we need to address issues where the architecture doesn't allow a real-time response and to make sure we correctly set users expectations around the progress or result of an action.
The child bugs blocking this tracker should capture individual examples of this issue and how we can go about addressing them.
Changed title a little to remove the notion these issues are only related to latency. https://bugzilla.redhat.com/show_bug.cgi?id=743727 is a good example of an issue where better UI feedback is needed to help the user understand what they need to fix with their system.
While not directly related to this, another important aspect of the "feedback loop" to the UI is the fact that the user has no clue about the load the RHQ server and individual agents are under and thus cannot anticipate the latency of the individual operations (which are the "things" this BZ is actually about).
The tracker bug 855744 deals with that problem and should include the following areas:
1) Data purge job duration - every hour the RHQ server is doing data compaction and purge for measurement tables, etc.
Having an indication of how busy this job is (i.e. the percentage of the hour the job has until another one kicks in) would a great indicator for the user on how is the RHQ server able to keep up with the inflow of the data the agents are generating.
2) A number of subsystems in the RHQ agent run different jobs on a schedule. As with the above, the user should be given an indication of how "saturated" these schedules are and thus how the agent is keeping up with the work laid upon it (discovery, availability, measurement, configuration, event, content subsystems - all of them run different kinds of schedules).
The big challenge is that we currently lack a way to report errors, status updates, etc. in threads that are *not* processing UI requests. Here are some examples. When resources are imported into inventory, a quartz job is scheduled to periodically send the updated inventory status of resources to respective agents. This is part of the inventory sync work flow between server and agent. If any kind of error occurs, we write it out to the server log but have nowhere to report it in the UI.
Another example is installing or updating a plugin at server start up. If the installation/upgrade fails at server start up, we have nowhere in the UI to report it. It is actually possible for the plugin to appear installed without any of its meta data actually being the database. This can and does lead to hard to debug situations for users.
I should point out that there are plenty of things that happen in non-UI request threads that are reported in the UI. For example, when a user schedules a resource operation, that operation is sent to the agent and executed asynchronously. When the operation completes the agent sends back the results which can be viewed in the operation history for that resource. We other similar audit trails like resource configuration history, bundle deployment history, etc.
Instead of all these separate audit trails, we need a global audit trail where events can be reported. Errors that happen outside of a UI request/response can be reported there and we can provide a place (or places) in the UI where that info can be viewed. This audit trail should also not be tied to individual resources so that valuable information is not lost when resources are removed from inventory.
As an aside, this could be a very good fit for our new metrics database.
(In reply to comment #3)
One thing that we need to keep in mind is the difference between events/health/audit data about the JON system and that same data for the customers environment. IMHO they should be clearly separated, only JON admins should really care about the former, but all regular JON users should care about the latter.
An audit trail for the JON system itself can be logically separated from resource audit trails, but I do not see a compelling reason for physical, implementation-level separation like we have today. We different classes and tables to represent the same type of data. All we really need is a filtering mechanism.