Bug 736494 - Generate synthetic agent deletes in cumin based on heartbeats and agent list
Summary: Generate synthetic agent deletes in cumin based on heartbeats and agent list
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: cumin
Version: Development
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: grid-maint-list
QA Contact: MRG Quality Engineering
Depends On: 702440
TreeView+ depends on / blocked
Reported: 2011-09-07 20:21 UTC by Trevor McKay
Modified: 2016-05-26 20:14 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 702440
Last Closed: 2016-05-26 20:14:30 UTC

Attachments (Terms of Use)

Description Trevor McKay 2011-09-07 20:21:36 UTC
Missed agent deletes can still cause stale data to accumulate.

During dynamic configuration of a pool via wallaby, we saw a case in which it appears that an agent delete message was never delivered to cumin from the qmf console object.  Not creating a BZ against QMF at this point because I don't have a reproducer.

However, cumin could benefit from a backup defense against missed delete messages.

+++ This bug was initially created as a clone of Bug #702440 +++

Description of problem:

This issue is closely related to several that have been addressed in the past but it is slightly different.

It is possible for stale data to collect in the cumin database that will never be deleted even if all agents are using stable ids.  This can happen whenever cumin sees a population of agents, cumin is shut down, and cumin is restarted in an environment where the agent population is not a superset of the population at shutdown time (because cumin is started pointing at a different broker, or configuration changes were made, or machines/agents went down unexpectedly, etc).

Cumin needs a mechanism to discover in its data references to agents which do not exist as of its start time.  This can be done in a background thread with a (configurable) timeout and run interval to garbage collect old agents and their objects.

How reproducible:


Steps to Reproduce:
1.  Start cumin pointed at a broker, let it run for a while.
2.  Start one or more sesame agents pointed at the same broker
3.  Shut cumin down.
4.  Restart cumin pointed at a different broker, or restart cumin pointed at the same broker after shutting down some of the sesame agents.
5. Systems will be shown under the inventory tab that do not exist in the current environment.

Actual results:

Wait forever, those systems should never go away.

Expected results:

Cumin should detect stale entities.

Additional info:

If agents are restarted with the same ids as the original and then removed while cumin is running, the systems will disappear from the display.

--- Additional comment from tmckay@redhat.com on 2011-05-05 12:53:25 EDT ---

BZ595774 is related to this, and contains a bunch of links to other related BZs.

--- Additional comment from tmckay@redhat.com on 2011-05-05 13:00:27 EDT ---

Possible solution outline:

Add a table in the database which tracks agents and the last time a heartbeat was heard from that agent.  Periodically scan the table and delete agents which have not received a heartbeat in N seconds (first run of the thread needs to be offset from cumin start time to give agents a chance to "show up").  Also delete any objects associated with that agent as we do on agent delete or agent creation.

--- Additional comment from tmckay@redhat.com on 2011-06-10 13:19:15 EDT ---

We already delete all objects associated with an agent when we first see that agent after cumin starts up (restricted by bound classes).  This covers objects that we will see again, as well as objects that we would not have seen, associated with that agent.

We also delete objects when a broker tells us an agent went away.

The only group left is objects associated with agents that we will never see during a given session (if they show up late, we will delete their objects, noted above).  This is in fact the group that we are targeting in this BZ.

The union of these two sets is simply all objects of bound classes for a particular cumin-data instance.  So doesn't handling phantom data just resolve to deleting all objects of all bound classes when cumin-data starts?

I think yes.  Will try.

(sample data is not deleted except by the expiration thread, when it is 24 hours old)

--- Additional comment from tmckay@redhat.com on 2011-06-10 13:22:53 EDT ---

One additional note, we do not delete the Collector object when we see its agent created or deleted.  I don't understand why not, maybe some historical reason.  Especially with collector filtering turned off, I can't see that this is too much of a problem.

--- Additional comment from tmckay@redhat.com on 2011-06-13 10:35:42 EDT ---

Fixed in revision 4808.

All non-sample data associated with bound classes is deleted by cumin-data instances on startup.  User is presented with a friendly banner noting the absence of the collector until the collector object is seen.

No longer any need to delete data when an agent create is seen.  Objects are still deleted when an agent delete is seen.

User data is also preserved.

--- Additional comment from jsarenik@redhat.com on 2011-07-21 06:51:43 EDT ---

Verified with cumin-0.1.4878-1.el5

--- Additional comment from tmckay@redhat.com on 2011-07-25 11:46:04 EDT ---

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    New Contents:
    Cumin did not have a way to recognize inactive agents in its database.  

  Stopping agents while cumin was shutdown or configuring cumin to point to a different broker could cause objects from missing agents to display in the UI when cumin was restarted.  These stale objects would never be deleted.

    All dynamic data in cumin is deleted when cumin starts.  Agents and objects are rediscovered as cumin runs.

    Cumin will only display data from active agents.  Overall performance is not discernibly affected by deletion of dynamic data on startup.

--- Additional comment from errata-xmlrpc@redhat.com on 2011-09-07 12:43:39 EDT ---

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Comment 2 Anne-Louise Tangring 2016-05-26 20:14:30 UTC
MRG-Grid is in maintenance and only customer escalations will be considered. This issue can be reopened if a customer escalation associated with it occurs.

Note You need to log in before you can comment on or make changes to this bug.