Bug 736685

Summary: cannot uninventory resource that has condition log not associated with an alert
Product: [Other] RHQ Project Reporter: John Mazzitelli <mazz>
Component: AlertsAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1CC: hrupp, ian.springer, jshaughn, skondkar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-07 19:27:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 625146, 734807    

Description John Mazzitelli 2011-09-08 12:24:56 UTC
If you have a resource with an alert definition whose conditions use the AND (aka ALL) conjunction (for example "if metric > 40 AND if metric < 60"), and if one of those conditions were true but the other isn't (that is, an alert hasn't been triggered yet because the full condition isn't true yet), you cannot uninventory that resource.

The reason is the alert_id column is null in the rhq_agent_condition_log table, but our bulk delete assumes a non-null alert.

See AlertConditionLog named query and the bulk uninventory code that uses it:

    @NamedQuery(name = AlertConditionLog.QUERY_DELETE_BY_RESOURCES, //
    query = "DELETE AlertConditionLog acl " //
        + "   WHERE acl.alert.id IN ( SELECT alert.id " //
        + "                             FROM AlertDefinition ad " //
        + "                             JOIN ad.alerts alert " //
        + "                            WHERE ad.resource.id IN ( :resourceIds ) ))"),

Notice that acl.alert.id is null in the case in question which will never match the set returned by the subquery (as you see, the subquery only ever returns alert IDs - but if the condition log was not yet associated with any alert, this DELETE won't remove that condition log).

Leaving a condition log in the DB will prohibit the ability to remove the resource itself due to foreign key constraints.

Comment 1 John Mazzitelli 2011-09-08 12:32:04 UTC
How to replicate:

You need to have an alert definition with more than one condition but using the ALL conjunction ("fire alert when ALL conditions are true").

For example, on a platform resource, create a condition you know will be true (e.g. for the platform resource, use something like "if free memory > 1").

Then create one or more additional conditions that you know will not be true (e.g. for the platform resource "if operation "Manual discovery" executes with FAILURE" - don't execute that operation :)

Wait for the free memory metric to be collected and sent up by the agent. At that point, you should be able to look in the rhq_alert_condition_log table and see a condition log row that has null for alert_id (since this condition was true, but an alert wasn't fired yet because of it).

Now try to uninventory the platform resource, and expect to see a foreign key constraint violation once the async delete job kicks off and tries to remove the platform resource from the DB

Comment 2 Jay Shaughnessy 2011-09-08 16:31:45 UTC
Now deletes condition logs via the alert def, not alerts, because not
every condition log may not yet be associated with an alert. Also, avoids
joining with the  potentially large alert table.

Comment 3 Heiko W. Rupp 2011-09-28 16:25:50 UTC
Jay, did you already fix this? If so, what version?

Comment 4 Jay Shaughnessy 2011-09-28 17:22:38 UTC
Yes, already fixed. Must have forgotten to update the BZ:

commit bbad56e4cc4bd0e2fae2d7a28e22ed6724162315
Author: Jay Shaughnessy <jshaughn>
Date:   Thu Sep 8 12:09:19 2011 -0400

Comment 5 Sunil Kondkar 2011-09-30 09:50:50 UTC
Verified on build#449 (Version: 4.1.0-SNAPSHOT Build Number: 4d56f0b)

Created an alert with conditions 'if free memory > 1' and 'Manual discovery executes with FAILURE" using the ALL conjunction. The rhq_alert_condition_log table has a condition log row that has null for alert_id:

rhq4=# select * from rhq_alert_condition_log;
  id   |     ctime     | alert_id | condition_id |    value    
-------+---------------+----------+--------------+-------------
 10001 | 1317375258767 |          |        10001 | 8.4496384E7
(1 row)

Navigated to 'Inventory menu->Platforms', selected the platform and clicked on 'Uninventory' button. The platform is uninventoried successfully.

The server log does not display foreign key constraint violation and it displays:

2011-09-30 15:11:15,247 INFO  [org.rhq.enterprise.server.scheduler.jobs.AsyncResourceDeleteJob] Async resource deletion - 833 successful, 0 failed, took [44669] ms

Marking as verified.

Comment 6 Mike Foley 2011-10-17 19:54:07 UTC
*** Bug 736452 has been marked as a duplicate of this bug. ***

Comment 7 Mike Foley 2012-02-07 19:27:03 UTC
changing status of VERIFIED BZs for JON 2.4.2 and JON 3.0 to CLOSED/CURRENTRELEASE