Bug 761320

Summary: Drift detection scans do not run with default intervals
Product: [Other] RHQ Project Reporter: John Sanda <jsanda>
Component: driftAssignee: John Sanda <jsanda>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.2CC: loleary
Target Milestone: ---   
Target Release: JON 3.0.1   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=767634
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 785845 (view as bug list) Environment:
Last Closed: 2013-09-03 11:17:57 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 760116, 761709, 785845    

Description John Sanda 2011-12-07 22:12:42 EST
Description of problem:
The default detection interval for a drift definition is 30 minutes, and the drift detector task on the agent runs every minute. I discovered that when using these default settings, the drift detector task/job always skips doing the file system scan for the definition.

It turns out that this problem can occur any time the interval for the drift detection job is higher than the intervals of the definitions it processes. Suppose we have a definition with the default interval of 30 minutes. The drift detector job runs on the definition for the first time at 12:00. We will say that the job takes one minute to complete to keep the math simple. This means that drift detection will run again on the definition no earlier than 12:31.

The drift detector job finished at 12:01. Since it configured by default to run every minute. It will run again at 12:02. Since there is only the one definition in the queue the drift detector will grab the definition from the schedule queue. Drift detection will not run on the definition though since it is not scheduled to run again until 12:31. Here is where the problem lies. When the definition is put back on the schedule queue, its next scan time is updated. So now drift detection is scheduled to run at 12:32. The drift detector will again at 12:03 pull the definition off the queue, see that its not scheduled to run, update the next scan time to 12:33, and put the definition back on the queue.

The drift detector is always updated the timestamp for the definition's next detection run; however, the timestamp should only be getting updated when drift detection actually runs. More precisely, the timestamp should only be updated when the file system scan has completed for the definition.

Version-Release number of selected component (if applicable):

How reproducible:
When a drift definition's detection interval is set higher than the agent's drift detector task interval which is set via the rhq.agent.plugins.drift-detection.period-secs property in agent-configuration.xml.

Steps to Reproduce:
1. Create a drift definition with the default interval or an interval that is reasonably larger (by at least 30 seconds or so) than the agent's drift detection period
2. Immediately after the agent does the initial detection run to generate the initial snapshot, make a change that will result in some drift
Actual results:
Drift will never get reported because the time for the next scheduled drift detection will keep getting pushed out further and further.

Expected results:
The scheduled time should not get updated until the file system scan actually runs again which would result in the drift getting detected and reported to the server.

Additional info:
There are some work arounds for this. One way to avoid the problem is to set the definitions' intervals to values lower than the agent's value for rhq.agent.plugins.drift-detection.period-secs which is 60 seconds by default.

Alternatively the drift detection period on the agent could be set to a value high than the interval on any of the definitions.

A third approach could be a combination of the two where you increase the definitions' intervals and also decrease the agent's drift detection period. This would be a meet in the middle approach.

With the first approach of setting each of the definitions' intervals lower than the agent's drift detection interval, drift detection will run more frequently. This will increase the load on the agent particularly around CPU utilization and disk IO. 

With the second approach of increasing the drift detection period on the agent, drift detection will run less frequently. A definition's interval has a default value of 30 minutes. Suppose we set the agent's drift detection period (i.e., the frequency at which the detector job runs) to 60 minutes. Let's say we create 5 drift definitions in succession. It will take over five hours for the initial snapshot to be generated for all of the definitions.

The third approach offers the most flexibility. If your agent and the machine on which it is running can handle the increased load you could for example drop the interval of each definition down to 10 minutes. At the same you could set rhq.agent.plugins.drift-detection.period-secs to a safe value of 900 (i.e., 15 minutes).
Comment 1 John Sanda 2011-12-08 14:18:11 EST
We no longer update the detection schedule if detection does not run for a definition. Drift detection might not run for one of three reasons - 1) the definition is disabled, 2) it is too early for the next scheduled run, 3) the server has not acknowledged the previous change set sent by the agent. The fix also includes additional debug logging that outputs the next scheduled detection time. It looks like,

2011-12-08 14:20:32,512 DEBUG [pool-3-thread-1] (rhq.core.pc.drift.ScheduleQueueImpl)- The next drift detection run for DriftDetectionSchedule[resourceId: 10001, driftDefinitionId: 10001, driftDefinitionName: DRIFT_1] is set for 02:20:37:744 PM

master commit hash: 518ccccdc4ad05353de809068c289c1fc64a5384
Comment 3 Charles Crouch 2012-01-30 13:50:50 EST
As I mentioned on 785845, this could only have been verified on master, so setting this to ON-QA for it to be verified as part of the jon3.0.1 release.
Comment 4 Mike Foley 2012-02-02 14:08:35 EST
verified JON 3.01 RC#2.
Comment 5 Heiko W. Rupp 2013-09-03 11:17:57 EDT
Bulk closing of old issues in VERIFIED state.