Red Hat Bugzilla – Bug 535283
fix mtime-based inventory sync logic
Last modified: 2014-11-09 17:49:19 EST
once upon a time measurement schedules used to require that an agent be online to receive those changes, otherwise the flow would fail. this might not have seemed so bad for updating schedules against a single resource, but we also support updating schedules for compatible groups, auto groups, and across resource types (measurement templates). so, this was corrected in RHQ-792, which allowed the agent to be down but still receive the updates.
the fix for RHQ-792 piggybacked on the logic that the agent uses for determining which resources it needs to merge, namely the mtime. if the last modified time of the resource on the server is LATER in time than the one stored on the agent, then something about that resource changed and it will be synced...and one of the items in the synchronization payload are the measurement schedules for that resource.
as part of general performance fixes for the 1.2 release, it was determined that the mtime on the server was being reset too often. this was because we were using a hibernate @OnUpdate persistence hook to update the mtime field to the current time each time the resource was merged through the entityManager. however, the agent doesn't care about every single update made to a resource, related resource object, or misuse of entityManager.merge(Resource) calls. instead, it only cares about whether the basic properties on a resource change, whether its corresponding measurement schedules have been updated, or whether it's plugin configuration has been updated.
to remedy this, the @OnUpdate hook was removed and "System.currentTimeMillis()" was placed into the setter methods for the properties of concern. the mtime was also manually set in the code paths that update the measurement schedules (note: mtime resetting is not necessary at the time of this writing because plugin configuration updates are synchronous).
unfortunately, this solution was ignoring one very evident issue - the domain entities are shared between the agent and server. so, although the side effect of resetting the mtime when certain setter methods were called on the resource in the server-side was correct, it posed problems when the mtime was reset on the agent side. remember, we just mentioned that differing mtimes is the primary qualification that determines whether a resource needs to be synced / merged between agent and server.
consequently, the mtime was being reset too often, which ended up being the root cause of RHQ-1990. although this was fixed with a one-off solution (namely resetting the mtime back down to 0 right at the end of the manual add workflow on the agent side) it's possible this mtime thing could come back to bite us later. accordingly, the proper fix here is to remove the side-effect of updating the mtime in the setter methods of Resource entity, and instead do it explicitly in the server-side code paths that update the resource.
not enough time left in this release to address this with an appropriate amount of QA to give us confidence that we haven't regressed...and inventory synchronization mechanisms are too important to just brush over. let's try to get this into the next release.
Automated tests as described in RHQ-792 should be added to confirm this fix.
Use the following procedure:
1) Inventory only the platform.
2) execute "inventory --xml --export=export-1.xml" on the agent command line
3) shut down the agent
4) Go to Administration>System Configuration>Templates
5) Edit Metric Templates for CPU
6) check "Update schedules for existing resources", set collection interval to 10 minutes for "Idle" metrics (this will enable that metric as well)
7) Check that this metric got updated for the individual CPU resources in their Monitor>Schedule sub tabs
8) start up the agent
9) wait 30s (shouldn't be necessary)
10) execute "inventory --xml --export=export-2.xml" on the agent commandline
11) Locate the <schedule> element with name CpuPerc.idle for the same CPU resource in both export-1.xml and export-2.xml
The schedule in export-1.xml should be disabled and the interval should be 1200000.
The schedule in export-2.xml should be enabled and interval should be 600000.
Actual results (prior to this bug being fixed):
The schedule looks the same in both files.
Fixed by updating the mtime externally from SLSB methods, rather than inside Resource entity's setters (git commit e22d84ed25a635fc2cbc416bdee87dc340166749). Tested fix using the above repro steps. Also tested that name/description/location updates done while the Agent is offline get synced when the Agent comes back online.
Fixed in RHQ_1_2_0_GA_PERF branch - r5255.
Ian, can you merge this into 1.3.1 too. Thanks
Associated case: 344497
Fixed in 1.3 CP branch - r5256. Charles, is that what you meant by 1.3.1?
r5258 - When updating metric schedules, if pushing the updated schedules to the Agents fails, throw an Exception, rather than just touching the mtime on the corresponding Resources; ***NOTE*** THIS IS A SPECIAL FIX FOR THIS BRANCH ONLY AND SHOULD NOT BE MERGED INTO TRUNK!
git commit for fix to trunk: e22d84ed25a635fc2cbc416bdee87dc340166749
It turns out the mtime also needs to be set on Resources when inventoryStatus is updated, otherwise inventory statuses never get synced to Agents... This is fixed by the following commits:
master - commit fdd23ec2bb428e6bde53e70d9b31c6ae22465da0
1.3 CP branch - r5262
1.2 PERF branch - r5261
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1996
This bug relates to RHQ-792
The backward patching was incomplete in the RHQ_1_3_0_GA_CP branch. This has been resolved in r5269.
qa -> gneelaka
This has the same repro steps as
Mass-closure of verified bugs against JON.