608487 – agent startup without --clean fails to sync with server

Bug 608487 - agent startup without --clean fails to sync with server

Summary: agent startup without --clean fails to sync with server

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Plugin Container
Sub Component:
Version:	3.0.0
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Joseph Marques
QA Contact:	Sudhir D
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	jon-sprint11-bugs
TreeView+	depends on / blocked

Reported:	2010-06-27 19:34 UTC by Joseph Marques
Modified:	2010-08-12 16:47 UTC (History)
CC List:	1 user (show)
Fixed In Version:	2.4
Clone Of:
Environment:
Last Closed:	2010-08-12 16:47:28 UTC
Embargoed:

Attachments	(Terms of Use)

Description Joseph Marques 2010-06-27 19:34:46 UTC

I believe there is a very serious regression in the agent/server sync stuff right now.  My use case is as follows:

* already committed inventory
* shut down agent
* update some measurement schedule in the GUI
* start agent normally (i.e., *not* clean)

After approximately two full days of positive and negative testing, I believe I've isolated the root cause down to a single block of code.  If you put a break point in Inventory.handleReport and I notice that it never gets passed the following check at the beginning of the method:

    if (report.getAddedRoots().isEmpty()) {
        return true; // nothing to do
    }

As a result, it thinks it has nothing to do and returns quickly.  Although it's true that it might not have any resources from an auto-discovery or runtime-discovery report to sync *up to* the server, it may nevertheless have data to sync *down from* the server.

-----

A few examples...

If an agent is offline when a measurement schedule is updated, we indicate this fact by resetting the mtime on the resource.  Then, when the agent comes back online, if the server's mtime is more recent than the mtime of the corresponding agent-local resource that was rehydrated from disk, this is considered a "modified" resource, which then goes through a "merge" process to sync the plugin configuration from the server, the latest set of enabled measurement schedules, the latest mtime from the server, etc.

If an agent is offline when a resource is uninventoried, it doesn't get the "message" because we don't send anything from the server down to the agent in a guaranteed-delivery way.  Instead, the agent relies on the resulting ResourceSyncInfo tree that it gets back from the server after sending its InventoryReport.  It compares the syncInfo tree to the local resource tree rehydrated from disk and anything that occurs in the latter but not the former is considered an "obsolete" resource and is deleted from the local inventory (afterwards it may be rediscovered and either auto-committed [if it's a service or nested-server] or require an import action [if it's a platform or top-level server]).

.....

Tracing SCM history I find that this block of suspect code was introduced fairly recently as a fix for BZ-545595[1].  Luckily, the commit hash[2] indicates that this enhancement was introduced after the 2.3.0 release (though I do wonder whether it got into any of the 2.3.1 CPs we delivered to customers).  In any event, the fix seems rather isolated, and so even though the ramifications were/are severe...the fix is rather simple (delete this block of code).  Granted, rolling back this change will re-introduce the issue described in the bugzilla report, but I think we can either:

a) ignore this error for now, assuming we don't actually want to support customers installing *only* the platform plugin, or
b) fix this bug in some other way, probably on the server-side, but which doesn't violate the expected post-conditions of the DiscoveryServerService.mergeInventoryReport, namely to return a ResourceSyncInfo tree that represents the latest known state of the server-side inventory.

[1] -- https://bugzilla.redhat.com/show_bug.cgi?id=545595
[2] -- commit 1f5b8ad2c49728a54ca655616f9072edbc697460 @ 2009-12-08 16:57:27 EST

Comment 1 Joseph Marques 2010-06-27 22:20:22 UTC

commit 44e97f20fd026fc660d38d819497cb843a353384
Author: Joseph Marques <joseph>
Date:   Sun Jun 27 18:04:02 2010 -0400

    BZ-608487: fix sync logic by always sending an InventoryReport to the server

----

as mentioned above, this is expected to cause a regression in https://bugzilla.redhat.com/show_bug.cgi?id=545595

Comment 2 Joseph Marques 2010-06-27 22:22:55 UTC

This bug has the exact same reproduction procedures as were detailed by Ian Springer here:

https://bugzilla.redhat.com/show_bug.cgi?id=535283#c3

This bug was fixed in simultaneous conjunction with performance improvements to this subsystem, detailed here:

https://bugzilla.redhat.com/show_bug.cgi?id=608057

Thus, verifying either of these bugs actually verifies both of them.

Comment 3 Corey Welton 2010-06-29 14:22:21 UTC

Verified per the steps in aforementioned bug


               <schedule>
                  <schedule-id>10171</schedule-id>
                  <name>CpuPerc.idle</name>
                  <enabled>false</enabled>
                  <interval>1200000</interval>
               </schedule>

versus

               <schedule>
                  <schedule-id>10171</schedule-id>
                  <name>CpuPerc.idle</name>
                  <enabled>true</enabled>
                  <interval>600000</interval>
               </schedule>

Comment 4 Sudhir D 2010-06-29 14:26:25 UTC

Ok, So I verified this with jon-server-2.4.0.GA_QA Build# 43 with below steps mentioned in bug 535283

1) Inventory only the platform.
2) execute "inventory --xml --export=export-1.xml" on the agent command line
3) shut down the agent
4) Go to Administration>System Configuration>Templates
5) Edit Metric Templates for CPU
6) check "Update schedules for existing resources", set collection interval to
10 minutes for "Idle" metrics (this will enable that metric as well)
7) Check that this metric got updated for the individual CPU resources in their
Monitor>Schedule sub tabs
8) start up the agent
9) wait 30s (shouldn't be necessary)
10) execute "inventory --xml --export=export-2.xml" on the agent commandline
11) Locate the <schedule> element with name CpuPerc.idle for the same CPU
resource in both export-1.xml and export-2.xml

The results are as expected below for both CPU-0 and CPU-1,
The schedule in export-1.xml is disabled and the interval is
1200000.
               <schedule>
                  <schedule-id>12744</schedule-id>
                  <name>CpuPerc.idle</name>
                  <enabled>false</enabled>
                  <interval>1200000</interval>
               </schedule>


The schedule in export-2.xml is enabled and interval is 600000.
               <schedule>
                  <schedule-id>12743</schedule-id>
                  <name>CpuPerc.idle</name>
                  <enabled>true</enabled>
                  <interval>600000</interval>
               </schedule>


Marking this bug as verified.

Comment 5 Corey Welton 2010-06-29 14:55:49 UTC

[10:40] < joseph> cswiii: sudhir_lurk: for https://bugzilla.redhat.com/show_bug.cgi?id=608487 make sure you test it at the resource-level, group-level, auto-group-level, and 
                  metric template level.
[10:40] < joseph> you should repeat all test cases for the disable button, enable button, and interval set of buttons
[10:40] < joseph> that will test all code-paths
[10:41] <@cswiii> the what-what?
[10:41] < joseph> then you should do that while the agent is online, to see that the updates make it to the agent in steady-state, as well as test when the agent is offline 
                  (and then when it starts back up, the schedules get sync'ed)
[10:42] <@cswiii> joseph: i don't know what you mean by cases for the disable button, enable button and interval set of buttons. 
[10:43] < joseph> ccrouch: https://bugzilla.redhat.com/show_bug.cgi?id=609158
[10:43] < joseph> cswiii: when you look at monitor>schedules, you have different controls at the botom of the table
[10:43] < joseph> even though they all do similar things, each of them execute different code

Reopening for further examination.

Comment 6 Sudhir D 2010-06-30 12:30:06 UTC

I have verified the Enable and Disable functionality at the resource level and I could see that the table was getting updated with values for the Enabled ones. Also, when disabled these were no longer available.

Comment 7 Sudhir D 2010-06-30 12:45:29 UTC

Verified at the group level, by creating group and adding the resources to same and then enabling the metric collection for group resources and monitoring them and then changing the interval and see that it is updated correctly and then disable the same.

Comment 8 Corey Welton 2010-06-30 18:44:11 UTC

Tested against dynagroups.  I think we can mark this bug as verified at this point.

Comment 9 Corey Welton 2010-08-12 16:47:28 UTC

Mass-closure of verified bugs against JON.

Note You need to log in before you can comment on or make changes to this bug.