534721 – (RHQ-1490) Availability computation is wrong when rhq server is down and agent is spooling

Bug 534721 (RHQ-1490) - Availability computation is wrong when rhq server is down and agent is spooling

Summary: Availability computation is wrong when rhq server is down and agent is spooling

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	RHQ-1490
Product:	RHQ Project
Classification:	Other
Component:	No Component
Sub Component:
Version:	1.2
Hardware:	All
OS:	All
Priority:	urgent
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jay Shaughnessy
QA Contact:
Docs Contact:
URL:	http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks:	rhq_triage 741450
TreeView+	depends on / blocked

Reported:	2009-02-06 13:39 UTC by Heiko W. Rupp
Modified:	2013-09-01 19:20 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:	rev 2940
Last Closed:	2013-09-01 19:20:07 UTC
Embargoed:

Attachments	(Terms of Use)

Description Heiko W. Rupp 2009-02-06 13:39:00 UTC

Start the rhq server + agent - let them run for some time.
Shut down th rhq server 
Go to the agent and type "avail" - verify that the server is down
Wait for 5 mins.
Verify in the agent via "avail" that the server is down, run "dumpspool" to see that the agent is spooling

Start the server again , log in go to the servers monitor tab - it shows all green

Look at the db - the downtime is not in rhq_availability

When I am faking AvailabilityType.DOWN from within the AS plugin when agent + server are running, the availability data is correctly ending up in the DB and shown on screen.

Going to the agent prompt: 

Sending the availability report to the server...
Done.
> 
> dumpspool object
data/command-spool.dat
1
[0] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null
> 

Even if the agent said "sending avail report", there is no such report in the spool. This might come from the InventoryManager.handleReport(AvailabilityReport), where the intermediate report is possibly not spooled, but
a full report is requestd for the next time - around line 571.


Sending the availability report to the server...
Done.
> dumpspool object
data/command-spool.dat
4
[0] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null
[1] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null
[2] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.event.EventServerService, invocation=NameBasedInvocation[mergeEventReport]}]; callback=null
[3] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null
>

Comment 1 John Mazzitelli 2009-02-06 13:44:30 UTC

availability reporting is not guaranteed to be delivered - therefore it is never spooled.

From DiscoveryServerService:

    // GH: Disabled temporarily (JBNADM-2385) @Asynchronous( guaranteedDelivery = true )
    @LimitedConcurrency(CONCURRENCY_LIMIT_AVAILABILITY_REPORT)
    boolean mergeAvailabilityReport(AvailabilityReport availabilityReport);

Comment 2 John Mazzitelli 2009-02-06 13:49:01 UTC

This was the description of the issue and why avail reporting is not guaranteed/spooled:

"Slow processing of measurement reports causes blocks to the availability report handling. This allows the backfiller to come along and mark everything down even though the agent knows everything is fine. The change to one asynch sending thread for agent comm's appears to have been the local cause to the problem though we'd still likely hit it at a slightly larger scale even with more threads sending (plus that caused other problems).

For now, we will try sending the avail reports synchronously (and not reliably)."

Comment 3 Heiko W. Rupp 2009-02-06 13:53:12 UTC

This is plain wrong, as the customer will see no metrics for the resource, but all lights are green - he will just be confused.

We did on purpose make availability a first class citizen in RHQ.

Writing a batch of availability reports to database (in batch even) should not be more expensive than doing the same for metrics - which we did not disable.

If it's about alerting on past un-availability, then we'd need to disable alerting  when we see that spooled
data is coming for the timeframe from [start of spooling, now].
But we need at least show the data to the user -- he might need that for SLA computations or such.

Comment 4 Red Hat Bugzilla 2009-11-10 20:34:13 UTC

This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1490

Comment 5 wes hayutin 2010-02-16 16:53:02 UTC

Temporarily adding the keyword "SubBug" so we can be sure we have accounted for all the bugs.

keyword:
new = Tracking + FutureFeature + SubBug

Comment 6 wes hayutin 2010-02-16 16:58:33 UTC

making sure we're not missing any bugs in rhq_triage

Comment 7 Jay Shaughnessy 2012-02-28 20:20:58 UTC

I'm not sure but I think the fact that agent avail is no longer
tied to avail reporting, or other changes made and descibed here [1],
may take care of this issue.

Asking Heiko to review and see if this can be closed.

[1]http://rhq-project.org/display/RHQ/Design-Availability+Checking

Comment 8 Heiko W. Rupp 2012-03-02 16:48:27 UTC

I think the changes mentioned in the wiki document will address this issue.

Comment 9 Jay Shaughnessy 2012-03-30 20:41:54 UTC

This is in master and can likely be closed, testing is somewhat implicit.

Comment 10 Heiko W. Rupp 2013-09-01 19:20:07 UTC

Bulk closing of BZs that have no target version set, but which are ON_QA for more than a year and thus are in production for a long time.

Note You need to log in before you can comment on or make changes to this bug.