This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 534721 - (RHQ-1490) Availability computation is wrong when rhq server is down and agent is spooling
Availability computation is wrong when rhq server is down and agent is spooling
Status: CLOSED CURRENTRELEASE
Product: RHQ Project
Classification: Other
Component: No Component (Show other bugs)
1.2
All All
urgent Severity medium (vote)
: ---
: ---
Assigned To: Jay Shaughnessy
http://jira.rhq-project.org/browse/RH...
: SubBug
Depends On:
Blocks: rhq_triage 741450
  Show dependency treegraph
 
Reported: 2009-02-06 08:39 EST by Heiko W. Rupp
Modified: 2013-09-01 15:20 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
rev 2940
Last Closed: 2013-09-01 15:20:07 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Heiko W. Rupp 2009-02-06 08:39:00 EST
Start the rhq server + agent - let them run for some time.
Shut down th rhq server 
Go to the agent and type "avail" - verify that the server is down
Wait for 5 mins.
Verify in the agent via "avail" that the server is down, run "dumpspool" to see that the agent is spooling

Start the server again , log in go to the servers monitor tab - it shows all green

Look at the db - the downtime is not in rhq_availability

When I am faking AvailabilityType.DOWN from within the AS plugin when agent + server are running, the availability data is correctly ending up in the DB and shown on screen.

Going to the agent prompt: 

Sending the availability report to the server...
Done.
> 
> dumpspool object
data/command-spool.dat
1
[0] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null
> 

Even if the agent said "sending avail report", there is no such report in the spool. This might come from the InventoryManager.handleReport(AvailabilityReport), where the intermediate report is possibly not spooled, but
a full report is requestd for the next time - around line 571.


Sending the availability report to the server...
Done.
> dumpspool object
data/command-spool.dat
4
[0] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null
[1] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null
[2] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.event.EventServerService, invocation=NameBasedInvocation[mergeEventReport]}]; callback=null
[3] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null
> 



Comment 1 John Mazzitelli 2009-02-06 08:44:30 EST
availability reporting is not guaranteed to be delivered - therefore it is never spooled.

From DiscoveryServerService:

    // GH: Disabled temporarily (JBNADM-2385) @Asynchronous( guaranteedDelivery = true )
    @LimitedConcurrency(CONCURRENCY_LIMIT_AVAILABILITY_REPORT)
    boolean mergeAvailabilityReport(AvailabilityReport availabilityReport);
Comment 2 John Mazzitelli 2009-02-06 08:49:01 EST
This was the description of the issue and why avail reporting is not guaranteed/spooled:

"Slow processing of measurement reports causes blocks to the availability report handling. This allows the backfiller to come along and mark everything down even though the agent knows everything is fine. The change to one asynch sending thread for agent comm's appears to have been the local cause to the problem though we'd still likely hit it at a slightly larger scale even with more threads sending (plus that caused other problems).

For now, we will try sending the avail reports synchronously (and not reliably)."
Comment 3 Heiko W. Rupp 2009-02-06 08:53:12 EST
This is plain wrong, as the customer will see no metrics for the resource, but all lights are green - he will just be confused.

We did on purpose make availability a first class citizen in RHQ.

Writing a batch of availability reports to database (in batch even) should not be more expensive than doing the same for metrics - which we did not disable.

If it's about alerting on past un-availability, then we'd need to disable alerting  when we see that spooled
data is coming for the timeframe from [start of spooling, now].
But we need at least show the data to the user -- he might need that for SLA computations or such.
Comment 4 Red Hat Bugzilla 2009-11-10 15:34:13 EST
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1490
Comment 5 wes hayutin 2010-02-16 11:53:02 EST
Temporarily adding the keyword "SubBug" so we can be sure we have accounted for all the bugs.

keyword:
new = Tracking + FutureFeature + SubBug
Comment 6 wes hayutin 2010-02-16 11:58:33 EST
making sure we're not missing any bugs in rhq_triage
Comment 7 Jay Shaughnessy 2012-02-28 15:20:58 EST
I'm not sure but I think the fact that agent avail is no longer
tied to avail reporting, or other changes made and descibed here [1],
may take care of this issue.

Asking Heiko to review and see if this can be closed.

[1]http://rhq-project.org/display/RHQ/Design-Availability+Checking
Comment 8 Heiko W. Rupp 2012-03-02 11:48:27 EST
I think the changes mentioned in the wiki document will address this issue.
Comment 9 Jay Shaughnessy 2012-03-30 16:41:54 EDT
This is in master and can likely be closed, testing is somewhat implicit.
Comment 10 Heiko W. Rupp 2013-09-01 15:20:07 EDT
Bulk closing of BZs that have no target version set, but which are ON_QA for more than a year and thus are in production for a long time.

Note You need to log in before you can comment on or make changes to this bug.