Start the rhq server + agent - let them run for some time. Shut down th rhq server Go to the agent and type "avail" - verify that the server is down Wait for 5 mins. Verify in the agent via "avail" that the server is down, run "dumpspool" to see that the agent is spooling Start the server again , log in go to the servers monitor tab - it shows all green Look at the db - the downtime is not in rhq_availability When I am faking AvailabilityType.DOWN from within the AS plugin when agent + server are running, the availability data is correctly ending up in the DB and shown on screen. Going to the agent prompt: Sending the availability report to the server... Done. > > dumpspool object data/command-spool.dat 1 [0] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null > Even if the agent said "sending avail report", there is no such report in the spool. This might come from the InventoryManager.handleReport(AvailabilityReport), where the intermediate report is possibly not spooled, but a full report is requestd for the next time - around line 571. Sending the availability report to the server... Done. > dumpspool object data/command-spool.dat 4 [0] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null [1] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null [2] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.event.EventServerService, invocation=NameBasedInvocation[mergeEventReport]}]; callback=null [3] org.rhq.enterprise.communications.command.client.CommandAndCallback: command=Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.agent-name=snert, rhq.security-token=1233073809826-109582636-1750379262810976394, rhq.send-throttle=true, rhq.guaranteed-delivery=true}]; params=[{targetInterfaceName=org.rhq.core.clientapi.server.measurement.MeasurementServerService, invocation=NameBasedInvocation[mergeMeasurementReport]}]; callback=null >
availability reporting is not guaranteed to be delivered - therefore it is never spooled. From DiscoveryServerService: // GH: Disabled temporarily (JBNADM-2385) @Asynchronous( guaranteedDelivery = true ) @LimitedConcurrency(CONCURRENCY_LIMIT_AVAILABILITY_REPORT) boolean mergeAvailabilityReport(AvailabilityReport availabilityReport);
This was the description of the issue and why avail reporting is not guaranteed/spooled: "Slow processing of measurement reports causes blocks to the availability report handling. This allows the backfiller to come along and mark everything down even though the agent knows everything is fine. The change to one asynch sending thread for agent comm's appears to have been the local cause to the problem though we'd still likely hit it at a slightly larger scale even with more threads sending (plus that caused other problems). For now, we will try sending the avail reports synchronously (and not reliably)."
This is plain wrong, as the customer will see no metrics for the resource, but all lights are green - he will just be confused. We did on purpose make availability a first class citizen in RHQ. Writing a batch of availability reports to database (in batch even) should not be more expensive than doing the same for metrics - which we did not disable. If it's about alerting on past un-availability, then we'd need to disable alerting when we see that spooled data is coming for the timeframe from [start of spooling, now]. But we need at least show the data to the user -- he might need that for SLA computations or such.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1490
Temporarily adding the keyword "SubBug" so we can be sure we have accounted for all the bugs. keyword: new = Tracking + FutureFeature + SubBug
making sure we're not missing any bugs in rhq_triage
I'm not sure but I think the fact that agent avail is no longer tied to avail reporting, or other changes made and descibed here [1], may take care of this issue. Asking Heiko to review and see if this can be closed. [1]http://rhq-project.org/display/RHQ/Design-Availability+Checking
I think the changes mentioned in the wiki document will address this issue.
This is in master and can likely be closed, testing is somewhat implicit.
Bulk closing of BZs that have no target version set, but which are ON_QA for more than a year and thus are in production for a long time.