543888 – JON server may drop collected metric data after outage

Bug 543888 - JON server may drop collected metric data after outage

Summary: JON server may drop collected metric data after outage

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Database
Sub Component:
Version:	unspecified
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	RHQ Project Maintainer
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	rhq_triage
TreeView+	depends on / blocked

Reported:	2009-12-03 12:31 UTC by Rodrigo A B Freire
Modified:	2010-05-18 13:30 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2010-05-18 13:30:56 UTC
Embargoed:

Attachments	(Terms of Use)
JON screenshot (160.46 KB, image/png) 2009-12-03 12:43 UTC, Rodrigo A B Freire	no flags	Details
JON Sysout showing the shutdown, startup and the errors. (11.04 KB, application/zip) 2009-12-03 12:45 UTC, Rodrigo A B Freire	no flags	Details
View All

Description Rodrigo A B Freire 2009-12-03 12:31:29 UTC

Description of problem:
The JON server may drop agent collected metric data after server outage.

Version-Release number of selected component (if applicable):
JON 2.3.0GA

How reproducible:
Hardly. 1st time issue. Posting it to bring to attention.

Steps to Reproduce:
1. On Administration, System Settings, Config, set "AgentMax Quiet Time Allowed" to 4000 minutes
2. Leave the agent on some remote system, collecting the data
3. Turn off the JON server, normally, non-disruptive shutdown (and its database)
4. After 10h, bring the server back on again.
  
Actual results:
When the agent sent the metrics merge data, some data was lost.

Expected results:
The full data should be merged.

Additional info:
This is an quite unusual setup (this setup is an POC).
The JON server works on my work notebook (which is online only from 9h00 - 18h00) and in this interim, an remote server send its metric to my machine. It's connected to the internet via an cellphone 3g connection, which may fail some times in the day. The solution for the remote server connecting to the jon server is use an dynamic DNS host. By the way, it work perfectly. However today, that was an gap between 5 AM untill the time I put the server back on. Notice that JON merged correctly data from 6 PM, the time the server went down, untill 5 AM. Then, the server threw some spurious SQL errors.
See attached agent.log and printscreen.

Comment 1 Rodrigo A B Freire 2009-12-03 12:43:11 UTC

Created attachment 375756 [details]
JON screenshot

This picture shows the approx. time when the server was shutdown and when the merge was stopped

Comment 2 Rodrigo A B Freire 2009-12-03 12:45:09 UTC

Created attachment 375757 [details]
JON Sysout showing the shutdown, startup and the errors.

Notice that the read errors are due to the adverse condition of the 3g cellphone connection.

Comment 3 wes hayutin 2010-02-16 16:59:13 UTC

Temporarily adding the keyword "SubBug" so we can be sure we have accounted for all the bugs.

keyword:
new = Tracking + FutureFeature + SubBug

Comment 4 wes hayutin 2010-02-16 17:04:01 UTC

making sure we're not missing any bugs in rhq_triage

Comment 5 Charles Crouch 2010-05-18 03:35:17 UTC

Rodrigo, this looks like a problem with connecting to the JON database...

2009-12-03 09:33:30,820 WARN  [org.jboss.resource.connectionmanager.TxConnectionManager] Connection error occured: org.jboss.resource.connectionmanager.TxConnectionManager$TxConnectionEventListener@1b5b88c[state=NORMAL mc=org.jboss.resource.adapter.jdbc.xa.XAManagedConnection@67946a handles=1 lastUse=1259839366711 permit=true trackByTx=true mcp=org.jboss.resource.connectionmanager.JBossManagedConnectionPool$OnePool@923ce0 context=org.jboss.resource.connectionmanager.InternalManagedConnectionPool@dacdcf xaResource=org.jboss.resource.connectionmanager.xa.JcaXAResourceWrapper@7618f8 txSync=null]
java.sql.BatchUpdateException: Batch entry 0 INSERT  /*+ APPEND */ INTO RHQ_MEAS_DATA_NUM_R02(schedule_id,time_stamp,value) VALUES(17548,1259823971366,9.5727616E7) was aborted.  Call getNextException to see the cause.
	at org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2531)


If you can provide steps that reproduce this issue, we can take a look otherwise I'm going to close it.
Thanks

Comment 6 Rodrigo A B Freire 2010-05-18 13:08:15 UTC

Hi Charles

I no longer support the environment and have no conditions to reproduce the error, so feel free to close the issue.

Huge thanks

- RF

Note You need to log in before you can comment on or make changes to this bug.