Bug 535352 (RHQ-2057)
Summary: | runaway avail reporting | ||
---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | John Mazzitelli <mazz> |
Component: | Agent | Assignee: | RHQ Project Maintainer <rhq-maint> |
Status: | CLOSED WORKSFORME | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | unspecified | CC: | cwelton, fredrik.joakim.sandstrom, jshaughn |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | All | ||
URL: | http://jira.rhq-project.org/browse/RHQ-2057 | ||
Whiteboard: | |||
Fixed In Version: | 1.4 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-02-28 19:47:26 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 534375 |
Description
John Mazzitelli
2009-05-05 01:48:00 UTC
seeing this in cmd trace: 2009-05-04 22:25:24,318 TRACE {send.complete}=>>DiscoveryServerService.setResourceError|{rhq.agent-name=mazzthink, rhq.retry=1, rhq.security-token=1241245112171-1980743266-8190587527248950016, rhq.send-throttle=true, rhq.guaranteed-delivery=true}|failed:java.lang.reflect.InvocationTargetException:null -> org.rhq.core.util.exception.WrappedRemotingException:javax.persistence.PersistenceException: org.hibernate.PropertyValueException: not-null property references a null or transient value: org.rhq.core.domain.resource.ResourceError.summary -> org.rhq.core.util.exception.WrappedRemotingException:org.hibernate.PropertyValueException: not-null property references a null or transient value: org.rhq.core.domain.resource.ResourceError.summary -> org.rhq.core.util.exception.WrappedRemotingException:not-null property references a null or transient value: org.rhq.core.domain.resource.ResourceError.summary I have it where the agent goes into an infinite loop constantly sending avai reports... I don't know why yet. This sendResourceErrorToServer is getting called every 1 second or so. I do have two Embedded Tomcat connector resources that are failing their avialability checks. but why is this constantly calling setRersourceError - the resource errors are getting stored in the DB (I checked) and the agent isn't getting exceptions from server. Daemon Thread [InventoryManager.availability-1] (Suspended (breakpoint at line 1192 in org.rhq.core.pc.inventory.InventoryManager)) owns: java.lang.Object (id=7013) org.rhq.core.pc.inventory.InventoryManager.sendResourceErrorToServer(org.rhq.core.domain.resource.ResourceError) line: 1192 org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(org.rhq.core.domain.resource.Resource, org.rhq.core.domain.discovery.AvailabilityReport, boolean, boolean) line: 178 org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(org.rhq.core.domain.resource.Resource, org.rhq.core.domain.discovery.AvailabilityReport, boolean, boolean) line: 210 org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(org.rhq.core.domain.resource.Resource, org.rhq.core.domain.discovery.AvailabilityReport, boolean, boolean) line: 210 org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(org.rhq.core.domain.resource.Resource, org.rhq.core.domain.discovery.AvailabilityReport, boolean, boolean) line: 210 org.rhq.core.pc.inventory.AvailabilityExecutor.call() line: 106 org.rhq.core.pc.inventory.AvailabilityExecutor.run() line: 70 java.util.concurrent.Executors$RunnableAdapter<T>.call() line: 441 java.util.concurrent.FutureTask$Sync.innerRun() line: 303 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask<V>(java.util.concurrent.FutureTask<V>).run() line: 138 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask<V>.access$301(java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask) line: 98 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask<V>.run() line: 207 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(java.lang.Runnable) line: 885 java.util.concurrent.ThreadPoolExecutor$Worker.run() line: 907 java.lang.Thread.run() line: 619 I think the root cause is this: id uuid name resource_key parent_resource_id 500639 92f9a2a2-df5a-4673-a251-aac93f89a00c Tomcat Connector (127.0.0.1:8009) jboss.web:address=127.0.0.1,port=8009,type=Connector 501246 500640 92f9a2a2-df5a-4673-a251-aac93f89a00c Tomcat Connector (127.0.0.1:8009) jboss.web:address=127.0.0.1,port=8009,type=Connector 501246 500641 b71e8fb5-c050-4ce5-be4d-a694a5c32c06 Tomcat Connector (127.0.0.1:8080) jboss.web:address=127.0.0.1,port=8080,type=Connector 501246 500642 b71e8fb5-c050-4ce5-be4d-a694a5c32c06 Tomcat Connector (127.0.0.1:8080) jboss.web:address=127.0.0.1,port=8080,type=Connector 501246 resource IDs 39,40 are dups - as is 41,42. Some how, we inventoried the same resources twice. Then what happens is in InventoryManager.synchInventory: if (!syncedResources.isEmpty() || !unknownResourceIds.isEmpty() || !modifiedResourceIds.isEmpty()) { performAvailabilityChecks(true); this.inventoryThreadPoolExecutor.schedule((Callable<? extends Object>) this.serviceScanExecutor, 5, TimeUnit.SECONDS); } the modifiedResourceIds is never empty - first it has two elements of 500639, 500641 - then it comes back within a second and it has 500640,500642 and then it comes back and repeats. It always thinks its needs to perform new avail checks and service scans - which puts us in an infinite loop (we scan, avail check, then scan avail check, scan....and so on) i think we need a unique constraint on the uuid column in rhq_resource - at the very least it would have caught this and not allowed the data to get into the DB (and its those dup rows that caused the agent to wig out). I manually deleted the 2 dup rows from the database and the agent is acting fine now... no infinite loop I did the ?debug=1 trick on the resource hub so I can uninventory those dup services. Now, every 15 minutes, I'm seeing this in teh server log: 08:07:11,537 INFO [CoreServerServiceImpl] Agent [mazzthink][1.3.0-SNAPSHOT(3877)] would like to connect to this server 08:07:11,803 INFO [CoreServerServiceImpl] Agent [mazzthink] has connected to this server at Tue May 05 08:07:11 EDT 2009 08:22:16,943 INFO [CoreServerServiceImpl] Agent [mazzthink][1.3.0-SNAPSHOT(3877)] would like to connect to this server 08:22:17,115 INFO [CoreServerServiceImpl] Agent [mazzthink] has connected to this server at Tue May 05 08:22:17 EDT 2009 08:37:22,162 INFO [CoreServerServiceImpl] Agent [mazzthink][1.3.0-SNAPSHOT(3877)] would like to connect to this server 08:37:22,287 INFO [CoreServerServiceImpl] Agent [mazzthink] has connected to this server at Tue May 05 08:37:22 EDT 2009 08:52:28,646 INFO [CoreServerServiceImpl] Agent [mazzthink][1.3.0-SNAPSHOT(3877)] would like to connect to this server 08:52:29,068 INFO [CoreServerServiceImpl] Agent [mazzthink] has connected to this server at Tue May 05 08:52:29 EDT 2009 that last comment may be a red herring - currently debugging this and this is getting caused by the failover code - my box recently lost its VPN and then I reconnected to the VPN, I have a feeling this is causing the problem. I see CannotConnectExceptions when trying to send up inventory and meas. reports, probably due to DNS and IP changes when going on and off VPN running for a while again, and I saw my avails running away again. so I checked to see if I have dups: id name agent_id parent_resource_id uuid resource_key 502550 mazzthink RHQ Server, JBossAS 4.2.1.GA default (0.0.0.0:2099) 500550 500550 c9c201a3-ea1a-4827-9cdf-846255157f28 C:\mazz\source\rhq\trunk\modules\enterprise\server\container\target\rhq-server-1.3.0-SNAPSHOT\jbossas\server\default 502551 mazzthink RHQ Server, JBossAS 4.2.1.GA default (0.0.0.0:2099) 500550 500550 c9c201a3-ea1a-4827-9cdf-846255157f28 C:\mazz\source\rhq\trunk\modules\enterprise\server\container\target\rhq-server-1.3.0-SNAPSHOT\jbossas\server\default 502552 mazzthink RHQ Server, JBossAS 4.2.1.GA default (0.0.0.0:2099) 500550 500550 c9c201a3-ea1a-4827-9cdf-846255157f28 C:\mazz\source\rhq\trunk\modules\enterprise\server\container\target\rhq-server-1.3.0-SNAPSHOT\jbossas\server\default 502553 mazzthink RHQ Server, JBossAS 4.2.1.GA default (0.0.0.0:2099) 500550 500550 95b5bc13-dbf0-46c1-b610-f86b02719058 C:\mazz\source\jopr\trunk\modules\dist\server\target\jopr-server-2.3.0-SNAPSHOT\jbossas\server\default 502554 mazzthink RHQ Server, JBossAS 4.2.1.GA default (0.0.0.0:2099) 500550 500550 95b5bc13-dbf0-46c1-b610-f86b02719058 C:\mazz\source\jopr\trunk\modules\dist\server\target\jopr-server-2.3.0-SNAPSHOT\jbossas\server\default now, I know for sure that I had network outages (my comcast went down several times today - not long, but 1 minute here, 10 minutes there, about 3 times or so today). This means that I lost my VPN. I'm not sure how that would cause this problem of duplicate resources, but the two times this has happened to me, I did lose my VPN connection. This is the SQL you can use to find out if you have any duplicate resources: select * from rhq_resource a, rhq_resource b where a.id != b.id and a.uuid = b.uuid I'm getting this again, but I do not have dup resources. I compeltely deleted my /data directory, started with -l and I still get the problem. BTW: I saw this happen when I shutdown the agent, deleted the agent install dir, and re-installed a new agent but kept the agent config prefs intact. does: boolean ok = configuration.getServerServices().getDiscoveryServerService().mergeAvailabilityReport( report); turn around and ask for another avail report to be sent? I hit this in a debugger, tried to step over while the VM was paused and it never returned. Once I restarted all the VM threads, everthing kept going but it kept looping put a breakpoint in InventoryManager.syncInventory - it keeps getting called. I execute this while in this method: System.out.print(this.resourceContainers.get("783cb684-ce28-4316-a657-ed32cc432a40")); System.out.print(this.resourceContainers.get("582ff136-024d-4067-8f9e-032ff78cbc06")); and look what comes out - notice the UUIDs are different but the IDs are the same!!! Two resources, same ID in the agent inventory: ResourceContainer: resource=[Resource[id=500067, type=Java Service Wrapper Launcher, key=jsw, name=RHQ Agent Java Service Wrapper, parent=mazzthink RHQ Agent, version=1.3.0-SNAPSHOT]] ResourceContainer: resource=[Resource[id=500067, type=Datasource, key=jboss.jca:name=RHQDS,service=DataSourceBinding, name=RHQDS Datasource, parent=mazzthink RHQ Server, JBossAS 4.2.3.GA default (0.0.0.0:2099)]] inventory -x shows this: <resource> <id>500067</id> <key>jboss.jca:name=RHQDS,service=DataSourceBinding</key> <name>RHQDS Datasource</name> <version></version> <uuid>582ff136-024d-4067-8f9e-032ff78cbc06</uuid> but database shows this: 500067 783cb684-ce28-4316-a657-ed32cc432a40 RHQ Agent Java Service Wrapper somehow the agent sync'ed incorrectly - the DB says 500067 is the JSW, but the agent things that ID is for the datasource. this is the root cause of the avail looping because syncInventory goes batty and keeps getting called because the agent things it isn't fully sync'ed i was using an old inventory.dat to try to get teh agent to be out of sync, but I think that was the problem. I had a bogus test. this may not be a problem. i'm leaving this jira open just in case i see this under different circumstances. the problem with the duplicate resources still needs to be fixed. Push to 1.4 I'm seeing this again in the GSS environment (a QA build RHQ r5013) .. 2009-09-11 09:15:55,144 INFO [org.rhq.enterprise.server.discovery.DiscoveryServerS erviceImpl] Processed AV:[vm138.gsslab.rdu.redhat.com][348][full] - need full=[fals e] in (156)ms 2009-09-11 09:15:56,786 INFO [org.rhq.enterprise.server.discovery.DiscoveryServerS erviceImpl] Processed AV:[vm138.gsslab.rdu.redhat.com][348][full] - need full=[fals e] in (147)ms ... I started seeing this after I restarted the agent with -u (previously I was seeing my JBAS instance show red when it was infact up) I checked for dupe resources using the SQL and saw none. Restarting the agent with --clean seems to have resolved the multiple avail reports. ------------------ After some more thought I think this can probably be explained by some nasty clock issues. The fast avail reporting seem to start around mindnight last night. Yesterday afternoon the clocks were reset using ntpd underneath the feet of the agent and server, from around midnight back to 4pm: (3:46:00 PM) ccrouch: [gss@vm138 bin]$ date (3:46:00 PM) ccrouch: Fri Sep 11 00:30:33 EDT 2009 (3:46:00 PM) ccrouch: [gss@vm138 bin]$ date (3:46:00 PM) ccrouch: Thu Sep 10 16:46:02 EDT 2009 I bet when we came across old date with a timestamp in the future we got confused :-/ This bug was previously known as http://jira.rhq-project.org/browse/RHQ-2057 This bug relates to RHQ-1885 mazz - what's left here? can we revisit unique restraint solution? we've done lots of changes to inventory sync since this was created. I would close this as this problem hasn't been seen in a while. That said, this might still be something we should consider: "i think we need a unique constraint on the uuid column in rhq_resource - at the very least it would have caught this and not allowed the data to get into the DB (and its those dup rows that caused the agent to wig out). " The UUIDs should be unique across the entire inventory. This would require a change to dbsetup/dbupgrade scripts (or liquibase if we ever move to that) FYI: the hibernate errors during calls to setResourceError are being fixed via bug 664126 I'm closing this since it's been stale for a while and we have massive amounts of refactored/change code in the area of question. |