Description of problem: If the JBoss ON database is taken offline for backup or maintenance or there is a network partition event that temporarily interrupts the JBoss ON server's communication with the database, agents can get permanently backfilled resulting in stale inventory availability reporting. Version-Release number of selected component (if applicable): 3.1.2.GA How reproducible: Always Steps to Reproduce: 1. Start JBoss ON system. 2. Import agent into inventory. 3. Verify that all agent resources are reported as UP. 4. Shutdown database and wait for 5 minutes. 5. Start database immediately after seeing agent log message: INFO [RHQ Agent Ping Thread-1] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.ping-executor.start-polling-after-exception}Starting polling to determine sender status (server ping failed) Actual results: All agent's resources are marked with "unknown" availability and platform resource is marked as "down". Additionally, server log reports affected agents have been backfilled: 2014-05-05 18:07:51,277 INFO [org.quartz.core.QuartzSchedulerThread] releaseTriggerRetryLoop: connection restored. 2014-05-05 18:08:08,193 INFO [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [Agent 01053671 on jboss.example.com] since [Mon May 05 17:59:40 CDT 2014]. Will be backfilled since we suspect it is down Expected results: Agent inventory continues to report correct availability state. Additional info: To summarize what I think is happening here I am going to over simplify things. The agent's ping fails because the server is not able to update the agent's last ping time in the database--the database is down. The server is also not able to backfill anything because the database is down so it can not run its "check for stale agents job". However, as soon as the database comes back online, there is up to a 1 minute window in where the "check for stale agents job" can execute before the agent's ping is restored. If the server checks for stale agents before the ping is restored, it will see the agent's last ping time was prior to the database going offline. If this is longer then 5 minutes, the agent gets backfilled. Because the agent never disconnected and doesn't know it has been backfilled, it continues to only report availability changes. Therefore, the agent's inventory will only reflect availability changes that have occurred since it was backfilled.
I have a question on the replication procedures... it says: 4. Shutdown database and wait for 5 minutes. 5. Start database immediately after seeing agent log message: Do I stop the DB, wait for 5 minutes (regardless of agent log messages) AND THEN upon seeing the agent log message start the DB again? Or do I stop the DB, and start it immediately back up when I see the agent log message, even if 5 minutes has not yet expired?
(In reply to John Mazzitelli from comment #3) > I have a question on the replication procedures... it says: > > 4. Shutdown database and wait for 5 minutes. > 5. Start database immediately after seeing agent log message: > > Do I stop the DB, wait for 5 minutes (regardless of agent log messages) AND > THEN upon seeing the agent log message start the DB again? Sorry about that. Yes. You shutdown the database... wait for at least 5 minutes... then after the 5 minutes has passed, start the database as soon as you see the agent log message. This is to ensure that the database has adequate time to start before the agent re-attempts its ping.
BZ 913606 appears to address this issue (or at least very similar)
I replicated on 3.1.2. I really think the fix for bug #913606 will fix this issue as well. Its almost the same thing. Specifically, git commit 277c8b1fc4379e6a652931b7db1bbdd4ce3ce5ac will hopefully fix this issue in 3.1.2.
(In reply to John Mazzitelli from comment #6) > I replicated on 3.1.2. I really think the fix for bug #913606 will fix this > issue as well. Its almost the same thing. WORKAROUND: Once replicated, and the agent is backfilled, the workaround to get its resources back to normal (without restarting the server or agent) is to execute the prompt command "avail". This will send up a full avail report and will put the agent's resources back into their proper availability state.
(In reply to John Mazzitelli from comment #6) > I replicated on 3.1.2. I really think the fix for bug #913606 will fix this > issue as well. Its almost the same thing. > > Specifically, git commit 277c8b1fc4379e6a652931b7db1bbdd4ce3ce5ac will > hopefully fix this issue in 3.1.2. I cherry picked that commit to a local release/jon3.1.x branch on my box, rebuilt, and re-ran the test. I can no longer replicate the issue. I think that commit fixes this issue as well as that original issue.
I am adding bug 913606 to this bugs depends list as I think 913606 is the logical upstream version of this bug.
When the agent starts up and registers with the server, if you see this WARN message in the agent log: "The server thinks this agent is down. Will notify the server of up-to-date information when possible." but the bug still shows up, that is very bad. Most likely some kind of race condition between the agent sending an avail report and the server doing the backfill.
I'm thinking to protect against this condition further, perhaps we should check to see if the agent is backfilled anytime we get a "changes-only" avail report from the agent. If it is backfilled, we should tell the agent to send us a full report next time. The problem with that is it might make the avail report processing slower (since we are now doing another query potentially - will have to look at the code, if we are already getting the agent data that we need, maybe we already have the information without needing to do another query)
After more examination Mazz and I think we have pinpointed the issue. Assuming the analysis is correct the problem affects all release lines. Currently testing the fix against 3.1.x. If successful we'll also apply to 3.2.x (same code change as 3.1.x) and master (slightly different approach).
Fix applied to release/jon3.1.x branch: commit 732521cc576be356401b5b6b54e2b6b8124e569f Merge: b47c483 7f42302 Author: jshaughn <jshaughn> Date: Mon May 19 13:16:28 2014 -0400 Merge pull request #37 from rhq-project/jshaughn/1094540 Jshaughn/1094540 reviewed by mazz and loleary. Approved for 3.1.x. commit 7f423025d54173fb438713d8449f0d79675c0622 Author: Jay Shaughnessy <jshaughn> Date: Mon May 19 12:24:33 2014 -0400 Don't unset the backfill flag in the ping logic, just make sure the agent knows we need a full avail report sent and defer the unsetting to when we actually get a report.
Fix applied to release/jon3.2.x branch: commit d8622d88ea43b8333adf915e1ad90d33ef7fe676 Author: Jay Shaughnessy <jshaughn> Date: Mon May 19 13:27:55 2014 -0400 Don't unset the backfill flag in the ping logic, just make sure the agent knows we need a full avail report sent and defer the unsetting to when we actually get a report.
Similar fix applied to master: commit 94008542694eef157289e6f9884669480021b565 Merge: 618c6c9 dcc27a2 Author: jshaughn <jshaughn> Date: Wed May 21 13:33:15 2014 -0400 Merge pull request #38 from rhq-project/jshaughn/1094540 commit dcc27a2c1f1acbf9fb818c92eb27ce278ef6db99 Author: Jay Shaughnessy <jshaughn> Date: Mon May 19 21:54:05 2014 -0400 Don't unset the backfill flag in the ping logic, just make sure the agent knows the server thinks its down, and let it resolve the situation by sending a full avail report, which when processed will unset the backfill flag.
Via product triage, determined that this bug is to be included for DR01 target milestone.
Probably a duplicate of this bug: 918205 The behavior I observed was the agent was still sending messages to the server (metrics, etc) but didn't tell the server it was back up.
Moving to ON_QA as available for test in latest cumulative patch build(DR01): http://jon01.mw.lab.eng.bos.redhat.com:8042/dist/release/jon/3.2.2.GA/5-29-2014/
I reproduced this issue on Version : 3.2.0.GA Update 02 Build Number : 055b880:0620403 even when I used <validation> <validate-on-match>true</validate-on-match> </validation> which made the issue from bz 1095016 go away.
This problem doesn't really have to do with the <validate-on-match> setting, although I suppose you need that to proceed after a DB loss. I'll be honest, I don't immediately see how this problem could occur with the fix in place. I'd probably ask for a re-test in DR2.
Right. The only reason the `<validate-on-match>true</validate-on-match>` configuration is needed is so that the issue identified in bug 1095016 doesn't prevent the back-fill job from being executed in the first place. I would be curious to know what the test results/behavior was from comment 19 after the fix for BZ 1095016 had been applied. What I would expect is: - agent reports ping fails - server reports it can not connect to the database - database connection gets restored - server reports its database connection has been reestablished - server reports that it has not heard from the agent and that the agent is back-filled - agent's resources show availability of unknown because they are backfilled - agent reports ping has been restored and its next avail report is a "full" report type - agent's resources show correct availability now
After a few more attempts I found following: 1 - steps described in Larry's comment 21 were correct 3 out of 5 tries (everything was Ok) 2 - 2 tries had following result: - agent reports ping fails - server reports it can not connect to the database - database connection gets restored - server reports its database connection has been reestablished - server reports that it has not heard from the agent and that the agent is back-filled - resources show availability of unknown because they are backfilled - agent reports ping has been restored BUT no avail report was visible in the agent's log - resources show unknown availability until agent's operation avail -f is called - agent's operation avail -f fixed resources availability The only difference in approach I used for 3 rounds which were Ok and 2 rounds which failed was following: - during 2 rounds which failed I stayed in the UI on agent resource which means that the UI was periodically invoking avail scan (not sure what exactly is done by UI here)
@Jay, it sounds like we might be missing something based on comment 22. Perhaps the live availability is breaking the fix somehow?
I think I know what is happening here. super annoying...
Moving to DR03 as didn't get included for earlier payload.
master commit 2fa7c545cb0225916ae27965a143de2a6f85748f Author: Jay Shaughnessy <jshaughn> Date: Fri Jun 13 12:06:08 2014 -0400 Clean up some of the live availability doco/workflow: - Fix the jdoc for DiscoveryAgentService.getCurrentAvailability() to reflect reality - Fix ResourceManagerBean.getLiveResourceAvailability() to mark its avail report as a "ServerSideReport". This prevents it from interfering with server-agent backfill coordination. Also, correct some confusing inline doco and doe some cleanup. - Fix AvailabilityManagerBean.mergeAvailabilityReport() such that logic checking for the backfill flag does not execute *after* the backfill flag has been reset. And clean up the overall logic a bit more. - Improve jdoc for AvailabilityManagerLocal.updateLastAvailabilityReportInNewTransaction() to clearly indicate its side-effect of clearing the backfill flag.
Oops, I overwrote a change made by spinder to set to DR3. Resetting... btw, he states that the original change did not make it into DR2. That's interesting and may have contributed to the re-test failure, but I believe there were still some issues which I worked on and committed to master above, and I will backport to 3.2.x in a moment... release/jon3.2.x commit 95e3a6837eabf41808b71ed73792f636afeb7c00 Author: Jay Shaughnessy <jshaughn> Date: Fri Jun 13 12:06:08 2014 -0400 Conflicts: modules/core/client-api/src/main/java/org/rhq/core/clientapi/agent/discovery/DiscoveryAgentService.java modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java Cherry-Pick master 2fa7c545cb0225916ae27965a143de2a6f85748f
Moving to ON_QA as available for test in latest build: http://jon01.mw.lab.eng.bos.redhat.com:8042/dist/release/jon/3.2.2.GA/6-28-2014/
Verified on Version : 3.2.0.GA Update 02 Build Number : 558495c:44428f7
*** Bug 918205 has been marked as a duplicate of this bug. ***
This has been verified and released in Red Hat JBoss Operations Network 3.2 Update 02 (3.2.2) available from the Red Hat Customer Portal[1]. [1]: https://access.redhat.com/jbossnetwork/restricted/softwareDetail.html?softwareId=31783