Bug 1094540 - Agents get permanently backfilled if backend database goes offline for a moment
Summary: Agents get permanently backfilled if backend database goes offline for a moment
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Inventory
Version: JON 3.1.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: DR03
: JON 3.2.2
Assignee: Jay Shaughnessy
QA Contact: Mike Foley
URL:
Whiteboard:
: 918205 (view as bug list)
Depends On: 913606
Blocks: 918205 1095018
TreeView+ depends on / blocked
 
Reported: 2014-05-05 23:53 UTC by Larry O'Leary
Modified: 2018-12-06 16:25 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1095018 (view as bug list)
Environment:
Last Closed: 2014-07-29 00:17:03 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1095016 0 unspecified CLOSED Quartz stops working if database connection is reset 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 753883 0 None None None Never

Internal Links: 1095016

Description Larry O'Leary 2014-05-05 23:53:40 UTC
Description of problem:
If the JBoss ON database is taken offline for backup or maintenance or there is a network partition event that temporarily interrupts the JBoss ON server's communication with the database, agents can get permanently backfilled resulting in stale inventory availability reporting.

Version-Release number of selected component (if applicable):
3.1.2.GA

How reproducible:
Always

Steps to Reproduce:
1.  Start JBoss ON system.
2.  Import agent into inventory.
3.  Verify that all agent resources are reported as UP.
4.  Shutdown database and wait for 5 minutes.
5.  Start database immediately after seeing agent log message:

        INFO  [RHQ Agent Ping Thread-1] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.ping-executor.start-polling-after-exception}Starting polling to determine sender status (server ping failed)

Actual results:
All agent's resources are marked with "unknown" availability and platform resource is marked as "down". Additionally, server log reports affected agents have been backfilled:

    2014-05-05 18:07:51,277 INFO  [org.quartz.core.QuartzSchedulerThread] releaseTriggerRetryLoop: connection restored.
    2014-05-05 18:08:08,193 INFO  [org.rhq.enterprise.server.core.AgentManagerBean] Have not heard from agent [Agent 01053671 on jboss.example.com] since [Mon May 05 17:59:40 CDT 2014]. Will be backfilled since we suspect it is down

Expected results:
Agent inventory continues to report correct availability state.

Additional info:
To summarize what I think is happening here I am going to over simplify things.

The agent's ping fails because the server is not able to update the agent's last ping time in the database--the database is down. The server is also not able to backfill anything because the database is down so it can not run its "check for stale agents job". However, as soon as the database comes back online, there is up to a 1 minute window in where the "check for stale agents job" can execute before the agent's ping is restored. If the server checks for stale agents before the ping is restored, it will see the agent's last ping time was prior to the database going offline. If this is longer then 5 minutes, the agent gets backfilled.

Because the agent never disconnected and doesn't know it has been backfilled, it continues to only report availability changes. Therefore, the agent's inventory will only reflect availability changes that have occurred since it was backfilled.

Comment 3 John Mazzitelli 2014-05-07 14:40:22 UTC
I have a question on the replication procedures... it says:

4.  Shutdown database and wait for 5 minutes.
5.  Start database immediately after seeing agent log message:

Do I stop the DB, wait for 5 minutes (regardless of agent log messages) AND THEN upon seeing the agent log message start the DB again?

Or do I stop the DB, and start it immediately back up when I see the agent log message, even if 5 minutes has not yet expired?

Comment 4 Larry O'Leary 2014-05-07 15:09:06 UTC
(In reply to John Mazzitelli from comment #3)
> I have a question on the replication procedures... it says:
> 
> 4.  Shutdown database and wait for 5 minutes.
> 5.  Start database immediately after seeing agent log message:
> 
> Do I stop the DB, wait for 5 minutes (regardless of agent log messages) AND
> THEN upon seeing the agent log message start the DB again?

Sorry about that.

Yes. You shutdown the database... wait for at least 5 minutes... then after the 5 minutes has passed, start the database as soon as you see the agent log message. This is to ensure that the database has adequate time to start before the agent re-attempts its ping.

Comment 5 John Mazzitelli 2014-05-07 15:46:27 UTC
BZ 913606 appears to address this issue (or at least very similar)

Comment 6 John Mazzitelli 2014-05-07 18:30:13 UTC
I replicated on 3.1.2. I really think the fix for bug #913606 will fix this issue as well. Its almost the same thing.

Specifically, git commit 277c8b1fc4379e6a652931b7db1bbdd4ce3ce5ac will hopefully fix this issue in 3.1.2.

Comment 7 John Mazzitelli 2014-05-07 18:33:29 UTC
(In reply to John Mazzitelli from comment #6)
> I replicated on 3.1.2. I really think the fix for bug #913606 will fix this
> issue as well. Its almost the same thing.

WORKAROUND: Once replicated, and the agent is backfilled, the workaround to get its resources back to normal (without restarting the server or agent) is to execute the prompt command "avail". This will send up a full avail report and will put the agent's resources back into their proper availability state.

Comment 8 John Mazzitelli 2014-05-07 20:29:13 UTC
(In reply to John Mazzitelli from comment #6)
> I replicated on 3.1.2. I really think the fix for bug #913606 will fix this
> issue as well. Its almost the same thing.
> 
> Specifically, git commit 277c8b1fc4379e6a652931b7db1bbdd4ce3ce5ac will
> hopefully fix this issue in 3.1.2.

I cherry picked that commit to a local release/jon3.1.x branch on my box, rebuilt, and re-ran the test. I can no longer replicate the issue. I think that commit fixes this issue as well as that original issue.

Comment 9 Larry O'Leary 2014-05-15 00:23:48 UTC
I am adding bug 913606 to this bugs depends list as I think 913606 is the logical upstream version of this bug.

Comment 10 John Mazzitelli 2014-05-16 20:53:33 UTC
When the agent starts up and registers with the server, if you see this WARN message in the agent log:

    "The server thinks this agent is down. Will notify the server of up-to-date information when possible."

but the bug still shows up, that is very bad. Most likely some kind of race condition between the agent sending an avail report and the server doing the backfill.

Comment 11 John Mazzitelli 2014-05-16 21:24:23 UTC
I'm thinking to protect against this condition further, perhaps we should check to see if the agent is backfilled anytime we get a "changes-only" avail report from the agent. If it is backfilled, we should tell the agent to send us a full report next time. The problem with that is it might make the avail report processing slower (since we are now doing another query potentially - will have to look at the code, if we are already getting the agent data that we need, maybe we already have the information without needing to do another query)

Comment 12 Jay Shaughnessy 2014-05-19 15:00:39 UTC
After more examination Mazz and I think we have pinpointed the issue.  Assuming the analysis is correct the problem affects all release lines.  Currently testing the fix against 3.1.x. If successful we'll also apply to 3.2.x (same code change as 3.1.x) and master (slightly different approach).

Comment 13 Jay Shaughnessy 2014-05-19 17:21:39 UTC
Fix applied to release/jon3.1.x branch:

commit 732521cc576be356401b5b6b54e2b6b8124e569f
Merge: b47c483 7f42302
Author: jshaughn <jshaughn>
Date:   Mon May 19 13:16:28 2014 -0400

    Merge pull request #37 from rhq-project/jshaughn/1094540
    
    Jshaughn/1094540 reviewed by mazz and loleary.  Approved for 3.1.x.

commit 7f423025d54173fb438713d8449f0d79675c0622
Author: Jay Shaughnessy <jshaughn>
Date:   Mon May 19 12:24:33 2014 -0400

    Don't unset the backfill flag in the ping logic, just make
    sure the agent knows we need a full avail report sent and defer
    the unsetting to when we actually get a report.

Comment 14 Jay Shaughnessy 2014-05-19 17:30:31 UTC
Fix applied to release/jon3.2.x branch:

commit d8622d88ea43b8333adf915e1ad90d33ef7fe676
Author: Jay Shaughnessy <jshaughn>
Date:   Mon May 19 13:27:55 2014 -0400

    Don't unset the backfill flag in the ping logic, just make
    sure the agent knows we need a full avail report sent and defer
    the unsetting to when we actually get a report.

Comment 15 Jay Shaughnessy 2014-05-21 18:00:29 UTC
Similar fix applied to master:

commit 94008542694eef157289e6f9884669480021b565
Merge: 618c6c9 dcc27a2
Author: jshaughn <jshaughn>
Date:   Wed May 21 13:33:15 2014 -0400

    Merge pull request #38 from rhq-project/jshaughn/1094540

commit dcc27a2c1f1acbf9fb818c92eb27ce278ef6db99
Author: Jay Shaughnessy <jshaughn>
Date:   Mon May 19 21:54:05 2014 -0400

    Don't unset the backfill flag in the ping logic, just make
    sure the agent knows the server thinks its down, and let it
    resolve the situation by sending a full avail report, which when
    processed will unset the backfill flag.

Comment 16 Simeon Pinder 2014-05-22 15:15:36 UTC
Via product triage, determined that this bug is to be included for DR01 target milestone.

Comment 17 Elias Ross 2014-05-22 16:04:18 UTC
Probably a duplicate of this bug: 918205

The behavior I observed was the agent was still sending messages to the server (metrics, etc) but didn't tell the server it was back up.

Comment 18 Simeon Pinder 2014-05-30 02:43:33 UTC
Moving to ON_QA as available for test in latest cumulative patch build(DR01):
http://jon01.mw.lab.eng.bos.redhat.com:8042/dist/release/jon/3.2.2.GA/5-29-2014/

Comment 19 Filip Brychta 2014-06-02 15:28:53 UTC
I reproduced this issue on Version :	
3.2.0.GA Update 02
Build Number :	
055b880:0620403

even when I used

<validation>
      <validate-on-match>true</validate-on-match>
</validation>

which made the issue from bz 1095016 go away.

Comment 20 Jay Shaughnessy 2014-06-02 18:33:34 UTC
This problem doesn't really have to do with the <validate-on-match> setting, although I suppose you need that to proceed after a DB loss.

I'll be honest, I don't immediately see how this problem could occur with the fix in place.  I'd probably ask for a re-test in DR2.

Comment 21 Larry O'Leary 2014-06-02 19:20:21 UTC
Right. The only reason the `<validate-on-match>true</validate-on-match>` configuration is needed is so that the issue identified in bug 1095016 doesn't prevent the back-fill job from being executed in the first place.

I would be curious to know what the test results/behavior was from comment 19 after the fix for BZ 1095016 had been applied. What I would expect is:

 - agent reports ping fails
 - server reports it can not connect to the database
 - database connection gets restored
 - server reports its database connection has been reestablished
 - server reports that it has not heard from the agent and that the agent is back-filled
 - agent's resources show availability of unknown because they are backfilled
 - agent reports ping has been restored and its next avail report is a "full" report type
 - agent's resources show correct availability now

Comment 22 Filip Brychta 2014-06-03 10:46:54 UTC
After a few more attempts I found following:
1 - steps described in Larry's comment 21 were correct 3 out of 5 tries (everything was Ok)
2 - 2 tries had following result:
 - agent reports ping fails
 - server reports it can not connect to the database
 - database connection gets restored
 - server reports its database connection has been reestablished
 - server reports that it has not heard from the agent and that the agent is back-filled
 - resources show availability of unknown because they are backfilled
 - agent reports ping has been restored BUT no avail report was visible in the agent's log
 - resources show unknown availability until agent's operation avail -f is called
 - agent's operation avail -f fixed resources availability

The only difference in approach I used for 3 rounds which were Ok and 2 rounds which failed was following:
- during 2 rounds which failed I stayed in the UI on agent resource which means that the UI was periodically invoking avail scan (not sure what exactly is done by UI here)

Comment 23 Larry O'Leary 2014-06-03 14:39:34 UTC
@Jay, it sounds like we might be missing something based on comment 22. 

Perhaps the live availability is breaking the fix somehow?

Comment 24 Jay Shaughnessy 2014-06-12 21:51:58 UTC
I think I know what is happening here. super annoying...

Comment 25 Simeon Pinder 2014-06-13 14:39:04 UTC
Moving to DR03 as didn't get included for earlier payload.

Comment 26 Jay Shaughnessy 2014-06-13 16:09:25 UTC
master commit 2fa7c545cb0225916ae27965a143de2a6f85748f
Author: Jay Shaughnessy <jshaughn>
Date:   Fri Jun 13 12:06:08 2014 -0400

 Clean up some of the live availability doco/workflow:
 - Fix the jdoc for DiscoveryAgentService.getCurrentAvailability() to reflect
   reality
 - Fix ResourceManagerBean.getLiveResourceAvailability() to mark its avail report
   as a "ServerSideReport".  This prevents it from interfering with server-agent
   backfill coordination.  Also, correct some confusing inline doco and doe some
   cleanup.
 - Fix AvailabilityManagerBean.mergeAvailabilityReport() such that logic checking
   for the backfill flag does not execute *after* the backfill flag has been
   reset. And clean up the overall logic a bit more.
 - Improve jdoc for
   AvailabilityManagerLocal.updateLastAvailabilityReportInNewTransaction()
   to clearly indicate its side-effect of clearing the backfill flag.

Comment 27 Jay Shaughnessy 2014-06-13 16:31:02 UTC
Oops, I overwrote a change made by spinder to set to DR3.  Resetting...

btw, he states that the original change did not make it into DR2.  That's interesting and may have contributed to the re-test failure, but I believe there were still some issues which I worked on and committed to master above, and I will backport to 3.2.x in a moment...


release/jon3.2.x commit 95e3a6837eabf41808b71ed73792f636afeb7c00
Author: Jay Shaughnessy <jshaughn>
Date:   Fri Jun 13 12:06:08 2014 -0400

Conflicts:
modules/core/client-api/src/main/java/org/rhq/core/clientapi/agent/discovery/DiscoveryAgentService.java     modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/resource/ResourceManagerBean.java

Cherry-Pick master 2fa7c545cb0225916ae27965a143de2a6f85748f

Comment 28 Simeon Pinder 2014-06-30 06:03:03 UTC
Moving to ON_QA as available for test in latest build:
http://jon01.mw.lab.eng.bos.redhat.com:8042/dist/release/jon/3.2.2.GA/6-28-2014/

Comment 29 Filip Brychta 2014-06-30 09:08:28 UTC
Verified on
Version :	
3.2.0.GA Update 02
Build Number :	
558495c:44428f7

Comment 30 Jay Shaughnessy 2014-07-18 18:39:31 UTC
*** Bug 918205 has been marked as a duplicate of this bug. ***

Comment 31 Larry O'Leary 2014-07-29 00:17:03 UTC
This has been verified and released in Red Hat JBoss Operations Network 3.2 Update 02 (3.2.2) available from the Red Hat Customer Portal[1].



[1]: https://access.redhat.com/jbossnetwork/restricted/softwareDetail.html?softwareId=31783


Note You need to log in before you can comment on or make changes to this bug.