Bug 725881

Summary: Ensure server restarts are safe for running agents
Product: [Other] RHQ Project Reporter: Jay Shaughnessy <jshaughn>
Component: Core ServerAssignee: RHQ Project Maintainer <rhq-maint>
Status: NEW --- QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: high    
Version: 4.1CC: hrupp
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=534375
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 678340, 729848    

Description Jay Shaughnessy 2011-07-26 16:30:16 EDT
Fast server restarts can happen undetected by the rhq agent. This can
leave corrupted entries in the rhq_agent table.  Running agents will 
not re-register after their server restarts, leaving their 
rhq_agent.server_id null.  This can lead to various issues, from the
agent being ignored when loading server-side caches (like the alert
condition cache) to improper agent counts in the admin server list

Ensuring a slow server start time is a current workaround but this
introduces unnatural delay, and works against the general desire for
fast startup times, which is becoming more possible with better 
hardware and AS releases.
Comment 1 Jay Shaughnessy 2011-07-26 16:41:55 EDT
(2:19:15 PM) jshaughn: ok, so the null server_id for a runing agent is most likely explained by a fast server restart
(2:21:09 PM) jshaughn: we do null the server_id refs on all relevant agents for a server starting up
(2:21:41 PM) jshaughn: it's in the infamous servermanagebean.establishCurrentServerMode()
(2:21:54 PM) mazz: ok
(2:21:54 PM) jshaughn: it's also true coming out of maintenance mode
(2:22:15 PM) mazz: just up the rhq-server.properties setting like we have
(2:22:19 PM) jshaughn: so riddle me this, why not have the server try and tell the agent to re-register
(2:22:29 PM) mazz: when?
(2:22:44 PM) jshaughn: at the same time as the server startup
(2:22:57 PM) mazz: is the comm layer even up at that point?
(2:23:00 PM) jshaughn: to try and ensure the agent doesn't miss a fast startup
(2:23:04 PM) jshaughn: dunno
(2:23:06 PM) jshaughn: maybe not
(2:23:12 PM) jshaughn: but perhaps it could be deferred
(2:23:49 PM) mazz: what if the agent is already talking to another server?
(2:23:57 PM) jshaughn: hmmm
(2:24:09 PM) jshaughn: ah, then it's not a problem
(2:24:21 PM) jshaughn: because they wouldn't be associated with the server comingup
(2:24:36 PM) mazz: my only reservation is this means we have to shotgun N messages to agents, where N is the number of agents that should be talking to thise server. Right?
(2:24:48 PM) mazz: we don't want to send this to ALL agents, that's a non starter
(2:25:02 PM) mazz: but I suppose we could attempt to shotgun to all agents that should be talking to this server
(2:25:09 PM) mazz: just do a fire and forget
(2:25:13 PM) jshaughn: yeah, it would be one message to try and reach the agents that were talking to this server
(2:25:32 PM) mazz: if the agent registers, great, otherwise, log a message saying something like we can't tell it to register and to watch out
(2:26:04 PM) jshaughn: yeah, something like that.  If the message can't be delivered then we can assume it was down anyway
(2:26:29 PM) jshaughn: this is only for agents that still think they were talking to a living server
(2:52:57 PM) jshaughn: another way we could solve the restart problem above would be to simply not clear the server_id column when a server comes up or goes down.  Just leave it alone and let dead agents just stay listed.
(2:53:34 PM) jshaughn: Any live agent would still be listed correctly.
(2:55:02 PM) jshaughn: also, I don't know why, when an agent shuts down gracefully, we don't clear this value.
(2:55:14 PM) mazz: there's a general issue there
(2:55:16 PM) mazz: its in a BZ
(2:56:36 PM) mazz: the agent tells the server that it is going down
(2:56:41 PM) mazz: but we don't do anything with it
(2:56:47 PM) mazz: other than log a message in the server log
(2:56:50 PM) jshaughn: I also don't know why when an agent goes down gracefully we don't set the avail to DOWN
(2:57:10 PM) mazz: there is a BZ that says, "we should do something , since we know the agent is going down, like backfilling it immediately "
(2:57:19 PM) mazz: there is no reason why we don't 
(2:57:24 PM) mazz: we just never implemented it
(2:57:28 PM) mazz: the method is there
(2:57:31 PM) mazz: the message is there
(2:57:36 PM) mazz: we are told when the agent shuts down
(2:57:39 PM) mazz: we just don't do anything
(2:57:46 PM) mazz: no reason other than, no time or personnel to do it
(2:57:59 PM) jshaughn: these things all sort of tie together
(2:58:56 PM) mazz: https://bugzilla.redhat.com/show_bug.cgi?id=534375
(2:59:01 PM) mazz: only 2 and a half years old :)
(2:59:35 PM) mazz: "We should make sure we do all the stuff we can do in here. For example, in the HAAC view of the servers, the agent count doesn't go down when an agent shuts down. We should ensure that the agent count goes down when the agent tells us it is shutting down.  We could also clear the alert cache for that agent to lower the footprint of the cache."
(3:00:02 PM) jshaughn: also: https://bugzilla.redhat.com/show_bug.cgi?id=701092
(3:00:20 PM) mazz: yup - that's the backfilling
(3:00:28 PM) mazz: larry even attached a patch to my old BZ
(3:00:31 PM) mazz: that does that :}
(3:00:54 PM) jshaughn: yeesh, ok, I'm moving this to block rhq3
(3:02:42 PM) mazz: larry may be doing too much in that patch
(3:02:49 PM) mazz: the backfill should already set avails
(3:02:54 PM) jshaughn: so, did you say there was an issue with:
(3:03:00 PM) mazz: also, I would not be surprised if backfilling also cleared the agent cache
(3:03:26 PM) jshaughn: (2:52:57 PM) jshaughn: another way we could solve the restart problem above would be to simply not clear the server_id column when a server comes up or goes down.  Just leave it alone and let dead agents just stay listed.
(3:03:40 PM) jshaughn: backfilling does not clear the cache
(3:06:29 PM) mazz: I can go either way with clearing/not clearning server_id
Comment 2 Charles Crouch 2011-07-26 17:33:55 EDT
(4:25:22 PM) ccrouch: ok, i'll settle for this:
(4:25:22 PM) ccrouch: getting 725881 scheduled for the next sprint, and before we start coding post to the devel list a detailed summary of the issue and how this will fix it
(4:25:56 PM) ccrouch: that way we start getting more folks familar with it, without requiring a full scale public inquiry to get anything done
(4:29:12 PM) ccrouch: then if people have comments, ideas they can contribute to them