Red Hat Bugzilla – Bug 536587
add command line option to start server in maintenance mode
Last modified: 2009-02-02 11:47:00 EST
(9:54:20 AM) jshaughn: mazz, ghinkle just to confirm a detail from that convo - yes, if a server goes down in MM it comes up in MM, allowing that maintenance session to pretty much do what it wants. MM is initiated and ended via HAAC. Note that the server does not have to be up when HAAC switches it from MM to NORMAL.
(9:55:05 AM) ghinkle: so... i could go in the db and switch it?
(9:55:13 AM) jshaughn: that too, of course :)
(9:55:33 AM) jshaughn: ghinkle: but only when the server is down
(9:55:36 AM) ghinkle: this would be if i brought a server down or it crashed or something... and i want to bring it up to investigate without having the agents beat it up
(9:55:42 AM) ghinkle: yup
(9:55:46 AM) jshaughn: right
(9:55:46 AM) ghinkle: good enough
(9:55:55 AM) jshaughn: but you could do it from HAAC as long as another server was up
(9:56:20 AM) jshaughn: might be a nice command line option for the single server scenario
(9:56:42 AM) ghinkle: right... i was thinking rhq-server.sh -maintenance
(9:56:45 AM) mazz: ah
(9:56:49 AM) mazz: I see - withOUT a server running
(9:56:50 AM) jshaughn: right
This option would be ignored when starting the server the first time (i.e. install time)
we could have the rhq-server script look for "-maintenance" and if it sees it on the command line, add "-Drhq.server.maintenance-mode-at-startup=true" to the VM opts. Otherwise, "-Drhq.server.maintenance-mode-at-startup=false" should be passed. Need to always pass it to make it easier to support launching via Java Service Wrapper.
We could alternatively add this setting to the rhq-server.properties and you can manually set it to "true" in the file.
Then , the code in StartupServlet would look for this setting and if true, ignore what's in the DB and immediately set the mode to MM.
Without something like this its hard to recover from certain failure scenarios...
-Assume you're server cloud is humming along and your DB goes down.
-You shut the JON servers down since they can't talk to the DB and are just throwing exceptions. [Note you can't put the servers in maintenance mode since you can't write to the DB.]
-You fix the DB and want to bring the Servers back up. Note agents stayed up spooling data during the DB outage.
-You go through and start the Servers.
While you were starting the Servers the agents were polling all the servers every 60seconds to see who might be up. The first server an agent finds which is up, it is going to try to talk to, and it won't check for whether its primary is up for another hour (or until you run the 'Download Latest Failover List' operation on the agent). So if you spend more than 1minute starting each server, odds are *all* your agents are going to be trying to talk to the first server you started. Which if you have a large number of agents is probably going to exceed the concurrency limits, so it may not be possible for all the agents to get properly connected (since agents being rejected for exceeding concurrency limits will not result in failover). Without all agents successfully being connected you can't reliably execute the 'Download Latest Failover List' operation so you're best bet is just to wait an hour for the agents to check again if their primary server is alive.
This situation would be some alleviated if you could start the servers in maintenance mode, because you could then switch them all at once to Normal mode.
(9:49:23 AM) mazz: we COULD have them update RHQ_SERVER, turn all of them into Maintenance Mode
(9:49:27 AM) mazz: only THEN start the servers
(9:49:33 AM) mazz: then go to Admin>ListServers
(9:49:41 AM) mazz: check all of them and "switch to normal"
(9:50:08 AM) mazz: hopefully, it creates less stress since all servers will come online at about the same time
(9:57:59 AM) mazz: the ONLY problem I see with this
(9:58:02 AM) ccrouch: and we assume that their DB can take 110 concurrent connections
(9:58:11 AM) mazz: putting the servers in MM is NOT immediate.
(9:58:24 AM) mazz: the servers have a 30s timer
(9:58:32 AM) mazz: wakes up, reads DB - "am I in MM?"
(9:58:46 AM) mazz: so there is a 30s lag between the first server going online and the last
(9:59:06 AM) mazz: in that 30s period, all the agents that happen to poll at that time will not connect to the server(s) not yet in MM
Right now the only way to get downed servers into maintenance mode is to use SQL as mazz describes above, a command line option would be preferable.
going to add a setting to rhq-server.properties - this way, you can configure the server via the UI config tab (and all the goodness that entails). it is also consistent with all our other server-specific configs.
1) updating StartupServlet to look at the config and change the server mode prior to comm module setup
2) updating the container build to add the new setting to the .properties
3) updating installer so you can set this setting at install time
4) updating rhq-server plugin so it can see the new setting in config tab
rhq-server plugin can now manage this new setting: https://jira.jboss.org/jira/browse/JOPR-41
QA Verified.. entered RHQ-1459 re: the location of the UI element, but the issue technically passes.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-921
This bug is related to RHQ-1082