534270 – (RHQ-1082) if MM check fails, server should go into MM

Bug 534270 (RHQ-1082) - if MM check fails, server should go into MM

Summary: if MM check fails, server should go into MM

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	RHQ-1082
Product:	RHQ Project
Classification:	Other
Component:	Core Server
Sub Component:
Version:	1.1
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	RHQ Project Maintainer
QA Contact:
Docs Contact:
URL:	http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks:	jon30-bugs
TreeView+	depends on / blocked

Reported:	2008-11-07 16:39 UTC by John Mazzitelli
Modified:	2014-05-02 20:15 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-05-02 20:15:12 UTC
Embargoed:

Attachments	(Terms of Use)

Description John Mazzitelli 2008-11-07 16:39:00 UTC

if the server job that checks the server MM flag fails to get to the database (i.e. gets a SQL exception) several times, it should flip to MM since the server is probably in a bad state and we'll want agents to go to another server.

Comment 1 Jay Shaughnessy 2008-11-07 17:00:45 UTC

We need to discuss what the behavior should be.  In general, if the db can't be reached it may be best to actually shutdown the server.  Putting the server into MM is a bit of a chicken and egg problem, going into MM requires that the operation mode be updated *in* the database.

Comment 2 John Mazzitelli 2008-11-07 17:13:41 UTC

in our case, we used another server in the cloud to flip the bit to MM.

but the server that was put into MM didn't know it because it coudn't get to the DB to read it.

Comment 3 Jay Shaughnessy 2008-11-07 17:34:32 UTC

That's an interesting scenario.  But still, I think this is more an issue of what to do if the DB is hosed.  If the DB is unreachable nothing is really going to work well. If we can agree on what constitutes a db failure (and very possibly that server manager job not being able to get the current operation mode is a good candidate) we may want to consider taking the server down (perhaps try to send e-mail to the rhqadmin email address).  Once the DB is validated as being stable, the admin can bring the server(s) back up.  If we implement the other feature about bringing up in MM then that may play well into this, if necessary.

Comment 4 Jay Shaughnessy 2008-11-11 17:13:00 UTC

This is an edge case.  In general failing this read from the db is indicative of a severe condition - db down, but we'll retry after delay on the chance that it's a temporary condition.  If we still can't get the operation mode then force MM.

Comment 5 John Mazzitelli 2008-11-11 17:28:15 UTC

we need to put this into 1.2 to avoid the problem when a server can't even determine if it should be in MM or not (when this happens, the server needs to retry that check after a pause of a few seconds and if it still fails, immediately assume the mode is MAINTENANCE)

Comment 6 Red Hat Bugzilla 2009-11-10 20:23:32 UTC

This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1082
This bug relates to RHQ-921

Note You need to log in before you can comment on or make changes to this bug.