1028622 – Storage Cluster Maintenance Jobs Not Created Properly

Bug 1028622 - Storage Cluster Maintenance Jobs Not Created Properly

Summary: Storage Cluster Maintenance Jobs Not Created Properly

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Core Server
Sub Component:
Version:	JON 3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ER07
Target Release:	JON 3.2.0
Assignee:	Stefan Negrea
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:	1029122
Blocks:	1012435
TreeView+	depends on / blocked

Reported:	2013-11-08 21:41 UTC by Stefan Negrea
Modified:	2014-01-02 20:36 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Clones:	1029122 (view as bug list)
Environment:
Last Closed:	2014-01-02 20:36:30 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Description Stefan Negrea 2013-11-08 21:41:25 UTC

Description of problem:
StorageClusterInitJob and StorageClusterCredentialsJob are wrongly created as quartz jobs.

With quartz jobs there is no guarantee on which node of an HA environment they will run and also there is only job for an entire HA environment. So a job might be schedule on a different node each run, or it will run in perpetuity on one single node.

The design dictates that these two jobs run on each server of an HA environment. So the current implementation wrongly uses quartz jobs. These two jobs, or the tasks performed by them should be run via EJB timers.

Here are side effects of the current implemetation:
1) StorageClusterInitJob:
- If the storage cluster is down when the HA RHQ servers are started then only one server will get the storage session initialized once the storage cluster becomes available.
- The first node to run the quartz job will configure the storage session for the particular node and then the job gets cancelled.
- The other nodes will never have a properly initialized storage connection because the job is cancelled on the first successful run.

2) StorageClusterCredentialsJob:
- When the cluster credentials change, it is unpredictable if/when all the HA nodes will get the cluster sessions restarted with the new correct credentials.
- Because this is a Quartz job, the node that executes it cannot be predicted. It is possible that the same HA node executes the job in perpetuity.

This bug does not affect single RHQ node installations, since there is only one executor for the Quartz jobs.

Steps to Reproduce StorageClusterInitJob:
1. Create an environment with at least two HA server nodes and storage nodes
2. Make sure the storage cluster is not started
3. Start each of the servers without starting the storage cluster
4. Wait for each HA server to be initialized
5. Start the storage cluster

Actual results:
Only one HA server be in Normal operational mode. The other HA servers will stay in maintenance mode and the storage session will not get initialized properly.

Expected results:
All HA server get the storage session properly initialized.

Steps to Reproduce StorageClusterCredentialsJob:
1. Create an environment with at least two HA server nodes and storage nodes
2. Start the storage cluster and HA servers
3. Wait for each HA server to be initialized
4. Change the storage cluster password
5. Watch for session re-initialization events

Actual results:
Not all the HA server will have the storage session reinitialized with the new credentials. Some servers might take longer, some server might never get the storage session reinitialized.

Expected results:
All HA servers have the storage session reinitialized within minutes.

Additional info:
This bug only affects HA deployments. The fix is simple, just move the code to be executed via EBJ timers, no functional changes required.

Comment 1 Mike Foley 2013-11-08 21:47:01 UTC

QE activities:

1) please create documented TCMS Testcases for these 2 HA scenarios 
2) please create a documented TMCS Testcase run for qualification. 
3) please perform a general smoketest on HA (simulating outages, etc...)

Comment 2 Stefan Negrea 2013-11-11 17:24:07 UTC

Converted the functionality implemented by StorageClusterInitJob and StorageClusterCredentialsJob to an EJB timer on the StorageClientManagerBean. This guarantees that session init and credential refresh happens on every single server of an HA environment. Also, the timer run interval is 90 seconds after an initial wait of 30 seconds. This will reduce the possibility of having subsequent runs wait on the credentials process to finish.


release/jon3.2.x branch commits:

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=06c3109c8ef0f3438297ecc5706a6519822e19c6

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=eac44d57849e2fa2a2f2226d0f7827e91ffc3a75

Comment 3 Stefan Negrea 2013-11-14 21:31:40 UTC

Slightly updated the code based on a code review from Jay. This change reduces the chance of a server not starting because the storage cluster is not available (which is the desired outcome). Also removed some redundant code left from the quartz jobs.

release/jon3.2.x branch commit:
https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=d69b30f4d2c8d89910619848d9e02e48b036d395

Comment 4 Simeon Pinder 2013-11-19 15:48:49 UTC

Moving to ON_QA as available for testing with new brew build.

Comment 5 Simeon Pinder 2013-11-22 05:14:14 UTC

Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.

Comment 6 Armine Hovsepyan 2013-12-02 12:47:52 UTC

verified

Note You need to log in before you can comment on or make changes to this bug.