Bug 1028622

Summary: Storage Cluster Maintenance Jobs Not Created Properly
Product: [JBoss] JBoss Operations Network Reporter: Stefan Negrea <snegrea>
Component: Core ServerAssignee: Stefan Negrea <snegrea>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: urgent Docs Contact:
Priority: urgent    
Version: JON 3.2CC: ahovsepy, jkandasa, theute
Target Milestone: ER07   
Target Release: JON 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1029122 (view as bug list) Environment:
Last Closed: 2014-01-02 20:36:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1029122    
Bug Blocks: 1012435    

Description Stefan Negrea 2013-11-08 21:41:25 UTC
Description of problem:
StorageClusterInitJob and StorageClusterCredentialsJob are wrongly created as quartz jobs. 

With quartz jobs there is no guarantee on which node of an HA environment they will run and also there is only job for an entire HA environment. So a job might be schedule on a different node each run, or it will run in perpetuity on one single node. 

The design dictates that these two jobs run on each server of an HA environment. So the current implementation wrongly uses quartz jobs. These two jobs, or the tasks performed by them should be run via EJB timers. 


Here are side effects of the current implemetation:
1) StorageClusterInitJob:
- If the storage cluster is down when the HA RHQ servers are started then only one server will get the storage session initialized once the storage cluster becomes available. 
- The first node to run the quartz job will configure the storage session for the particular node and then the job gets cancelled.
- The other nodes will never have a properly initialized storage connection because the job is cancelled on the first successful run.

2) StorageClusterCredentialsJob:
- When the cluster credentials change, it is unpredictable if/when all the HA nodes will get the cluster sessions restarted with the new correct credentials.
- Because this is a Quartz job, the node that executes it cannot be predicted. It is possible that the same HA node executes the job in perpetuity.


This bug does not affect single RHQ node installations, since there is only one executor for the Quartz jobs.


Steps to Reproduce StorageClusterInitJob:
1. Create an environment with at least two HA server nodes and storage nodes
2. Make sure the storage cluster is not started
3. Start each of the servers without starting the storage cluster
4. Wait for each HA server to be initialized
5. Start the storage cluster

Actual results:
Only one HA server be in Normal operational mode. The other HA servers will stay in maintenance mode and the storage session will not get initialized properly.

Expected results:
All HA server get the storage session properly initialized.



Steps to Reproduce StorageClusterCredentialsJob:
1. Create an environment with at least two HA server nodes and storage nodes
2. Start the storage cluster and HA servers
3. Wait for each HA server to be initialized
4. Change the storage cluster password
5. Watch for session re-initialization events

Actual results:
Not all the HA server will have the storage session reinitialized with the new credentials. Some servers might take longer, some server might never get the storage session reinitialized.

Expected results:
All HA servers have the storage session reinitialized within minutes.




Additional info:
This bug only affects HA deployments. The fix is simple, just move the code to be executed via EBJ timers, no functional changes required.

Comment 1 Mike Foley 2013-11-08 21:47:01 UTC
QE activities:

1) please create documented TCMS Testcases for these 2 HA scenarios 
2) please create a documented TMCS Testcase run for qualification. 
3) please perform a general smoketest on HA (simulating outages, etc...)

Comment 2 Stefan Negrea 2013-11-11 17:24:07 UTC
Converted the functionality implemented by StorageClusterInitJob and StorageClusterCredentialsJob to an EJB timer on the StorageClientManagerBean. This guarantees that session init and credential refresh happens on every single server of an HA environment. Also, the timer run interval is 90 seconds after an initial wait of 30 seconds. This will reduce the possibility of having subsequent runs wait on the credentials process to finish.


release/jon3.2.x branch commits:

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=06c3109c8ef0f3438297ecc5706a6519822e19c6

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=eac44d57849e2fa2a2f2226d0f7827e91ffc3a75

Comment 3 Stefan Negrea 2013-11-14 21:31:40 UTC
Slightly updated the code based on a code review from Jay. This change reduces the chance of a server not starting because the storage cluster is not available (which is the desired outcome). Also removed some redundant code left from the quartz jobs.

release/jon3.2.x branch commit:
https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=d69b30f4d2c8d89910619848d9e02e48b036d395

Comment 4 Simeon Pinder 2013-11-19 15:48:49 UTC
Moving to ON_QA as available for testing with new brew build.

Comment 5 Simeon Pinder 2013-11-22 05:14:14 UTC
Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.

Comment 6 Armine Hovsepyan 2013-12-02 12:47:52 UTC
verified