Description of problem: StorageClusterInitJob and StorageClusterCredentialsJob are wrongly created as quartz jobs. With quartz jobs there is no guarantee on which node of an HA environment they will run and also there is only job for an entire HA environment. So a job might be schedule on a different node each run, or it will run in perpetuity on one single node. The design dictates that these two jobs run on each server of an HA environment. So the current implementation wrongly uses quartz jobs. These two jobs, or the tasks performed by them should be run via EJB timers. Here are side effects of the current implemetation: 1) StorageClusterInitJob: - If the storage cluster is down when the HA RHQ servers are started then only one server will get the storage session initialized once the storage cluster becomes available. - The first node to run the quartz job will configure the storage session for the particular node and then the job gets cancelled. - The other nodes will never have a properly initialized storage connection because the job is cancelled on the first successful run. 2) StorageClusterCredentialsJob: - When the cluster credentials change, it is unpredictable if/when all the HA nodes will get the cluster sessions restarted with the new correct credentials. - Because this is a Quartz job, the node that executes it cannot be predicted. It is possible that the same HA node executes the job in perpetuity. This bug does not affect single RHQ node installations, since there is only one executor for the Quartz jobs. Steps to Reproduce StorageClusterInitJob: 1. Create an environment with at least two HA server nodes and storage nodes 2. Make sure the storage cluster is not started 3. Start each of the servers without starting the storage cluster 4. Wait for each HA server to be initialized 5. Start the storage cluster Actual results: Only one HA server be in Normal operational mode. The other HA servers will stay in maintenance mode and the storage session will not get initialized properly. Expected results: All HA server get the storage session properly initialized. Steps to Reproduce StorageClusterCredentialsJob: 1. Create an environment with at least two HA server nodes and storage nodes 2. Start the storage cluster and HA servers 3. Wait for each HA server to be initialized 4. Change the storage cluster password 5. Watch for session re-initialization events Actual results: Not all the HA server will have the storage session reinitialized with the new credentials. Some servers might take longer, some server might never get the storage session reinitialized. Expected results: All HA servers have the storage session reinitialized within minutes. Additional info: This bug only affects HA deployments. The fix is simple, just move the code to be executed via EBJ timers, no functional changes required.
QE activities: 1) please create documented TCMS Testcases for these 2 HA scenarios 2) please create a documented TMCS Testcase run for qualification. 3) please perform a general smoketest on HA (simulating outages, etc...)
Converted the functionality implemented by StorageClusterInitJob and StorageClusterCredentialsJob to an EJB timer on the StorageClientManagerBean. This guarantees that session init and credential refresh happens on every single server of an HA environment. Also, the timer run interval is 90 seconds after an initial wait of 30 seconds. This will reduce the possibility of having subsequent runs wait on the credentials process to finish. release/jon3.2.x branch commits: https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=06c3109c8ef0f3438297ecc5706a6519822e19c6 https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=eac44d57849e2fa2a2f2226d0f7827e91ffc3a75
Slightly updated the code based on a code review from Jay. This change reduces the chance of a server not starting because the storage cluster is not available (which is the desired outcome). Also removed some redundant code left from the quartz jobs. release/jon3.2.x branch commit: https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=d69b30f4d2c8d89910619848d9e02e48b036d395
Moving to ON_QA as available for testing with new brew build.
Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.
verified