+++ This bug was initially created as a clone of Bug #1003679 +++ Description of problem: When a bundle deployment to platform resource fails and I try to run it for the 2nd time, bundle gets into weird state. I am deploying sample files to group of 2 platform resources. I have uploaded incomplete bundle via CLI, so it must fail, because ant deployer will miss some files. After 2nd atempt, both platforms are marked as FAILED, but bundle itself still shows IN_PROGRESS. Version-Release number of selected component (if applicable): RHQ 4.9 - master How reproducible: you need enough luck and fullmoon Steps to Reproduce: 1. run attached CLI script (more times if you don't hit the issue) Actual results: Script fails and Bundle deployment times out, Look through UI, you should see X attempts to deploy a bundle version 1.1 (incomplete), all must be FAILED and the last one stays in IN_PROGRESS. Expected results: After script runs, you should see X failed deploy attempts Additional info: --- Additional comment from Libor Zoubek on 2013-09-02 13:24:11 EDT --- Bug 1003681 contains reproducing script with simple bundles --- Additional comment from Filip Brychta on 2013-09-19 10:00:38 EDT --- I found RHQ Version: 4.10.0-SNAPSHOT Build Number: 1e23623 in similar state. Bundle deployment was stuck in In progress state. This behaviour is nondeterministic. Scenario: 1 - cli script deploys a simple bundle on group of linux platforms into /tmp/myBundle direcotory 2 - script never exits and GUI shows that bundle deployment is still in progress (see attached screenshot) see attached complete logs, cli script started at 2013-09-18 18:50:16.763 --- Additional comment from Filip Brychta on 2013-09-19 10:01:18 EDT --- --- Additional comment from Filip Brychta on 2013-09-19 10:03:59 EDT --- --- Additional comment from Filip Brychta on 2013-09-19 10:23:13 EDT --- (In reply to Filip Brychta from comment #2) > I found RHQ Version: 4.10.0-SNAPSHOT Build Number: 1e23623 in similar state. > Bundle deployment was stuck in In progress state. This behaviour is > nondeterministic. > > Scenario: > 1 - cli script deploys a simple bundle on group of linux platforms into > /tmp/myBundle direcotory > 2 - script never exits and GUI shows that bundle deployment is still in > progress (see attached screenshot) script never exists because of ... while (deployment.status == BundleDeploymentStatus.PENDING || deployment.status == BundleDeploymentStatus.IN_PROGRESS) ... > > see attached complete logs, cli script started at 2013-09-18 18:50:16.763 --- Additional comment from Lukas Krejci on 2013-10-02 10:54:41 EDT --- There probably is a race condition underlying this problem. The BundleDeployment.status field is only being updated in a single business method (+ as a consequence of setting a failure message): BundleManagerBean#setBundleResourceDeploymentStatus() This method is called from agents once a deployment of the bundle on an individual resource from the target resource group is completed (it can also be called by the server if it cannot schedule the deployment on an agent, but that's not important in our case). The method first updates the status of the single resource deployment in question and then goes on to check the "overarching" bundle deployment. It goes through all of its already existing resource deployments and then assigns the overall progress of the bundle deployment according to the following logic: if (someInProgress) { deployment.setStatus(BundleDeploymentStatus.IN_PROGRESS); } else if (someSuccess) { deployment.setStatus(someFailure ? BundleDeploymentStatus.MIXED : BundleDeploymentStatus.SUCCESS); } else { deployment.setStatus(BundleDeploymentStatus.FAILURE); } Let's consider this situation: Out of N members of a resource group only the last 2 agents haven't reported back their deployment status: agent Y and agent Z. The reports from Y and Z come nearly simultaneously and therefore the calls to BundleManagerBean#setBundleResourceDeploymentStatus() each run in a standalone transaction. 1) in DB, BundleDeployment.status == IN_PROGRESS 2) Transaction for report from agent Y starts (T(Y)). 3) Transaction for report from agent Z starts (T(Z)). 4) In T(Y), resource deployment R(Y) is updated to SUCCESS. 5) In T(Z), resource deployment R(Z) is updated to SUCCESS. 6) In T(Y), all the resource deployments are checked. 7) In T(Y), we see R(Z) as IN_PROGRESS because T(Z) has not committed yet. 8) In T(Z), all resource deployments are checked. 9) In T(Z), we see R(Y) as IN_PROGRESS because T(Y) has not committed yet. 10) In T(Y), according to the above logic, BundleDeployment.status is set to IN_PROGRESS. 11) T(Y) completes. 12) In T(Z), according to the above logic, BundleDeployment.status is set to IN_PROGRESS. 13) T(Z) completes. As a result, we have all the resource deployments completed, yet the bundle deployment status lingers in IN_PROGRESS state forever. There are a couple of options to fix this problem: 1) load the resource deployments of a BundleDeployment entity eagerly and compute the overall status instead of storing it in DB. 2) Come up with some atomic counter in DB that would determine the true number of completed requests despite our transaction isolation level. 3) Have a scheduled job that would compute the BundleDeployment.status for all deployments in progress periodically. --- Additional comment from Lukas Krejci on 2013-10-02 10:58:01 EDT --- As for reproduction tips: I think the likelyhood of this happening is proportional to the size of the resource group and inversely proportional to the speed of the DB. I.e. the bigger the resource group and/or the slower the DB, the more likely this should be. --- Additional comment from Lukas Krejci on 2013-10-04 14:19:04 EDT --- commit 7637832ec430a746b7d0a8195980988c9c451521 Author: Lukas Krejci <lkrejci> Date: Fri Oct 4 20:12:45 2013 +0200 [BZ 1003679] - New job to check bundle deployment completion. The quartz job checks for the completion of a bundle deployment instead of checking for that inline with the handling of individual resource deployment reports. This avoids the possibility of a race condition when the last two reports, if running simultaneously, could leave the deployment in an IN_PROGRESS state even though all the resource deployments have completed.
This has been pulled into JON before the creation of dedicated release/jon3.2.x branch, thus I assume it is part of ER3.
The upstream BZ 1003679 has been returned to DEV with a problem. Pulling this out of QA q until we have a fixed version in JON again, too. Resetting priority to URGENT so we don't leave JON with a broken version of the fix.
gonna do this: https://bugzilla.redhat.com/show_bug.cgi?id=1003679#c13
(In reply to John Mazzitelli from comment #3) > gonna do this: https://bugzilla.redhat.com/show_bug.cgi?id=1003679#c13 cherry picked to release 3.2 branch: 4892c50
Moving to ON_QA as available for testing with new brew build.
Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.
verified on CR1