Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1015658

Summary:	An attempt to deploy a bundle can lead to infinite IN_PROGRESS state
Product:	[JBoss] JBoss Operations Network	Reporter:	Lukas Krejci <lkrejci>
Component:	Provisioning	Assignee:	John Mazzitelli <mazz>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Mike Foley <mfoley>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	JON 3.2	CC:	fbrychta, hrupp, lkrejci, lzoubek, theute
Target Milestone:	ER07
Target Release:	JON 3.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1003679	Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1003679
Bug Blocks:	1012435

Description Lukas Krejci 2013-10-04 18:20:06 UTC

+++ This bug was initially created as a clone of Bug #1003679 +++

Description of problem: When a bundle deployment to platform resource fails and I try to run it for the 2nd time, bundle gets into weird state. I am deploying sample files to group of 2 platform resources. I have uploaded incomplete bundle via CLI, so it must fail, because ant deployer will miss some files. After 2nd atempt, both platforms are marked as FAILED, but bundle itself still shows IN_PROGRESS.


Version-Release number of selected component (if applicable): 
RHQ 4.9 - master


How reproducible: you need enough luck and fullmoon


Steps to Reproduce:
1. run attached CLI script (more times if you don't hit the issue)

Actual results: Script fails and Bundle deployment times out, Look through UI, you should see X attempts to deploy a bundle version 1.1 (incomplete), all must be FAILED and the last one stays in IN_PROGRESS.



Expected results: After script runs, you should see X failed deploy attempts


Additional info:

--- Additional comment from Libor Zoubek on 2013-09-02 13:24:11 EDT ---

Bug 1003681 contains reproducing script with simple bundles

--- Additional comment from Filip Brychta on 2013-09-19 10:00:38 EDT ---

I found RHQ Version: 4.10.0-SNAPSHOT Build Number: 1e23623 in similar state.
Bundle deployment was stuck in In progress state. This behaviour is nondeterministic.

Scenario:
1 - cli script deploys a simple bundle on group of linux platforms into /tmp/myBundle direcotory
2 - script never exits and GUI shows that bundle deployment is still in progress (see attached screenshot)

see attached complete logs, cli script started at 2013-09-18 18:50:16.763

--- Additional comment from Filip Brychta on 2013-09-19 10:01:18 EDT ---



--- Additional comment from Filip Brychta on 2013-09-19 10:03:59 EDT ---



--- Additional comment from Filip Brychta on 2013-09-19 10:23:13 EDT ---

(In reply to Filip Brychta from comment #2)
> I found RHQ Version: 4.10.0-SNAPSHOT Build Number: 1e23623 in similar state.
> Bundle deployment was stuck in In progress state. This behaviour is
> nondeterministic.
> 
> Scenario:
> 1 - cli script deploys a simple bundle on group of linux platforms into
> /tmp/myBundle direcotory
> 2 - script never exits and GUI shows that bundle deployment is still in
> progress (see attached screenshot)

script never exists because of ...  while (deployment.status == BundleDeploymentStatus.PENDING || deployment.status == BundleDeploymentStatus.IN_PROGRESS) ... 

> 
> see attached complete logs, cli script started at 2013-09-18 18:50:16.763

--- Additional comment from Lukas Krejci on 2013-10-02 10:54:41 EDT ---

There probably is a race condition underlying this problem. 

The BundleDeployment.status field is only being updated in a single business method (+ as a consequence of setting a failure message):

BundleManagerBean#setBundleResourceDeploymentStatus()

This method is called from agents once a deployment of the bundle on an individual resource from the target resource group is completed (it can also be called by the server if it cannot schedule the deployment on an agent, but that's not important in our case).

The method first updates the status of the single resource deployment in question and then goes on to check the "overarching" bundle deployment. It goes through all of its already existing resource deployments and then assigns the overall progress of the bundle deployment according to the following logic:

if (someInProgress) {
    deployment.setStatus(BundleDeploymentStatus.IN_PROGRESS);
} else if (someSuccess) {
    deployment.setStatus(someFailure ? BundleDeploymentStatus.MIXED : BundleDeploymentStatus.SUCCESS);
} else {
    deployment.setStatus(BundleDeploymentStatus.FAILURE);
}

Let's consider this situation:

Out of N members of a resource group only the last 2 agents haven't reported back their deployment status: agent Y and agent Z.

The reports from Y and Z come nearly simultaneously and therefore the calls to BundleManagerBean#setBundleResourceDeploymentStatus() each run in a standalone transaction.

1) in DB, BundleDeployment.status == IN_PROGRESS
2) Transaction for report from agent Y starts (T(Y)).
3) Transaction for report from agent Z starts (T(Z)).
4) In T(Y), resource deployment R(Y) is updated to SUCCESS.
5) In T(Z), resource deployment R(Z) is updated to SUCCESS.
6) In T(Y), all the resource deployments are checked. 
7) In T(Y), we see R(Z) as IN_PROGRESS because T(Z) has not committed yet.
8) In T(Z), all resource deployments are checked.
9) In T(Z), we see R(Y) as IN_PROGRESS because T(Y) has not committed yet.
10) In T(Y), according to the above logic, BundleDeployment.status is set to IN_PROGRESS.
11) T(Y) completes.
12) In T(Z), according to the above logic, BundleDeployment.status is set to IN_PROGRESS.
13) T(Z) completes.

As a result, we have all the resource deployments completed, yet the bundle deployment status lingers in IN_PROGRESS state forever.

There are a couple of options to fix this problem:

1) load the resource deployments of a BundleDeployment entity eagerly and compute the overall status instead of storing it in DB.

2) Come up with some atomic counter in DB that would determine the true number of completed requests despite our transaction isolation level.

3) Have a scheduled job that would compute the BundleDeployment.status for all deployments in progress periodically.

--- Additional comment from Lukas Krejci on 2013-10-02 10:58:01 EDT ---

As for reproduction tips:

I think the likelyhood of this happening is proportional to the size of the resource group and inversely proportional to the speed of the DB.

I.e. the bigger the resource group and/or the slower the DB, the more likely this should be.

--- Additional comment from Lukas Krejci on 2013-10-04 14:19:04 EDT ---

commit 7637832ec430a746b7d0a8195980988c9c451521
Author: Lukas Krejci <lkrejci>
Date:   Fri Oct 4 20:12:45 2013 +0200

    [BZ 1003679] - New job to check bundle deployment completion.
    
    The quartz job checks for the completion of a bundle deployment instead of
    checking for that inline with the handling of individual resource
    deployment reports. This avoids the possibility of a race condition when
    the last two reports, if running simultaneously, could leave the deployment
    in an IN_PROGRESS state even though all the resource deployments have
    completed.

Comment 1 Lukas Krejci 2013-10-14 13:01:40 UTC

This has been pulled into JON before the creation of dedicated release/jon3.2.x branch, thus I assume it is part of ER3.

Comment 2 Lukas Krejci 2013-10-23 12:12:45 UTC

The upstream BZ 1003679 has been returned to DEV with a problem. Pulling this out of QA q until we have a fixed version in JON again, too.

Resetting priority to URGENT so we don't leave JON with a broken version of the fix.

Comment 3 John Mazzitelli 2013-11-05 15:37:17 UTC

gonna do this: https://bugzilla.redhat.com/show_bug.cgi?id=1003679#c13

Comment 4 John Mazzitelli 2013-11-05 21:52:34 UTC

(In reply to John Mazzitelli from comment #3)
> gonna do this: https://bugzilla.redhat.com/show_bug.cgi?id=1003679#c13

cherry picked to release 3.2 branch: 4892c50

Comment 5 Simeon Pinder 2013-11-19 15:48:04 UTC

Moving to ON_QA as available for testing with new brew build.

Comment 6 Simeon Pinder 2013-11-22 05:13:37 UTC

Mass moving all of these from ER6 to target milestone ER07 since the ER6 build was bad and QE was halted for the same reason.

Comment 7 Libor Zoubek 2013-12-05 14:51:28 UTC

verified on CR1