1003679 – An attempt to deploy a bundle can lead to infinite IN_PROGRESS state

Bug 1003679 - An attempt to deploy a bundle can lead to infinite IN_PROGRESS state

Summary: An attempt to deploy a bundle can lead to infinite IN_PROGRESS state

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Provisioning
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	RHQ 4.10
Assignee:	John Mazzitelli
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1015658
TreeView+	depends on / blocked

Reported:	2013-09-02 17:06 UTC by Libor Zoubek
Modified:	2015-11-02 00:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Clones:	1015658 (view as bug list)
Environment:
Last Closed:	2014-04-23 12:30:56 UTC
Embargoed:

Attachments	(Terms of Use)
cli script (134.92 KB, application/javascript) 2013-09-02 17:06 UTC, Libor Zoubek	no flags	Details
complete logs (525.74 KB, application/x-gzip) 2013-09-19 14:01 UTC, Filip Brychta	no flags	Details
screenshot (99.90 KB, image/png) 2013-09-19 14:03 UTC, Filip Brychta	no flags	Details
View All

Description Libor Zoubek 2013-09-02 17:06:58 UTC

Created attachment 792933 [details]
cli script

Description of problem: When a bundle deployment to platform resource fails and I try to run it for the 2nd time, bundle gets into weird state. I am deploying sample files to group of 2 platform resources. I have uploaded incomplete bundle via CLI, so it must fail, because ant deployer will miss some files. After 2nd atempt, both platforms are marked as FAILED, but bundle itself still shows IN_PROGRESS.


Version-Release number of selected component (if applicable): 
RHQ 4.9 - master


How reproducible: you need enough luck and fullmoon


Steps to Reproduce:
1. run attached CLI script (more times if you don't hit the issue)

Actual results: Script fails and Bundle deployment times out, Look through UI, you should see X attempts to deploy a bundle version 1.1 (incomplete), all must be FAILED and the last one stays in IN_PROGRESS.



Expected results: After script runs, you should see X failed deploy attempts


Additional info:

Comment 1 Libor Zoubek 2013-09-02 17:24:11 UTC

Bug 1003681 contains reproducing script with simple bundles

Comment 2 Filip Brychta 2013-09-19 14:00:38 UTC

I found RHQ Version: 4.10.0-SNAPSHOT Build Number: 1e23623 in similar state.
Bundle deployment was stuck in In progress state. This behaviour is nondeterministic.

Scenario:
1 - cli script deploys a simple bundle on group of linux platforms into /tmp/myBundle direcotory
2 - script never exits and GUI shows that bundle deployment is still in progress (see attached screenshot)

see attached complete logs, cli script started at 2013-09-18 18:50:16.763

Comment 3 Filip Brychta 2013-09-19 14:01:18 UTC

Created attachment 799964 [details]
complete logs

Comment 4 Filip Brychta 2013-09-19 14:03:59 UTC

Created attachment 799969 [details]
screenshot

Comment 5 Filip Brychta 2013-09-19 14:23:13 UTC

(In reply to Filip Brychta from comment #2)
> I found RHQ Version: 4.10.0-SNAPSHOT Build Number: 1e23623 in similar state.
> Bundle deployment was stuck in In progress state. This behaviour is
> nondeterministic.
> 
> Scenario:
> 1 - cli script deploys a simple bundle on group of linux platforms into
> /tmp/myBundle direcotory
> 2 - script never exits and GUI shows that bundle deployment is still in
> progress (see attached screenshot)

script never exists because of ...  while (deployment.status == BundleDeploymentStatus.PENDING || deployment.status == BundleDeploymentStatus.IN_PROGRESS) ... 

> 
> see attached complete logs, cli script started at 2013-09-18 18:50:16.763

Comment 6 Lukas Krejci 2013-10-02 14:54:41 UTC

There probably is a race condition underlying this problem. 

The BundleDeployment.status field is only being updated in a single business method (+ as a consequence of setting a failure message):

BundleManagerBean#setBundleResourceDeploymentStatus()

This method is called from agents once a deployment of the bundle on an individual resource from the target resource group is completed (it can also be called by the server if it cannot schedule the deployment on an agent, but that's not important in our case).

The method first updates the status of the single resource deployment in question and then goes on to check the "overarching" bundle deployment. It goes through all of its already existing resource deployments and then assigns the overall progress of the bundle deployment according to the following logic:

if (someInProgress) {
    deployment.setStatus(BundleDeploymentStatus.IN_PROGRESS);
} else if (someSuccess) {
    deployment.setStatus(someFailure ? BundleDeploymentStatus.MIXED : BundleDeploymentStatus.SUCCESS);
} else {
    deployment.setStatus(BundleDeploymentStatus.FAILURE);
}

Let's consider this situation:

Out of N members of a resource group only the last 2 agents haven't reported back their deployment status: agent Y and agent Z.

The reports from Y and Z come nearly simultaneously and therefore the calls to BundleManagerBean#setBundleResourceDeploymentStatus() each run in a standalone transaction.

1) in DB, BundleDeployment.status == IN_PROGRESS
2) Transaction for report from agent Y starts (T(Y)).
3) Transaction for report from agent Z starts (T(Z)).
4) In T(Y), resource deployment R(Y) is updated to SUCCESS.
5) In T(Z), resource deployment R(Z) is updated to SUCCESS.
6) In T(Y), all the resource deployments are checked. 
7) In T(Y), we see R(Z) as IN_PROGRESS because T(Z) has not committed yet.
8) In T(Z), all resource deployments are checked.
9) In T(Z), we see R(Y) as IN_PROGRESS because T(Y) has not committed yet.
10) In T(Y), according to the above logic, BundleDeployment.status is set to IN_PROGRESS.
11) T(Y) completes.
12) In T(Z), according to the above logic, BundleDeployment.status is set to IN_PROGRESS.
13) T(Z) completes.

As a result, we have all the resource deployments completed, yet the bundle deployment status lingers in IN_PROGRESS state forever.

There are a couple of options to fix this problem:

1) load the resource deployments of a BundleDeployment entity eagerly and compute the overall status instead of storing it in DB.

2) Come up with some atomic counter in DB that would determine the true number of completed requests despite our transaction isolation level.

3) Have a scheduled job that would compute the BundleDeployment.status for all deployments in progress periodically.

Comment 7 Lukas Krejci 2013-10-02 14:58:01 UTC

As for reproduction tips:

I think the likelyhood of this happening is proportional to the size of the resource group and inversely proportional to the speed of the DB.

I.e. the bigger the resource group and/or the slower the DB, the more likely this should be.

Comment 8 Lukas Krejci 2013-10-04 18:19:04 UTC

commit 7637832ec430a746b7d0a8195980988c9c451521
Author: Lukas Krejci <lkrejci>
Date:   Fri Oct 4 20:12:45 2013 +0200

    [BZ 1003679] - New job to check bundle deployment completion.
    
    The quartz job checks for the completion of a bundle deployment instead of
    checking for that inline with the handling of individual resource
    deployment reports. This avoids the possibility of a race condition when
    the last two reports, if running simultaneously, could leave the deployment
    in an IN_PROGRESS state even though all the resource deployments have
    completed.

Comment 9 Libor Zoubek 2013-10-23 11:06:52 UTC

Hi Lukas, 

after your fix, bundle deployment in RHQ 4.10 master got a little broken. I have deployment going on 2 agents and deployment is marked as MIXED when both agent deployments finished with SUCCESS.

On one agent deployment history is an error message: 

Failed to schedule, agent on [Resource[id=10001, uuid=bb81426e-d81a-44b4-bc26-95d712168e78, type={Platforms}Linux, key=auto-rhq.bc.jonqe.lab.eng.bos.redhat.com, name=auto-rhq.bc.jonqe.lab.eng.bos.redhat.com, parent=, version=Linux 2.6.32-358.el6.x86_64]] may be down: org.quartz.ObjectAlreadyExistsException: Unable to store Job with name: '10041' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification.

Comment 10 Libor Zoubek 2013-10-23 13:13:38 UTC

One important note: I am getting this mixed state only when upgrading bundle deployment. 

Test scenario is like this:
1. deploy bundle on 2 agents => pass
2. upgrade bundle => fail .. mixed status, but success on both agents

Comment 11 John Mazzitelli 2013-11-04 21:48:54 UTC

(In reply to Libor Zoubek from comment #1)
> Bug 1003681 contains reproducing script with simple bundles

I can't replicate this on master branch. I took bundles.zip from that bug #1003681, unzipped, and ran bundles.js using the CLI and I get an error:

"Exception in thread "main" org.rhq.enterprise.client.script.CLIScriptException: org.rhq.scripting.javascript.engine.util.ExtendedScriptException: org.mozilla.javascript.EcmaError: ReferenceError: "bundleIncomplete" is not defined. (<Unknown source>#4304) in <Unknown source> at line number 4304"

I then tried to deploy a bundle, then upgrade the bundle to a new version, to a group of two platforms. I tried several times. Each time, a success and I never saw the destination's status as "mixed" soon after everything completed.

Comment 12 John Mazzitelli 2013-11-05 15:22:03 UTC

(In reply to Libor Zoubek from comment #9)
> org.quartz.ObjectAlreadyExistsException: Unable to store Job with name:
> '10041' and group:
> 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob',
> because one already exists with this identification.

I think the code we have to alter is this - we should only schedule the job if it doesn't already exist. If the job already exists, there is no need to schedule another one, since one job is all we need in order to periodically check it. I also think the job should start a few seconds in the future, rather than immediately upon scheduling since we're almost assuredly not going to have anything different at schedule time (and this might actually lead to a race condition if we aren't done scheduling all the bundle deployments):

                    // schedule the bundle deployment completion check. Due to timing issues, we cannot determine
                    // the overall completion status of the bundle deployment while receiving the individual resource
                    // deployment statuses. This needs to be done out of band by a quartz job.
                    // See https://bugzilla.redhat.com/show_bug.cgi?id=1003679 for details.
                    JobDetail jobDetail = BundleDeploymentStatusCheckJob.getJobDetail(deployment.getId());
                    Trigger trigger = QuartzUtil.getRepeatingTrigger(jobDetail, 0, 10000);

                    quartzScheduler.scheduleJob(jobDetail, trigger);

Comment 13 John Mazzitelli 2013-11-05 15:30:30 UTC

after talking with lukas, we think it would make more sense to move the scheduling of the job inside scheduleBundleDeploymentImpl after the for-loop that schedules all the resource deployments:

        for (Resource groupMember : groupMembers) {
            try {
                scheduleBundleResourceDeployment(subject, newDeployment, groupMember, isCleanDeployment, isRevert);
            } catch (Throwable t) {
                log.error("Failed to complete scheduling of bundle deployment to [" + groupMember
                    + "]. Other bundle deployments to other resources may have been scheduled. ", t);
            }
        }

/// TODO SCHEDULE CHECK JOB HERE

Comment 14 John Mazzitelli 2013-11-05 21:48:11 UTC

git commit to master: 85aa4b3

hopefully, this fixes the race condition

Comment 15 John Mazzitelli 2013-11-05 21:52:00 UTC

(In reply to John Mazzitelli from comment #14)
> git commit to master: 85aa4b3
> 
> hopefully, this fixes the race condition

scratch that - that is the wrong git commit hash - the real commit hash to master branch is 7e32d8c

Comment 16 Heiko W. Rupp 2014-04-23 12:30:56 UTC

Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.

Note You need to log in before you can comment on or make changes to this bug.