Bug 1463351 - Bundle deployment sometimes fails with message "Unable to store Trigger with name: '<timestamp>' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification."
Bundle deployment sometimes fails with message "Unable to store Trigger with ...
Status: POST
Product: JBoss Operations Network
Classification: JBoss
Component: Provisioning (Show other bugs)
JON 3.3.8
Unspecified Unspecified
high Severity high
: ER01
: JON 3.3.9
Assigned To: Michael Burman
Mike Foley
: Triaged
Depends On: 1463352
  Show dependency treegraph
Reported: 2017-06-20 11:29 EDT by bkramer
Modified: 2017-08-03 07:42 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description bkramer 2017-06-20 11:29:38 EDT
Description of problem:

Bundle deployment sometimes fails with the error like:

... INFO  [org.quartz.core.JobRunShell] (RHQScheduler_Worker-5) Job org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob.39199 threw a JobExecutionException: : org.quartz.JobExecutionException: Could not schedule the bundle deployment completion check job for BundleDeployment[id=39199, name=Deployment [3] of Version [3.0.0] to [my_platform]. Upgrade from Version [2.0.0]]. [See nested exception: org.quartz.ObjectAlreadyExistsException: Unable to store Trigger with name: '1497219902994' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification.]
	at org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob.execute(BundleDeploymentStatusCheckJob.java:97) [rhq-server.jar:4.12.0.JON330GA-redhat-8]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-1.6.5.jar:1.6.5]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525) [quartz-1.6.5.jar:1.6.5]
Caused by: org.quartz.ObjectAlreadyExistsException: Unable to store Trigger with name: '1497219902994' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification.

The trigger name in the above message is a timestamp meaning that if multiple deployments were created at the same time, there is a chance that a duplicate trigger for a different job was created. Our code confirms this - BundleDeploymentStatusCheckJob.execute:

064     @Override
065     public void execute(JobExecutionContext context) throws JobExecutionException {
066         BundleManagerLocal bundleManager = LookupUtil.getBundleManager();
067         SubjectManagerLocal subjectManager = LookupUtil.getSubjectManager();
069         Subject overlord = subjectManager.getOverlord();
071         PageList<BundleDeployment> deployments = bundleManager.findBundleDeploymentsByCriteria(overlord,
072             getCriteriaFromContext(context));
074         if (deployments.size() > 0) {
075             BundleDeployment bundleDeployment = deployments.get(0);
076             SchedulerLocal scheduler = LookupUtil.getSchedulerBean();
077             JobDetail jobDetail = context.getJobDetail();
079             BundleDeploymentStatus bundleDeploymentStatus = bundleManager.determineBundleDeploymentStatus(bundleDeployment.getId());
080             if (bundleDeploymentStatus.isTerminal()) {
081                 // delete this job, we've assigned a final status
082                 try {
083                     context.setResult(bundleDeploymentStatus); // Return status to possible listeners
084                     scheduler.deleteJob(jobDetail.getName(), jobDetail.getGroup());
085                 } catch (SchedulerException e) {
086                     throw new JobExecutionException("Could not delete the bundle deployment completion check job for "
087                         + bundleDeployment + ".", e);
088                 }
089             } else {
090                 // try again in 10s
091                 try {
092                     Trigger trigger = QuartzUtil.getFireOnceOffsetTrigger(jobDetail, 10000L);
093                     // just need a trigger name unique for this job
094                     trigger.setName(String.valueOf(System.currentTimeMillis()));
095                     scheduler.scheduleJob(trigger);
096                 } catch (SchedulerException e) {
097                     throw new JobExecutionException(
098                         "Could not schedule the bundle deployment completion check job for " + bundleDeployment + ".",
099                         e);
100                 }
101             }
102         }
103     }

where "trigger.setName(String.valueOf(System.currentTimeMillis()));". 

Version-Release number of selected component (if applicable):
JBoss ON 3.3.8

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:
Comment 2 Larry O'Leary 2017-07-12 17:55:47 EDT
There is really no way to reliably reproduce this. Obviously, the more concurrency, the easier it becomes to reproduce the error consistently. However, it is still all about timing. In my testing, I could see that most instances of the trigger used a name that was made up of a timestamp that was within 30ms of one another. However, I had to run the test with 8 concurrent deployment attempt for nearly a day before two of the trigger names eventually used the exact same system time.

The bug is very evident though. In a multi-server and multi-threaded environment, it is very easy to see how two bundle deployment requests could result in the same trigger name. The buggy code is at line 94 of BundleDeploymentStatusCheckJob[1]. 

To fix this, the call to trigger.setName must use a trigger name that not only includes the timestamp, but perhaps the original bundle deployment job name as its prefix. This should make it unique enough seeing that the bundle deployment job name includes the deployment id if I am not mistaking. Meaning that even if two users deploy the exact same bundle to even the same destination at the exact same time, the trigger name should still be unique due to the database transaction and auto-increment nature of the deployment id.

From a testing perspective, there isn't a whole lot that can be tested here. However, a concurrent bundle deployment soak/tolerance test could be beneficial.

Such a test would include deploying multiple bundles to multiple agents simultaneously. This could be against a single or multiple JBoss ON servers. Each bundle would need to have a deployment delay built into its recipe so that there is more of a chance that the BundleDeploymentStatusCheckJob would need to be rescheduled which is what is causing the issue here.

For my own testing I used 8 versions of a single bundle and deployed them to a single agent. The recipe used the Ant sleep task within a postInstallTarget called dosleep:

  <target name="dosleep">
    <rhq:audit status="SUCCESS" action="Taking a nap" info="All done deploying bundle to \${rhq.deploy.dir}. Sleeping for 21 seconds."/>
    <sleep seconds="21"/>
    <rhq:audit status="SUCCESS" action="Waking up from my nap" info="That was a good sleep."/>

This 21 second delay means that each of the 8 bundles will take at least 21 seconds to have their status updated to SUCCESS. In which case the JBoss ON server will schedule the BundleDeploymentStatusCheckJob every 10 seconds to query the deployment status. The important thing is to execute the bundle deployments in their own threads (concurrently).

[1]: https://github.com/rhq-project/rhq/blob/RHQ_4_12_0_JON338GA/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/scheduler/jobs/BundleDeploymentStatusCheckJob.java#L94
Comment 3 Michael Burman 2017-08-03 07:42:14 EDT
Fixed in the master:

commit 1b6de1d45e0520b43a10d2cf3339e176be5146db
Author: Michael Burman <miburman@redhat.com>
Date:   Thu Aug 3 14:41:47 2017 +0300

    [BZ 1463351] Replace statusCheckJob name to randomUUID(), which should be random as long as there is enough entropy in the system

Note You need to log in before you can comment on or make changes to this bug.