Bug 1463351 - Bundle deployment sometimes fails with message "Unable to store Trigger with name: '<timestamp>' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification."
Summary: Bundle deployment sometimes fails with message "Unable to store Trigger with ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Provisioning
Version: JON 3.3.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: DR01
: JON 3.3.9
Assignee: Michael Burman
QA Contact: Filip Brychta
URL:
Whiteboard:
Depends On: 1463352
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-20 15:29 UTC by bkramer
Modified: 2020-08-13 09:25 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-10-02 17:22:36 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3077601 0 None None None 2017-09-28 20:25:14 UTC
Red Hat Product Errata RHEA-2017:2846 0 normal SHIPPED_LIVE Red Hat JBoss Operations Network 3.3.9 bug fix update 2017-10-02 21:21:28 UTC

Description bkramer 2017-06-20 15:29:38 UTC
Description of problem:

Bundle deployment sometimes fails with the error like:

********************************************************
... INFO  [org.quartz.core.JobRunShell] (RHQScheduler_Worker-5) Job org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob.39199 threw a JobExecutionException: : org.quartz.JobExecutionException: Could not schedule the bundle deployment completion check job for BundleDeployment[id=39199, name=Deployment [3] of Version [3.0.0] to [my_platform]. Upgrade from Version [2.0.0]]. [See nested exception: org.quartz.ObjectAlreadyExistsException: Unable to store Trigger with name: '1497219902994' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification.]
	at org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob.execute(BundleDeploymentStatusCheckJob.java:97) [rhq-server.jar:4.12.0.JON330GA-redhat-8]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-1.6.5.jar:1.6.5]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525) [quartz-1.6.5.jar:1.6.5]
Caused by: org.quartz.ObjectAlreadyExistsException: Unable to store Trigger with name: '1497219902994' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification.
********************************************************

The trigger name in the above message is a timestamp meaning that if multiple deployments were created at the same time, there is a chance that a duplicate trigger for a different job was created. Our code confirms this - BundleDeploymentStatusCheckJob.execute:

**********************************************************
064     @Override
065     public void execute(JobExecutionContext context) throws JobExecutionException {
066         BundleManagerLocal bundleManager = LookupUtil.getBundleManager();
067         SubjectManagerLocal subjectManager = LookupUtil.getSubjectManager();
068 
069         Subject overlord = subjectManager.getOverlord();
070 
071         PageList<BundleDeployment> deployments = bundleManager.findBundleDeploymentsByCriteria(overlord,
072             getCriteriaFromContext(context));
073 
074         if (deployments.size() > 0) {
075             BundleDeployment bundleDeployment = deployments.get(0);
076             SchedulerLocal scheduler = LookupUtil.getSchedulerBean();
077             JobDetail jobDetail = context.getJobDetail();
078 
079             BundleDeploymentStatus bundleDeploymentStatus = bundleManager.determineBundleDeploymentStatus(bundleDeployment.getId());
080             if (bundleDeploymentStatus.isTerminal()) {
081                 // delete this job, we've assigned a final status
082                 try {
083                     context.setResult(bundleDeploymentStatus); // Return status to possible listeners
084                     scheduler.deleteJob(jobDetail.getName(), jobDetail.getGroup());
085                 } catch (SchedulerException e) {
086                     throw new JobExecutionException("Could not delete the bundle deployment completion check job for "
087                         + bundleDeployment + ".", e);
088                 }
089             } else {
090                 // try again in 10s
091                 try {
092                     Trigger trigger = QuartzUtil.getFireOnceOffsetTrigger(jobDetail, 10000L);
093                     // just need a trigger name unique for this job
094                     trigger.setName(String.valueOf(System.currentTimeMillis()));
095                     scheduler.scheduleJob(trigger);
096                 } catch (SchedulerException e) {
097                     throw new JobExecutionException(
098                         "Could not schedule the bundle deployment completion check job for " + bundleDeployment + ".",
099                         e);
100                 }
101             }
102         }
103     }
**********************************************************

where "trigger.setName(String.valueOf(System.currentTimeMillis()));". 


Version-Release number of selected component (if applicable):
JBoss ON 3.3.8

How reproducible:
Sometimes

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Larry O'Leary 2017-07-12 21:55:47 UTC
There is really no way to reliably reproduce this. Obviously, the more concurrency, the easier it becomes to reproduce the error consistently. However, it is still all about timing. In my testing, I could see that most instances of the trigger used a name that was made up of a timestamp that was within 30ms of one another. However, I had to run the test with 8 concurrent deployment attempt for nearly a day before two of the trigger names eventually used the exact same system time.

The bug is very evident though. In a multi-server and multi-threaded environment, it is very easy to see how two bundle deployment requests could result in the same trigger name. The buggy code is at line 94 of BundleDeploymentStatusCheckJob[1]. 

To fix this, the call to trigger.setName must use a trigger name that not only includes the timestamp, but perhaps the original bundle deployment job name as its prefix. This should make it unique enough seeing that the bundle deployment job name includes the deployment id if I am not mistaking. Meaning that even if two users deploy the exact same bundle to even the same destination at the exact same time, the trigger name should still be unique due to the database transaction and auto-increment nature of the deployment id.



From a testing perspective, there isn't a whole lot that can be tested here. However, a concurrent bundle deployment soak/tolerance test could be beneficial.

Such a test would include deploying multiple bundles to multiple agents simultaneously. This could be against a single or multiple JBoss ON servers. Each bundle would need to have a deployment delay built into its recipe so that there is more of a chance that the BundleDeploymentStatusCheckJob would need to be rescheduled which is what is causing the issue here.

For my own testing I used 8 versions of a single bundle and deployed them to a single agent. The recipe used the Ant sleep task within a postInstallTarget called dosleep:

  <target name="dosleep">
    <rhq:audit status="SUCCESS" action="Taking a nap" info="All done deploying bundle to \${rhq.deploy.dir}. Sleeping for 21 seconds."/>
    <sleep seconds="21"/>
    <rhq:audit status="SUCCESS" action="Waking up from my nap" info="That was a good sleep."/>
  </target>


This 21 second delay means that each of the 8 bundles will take at least 21 seconds to have their status updated to SUCCESS. In which case the JBoss ON server will schedule the BundleDeploymentStatusCheckJob every 10 seconds to query the deployment status. The important thing is to execute the bundle deployments in their own threads (concurrently).



[1]: https://github.com/rhq-project/rhq/blob/RHQ_4_12_0_JON338GA/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/scheduler/jobs/BundleDeploymentStatusCheckJob.java#L94

Comment 3 Michael Burman 2017-08-03 11:42:14 UTC
Fixed in the master:

commit 1b6de1d45e0520b43a10d2cf3339e176be5146db
Author: Michael Burman <miburman>
Date:   Thu Aug 3 14:41:47 2017 +0300

    [BZ 1463351] Replace statusCheckJob name to randomUUID(), which should be random as long as there is enough entropy in the system

Comment 5 Simeon Pinder 2017-09-01 15:34:06 UTC
Moving to ON_QA as available for test with the following binary:
http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/133/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip

*NOTE Build represents JON 3.3.9 DR01 build.

Please report issues as you encounter them.

Comment 8 errata-xmlrpc 2017-10-02 17:22:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2846


Note You need to log in before you can comment on or make changes to this bug.