Bug 1463351

Summary: Bundle deployment sometimes fails with message "Unable to store Trigger with name: '<timestamp>' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification."
Product: [JBoss] JBoss Operations Network Reporter: bkramer <bkramer>
Component: ProvisioningAssignee: Michael Burman <miburman>
Status: CLOSED ERRATA QA Contact: Filip Brychta <fbrychta>
Severity: high Docs Contact:
Priority: high    
Version: JON 3.3.8CC: fbrychta, hhovsepy, loleary, spinder
Target Milestone: DR01Keywords: Triaged
Target Release: JON 3.3.9   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-02 17:22:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1463352    
Bug Blocks:    

Description bkramer 2017-06-20 15:29:38 UTC
Description of problem:

Bundle deployment sometimes fails with the error like:

********************************************************
... INFO  [org.quartz.core.JobRunShell] (RHQScheduler_Worker-5) Job org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob.39199 threw a JobExecutionException: : org.quartz.JobExecutionException: Could not schedule the bundle deployment completion check job for BundleDeployment[id=39199, name=Deployment [3] of Version [3.0.0] to [my_platform]. Upgrade from Version [2.0.0]]. [See nested exception: org.quartz.ObjectAlreadyExistsException: Unable to store Trigger with name: '1497219902994' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification.]
	at org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob.execute(BundleDeploymentStatusCheckJob.java:97) [rhq-server.jar:4.12.0.JON330GA-redhat-8]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-1.6.5.jar:1.6.5]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525) [quartz-1.6.5.jar:1.6.5]
Caused by: org.quartz.ObjectAlreadyExistsException: Unable to store Trigger with name: '1497219902994' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification.
********************************************************

The trigger name in the above message is a timestamp meaning that if multiple deployments were created at the same time, there is a chance that a duplicate trigger for a different job was created. Our code confirms this - BundleDeploymentStatusCheckJob.execute:

**********************************************************
064     @Override
065     public void execute(JobExecutionContext context) throws JobExecutionException {
066         BundleManagerLocal bundleManager = LookupUtil.getBundleManager();
067         SubjectManagerLocal subjectManager = LookupUtil.getSubjectManager();
068 
069         Subject overlord = subjectManager.getOverlord();
070 
071         PageList<BundleDeployment> deployments = bundleManager.findBundleDeploymentsByCriteria(overlord,
072             getCriteriaFromContext(context));
073 
074         if (deployments.size() > 0) {
075             BundleDeployment bundleDeployment = deployments.get(0);
076             SchedulerLocal scheduler = LookupUtil.getSchedulerBean();
077             JobDetail jobDetail = context.getJobDetail();
078 
079             BundleDeploymentStatus bundleDeploymentStatus = bundleManager.determineBundleDeploymentStatus(bundleDeployment.getId());
080             if (bundleDeploymentStatus.isTerminal()) {
081                 // delete this job, we've assigned a final status
082                 try {
083                     context.setResult(bundleDeploymentStatus); // Return status to possible listeners
084                     scheduler.deleteJob(jobDetail.getName(), jobDetail.getGroup());
085                 } catch (SchedulerException e) {
086                     throw new JobExecutionException("Could not delete the bundle deployment completion check job for "
087                         + bundleDeployment + ".", e);
088                 }
089             } else {
090                 // try again in 10s
091                 try {
092                     Trigger trigger = QuartzUtil.getFireOnceOffsetTrigger(jobDetail, 10000L);
093                     // just need a trigger name unique for this job
094                     trigger.setName(String.valueOf(System.currentTimeMillis()));
095                     scheduler.scheduleJob(trigger);
096                 } catch (SchedulerException e) {
097                     throw new JobExecutionException(
098                         "Could not schedule the bundle deployment completion check job for " + bundleDeployment + ".",
099                         e);
100                 }
101             }
102         }
103     }
**********************************************************

where "trigger.setName(String.valueOf(System.currentTimeMillis()));". 


Version-Release number of selected component (if applicable):
JBoss ON 3.3.8

How reproducible:
Sometimes

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Larry O'Leary 2017-07-12 21:55:47 UTC
There is really no way to reliably reproduce this. Obviously, the more concurrency, the easier it becomes to reproduce the error consistently. However, it is still all about timing. In my testing, I could see that most instances of the trigger used a name that was made up of a timestamp that was within 30ms of one another. However, I had to run the test with 8 concurrent deployment attempt for nearly a day before two of the trigger names eventually used the exact same system time.

The bug is very evident though. In a multi-server and multi-threaded environment, it is very easy to see how two bundle deployment requests could result in the same trigger name. The buggy code is at line 94 of BundleDeploymentStatusCheckJob[1]. 

To fix this, the call to trigger.setName must use a trigger name that not only includes the timestamp, but perhaps the original bundle deployment job name as its prefix. This should make it unique enough seeing that the bundle deployment job name includes the deployment id if I am not mistaking. Meaning that even if two users deploy the exact same bundle to even the same destination at the exact same time, the trigger name should still be unique due to the database transaction and auto-increment nature of the deployment id.



From a testing perspective, there isn't a whole lot that can be tested here. However, a concurrent bundle deployment soak/tolerance test could be beneficial.

Such a test would include deploying multiple bundles to multiple agents simultaneously. This could be against a single or multiple JBoss ON servers. Each bundle would need to have a deployment delay built into its recipe so that there is more of a chance that the BundleDeploymentStatusCheckJob would need to be rescheduled which is what is causing the issue here.

For my own testing I used 8 versions of a single bundle and deployed them to a single agent. The recipe used the Ant sleep task within a postInstallTarget called dosleep:

  <target name="dosleep">
    <rhq:audit status="SUCCESS" action="Taking a nap" info="All done deploying bundle to \${rhq.deploy.dir}. Sleeping for 21 seconds."/>
    <sleep seconds="21"/>
    <rhq:audit status="SUCCESS" action="Waking up from my nap" info="That was a good sleep."/>
  </target>


This 21 second delay means that each of the 8 bundles will take at least 21 seconds to have their status updated to SUCCESS. In which case the JBoss ON server will schedule the BundleDeploymentStatusCheckJob every 10 seconds to query the deployment status. The important thing is to execute the bundle deployments in their own threads (concurrently).



[1]: https://github.com/rhq-project/rhq/blob/RHQ_4_12_0_JON338GA/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/scheduler/jobs/BundleDeploymentStatusCheckJob.java#L94

Comment 3 Michael Burman 2017-08-03 11:42:14 UTC
Fixed in the master:

commit 1b6de1d45e0520b43a10d2cf3339e176be5146db
Author: Michael Burman <miburman>
Date:   Thu Aug 3 14:41:47 2017 +0300

    [BZ 1463351] Replace statusCheckJob name to randomUUID(), which should be random as long as there is enough entropy in the system

Comment 5 Simeon Pinder 2017-09-01 15:34:06 UTC
Moving to ON_QA as available for test with the following binary:
http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/133/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip

*NOTE Build represents JON 3.3.9 DR01 build.

Please report issues as you encounter them.

Comment 8 errata-xmlrpc 2017-10-02 17:22:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2846