Description of problem: Bundle deployment sometimes fails with the error like: ******************************************************** ... INFO [org.quartz.core.JobRunShell] (RHQScheduler_Worker-5) Job org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob.39199 threw a JobExecutionException: : org.quartz.JobExecutionException: Could not schedule the bundle deployment completion check job for BundleDeployment[id=39199, name=Deployment [3] of Version [3.0.0] to [my_platform]. Upgrade from Version [2.0.0]]. [See nested exception: org.quartz.ObjectAlreadyExistsException: Unable to store Trigger with name: '1497219902994' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification.] at org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob.execute(BundleDeploymentStatusCheckJob.java:97) [rhq-server.jar:4.12.0.JON330GA-redhat-8] at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-1.6.5.jar:1.6.5] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525) [quartz-1.6.5.jar:1.6.5] Caused by: org.quartz.ObjectAlreadyExistsException: Unable to store Trigger with name: '1497219902994' and group: 'org.rhq.enterprise.server.scheduler.jobs.BundleDeploymentStatusCheckJob', because one already exists with this identification. ******************************************************** The trigger name in the above message is a timestamp meaning that if multiple deployments were created at the same time, there is a chance that a duplicate trigger for a different job was created. Our code confirms this - BundleDeploymentStatusCheckJob.execute: ********************************************************** 064 @Override 065 public void execute(JobExecutionContext context) throws JobExecutionException { 066 BundleManagerLocal bundleManager = LookupUtil.getBundleManager(); 067 SubjectManagerLocal subjectManager = LookupUtil.getSubjectManager(); 068 069 Subject overlord = subjectManager.getOverlord(); 070 071 PageList<BundleDeployment> deployments = bundleManager.findBundleDeploymentsByCriteria(overlord, 072 getCriteriaFromContext(context)); 073 074 if (deployments.size() > 0) { 075 BundleDeployment bundleDeployment = deployments.get(0); 076 SchedulerLocal scheduler = LookupUtil.getSchedulerBean(); 077 JobDetail jobDetail = context.getJobDetail(); 078 079 BundleDeploymentStatus bundleDeploymentStatus = bundleManager.determineBundleDeploymentStatus(bundleDeployment.getId()); 080 if (bundleDeploymentStatus.isTerminal()) { 081 // delete this job, we've assigned a final status 082 try { 083 context.setResult(bundleDeploymentStatus); // Return status to possible listeners 084 scheduler.deleteJob(jobDetail.getName(), jobDetail.getGroup()); 085 } catch (SchedulerException e) { 086 throw new JobExecutionException("Could not delete the bundle deployment completion check job for " 087 + bundleDeployment + ".", e); 088 } 089 } else { 090 // try again in 10s 091 try { 092 Trigger trigger = QuartzUtil.getFireOnceOffsetTrigger(jobDetail, 10000L); 093 // just need a trigger name unique for this job 094 trigger.setName(String.valueOf(System.currentTimeMillis())); 095 scheduler.scheduleJob(trigger); 096 } catch (SchedulerException e) { 097 throw new JobExecutionException( 098 "Could not schedule the bundle deployment completion check job for " + bundleDeployment + ".", 099 e); 100 } 101 } 102 } 103 } ********************************************************** where "trigger.setName(String.valueOf(System.currentTimeMillis()));". Version-Release number of selected component (if applicable): JBoss ON 3.3.8 How reproducible: Sometimes Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
There is really no way to reliably reproduce this. Obviously, the more concurrency, the easier it becomes to reproduce the error consistently. However, it is still all about timing. In my testing, I could see that most instances of the trigger used a name that was made up of a timestamp that was within 30ms of one another. However, I had to run the test with 8 concurrent deployment attempt for nearly a day before two of the trigger names eventually used the exact same system time. The bug is very evident though. In a multi-server and multi-threaded environment, it is very easy to see how two bundle deployment requests could result in the same trigger name. The buggy code is at line 94 of BundleDeploymentStatusCheckJob[1]. To fix this, the call to trigger.setName must use a trigger name that not only includes the timestamp, but perhaps the original bundle deployment job name as its prefix. This should make it unique enough seeing that the bundle deployment job name includes the deployment id if I am not mistaking. Meaning that even if two users deploy the exact same bundle to even the same destination at the exact same time, the trigger name should still be unique due to the database transaction and auto-increment nature of the deployment id. From a testing perspective, there isn't a whole lot that can be tested here. However, a concurrent bundle deployment soak/tolerance test could be beneficial. Such a test would include deploying multiple bundles to multiple agents simultaneously. This could be against a single or multiple JBoss ON servers. Each bundle would need to have a deployment delay built into its recipe so that there is more of a chance that the BundleDeploymentStatusCheckJob would need to be rescheduled which is what is causing the issue here. For my own testing I used 8 versions of a single bundle and deployed them to a single agent. The recipe used the Ant sleep task within a postInstallTarget called dosleep: <target name="dosleep"> <rhq:audit status="SUCCESS" action="Taking a nap" info="All done deploying bundle to \${rhq.deploy.dir}. Sleeping for 21 seconds."/> <sleep seconds="21"/> <rhq:audit status="SUCCESS" action="Waking up from my nap" info="That was a good sleep."/> </target> This 21 second delay means that each of the 8 bundles will take at least 21 seconds to have their status updated to SUCCESS. In which case the JBoss ON server will schedule the BundleDeploymentStatusCheckJob every 10 seconds to query the deployment status. The important thing is to execute the bundle deployments in their own threads (concurrently). [1]: https://github.com/rhq-project/rhq/blob/RHQ_4_12_0_JON338GA/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/scheduler/jobs/BundleDeploymentStatusCheckJob.java#L94
Fixed in the master: commit 1b6de1d45e0520b43a10d2cf3339e176be5146db Author: Michael Burman <miburman> Date: Thu Aug 3 14:41:47 2017 +0300 [BZ 1463351] Replace statusCheckJob name to randomUUID(), which should be random as long as there is enough entropy in the system
Moving to ON_QA as available for test with the following binary: http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/133/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip *NOTE Build represents JON 3.3.9 DR01 build. Please report issues as you encounter them.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2846