Hide Forgot
Description of problem: If I cancel a job, I do not expect it to run later. However this is exactly what happens. I discovered this after canceling several thousand jobs in the queue. When I came back later, I looked at the list of jobs and saw that the number of canceled jobs was less than it was previously. I looked up the status of a job I *knew* I canceled, and now it is listed as "FINISHED". Version-Release number of selected component (if applicable): How reproducible: Always, though if your job queue is processed fast enough, it may be hard to cancel a job fast enough before it gets processed. Steps to Reproduce: 1. Create a job, for example refresh an owner's pools. Make note of the job id. 2. Cancel the job before it gets processed using the job id. The job status should be "CANCELED". 3. Wait for the job queue to catch up. If you retrieve the job details at the right times, the job status will switch to "RUNNING" and then to "FINISHED". Actual results: A canceled job will be processed. Expected results: A canceled job doesn't get processed, and remains cancelled
Hmm. On further evaluation, this is not as reproducible as I thought. Perhaps our Candlepin's quartz state was messed up enough that it was processing the canceled jobs. As it stands, my canceled jobs are now staying canceled, which is good. If the project maintainer wants to mark this bug as could not reproduce, I'm fine with that for now.
Do you have any more details on what caused the strange quartz state in this instance? For example, was there heavy load on the candlepin instance? Perhaps there were a lot of jobs created and cancelled quickly?
This came about because none of the tasks (or at least it seemed like it) were getting processed. Querying the database directly, we saw that several jobs were executing, but never seemed to get finished. Meanwhile, there were 6000+ jobs that were waiting to run, many that were over a week old. Unfortunately it's a bit of a mystery as to why it got backed up the way it did. I have a (weak) theory that a few orgs (with many subscriptions and consumers) had refresh pools tasks that clogged up the works, but that's a shot in the dark. What I ended up doing was canceling those several thousand jobs. Eventually, we reset tomcat so that we could enable some advanced logging on the app and in quartz, and then the queue started getting processed... including the canceled ones. Thankfully they all processed fast enough that the queue emptied in a matter of hours.
I'm inclined to close this without a reproducer. With no easy way to reproduce, there's not much we can do here. I am curious though: which environment was this in (if it was a dev environment, I'm less concerned).
Since this was a non-prod environment, I'm closing for now. Feel free to reopen if you see this again, or if you come up with a reproducer.