Bug 1144223

Summary: Make the job queue choose jobs fairly
Product: [Community] Bugzilla Reporter: Jason McDonald <jmcdonal>
Component: Internal ToolsAssignee: PnT DevOps Devs <hss-ied-bugs>
Internal Tools sub component: Rules Engine QA Contact: tools-bugs <tools-bugs>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: urgent    
Priority: urgent CC: jmcdonal, qgong
Version: 4.4   
Target Milestone: 4.4   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.4.6026 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-27 02:19:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jason McDonald 2014-09-19 03:50:37 UTC
Description of problem:
So far we've had two incidents where the job queue got into a state where it wasn't keeping up with the inflow of Rules Engine jobs and we had to manually delete all jobs from the queue.

Diagnosis showed that the queue contained a large number of jobs that were blocked because the queue contained older jobs for the same bugs.  The job queue processor was stuck because it kept choosing the blocked jobs (and having to decline them) rather than choosing unblocked jobs.

This happens because the decision of which job to choose from the queue is amde only based on the priority of the job (useless, as all jobs currently have the same priority) and whether the job's grab_until and run_after times are in the past.  This effectively turns the job queue into a pool of jobs where most jobs are "runnable", and because of the way MySQL works, the same set of jobs keep getting selected (and declined), ad infinitum.

The solution:
The job queue needs to be an actual queue.  The oldest eligible jobs need to be processed first, and if a job needs to be declined it must go to the end of the queue, either by resetting the insert_time to the current time or discarding the job and creating a new one.  The former option may be easier, as jobs have other attributes that need to be preserved if the job is declined (e.g. the retry count).

Comment 3 Rony Gong 🔥 2014-10-10 12:00:59 UTC
Verified on QA environment(bzweb01-qe) with version(4.4.6026-4)
Result: Pass
1. Generate lots of jobs by execute the attachment
2. In the build that fixed this bug, I can find the bugzilla service consume the jobs by rate 1500/h, and almost can't see the  Declined job message in the log
3.  In the build that not fixed this bug, I can find the bugzilla service consume the jobs very slow, at last hang there, and could get lots of the  Declined job message in the log