Description of problem: For a particular large account it seems errata are being processed slowly, causing a number of issues which may be symptoms of the same problem. The issues are as follows: 1) Update status icon in webui is inaccurate for some systems: Systems that have errata that apply to them (as indicated by up2date -l) are shown with a blue 'up2date' icon in the RHN webui. The systems are correctly checking into RHN automatically yet their update status according to the webui icon is not corrected (Does not show the applicable errata.) Running rhn_check manually on the affected systems seems to correct the webui update status icon for these systems. 2) "Affected Systems" tab for a particular erratum is incorrect: Systems that have errata that apply to them (as indicated by up2date -l) are not listed in the "Affected Systems" tab for those errata. This seems to be an issue of slow errata processing. The customer affected by this noted that 1 system per hour or so seems to appear in the "Systems Affected" list. Another thing to note is that the same systems that were missing from the erratum's "Systems Affected" list that had auto errata update set on them showed the erratum scheduled for update in their SDC > Events > Pending tab. 3) SDC "errata" tab does not show all applicable errata for a system: Systems that have errata that apply to them (as indicated by up2date -l) do not show those errata in the SDC > Errata tab. As in issue #2 above, systems with auto errata update do show those errata as pending / scheduled in the SDC > Events > pending tab.
From the specific examples above it seems that this bug does not affect whether or not the errata are actually scheduled / processed by auto-errata update. Rather, it seems to create an incorrect display of information for non-auto-errata update systems in the webui.
One more comment! May be related to bug 201349
More reports/details on this bug available in the thread: https://www.redhat.com/archives/nahant-list/2007-February/msg00058.html
as mentioned in the mailing list thread, running up2date -p on the system manually seems to correct the problem in the web ui... not sure how long it is effective for though.
I am researching this problem now and have a couple of theories: 1) Taskomatic errata processing is heavily serialized - We process errata one at a time and one org at a time per errata. This is obviously far from optimal and not at all how this should work. I think this is a historical vestige of the Perl version of Taskomatic and also reflects a desire to not "melt" the database when RHN had a much smaller DB server. 2) We had several problems with errata processing during the 410 release. The result was that several errata did not get autoupdate scheduling run for them. AFAIK - nothing was ever done to correct this, meaning that these errata were never rescheduled. I need to research this to verify. I am currently modifying the ErrataQueue, ErrataCache, and ErrataMail tasks to be more parallel and hopefully substantially increase throughput. After that, I will be spelunking thru data to determine if anything needs to be done.
Machines which have auto errata updates set are getting these updates. So it is not a matter of the update not being there. It is just a matter of whether the GUI is showing them as being there.
This checkin attempts to address the problems reported. Specifically, it: * Introduces a generic threaded queue model which can be used by many Taskomatic tasks when parallelizing work processing is considered to be beneficial. * Ports the ErrataCache and ErrataQueue tasks to use the new threaded framework. This should fix many of the problems reported since errata processing is a largely serial process. This would result in out-of-date data displays in the UI until the task was able to process all its work items. This delay could be sizable given the large queues which can develop during large errata releases. Each of these tasks will default to 2 worker threads unless configured to use more, if needed. * Ports the RepollEntitlement task to use the new threaded queues as well. I have done significant testing on webdev to insure that repoll continues to work as expected. Future QA pushes prior to RHN 500 GA should smoke out any issues as we have at least one more bulk repoll to do.
Taskomatic was missing updates for some systems which have auto-updates enabled. Checked in a fix for this. Should be part of the next QA push. The testing used for comments #18 and #19 should be used again.
onqa
onqa for reals
My auto-update systems are not scheduling errata. I registered a system (rlx-2-16.rhndev.redhat.com) to the iowastate account, set it to auto-update, pushed an erratum, and noticed that the erratum applies to that system. However the update is not getting auto-scheduled.
Yeah, that's not totally unexpected, unfortunately. I just found a bug in Taskomatic a few minutes ago which could cause this task to fail before the updates have been scheduled. *sigh* Next QA push should have a good fix in it.
As I mentioned before, what is the status concerning the errata even appearing in the Web GUI? I can't schedule an update if it isn't showing up?
Hi Dave, we are currently working on the issue. The workaround is to run rhn_check manually on the affected systems - you might want to try running rhn_check in a cron script on the systems in the meantime while we work on this. Hope this helps.
OK. Fix has just been committed which should handle the DB errors. Also, it appears that rebuilding the errata cache might fix the UI display problems as well. Running down why that would help now.
I have retested this to the best of my knowledge. Please re-open if the issue resurfaces after rhn500h is released.
Reopening the bug due to another use case which I just found. There are two basic ways a given server's errata cache can be recalculated: 1) User logs into RHN - This only works for orgs with few servers (I think the limit is less than 30). 2) Server's setup is changed - This generally happens when a server registers, errata is pushed to a channel to which the server is subscribed, or a server's base channel is changed. The fix verified only addressed the 2nd use case not the first. I have modified the fix to handle both use cases. Suggested Test Plans: There needs to be two test plans: 1) The same testing used to verify the bug previously. 2) Simulate use case #1: a) Find a user for an org with less than 30 servers. b) Register another server. c) Wait 15-20 mins. This gives Taskomatic time to run and do its thing. d) Open a terminal connected to rhnjava.back-webqa.redhat.com. Watch Tomcat's log with this command: tail -f /var/log/tomcat5/catalina.out e) Login as the user selected in step a. f) Verify that no errors were logged by Tomcat during the login. Inspect the server registered in step b and verify that relevant errata are applied and scheduled.
Scenario smoked out an issue with the ErrataQueue task which would cause it to not process errata records. :( Code checked in this evening addresses that issue so that _should_ fix scenario one. One thing to not is that ErrataQueue processing is not fast. It might take up to 30 mins for it to process a single errata. You can monitor the process, though, by looking at the rhnErrataQueue table like so: select count(*) from rhnErrataQueue; or select * from rhnErrataQueue; Also the fix for ErrataQueue has increased the memory requirements for Taskomatic. I was seeing regular OOMs with the current min/max heap sizes. I've bumped up the memory to 192mb and hope that will be enough.
pushed this fix to webqa. Imported: RHSA-2007:0009 Will check on status in the morning.
I re-ran my original tests from yesterday: Scenario 1 works. Scenario 2 works. Is it my imagination or is this running *a lot* faster than before?
Thanks for all your work on these issues. With any luck our problem resolution may improve things for others too. We really appreciate your efforts on this.
Closed in rhn500h Release.