Description of problem:
For a particular large account it seems errata are being processed slowly,
causing a number of issues which may be symptoms of the same problem. The issues
are as follows:
1) Update status icon in webui is inaccurate for some systems:
Systems that have errata that apply to them (as indicated by up2date -l) are
shown with a blue 'up2date' icon in the RHN webui. The systems are correctly
checking into RHN automatically yet their update status according to the webui
icon is not corrected (Does not show the applicable errata.) Running rhn_check
manually on the affected systems seems to correct the webui update status icon
for these systems.
2) "Affected Systems" tab for a particular erratum is incorrect:
Systems that have errata that apply to them (as indicated by up2date -l) are not
listed in the "Affected Systems" tab for those errata. This seems to be an issue
of slow errata processing. The customer affected by this noted that 1 system per
hour or so seems to appear in the "Systems Affected" list. Another thing to note
is that the same systems that were missing from the erratum's "Systems Affected"
list that had auto errata update set on them showed the erratum scheduled for
update in their SDC > Events > Pending tab.
3) SDC "errata" tab does not show all applicable errata for a system:
Systems that have errata that apply to them (as indicated by up2date -l) do not
show those errata in the SDC > Errata tab. As in issue #2 above, systems with
auto errata update do show those errata as pending / scheduled in the SDC >
Events > pending tab.
From the specific examples above it seems that this bug does not affect whether
or not the errata are actually scheduled / processed by auto-errata update.
Rather, it seems to create an incorrect display of information for
non-auto-errata update systems in the webui.
One more comment! May be related to bug 201349
More reports/details on this bug available in the thread:
as mentioned in the mailing list thread, running up2date -p on the system
manually seems to correct the problem in the web ui... not sure how long it is
effective for though.
I am researching this problem now and have a couple of theories:
1) Taskomatic errata processing is heavily serialized - We process errata one at
a time and one org at a time per errata. This is obviously far from optimal and
not at all how this should work. I think this is a historical vestige of the
Perl version of Taskomatic and also reflects a desire to not "melt" the database
when RHN had a much smaller DB server.
2) We had several problems with errata processing during the 410 release. The
result was that several errata did not get autoupdate scheduling run for them.
AFAIK - nothing was ever done to correct this, meaning that these errata were
never rescheduled. I need to research this to verify.
I am currently modifying the ErrataQueue, ErrataCache, and ErrataMail tasks to
be more parallel and hopefully substantially increase throughput. After that, I
will be spelunking thru data to determine if anything needs to be done.
Machines which have auto errata updates set are getting these updates. So it is
not a matter of the update not being there. It is just a matter of whether the
GUI is showing them as being there.
This checkin attempts to address the problems reported. Specifically, it:
* Introduces a generic threaded queue model which can be used by
many Taskomatic tasks when parallelizing work processing is
considered to be beneficial.
* Ports the ErrataCache and ErrataQueue tasks to use the new threaded
framework. This should fix many of the problems reported since errata
processing is a largely serial process. This would result in
out-of-date data displays in the UI until the task was able to process all
its work items. This delay could be sizable given the large queues which can
develop during large errata releases. Each of these tasks will default to 2
worker threads unless configured to use more, if needed.
* Ports the RepollEntitlement task to use the new threaded queues as well. I
have done significant testing on webdev to insure that repoll continues to
work as expected. Future QA pushes prior to RHN 500 GA should smoke out any
issues as we have at least one more bulk repoll to do.
Taskomatic was missing updates for some systems which have auto-updates enabled.
Checked in a fix for this. Should be part of the next QA push. The testing used
for comments #18 and #19 should be used again.
onqa for reals
My auto-update systems are not scheduling errata. I registered a system
(rlx-2-16.rhndev.redhat.com) to the iowastate account, set it to auto-update,
pushed an erratum, and noticed that the erratum applies to that system. However
the update is not getting auto-scheduled.
Yeah, that's not totally unexpected, unfortunately. I just found a bug in
Taskomatic a few minutes ago which could cause this task to fail before the
updates have been scheduled.
Next QA push should have a good fix in it.
As I mentioned before, what is the status concerning the errata even appearing
in the Web GUI? I can't schedule an update if it isn't showing up?
Hi Dave, we are currently working on the issue. The workaround is to run
rhn_check manually on the affected systems - you might want to try running
rhn_check in a cron script on the systems in the meantime while we work on this.
Hope this helps.
OK. Fix has just been committed which should handle the DB errors. Also, it
appears that rebuilding the errata cache might fix the UI display problems as
well. Running down why that would help now.
I have retested this to the best of my knowledge. Please re-open if the issue
resurfaces after rhn500h is released.
Reopening the bug due to another use case which I just found. There are two
basic ways a given server's errata cache can be recalculated:
1) User logs into RHN - This only works for orgs with few servers (I think the
limit is less than 30).
2) Server's setup is changed - This generally happens when a server registers,
errata is pushed to a channel to which the server is subscribed, or a server's
base channel is changed.
The fix verified only addressed the 2nd use case not the first. I have modified
the fix to handle both use cases.
Suggested Test Plans:
There needs to be two test plans:
1) The same testing used to verify the bug previously.
2) Simulate use case #1:
a) Find a user for an org with less than 30 servers.
b) Register another server.
c) Wait 15-20 mins. This gives Taskomatic time to run and do its thing.
d) Open a terminal connected to rhnjava.back-webqa.redhat.com. Watch Tomcat's
log with this command: tail -f /var/log/tomcat5/catalina.out
e) Login as the user selected in step a.
f) Verify that no errors were logged by Tomcat during the login. Inspect the
server registered in step b and verify that relevant errata are applied and
Scenario smoked out an issue with the ErrataQueue task which would cause it to
not process errata records. :(
Code checked in this evening addresses that issue so that _should_ fix scenario
one. One thing to not is that ErrataQueue processing is not fast. It might take
up to 30 mins for it to process a single errata. You can monitor the process,
though, by looking at the rhnErrataQueue table like so:
select count(*) from rhnErrataQueue;
select * from rhnErrataQueue;
Also the fix for ErrataQueue has increased the memory requirements for
Taskomatic. I was seeing regular OOMs with the current min/max heap sizes. I've
bumped up the memory to 192mb and hope that will be enough.
pushed this fix to webqa.
Will check on status in the morning.
I re-ran my original tests from yesterday:
Scenario 1 works.
Scenario 2 works.
Is it my imagination or is this running *a lot* faster than before?
Thanks for all your work on these issues. With any luck our problem resolution
may improve things for others too. We really appreciate your efforts on this.
Closed in rhn500h Release.