226897 – Errata Processing Slowness (?) causing number of problems

Bug 226897 - Errata Processing Slowness (?) causing number of problems

Summary: Errata Processing Slowness (?) causing number of problems

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite 5
Classification:	Red Hat
Component:	Provisioning
Sub Component:
Version:	unspecified
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Kevin A. Smith
QA Contact:	Beth Nackashi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	166615
TreeView+	depends on / blocked

Reported:	2007-02-01 21:42 UTC by Máirín Duffy
Modified:	2007-10-24 01:52 UTC (History)
CC List:	8 users (show)
Fixed In Version:	rhn500h
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-03-13 13:50:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Máirín Duffy 2007-02-01 21:42:09 UTC

Description of problem:

For a particular large account it seems errata are being processed slowly,
causing a number of issues which may be symptoms of the same problem. The issues
are as follows:

1) Update status icon in webui is inaccurate for some systems: 

Systems that have errata that apply to them (as indicated by up2date -l) are
shown with a blue 'up2date' icon in the RHN webui. The systems are correctly
checking into RHN automatically yet their update status according to the webui
icon is not corrected (Does not show the applicable errata.) Running rhn_check
manually on the affected systems seems to correct the webui update status icon
for these systems.

2) "Affected Systems" tab for a particular erratum is incorrect:

Systems that have errata that apply to them (as indicated by up2date -l) are not
listed in the "Affected Systems" tab for those errata. This seems to be an issue
of slow errata processing. The customer affected by this noted that 1 system per
hour or so seems to appear in the "Systems Affected" list. Another thing to note
is that the same systems that were missing from the erratum's "Systems Affected"
list that had auto errata update set on them showed the erratum scheduled for
update in their SDC > Events > Pending tab.

3) SDC "errata" tab does not show all applicable errata for a system:

Systems that have errata that apply to them (as indicated by up2date -l) do not
show those errata in the SDC > Errata tab. As in issue #2 above, systems with
auto errata update do show those errata as pending / scheduled in the SDC >
Events > pending tab.

Comment 2 Máirín Duffy 2007-02-01 21:50:41 UTC

From the specific examples above it seems that this bug does not affect whether
or not the errata are actually scheduled / processed by auto-errata update.
Rather, it seems to create an incorrect display of information for
non-auto-errata update systems in the webui.

Comment 9 Máirín Duffy 2007-02-01 22:29:52 UTC

One more comment! May be related to bug 201349

Comment 11 Máirín Duffy 2007-02-07 15:08:16 UTC

More reports/details on this bug available in the thread:
https://www.redhat.com/archives/nahant-list/2007-February/msg00058.html

Comment 12 Máirín Duffy 2007-02-08 15:01:30 UTC

as mentioned in the mailing list thread, running up2date -p on the system
manually seems to correct the problem in the web ui... not sure how long it is
effective for though.

Comment 14 Kevin A. Smith 2007-02-08 15:54:42 UTC

I am researching this problem now and have a couple of theories:

1) Taskomatic errata processing is heavily serialized - We process errata one at
a time and one org at a time per errata. This is obviously far from optimal and
not at all how this should work. I think this is a historical vestige of the
Perl version of Taskomatic and also reflects a desire to not "melt" the database
when RHN had a much smaller DB server.

2) We had several problems with errata processing during the 410 release. The
result was that several errata did not get autoupdate scheduling run for them.
AFAIK - nothing was ever done to correct this, meaning that these errata were
never rescheduled. I need to research this to verify.

I am currently modifying the ErrataQueue, ErrataCache, and ErrataMail tasks to
be more parallel and hopefully substantially increase throughput. After that, I
will be spelunking thru data to determine if anything needs to be done.

Comment 15 Dave Edsall 2007-02-08 16:40:38 UTC

Machines which have auto errata updates set are getting these updates. So it is
not a matter of the update not being there. It is just a matter of whether the
GUI is showing them as being there.

Comment 16 Kevin A. Smith 2007-02-12 21:58:58 UTC

This checkin attempts to address the problems reported. Specifically, it:

* Introduces a generic threaded queue model which can be used by 
  many Taskomatic tasks when parallelizing work processing is 
  considered to be beneficial.

* Ports the ErrataCache and ErrataQueue tasks to use the new threaded 
  framework. This should fix many of the problems reported since errata 
  processing is a largely serial process. This would result in 
  out-of-date data displays in the UI until the task was able to process all 
  its work items. This delay could be sizable given the large queues which can 
  develop during large errata releases. Each of these tasks will default to 2   
  worker threads unless configured to use more, if needed.

* Ports the RepollEntitlement task to use the new threaded queues as well. I 
  have done significant testing on webdev to insure that repoll continues to 
  work as expected. Future QA pushes prior to RHN 500 GA should smoke out any 
  issues as we have at least one more bulk repoll to do.

Comment 24 Kevin A. Smith 2007-03-01 00:07:40 UTC

Taskomatic was missing updates for some systems which have auto-updates enabled.
Checked in a fix for this. Should be part of the next QA push. The testing used
for comments #18 and #19 should be used again.

Comment 25 Mike McCune 2007-03-01 04:59:11 UTC

onqa

Comment 26 Mike McCune 2007-03-01 05:02:19 UTC

onqa for reals

Comment 27 Beth Nackashi 2007-03-01 19:10:36 UTC

My auto-update systems are not scheduling errata.  I registered a system
(rlx-2-16.rhndev.redhat.com) to the iowastate account, set it to auto-update,
pushed an erratum, and noticed that the erratum applies to that system.  However
the update is not getting auto-scheduled.

Comment 28 Kevin A. Smith 2007-03-01 19:32:12 UTC

Yeah, that's not totally unexpected, unfortunately. I just found a bug in
Taskomatic a few minutes ago which could cause this task to fail before the
updates have been scheduled.

*sigh*

Next QA push should have a good fix in it.

Comment 30 Dave Edsall 2007-03-01 20:08:54 UTC

As I mentioned before, what is the status concerning the errata even appearing
in the Web GUI? I can't schedule an update if it isn't showing up?

Comment 31 Máirín Duffy 2007-03-01 20:13:57 UTC

Hi Dave, we are currently working on the issue. The workaround is to run
rhn_check manually on the affected systems - you might want to try running
rhn_check in a cron script on the systems in the meantime while we work on this.
Hope this helps.

Comment 32 Kevin A. Smith 2007-03-01 21:05:41 UTC

OK. Fix has just been committed which should handle the DB errors. Also, it
appears that rebuilding the errata cache might fix the UI display problems as
well. Running down why that would help now.

Comment 33 Mike McCune 2007-03-01 23:19:00 UTC

onqa

Comment 34 Beth Nackashi 2007-03-02 18:11:51 UTC

I have retested this to the best of my knowledge.  Please re-open if the issue
resurfaces after rhn500h is released.

Comment 35 Kevin A. Smith 2007-03-02 19:50:38 UTC

Reopening the bug due to another use case which I just found. There are two
basic ways a given server's errata cache can be recalculated:

1) User logs into RHN - This only works for orgs with few servers (I think the
limit is less than 30).

2) Server's setup is changed - This generally happens when a server registers,
errata is pushed to a channel to which the server is subscribed, or a server's
base channel is changed.

The fix verified only addressed the 2nd use case not the first. I have modified
the fix to handle both use cases.

Suggested Test Plans:

There needs to be two test plans:

1) The same testing used to verify the bug previously.

2) Simulate use case #1:

  a) Find a user for an org with less than 30 servers.

  b) Register another server.

  c) Wait 15-20 mins. This gives Taskomatic time to run and do its thing.

  d) Open a terminal connected to rhnjava.back-webqa.redhat.com. Watch Tomcat's
log with this command: tail -f /var/log/tomcat5/catalina.out

  e) Login as the user selected in step a.

  f) Verify that no errors were logged by Tomcat during the login. Inspect the
server registered in step b and verify that relevant errata are applied and
scheduled.

Comment 36 Mike McCune 2007-03-02 21:04:25 UTC

onqa

Comment 38 Kevin A. Smith 2007-03-03 02:36:56 UTC

Scenario smoked out an issue with the ErrataQueue task which would cause it to
not process errata records. :(

Code checked in this evening addresses that issue so that _should_ fix scenario
one. One thing to not is that ErrataQueue processing is not fast. It might take
up to 30 mins for it to process a single errata. You can monitor the process,
though, by looking at the rhnErrataQueue table like so:

select count(*) from rhnErrataQueue;

or

select * from rhnErrataQueue;

Also the fix for ErrataQueue has increased the memory requirements for
Taskomatic. I was seeing regular OOMs with the current min/max heap sizes. I've
bumped up the memory to 192mb and hope that will be enough.

Comment 39 Mike McCune 2007-03-03 08:51:59 UTC

pushed this fix to webqa.

Imported:

RHSA-2007:0009

Will check on status in the morning.

Comment 40 Beth Nackashi 2007-03-03 13:54:51 UTC

I re-ran my original tests from yesterday:
Scenario 1 works.
Scenario 2 works.

Is it my imagination or is this running *a lot* faster than before?

Comment 41 John T. Rose 2007-03-03 18:28:38 UTC

Thanks for all your work on these issues. With any luck our problem resolution
may improve things for others too. We really appreciate your efforts on this.

Comment 42 Brandon Perkins 2007-03-13 13:50:15 UTC

Closed in rhn500h Release.

Note You need to log in before you can comment on or make changes to this bug.