Red Hat Bugzilla – Bug 495815
yum metadata generation in /var/cache/rhn can cause extreme server load
Last modified: 2009-09-23 11:04:45 EDT
Cloning for sat 5.3.0 - even though the code may not be necessary in satellite 5.3.0, QA should cover this case to make sure this is not seen in 5.3.0. (Per mmccune and prad on IRC just now).
+++ This bug was initially created as a clone of Bug #495814 +++
If the cache for a given channel needs to get regenerated in /var/cache/rhn every client request for that new metadata kicks of a process to regenerate the files.
This can cause extreme load on the Satellite server and each thread is essentially doing the same thing.
In order to reproduce this issue I wrote a simple multi-threaded python utility to spawn multiple yum requests to a RHN Satellite server. This client spins up 10 threads each doing:
yum clean all && yum search zsh
with separate --installroot parameters to allow simultaneous execution.
After setting up 2 RHEL5 clients each with my load simulator I was
quickly able to get my Satellite to reach a load of *40-80* with it
eventually ceasing to be accessible.
** Steps to reproduce the yum 'metadata storm' on a 5.2 Satellite:
1) Register at least 2 RHEL5 clients to your Satellite
2) Make sure your RHEL5 channel is populated and synced
3) Check out:
4) On each RHEL5 client as root execute: 'python yum-load-test.py'
5) On your RHN Satellite server run: 'rm -rf /var/cache/rhn/'
6) wait .. This will cause each client request to start re-generation of
the metadata for the rhel5 channel. As these requests pile up the
server is quickly brought to its knees.
The more clients you have the quicker it will die.
bug 495814 for sat52maint
bug 495816 for sat51maint
bug 495815 for sat530-triage
1) Registered (2) systems and subscribed to fully sync'd RHEL 5 channel.
2) Started python script http://svn.rhndev.redhat.com/viewcvs/trunk/eng/scripts/load-testing/yum-load-test.py on both systems.
3) rm -rf /var/cache/rhn/ on satellite.
4) Waited until /var/cache/rhn/repodaata/ being regenerated.
5) Accessed the satellite WEBUI over the next 20 minutes and satellite still seems accessible.
Satellite does not seem to die.
Verified in stage -> RELEASE_PENDING.
* registered 2 rhel5 clients
* started yum-load-test.py
* removed files from /var/cache/rhn/
* load didn't exceed 1.5
# sar -q 30 10
Linux 2.6.9-89.0.3.ELsmp (dell-pem710-01.rhts.eng.bos.redhat.com) 08/14/2009
07:52:25 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
07:52:55 AM 12 454 1.39 1.24 0.68
07:53:25 AM 1 454 1.37 1.24 0.70
07:53:55 AM 1 452 1.22 1.22 0.71
07:54:25 AM 0 452 1.13 1.20 0.72
07:54:55 AM 1 450 1.30 1.23 0.74
07:55:25 AM 0 451 1.49 1.28 0.78
07:55:55 AM 0 449 0.98 1.17 0.75
07:56:25 AM 0 451 0.59 1.06 0.73
07:56:55 AM 0 449 0.36 0.96 0.70
07:57:25 AM 0 451 0.28 0.88 0.69
Average: 2 451 1.01 1.15 0.72
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.